NLP Tutorial

Part 2: Traditional NLP

Corpora, Tokens, and Types

Corpus: text dataset with metadata

Tokens: grouped contiguous units of characters

Instance: some text along with the corresponding metadata

Tokenization: process of fractionalizing corpus into instances

pengTweet.png

In the above tweet, during the tokenization process the @’s and #’s should remain in place as well as any emojis should be preserved. Many packages tokenize text for us be but we will specifically be using NLTK and spaCy.

Must install spaCy packages seperately. Install with these commands:

pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_trf

or if you use conda:

conda install -c conda-forge spacy
python -m spacy download en_core_web_trf

Install NLTK with pip:

pip install --user -U nltk

or with conda:

conda install -c anaconda nltk
import spacy

nlp = spacy.load('en_core_web_trf')
tweet = "Find out the reason that commands you to write; see whether it has spread its roots into the very depths of your heart."
cleanedTweet = [str(token) for token in nlp(tweet.lower())]
print(cleanedTweet)
['find', 'out', 'the', 'reason', 'that', 'commands', 'you', 'to', 'write', ';', 'see', 'whether', 'it', 'has', 'spread', 'its', 'roots', 'into', 'the', 'very', 'depths', 'of', 'your', 'heart', '.']
from nltk.tokenize import TweetTokenizer

fullTweet = "'Find out the reason that commands you to write; see whether it has spread its roots into the very depths of your heart.' Rainer Maria Rilke, born #onthisday in 1875. #WednesdayWisdom"

tokenizer = TweetTokenizer()
print(tokenizer.tokenize(fullTweet.lower()))
["'", 'find', 'out', 'the', 'reason', 'that', 'commands', 'you', 'to', 'write', ';', 'see', 'whether', 'it', 'has', 'spread', 'its', 'roots', 'into', 'the', 'very', 'depths', 'of', 'your', 'heart', '.', "'", 'rainer', 'maria', 'rilke', ',', 'born', '#onthisday', 'in', '1875', '.', '#wednesdaywisdom']

Types: Unique tokens in a corpus Vocabulary: Set of all types in a corpus Stopwords: Articles and prepositions which serve a grammatical purpose

Unigrams, Bigrams, Trigrams, …, N-Grams

N-Grams: fixed-length consecutive token sequences that occur in the text

def n_grams(text, n):
    '''
    takes tokens or text, returns a list of ngrams
    '''
    return [text[i:i+n] for i in range(len(text)-n+1)]


print("Bigram:\n",n_grams(cleanedTweet, 2)) # bigram
print("\nTrigram:\n",n_grams(cleanedTweet, 3)) # trigram

Bigram:
 [['find', 'out'], ['out', 'the'], ['the', 'reason'], ['reason', 'that'], ['that', 'commands'], ['commands', 'you'], ['you', 'to'], ['to', 'write'], ['write', ';'], [';', 'see'], ['see', 'whether'], ['whether', 'it'], ['it', 'has'], ['has', 'spread'], ['spread', 'its'], ['its', 'roots'], ['roots', 'into'], ['into', 'the'], ['the', 'very'], ['very', 'depths'], ['depths', 'of'], ['of', 'your'], ['your', 'heart'], ['heart', '.']]

Trigram:
 [['find', 'out', 'the'], ['out', 'the', 'reason'], ['the', 'reason', 'that'], ['reason', 'that', 'commands'], ['that', 'commands', 'you'], ['commands', 'you', 'to'], ['you', 'to', 'write'], ['to', 'write', ';'], ['write', ';', 'see'], [';', 'see', 'whether'], ['see', 'whether', 'it'], ['whether', 'it', 'has'], ['it', 'has', 'spread'], ['has', 'spread', 'its'], ['spread', 'its', 'roots'], ['its', 'roots', 'into'], ['roots', 'into', 'the'], ['into', 'the', 'very'], ['the', 'very', 'depths'], ['very', 'depths', 'of'], ['depths', 'of', 'your'], ['of', 'your', 'heart'], ['your', 'heart', '.']]

Sometimes character n-grams can be useful as well as they can help find patterns in suffixes and prefixes.

Lemmas and Stems

Lemmas: root form of a word Lemmatization: reduction of tokens to their lemma

doc = nlp(u"Adam was seen bravely running to the scene")
print(doc)
for token in doc:
    print('{} --> {}'.format(token, token.lemma_))
Adam was seen bravely running to the scene
Adam --> Adam
was --> be
seen --> see
bravely --> bravely
running --> run
to --> to
the --> the
scene --> scene

spaCy uses a predefined dictionary (WordNet) for lemmatization.

Stemming: Use of defined rules to strip endings from words to create stems

Categorizing Sentences and Documents

Commonly done in the cases of sentiment analysis for reviews, spam filtering, language identification, topic assignment, and email triaging.

Categorizing Words: POS Tagging

Labeling each word and symbol as what their part-of-speech (POS) is.

print(doc)
for token in doc:
    print('{} - {}'.format(token, token.pos_))
Adam was seen bravely running to the scene
Adam - PROPN
was - AUX
seen - VERB
bravely - ADV
running - VERB
to - ADP
the - DET
scene - NOUN

Categorizing Spans: Chunking and named Entity Recognition

Using our example sentence “Adam was seen bravely running into the scene” we can label spans of text.

[NP Adam] [VP was seen] [VP bravely running] to [NP the scene]
print(doc)
for chunk in doc.noun_chunks:
    print ('{} - {}'.format(chunk, chunk.label_))
Adam was seen bravely running to the scene
Adam - NP
the scene - NP

Structure of Sentence

Shallow Parsing: indentifies phrasal units Parsing: Identifies relationships within phrasal units

Parse trees show the heirarchy of grammatical units within a sentence. Below is a Constituent Parse:

sentenceStruct.png

In identifying relationships, dependency parsing is also very useful and clear.

sentenceDepend.png

Graphs made using corenlp.run

Word Senses and Semantics

Senses: Different meanings of a word

WordNet, which is used by spaCy, attempts to catalog every sense of every word in the English dictionary. For example, even a simple word like ‘zero’ produces a multitude of results:

wordNetZero.png