Basic Natural Language Processing

Original Source: https://www.coursera.org/specializations/data-science-python

import nltk
#nltk.download() #only at first time

import numpy as np

text = "Children shouldn't drink sugary drinks before bed. Children should be healthy."
splitted_text = text.lower().split(' ')

1. Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

porter = nltk.PorterStemmer()
[porter.stem(t) for t in splitted_text]

['children',
 "shouldn't",
 'drink',
 'sugari',
 'drink',
 'befor',
 'bed.',
 'children',
 'should',
 'be',
 'healthy.']

2. Lemmatisation

Lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document.

WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in splitted_text]

['child',
 "shouldn't",
 'drink',
 'sugary',
 'drink',
 'before',
 'bed.',
 'child',
 'should',
 'be',
 'healthy.']

3. Tokenization

Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing.

text_tokens = nltk.word_tokenize(text)
text_tokens

['Children',
 'should',
 "n't",
 'drink',
 'sugary',
 'drinks',
 'before',
 'bed',
 '.',
 'Children',
 'should',
 'be',
 'healthy',
 '.']

4. Frequency of words

FreqDist(text_list) returns dictionary where key is unique token in text_list and value is number of occurence of that token in text_list.

from nltk.probability import FreqDist

FreqDist(text_tokens)

FreqDist({'Children': 2, 'should': 2, '.': 2, "n't": 1, 'drink': 1, 'sugary': 1, 'drinks': 1, 'before': 1, 'bed': 1, 'be': 1, ...})

5. POS tagging

Tags part of speech to each tokens

nltk.pos_tag(text_tokens)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('sugary', 'JJ'),
 ('drinks', 'NNS'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.'),
 ('Children', 'NNP'),
 ('should', 'MD'),
 ('be', 'VB'),
 ('healthy', 'JJ'),
 ('.', '.')]

6. Parsing

With grammar that you provide, detects the grammatical structure of the sentence. If you want to detect ‘S-VP(V-NP)’ structure in a sentence, you can do as follows.

# Parsing sentence structure
parsing_text = nltk.word_tokenize("Alice loves Bob")
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")
# grammar = nltk.data.load('mygrammar.cfg') # You can also load grammar from external file.

parser = nltk.ChartParser(grammar)
trees = parser.parse_all(parsing_text)
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))

You can also use nltk.CharParser as your grammar.

parser = nltk.ChartParser(grammar)
trees = parser.parse_all(parsing_text)
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))

* POS tagging and parsing ambiguity

There are many sentences that can be interpretted in many ways, thus pos-tagged in many ways. nltk.pos_tag only captures one case.

Also, like the example below, nltk.pos_tag doesn’t recognize uncommon use of words.

text18 = nltk.word_tokenize("The old man the boat")
nltk.pos_tag(text18)

[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('the', 'DT'), ('boat', 'NN')]

nltk.post_tag also does poor job when tagging grammatically well formed sentence if the sentences is meaningless.

text19 = nltk.word_tokenize("Colorless green ideas sleep furiously")
nltk.pos_tag(text19)

[('Colorless', 'NNP'),
 ('green', 'JJ'),
 ('ideas', 'NNS'),
 ('sleep', 'VBP'),
 ('furiously', 'RB')]

7. Spelling Recommender

Given a misspelled word, recommend correct spelling. Recommended word should start with the same word as misspelled word. To do that, first transform words into ngrams. Next, compute similarity between ngrams of misspelled word and ngrams of all correct words in word corpus, using distance functions.

from nltk.corpus import words

correct_spellings = words.words() # all words in nltk.corpus.words
len(correct_spellings)

What is ngram?

list(nltk.ngrams('congratulations', n=3))

[('c', 'o', 'n'),
 ('o', 'n', 'g'),
 ('n', 'g', 'r'),
 ('g', 'r', 'a'),
 ('r', 'a', 't'),
 ('a', 't', 'u'),
 ('t', 'u', 'l'),
 ('u', 'l', 'a'),
 ('l', 'a', 't'),
 ('a', 't', 'i'),
 ('t', 'i', 'o'),
 ('i', 'o', 'n'),
 ('o', 'n', 's')]

Jaccard distance

size of the intersection divided by the size of the union of the sample sets

nltk.jaccard_distance(set(nltk.ngrams('graduate', n=3)), set(nltk.ngrams('gradually', n=3)))

0.5555555555555556

Edit distance

the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other

nltk.edit_distance('graduate', 'gradually')

1. Use Jaccard distance on the trigrams of the two words

entries=['cormulent', 'incendenece', 'validrate']
recommendations = []
for entry in entries:
    recommendations.append(sorted([(nltk.jaccard_distance(set(nltk.ngrams(entry, n=3)), set(nltk.ngrams(a, n=3))), a)\
                                   for a in correct_spellings if a.startswith(entry[0])])[0][1])

recommendations

['corpulent', 'indecence', 'validate']

2. Use Edit distance on the two words

recommendations = []
for entry in entries:
    recommendations.append(sorted([(nltk.edit_distance(entry, a), a) for a in correct_spellings if a.startswith(entry[0])])[0][1])

recommendations

['corpulent', 'intendence', 'validate']

Practices with moby.txt

with open('moby.txt', 'r') as f:
    moby_raw = f.read()
moby_tokens = nltk.word_tokenize(moby_raw)

1. After lemmatizing the verbs, how many unique tokens does text1 have?

lemmatizer = nltk.WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w,'v') for w in moby_tokens]
len(set(lemmatized))

2. What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)

len(set(moby_tokens))/len(moby_tokens)

0.08139566804842562

3. What is the 10 most frequently occuring unique words(excluding special characters like ‘.’)?

freq = FreqDist(moby_tokens)
freq_tuple_list = [(freq[x], x) for x in set(moby_tokens) if x.isalpha()]
sorted(freq_tuple_list, key = lambda x: x[0], reverse = True)[:10]

[(13715, 'the'),
 (6513, 'of'),
 (6010, 'and'),
 (4545, 'a'),
 (4515, 'to'),
 (3908, 'in'),
 (2978, 'that'),
 (2459, 'his'),
 (2196, 'it'),
 (2097, 'I')]

4. Find the longest word and that word’s length.

len_dict = {x: len(x) for x in set(moby_tokens)}
sorted(len_dict.items(), key = lambda x: x[1], reverse = True)[0]

("twelve-o'clock-at-night", 23)

5. What is the average number of tokens per sentence?

sentences = nltk.sent_tokenize(moby_raw) # tokenize to sentences
np.array(list(map(lambda x: len(nltk.word_tokenize(x)), sentences))).mean()

25.881952902963864

6. What are the 5 most frequent parts of speech in this text? What is their frequency?

from collections import Counter

moby_pos = nltk.pos_tag(moby_tokens)
pos_freq_dict = Counter([p for w, p in moby_pos])
sorted(pos_freq_dict.items(), key = lambda x: x[1], reverse = True)[:5]

[('NN', 32730), ('IN', 28657), ('DT', 25867), (',', 19204), ('JJ', 17620)]

Share on

Twitter Facebook Google+ LinkedIn

YoonSoo