Basic Natural Language Processing
Original Source: https://www.coursera.org/specializations/data-science-python
import nltk
#nltk.download() #only at first time
import numpy as np
text = "Children shouldn't drink sugary drinks before bed. Children should be healthy."
splitted_text = text.lower().split(' ')
1. Stemming
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.
porter = nltk.PorterStemmer()
[porter.stem(t) for t in splitted_text]
['children',
"shouldn't",
'drink',
'sugari',
'drink',
'befor',
'bed.',
'children',
'should',
'be',
'healthy.']
2. Lemmatisation
Lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document.
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in splitted_text]
['child',
"shouldn't",
'drink',
'sugary',
'drink',
'before',
'bed.',
'child',
'should',
'be',
'healthy.']
3. Tokenization
Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing.
text_tokens = nltk.word_tokenize(text)
text_tokens
['Children',
'should',
"n't",
'drink',
'sugary',
'drinks',
'before',
'bed',
'.',
'Children',
'should',
'be',
'healthy',
'.']
4. Frequency of words
FreqDist(text_list) returns dictionary where key is unique token in text_list and value is number of occurence of that token in text_list.
from nltk.probability import FreqDist
FreqDist(text_tokens)
FreqDist({'Children': 2, 'should': 2, '.': 2, "n't": 1, 'drink': 1, 'sugary': 1, 'drinks': 1, 'before': 1, 'bed': 1, 'be': 1, ...})
5. POS tagging
Tags part of speech to each tokens
nltk.pos_tag(text_tokens)
[('Children', 'NNP'),
('should', 'MD'),
("n't", 'RB'),
('drink', 'VB'),
('sugary', 'JJ'),
('drinks', 'NNS'),
('before', 'IN'),
('bed', 'NN'),
('.', '.'),
('Children', 'NNP'),
('should', 'MD'),
('be', 'VB'),
('healthy', 'JJ'),
('.', '.')]
6. Parsing
With grammar that you provide, detects the grammatical structure of the sentence. If you want to detect ‘S-VP(V-NP)’ structure in a sentence, you can do as follows.
# Parsing sentence structure
parsing_text = nltk.word_tokenize("Alice loves Bob")
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")
# grammar = nltk.data.load('mygrammar.cfg') # You can also load grammar from external file.
parser = nltk.ChartParser(grammar)
trees = parser.parse_all(parsing_text)
for tree in trees:
print(tree)
(S (NP Alice) (VP (V loves) (NP Bob)))
You can also use nltk.CharParser as your grammar.
parser = nltk.ChartParser(grammar)
trees = parser.parse_all(parsing_text)
for tree in trees:
print(tree)
(S (NP Alice) (VP (V loves) (NP Bob)))
* POS tagging and parsing ambiguity
There are many sentences that can be interpretted in many ways, thus pos-tagged in many ways. nltk.pos_tag only captures one case.
Also, like the example below, nltk.pos_tag doesn’t recognize uncommon use of words.
text18 = nltk.word_tokenize("The old man the boat")
nltk.pos_tag(text18)
[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('the', 'DT'), ('boat', 'NN')]
nltk.post_tag also does poor job when tagging grammatically well formed sentence if the sentences is meaningless.
text19 = nltk.word_tokenize("Colorless green ideas sleep furiously")
nltk.pos_tag(text19)
[('Colorless', 'NNP'),
('green', 'JJ'),
('ideas', 'NNS'),
('sleep', 'VBP'),
('furiously', 'RB')]
7. Spelling Recommender
Given a misspelled word, recommend correct spelling. Recommended word should start with the same word as misspelled word. To do that, first transform words into ngrams. Next, compute similarity between ngrams of misspelled word and ngrams of all correct words in word corpus, using distance functions.
from nltk.corpus import words
correct_spellings = words.words() # all words in nltk.corpus.words
len(correct_spellings)
236736
What is ngram?
list(nltk.ngrams('congratulations', n=3))
[('c', 'o', 'n'),
('o', 'n', 'g'),
('n', 'g', 'r'),
('g', 'r', 'a'),
('r', 'a', 't'),
('a', 't', 'u'),
('t', 'u', 'l'),
('u', 'l', 'a'),
('l', 'a', 't'),
('a', 't', 'i'),
('t', 'i', 'o'),
('i', 'o', 'n'),
('o', 'n', 's')]
Jaccard distance
size of the intersection divided by the size of the union of the sample sets
nltk.jaccard_distance(set(nltk.ngrams('graduate', n=3)), set(nltk.ngrams('gradually', n=3)))
0.5555555555555556
Edit distance
the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other
nltk.edit_distance('graduate', 'gradually')
3
1. Use Jaccard distance on the trigrams of the two words
entries=['cormulent', 'incendenece', 'validrate']
recommendations = []
for entry in entries:
recommendations.append(sorted([(nltk.jaccard_distance(set(nltk.ngrams(entry, n=3)), set(nltk.ngrams(a, n=3))), a)\
for a in correct_spellings if a.startswith(entry[0])])[0][1])
recommendations
['corpulent', 'indecence', 'validate']
2. Use Edit distance on the two words
recommendations = []
for entry in entries:
recommendations.append(sorted([(nltk.edit_distance(entry, a), a) for a in correct_spellings if a.startswith(entry[0])])[0][1])
recommendations
['corpulent', 'intendence', 'validate']
Practices with moby.txt
with open('moby.txt', 'r') as f:
moby_raw = f.read()
moby_tokens = nltk.word_tokenize(moby_raw)
1. After lemmatizing the verbs, how many unique tokens does text1 have?
lemmatizer = nltk.WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w,'v') for w in moby_tokens]
len(set(lemmatized))
16900
2. What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)
len(set(moby_tokens))/len(moby_tokens)
0.08139566804842562
3. What is the 10 most frequently occuring unique words(excluding special characters like ‘.’)?
freq = FreqDist(moby_tokens)
freq_tuple_list = [(freq[x], x) for x in set(moby_tokens) if x.isalpha()]
sorted(freq_tuple_list, key = lambda x: x[0], reverse = True)[:10]
[(13715, 'the'),
(6513, 'of'),
(6010, 'and'),
(4545, 'a'),
(4515, 'to'),
(3908, 'in'),
(2978, 'that'),
(2459, 'his'),
(2196, 'it'),
(2097, 'I')]
4. Find the longest word and that word’s length.
len_dict = {x: len(x) for x in set(moby_tokens)}
sorted(len_dict.items(), key = lambda x: x[1], reverse = True)[0]
("twelve-o'clock-at-night", 23)
5. What is the average number of tokens per sentence?
sentences = nltk.sent_tokenize(moby_raw) # tokenize to sentences
np.array(list(map(lambda x: len(nltk.word_tokenize(x)), sentences))).mean()
25.881952902963864
6. What are the 5 most frequent parts of speech in this text? What is their frequency?
from collections import Counter
moby_pos = nltk.pos_tag(moby_tokens)
pos_freq_dict = Counter([p for w, p in moby_pos])
sorted(pos_freq_dict.items(), key = lambda x: x[1], reverse = True)[:5]
[('NN', 32730), ('IN', 28657), ('DT', 25867), (',', 19204), ('JJ', 17620)]
Leave a Comment