Classification of Text

Original Source: https://www.coursera.org/specializations/data-science-python

Text Classification

Examples of Text Classification

Topic identification: Is this news article about Art, Sports, or Technology?
Spam Detection: Is this email a spam or not?
Sentiment analysis: Is this movie review positive or negative?
Spelling correction: ‘weather’ or ‘whether’?

Supervised Classification

With labeled instances’ properties(features), learn a classification model including feature importance(weights). Then apply the model on new instances to predict the label.

Classification Paradigms

Binary Classification: When there are only two possible classes
Multi-class Classification: When there are more than two possible classses
Multi-label Classification: When data instances can have two or more labels

Handling Textual Features

Commonly, words are used as features of each text.

Should we give smaller weight to commonly-occuring words?
Should we make lower case or leave as is?
Consider stemming and lemmatization
Consider capitalization (ie. US vs us)
Consider part of speech and grammatical structure (parsing)
Consider grouping words of similar meaning (ie. {buy, purchase})
Consider grouping all digits as one
Consider n-grams (ie. White House)

Naive Bayes Classifier

Given X(text), what is the probability of each y(label) occurs?

Bayes’ Rule: Pr(y|X) = Pr(y)Pr(X|y)/Pr(X)

optimal_y = argmax Pr(y|X) = argmaxPr(y)Pr(X|y)

Naive assumption: Given the class label, features are assumed to be independent of each other

optimal_y = argmax Pr(y|X) = argmaxPr(y)(Pr(X1|y)Pr(X2|y)…Pr(Xn|y) where Xn is nth token in text.

Example: Predict whether the topic of the query is computer science, zoology, or entertainment If the query is “Python download”, we should compare following three values and choose topic that maximizes it.

Pr(computer science) Pr(Python|computer science) Pr(download|computer science)
Pr(zoology) Pr(Python|zoology) Pr(download|zoology)
Pr(entertainment) Pr(Python|entertainment) Pr(download|entertainment)

Learning parameters

Prior probabilities: Pr(y) for all y in Y If there are N instances in all, and n out of those are labeled as class y, Pr(y)=n/N
Likelihood: Pr(xi|y) for all features xi and labels y in Y If there are p instances of class y, and xi appears in k of those, Pr(xi|y) = k/p

Smoothing

If Pr(xi|y) = 0, no matter how high the probability is for other features, Pr(y|X) will be 0. To prevent this, we apply ‘Laplace smoothing(Additive smoothing)’. As a result, when computing likelihood,

Pr(xi | y) = (k+1)/(p+n) where n is # of features

Naive Bayes Variations

Multinomail Naive Bayes: Each feature value is a count of unique tokens in a text.
Bernoulli Naive Bayes: Each feature is binary (word is present/absent)

Support Vector Machine (SVM) for text classification

Support vector machines are maximum-margin classifiers
Larger parameter C means less regularization
Linear kernels usually work best for text data

Model selection in Scikit-learn**

Two ways to evaluate model

X_train, X_test, y_train, y_test = train_test_split(train_data, train_labels) model.score(X_test, y_test)
cross_val_score(model, train_data, train_labels, cv=5)

Vectorizing Text

1. CountVectorizer

Transforms text to sparse array where each column is the # occurence of each token.

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    'The last document?',    
]
vect = CountVectorizer().fit(corpus)
vect.get_feature_names()

['and',
 'document',
 'first',
 'is',
 'last',
 'one',
 'second',
 'the',
 'third',
 'this']

vect.transform(corpus).toarray()

array([[0, 1, 1, 1, 0, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 0, 1, 0, 1],
       [0, 1, 0, 0, 1, 0, 0, 1, 0, 0]], dtype=int64)

2. TfidfVectorizer

tf–idf(term frequency–inverse document frequency) value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

tf-idf(d,t) = tf(d,t)*idf(t)

tf(d,t) = term frequency

idf(d,t) = log(n/(1+df(t))) where df(t)=# of documents which has word t in it.

n = # of total documents

from sklearn.feature_extraction.text import TfidfVectorizer
tfidv = TfidfVectorizer().fit(corpus)
tfidv.transform(corpus).toarray()

array([[0.        , 0.38947624, 0.55775063, 0.4629834 , 0.        ,
        0.        , 0.        , 0.32941651, 0.        , 0.4629834 ],
       [0.        , 0.24151532, 0.        , 0.28709733, 0.        ,
        0.        , 0.85737594, 0.20427211, 0.        , 0.28709733],
       [0.55666851, 0.        , 0.        , 0.        , 0.        ,
        0.55666851, 0.        , 0.26525553, 0.55666851, 0.        ],
       [0.        , 0.38947624, 0.55775063, 0.4629834 , 0.        ,
        0.        , 0.        , 0.32941651, 0.        , 0.4629834 ],
       [0.        , 0.45333103, 0.        , 0.        , 0.80465933,
        0.        , 0.        , 0.38342448, 0.        , 0.        ]])

* Arguments

stop_words

If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == ‘word’.

vect = CountVectorizer(stop_words='english').fit(corpus)
vect.get_feature_names()

['document', 'second']

ngram_range

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

vect = CountVectorizer(ngram_range=(1,2)).fit(corpus)
vect.get_feature_names()

['and',
 'and the',
 'document',
 'first',
 'first document',
 'is',
 'is the',
 'is this',
 'last',
 'last document',
 'one',
 'second',
 'second document',
 'second second',
 'the',
 'the first',
 'the last',
 'the second',
 'the third',
 'third',
 'third one',
 'this',
 'this is',
 'this the']

analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable

Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

vect = CountVectorizer(analyzer='char_wb').fit(corpus)
vect.get_feature_names()

[' ',
 '.',
 '?',
 'a',
 'c',
 'd',
 'e',
 'f',
 'h',
 'i',
 'l',
 'm',
 'n',
 'o',
 'r',
 's',
 't',
 'u']

min_df

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.

vect = CountVectorizer(min_df=2).fit(corpus)
vect.get_feature_names()

['document', 'first', 'is', 'the', 'this']

Excercise

Fit and transform the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using character n-grams from n=2 to n=5.

Using this document-term matrix and the following additional features:

the length of document (number of characters)
number of digits per document
number of non-word characters (anything other than a letter, digit or underscore.)

fit a Logistic Regression model with regularization C=100. Then compute the area under the curve (AUC) score using the transformed test data.

Also find the 10 smallest and 10 largest coefficients from the model and return them along with the AUC score in a tuple.

The list of 10 smallest coefficients should be sorted smallest first, the list of 10 largest coefficients should be sorted largest first.

The three features that were added to the document term matrix should have the following names should they appear in the list of coefficients: [‘length_of_doc’, ‘digit_count’, ‘non_word_char_count’]

import pandas as pd
import numpy as np

spam_data = pd.read_csv('spam.csv')

spam_data['target'] = np.where(spam_data['target']=='spam',1,0)
spam_data.head()

	text	target
0	Go until jurong point, crazy.. Available only ...	0
1	Ok lar... Joking wif u oni...	0
2	Free entry in 2 a wkly comp to win FA Cup fina...	1
3	U dun say so early hor... U c already then say...	0
4	Nah I don't think he goes to usf, he lives aro...	0

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(spam_data['text'],
                                                    spam_data['target'],
                                                    random_state=0)

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_auc_score

def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

vect = CountVectorizer(min_df=5, ngram_range=(2,5), analyzer='char_wb').fit(X_train)

vectorized_X_train = vect.transform(X_train)
added_X_train = add_feature(add_feature(add_feature(vectorized_X_train, X_train.str.len()), X_train.str.count(r'\d')), X_train.str.count(r'\W'))

vectorized_X_test = vect.transform(X_test)
added_X_test = add_feature(add_feature(add_feature(vectorized_X_test, X_test.str.len()), X_test.str.count(r'\d')), X_test.str.count(r'\W'))

model = LogisticRegression(C=100).fit(added_X_train, y_train)
predictions = model.predict_proba(added_X_test)[:, 1]

feature_names = np.array(list(vect.get_feature_names())+['length_of_doc', 'digit_count', 'non_word_char_count'])
sorted_coef_index = model.coef_[0].argsort()
sorted_feature_names = feature_names[sorted_coef_index]

roc_auc_score(y_test, predictions), sorted_feature_names[:10], np.fliplr([sorted_feature_names[-10:]])

(0.9952506663497614,
 array(['. ', '..', '? ', ' i', ' y', ' go', ':)', ' h', 'go', ' m'],
       dtype='<U19'),
 array([['digit_count', 'ne', 'ia', 'co', 'xt', ' ch', 'mob', ' x', 'ww',
         'ar']], dtype='<U19'))

Share on

Twitter Facebook Google+ LinkedIn

YoonSoo