40

2 latent methods for dimension reduction and topic modeling

 5 years ago
source link: https://www.tuicool.com/articles/hit/UfQRbuq
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
ERBRjiQ.jpg!web
Photo: https://pixabay.com/en/golden-gate-bridge-women-back-1030999/

Before the state-of-the-artword embedding technique, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) area good approaches to deal with NLP problems. Both LSA and LDA have same input which is Bag of words in matrix format . LSA focus on reducing matrix dimension while LDA solves topic modeling problems.

I will not go through mathematical detail and as there is lot of great material for that. You may check it from reference. For the sake of keeping it easy to understand, I did not do pre-processing such as stopwords removal. It is critical part when you use LSA, LSI and LDA. After reading this article, you will know:

  • Latent Semantic Analysis (LSA)
  • Latent Dirichlet Allocation (LDA)
  • Take Away

Latent Semantic Analysis (LSA)

LSA was introduced by Jerome Bellegarda in 2005. The objective of LSA is reducing dimension for classification. The idea is that words will occurs in similar pieces of text if they have similar meaning. We usually use Latent Semantic Indexing (LSI) as an alternative name in NLP field.

First of all, we have m documents and n words as input. An m * n matrix can be constructed while column and row are document and word respectively. You can use count occurrence or TF-IDF score. However, TF-IDF is better than count occurrence in most of the time as high frequency do not account for better classification.

2Uv6JrE.png!web
Photo: http://mropengate.blogspot.com/2016/04/tf-idf-in-r-language.html

The idea of TF-IDF is that high frequency may not able to provide much information gain. In another word, rare words contribute more weights to the model. Word importance will be increased if the number of occurrence within same document (i.e. training record). On the other hand, it will be decreased if it occurs in corpus (i.e. other training records). For detail, you may check thisblog.

The challenge is that the matrix is very sparse (or high dimension) and noisy (or include lots of low frequency word). So truncated SVD is adopted to reduce dimension.

M7Zbyaq.png!web

The idea of SVD is finding the most valuable information and using lower dimension t to represent same thing.

tfidf_vec = TfidfVectorizer(use_idf=True, norm='l2')
svd = TruncatedSVD(n_components=dim)
transformed_x_train = tfidf_vec.fit_transform(x_train)
transformed_x_test = tfidf_vec.transform(x_test)
print('TF-IDF output shape:', transformed_x_train.shape)
x_train_svd = svd.fit_transform(transformed_x_train)
x_test_svd = svd.transform(transformed_x_test)
print('LSA output shape:', x_train_svd.shape)
explained_variance = svd.explained_variance_ratio_.sum()
print("Sum of explained variance ratio: %d%%" % (int(explained_variance * 100)))

Output

TF-IDF output shape: (11314, 130107)
LSA output shape: (11314, 50)
Sum of explained variance ratio: 8%

We can see that the dimension reduces from 130k to 50 only.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, KFold
lr_model = LogisticRegression(solver='newton-cg',n_jobs=-1)
lr_model.fit(x_train_svd, y_train)
cv = KFold(n_splits=5, shuffle=True)
    
scores = cross_val_score(lr_model, x_test_svd, y_test, cv=cv, scoring='accuracy')
print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))

Output

Accuracy: 0.6511 (+/- 0.0201)

Latent Dirichlet Allocation (LDA)

LDA is introduced by David Blei, Andrew Ng and Michael O. Jordan in 2003. It is unsupervised learning and topic model is the typical example. The assumption is that each document mix with various topics and every topic mix with various words.

buqYJfJ.png!web
Various words under various topics

Intuitively, you can image that we have two layer of aggregations. First layer is the distribution of categories. For example, we have finance news, weather news and political news. Second layer is distribution of words within the category. For instance, we can find “sunny” and “cloud” in weather news while “money” and “stock” exists in finance news.

However, “a”, “with” and “can” do not contribute on topic modeling problem. Those words exist among documents and will have roughly same probability between categories. Therefore, stopwords removal is a critical step to achieve a better result.

UBruIn6.png!web

For particular document d, we get the topic distribution which is θ. From this distribution(θ), topic t will be chosen and selecting corresponding word from ϕ.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
def build_lda(x_train, num_of_topic=10):
    vec = CountVectorizer()
    transformed_x_train = vec.fit_transform(x_train)
    feature_names = vec.get_feature_names()
lda = LatentDirichletAllocation(
        n_components=num_of_topic, max_iter=5, 
        learning_method='online', random_state=0)
    lda.fit(transformed_x_train)
return lda, vec, feature_names
def display_word_distribution(model, feature_names, n_word):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        words = []
        for i in topic.argsort()[:-n_word - 1:-1]:
            words.append(feature_names[i])
        print(words)
lda_model, vec, feature_names = build_lda(x_train)
display_word_distribution(
    model=lda_model, feature_names=feature_names, 
    n_word=5)

Output

Topic 0:
['the', 'for', 'and', 'to', 'edu']
Topic 1:
['c_', 'w7', 'hz', 'mv', 'ck']
Topic 2:
['space', 'nasa', 'cmu', 'science', 'edu']
Topic 3:
['the', 'to', 'of', 'for', 'and']
Topic 4:
['the', 'to', 'of', 'and', 'in']
Topic 5:
['the', 'of', 'and', 'in', 'were']
Topic 6:
['edu', 'team', 'he', 'game', '10']
Topic 7:
['ax', 'max', 'g9v', 'b8f', 'a86']
Topic 8:
['db', 'bike', 'ac', 'image', 'dod']
Topic 9:
['nec', 'mil', 'navy', 'sg', 'behanna']

Take Away

To access all code, you can visit my github repo.

  • Both of them use Bag-of-words as input matrix
  • The challenge of SVD is that we are hard to determine the optimal number of dimension . In general, low dimension consume less resource but we may not able to distinguish opposite meaning words while high dimension overcome it but consuming more resource.

About Me

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science and Artificial Intelligence , especially in NLP and platform related. You can reach me from Medium , LinkedIn or Github .

Reference

[1] SVD Tutorial: https://cs.fit.edu/~dmitra/SciComp/Resources/singular-value-decomposition-fast-track-tutorial.pdf

[2] CUHK LSI Tutorial: http://www1.se.cuhk.edu.hk/~seem5680/lecture/LSI-Eg.pdf

[3] Stanford LSI Tutorial: https://nlp.stanford.edu/IR-book/pdf/18lsi.pdf

[4] LSA and LDA Explanation: https://cs.stanford.edu/~ppasupat/a9online/1140.html


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK