6

NLP Tutorial: Topic Modeling in Python with BerTopic

 4 years ago
source link: https://hackernoon.com/nlp-tutorial-topic-modeling-in-python-with-bertopic-372w35l9
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

NLP Tutorial: Topic Modeling in Python with BerTopic

@davisdavidDavis David

Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.

Topic modeling is an unsupervised machine learning technique thaat automatically identifies different topics present in a document (textual data). Data has become a key asset/tool to run many businesses around the world. With topic modeling, you can collect unstructured datasets, analyzing the documents, and obtain the relevant and desired information that can assist you in making a better decision.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

There are different techniques to perform topic modeling (such as LDA) but, in this NLP tutorial, you will learn how to use the BerTopic technique developed by Maarten Grootendorst.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Table of Contents:

  1. What is BerTopic
  2. How to Install BerTopic
  3. Load Olympic Tokyo Tweets Data
  4. Create BerTopic Model
  5. Select Top Topics
  6. Select One Topic
  7. Topic Modeling Visualization
  8. Topic Reduction
  9. Make Prediction 
  10. Save and Load Model

What is BerTopic?

BerTopic is a topic modeling technique that uses transformers (BERT embeddings) and class-based TF-IDF to create dense clusters. It also allows you to easily interpret and visualize the topics generated.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The BerTopic algorithm contains 3 stages:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

1.Embed the textual data(documents)
In this step, the algorithm extracts document embeddings with BERT, or it can use any other embedding technique.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

By default, it uses the following sentence transformers

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • "paraphrase-MiniLM-L6-v2"- This is an English BERT-based model trained specifically for semantic similarity tasks. 
  • "paraphrase-multilingual-MiniLM-L12-v2"- This is similar to the first, with one major difference is that the xlm models work for 50+ languages.

2.Cluster Documents
It uses UMAP to reduce the dimensionality of embeddings and the HDBSCAN technique to cluster reduced embeddings and create clusters of semantically similar documents.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

3.Create a topic representation
The last step is to extract and reduce topics with class-based TF-IDF and then improve the coherence of words with Maximal Marginal Relevance.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

How to Install BerTopic

You can install the package via pip:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
pip install bertopic

If you are interested in the visualization options, you need to install them as follows.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
pip install bertopic[visualization]

BerTopic supports different transformers and language backends that you can use to create a model. You can install one according to the options available below.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • pip install bertopic[flair]
  • pip install bertopic[gensim]
  • pip install bertopic[spacy]
  • pip install bertopic[use]

The Libraries 

We will use the following libraries that will help us to load data and create a model from BerTopic.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
#import packages

import pandas as pd 
import numpy as np
from bertopic import BERTopic

Step 1. Load Data

In this NLP tutorial, we will use Olympic Tokyo 2020 Tweets with a goal to create a model that can automatically categorize the tweets by their topics.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

You can download the datasets here.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
#load data 
import pandas as pd 
 
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/data/tokyo_2020_tweets.csv", engine='python')
 
# select only 6000 tweets 
df = df[0:6000]

NB: We selected only 6,000 tweets for computational reasons.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Step 2. Create Model

To create a model using BERTopic, you need to load the tweets as a list and then pass it to the fit_transform method. This method will do the following: 

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • Fit the model on the collection of tweets. 
  • Generate topics.
  •  Return the tweets with the topics.
# create model 
 
model = BERTopic(verbose=True)
 
#convert to list 
docs = df.text.to_list()
 
topics, probabilities = model.fit_transform(docs)

Step 3. Select Top Topics

After training the model, you can access the size of topics in descending order.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
model.get_topic_freq().head(11)

Note:  Topic -1 is the largest and it refers to outliers tweets that do not assign to any topics generated. In this case, we will ignore Topic -1.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Step 4. Select One Topic

You can select a specific topic and get the top n words for that topic and their c-TF-IDF scores.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
model.get_topic(6)

For this selected topic, common words are Sweden, goal,rolfo, swedes, goals, soccer. It is obvious this topic focuses on “soccer for Sweden team”.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Step 5:Topic Modeling Visualization

BerTopic allows you to visualize the topics that were generated in a way very similar to LDAvis. This will allow you to get more insights into the topic's quality. In this article, we will look at three methods to visualize the topics.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Visualize Topics 

The visualize_topics method can help you visualize topics generated with their sizes and corresponding words. The visualization is inspired by LDavis.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
model.visualize_topics()

Visualize Terms

The visualize_barchart method will show the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores. You can then compare topic representations to each other and gain more insights from the topic generated.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
model.visualize_barchart()

In the above graph, you can see Top words in Topic 4 are proud, thank, cheer4india, cheering, and congrats.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Visualize Topic Similarity 

You can also visualize how similar certain topics are to each other. To visualize the heatmap, simply call.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
model.visualize_heatmap()

In the above graph, you can see that topic 93 is similar to topic 102 with a similarity score of 0.933.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Topic Reduction

Sometimes you may end up with too many topics or too few topics generated, BerTopic gives you an option to control this behavior in different ways. 

0 reactions
heart.png
light.png
money.png
thumbs-down.png

(a) You can set the number of topics you want by setting the argument "nr_topics" with a number of topics you want. The BerTopic will find similar topics and merge them.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
model = BERTopic(nr_topics=20) 

In the above code, the number of topics that will be generated is 20.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

(b)Another option is to reduce the number of topics automatically. To use this option, you need to set "nr_topics" to "auto" before training the model.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
model = BERTopic(nr_topics="auto")

(c) The last option is to reduce the number of topics after training the model. This is a great option if retraining the model will take many hours.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
new_topics, new_probs = model.reduce_topics(docs, topics, probabilities, nr_topics=15)

In the above example, you reduce the number of topics to 15 after training the model.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Step 6:Make Prediction 

To predict a topic of a new document, you need to add a new instance(s) on the transform method.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
topics, probs = model.transform(new_docs)

Step 7:Save Model

You can save a trained model by using the save method.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
model.save("my_topics_model")

Step 8:Load Model

You can load the model by using the load method.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
BerTopic_model = BERTopic.load("my_topics_model")

Final Thoughts on Topic Modeling in Python with BerTopic

In this NLP tutorial, you have learned

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • How to create a BerTopic Model.
  • Select topics generated.
  • Visualize topics and words per topic to gain more insights.
  • Different techniques to reduce the number of topics generated.
  • How to make predictions.
  • How to save and load BerTopic Model.

BerTopic has a lot of features to offer when creating the model. For example, if you have a dataset for a specific language(by default, it supports the English model) you can choose the language by setting the language parameter while configuring the model.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
model = BERTopic(language="German")

Note: Select a language in which its embedding model exists.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

If you have a mixture of languages in your documents, you can set

language="multilingual"
to support more than 50 languages.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

0 reactions
heart.png
light.png
money.png
thumbs-down.png

You can also find me on Twitter @Davis_McDavid.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

And you can read more articles like this here.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Want to keep up to date with all the latest in python? Subscribe to our newsletter in the footer below.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
7
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png
by Davis David @davisdavid. Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.Contact me to collaborate

Also Featured In

This story is new, give it time!
Join Hacker Noon

Create your free account to unlock your custom reading experience.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK