BERTopic

BERTopic is a topic modeling technique that leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. It even supports visualizations similar to LDAvis!

Corresponding medium post can be found here.

Installation

Installation can be done using pypi:

pip install bertopic

To use the visualization options, install BERTopic as follows:

pip install bertopic[visualization]

Installation Errors

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation here or you can follow along with the Google Colab notebook here.

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of english documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

model = BERTopic(language="english")
topics, probabilities = model.fit_transform(docs)

After generating topics and their probabilities, we can access the frequent topics that were generated:

>>> model.get_topic_freq().head()
Topic	Count
-1	7288
49	3992
30	701
27	684
11	568

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 49:

>>> model.get_topic(49)
[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

Supported Languages
Use "multilingual" to select a model that supports 50+ languages.

Moreover, the following languages are supported:
Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanize, Bosnian, Breton, Bulgarian, Burmese, Burmese zawgyi font, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanize, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanize, Telugu, Telugu Romanize, Thai, Turkish, Ukrainian, Urdu, Urdu Romanize, Uyghur, Uzbek, Vietnamese, Welsh, Western Frisian, Xhosa, Yiddish

Visualize Topics

After having trained our BERTopic model, we can iteratively go through perhaps a hundred topic to get a good understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:

model.visualize_topics()

Visualize Topic Probabilities

The variable probabilities that is returned from transform() or fit_transform() can be used to understand how confident BERTopic is that certain topics can be found in a document.

To visualize the distributions, we simply call:

# Make sure to input the probabilities of a single document!
model.visualize_distribution(probabilities[0])

Embedding Models

You can select any model from sentence-transformers and pass it through BERTopic with embedding_model:

from bertopic import BERTopic
model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

You can also use previously generated embeddings by passing it through fit_transform():

model = BERTopic()
topics, probabilities = model.fit_transform(docs, embeddings)

Click here for a list of supported sentence transformers models.

Overview

Methods Code Fit the model model.fit(docs]) Fit the model and predict documents model.fit_transform(docs]) Predict new documents model.transform([new_doc]) Access single topic model.get_topic(12) Access all topics model.get_topics() Get topic freq model.get_topic_freq() Visualize Topics model.visualize_topics() Visualize Topic Probability Distribution model.visualize_distribution(probabilities[0]) Update topic representation model.update_topics(docs, topics, n_gram_range=(1, 3)) Reduce nr of topics model.reduce_topics(docs, topics, probabilities, nr_topics=30) Find topics model.find_topics("vehicle") Save model model.save("my_model") Load model BERTopic.load("my_model")

Github GitHub - MaartenGr/BERTopic: Leveraging BERT and c-TF-IDF to create easil...

BERTopic

Installation

Getting Started

Quick Start

Visualize Topics

Visualize Topic Probabilities

Embedding Models

Overview

Recommend

万亿成交开门红，一月攻势剑指何方？

分布式任务调度平台

欧派家居，钱多还是钱紧

胜利证券李宁：中国资本市场的大时代刚刚开始，本轮行情我们看到2023年

Download and install

任泽平：南北差距根本上是市场化程度的差距

Github GitHub - RameenAbdal/StyleFlow: StyleFlow: Attribute-conditioned Explorat...

策略研究丨2021年A股增量资金展望！

货币政策要“稳”字当头——央行行长易纲谈2021年金融热点问题

Programming with 64-Bit ARM Assembly Language: Single Board Computer Development...

About Joyk