Github GitHub - MaartenGr/BERTopic: Leveraging BERT and c-TF-IDF to create easil...
source link: https://github.com/MaartenGr/BERTopic
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
BERTopic
BERTopic is a topic modeling technique that leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. It even supports visualizations similar to LDAvis!
Corresponding medium post can be found here.
Installation
Installation can be done using pypi:
pip install bertopic
To use the visualization options, install BERTopic as follows:
pip install bertopic[visualization]
Installation Errors
Getting Started
For an in-depth overview of the features of BERTopic
you can check the full documentation here or you can follow along
with the Google Colab notebook here.
Quick Start
We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of english documents:
from bertopic import BERTopic from sklearn.datasets import fetch_20newsgroups docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data'] model = BERTopic(language="english") topics, probabilities = model.fit_transform(docs)
After generating topics and their probabilities, we can access the frequent topics that were generated:
>>> model.get_topic_freq().head() Topic Count -1 7288 49 3992 30 701 27 684 11 568
-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most
frequent topic that was generated, topic 49
:
>>> model.get_topic(49) [('windows', 0.006152228076250982), ('drive', 0.004982897610645755), ('dos', 0.004845038866360651), ('file', 0.004140142872194834), ('disk', 0.004131678774810884), ('mac', 0.003624848635985097), ('memory', 0.0034840976976789903), ('software', 0.0034415334250699077), ('email', 0.0034239554442333257), ('pc', 0.003047105930670237)]
Supported Languages
Use "multilingual" to select a model that supports 50+ languages.
Moreover, the following languages are supported:
Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese,
Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanize, Bosnian,
Breton, Bulgarian, Burmese, Burmese zawgyi font, Catalan, Chinese (Simplified),
Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto,
Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek,
Gujarati, Hausa, Hebrew, Hindi, Hindi Romanize, Hungarian, Icelandic, Indonesian,
Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean,
Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian,
Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian,
Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian,
Russian, Sanskrit, Scottish Gaelic, Serbian, Sindhi, Sinhala, Slovak,
Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil,
Tamil Romanize, Telugu, Telugu Romanize, Thai, Turkish, Ukrainian,
Urdu, Urdu Romanize, Uyghur, Uzbek, Vietnamese, Welsh, Western Frisian,
Xhosa, Yiddish
Visualize Topics
After having trained our BERTopic
model, we can iteratively go through perhaps a hundred topic to get a good
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation.
Instead, we can visualize the topics that were generated in a way very similar to
LDAvis:
model.visualize_topics()
Visualize Topic Probabilities
The variable probabilities
that is returned from transform()
or fit_transform()
can
be used to understand how confident BERTopic is that certain topics can be found in a document.
To visualize the distributions, we simply call:
# Make sure to input the probabilities of a single document! model.visualize_distribution(probabilities[0])
Embedding Models
You can select any model from sentence-transformers
and pass it through
BERTopic with embedding_model
:
from bertopic import BERTopic model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")
You can also use previously generated embeddings by passing it through fit_transform()
:
model = BERTopic() topics, probabilities = model.fit_transform(docs, embeddings)
Click here for a list of supported sentence transformers models.
Overview
Methods Code Fit the modelmodel.fit(docs])
Fit the model and predict documents
model.fit_transform(docs])
Predict new documents
model.transform([new_doc])
Access single topic
model.get_topic(12)
Access all topics
model.get_topics()
Get topic freq
model.get_topic_freq()
Visualize Topics
model.visualize_topics()
Visualize Topic Probability Distribution
model.visualize_distribution(probabilities[0])
Update topic representation
model.update_topics(docs, topics, n_gram_range=(1, 3))
Reduce nr of topics
model.reduce_topics(docs, topics, probabilities, nr_topics=30)
Find topics
model.find_topics("vehicle")
Save model
model.save("my_model")
Load model
BERTopic.load("my_model")
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK