0

GitHub - bloomberg/fast-noise-aware-topic-clustering: Research code and scripts...

 2 years ago
source link: https://github.com/bloomberg/fast-noise-aware-topic-clustering
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

FANATIC: FAst Noise-Aware TopIc Clustering

Authors: Ari Silburt, Anja Subasic, Evan Thompson, Carmeline Dsilva, Tarec Fares

General

This repo contains the research code and scripts used in the Silburt et al. (2021) paper "FANATIC: FAst Noise-Aware TopIc Clustering", and provides a basic overview of the code structure and its major components. For more questions, please directly contact the authors.

In particular, this repo allows a user to:

  • Download the reddit data
  • Train a word2vec embedding model
  • Use FANATIC to cluster the reddit data and dump results for downstream analysis.

Note that the original paper results used an in-house preprocessor, but a very similar open-source one has been provided (see fanatic/preprocess/nltk_preprocessor.py).

License

Please read the LICENSE.

How-to

Setup

It is recommended to create a fresh python virual environment and run the following commands from the base repo:

pip install --upgrade pip
pip install -r requirements.txt

And then in a python shell do:

import nltk
nltk.download('stopwords')

This repo has been tested against python3.7.

Download the Reddit Data

Data can be downloaded from pushshift using wget, e.g. wget https://files.pushshift.io/reddit/submissions/RS_2017-11.zst. If data files are downloaded to the data/ directory, subsequent scripts are already set up to look there.

Training a Word2vec Embedding

A new word2vec model can be trained using embedding_driver.py, and it is recommended to carefully inspect the arguments before running. In particular, a Reddit data file(s) must first be downloaded and specified in the --data-files argument.

Cluster via FANATIC

Once the data has been downloaded and word2vec model trained, a clustering run can be performed using the clustering_driver.py script. All input arguments are specified in the parse_args function of fanatic/arguments.py including data, label, preprocessing, clustering algorithm and output arguments.

See the Silburt et al. (2021) paper for a detailed explanation of FANATIC's hyperparameters.

Clustering labels

By default, the data is clustered against the data/subreddit_labels.json labels file, which indicates:

  • what subreddits are considered for clustering (all other subreddits are discarded).
  • whether the subreddit is a "coherent" or "noise" topic, where all noise topics are assigned the same NOISE label.

The --subreddit-noise-percentage argument sets the fraction of documents that come from noise subreddits. In the case where --num-docs-read and --subreddit-noise-percentage are incompatible with each other, honouring --subreddit-noise-percentage is prioritized. If --subreddit-noise-percentage is set to None, the noise percentage is set by the natural data distribution.

The data/subreddit_labels.json labels file can be substituted for a different one, or ignored entirely by setting --subreddit-labels-file None. When --subreddit-labels-file is set to None all encountered subreddits are used, each subreddit becomes its own coherent topic, and the concept of "topic noise" disappears. Thus, the --subreddit-noise-percentage argument becomes irrelevant.

Clustering Outputs

After a successful clustering run, the files that are output are:

  • fanatic_<dataset-id>_<seed-run>_labels_and_assignments.json: this file is generated for each seed run and contains, for each document-id, the assignment (what cluster the document ended up in) and label (the label associated with the document). Therefore, the full clustering result is contained within this file for downstream analysis. Document-ids can be mapped back to the original datafile should additional metadata be desired.
  • fanatic_<dataset-id>_<seed-run>_sample_clusters.txt - this file is generated for each seed run and contains the first 10 documents from each cluster and associated label. Format is <text> -> <label>. This gives the user a qualitative sense of what each cluster contains.
  • fanatic_<dataset-id>_<seed-run>_summary.txt - this file is generated for each seed run and contains all input parameters and clustering stats/metrics. It is effectively a summary of the entire clustering run, allowing you to quickly parse results and/or recreate the job if needed. It can be consumed by configparser.
  • fanatic_<dataset-id>_summary_averaged.txt - this file is generated once for a dataset-id and contains the input arguments and averaged clustering stats/metrics across the seed runs.

In addition, the full clustering model can be dumped via pickle for deeper investigation by adding --flag-save-clusteringmodel. Warning: this file can become large, especially for big datasets.

Custom Preprocessor / Featurizer

Results from the paper were generated using an in-house preprocessor that is not available to the public. Using nltk we created a very similar preprocessor, located at fanatic/preprocess/nltk_preprocessor.py, and inherits from fanatic/preprocess/generic_preprocessor.py. Users are free to create their own custom preprocessors that also inherit from generic_preprocessor.py and experiment with more sophisticated features (e.g. BERT embeddings). See generic_preprocessor.py for additional documentation and requirements.

Non-Reddit Datasets

Users are encouraged to substitute or modify fanatic/preprocess/read_data.py to read in different kinds of data. In particular, DATASET_INPUT_FIELD, DATASET_LABEL_FIELD and DATASET_ID_FIELD must be changed to extract the relevant content from the new dataset for downstream preprocessing and clustering.

Tests

Some basic unit tests can be run from the home directory with python3.7 -m pytest tests/unit/


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK