

GitHub - jsalt18-sentence-repl/jiant: The jiant sentence representation learning...
source link: https://github.com/jsalt18-sentence-repl/jiant
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md
jiant
This repo contains the jiant
sentence representation learning toolkit created at the 2018 JSALT Workshop by the General-Purpose Sentence Representation Learning team. It is an extensible platform meant to make it easy to run experiments that involve multitask and transfer learning across sentence-level NLP tasks.
The 'j' in jiant
stands for JSALT. That's all the acronym we have.
To reproduce experiments from JSALT (bugs and all) use the jsalt-experiments
branch. That will contain a snapshot of the code as of early August, potentially with updated documentation.
Dependencies
Make sure you have installed the packages listed in environment.yml
.
When listed, specific particular package versions are required.
If you use conda (recommended, instructions for installing miniconda here), you can create an environment from this package with the following command:
conda env create -f environment.yml
To activate the environment run source activate jiant
, and to deactivate run source deactivate
Some requirements may only be needed for specific configurations. If you have trouble installing a specific dependency and suspect that it isn't needed for your use case, create an issue or a pull request, and we'll help you get by without it.
You will also need to install dependencies for nltk if you do not already have them:
python -m nltk.downloader -d /usr/share/nltk_data perluniprops nonbreaking_prefixes punkt
Submodules
This project uses git submodules to manage some dependencies on other research code, in particular for loading CoVe and the OpenAI transformer model. In order to make sure you get these repos when you download jiant/
, add --recursive
to your clone command:
git clone --recursive [email protected]:jsalt18-sentence-repl/jiant.git jiant
If you already cloned and just need to get the submodules, you can do:
git submodule update --init --recursive
Downloading data
The repo contains a convenience python script for downloading all GLUE data and standard splits.
python scripts/download_glue_data.py --data_dir data --tasks all
We also make use of many other data sources, including:
- Translation: WMT'14 EN-DE, WMT'17 EN-RU. Scripts to prepare the WMT data are in
scripts/wmt/
. - Language modeling: Billion Word Benchmark, WikiText103. We use the English sentence tokenizer from NLTK toolkit Punkt Tokenizer Models to preprocess WikiText103 corpus. Note that it's only used in breaking paragraphs into sentences. It will use default tokenizer on word level as all other tasks unless otherwise specified. We don't do any preprocessing on BWB corpus.
- Image captioning: MSCOCO Dataset (http://cocodataset.org/#download). Specifically we use the following splits: 2017 Train images [118K/18GB], 2017 Val images [5K/1GB], 2017 Train/Val annotations [241MB].
- Reddit: reddit_comments dataset. Specifically we use the 2008 and 2009 tables.
- DisSent: Details for preparing the corpora are in
scripts/dissent/README
. - DNC (Diverse Natural Language Inference Collection), i.e. Recast data: The DNC is currently being prepared for release for EMNLP camera ready. Instructions on how to download the data is forthcoming.
- CCG: Details for preparing the corpora are in
scripts/ccg/README
. - Edge probing analysis tasks: see
probing/data
for more information.
To incorporate the above data, placed the data in the data directory in its own directory (see task-directory relations in src/preprocess.py
and src/tasks.py
.
Running
To run an experiment, make a config file similar to config/demo.conf
with your model configuration. You can use the --overrides
flag to override specific variables. For example:
python main.py --config_file config/demo.conf \
--overrides "exp_name = my_exp, run_name = foobar, d_hid = 256"
will run the demo config, but output to $JIANT_PROJECT_PREFIX/my_exp/foobar
.
To run the demo config, you will have to set environment variables. The best way to achieve that is to follow the instructions in this script
- $JIANT_PROJECT_PREFIX: the where the outputs will be saved.
- $JIANT_DATA_DIR: location of the saved data. This is usually the location of the Glue data.
- $WORD_EMBED: location of the word embeddings you want to use. For GloVe: 840B300d Glove. For FastText: 300d-2M. For ELMo, AllenNLP will download it for you.
- $FASTTEXT_MODEL_FILE: location of the FastText model: can be set to '.'
Saving Preprocessed Data
Because preprocessing is expensive (e.g. building vocab and indexing for very large tasks like WMT or BWB), we often want to run multiple experiments using the same preprocessing. So, we group runs using the same preprocessing in a single experiment directory (set using the exp_dir
flag) in which we store all shared preprocessing objects. Later runs will load the stored preprocessing. We write run-specific information (logs, saved models, etc.) to a run-specific directory (set using flag run_dir
), usually nested in the experiment directory. Experiment directories are written in project_dir
. Overall the directory structure looks like:
project_dir # directory for all experiments using jiant
|-- exp1/ # directory for a set of runs training and evaluating on FooTask and BarTask
| |-- preproc/ # shared indexed data of FooTask and BarTask
| |-- vocab/ # shared vocabulary built from examples from FooTask and BarTask
| |-- FooTask/ # shared FooTask class object
| |-- BarTask/ # shared BarTask class object
| |-- run1/ # run directory with some hyperparameter settings
| |-- run2/ # run directory with some different hyperparameter settings
| |
| [...]
|
|-- exp2/ # directory for a runs with a different set of experiments, potentially using a different branch of the code
| |-- preproc/
| |-- vocab/
| |-- FooTask/
| |-- BazTask/
| |-- run1/
| |
| [...]
|
[...]
You should also set data_dir
and word_embs_file
options to point to the directories containing the data (e.g. the output of the scripts/download_glue_data
script) and word embeddings (optional, not needed when using ELMo, see later sections) respectively.
To force rereading and reloading of the tasks, perhaps because you changed the format or preprocessing of a task, delete the objects in the directories named for the tasks (e.g., QQP/
) or use the option reload_tasks = 1
.
To force rebuilding of the vocabulary, perhaps because you want to include vocabulary for more tasks, delete the objects in vocab/
or use the option reload_vocab = 1
.
To force reindexing of a task's data, delete some or all of the objects in preproc/
or use the option reload_index = 1
and set reindex_tasks
to the names of the tasks to be reindexed, e.g. reindex_tasks=\"sst,mnli\"
. You should do this whenever you rebuild the task objects or vocabularies.
Command-Line Options
All model configuration is handled through the config file system and the --overrides
flag, but there are also a few command-line arguments that control the behavior of main.py
. In particular:
--tensorboard
(or -t
): use this to run a Tensorboard server while the trainer is running, serving on the port specified by --tensorboard_port
(default is 6006
).
The trainer will write event data even if this flag is not used, and you can run Tensorboard separately as:
tensorboard --logdir <exp_dir>/<run_name>/tensorboard
--notify <email_address>
: use this to enable notification emails via SendGrid. You'll need to make an account and set the SENDGRID_API_KEY
environment variable to contain the (text of) the client secret key.
--remote_log
(or -r
): use this to enable remote logging via Google Stackdriver. You can set up credentials and set the GOOGLE_APPLICATION_CREDENTIALS
environment variable; see Stackdriver Logging Client Libraries.
Model
The core model is a shared BiLSTM with task-specific components. When a language modeling objective is included in the set of training tasks, we use a bidirectional language model for all tasks, which is constructed to avoid cheating on the language modeling tasks.
We also include an experimental option to use a shared Transformer in place of the shared BiLSTM by setting sent_enc = transformer
. When using a Transformer, we use the Noam learning rate scheduler, as that seems important to training the Transformer thoroughly.
Task-specific components include logistic regression and multi-layer perceptron for classification and regression tasks, and an RNN decoder with attention for sequence transduction tasks. To see the full set of available params, see config/defaults.conf. For a list of options affecting the execution pipeline (which configuration file to use, whether to enable remote logging or tensorboard, etc.), see the arguments section in main.py.
Trainer
The trainer was originally written to perform sampling-based multi-task training. At each step, a task is sampled and bpp_base
(default: 1) batches of that task's training data is trained on.
The trainer evaluates the model on the validation data after a fixed number of gradient steps, set by val_interval
.
The learning rate is scheduled to decay by lr_decay_factor
(default: .5) whenever the validation score doesn't improve after task_patience
(default: 1) validation checks.
Note: "epoch" is generally used in comments and variable names to refer to the interval between validation checks, not to a complete pass through any one training set.
If you're training only on one task, you don't need to worry about sampling schemes, but if you are training on multiple tasks, you can vary the sampling weights with weighting_method
, e.g. weighting_method = uniform
or weighting_method = proportional
(to amount of training data). You can also scale the losses of each minibatch via scaling_method
if you want to weight tasks with different amounts of training data equally throughout training.
For multi-task training, we use a shared global optimizer and LR scheduler for all tasks. In the global case, we use the macro average of each task's validation metrics to do LR scheduling and early stopping. When doing multi-task training and at least one task's validation metric should decrease (e.g. perplexity), we invert tasks whose metric should decrease by averaging 1 - (val_metric / dec_val_scale)
, so that the macro-average will be well-behaved.
We have partial support for per-task optimizers (shared_optimizer = 0
), but checkpointing may not behave correctly in this configuration. In the per-task case, we stop training on a task when its patience has run out or its optimizer hits the minimum learning rate.
Within a run, tasks are distinguished between training tasks and evaluation tasks. The logic of main.py
is that the entire model is pretrained on all the training
tasks, then the best model is then loaded, and task-specific components are trained for each of the evaluation tasks with a frozen shared sentence encoder.
You can control which steps are performed or skipped by setting the flags do_train, train_for_eval, do_eval
.
Specify training tasks with train_tasks = $TRAIN_TASKS
where $TRAIN_TASKS
is a comma-separated list of task names; similarly use eval_tasks
to specify the eval-only tasks.
For example, train_tasks = \"sst,mnli,foo\", eval_tasks = \"qnli,bar,sst,mnli,foo\"
(HOCON notation requires escaped quotes in command line arguments).
Note: if you want to train and evaluate on a task, that task must be in both train_tasks
and eval_tasks
.
Adding New Tasks
To add new tasks, you should:
-
Add your data to the
data_dir
you intend to use. When constructing your task class (see next bullet), make sure you specify the correct subfolder containing your data. -
Create a class in
src/tasks.py
, and make sure that...- You decorate the task: in the line immediately before
class MyNewTask():
, add the line@register_task(task_name, rel_path='path/to/data')
wheretask_name
is the designation for the task used intrain_tasks, eval_tasks
andrel_path
is the path to the data indata_dir
. SeeEdgeProbingTasks
intasks.py
for an example. - Your task inherits from existing classes as necessary (e.g.
PairClassificationTask
,SequenceGenerationTask
,WikiTextLMTask
, etc.). - The task definition includes the data loader, as a method called
load_data()
which stores tokenized but un-indexed data for each split in attributes namedtask.{train,valid,test}_data_text
. The formatting of each datum can be anything as long as your preprocessing code (insrc/preprocess.py
, see next bullet) expects that format. Generally data are formatted as lists of inputs and output, e.g. MNLI is formatted as[[sentences1]; [sentences2]; [labels]]
wheresentences{1,2}
is a list of the first sentences from each example. Make sure to call your data loader in initialization! - Your task implements a method
task.get_sentences()
that iterates over all text to index in order to build the vocabulary. For some types of tasks, e.g.SingleClassificationTask
, you only need settask.sentences
to be a list of sentences (List[List[str]]
). - Your task implements a method
task.count_examples()
that setstask.example_counts
(Dict[str:int]
): the number of examples per split (train, val, test). See here for an example. - Your task implements a method
task.get_split_text()
that takes in the name of a split and returns an iterable over the data in that split. This method will be called in preprocessing and passed totask.process_split
(see next bullet). - Your task implements a method
task.process_split()
that takes in a split of your data and produces a list of AllenNLPInstance
s. AnInstance
is a wrapper around a dictionary of(field_name, Field)
pairs.Field
s are objects to help with data processing (indexing, padding, etc.). Each input and output should be wrapped in a field of the appropriate type (TextField
for text,LabelField
for class labels, etc.). For MNLI, we wrap the premise and hypothesis inTextField
s and the label inLabelField
. See the AllenNLP tutorial or the examples insrc/tasks.py
. The names of the fields, e.g.input1
, can be named anything so long as the corresponding code insrc/models.py
(see next bullet) expects that named field. However make sure that the values to be predicted are either namedlabels
(for classification or regression) ortargs
(for sequence generation)! - If you task requires task specific label namespaces, e.g. for translation or tagging, you set the attribute
task._label_namespace
to reserve a vocabulary namespace for your task's target labels. We strongly suggest including the task name in the target namespace. Your task should also implementtask.get_all_labels()
, which returns an iterable over the labels (possibly words, e.g. in the case of MT) in the task-specific namespace. - Your task has attributes
task.val_metric
(name of task-specific metric to track during training) andtask.val_metric_decreases
(bool,True
if val metric should decrease during training). You should also implement atask.get_metrics()
method that implements the metrics you care about by using AllenNLPScorer
objects (typically set viatask.scorer1
,task.scorer2
, etc.).
- You decorate the task: in the line immediately before
-
In
src/models.py
, make sure that:- The correct task-specific module is being created for your task in
build_module()
. - Your task is correctly being handled in
forward()
ofMultiTaskModel
. The model will receive the task class you created and a batch of data, where each batch is a dictionary with keys of theInstance
objects you created in preprocessing, as well as apredict
flag that indicates if your forward function should generate predictions or not. - You create additional methods or add branches to existing methods as necessary. If you do add additional methods, make sure to make use of the
sent_encoder
attribute of the model, which is shared amongst all tasks.
- The correct task-specific module is being created for your task in
Note: The current training procedure is task-agnostic: we randomly sample a task to train on, pass a batch to the model, and receive an output dictionary at least containing a loss
key. Training loss should be calculated within the model; validation metrics should also be computed within AllenNLP scorer
s and not in the training loop. So you should not need to modify the training loop; please reach out if you think you need to.
Feel free to create a pull request to add an additional task if you expect that it'll be useful to others.
Pretrained Embeddings
ELMo
We use the ELMo implementation provided by AllenNLP.
To use ELMo, set elmo
to 1.
By default, AllenNLP will download and cache the pretrained ELMo weights. If you want to use a particular file containing ELMo weights, set elmo_weight_file_path = path/to/file
.
To use only the character-level CNN word encoder from ELMo by use elmo_chars_only = 1
. This is set by default.
CoVe
We use the CoVe implementation provided here.
To use CoVe, clone the repo and set the option path_to_cove = "/path/to/cove/repo"
and set cove = 1
.
FastText
To use fastText, we can either use the pretrained vectors or pretrained model. The former will have OOV terms while the latter will not, so using the latter is preferred.
To use the pretrained model, follow the instructions here (specifically "Building fastText for Python") to setup the fastText package, then download the trained English model (note: 9.6G).
fastText will also need to be built in the jiant environment following these instructions.
To activate fastText model within our framework, set the flag fastText = 1
Download the pretrained vectors located here, preferrably the 300-dimensional Common Crawl vectors. Set the word_emb_file
to point to the .vec file.
GloVe
To use GloVe pretrained word embeddings, download and extract the relevant files and set word_embs_file
to the GloVe file.
Quick-Start on GCP (for JSALT internal use only)
For the JSALT workshop, we used Google Compute Engine as our main compute platform. If you're using Google Compute Engine, the private project instance images (cpu-workstation-template*
and gpu-worker-template-*
) already have all the required packages installed, plus the GLUE data and pre-trained embeddings downloaded to /usr/share/jsalt
. Unfortunately, these images are not straightforward to share. To use, clone this repo to your home directory, then test with:
python main.py --config_file config/demo.conf
You should see the model start training, and achieve an accuracy of > 70% on SST in a few minutes. The default config will write the experiment directory to $HOME/exp/<experiment_name>
and the run directory to $HOME/exp/<experiment_name>/<run_name>
, so you can find the demo output in ~/exp/jiant-demo/sst
.
Getting Help
Post an issue here on GitHub if you have any problems, and create a pull request if you make any improvements (substantial or cosmetic) to the code that you're willing to share.
Recommend
-
58
README.md End-to-End Learning of Motion Representation for Video Understanding This repository contains implementation code for the project 'End-to-End Learning of Motion Representation for Vid...
-
102
Word embeddings (see my oldpost1 andpost2) capture the idea that one can express “meaning” of words using a vector, so that the cosine of the angle between the vectors captures semantic similarity. (“Cosine similarity” p...
-
87
This post continuesSanjeev’s post and describes further attempts to construct elementary and interpretable text embeddings. The previous post described the the SIF embedd...
-
46
README.md GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua Bengio & Christopher Pa...
-
42
Learning sentence embeddings by Natural Language Inference Unsupervised learning approach seems like a normal way to build word, sentence or document embeddings because it is more generalized such th...
-
215
README.md S-RL Toolbox: Reinforcement Learning (RL) and State Representation Learning (SRL) Toolbox for Robotics This repository was made to evaluate S...
-
58
Innovations in Graph Representation Learning
-
13
Momentum Contrast for Unsupervised Visual Representation Learning Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick Novembe...
-
12
SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contrastive Learning of Sentence Embeddings
-
6
Clozemaster language learning platform set to restrict free users to 30 sentences a day...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK