8

GitHub - feedly/transfer-nlp: NLP library designed for flexible research and dev...

 4 years ago
source link: https://github.com/feedly/transfer-nlp
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md

TransferNLP_Logo.jpg%20

Welcome to the Transfer NLP library, a framework built on top of PyTorch whose goal is to progressively achieve 2 kinds of Transfer:

  • easy transfer of code: the framework should be modular enough so that you don't have to re-write everything each time you experiment with a new architecture / a new kind of task
  • easy transfer learning: the framework should be able to easily interact with pre-trained models and manipulate them in order to fine-tune some of their parts.

You can have an overview of the high-level API on this Colab Notebook, which shows how to use the framework on several examples. All examples on these notebooks embed in-cell Tensorboard training monitoring!

Set up your environment

mkvirtualenv transfernlp
workon transfernlp

git clone https://github.com/feedly/transfer-nlp.git
cd transfer-nlp
pip install -r requirements.txt
  • create a virtual environment: mkvirtualenv YourEnvName
  • clone the repository: git clone https://github.com/feedly/transfer-nlp.git
  • Install requirements: pip install -r requirements.txt

The library is available on Pypi but pip install transfer-nlp is not recommended yet.

Documentation

API documentation and an overview of the library can be found here

High-Level usage of the library

You can have a look at the Colab Notebook to get a simple sense of the library usage.

A basic usage is:

# Setup the experiment
config_file  = [Dict config file, or str/Path to a json config file]
experiment = ExperimentConfig.from_json(experiment=config_file)

# Launch the training session
experiment['trainer'].train()

# Use the predictor for inference
input_json = {"inputs": [Some Examples]}
output_json = experiment['predictor'].json_to_json(input_json=input_json)

You can use this code with all existing experiments in experiments/.

How to experiment with the library?

For reproducible research and easy ablation studies, the library enforces the use of configuration files for experiments.

In Transfer-NLP, an experiment config file contains all the necessary information to define entirely the experiment. This is where you will insert names of the different components your experiment will use. Transfer-NLP makes use of the Inversion of Control pattern, which allows you to define any kind of classes you could need, and the ExperimentConfig.from_json method will create a dictionnary and instatiate your objects accordingly.

To use your own classes inside Tranfer-NLP, you need to register them using the @register_plugin decorator. Instead of using a different registry for each kind of component (Models, Data loaders, Vectorizers, Optimizers, ...), only a single registry is used here, in order to enforce total customization.

Currently, the config file logic has 3 kinds of components:

  • simple parameters: those are parameters which you know the value in advance:
{"initial_learning_rate": 0.01,
"embedding_dim": 100,...}
  • simple lists: similar to simple parameters, but as a list:
{"layers_dropout": [0.1, 0.2, 0.3], ...}
  • Complex config: this is whre the library instantiates your objects: this needs to have the _name of the object class (you need to @register_plugin it), and some parameters. If your class has default parameters and your config file doesn't contain them, objects will be instantiated as default. Otherwise the parameters have to be present in the config file. Sometimes, initialization parameters are not available before launching the experiment. E.g., suppose your Model object needs a Vocabulary size as init input. The config file logic here makes it easy to deal with this while keeping the library code very general. You can have a look at the experiments for examples: surnames.py, news.py or cbow.py. The corresponding json files in experiments will show you examples of how to get started.

Usage Pipeline

The goal of the config file is to load a Trainer and run the experiment from it. We provide a BasicTrainer in transfer_nlp.plugins.trainers.py. This basic trainer will take a model and some data as input, and run a whole training pipeline. We make use of the PyTorch-Ignite library to monitor events during training (logging some metrics, manipulating learning rates, checkpointing models, etc...). Tensorboard logs are also included as an option, you will have to specify a tensorboard_logs simple parameters path in the config file. Then just run tensorboard --logdir=path/to/logs in a terminal and you can monitor your experiment while it's training. Tensorboard comes with very nice utilities to keep track of the norms of your model weights, histograms, distributions, visualizing embeddings, ...

Slack integration

While experimenting with your own models / data, the training might take some time. To get notified when your training finishes or crashes, we recommend the simple library knockknock by folks at HuggingFace, which add a simple decorator to your running function to notify you via Slack, E-mail, etc.

Some objectves to reach:

  • Unit-test everything
  • Include examples using state of the art pre-trained models
  • Include linguistic properties to models
  • Experiment with RL for sequential tasks
  • Include probing tasks to try to understand the properties that are learned by the models

Acknowledgment

The library has been inspired by the reading of "Natural Language Processing with PyTorch" by Delip Rao and Brian McMahan. Experiments in experiments, the Vocabulary building block and embeddings nearest neighbors are taken or adapted from the code provided in the book.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK