12

Deep Learning For NLP with PyTorch and Torchtext

 3 years ago
source link: https://towardsdatascience.com/deep-learning-for-nlp-with-pytorch-and-torchtext-4f92d69052f?gi=2e27bc025402
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Deep Learning For NLP with PyTorch and Torchtext

Torchtext’s Pre-trained Word Embedding, Dataset API, Iterator API, and training model with Torchtext and PyTorch

QN77Nfb.jpg!web

Picture by Clarissa Watson on Unsplash

PyTorch has been an awesome deep learning framework that I have been working with. However, when it comes to NLP somehow I could not found as good utility library like torchvision . Turns out PyTorch has this torchtext , which, in my opinion, lack of examples on how to use it and the documentation [6] can be improved. Moreover, there are some great tutorials like [1] and [2] but, we still need more examples.

This article’s purpose is to give readers sample codes on how to use torchtext , in particular, to use pre-trained word embedding, use dataset API, use iterator API for mini-batch, and finally how to use these in conjunction to train a model.

Pre-Trained Word Embedding with Torchtext

There have been some alternatives in pre-trained word embeddings such as Spacy [3], Stanza (Stanford NLP)[4], Gensim [5] but in this article, I wanted to focus on doing word embedding with torchtext .

Available Word Embedding

You can see the list of pre-trained word embeddings at torchtext . At this time of writing, there are 3 pre-trained word embedding classes supported: GloVe, FastText, and CharNGram, with no additional detail on how to load. The exhaustive list is stated here , but it took me sometimes to read that so I will layout the list here.

charngram.100d
fasttext.en.300d
fasttext.simple.300d
glove.42B.300d
glove.840B.300d
glove.twitter.27B.25d
glove.twitter.27B.50d
glove.twitter.27B.100d
glove.twitter.27B.200d
glove.6B.50d
glove.6B.100d
glove.6B.200d
glove.6B.300d

There are two ways we can load pre-trained word embeddings: initiate word embedding object or using Field instance.

Using Field Instance

You need some toy dataset to use this so let’s set one up.

df = pd.DataFrame([
    ['my name is Jack', 'Y'],
    ['Hi I am Jack', 'Y'],
    ['Hello There!', 'Y'],
    ['Hi I am cooking', 'N'],
    ['Hello are you there?', 'N'],
    ['There is a bird there', 'N'],
], columns=['text', 'label'])

then we can construct Field objects that hold metadata of feature column and label column.

from torchtext.data import Fieldtext_field = Field(
    tokenize='basic_english', 
    lower=True
)label_field = Field(sequential=False, use_vocab=False)# sadly have to apply preprocess manually
preprocessed_text = df['text'].apply(lambda x: text_field.preprocess(x))# load fastext simple embedding with 300d
text_field.build_vocab(
    preprocessed_text, 
    vectors='fasttext.simple.300d'
)# get the vocab instance
vocab = text_field.vocab

to get the real instance of pre-trained word embedding, you can use

vocab.vectors

Initiate Word Embedding Object

For each of these codes, it will download a big size of word embeddings so you have to be patient and do not execute all of the below codes all at once.

FastText

FastText object has one parameter: language, and it can be ‘simple’ or ‘en’. Currently they only support 300 embedding dimensions as mentioned at the above embedding list.

from torchtext.vocab import FastText
embedding = FastText('simple')

CharNGram

from torchtext.vocab import CharNGram
embedding_charngram = CharNGram()

GloVe

GloVe object has 2 parameters: name and dim. You can look up the available embedding list on what each parameter support.

from torchtext.vocab import GloVe
embedding_glove = GloVe(name='6B', dim=100)

Using Word Embedding

Using the torchtext API to use word embedding is super easy! Say you have stored your embedding at variable embedding , then you can use it like a python’s dict .

# known token, in my case print 12
print(vocab['are'])
# unknown token, will print 0
print(vocab['crazy'])

As you can see, it has handled unknown token without throwing error! If you play with encoding the words into an integer, you can notice that by default unknown token will be encoded as 0 while pad token will be encoded as 1 .

Using Dataset API

Assuming variable df has been defined as above, we now proceed to prepare the data by constructing Field for both the feature and label.

from torchtext.data import Fieldtext_field = Field(
    sequential=True,
    tokenize='basic_english', 
    fix_length=5,
    lower=True
)label_field = Field(sequential=False, use_vocab=False)# sadly have to apply preprocess manually
preprocessed_text = df['text'].apply(
    lambda x: text_field.preprocess(x)
)# load fastext simple embedding with 300d
text_field.build_vocab(
    preprocessed_text, 
    vectors='fasttext.simple.300d'
)# get the vocab instance
vocab = text_field.vocab

A bit of warning here, Dataset.split may return 3 datasets ( train, val, test ) instead of 2 values as defined

Using Iterator Class for Mini-batching

I do not found any ready Dataset API to load pandas DataFrame to torchtext dataset, but it is pretty easy to form one.

from torchtext.data import Dataset, Example
ltoi = {l: i for i, l in enumerate(df['label'].unique())}
df['label'] = df['label'].apply(lambda y: ltoi[y])class DataFrameDataset(Dataset):
    def __init__(self, df: pd.DataFrame, fields: list):
        super(DataFrameDataset, self).__init__(
            [
                Example.fromlist(list(r), fields) 
                for i, r in df.iterrows()
            ], 
            fields
        )

we can now construct the DataFrameDataset and initiate it with the pandas dataframe.

train_dataset, test_dataset = DataFrameDataset(
    df=df, 
    fields=(
        ('text', text_field),
        ('label', label_field)
    )
).split()

we then use BucketIterator class to easily construct minibatching iterator .

from torchtext.data import BucketIteratortrain_iter, test_iter = BucketIterator.splits(
    datasets=(train_dataset, test_dataset), 
    batch_sizes=(2, 2),
    sort=False
)

Remember to use sort=Falseotherwise it will lead to an error when you try to iterate test_iter because we haven’t defined the sort function, yet somehow, by default test_iter defined to be sorted.

A little note: while I do agree that we should use DataLoader API to handle the minibatch, but at this moment I have not explored how to use DataLoader with torchtext.

Example in Training PyTorch Model

Let’s define an arbitrary PyTorch model using 1 embedding layer and 1 linear layer. In the current example, I do not use pre-trained word embedding but instead I use new untrained word embedding.

import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adamclass ModelParam(object):
    def __init__(self, param_dict: dict = dict()):
        self.input_size = param_dict.get('input_size', 0)
        self.vocab_size = param_dict.get('vocab_size')
        self.embedding_dim = param_dict.get('embedding_dim', 300)
        self.target_dim = param_dict.get('target_dim', 2)
        
class MyModel(nn.Module):
    def __init__(self, model_param: ModelParam):
        super().__init__()
        self.embedding = nn.Embedding(
            model_param.vocab_size, 
            model_param.embedding_dim
        )
        self.lin = nn.Linear(
            model_param.input_size * model_param.embedding_dim, 
            model_param.target_dim
        )
        
    def forward(self, x):
        features = self.embedding(x).view(x.size()[0], -1)
        features = F.relu(features)
        features = self.lin(features)
        return features

Then I can easily iterate the training (and testing) routine as follows.

Reusing The Pre-trained Word Embedding

It is easy to modify the current defined model to a model that used pre-trained embedding.

class MyModelWithPretrainedEmbedding(nn.Module):
    def __init__(self, model_param: ModelParam, embedding):
        super().__init__()
        self.embedding = embedding
        self.lin = nn.Linear(
            model_param.input_size * model_param.embedding_dim, 
            model_param.target_dim
        )
        
    def forward(self, x):
        features = self.embedding[x].reshape(x.size()[0], -1)
        features = F.relu(features)
        features = self.lin(features)
        return features

I made 3 lines of modifications. You should notice that I have changed constructor input to accept an embedding. Additionally, I have also change the view method to reshape and use get operator [] instead of call operator () to access the embedding.

model = MyModelWithPretrainedEmbedding(model_param, vocab.vectors)

Conclusion

I have finished laying out my own exploration of using torchtext to handle text data in PyTorch. I began writing this article because I had trouble using it with the current tutorials available on the internet. I hope this article may reduce overhead for others too.

You need help to write this code? Here’s a link to google Colab.

Link to Google Colab

References

[1] Nie, A. A Tutorial on Torchtext. 2017. http://anie.me/On-Torchtext/

[2] Text Classification with TorchText Tutorial. https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html

[3] Stanza Documentation. https://stanfordnlp.github.io/stanza/

[4] Gensim Documentation. https://radimrehurek.com/gensim/

[5] Spacy Documentation. https://spacy.io/

[6] Tor chtext Documentation. https://pytorch.org/text/


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK