BERT to the rescue!

In this post, I want to show how to apply BERT to a simple text classification problem. I assume that you’re more or less familiar with what BERT is on a high level, and focus more on the practical side by showing you how to utilize it in your work. Roughly speaking, BERT is a model that knows to represent text. You give it some sequence as an input, it then looks left and right several times and produces a vector representation for each word as the output. In their paper , the authors describe two ways to work with BERT, one as with “feature extraction” mechanism. That is, we use the final output of BERT as an input to another model. This way we’re “extracting” features from text using BERT and then use it in a separate model for the actual task in hand. The other way is by “fine-tuning” BERT. That is, we add additional layer/s on top of BERT and then train the whole thing together. This way, we train our additional layer/s and also change (fine-tune) the BERTs weights. Here I want to show the second method and present a step-by-step solution to a very simple and popular text classification task — IMDB Movie reviews sentiment classification. This task may be not the hardest task to solve and applying BERT to it might be slightly overkilling, but most of the steps shown here are the same for almost every task, no matter how complex it is.

Before diving into the actual code, let’s understand the general structure of BERT and what we need to do to use it in a classification task. As mentioned before, generally, the input to BERT is a sequence of words, and the output is a sequence of vectors. BERT allows us to perform different tasks based on its output. So for different task type, we need to change the input and/or the output slightly. In the figure below, you can see 4 different task types, for each task type, we can see what should be the input and the output of the model.

You can see that for the input, there’s always a special [CLS] token (stands for classification) at the start of each sequence and a special [SEP] token that separates two parts of the input.

For the output, if we interested in classification, we need to use the output of the first token (the [CLS] token). For more complicated outputs, we can use all the other tokens output.

We are interested in “Single Sentence Classification” (top right), so we’ll add the special [CLS] token and use its output as an input to a linear layer followed by sigmoid activation, that performs the actual classification.

Now let’s understand the task in hand: given a movie review, predict whether it’s positive or negative. The dataset we use is 50,000 IMDB reviews (25K for train and 25K for test) from the PyTorch-NLP library. Each review is tagged pos or neg . There are 50% positive reviews and 50% negative reviews both in train and test sets.

You can find all the code in this notebook.

1. Preparing the Data

We load the data using the pytorch-nlp library:

train_data, test_data = imdb_dataset(train=True, test=True)

Each instance in this dataset is a dictionary with 2 fields: text and sentimet

{
    'sentiment': 'pos',  
    'text': 'Having enjoyed Joyces complex nove...'
}

We create two variables for each set, one for texts and one for the labels:

train_texts, train_labels = list(zip(*map(lambda d: (d['text'], d['sentiment']), train_data)))

test_texts, test_labels = list(zip(*map(lambda d: (d['text'], d['sentiment']), test_data)))

Next, we need to tokenize our texts. BERT was trained using the WordPiece tokenization. It means that a word can be broken down into more than one sub-words. For example, if I tokenize the sentence “Hi, my name is Dima” I’ll get:

tokenizer.tokenize('Hi my name is Dima')

# OUTPUT
['hi', 'my', 'name', 'is', 'dim', '##a']

This kind of tokenization is beneficial when dealing with out of vocabulary words, and it may help better represent complicated words. The sub-words are constructed during the training time and depend on the corpus the model was trained on. We could use any other tokenization technique of course, but we’ll get the best results if we tokenize with the same tokenizer the BERT model was trained on. The PyTorch-Pretrained-BERT library provides us with tokenizer for each of BERTS models. Here we use the basic bert-base-uncased model, there are several other models, including much larger models. Maximum sequence size for BERT is 512, so we’ll truncate any review that is longer than this.

The code below creates the tokenizer, tokenizes each review, adds the special [CLS] token, and then takes only the first 512 tokens for both train and test sets:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

train_tokens = list(map(lambda t: ['[CLS]'] + tokenizer.tokenize(t)[:511], train_texts))

test_tokens = list(map(lambda t: ['[CLS]'] + tokenizer.tokenize(t)[:511], test_texts))

Next, we need to convert each token in each review to an id as present in the tokenizer vocabulary. If there’s a token that is not present in the vocabulary, the tokenizer will use the special [UNK] token and use its id:

train_tokens_ids = list(map(tokenizer.convert_tokens_to_ids, train_tokens))

test_tokens_ids = list(map(tokenizer.convert_tokens_to_ids, train_tokens_ids))

Finally, we need to pad our input so it will have the same size of 512. It means that for any review that is shorter than 512 tokens, we’ll add zeros to reach 512 tokens:

train_tokens_ids = pad_sequences(train_tokens_ids, maxlen=512, truncating="post", padding="post", dtype="int")

test_tokens_ids = pad_sequences(test_tokens_ids, maxlen=512, truncating="post", padding="post", dtype="int")

Our target variable is currently a list of neg and pos strings. We’ll convert it to numpy arrays of booleans:

train_y = np.array(train_labels) == 'pos'
test_y = np.array(test_labels) == 'pos'

2. Model Building

We’ll use PyTorch and the excellent PyTorch-Pretrained-BERT library for the model building. Actually, there’s a very similar model already implemented in this library and we could’ve used this one. For this post, I want to implement it myself so we can better understand what’s going on.

Before we create our model, let’s see how we can use the BERT model as implemented in the PyTorch-Pretrained-BERT library:

bert = BertModel.from_pretrained('bert-base-uncased')

x = torch.tensor(train_tokens_ids[:3])
y, pooled = bert(x, output_all_encoded_layers=False)

print('x shape:', x.shape)
print('y shape:', y.shape)
print('pooled shape:', pooled.shape)

# OUTPUT
x shape :(3, 512)
y shape: (3, 512, 768)
pooled shape: (3, 768)

First, we create the BERT model, then we create a PyTorch tensor with first 3 reviews from our training set and pass it to it. The output is two variables. Let's understand all the shapes: x is of size (3, 512) , we took only 3 reviews, 512 tokens each. y is of size (3, 512, 768) , this is the BERTs final layer output for each token. We could use output_all_encoded_layer=True to get the output of all the 12 layers. Each token in each review is represented using a vector of size 768. pooled is of size (3, 768) this is the output of our [CLS] token, the first token in our sequence.

Our goal is to take BERTs pooled output, apply a linear layer and a sigmoid activation. Here’s how our model looks like:

class BertBinaryClassifier(nn.Module):
    def __init__(self, dropout=0.1):
        super(BertBinaryClassifier, self).__init__()

self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.apply(self.bert.init_bert_weights)
        self.linear = nn.Linear(768, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, tokens):
        _, pooled_output = self.bert(tokens, utput_all=False)
        linear_output = self.linear(dropout_output)
        proba = self.sigmoid(linear_output)
        return proba

Every model in PyTorch is a nn.Module object. It means that every model we built must provide 2 methods. The __init__ method declares all the different parts the model will use. In our case, we create the BERT model that we’ll fine-tune, the Linear layer, and the Sigmoid activation. The forward method is the actual code that runs during the forward pass (like the predict method in sklearn or keras ). Here we take the tokens input and pass it to the BERT model. The output of BERT is 2 variables, as we have seen before, we use only the second one (the _ name is used to emphasize that this variable is not used). We take the pooled output and pass it to the linear layer. Finally, we use the Sigmoid activation to provide the actual probability.

3. Training/Fine-tuning

The training is pretty standard. First, we prepare our tensors and data loaders:

train_tokens_tensor = torch.tensor(train_tokens_ids)
train_y_tensor = torch.tensor(train_y.reshape(-1, 1)).float()

test_tokens_tensor = torch.tensor(test_tokens_ids)
test_y_tensor = torch.tensor(test_y.reshape(-1, 1)).float()

train_dataset = TensorDataset(train_tokens_tensor, train_y_tensor)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=BATCH_SIZE)

test_dataset = TensorDataset(test_tokens_tensor, test_y_tensor)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=BATCH_SIZE)

We’ll use the Adam optimizer and the BinaryCrossEntropy ( BCELoss ) loss and train the model for 10 epochs:

bert_clf = BertBinaryClassifier()
bert_clf = bert_clf.cuda()
optimizer = Adam(bert_clf.parameters(), lr=3e-6)
bert_clf.train()

for epoch_num in range(EPOCHS):
    for step_num, batch_data in enumerate(train_dataloader):
        token_ids, labels = tuple(t.to(device) for t in batch_data)
        probas = bert_clf(token_ids)
        loss_func = nn.BCELoss()
        batch_loss = loss_func(probas, labels)
        bert_clf.zero_grad()
        batch_loss.backward()
        optimizer.step()

For those who’re not familiar with PyTorch, Let’s go over the code step by step.

First, we create the BertBinaryClassifier as we defined above. We move it to the GPU by applying bert_clf.cuda() . We create the Adam optimizer with our model parameters (that the optimizer will update) and a learning rate I found worked well.

For each step in each epoch, we do the following:

.to(device)
bert_clf(token_ids)
loss_func(probas, labels)
batch_loss.backward()
optimizer.step()

After 10 epochs, I got pretty good results.

Conclusion

BERT is a very powerful model and can be applied to many tasks. For me, it provided some very good results on tasks that I work on. I hope that this post helped you better understand the practical aspects of working with BERT. As mentioned before, you can find the code in this notebook

1. Preparing the Data

2. Model Building

3. Training/Fine-tuning

Conclusion

Recommend

GitHub - google-research/bert

Microsoft’s New MT-DNN Outperforms Google BERT

BERT模型详解

理解 BERT Transformer：Attention isn't all you need！

站在BERT肩膀上的NLP新秀们（PART III）

详解BERT阅读理解

图解Bert之Transformer实战

BERT大火却不懂Transformer？

当BERT遇上知识图谱

美团BERT的探索和实践

About Joyk