26

BERT to the rescue!

 5 years ago
source link: https://www.tuicool.com/articles/nqaMbqq
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

In this post, I want to show how to apply BERT to a simple text classification problem. I assume that you’re more or less familiar with what BERT is on a high level, and focus more on the practical side by showing you how to utilize it in your work. Roughly speaking, BERT is a model that knows to represent text. You give it some sequence as an input, it then looks left and right several times and produces a vector representation for each word as the output. In their paper , the authors describe two ways to work with BERT, one as with “feature extraction” mechanism. That is, we use the final output of BERT as an input to another model. This way we’re “extracting” features from text using BERT and then use it in a separate model for the actual task in hand. The other way is by “fine-tuning” BERT. That is, we add additional layer/s on top of BERT and then train the whole thing together. This way, we train our additional layer/s and also change (fine-tune) the BERTs weights. Here I want to show the second method and present a step-by-step solution to a very simple and popular text classification task — IMDB Movie reviews sentiment classification. This task may be not the hardest task to solve and applying BERT to it might be slightly overkilling, but most of the steps shown here are the same for almost every task, no matter how complex it is.

Before diving into the actual code, let’s understand the general structure of BERT and what we need to do to use it in a classification task. As mentioned before, generally, the input to BERT is a sequence of words, and the output is a sequence of vectors. BERT allows us to perform different tasks based on its output. So for different task type, we need to change the input and/or the output slightly. In the figure below, you can see 4 different task types, for each task type, we can see what should be the input and the output of the model.

e2URNvJ.jpg!webYvEjYrf.jpg!web

You can see that for the input, there’s always a special [CLS] token (stands for classification) at the start of each sequence and a special [SEP] token that separates two parts of the input.

For the output, if we interested in classification, we need to use the output of the first token (the [CLS] token). For more complicated outputs, we can use all the other tokens output.

We are interested in “Single Sentence Classification” (top right), so we’ll add the special [CLS] token and use its output as an input to a linear layer followed by sigmoid activation, that performs the actual classification.

Now let’s understand the task in hand: given a movie review, predict whether it’s positive or negative. The dataset we use is 50,000 IMDB reviews (25K for train and 25K for test) from the PyTorch-NLP library. Each review is tagged pos or neg . There are 50% positive reviews and 50% negative reviews both in train and test sets.

You can find all the code in this notebook.

1. Preparing the Data

We load the data using the pytorch-nlp library:

train_data, test_data = imdb_dataset(train=True, test=True)

Each instance in this dataset is a dictionary with 2 fields: text and sentimet

{
    'sentiment': 'pos',  
    'text': 'Having enjoyed Joyces complex nove...'
}

We create two variables for each set, one for texts and one for the labels:

train_texts, train_labels = list(zip(*map(lambda d: (d['text'], d['sentiment']), train_data)))
test_texts, test_labels = list(zip(*map(lambda d: (d['text'], d['sentiment']), test_data)))

Next, we need to tokenize our texts. BERT was trained using the WordPiece tokenization. It means that a word can be broken down into more than one sub-words. For example, if I tokenize the sentence “Hi, my name is Dima” I’ll get:

tokenizer.tokenize('Hi my name is Dima')
# OUTPUT
['hi', 'my', 'name', 'is', 'dim', '##a']

This kind of tokenization is beneficial when dealing with out of vocabulary words, and it may help better represent complicated words. The sub-words are constructed during the training time and depend on the corpus the model was trained on. We could use any other tokenization technique of course, but we’ll get the best results if we tokenize with the same tokenizer the BERT model was trained on. The PyTorch-Pretrained-BERT library provides us with tokenizer for each of BERTS models. Here we use the basic bert-base-uncased model, there are several other models, including much larger models. Maximum sequence size for BERT is 512, so we’ll truncate any review that is longer than this.

The code below creates the tokenizer, tokenizes each review, adds the special [CLS] token, and then takes only the first 512 tokens for both train and test sets:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
train_tokens = list(map(lambda t: ['[CLS]'] + tokenizer.tokenize(t)[:511], train_texts))
test_tokens = list(map(lambda t: ['[CLS]'] + tokenizer.tokenize(t)[:511], test_texts))

Next, we need to convert each token in each review to an id as present in the tokenizer vocabulary. If there’s a token that is not present in the vocabulary, the tokenizer will use the special [UNK] token and use its id:

train_tokens_ids = list(map(tokenizer.convert_tokens_to_ids, train_tokens))
test_tokens_ids = list(map(tokenizer.convert_tokens_to_ids, train_tokens_ids))

Finally, we need to pad our input so it will have the same size of 512. It means that for any review that is shorter than 512 tokens, we’ll add zeros to reach 512 tokens:

train_tokens_ids = pad_sequences(train_tokens_ids, maxlen=512, truncating="post", padding="post", dtype="int")
test_tokens_ids = pad_sequences(test_tokens_ids, maxlen=512, truncating="post", padding="post", dtype="int")

Our target variable is currently a list of neg and pos strings. We’ll convert it to numpy arrays of booleans:

train_y = np.array(train_labels) == 'pos'
test_y = np.array(test_labels) == 'pos'

2. Model Building

We’ll use PyTorch and the excellent PyTorch-Pretrained-BERT library for the model building. Actually, there’s a very similar model already implemented in this library and we could’ve used this one. For this post, I want to implement it myself so we can better understand what’s going on.

Before we create our model, let’s see how we can use the BERT model as implemented in the PyTorch-Pretrained-BERT library:

bert = BertModel.from_pretrained('bert-base-uncased')
x = torch.tensor(train_tokens_ids[:3])
y, pooled = bert(x, output_all_encoded_layers=False)
print('x shape:', x.shape)
print('y shape:', y.shape)
print('pooled shape:', pooled.shape)
# OUTPUT
x shape :(3, 512)
y shape: (3, 512, 768)
pooled shape: (3, 768)

First, we create the BERT model, then we create a PyTorch tensor with first 3 reviews from our training set and pass it to it. The output is two variables. Let's understand all the shapes: x is of size (3, 512) , we took only 3 reviews, 512 tokens each. y is of size (3, 512, 768) , this is the BERTs final layer output for each token. We could use output_all_encoded_layer=True to get the output of all the 12 layers. Each token in each review is represented using a vector of size 768. pooled is of size (3, 768) this is the output of our [CLS] token, the first token in our sequence.

Our goal is to take BERTs pooled output, apply a linear layer and a sigmoid activation. Here’s how our model looks like:

class BertBinaryClassifier(nn.Module):
    def __init__(self, dropout=0.1):
        super(BertBinaryClassifier, self).__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
self.apply(self.bert.init_bert_weights)
self.linear = nn.Linear(768, 1)
self.sigmoid = nn.Sigmoid()

def forward(self, tokens):
_, pooled_output = self.bert(tokens, utput_all=False)
linear_output = self.linear(dropout_output)
proba = self.sigmoid(linear_output)
return proba

Every model in PyTorch is a nn.Module object. It means that every model we built must provide 2 methods. The __init__ method declares all the different parts the model will use. In our case, we create the BERT model that we’ll fine-tune, the Linear layer, and the Sigmoid activation. The forward method is the actual code that runs during the forward pass (like the predict method in sklearn or keras ). Here we take the tokens input and pass it to the BERT model. The output of BERT is 2 variables, as we have seen before, we use only the second one (the _ name is used to emphasize that this variable is not used). We take the pooled output and pass it to the linear layer. Finally, we use the Sigmoid activation to provide the actual probability.

3. Training/Fine-tuning

The training is pretty standard. First, we prepare our tensors and data loaders:

train_tokens_tensor = torch.tensor(train_tokens_ids)
train_y_tensor = torch.tensor(train_y.reshape(-1, 1)).float()
test_tokens_tensor = torch.tensor(test_tokens_ids)
test_y_tensor = torch.tensor(test_y.reshape(-1, 1)).float()
train_dataset = TensorDataset(train_tokens_tensor, train_y_tensor)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=BATCH_SIZE)
test_dataset = TensorDataset(test_tokens_tensor, test_y_tensor)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=BATCH_SIZE)

We’ll use the Adam optimizer and the BinaryCrossEntropy ( BCELoss ) loss and train the model for 10 epochs:

bert_clf = BertBinaryClassifier()
bert_clf = bert_clf.cuda()
optimizer = Adam(bert_clf.parameters(), lr=3e-6)
bert_clf.train()
for epoch_num in range(EPOCHS):
    for step_num, batch_data in enumerate(train_dataloader):
        token_ids, labels = tuple(t.to(device) for t in batch_data)
        probas = bert_clf(token_ids)
        loss_func = nn.BCELoss()
        batch_loss = loss_func(probas, labels)
        bert_clf.zero_grad()
        batch_loss.backward()
        optimizer.step()

For those who’re not familiar with PyTorch, Let’s go over the code step by step.

First, we create the BertBinaryClassifier as we defined above. We move it to the GPU by applying bert_clf.cuda() . We create the Adam optimizer with our model parameters (that the optimizer will update) and a learning rate I found worked well.

For each step in each epoch, we do the following:

.to(device)
bert_clf(token_ids)
loss_func(probas, labels)
batch_loss.backward()
optimizer.step()

After 10 epochs, I got pretty good results.

Conclusion

BERT is a very powerful model and can be applied to many tasks. For me, it provided some very good results on tasks that I work on. I hope that this post helped you better understand the practical aspects of working with BERT. As mentioned before, you can find the code in this notebook


Recommend

  • 51
    • Github github.com 6 years ago
    • Cache

    GitHub - google-research/bert

    README.md BERT Introduction BERT, or Bidirectional Embedding Representatio...

  • 36

    Multi-task learning and language model pre-training are popular approaches for many of today’s natural language understanding (NLU) tasks. Now, Microsoft researchers have released technical details of an AI system that c...

  • 254
    • fancyerii.github.io 6 years ago
    • Cache

    BERT模型详解

    本文详细介绍BERT模型的原理,包括相关的ELMo和OpenAI GPT模型。阅读本文需要先学习Transformer模型,不了解的读者可以先阅读Transformer图解和 Transformer代码阅...

  • 54

    BERT 是谷歌近期发布的一种自然语言处理模型,它在问答系统、自然语言推理和释义检测(paraphrase detection)等许多任务中都取得了突破性的进展。在这篇文章中,作者提出了一些新的见解和假设,来解释 BERT 强大能力的来源。作者将语言理...

  • 54

    作者: 高开远 学校: 上海交通大学 研究方向: 自然语言处 理 写在前面 ...

  • 45
    • www.tuicool.com 5 years ago
    • Cache

    详解BERT阅读理解

    作者: SunYanCN 研究方向: 自然语言处理 BERT的简单回顾 Google发布的论文《Pre-training of Deep Bidirection...

  • 24
    • www.tuicool.com 5 years ago
    • Cache

    图解Bert之Transformer实战

    一、Transformer简介 1.1、Seq2seq model Transformer(变形金刚)简单来讲,我们可以将其看作一个seq2seq with self-attention model。我们可以这幺理解,Transformer整体作为一个翻译器,输入法文的句子,模型将其翻译...

  • 21
    • flashgene.com 5 years ago
    • Cache

    BERT大火却不懂Transformer?

    Transformer由论文《Attention is All You Need》提出,现在是谷歌云TPU推荐的参考模型。论文相关的Tensorflow的代码可以从GitHub获取,其作为Tensor2Tensor包的一部分。哈佛的NLP团队也实现了一个基于PyTorch的版本,并注释该论文。 在本...

  • 20
    • 微信 mp.weixin.qq.com 5 years ago
    • Cache

    当BERT遇上知识图谱

    作者:高开远 学校:上海交通大学 研究方向:自然语言处理 知乎专栏: BERT巨人肩膀 原文地址: https://zhuanlan.zhihu.com/p/91052495...

  • 35
    • tech.meituan.com 5 years ago
    • Cache

    美团BERT的探索和实践

    2018年,自然语言处理(Natural Language Processing,NLP)领域最激动人心的进展莫过于 预训练语言模型 ,包括基于RNN的ELMo[1]和ULMFiT[2],基于Transformer[3]的OpenAI GPT[4]及Google BERT[5]等。下图1回顾了近年来...

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK