阿里天池 NLP 入门赛 Bert 方案 -3 Bert 预训练与分类

哈尔滨工程大学计算机硕士在读

前言

这篇文章用于记录阿里天池 NLP 入门赛，详细讲解了整个数据处理流程，以及如何从零构建一个模型，适合新手入门。

赛题以新闻数据为赛题数据，数据集报名后可见并可下载。赛题数据为新闻文本，并按照字符级别进行匿名处理。整合划分出 14 个候选分类类别：财经、彩票、房产、股票、家居、教育、科技、社会、时尚、时政、体育、星座、游戏、娱乐的文本数据。实质上是一个 14 分类问题。

赛题数据由以下几个部分构成：训练集 20w 条样本，测试集 A 包括 5w 条样本，测试集 B 包括 5w 条样本。

比赛地址：https://tianchi.aliyun.com/competition/entrance/531810/introduction

数据可以通过上面的链接下载。

代码地址：https://github.com/zhangxiann/Tianchi-NLP-Beginner

分为 3 篇文章介绍：

在上一篇文章中，我们介绍了 Bert 的源码。

这篇文章，我们来看下如何预训练 Bert，以及使用 Bert 进行分类。

训练 Bert

在前面，我们已经了解完了 Bert 的源码，现在我们我来看如何训练 Bert。

训练 Bert 对应的代码文件是 run_pretraining.py。

脚本

训练脚本为：run_pretraining.sh，内容如下：

python run_pretraining.py
--input_file=./records/*.tfrecord                # 处理好的文件
--output_dir=./bert-mini                         # 训练好模型，保存的位置
--do_train=True                                  # 开启训练
--do_eval=True                                   # 开启验证
--bert_config_file=./bert-mini/bert_config.json  # 词典路径
--train_batch_size=128                           # 训练的 batch_size
--eval_batch_size=128                            # 测试的 batch_size
--max_seq_length=256                             # 句子的最大长度
--max_predictions_per_seq=32                     # 每个句子 mask 的最大数量
--learning_rate=1e-4                             # 学习率

训练过程主要用了 estimator 调度器。这个调度器支持自定义训练过程，将训练集传入之后自动训练。

对应的代码文件是 run_pretraining.py。

主要函数是 model_fn_builder() ，get_masked_lm_output()，get_next_sentence_output()。

model_fn_builder()

在这个函数里创建 Bert 模型，得到输出，然后分别调用 get_masked_lm_output() 计算预测 mask 词的损失~~，调用 get_next_sentence_output() 计算预测前后句子的 loss*~~（这里不预测句子前后关系，因此不计算 loss）。

def model_fn_builder(bert_config, init_checkpoint, learning_rate,
                     num_train_steps, num_warmup_steps, use_tpu,
                     use_one_hot_embeddings):
    """Returns `model_fn` closure for TPUEstimator."""

    def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
        """The `model_fn` for TPUEstimator."""

        tf.logging.info("*** Features ***")
        for name in sorted(features.keys()):
            tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
        # input_ids: [batch_size, seq_length]
        input_ids = features["input_ids"]
        # input_mask: [batch_size, seq_length]
        input_mask = features["input_mask"]
        # segment_ids: [batch_size, seq_length]
        segment_ids = features["segment_ids"]
        # masked_lm_positions: [batch_size, max_predictions_per_seq]
        masked_lm_positions = features["masked_lm_positions"]
        # masked_lm_ids: [batch_size, max_predictions_per_seq]
        masked_lm_ids = features["masked_lm_ids"]
        # masked_lm_weights: [batch_size, max_predictions_per_seq]
        masked_lm_weights = features["masked_lm_weights"]
        # 这里没用到 NSP，因此用不到这个变量
        next_sentence_labels = features["next_sentence_labels"]

        is_training = (mode == tf.estimator.ModeKeys.TRAIN)
        # 创建 Bert
        model = modeling.BertModel(
            config=bert_config,
            is_training=is_training,
            input_ids=input_ids,
            input_mask=input_mask,
            token_type_ids=segment_ids,
            use_one_hot_embeddings=use_one_hot_embeddings)

        # 调用 get_masked_lm_output，计算 loss
        (masked_lm_loss,
         masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(
            bert_config, model.get_sequence_output(), model.get_embedding_table(),
            masked_lm_positions, masked_lm_ids, masked_lm_weights)

        total_loss = masked_lm_loss

        # No NSP
        # (next_sentence_loss, next_sentence_example_loss,
        #  next_sentence_log_probs) = get_next_sentence_output(
        #     bert_config, model.get_pooled_output(), next_sentence_labels)
        #
        # total_loss = masked_lm_loss + next_sentence_loss

        tvars = tf.trainable_variables()

        initialized_variable_names = {}
        scaffold_fn = None
        # 如果之前有训练好的模型，那么加载训练好的参数
        if init_checkpoint:
            (assignment_map, initialized_variable_names
             ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
            if use_tpu:

                def tpu_scaffold():
                    tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
                    return tf.train.Scaffold()

                scaffold_fn = tpu_scaffold
            else:
                tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

        tf.logging.info("**** Trainable Variables ****")
        for var in tvars:
            init_string = ""
            if var.name in initialized_variable_names:
                init_string = ", *INIT_FROM_CKPT*"
            tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
                            init_string)

        output_spec = None
        if mode == tf.estimator.ModeKeys.TRAIN:
        # 验证
            # 定义优化器
            train_op = optimization.create_optimizer(
                total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
            # 不懂
            output_spec = tf.contrib.tpu.TPUEstimatorSpec(
                mode=mode,
                loss=total_loss,
                train_op=train_op,
                scaffold_fn=scaffold_fn)
        elif mode == tf.estimator.ModeKeys.EVAL:
        # 验证
            def metric_fn(masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
                          masked_lm_weights):
                """Computes the loss and accuracy of the model."""
                # masked_lm_log_probs: [batch_size * max_predictions_per_seq, vocab_size]
                masked_lm_log_probs = tf.reshape(masked_lm_log_probs,
                                                 [-1, masked_lm_log_probs.shape[-1]])
                # 取最大值所在的索引，获得预测的 id: [batch_size * max_predictions_per_seq]
                masked_lm_predictions = tf.argmax(
                    masked_lm_log_probs, axis=-1, output_type=tf.int32)
                # masked_lm_example_loss: [batch_size * max_predictions_per_seq]
                masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1])
                masked_lm_ids = tf.reshape(masked_lm_ids, [-1])
                masked_lm_weights = tf.reshape(masked_lm_weights, [-1])
                # 计算平均准确率
                masked_lm_accuracy = tf.metrics.accuracy(
                    labels=masked_lm_ids,
                    predictions=masked_lm_predictions,
                    weights=masked_lm_weights)
                # 计算平均 loss，这个 loss 和 masked_lm_loss 是一样的
                masked_lm_mean_loss = tf.metrics.mean(
                    values=masked_lm_example_loss, weights=masked_lm_weights)
                # 返回准确率和 loss
                return {
                    "masked_lm_accuracy": masked_lm_accuracy,
                    "masked_lm_loss": masked_lm_mean_loss,
                }

            eval_metrics = (metric_fn, [
                masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
                masked_lm_weights
            ])
            output_spec = tf.contrib.tpu.TPUEstimatorSpec(
                mode=mode,
                loss=total_loss,
                eval_metrics=eval_metrics,
                scaffold_fn=scaffold_fn)
        else:
            raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode))

        return output_spec

    return model_fn

get_masked_lm_output()

get_masked_lm_output() 的作用是计算 mask 预测的 loss。

输入参数：

input_tensor：BertModel 最后一层的输出，形状是 [batch_size, seq_length, hidden_size]。
output_weights：形状是 [vocab_size, hidden_size]。
positions：表示 mask 的位置，形状是 [vocab_size, hidden_size]。
label_ids：表示 mask 对应的真实 token。
label_weights：每个 mask 的权重。

流程如下：

从 input_tensor 中，根据 positions 取出 mask 对应的输出。
将 input_tensor 经过一个全连接层和 layer_norm 层，得到 logits，形状为 [batch_size * max_predictions_per_seq, vocab_size]。
将 logits 和 output_weights 相乘，得到概率矩阵 log_probs，形状为 [batch_size * max_predictions_per_seq, vocab_size]，再经过 softmax。
将 log_probs 和真实标签 one_hot_labels 计算加权 loss。

# input_tensor: [batch_size, seq_length, hidden_size]
# output_weights: [vocab_size, hidden_size]
def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,
                         label_ids, label_weights):
    """Get loss and log probs for the masked LM."""
    # 取出 mask 的元素
    # input_tensor: [batch_size, seq_length, hidden_size]
    # input_tensor: [batch_size * max_predictions_per_seq, hidden_size]
    input_tensor = gather_indexes(input_tensor, positions)

    with tf.variable_scope("cls/predictions"):
        # We apply one more non-linear transformation before the output layer.
        # This matrix is not used after pre-training.
        # 将 mask 的元素经过全连接层 和 layer_norm
        with tf.variable_scope("transform"):
            input_tensor = tf.layers.dense(
                input_tensor,
                units=bert_config.hidden_size,
                activation=modeling.get_activation(bert_config.hidden_act),
                kernel_initializer=modeling.create_initializer(
                    bert_config.initializer_range))
            input_tensor = modeling.layer_norm(input_tensor)

        # The output weights are the same as the input embeddings, but there is
        # an output-only bias for each token.
        # output_bias: [vocab_size]
        output_bias = tf.get_variable(
            "output_bias",
            shape=[bert_config.vocab_size],
            initializer=tf.zeros_initializer())

        # transpose_b=True 表示把第二个参数转置
        # logits: [batch_size * max_predictions_per_seq, vocab_size]
        logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)
        # log_probs: [batch_size * max_predictions_per_seq, 1]
        log_probs = tf.nn.log_softmax(logits, axis=-1)
        # [batch_size, max_predictions_per_seq] -> batch_size * max_predictions_per_seq
        label_ids = tf.reshape(label_ids, [-1])
        # [batch_size, max_predictions_per_seq] -> batch_size * max_predictions_per_seq
        label_weights = tf.reshape(label_weights, [-1])
        # one_hot_labels: [batch_size * max_predictions_per_seq, vocab_size]
        one_hot_labels = tf.one_hot(label_ids, depth=bert_config.vocab_size, dtype=tf.float32)

        # The `positions` tensor might be zero-padded (if the sequence is too
        # short to have the maximum number of predictions). The `label_weights`
        # tensor has a value of 1.0 for every real prediction and 0.0 for the
        # padding predictions.
        # per_example_loss: [batch_size * max_predictions_per_seq] 每个位置相乘
        per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
        numerator = tf.reduce_sum(label_weights * per_example_loss) # 分子
        denominator = tf.reduce_sum(label_weights) + 1e-5 # 分母
        # 计算加权平均 loss
        loss = numerator / denominator

    return (loss, per_example_loss, log_probs)

训练完成后，会把训练好的模型保存到 output_dit 中。

转换为 PyTorch 模型

由于我们是使用 Tensorflow 来训练模型，而我们的文本分类模型是使用 PyTorch 的，因此需要把 Tensorflow 的模型，转换为 PyTorch 的模型。

这里使用 HuggingFace 提供的 转换代码。

代码文件为 convert_checkpoint.py，脚本文件为 convert_checkpoint.sh，脚本如下：

export BERT_BASE_DIR=./bert-mini                       # 设置模型路径
python convert_checkpoint.py
--bert_config_file $BERT_BASE_DIR/bert_config.json     # Bert 配置文件
--tf_checkpoint $BERT_BASE_DIR/bert_model.ckpt-100000  # Tensorflow 模型名称
--config $BERT_BASE_DIR/bert_config.json               # 词典路径
--pytorch_dump_path $BERT_BASE_DIR/pytorch_model.bin # PyTorch 模型名称

注意，你需要先安装 tensorflow，pytorch，transformers。

微调 Bert 模型

在上一篇文章 阿里天池 NLP 入门赛 TextCNN 方案代码详细注释和流程讲解 中，我们使用 TextCNN 来训练模型，模型结构图如下：

图中的 `WordCNNEncoder` 就是TextCNN。

我们把 TextCNN 替换为 Bert。

模型结构图如下：

我们只关注如何使用 `WordBertEncoder`，模型其他部分的细节与上一篇文章一样，请查看 [阿里天池 NLP 入门赛 TextCNN 方案代码详细注释和流程讲解](https://zhuanlan.zhihu.com/p/183862056)。

WordBertEncoder 代码如下。

首先加载转换好的 PyTorch 模型。

在 forward() 函数中，将 input_ids 和 token_type_ids 输入到 Bert 模型。

得到 sequence_output（表示最后一个 Encoder 对应的 hidden-states），pooled_output（表示最后一个 Encoder 的第一个 token 对应的 hidden-states）。

代码中有详细注释。

# build word encoder
bert_path = osp.join(dir,'./bert/bert-mini/')
dropout = 0.15

from transformers import BertModel


class WordBertEncoder(nn.Module):
    def __init__(self):
        super(WordBertEncoder, self).__init__()
        self.dropout = nn.Dropout(dropout)

        self.tokenizer = WhitespaceTokenizer()
        # 加载 Bert 模型
        self.bert = BertModel.from_pretrained(bert_path)

        self.pooled = False
        logging.info('Build Bert encoder with pooled {}.'.format(self.pooled))

    def encode(self, tokens):
        tokens = self.tokenizer.tokenize(tokens)
        return tokens

    # 如果参数名字里，包含 ['bias', 'LayerNorm.weight']，那么没有 decay
    # 其他参数都有 0.01 的 decay
    def get_bert_parameters(self):
        no_decay = ['bias', 'LayerNorm.weight']
        optimizer_parameters = [
            {'params': [p for n, p in self.bert.named_parameters() if not any(nd in n for nd in no_decay)],
             'weight_decay': 0.01},
            {'params': [p for n, p in self.bert.named_parameters() if any(nd in n for nd in no_decay)],
             'weight_decay': 0.0}
        ]
        return optimizer_parameters

    def forward(self, input_ids, token_type_ids):
        # bert_len 是句子的长度
        # input_ids: sen_num * bert_len
        # token_type_ids: sen_num  * bert_len



        # 256 是 hidden_size
        # sequence_output：sen_num * bert_len * 256。是最后一个 Encoder 输出的 hidden-states
        # pooled_output：sen_num * 256。首先取最后一个 Encoder 层输出的 hidden-states 的第一个位置对应的 hidden-state，
        # 也就是 CLS 对应的 hidden state，是一个 256 维的向量。经过线性变换和 Tanh 激活函数得到最终的 256 维向量。
        # 可以直接用于分类
        sequence_output, pooled_output = self.bert(input_ids=input_ids, token_type_ids=token_type_ids)
        # Bert 模型的输出是一个 tuple，包含 4 个元素：last_hidden_state、pooler_output、hidden_states、attentions

        if self.pooled:
            reps = pooled_output             # 取第一个元素的 hidden state： sen_num * 256
        else:
            reps = sequence_output[:, 0, :]  # 取第一个元素的 hidden state： sen_num * 256

        if self.training:
            reps = self.dropout(reps)

        return reps # sen_num * 256

如果你有疑问，欢迎留言。

参考

如果你觉得这篇文章对你有帮助，不妨点个赞，让我有更多动力写出好文章。

阿里天池 NLP 入门赛 Bert 方案 -3 Bert 预训练与分类

阿里天池 NLP 入门赛 Bert 方案 -3 Bert 预训练与分类

前言

训练 Bert

脚本

model_fn_builder()

get_masked_lm_output()

转换为 PyTorch 模型

微调 Bert 模型

Recommend

PyTorch预训练Bert模型

阿里天池 NLP 入门赛 Bert 方案 -2 Bert 源码讲解

阿里天池 NLP 入门赛 Bert 方案 -1 数据预处理

我不太懂BERT系列——BERT预训练实操总结

跨界出圈 | 谈谈BERT跨模态预训练

腾讯新预训练模型LP-BERT

《阿里云天池大赛赛题解析》——O2O优惠卷预测 - top王

平台赋能技术创新阿里云天池持续助力全球顶尖算法大赛

平台赋能技术创新，阿里云天池持续助力全球顶尖算法大赛-品玩

破解ACL论文：Gzip和KNN在文本分类中与BERT竞争

About Joyk