1

匹配任务:如何根据用户的搜索语句,为用户推荐相似问题

 2 years ago
source link: https://my.oschina.net/u/4067628/blog/3302123
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

匹配任务:如何根据用户的搜索语句,为用户推荐相似问题

本案例介绍NLP最基本的任务类型之一-文本相似度匹配,并且利用PaddlePaddle搭建语义匹配模型,来计算两个文本的相似程度。

下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

1. 背景介绍

文本语义匹配,简单来说是计算两段文本的相似度分数,来判断两段文本是不是相似的。

在本案例中以相似资讯推荐为例,例如百度知道场景下,用户搜索一个问题,会根据相似度给出相似程度最高的问题,让用户参考其回答。例如,当某用户在搜索引擎中搜索“头发最近掉得很厉害怎么办?”,可以将其与库中的问题进行语义匹配计算,例如某问题为“最近容易脱发怎么办?”,二者计算后相似度分数超过了0.9,那么可以将这条问题推荐给该用户,让用户参考其回答。

1.1 举例说明

这里我们提供3对句子作为待预测数据。

In[1]
# 声明变量
DATA_PATH = "/home/aistudio/data/data12739";  # 数据集路径
WORK_PATH = "/home/aistudio/work/similarity_net"; # 脚本运行路径
EVAL_PATH = WORK_PATH + "/evaluate";
PAIRWISE_MODEL_PATH = WORK_PATH + "/model_files/bow_pairwise/200";
POINTWISE_MODEL_PATH = WORK_PATH + "/model_files/bow_pointwise/200";
PAIRWISE_DATA_PATH = DATA_PATH + "/train_pairwise";
POINTWISE_DATA_PATH = DATA_PATH + "/train_pointwise";
INIT_MODEL = WORK_PATH + "/model_files/simnet_bow_pairwise_pretrained_model"
In[2]
# 解压数据集
!cd {DATA_PATH} && unzip -qo data.zip;
!cp {DATA_PATH}/data/* {DATA_PATH};
In[3]
#初始化模型
!cd {WORK_PATH} && sed -i 's#TRAIN_DATA_PATH=.*$#TRAIN_DATA_PATH={PAIRWISE_DATA_PATH}#' run.sh;
!cd {WORK_PATH} && sed -i 's#pointwise#pairwise#g' run.sh;
!cd {WORK_PATH} && sed -i 's#save_steps 2000#save_steps 200#g' run.sh;
!cd {WORK_PATH} && sed -i 's#validation_steps 2000#validation_steps 200#g' run.sh;
!cd {WORK_PATH} && sed -i 's#CONFIG_PATH=.*$#CONFIG_PATH=./config/bow_pairwise.json#' run.sh;
!cd {WORK_PATH} && sed -i 's#INIT_CHECKPOINT=.*$#INIT_CHECKPOINT={INIT_MODEL}#' run.sh;
In[4]
#查看待预测数据
!cd {DATA_PATH} && cat infer;
车头 如何 放置 车牌	前 牌照 怎么 弄
车头 如何 放置 车牌	如何 办理 北京 车牌
车头 如何 放置 车牌	后 牌照 怎么 装
In[6]
#使用预训练好的模型预测数据并查看结果
!cd {WORK_PATH} && sh run.sh infer && cat infer_result
-----------  Configuration Arguments -----------
batch_size: 128
compute_accuracy: False
config_path: ./config/bow_pairwise.json
do_infer: True
do_test: False
do_train: False
do_valid: False
enable_ce: False
epoch: 10
infer_data_dir: /home/aistudio/data/data12739/infer
infer_result_path: ./infer_result
init_checkpoint: /home/aistudio/work/similarity_net/model_files/simnet_bow_pairwise_pretrained_model
lamda: 0.91
output_dir: None
save_steps: 200
skip_steps: 10
task_mode: pairwise
task_name: simnet
test_data_dir: None
test_result_path: test_result
train_data_dir: None
use_cuda: True
valid_data_dir: None
validation_steps: 100
verbose_result: True
vocab_path: /home/aistudio/data/data12739/term2id.dict
------------------------------------------------
W1024 07:35:22.937739   147 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W1024 07:35:22.941510   147 device_context.cc:267] device: 0, cuDNN Version: 7.3.
start test process ...
infer result saved in /home/aistudio/work/similarity_net/./infer_result
车头 如何 放置 车牌	前 牌照 怎么 弄	0.8589292764663696
车头 如何 放置 车牌	如何 办理 北京 车牌	0.8042251467704773
车头 如何 放置 车牌	后 牌照 怎么 装	0.8330044150352478

由上可知,模型做出了相对正确的排序

2 快速实践

介绍如何准备数据,定义分类的网络结构,然后快速进行语义匹配模型的训练、评估和预测。

2.1 数据准备

为了训练匹配模型,一般需要准备三个数据集:训练集train.txt、验证集dev.txt、测试集test.txt。

  • 训练集,用来训练模型参数的数据集,模型直接根据训练集来调整自身参数以获得更好的分类效果。
  • 验证集,用于在训练过程中检验模型的状态,收敛情况。验证集通常用于调整超参数,根据几组模型验证集上的表现决定哪组超参数拥有最好的性能。
  • 测试集,用来计算模型的各项评估指标,验证模型泛化能力。

这里我们提供一份已标注的、经过分词预处理的数据集,其目录结构如下

.
├── train.txt   # 训练集
├── dev.txt     # 验证集
├── test.txt    # 测试集
├── infer.txt   # 待预测数据
├── vocab.txt   # 词典

此外,该匹配任务有两种训练模式,分别为pointwise和pairwise。根据不同训练模式,我们需要准备的训练数据也不同。 两种模式的介绍细节在3. 概念解释章节。

1. pointwise模式 数据由三列组成,以制表符('\t')分隔,第一列和第二列是以空格分词的需要比对的句对,第三列是标签(0表示不相似,1表示相似),如下示例,文件为 utf8 编码。

现在 安卓模拟器 哪个 好 用    电脑 安卓模拟器 哪个 更好      1
长 的 清新 是 什么 意思     小 清新 的 意思 是 什么 0

2. pairwise模式 数据由三列组成,以制表符('\t')分隔,第一列为以空格分词的句子,第二列是与第一列相似的句子,第三列是与第一列不相似的句子。如下示例,文件为 utf8 编码。

现在 安卓模拟器 哪个 好 用     电脑 安卓模拟器 哪个 更好      电信 手机 可以 用 腾讯 大王 卡 吗 ?
土豆 一亩地 能 收 多少 斤      一亩 地土豆 产 多少 斤        一亩 地 用 多少 斤 土豆 种子

以上两种方式的验证集和测试集数据格式均与pointwise模式的训练集数据格式相同。

2.2 定义网络结构

传统的机器学习分类方法,需要设置很多人工特征,例如单词的个数、文本的长度、单词的词性等等,而随着深度学习的发展,很多分类模型的效果得到验证和使用,包括BOW、CNN、RNN、BiLSTM等,其特点是不用设计人工特征,而是基于词向量(word embedding)进行表示学习。

这里我们以经典的 BOW 模型为例,介绍如何使用 PaddlePaddle 定义网络结构。

网络的配置如下。

In[7]
"""
bow class
"""

import sys
sys.path.append("work/models/matching")
import paddle_layers as layers


class BOW(object): 
    """
    BOW
    """
    
    def __init__(self, conf_dict):
        """
        initialize
        """
        self.dict_size = conf_dict["dict_size"]
        self.task_mode = conf_dict["task_mode"]
        self.emb_dim = conf_dict["net"]["emb_dim"]
        self.bow_dim = conf_dict["net"]["bow_dim"]

    def predict(self, left, right):
        """
        Forward network
        """
        # embedding layer
        emb_layer = layers.EmbeddingLayer(self.dict_size, self.emb_dim, "emb")
        left_emb = emb_layer.ops(left)
        right_emb = emb_layer.ops(right)
        # Presentation context
        pool_layer = layers.SequencePoolLayer("sum")
        left_pool = pool_layer.ops(left_emb)
        right_pool = pool_layer.ops(right_emb)
        softsign_layer = layers.SoftsignLayer()
        left_soft = softsign_layer.ops(left_pool)
        right_soft = softsign_layer.ops(right_pool)
        # matching layer
        if self.task_mode == "pairwise":
            bow_layer = layers.FCLayer(self.bow_dim, None, "fc")
            left_bow = bow_layer.ops(left_soft)
            right_bow = bow_layer.ops(right_soft)
            cos_sim_layer = layers.CosSimLayer()
            pred = cos_sim_layer.ops(left_bow, right_bow)
            return left_bow, pred
        else:
            concat_layer = layers.ConcatLayer(1)
            concat = concat_layer.ops([left_soft, right_soft])
            bow_layer = layers.FCLayer(self.bow_dim, None, "fc")
            concat_fc = bow_layer.ops(concat)
            softmax_layer = layers.FCLayer(2, "softmax", "cos_sim")
            pred = softmax_layer.ops(concat_fc)
            return left_soft, pred

定义网络结构后,需要定义训练和预测程序、优化函数、数据提供器等,为了便于学习,我们将模型训练、评估、预测的过程封装成 run.sh 脚本。

In[8]
# 查看所有参数及说明
!cd {WORK_PATH} && python run_classifier.py -h
usage: run_classifier.py [-h] [--config_path CONFIG_PATH]
                         [--init_checkpoint INIT_CHECKPOINT]
                         [--output_dir OUTPUT_DIR] [--task_mode TASK_MODE]
                         [--epoch EPOCH] [--save_steps SAVE_STEPS]
                         [--validation_steps VALIDATION_STEPS]
                         [--skip_steps SKIP_STEPS]
                         [--verbose_result VERBOSE_RESULT]
                         [--test_result_path TEST_RESULT_PATH]
                         [--infer_result_path INFER_RESULT_PATH]
                         [--train_data_dir TRAIN_DATA_DIR]
                         [--valid_data_dir VALID_DATA_DIR]
                         [--test_data_dir TEST_DATA_DIR]
                         [--infer_data_dir INFER_DATA_DIR]
                         [--vocab_path VOCAB_PATH] [--batch_size BATCH_SIZE]
                         [--use_cuda USE_CUDA] [--task_name TASK_NAME]
                         [--do_train DO_TRAIN] [--do_valid DO_VALID]
                         [--do_test DO_TEST] [--do_infer DO_INFER]
                         [--compute_accuracy COMPUTE_ACCURACY] [--lamda LAMDA]
                         [--enable_ce]

optional arguments:
  -h, --help            show this help message and exit
  --enable_ce           If set, run the task with continuous evaluation logs.

model:
  model configuration and paths.

  --config_path CONFIG_PATH
                        Path to the json file for EmoTect model config.
                        Default: None.
  --init_checkpoint INIT_CHECKPOINT
                        Init checkpoint to resume training from. Default:
                        None.
  --output_dir OUTPUT_DIR
                        Directory path to save checkpoints Default: None.
  --task_mode TASK_MODE
                        task mode: pairwise or pointwise Default: None.

training:
  training options.

  --epoch EPOCH         Number of epoches for training. Default: 10.
  --save_steps SAVE_STEPS
                        The steps interval to save checkpoints. Default: 200.
  --validation_steps VALIDATION_STEPS
                        The steps interval to evaluate model performance.
                        Default: 100.

logging:
  logging related

  --skip_steps SKIP_STEPS
                        The steps interval to print loss. Default: 10.
  --verbose_result VERBOSE_RESULT
                        Whether to output verbose result. Default: True.
  --test_result_path TEST_RESULT_PATH
                        Directory path to test result. Default: test_result.
  --infer_result_path INFER_RESULT_PATH
                        Directory path to infer result. Default: infer_result.

data:
  Data paths, vocab paths and data processing options

  --train_data_dir TRAIN_DATA_DIR
                        Directory path to training data. Default: None.
  --valid_data_dir VALID_DATA_DIR
                        Directory path to valid data. Default: None.
  --test_data_dir TEST_DATA_DIR
                        Directory path to testing data. Default: None.
  --infer_data_dir INFER_DATA_DIR
                        Directory path to infer data. Default: None.
  --vocab_path VOCAB_PATH
                        Vocabulary path. Default: None.
  --batch_size BATCH_SIZE
                        Total examples' number in batch for training. Default:
                        32.

run_type:
  running type options.

  --use_cuda USE_CUDA   If set, use GPU for training. Default: False.
  --task_name TASK_NAME
                        The name of task to perform sentiment classification.
                        Default: None.
  --do_train DO_TRAIN   Whether to perform training. Default: False.
  --do_valid DO_VALID   Whether to perform dev. Default: False.
  --do_test DO_TEST     Whether to perform testing. Default: False.
  --do_infer DO_INFER   Whether to perform inference. Default: False.
  --compute_accuracy COMPUTE_ACCURACY
                        Whether to compute accuracy. Default: False.
  --lamda LAMDA         When task_mode is pairwise, lamda is the threshold for
                        calculating the accuracy. Default: 0.91.

customize:
  customized options.

2.3 pairwise模型训练

基于示例的数据集,可以运行下面的命令,在训练集(train.txt)上进行模型训练,并在验证集(dev.txt)验证。

In[9]
#使用pairwise模式训练模型,采用train_pairwise作为训练集,dev和test作为开发集验证模型收敛效果
!cd {WORK_PATH} && sed -i 's#INIT_CHECKPOINT=.*$#INIT_CHECKPOINT=./#' run.sh;
!cd {WORK_PATH} && sh run.sh train;
-----------  Configuration Arguments -----------
batch_size: 64
compute_accuracy: False
config_path: ./config/bow_pairwise.json
do_infer: False
do_test: True
do_train: True
do_valid: True
enable_ce: False
epoch: 10
infer_data_dir: /home/aistudio/data/data12739/infer
infer_result_path: infer_result
init_checkpoint: ./
lamda: 0.958
output_dir: ./model_files
save_steps: 200
skip_steps: 10
task_mode: pairwise
task_name: simnet
test_data_dir: /home/aistudio/data/data12739/test
test_result_path: test_result
train_data_dir: /home/aistudio/data/data12739/train_pairwise
use_cuda: True
valid_data_dir: /home/aistudio/data/data12739/dev
validation_steps: 2000
verbose_result: True
vocab_path: /home/aistudio/data/data12739/term2id.dict
------------------------------------------------
W1024 07:35:40.220225   231 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W1024 07:35:40.224256   231 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Load model from ./

     You can try our memory optimize feature to save your memory usage:
         # create a build_strategy variable to set memory optimize option
         build_strategy = compiler.BuildStrategy()
         build_strategy.enable_inplace = True
         build_strategy.memory_optimize = True
         
         # pass the build_strategy to with_data_parallel API
         compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
             loss_name=loss.name, build_strategy=build_strategy)
      
     !!! Memory optimize is our experimental feature !!!
         some variables may be removed/reused internal to save memory usage, 
         in order to fetch the right value of the fetch_list, please set the 
         persistable property to true for each variable in fetch_list

         # Sample
         conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None) 
         # if you need to fetch conv1, then:
         conv1.persistable = True

                 
I1024 07:35:40.244242   231 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I1024 07:35:40.246202   231 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
device count: 1
start train process ...
epoch: 0, loss: 0.005618, used time: 0 sec
epoch: 1, loss: 0.002024, used time: 0 sec
epoch: 2, loss: 0.001103, used time: 0 sec
epoch: 3, loss: 0.000167, used time: 0 sec
epoch: 4, loss: 0.000007, used time: 0 sec
epoch: 5, loss: 0.000007, used time: 0 sec
saving infer model in ./model_files/bow_pairwise/200
epoch: 6, loss: 0.000052, used time: 0 sec
epoch: 7, loss: 0.000076, used time: 0 sec
epoch: 8, loss: 0.000064, used time: 0 sec
epoch: 9, loss: 0.000003, used time: 0 sec
AUC of test is 0.732348

训练完成后,会在./model_files 目录下生成以训练模式命名的模型目录,其中包含达到指定训练次数时所保存的模型文件。

2.4 pairwise模型评估

利用训练后的模型,可以运行下面的命令进行测试,查看预训练的模型在测试集(test.txt)上的评测结果

In[10]
#对训练好的模型进行评估
!cd {WORK_PATH} && sed -i 's#INIT_CHECKPOINT=.*$#INIT_CHECKPOINT={PAIRWISE_MODEL_PATH}#' run.sh;
!cd {WORK_PATH} && sh run.sh eval;
-----------  Configuration Arguments -----------
batch_size: 128
compute_accuracy: False
config_path: ./config/bow_pairwise.json
do_infer: False
do_test: True
do_train: False
do_valid: False
enable_ce: False
epoch: 10
infer_data_dir: None
infer_result_path: infer_result
init_checkpoint: /home/aistudio/work/similarity_net/model_files/bow_pairwise/200
lamda: 0.958
output_dir: None
save_steps: 200
skip_steps: 10
task_mode: pairwise
task_name: simnet
test_data_dir: /home/aistudio/data/data12739/test
test_result_path: ./test_result
train_data_dir: None
use_cuda: True
valid_data_dir: None
validation_steps: 100
verbose_result: True
vocab_path: /home/aistudio/data/data12739/term2id.dict
------------------------------------------------
W1024 07:35:48.384515   254 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W1024 07:35:48.388545   254 device_context.cc:267] device: 0, cuDNN Version: 7.3.
start test process ...
AUC of test is 0.733501
test result saved in /home/aistudio/work/similarity_net/./test_result

AUC of test即为最终的指标

2.5 pairwise模型预测

利用已有模型,可在未知label的数据集(infer.txt)上进行预测,得到模型预测结果及各句对相似的概率

In[11]
#使用训练好的模型进行预测并查看结果
!cd {WORK_PATH} && sh run.sh infer && cat infer_result;
-----------  Configuration Arguments -----------
batch_size: 128
compute_accuracy: False
config_path: ./config/bow_pairwise.json
do_infer: True
do_test: False
do_train: False
do_valid: False
enable_ce: False
epoch: 10
infer_data_dir: /home/aistudio/data/data12739/infer
infer_result_path: ./infer_result
init_checkpoint: /home/aistudio/work/similarity_net/model_files/bow_pairwise/200
lamda: 0.91
output_dir: None
save_steps: 200
skip_steps: 10
task_mode: pairwise
task_name: simnet
test_data_dir: None
test_result_path: test_result
train_data_dir: None
use_cuda: True
valid_data_dir: None
validation_steps: 100
verbose_result: True
vocab_path: /home/aistudio/data/data12739/term2id.dict
------------------------------------------------
W1024 07:35:54.063849   269 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W1024 07:35:54.067873   269 device_context.cc:267] device: 0, cuDNN Version: 7.3.
start test process ...
infer result saved in /home/aistudio/work/similarity_net/./infer_result
车头 如何 放置 车牌	前 牌照 怎么 弄	0.855399876832962
车头 如何 放置 车牌	如何 办理 北京 车牌	0.7997686564922333
车头 如何 放置 车牌	后 牌照 怎么 装	0.761825680732727

2.6 pointwise模型训练

基于示例的数据集,可以运行下面的命令,在训练集(train.txt)上进行模型训练,并在验证集(dev.txt)验证。

In[12]
#修改训练模式为pointwise
!cd {WORK_PATH} && sed -i 's#TRAIN_DATA_PATH=.*$#TRAIN_DATA_PATH={POINTWISE_DATA_PATH}#' run.sh;
!cd {WORK_PATH} && sed -i 's#pairwise#pointwise#g' run.sh;
!cd {WORK_PATH} && sed -i 's#CONFIG_PATH=.*$#CONFIG_PATH=./config/bow_pointwise.json#' run.sh;
!cd {WORK_PATH} && sed -i 's#INIT_CHECKPOINT=.*$#INIT_CHECKPOINT=./#' run.sh;
In[13]
#使用pointwise模式训练模型,采用train_pointwise作为训练集,dev和test作为开发集验证模型收敛效果
!cd {WORK_PATH} && sh run.sh train;
-----------  Configuration Arguments -----------
batch_size: 64
compute_accuracy: False
config_path: ./config/bow_pointwise.json
do_infer: False
do_test: True
do_train: True
do_valid: True
enable_ce: False
epoch: 10
infer_data_dir: /home/aistudio/data/data12739/infer
infer_result_path: infer_result
init_checkpoint: ./
lamda: 0.958
output_dir: ./model_files
save_steps: 200
skip_steps: 10
task_mode: pointwise
task_name: simnet
test_data_dir: /home/aistudio/data/data12739/test
test_result_path: test_result
train_data_dir: /home/aistudio/data/data12739/train_pointwise
use_cuda: True
valid_data_dir: /home/aistudio/data/data12739/dev
validation_steps: 2000
verbose_result: True
vocab_path: /home/aistudio/data/data12739/term2id.dict
------------------------------------------------
W1024 07:36:04.743561   293 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W1024 07:36:04.747555   293 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Load model from ./

     You can try our memory optimize feature to save your memory usage:
         # create a build_strategy variable to set memory optimize option
         build_strategy = compiler.BuildStrategy()
         build_strategy.enable_inplace = True
         build_strategy.memory_optimize = True
         
         # pass the build_strategy to with_data_parallel API
         compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
             loss_name=loss.name, build_strategy=build_strategy)
      
     !!! Memory optimize is our experimental feature !!!
         some variables may be removed/reused internal to save memory usage, 
         in order to fetch the right value of the fetch_list, please set the 
         persistable property to true for each variable in fetch_list

         # Sample
         conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None) 
         # if you need to fetch conv1, then:
         conv1.persistable = True

                 
I1024 07:36:04.767258   293 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I1024 07:36:04.768844   293 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
device count: 1
start train process ...
epoch: 0, loss: 0.649951, used time: 0 sec
epoch: 1, loss: 0.366828, used time: 0 sec
epoch: 2, loss: 0.150767, used time: 0 sec
epoch: 3, loss: 0.062965, used time: 0 sec
epoch: 4, loss: 0.026085, used time: 0 sec
epoch: 5, loss: 0.011099, used time: 0 sec
saving infer model in ./model_files/bow_pointwise/200
epoch: 6, loss: 0.003839, used time: 0 sec
epoch: 7, loss: 0.002582, used time: 0 sec
epoch: 8, loss: 0.001462, used time: 0 sec
epoch: 9, loss: 0.000868, used time: 0 sec
AUC of test is 0.635277

2.7 pointwise模型评估

利用训练后的模型,可以运行下面的命令进行测试,查看预训练的模型在测试集(test.txt)上的评测结果

In[14]
#对训练好的模型进行评估
!cd {WORK_PATH} && sed -i 's#INIT_CHECKPOINT=.*$#INIT_CHECKPOINT={POINTWISE_MODEL_PATH}#' run.sh;
!cd {WORK_PATH} && sh run.sh eval;
-----------  Configuration Arguments -----------
batch_size: 128
compute_accuracy: False
config_path: ./config/bow_pointwise.json
do_infer: False
do_test: True
do_train: False
do_valid: False
enable_ce: False
epoch: 10
infer_data_dir: None
infer_result_path: infer_result
init_checkpoint: /home/aistudio/work/similarity_net/model_files/bow_pointwise/200
lamda: 0.958
output_dir: None
save_steps: 200
skip_steps: 10
task_mode: pointwise
task_name: simnet
test_data_dir: /home/aistudio/data/data12739/test
test_result_path: ./test_result
train_data_dir: None
use_cuda: True
valid_data_dir: None
validation_steps: 100
verbose_result: True
vocab_path: /home/aistudio/data/data12739/term2id.dict
------------------------------------------------
W1024 07:36:12.950116   316 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W1024 07:36:12.953586   316 device_context.cc:267] device: 0, cuDNN Version: 7.3.
start test process ...
AUC of test is 0.642783
test result saved in /home/aistudio/work/similarity_net/./test_result

2.8 pointwise模型预测

利用已有模型,可在未知label的数据集(infer.txt)上进行预测,得到模型预测结果及各句对相似的概率

In[15]
#使用训练好的模型进行预测并查看结果
!cd {WORK_PATH} && sh run.sh infer && cat infer_result;
-----------  Configuration Arguments -----------
batch_size: 128
compute_accuracy: False
config_path: ./config/bow_pointwise.json
do_infer: True
do_test: False
do_train: False
do_valid: False
enable_ce: False
epoch: 10
infer_data_dir: /home/aistudio/data/data12739/infer
infer_result_path: ./infer_result
init_checkpoint: /home/aistudio/work/similarity_net/model_files/bow_pointwise/200
lamda: 0.91
output_dir: None
save_steps: 200
skip_steps: 10
task_mode: pointwise
task_name: simnet
test_data_dir: None
test_result_path: test_result
train_data_dir: None
use_cuda: True
valid_data_dir: None
validation_steps: 100
verbose_result: True
vocab_path: /home/aistudio/data/data12739/term2id.dict
------------------------------------------------
W1024 07:36:18.112660   331 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W1024 07:36:18.117071   331 device_context.cc:267] device: 0, cuDNN Version: 7.3.
start test process ...
infer result saved in /home/aistudio/work/similarity_net/./infer_result
车头 如何 放置 车牌	前 牌照 怎么 弄	1
车头 如何 放置 车牌	如何 办理 北京 车牌	1
车头 如何 放置 车牌	后 牌照 怎么 装	0

3 概念解释

3.1 模型介绍

传统的文本匹配技术如信息检索中的向量空间模型 VSM、BM25 等算法,主要解决词汇层面的相似度问题,这种方法的效果在实际应用中受到语言的多义词和语言结构等问题影响。SimNet 在语义表示上沿袭了隐式连续向量表示的方式,但对语义匹配问题在深度学习框架下进行了 End-to-End 的建模,将point-wise与 pair-wise两种有监督学习方式全部统一在一个整体框架内。在实际应用场景下,将海量的用户点击行为数据转化为大规模的弱标记数据,在网页搜索任务上的初次使用即展现出极大威力,带来了相关性的明显提升。

3.2 两种训练模式

  1. pointwise

Pointwis方法的主要思想是将问题转化为二分类问题,具有以下特征:

(1)输入是句对;

(2)输出是二者的相关度;

(3)损失函数评估输入句对的预测得分和真实得分之间的差异。

  1. pairwise

在pairwise方法通过让正确的回答的得分明显高于错误的候选回答来筛选正确回答,具有以下特征:

(1)输入是一对相似句子和一个不相关句子;

(2)输出是相似句对与不相似句对得分之间的对比;

(3)损失函数评估相似句对得分和不相似句对得分之间的差异。

4 进阶使用

介绍如何选择合适的训练数据以及如何精调(finetune)已有模型。

4.1 选择合适的训练数据-以pairwise为例

数据 优点 缺点 2k条人工标注后的标注数据,一个query只对应一个正例或负例 数据质量好 标注成本昂贵,数量少 2w条点击日志的弱监督数据,一个query可以对应多个正例/负例 数据易获取, 数量大 数据质量相对较差

4.1.1 使用标注数据训练后在unicom测试集上的测试

由于unicom测试集是一个query匹配多个不同正例/负例的形式,因此更适合对pairwise模型的排序性能进行测试

In[16]
#由于标注数据是默认数据,所以可以直接使用之前训练的pairwise模型在unicom数据集上进行测试
!cd {EVAL_PATH} && sed -i 's#INIT_CHECKPOINT=.*$#INIT_CHECKPOINT={PAIRWISE_MODEL_PATH}/#' evaluate_unicom.sh;
!cd {EVAL_PATH} && sed -i 's#pointwise#pairwise#g' evaluate_unicom.sh;
!cd {EVAL_PATH} && sh evaluate_unicom.sh
-----------  Configuration Arguments -----------
batch_size: 128
compute_accuracy: False
config_path: /home/aistudio/work/similarity_net/config/bow_pairwise.json
do_infer: True
do_test: False
do_train: False
do_valid: False
enable_ce: False
epoch: 10
infer_data_dir: /home/aistudio/work/similarity_net/evaluate/unicom_infer
infer_result_path: /home/aistudio/work/similarity_net/evaluate/unicom_infer_result
init_checkpoint: /home/aistudio/work/similarity_net/model_files/bow_pairwise/200/
lamda: 0.91
output_dir: None
save_steps: 200
skip_steps: 10
task_mode: pairwise
task_name: simnet
test_data_dir: None
test_result_path: test_result
train_data_dir: None
use_cuda: True
valid_data_dir: None
validation_steps: 100
verbose_result: True
vocab_path: /home/aistudio/data/data12739/term2id.dict
------------------------------------------------
W1024 07:36:27.019397   352 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W1024 07:36:27.023423   352 device_context.cc:267] device: 0, cuDNN Version: 7.3.
start test process ...
infer result saved in /home/aistudio/work/similarity_net/evaluate/unicom_infer_result
pos/neg of unicom data is 1.125801

由上可得在该数据集上,之前使用少量标注数据训练的pairwise模型的正逆序比为1.090651。其预测的结果部分显示如下:

4.1.2 使用带强负例的弱监督数据进行训练

In[17]
#更换训练数据为train_pairwise_v2,重新训练pairwise
PAIRWISE_DATA_PATH = DATA_PATH + "/train_pairwise_v2";
!cd {WORK_PATH} && sed -i 's#TRAIN_DATA_PATH=.*$#TRAIN_DATA_PATH={PAIRWISE_DATA_PATH}#' run.sh;
!cd {WORK_PATH} && sed -i 's#pointwise#pairwise#g' run.sh;
#由于训练数据变多了,因此相应的save_steps和valid_steps也增大
!cd {WORK_PATH} && sed -i 's#save_steps 200#save_steps 2000#g' run.sh;
!cd {WORK_PATH} && sed -i 's#validation_steps 200#validation_steps 2000#g' run.sh;
!cd {WORK_PATH} && sed -i 's#INIT_CHECKPOINT=.*$#INIT_CHECKPOINT=./#' run.sh;
!cd {WORK_PATH} && sh run.sh train;
-----------  Configuration Arguments -----------
batch_size: 64
compute_accuracy: False
config_path: ./config/bow_pairwise.json
do_infer: False
do_test: True
do_train: True
do_valid: True
enable_ce: False
epoch: 10
infer_data_dir: /home/aistudio/data/data12739/infer
infer_result_path: infer_result
init_checkpoint: ./
lamda: 0.958
output_dir: ./model_files
save_steps: 2000
skip_steps: 10
task_mode: pairwise
task_name: simnet
test_data_dir: /home/aistudio/data/data12739/test
test_result_path: test_result
train_data_dir: /home/aistudio/data/data12739/train_pairwise_v2
use_cuda: True
valid_data_dir: /home/aistudio/data/data12739/dev
validation_steps: 20000
verbose_result: True
vocab_path: /home/aistudio/data/data12739/term2id.dict
------------------------------------------------
W1024 07:36:35.157111   378 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W1024 07:36:35.160354   378 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Load model from ./

     You can try our memory optimize feature to save your memory usage:
         # create a build_strategy variable to set memory optimize option
         build_strategy = compiler.BuildStrategy()
         build_strategy.enable_inplace = True
         build_strategy.memory_optimize = True
         
         # pass the build_strategy to with_data_parallel API
         compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
             loss_name=loss.name, build_strategy=build_strategy)
      
     !!! Memory optimize is our experimental feature !!!
         some variables may be removed/reused internal to save memory usage, 
         in order to fetch the right value of the fetch_list, please set the 
         persistable property to true for each variable in fetch_list

         # Sample
         conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None) 
         # if you need to fetch conv1, then:
         conv1.persistable = True

                 
I1024 07:36:35.179633   378 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I1024 07:36:35.181542   378 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
device count: 1
start train process ...
epoch: 0, loss: 0.009513, used time: 1 sec
epoch: 1, loss: 0.001531, used time: 1 sec
epoch: 2, loss: 0.000614, used time: 1 sec
epoch: 3, loss: 0.000517, used time: 1 sec
epoch: 4, loss: 0.000568, used time: 1 sec
epoch: 5, loss: 0.000492, used time: 1 sec
saving infer model in ./model_files/bow_pairwise/2000
epoch: 6, loss: 0.000644, used time: 2 sec
epoch: 7, loss: 0.000507, used time: 1 sec
epoch: 8, loss: 0.000454, used time: 2 sec
epoch: 9, loss: 0.000362, used time: 2 sec
AUC of test is 0.745907
In[18]
#使用重训后的pariwise模型在unicom数据集上进行测试
PAIRWISE_MODEL_PATH = WORK_PATH + "/model_files/bow_pairwise/2000";
!cd {EVAL_PATH} && sed -i 's#INIT_CHECKPOINT=.*$#INIT_CHECKPOINT={PAIRWISE_MODEL_PATH}/#' evaluate_unicom.sh;
!cd {EVAL_PATH} && sed -i 's#pointwise#pairwise#g' evaluate_unicom.sh;
!cd {EVAL_PATH} && sh evaluate_unicom.sh
-----------  Configuration Arguments -----------
batch_size: 128
compute_accuracy: False
config_path: /home/aistudio/work/similarity_net/config/bow_pairwise.json
do_infer: True
do_test: False
do_train: False
do_valid: False
enable_ce: False
epoch: 10
infer_data_dir: /home/aistudio/work/similarity_net/evaluate/unicom_infer
infer_result_path: /home/aistudio/work/similarity_net/evaluate/unicom_infer_result
init_checkpoint: /home/aistudio/work/similarity_net/model_files/bow_pairwise/2000/
lamda: 0.91
output_dir: None
save_steps: 200
skip_steps: 10
task_mode: pairwise
task_name: simnet
test_data_dir: None
test_result_path: test_result
train_data_dir: None
use_cuda: True
valid_data_dir: None
validation_steps: 100
verbose_result: True
vocab_path: /home/aistudio/data/data12739/term2id.dict
------------------------------------------------
W1024 07:37:02.118157   404 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W1024 07:37:02.121443   404 device_context.cc:267] device: 0, cuDNN Version: 7.3.
start test process ...
infer result saved in /home/aistudio/work/similarity_net/evaluate/unicom_infer_result
pos/neg of unicom data is 1.257695

由上可得在使用较多弱监督训练数据训练的模型的正逆序比提升至了1.32。其预测的结果部分显示如下:

大家也可以自己尝试在pointwise上两种数据的不同训练效果,data文件夹里带v2后缀的即为弱监督训练数据

4.2 模型精调

可以通过修改run.sh中的INIT_CHEKPOINT,加载预训练模型来进行精调,进一步提升效果

In[19]
#将INIT_CHEKPOINT改为已经预训练好的模型
!cd {WORK_PATH} && sed -i 's#INIT_CHECKPOINT=.*$#INIT_CHECKPOINT={INIT_MODEL}#' run.sh;
In[20]
#初始化参数
!cd {WORK_PATH} && sed -i 's#pointwise#pairwise#g' run.sh;
!cd {WORK_PATH} && sed -i 's#CONFIG_PATH=.*$#CONFIG_PATH=./config/bow_pairwise.json#' run.sh;
In[21]
#加载预训练模型,进行精调
!cd {WORK_PATH} && sh run.sh train
-----------  Configuration Arguments -----------
batch_size: 64
compute_accuracy: False
config_path: ./config/bow_pairwise.json
do_infer: False
do_test: True
do_train: True
do_valid: True
enable_ce: False
epoch: 10
infer_data_dir: /home/aistudio/data/data12739/infer
infer_result_path: infer_result
init_checkpoint: /home/aistudio/work/similarity_net/model_files/simnet_bow_pairwise_pretrained_model
lamda: 0.958
output_dir: ./model_files
save_steps: 2000
skip_steps: 10
task_mode: pairwise
task_name: simnet
test_data_dir: /home/aistudio/data/data12739/test
test_result_path: test_result
train_data_dir: /home/aistudio/data/data12739/train_pairwise_v2
use_cuda: True
valid_data_dir: /home/aistudio/data/data12739/dev
validation_steps: 20000
verbose_result: True
vocab_path: /home/aistudio/data/data12739/term2id.dict
------------------------------------------------
W1024 07:37:13.661295   426 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W1024 07:37:13.665354   426 device_context.cc:267] device: 0, cuDNN Version: 7.3.
Load model from /home/aistudio/work/similarity_net/model_files/simnet_bow_pairwise_pretrained_model

     You can try our memory optimize feature to save your memory usage:
         # create a build_strategy variable to set memory optimize option
         build_strategy = compiler.BuildStrategy()
         build_strategy.enable_inplace = True
         build_strategy.memory_optimize = True
         
         # pass the build_strategy to with_data_parallel API
         compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
             loss_name=loss.name, build_strategy=build_strategy)
      
     !!! Memory optimize is our experimental feature !!!
         some variables may be removed/reused internal to save memory usage, 
         in order to fetch the right value of the fetch_list, please set the 
         persistable property to true for each variable in fetch_list

         # Sample
         conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None) 
         # if you need to fetch conv1, then:
         conv1.persistable = True

                 
I1024 07:37:13.924934   426 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I1024 07:37:13.926918   426 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
device count: 1
start train process ...
epoch: 0, loss: 0.003804, used time: 2 sec
epoch: 1, loss: 0.000433, used time: 1 sec
epoch: 2, loss: 0.000272, used time: 1 sec
epoch: 3, loss: 0.000246, used time: 2 sec
epoch: 4, loss: 0.000193, used time: 1 sec
epoch: 5, loss: 0.000207, used time: 1 sec
saving infer model in ./model_files/bow_pairwise/2000
epoch: 6, loss: 0.000263, used time: 3 sec
epoch: 7, loss: 0.000384, used time: 1 sec
epoch: 8, loss: 0.000292, used time: 1 sec
epoch: 9, loss: 0.000287, used time: 2 sec
AUC of test is 0.872568
In[22]
#在unicom测试集上评估精调后的结果
!cd {EVAL_PATH} && sed -i 's#INIT_CHECKPOINT=.*$#INIT_CHECKPOINT={PAIRWISE_MODEL_PATH}/#' evaluate_unicom.sh;
!cd {EVAL_PATH} && sed -i 's#pointwise#pairwise#g' evaluate_unicom.sh;
!cd {EVAL_PATH} && sh evaluate_unicom.sh
-----------  Configuration Arguments -----------
batch_size: 128
compute_accuracy: False
config_path: /home/aistudio/work/similarity_net/config/bow_pairwise.json
do_infer: True
do_test: False
do_train: False
do_valid: False
enable_ce: False
epoch: 10
infer_data_dir: /home/aistudio/work/similarity_net/evaluate/unicom_infer
infer_result_path: /home/aistudio/work/similarity_net/evaluate/unicom_infer_result
init_checkpoint: /home/aistudio/work/similarity_net/model_files/bow_pairwise/2000/
lamda: 0.91
output_dir: None
save_steps: 200
skip_steps: 10
task_mode: pairwise
task_name: simnet
test_data_dir: None
test_result_path: test_result
train_data_dir: None
use_cuda: True
valid_data_dir: None
validation_steps: 100
verbose_result: True
vocab_path: /home/aistudio/data/data12739/term2id.dict
------------------------------------------------
W1024 07:37:41.263497   452 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W1024 07:37:41.267405   452 device_context.cc:267] device: 0, cuDNN Version: 7.3.
start test process ...
infer result saved in /home/aistudio/work/similarity_net/evaluate/unicom_infer_result
pos/neg of unicom data is 1.592826

可以看到,加载预训练模型进行精调过后,训练相同步数后正逆序比由1.3提升至了1.50,效果显著。

下面是不同策略下,pariwise模型评估结果的对比总结

策略 unicom(正逆序比) 使用2000条人工标注的训练数据 1.08 使用2w条点击日志弱监督训练数据 1.30 +预训练bow模型进行精调 1.50

点击链接,使用AI Studio一键上手实践项目吧:https://aistudio.baidu.com/aistudio/projectdetail/125034

下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

>> 访问 PaddlePaddle 官网,了解更多相关内容


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK