README.md

BertSum

This code is for paper Fine-tune BERT for Extractive Summarization

Results on CNN/Dailymail (25/3/2019):

Models ROUGE-1 ROUGE-2 ROUGE-3 Transformer Baseline 40.9 18.02 37.17 BERTSUM+Classifier 43.23 20.22 39.60 BERTSUM+Transformer 43.25 20.24 39.63 BERTSUM+LSTM 43.22 20.17 39.59

Python version: This code is in Python3.6

Package Requirements: pytorch pytorch_pretrained_bert tensorboardX multiprocess pyrouge

Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)

Data Preparation For CNN/Dailymail

Option 1: download the processed data

download https://drive.google.com/open?id=1lqQmKflLisi-JBzFY0DL4sxqKtzZMrxt

unzip the zipfile and put all .pt files into bert_data

Option 2: process the data yourself

Follow steps in https://github.com/abisee/cnn-dailymail to generate CoreNLP tokenized and sentence-splitted datasets (cnn_stories_tokenized and dm_stories_tokenized), and merge them into one directory merged_stories_tokenized.

Step 1

python preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -map_path MAP_PATH -lower

RAW_PATH is the directory containing tokenized files (../merged_stories_tokenized), JSON_PATH is the target directory to save the generated json files (../json_data), MAP_PATH is the directory containing the urls files (../urls)

Step 2

python preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH -oracle_mode greedy -n_cpus 4 -log_file ../logs/preprocess.log

JSON_PATH is the directory containing json files (../json_data), BERT_DATA_PATH is the target directory to save the generated binary files (../bert_data)
-oracle_mode can be greedy or combination, where combination is more accurate but takes much longer time to process

Model Training

First run: For the first time, you should use single-GPU, so the code can download the BERT model. Change -visible_gpus 0,1,2 -gpu_ranks 0,1,2 -world_size 3 to -visible_gpus 0 -gpu_ranks 0 -world_size 1, after downloading, you could kill the process and rerun the code with multi-GPUs.

To train the BERT+Classifier model, run:

python train.py -mode train -encoder classifier -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_classifier -lr 2e-3 -visible_gpus 0,1,2  -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_classifier -use_interval true -warmup_steps 10000

To train the BERT+Transformer model, run:

python train.py -mode train -encoder transformer -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_transformer -lr 2e-3 -visible_gpus 0,1,2  -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_transformer -use_interval true -warmup_steps 10000 -ff_size 2048 -inter_layers 2 -heads 8

To train the BERT+RNN model, run:

python train.py -mode train -encoder rnn -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_rnn -lr 2e-3 -visible_gpus 0,1,2  -gpu_ranks 0,1,2 -world_size 3 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_rnn -use_interval true -warmup_steps 10000 -rnn_size 768 -dropout 0.1

-mode can be {train, validate, test}, where validate will inspect the model directory and evaluate the model for each newly saved checkpoint, test need to be used with -test_from, indicating the checkpoint you want to use

Model Evaluation

After the training finished, run

python train.py -mode validate -bert_data_path ../bert_data/cnndm -model_path MODEL_PATH  -visible_gpus 0  -gpu_ranks 0 -batch_size 30000  -log_file LOG_FILE  -result_path RESULT_PATH -test_all -block_trigram true

MODEL_PATH is the directory of saved checkpoints
RESULT_PATH is where you want to put decoded summaries (default ../results/cnndm)

Trained Model

download the trained model at https://drive.google.com/open?id=1YFWPOPAyf_Uhqj6ny7_-2iZiDy5BXNvL

Test it by running:

python train.py -mode test -bert_data_path ../bert_data/cnndm -test_from CP_FILE  -visible_gpus 0  -gpu_ranks 0 -batch_size 30000  -log_file LOG_FILE  -result_path RESULT_PATH -block_trigram true

CP_FILE is the path of checkpoint file

GitHub - nlpyang/BertSum: Code for paper Fine-tune BERT for Extractive Summariza...

README.md

BertSum

Data Preparation For CNN/Dailymail

Option 1: download the processed data

Option 2: process the data yourself

Step 1

Step 2

Model Training

Model Evaluation

Trained Model

Recommend

What is `this`? The Inner Workings of JavaScript Objects

Deliver Responsive and Art Directed Images For Your Website with ImageKit

滴滴总裁探望常德被害司机家属_新浪网

华为苹果两大巨头正面刚？网友直呼国产手机太长脸

麦当劳宣布最大规模收购：创建更个性化汽车餐厅菜单

微信正式上线物流助手接口功能

苹果开发中文网站iOS 键盘管理器:FJFKeyboardHelper

苹果开发中文网站iOS Skeleton Screen加载占位图

苹果开发中文网站史上明星最多的苹果发布会：有 4 个新会员，还有一张钛合金信用卡

GitHub - yuzutech/kroki: Convert plain text diagrams to images !

About Joyk