Github GitHub - vietai/SAT: Styled Augmented Translation - JOYK Joy of Geek, Geek News, Link all geek

Style Augmented Translation

Introduction

By collecting high-quality data, we were able to train a model that outperforms Google Translate on 6 different domains of English-Vietnamese Translation.

English to Vietnamese Translation (BLEU score)

Vietnamese to English Translation (BLEU score)

Get data and model at Google Cloud Storage

Check out our demo web app

Visit our blog post for more details.

Using the code

This code is build on top of vietai/dab:

To prepare for training, generate tfrecords from raw text files:

python t2t_datagen.py \
--data_dir=$path_to_folder_contains_vocab_file \
--tmp_dir=$path_to_folder_that_contains_training_data \
--problem=$problem

To train a Transformer model on the generated tfrecords

python t2t_trainer.py \
--data_dir=$path_to_folder_contains_vocab_file_and_tf_records \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--output_dir=$path_to_folder_to_save_checkpoints

To run inference on the trained model:

python t2t_decoder.py \
--data_dir=$path_to_folde_contains_vocab_file_and_tf_records \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--output_dir=$path_to_folder_contains_checkpoints

In this colab, we demonstrated how to run these three phases in the context of hosting data/model on Google Cloud Storage.

Dataset

Our data contains roughly 3.3 million pairs of texts. After augmentation, the data is of size 26.7 million pairs of texts. A more detail breakdown of our data is shown in the table below.

Pure Augmented Fictional Books 333,189 2,516,787 Legal Document 1,150,266 3,450,801 Medical Publication 5,861 27,588 Movies Subtitles 250,000 3,698,046 Software 79,912 239,745 TED Talk 352,652 4,983,294 Wikipedia 645,326 1,935,981 News 18,449 139,341 Religious texts 124,389 1,182,726 Educational content 397,008 8,475,342 No tag 5,517 66,299 Total 3,362,569 26,715,950

Data sources is described in more details here.

Github GitHub - vietai/SAT: Styled Augmented Translation

Style Augmented Translation

Introduction

Using the code

Dataset

Recommend

五一出游人次达到2亿深度剖析被压抑的旅游需求

江小白起家产品抽检不合格！它还有多少情怀可讲

Github GitHub - PaddlePaddle/PaddleDetection: Object detection and instance segm...

Kafka的灵魂伴侣Logi-KafkaManger三之运维管控--集群列表

2020上市物企成绩单：碧桂园服务各科兼优彩生活上市以来收入首降

AI空间数字化技术及实践经验

阿布扎比技术创新研究所发布阿联酋首个安全云技术计划

关于净推荐值（NPS）的理解

从华为掀起无人驾驶小高潮来看智能产品对人们的影响

做竞品分析时常见的对比分析方向都有哪些？（下）

About Joyk