GitHub - imcaspar/gpt2-ml: GPT2 for Multiple Languages, including pretrained mod...
source link: https://github.com/imcaspar/gpt2-ml
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
README.md
GPT2 for Multiple Languages
- Simplified GPT2 train scripts(based on Grover, supporting TPUs)
- Ported bert tokenizer,multilingual corpus compatible
- 1.5B GPT2 pretrained Chinese model ( ~15G corpus, 10w steps )
- Batteries-included Colab demo #
- 1.5B GPT2 pretrained Chinese model ( ~50G corpus, 100w steps )
Pretrained Model
1.5B GPT2 pretrained Chinese model [Google Drive]
Corpus from THUCNews and nlp_chinese_corpus
Using Cloud TPU Pod v3-256 to train 10w steps
Google Colab
With just 3 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go:
Train
Disclaimer
The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks.
Citing
@misc{GPT2-ML,
author = {Zhibo Zhang},
title = {GPT2-ML: GPT-2 for Multiple Languages},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/imcaspar/gpt2-ml}},
}
Reference
https://github.com/google-research/bert
https://github.com/rowanz/grover
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK