GPT2 for Multiple Languages

中文说明 | English

  • [x] Simplifed GPT2 train scripts(based on Grover, supporting TPUs)
  • [x] Ported bert tokenizer, multilingual corpus compatible
  • [x] 1.5B GPT2 pretrained Chinese model ( ~15G corpus, 10w steps )
  • [x] Batteries-included Colab demo #
  • [x] 1.5B GPT2 pretrained Chinese model ( ~30G corpus, 22w steps )

Pretrained Model

Size Language Corpus Vocab Link1 Link2 SHA256
1.5B Params Chinese ~30G CLUE ( 8021 tokens ) Google Drive Baidu Pan (ffz6) e698cc97a7f5f706f84f58bb469d614e
1.5B Params Chinese ~15G Bert ( 21128 tokens ) Google Drive Baidu Pan (q9vr) 4a6e5124df8db7ac2bdd902e6191b807

Corpus from THUCNews and nlp_chinese_corpus

Using Cloud TPU Pod v3-256 to train 22w steps


Google Colab

With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go:

[Colab Notebook]



The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks.


  author = {Zhibo Zhang},
  title = {GPT2-ML: GPT-2 for Multiple Languages},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{}},


Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)


[机器之心] 只需单击三次,让中文GPT-2为你生成定制故事

[科学空间] 现在可以用Keras玩中文GPT2了

