Awesome Open Source
Awesome Open Source

Pretrained BigBird Model for Korean

What is BigBird โ€ข How to Use โ€ข Pretraining โ€ข Evaluation Result โ€ข Docs โ€ข Citation

ํ•œ๊ตญ์–ด | English

Apache 2.0 Issues linter DOI

What is BigBird?

BigBird: Transformers for Longer Sequences์—์„œ ์†Œ๊ฐœ๋œ sparse-attention ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋กœ, ์ผ๋ฐ˜์ ์ธ BERT๋ณด๋‹ค ๋” ๊ธด sequence๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿฆ… Longer Sequence - ์ตœ๋Œ€ 512๊ฐœ์˜ token์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” BERT์˜ 8๋ฐฐ์ธ ์ตœ๋Œ€ 4096๊ฐœ์˜ token์„ ๋‹ค๋ฃธ

โฑ๏ธ Computational Efficiency - Full attention์ด ์•„๋‹Œ Sparse Attention์„ ์ด์šฉํ•˜์—ฌ O(n2)์—์„œ O(n)์œผ๋กœ ๊ฐœ์„ 

How to Use

  • ๐Ÿค— Huggingface Hub์— ์—…๋กœ๋“œ๋œ ๋ชจ๋ธ์„ ๊ณง๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:)
  • ์ผ๋ถ€ ์ด์Šˆ๊ฐ€ ํ•ด๊ฒฐ๋œ transformers>=4.11.0 ์‚ฌ์šฉ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. (MRC ์ด์Šˆ ๊ด€๋ จ PR)
  • BigBirdTokenizer ๋Œ€์‹ ์— BertTokenizer ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. (AutoTokenizer ์‚ฌ์šฉ์‹œ BertTokenizer๊ฐ€ ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.)
  • ์ž์„ธํ•œ ์‚ฌ์šฉ๋ฒ•์€ BigBird Tranformers documentation์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("monologg/kobigbird-bert-base")  # BigBirdModel
tokenizer = AutoTokenizer.from_pretrained("monologg/kobigbird-bert-base")  # BertTokenizer

Pretraining

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Pretraining BigBird] ์ฐธ๊ณ 

Hardware Max len LR Batch Train Step Warmup Step
KoBigBird-BERT-Base TPU v3-8 4096 1e-4 32 2M 20k
  • ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜, ํ•œ๊ตญ์–ด ์œ„ํ‚ค, Common Crawl, ๋‰ด์Šค ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
  • ITC (Internal Transformer Construction) ๋ชจ๋ธ๋กœ ํ•™์Šต (ITC vs ETC)

Evaluation Result

1. Short Sequence (<=512)

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetune on Short Sequence Dataset] ์ฐธ๊ณ 

NSMC
(acc)
KLUE-NLI
(acc)
KLUE-STS
(pearsonr)
Korquad 1.0
(em/f1)
KLUE MRC
(em/rouge-w)
KoELECTRA-Base-v3 91.13 86.87 93.14 85.66 / 93.94 59.54 / 65.64
KLUE-RoBERTa-Base 91.16 86.30 92.91 85.35 / 94.53 69.56 / 74.64
KoBigBird-BERT-Base 91.18 87.17 92.61 87.08 / 94.71 70.33 / 75.34

2. Long Sequence (>=1024)

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetune on Long Sequence Dataset] ์ฐธ๊ณ 

TyDi QA
(em/f1)
Korquad 2.1
(em/f1)
Fake News
(f1)
Modu Sentiment
(f1-macro)
KLUE-RoBERTa-Base 76.80 / 78.58 55.44 / 73.02 95.20 42.61
KoBigBird-BERT-Base 79.13 / 81.30 67.77 / 82.03 98.85 45.42

Docs

Citation

KoBigBird๋ฅผ ์‚ฌ์šฉํ•˜์‹ ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

@software{jangwon_park_2021_5654154,
  author       = {Jangwon Park and Donggyu Kim},
  title        = {KoBigBird: Pretrained BigBird Model for Korean},
  month        = nov,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.5654154},
  url          = {https://doi.org/10.5281/zenodo.5654154}
}

Contributors

Jangwon Park and Donggyu Kim

Acknowledgements

KoBigBird๋Š” Tensorflow Research Cloud (TFRC) ํ”„๋กœ๊ทธ๋žจ์˜ Cloud TPU ์ง€์›์œผ๋กœ ์ œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ๋ฉ‹์ง„ ๋กœ๊ณ ๋ฅผ ์ œ๊ณตํ•ด์ฃผ์‹  Seyun Ahn๋‹˜๊ป˜ ๊ฐ์‚ฌ๋ฅผ ์ „ํ•ฉ๋‹ˆ๋‹ค.


Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
Python (1,141,333)ย 
Pytorch (11,591)ย 
Nlp (8,364)ย 
Transformer (1,679)ย 
Bert (1,164)ย 
Korean Nlp (101)ย 
Bigbird (3)ย 
Related Projects
Advertising ๐Ÿ“ฆย 9
All Projects
Application Programming Interfaces ๐Ÿ“ฆย 120
Applications ๐Ÿ“ฆย 181
Artificial Intelligence ๐Ÿ“ฆย 72
Blockchain ๐Ÿ“ฆย 70
Build Tools ๐Ÿ“ฆย 111
Cloud Computing ๐Ÿ“ฆย 79
Code Quality ๐Ÿ“ฆย 28
Collaboration ๐Ÿ“ฆย 30
Command Line Interface ๐Ÿ“ฆย 48
Community ๐Ÿ“ฆย 81
Companies ๐Ÿ“ฆย 60
Compilers ๐Ÿ“ฆย 60
Computer Science ๐Ÿ“ฆย 74
Configuration Management ๐Ÿ“ฆย 39
Content Management ๐Ÿ“ฆย 167
Control Flow ๐Ÿ“ฆย 197
Data Formats ๐Ÿ“ฆย 77
Data Processing ๐Ÿ“ฆย 266
Data Storage ๐Ÿ“ฆย 132
Economics ๐Ÿ“ฆย 60
Frameworks ๐Ÿ“ฆย 198
Games ๐Ÿ“ฆย 122
Graphics ๐Ÿ“ฆย 103
Hardware ๐Ÿ“ฆย 148
Integrated Development Environments ๐Ÿ“ฆย 47
Learning Resources ๐Ÿ“ฆย 147
Legal ๐Ÿ“ฆย 28
Libraries ๐Ÿ“ฆย 119
Lists Of Projects ๐Ÿ“ฆย 21
Machine Learning ๐Ÿ“ฆย 336
Mapping ๐Ÿ“ฆย 61
Marketing ๐Ÿ“ฆย 15
Mathematics ๐Ÿ“ฆย 55
Media ๐Ÿ“ฆย 228
Messaging ๐Ÿ“ฆย 97
Networking ๐Ÿ“ฆย 304
Operating Systems ๐Ÿ“ฆย 84
Operations ๐Ÿ“ฆย 120
Package Managers ๐Ÿ“ฆย 52
Programming Languages ๐Ÿ“ฆย 229
Runtime Environments ๐Ÿ“ฆย 96
Science ๐Ÿ“ฆย 42
Security ๐Ÿ“ฆย 375
Social Media ๐Ÿ“ฆย 26
Software Architecture ๐Ÿ“ฆย 70
Software Development ๐Ÿ“ฆย 68
Software Performance ๐Ÿ“ฆย 57
Software Quality ๐Ÿ“ฆย 127
Text Editors ๐Ÿ“ฆย 45
Text Processing ๐Ÿ“ฆย 131
User Interface ๐Ÿ“ฆย 310
User Interface Components ๐Ÿ“ฆย 465
Version Control ๐Ÿ“ฆย 29
Virtualization ๐Ÿ“ฆย 68
Web Browsers ๐Ÿ“ฆย 38
Web Servers ๐Ÿ“ฆย 25
Web User Interface ๐Ÿ“ฆย 194