Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Nlp_chinese_corpus | 8,344 | 6 months ago | 20 | mit | ||||||
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP | ||||||||||
Bert Pytorch | 5,605 | 1 | 5 months ago | 5 | October 23, 2018 | 63 | apache-2.0 | Python | ||
Google AI 2018 BERT pytorch implementation | ||||||||||
Clue | 3,345 | 6 months ago | 73 | Python | ||||||
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard | ||||||||||
Bertweet | 447 | a year ago | mit | Python | ||||||
BERTweet: A pre-trained language model for English Tweets (EMNLP-2020) | ||||||||||
Chatbot_data | 293 | 8 months ago | mit | |||||||
Chatbot_data_for_Korean | ||||||||||
Lotclass | 231 | 2 years ago | apache-2.0 | Python | ||||||
[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach | ||||||||||
Parsbert | 222 | 10 months ago | 6 | apache-2.0 | Jupyter Notebook | |||||
🤗 ParsBERT: Transformer-based Model for Persian Language Understanding | ||||||||||
Robbert | 168 | a year ago | 11 | mit | Jupyter Notebook | |||||
A Dutch RoBERTa-based language model | ||||||||||
Transformer Lm | 155 | 3 years ago | 8 | Python | ||||||
Transformer language model (GPT-2) with sentencepiece tokenizer | ||||||||||
Jlm | 99 | 4 years ago | mit | Python | ||||||
A fast LSTM Language Model for large vocabulary language like Japanese and Chinese |
HerBERT is a series of BERT-based language models trained for Polish language understanding.
All three HerBERT models are summarized below:
Model | Tokenizer | Vocab Size | Batch Size | Train Steps | KLEJ Score |
---|---|---|---|---|---|
herbert-klej-cased-v1 |
BPE | 50K | 570 | 180k | 80.5 |
herbert-base-cased |
BPE-Dropout | 50K | 2560 | 50k | 86.3 |
herbert-large-cased |
BPE-Dropout | 50K | 2560 | 60k | 88.4 |
Full KLEJ Benchmark leaderboard is available here.
For more details about model architecture, training process, used corpora and evaluation please refer to:
Example of how to load the model:
from transformers import AutoTokenizer, AutoModel
model_names = {
"herbert-klej-cased-v1": {
"tokenizer": "allegro/herbert-klej-cased-tokenizer-v1",
"model": "allegro/herbert-klej-cased-v1",
},
"herbert-base-cased": {
"tokenizer": "allegro/herbert-base-cased",
"model": "allegro/herbert-base-cased",
},
"herbert-large-cased": {
"tokenizer": "allegro/herbert-large-cased",
"model": "allegro/herbert-large-cased",
},
}
tokenizer = AutoTokenizer.from_pretrained(model_names["allegro/herbert-base-cased"]["tokenizer"])
model = AutoModel.from_pretrained(model_names["allegro/herbert-base-cased"]["model"])
And how to use the model:
output = model(
**tokenizer.batch_encode_plus(
[
(
"A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
"A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
)
],
padding="longest",
add_special_tokens=True,
return_tensors="pt",
)
)
CC BY 4.0
If you use this model, please cite the following papers:
The herbert-klej-cased-v1
version of the model:
@inproceedings{rybak-etal-2020-klej,
title = "{KLEJ}: Comprehensive Benchmark for Polish Language Understanding",
author = "Rybak, Piotr and Mroczkowski, Robert and Tracz, Janusz and Gawlik, Ireneusz",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.111",
pages = "1191--1201",
}
The herbert-base-cased
or herbert-large-cased
version of the model:
@inproceedings{mroczkowski-etal-2021-herbert,
title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
author = "Mroczkowski, Robert and
Rybak, Piotr and
Wr{\'o}blewska, Alina and
Gawlik, Ireneusz",
booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
month = apr,
year = "2021",
address = "Kiyv, Ukraine",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
pages = "1--10",
}
You can contact us at: [email protected]