HerBERT is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic masking of whole words.
Alternatives To Herbert
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
6 months ago20mit
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Bert Pytorch5,605
15 months ago5October 23, 201863apache-2.0Python
Google AI 2018 BERT pytorch implementation
6 months ago73Python
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
a year agomitPython
BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
8 months agomit
2 years agoapache-2.0Python
[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach
10 months ago6apache-2.0Jupyter Notebook
🤗 ParsBERT: Transformer-based Model for Persian Language Understanding
a year ago11mitJupyter Notebook
A Dutch RoBERTa-based language model
Transformer Lm155
3 years ago8Python
Transformer language model (GPT-2) with sentencepiece tokenizer
4 years agomitPython
A fast LSTM Language Model for large vocabulary language like Japanese and Chinese
Alternatives To Herbert
Select To Compare

Alternative Project Comparisons


HerBERT is a series of BERT-based language models trained for Polish language understanding.

All three HerBERT models are summarized below:

Model Tokenizer Vocab Size Batch Size Train Steps KLEJ Score
herbert-klej-cased-v1 BPE 50K 570 180k 80.5
herbert-base-cased BPE-Dropout 50K 2560 50k 86.3
herbert-large-cased BPE-Dropout 50K 2560 60k 88.4

Full KLEJ Benchmark leaderboard is available here.

For more details about model architecture, training process, used corpora and evaluation please refer to:


Example of how to load the model:

from transformers import AutoTokenizer, AutoModel

model_names = {
    "herbert-klej-cased-v1": {
        "tokenizer": "allegro/herbert-klej-cased-tokenizer-v1", 
        "model": "allegro/herbert-klej-cased-v1",
    "herbert-base-cased": {
        "tokenizer": "allegro/herbert-base-cased", 
        "model": "allegro/herbert-base-cased",
    "herbert-large-cased": {
        "tokenizer": "allegro/herbert-large-cased", 
        "model": "allegro/herbert-large-cased",

tokenizer = AutoTokenizer.from_pretrained(model_names["allegro/herbert-base-cased"]["tokenizer"])
model = AutoModel.from_pretrained(model_names["allegro/herbert-base-cased"]["model"])

And how to use the model:

output = model(
                "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
                "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."


CC BY 4.0


If you use this model, please cite the following papers:

The herbert-klej-cased-v1 version of the model:

    title = "{KLEJ}: Comprehensive Benchmark for Polish Language Understanding",
    author = "Rybak, Piotr and Mroczkowski, Robert and Tracz, Janusz and Gawlik, Ireneusz",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.111",
    pages = "1191--1201",

The herbert-base-cased or herbert-large-cased version of the model:

    title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
    author = "Mroczkowski, Robert  and
      Rybak, Piotr  and
      Wr{\'o}blewska, Alina  and
      Gawlik, Ireneusz",
    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Kiyv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
    pages = "1--10",


You can contact us at: [email protected]

Popular Corpus Projects
Popular Language Model Projects
Popular Data Processing Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Language Model