Herbert

HerBERT is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic masking of whole words.
Alternatives To Herbert
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Nlp_chinese_corpus8,344
6 months ago20mit
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Bert Pytorch5,605
15 months ago5October 23, 201863apache-2.0Python
Google AI 2018 BERT pytorch implementation
Clue3,345
6 months ago73Python
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Bertweet447
a year agomitPython
BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
Chatbot_data293
8 months agomit
Chatbot_data_for_Korean
Lotclass231
2 years agoapache-2.0Python
[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach
Parsbert222
10 months ago6apache-2.0Jupyter Notebook
🤗 ParsBERT: Transformer-based Model for Persian Language Understanding
Robbert168
a year ago11mitJupyter Notebook
A Dutch RoBERTa-based language model
Transformer Lm155
3 years ago8Python
Transformer language model (GPT-2) with sentencepiece tokenizer
Jlm99
4 years agomitPython
A fast LSTM Language Model for large vocabulary language like Japanese and Chinese
Alternatives To Herbert
Select To Compare


Alternative Project Comparisons
Readme

HerBERT

HerBERT is a series of BERT-based language models trained for Polish language understanding.

All three HerBERT models are summarized below:

Model Tokenizer Vocab Size Batch Size Train Steps KLEJ Score
herbert-klej-cased-v1 BPE 50K 570 180k 80.5
herbert-base-cased BPE-Dropout 50K 2560 50k 86.3
herbert-large-cased BPE-Dropout 50K 2560 60k 88.4

Full KLEJ Benchmark leaderboard is available here.

For more details about model architecture, training process, used corpora and evaluation please refer to:

Usage

Example of how to load the model:

from transformers import AutoTokenizer, AutoModel

model_names = {
    "herbert-klej-cased-v1": {
        "tokenizer": "allegro/herbert-klej-cased-tokenizer-v1", 
        "model": "allegro/herbert-klej-cased-v1",
    },
    "herbert-base-cased": {
        "tokenizer": "allegro/herbert-base-cased", 
        "model": "allegro/herbert-base-cased",
    },
    "herbert-large-cased": {
        "tokenizer": "allegro/herbert-large-cased", 
        "model": "allegro/herbert-large-cased",
    },
}

tokenizer = AutoTokenizer.from_pretrained(model_names["allegro/herbert-base-cased"]["tokenizer"])
model = AutoModel.from_pretrained(model_names["allegro/herbert-base-cased"]["model"])

And how to use the model:

output = model(
    **tokenizer.batch_encode_plus(
        [
            (
                "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
                "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
            )
        ],
        padding="longest",
        add_special_tokens=True,
        return_tensors="pt",
    )
)

License

CC BY 4.0

Citation

If you use this model, please cite the following papers:

The herbert-klej-cased-v1 version of the model:

@inproceedings{rybak-etal-2020-klej,
    title = "{KLEJ}: Comprehensive Benchmark for Polish Language Understanding",
    author = "Rybak, Piotr and Mroczkowski, Robert and Tracz, Janusz and Gawlik, Ireneusz",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.111",
    pages = "1191--1201",
}

The herbert-base-cased or herbert-large-cased version of the model:

@inproceedings{mroczkowski-etal-2021-herbert,
    title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
    author = "Mroczkowski, Robert  and
      Rybak, Piotr  and
      Wr{\'o}blewska, Alina  and
      Gawlik, Ireneusz",
    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Kiyv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
    pages = "1--10",
}

Contact

You can contact us at: [email protected]

Popular Corpus Projects
Popular Language Model Projects
Popular Data Processing Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Corpus
Language Model
Tokenizer