Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Tokenizers | 8,056 | 362 | 3 months ago | 85 | November 14, 2023 | 233 | apache-2.0 | Rust | ||
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production | ||||||||||
Friso | 449 | 7 months ago | 7 | apache-2.0 | C | |||||
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc. | ||||||||||
Coccoc Tokenizer | 295 | 3 years ago | 3 | lgpl-3.0 | C++ | |||||
high performance tokenizer for Vietnamese language | ||||||||||
Open Nlp | 88 | 20 | 10 years ago | 7 | May 28, 2014 | 2 | other | Ruby | ||
Ruby bindings to the OpenNLP Java toolkit. | ||||||||||
Python Ucto | 29 | 2 | 1 | 6 months ago | 22 | October 31, 2023 | 5 | Cython | ||
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto). | ||||||||||
Sentencepiece | 12 | 3 | 9 months ago | 23 | July 22, 2023 | other | Rust | |||
Rust binding for the sentencepiece library |