Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Sentencepiece | 8,851 | 120 | 787 | 3 months ago | 34 | May 02, 2023 | 32 | apache-2.0 | C++ | |
Unsupervised text tokenizer for Neural Network-based text generation. | ||||||||||
Pkuseg Python | 6,001 | 4 | 8 | a year ago | 22 | June 19, 2020 | 119 | mit | Python | |
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation | ||||||||||
Subword Nmt | 1,937 | 18 | 18 | 2 years ago | 8 | December 08, 2021 | 2 | mit | Python | |
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation | ||||||||||
Pythainlp | 902 | 24 | 51 | 3 months ago | 101 | November 26, 2023 | 35 | apache-2.0 | Python | |
Thai Natural Language Processing in Python. | ||||||||||
Jieba Rs | 585 | 5 | 15 | 9 months ago | 40 | July 16, 2023 | 9 | mit | Rust | |
The Jieba Chinese Word Segmentation Implemented in Rust | ||||||||||
Ekphrasis | 583 | 7 | 2 years ago | 54 | May 17, 2022 | 18 | mit | Python | ||
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets). | ||||||||||
Vncorenlp | 472 | a year ago | other | Java | ||||||
A Vietnamese natural language processing toolkit (NAACL 2018) | ||||||||||
Nagisa | 365 | 1 | 7 | 3 months ago | 22 | July 30, 2023 | 4 | mit | Python | |
A Japanese tokenizer based on recurrent neural networks | ||||||||||
Pycantonese | 290 | a year ago | 24 | December 28, 2021 | 5 | mit | Python | |||
Cantonese Linguistics and NLP | ||||||||||
Python Wordsegment | 268 | 4 years ago | 8 | other | Python | |||||
English word segmentation, written in pure-Python, and based on a trillion-word corpus. |