Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Chinesewordsegmentation | 427 | 3 years ago | 2 | mit | Python | |||||
Chinese word segmentation algorithm without corpus(无需语料库的中文分词) | ||||||||||
Word Embedding Dimensionality Selection | 320 | 3 years ago | 6 | mit | Python | |||||
On the Dimensionality of Word Embedding | ||||||||||
Pyate | 242 | a year ago | 32 | March 02, 2022 | 8 | mit | HTML | |||
PYthon Automated Term Extraction | ||||||||||
Smart | 78 | 6 months ago | 10 | gpl-3.0 | JavaScript | |||||
String Matching Algorithms Research Tool | ||||||||||
Segment | 77 | 7 years ago | Python | |||||||
A tool to segment text based on frequencies and the Viterbi algorithm "#TheBoyWhoLived" => ['#', 'The', 'Boy', 'Who', 'Lived'] | ||||||||||
Wordfreq | 72 | 4 | 1 | 2 years ago | 1 | October 30, 2012 | 4 | mit | HTML | |
Text corpus calculation in Javascript. Supports Chinese, English. | ||||||||||
End2endasr | 43 | 5 years ago | 5 | Python | ||||||
implement end-to-end asr algorithm with tensorflow | ||||||||||
Communityfitnet | 23 | a month ago | 5 | |||||||
This page is a companion for our paper on overfitting and underfitting of community detection methods on real networks, written by Amir Ghasemian, Homa Hosseinmardi, and Aaron Clauset. (arXiv:1802.10582) | ||||||||||
Cocolian Nlp | 16 | 5 years ago | apache-2.0 | Java | ||||||
本项目目的在于构建一个标准化的NLP处理框架,提供企业级的API,以及各种推荐实现和测试包。 目前国内外有不少NLP语言包,包括中科院、复旦大学的,通过对这些常用NLP软件的封装,可以为企业提供一个可以根据需要来对比和无缝切换底层实现的NLP框架。 | ||||||||||
Reaper | 16 | 9 years ago | 1 | October 05, 2015 | 5 | Clojure | ||||
A text summarization framework written in clojure. |
Chinese word segmentation algorithm without corpus
from wordseg import WordSegment
doc = u'十四是十四四十是四十,十四不是四十,四十不是十四'
ws = WordSegment(doc, max_word_len=2, min_aggregation=1, min_entropy=0.5)
ws.segSentence(doc)
This will generate words
十四 是 十四 四十 是 四十 , 十四 不是 四十 , 四十 不是 十四
In fact, doc
should be a long enough document string for better results. In that condition, the min_aggregation should be set far greater than 1, such as 50, and min_entropy should also be set greater than 0.5, such as 1.5.
Besides, both input and output of this function should be decoded as unicode.
WordSegment.segSentence
has an optional argument method
, with values WordSegment.L
, WordSegment.S
and WordSegment.ALL
, means
WordSegment.L
: if a long word that is combinations of several shorter words found, given only the long word.WordSegment.S
: given the several shorter words.WordSegment.ALL
: given both the long and the shorters.Thanks Matrix67's article