Chinesewordsegmentation

Chinese word segmentation algorithm without corpus(无需语料库的中文分词)
Alternatives To Chinesewordsegmentation
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Chinesewordsegmentation427
3 years ago2mitPython
Chinese word segmentation algorithm without corpus(无需语料库的中文分词)
Word Embedding Dimensionality Selection320
3 years ago6mitPython
On the Dimensionality of Word Embedding
Pyate242
a year ago32March 02, 20228mitHTML
PYthon Automated Term Extraction
Smart78
6 months ago10gpl-3.0JavaScript
String Matching Algorithms Research Tool
Segment77
7 years agoPython
A tool to segment text based on frequencies and the Viterbi algorithm "#TheBoyWhoLived" => ['#', 'The', 'Boy', 'Who', 'Lived']
Wordfreq72412 years ago1October 30, 20124mitHTML
Text corpus calculation in Javascript. Supports Chinese, English.
End2endasr43
5 years ago5Python
implement end-to-end asr algorithm with tensorflow
Communityfitnet23
a month ago5
This page is a companion for our paper on overfitting and underfitting of community detection methods on real networks, written by Amir Ghasemian, Homa Hosseinmardi, and Aaron Clauset. (arXiv:1802.10582)
Cocolian Nlp16
5 years agoapache-2.0Java
本项目目的在于构建一个标准化的NLP处理框架,提供企业级的API,以及各种推荐实现和测试包。 目前国内外有不少NLP语言包,包括中科院、复旦大学的,通过对这些常用NLP软件的封装,可以为企业提供一个可以根据需要来对比和无缝切换底层实现的NLP框架。
Reaper16
9 years ago1October 05, 20155Clojure
A text summarization framework written in clojure.
Alternatives To Chinesewordsegmentation
Select To Compare


Alternative Project Comparisons
Readme

ChineseWordSegmentation

Chinese word segmentation algorithm without corpus

Usage

from wordseg import WordSegment
doc = u'十四是十四四十是四十,十四不是四十,四十不是十四'
ws = WordSegment(doc, max_word_len=2, min_aggregation=1, min_entropy=0.5)
ws.segSentence(doc)

This will generate words

十四 是 十四 四十 是 四十 , 十四 不是 四十 , 四十 不是 十四

In fact, doc should be a long enough document string for better results. In that condition, the min_aggregation should be set far greater than 1, such as 50, and min_entropy should also be set greater than 0.5, such as 1.5.

Besides, both input and output of this function should be decoded as unicode.

WordSegment.segSentence has an optional argument method, with values WordSegment.L, WordSegment.S and WordSegment.ALL, means

  • WordSegment.L: if a long word that is combinations of several shorter words found, given only the long word.
  • WordSegment.S: given the several shorter words.
  • WordSegment.ALL: given both the long and the shorters.

Reference

Thanks Matrix67's article

Popular Corpus Projects
Popular Algorithms Projects
Popular Data Processing Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Python
Algorithms
Segmentation
Chinese
Corpus