Awesome Open Source
Awesome Open Source

unidic-py

This is a version of UniDic for Contemporary Written Japanese packaged for use with pip.

Currently it supports 3.1.0, the latest version of UniDic. Note this will take up 770MB on disk after install. If you want a small package, try unidic-lite.

The data for this dictionary is hosted as part of the AWS Open Data Sponsorship Program. You can read the announcement here.

After installing via pip, you need to download the dictionary using the following command:

python -m unidic download

With fugashi or mecab-python3 unidic will be used automatically when installed, though if you want you can manually pass the MeCab arguments:

import fugashi
import unidic
tagger = fugashi.Tagger('-d "{}"'.format(unidic.DICDIR))
# that's it!

Differences from the Official UniDic Release

This has a few changes from the official UniDic release to make it easier to use.

  • entries for 令和 have been added
  • single-character numeric and alphabetic words have been deleted
  • unk.def has been modified so unknown punctuation won't be marked as a noun

See the extras directory for details on how to replicate the build process.

License

The modern Japanese UniDic is available under the GPL, LGPL, or BSD license, see here. UniDic is developed by NINJAL, the National Institute for Japanese Language and Linguistics. UniDic is copyrighted by the UniDic Consortium and is distributed here under the terms of the BSD License.

The code in this repository is not written or maintained by NINJAL. The code is available under the MIT or WTFPL License, as you prefer.

Related Awesome Lists
Top Programming Languages
Top Projects

Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
Python (821,670
Nlp (14,971
Dictionary (11,783
Japanese (4,019
Unidic (6