Awesome Open Source
Awesome Open Source

PyThaiNLP: Thai Natural Language Processing in Python

pypi Python 3.7 License Download Unit test and code coverage Coverage Status Google Colab Badge DOI Chat on Matrix

PyThaiNLP is a Python package for text processing and linguistic analysis, similar to NLTK with focus on Thai language.



Now, You can contact or ask any questions with the PyThaiNLP team. Chat on Matrix

Version Description Status
3.1 Stable Change Log
dev Release Candidate for 4.0 Change Log

Getting Started


PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.

List of Features
  • Convenient character and word classes, like Thai consonants (pythainlp.thai_consonants), vowels (pythainlp.thai_vowels), digits (pythainlp.thai_digits), and stop words (pythainlp.corpus.thai_stopwords) -- comparable to constants like string.letters, string.digits, and string.punctuation
  • Thai linguistic unit segmentation/tokenization, including sentence (sent_tokenize), word (word_tokenize), and subword segmentations based on Thai Character Cluster (subword_tokenize)
  • Thai part-of-speech tagging (pos_tag)
  • Thai spelling suggestion and correction (spell and correct)
  • Thai transliteration (transliterate)
  • Thai soundex (soundex) with three engines (lk82, udom83, metasound)
  • Thai collation (sort by dictionary order) (collate)
  • Read out number to Thai words (bahttext, num_to_thaiword)
  • Thai datetime formatting (thai_strftime)
  • Thai-English keyboard misswitched fix (eng_to_thai, thai_to_eng)
  • Command-line interface for basic functions, like tokenization and pos tagging (run thainlp in your shell)


pip install --upgrade pythainlp

This will install the latest stable release of PyThaiNLP.

Install different releases:

  • Stable release: pip install --upgrade pythainlp
  • Pre-release (near ready): pip install --upgrade --pre pythainlp
  • Development (likely to break things): pip install

Installation Options

Some functionalities, like Thai WordNet, may require extra packages. To install those requirements, specify a set of [name] immediately after pythainlp:

pip install pythainlp[extra1,extra2,...]
List of possible `extras`
  • full (install everything)
  • attacut (to support attacut, a fast and accurate tokenizer)
  • benchmarks (for word tokenization benchmarking)
  • icu (for ICU, International Components for Unicode, support in transliteration and tokenization)
  • ipa (for IPA, International Phonetic Alphabet, support in transliteration)
  • ml (to support ULMFiT models for classification)
  • thai2fit (for Thai word vector)
  • thai2rom (for machine-learnt romanization)
  • wordnet (for Thai WordNet API)

For dependency details, look at extras variable in

Data directory

  • Some additional data, like word lists and language models, may get automatically download during runtime.
  • PyThaiNLP caches these data under the directory ~/pythainlp-data by default.
  • Data directory can be changed by specifying the environment variable PYTHAINLP_DATA_DIR.
  • See the data catalog (db.json) at

Command-Line Interface

Some of PyThaiNLP functionalities can be used at command line, using thainlp command.

For example, displaying a catalog of datasets:

thainlp data catalog

Showing how to use:

thainlp help


PyThaiNLP Source Code and Notebooks Apache Software License 2.0
Corpora, datasets, and documentations created by PyThaiNLP Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0)
Language models created by PyThaiNLP Creative Commons Attribution 4.0 International Public License (CC-by)
Other corpora and models that may included with PyThaiNLP See Corpus License

Contribute to PyThaiNLP

  • Please do fork and create a pull request :)
  • For style guide and other information, including references to algorithms we use, please refer to our contributing page.

Who uses PyThaiNLP?

You can read


If you use PyThaiNLP in your project or publication, please cite the library as follows

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, & Pattarawat Chormai. (2016, Jun 27). PyThaiNLP: Thai Natural Language Processing in Python. Zenodo.

or BibTeX entry:

    author       = {Wannaphong Phatthiyaphaibun and Korakot Chaovavanich and Charin Polpanumas and Arthit Suriyawongkul and Lalita Lowphansirikul and Pattarawat Chormai},
    title        = {{PyThaiNLP: Thai Natural Language Processing in Python}},
    month        = Jun,
    year         = 2016,
    doi          = {10.5281/zenodo.3519354},
    publisher    = {Zenodo},
    url          = {}


Logo Description
VISTEC-depa Thailand Artificial Intelligence Research Institute Since 2019 - 2022, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have been supported by VISTEC-depa Thailand Artificial Intelligence Research Institute.
MacStadium We get support free Mac Mini M1 from MacStadium for doing Build CI.

Made with | PyThaiNLP Team | "We build Thai NLP"

We have only one official repository at and another mirror at
Beware of malware if you use code from mirrors other than the official two at GitHub and GitLab.

Alternative Project Comparisons
Related Awesome Lists
Top Programming Languages

Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
Python (891,361
Natural Language Processing (15,895
Segmentation (8,388
Hacktoberfest2022 (3,062
Tagging (2,249
Transliteration (575
Word Segmentation (226
Nlp Library (222
Soundex (173
Thai (80
Thai Language (52
Thai Nlp (29
Thai Nlp Library (9