Hanlp

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification
Alternatives To Hanlp
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Hanlp30,911238226 days ago43February 25, 20239apache-2.0Python
Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification
Spacy27,6991,5331,3673 days ago226October 16, 202390mitPython
💫 Industrial-strength Natural Language Processing (NLP) in Python
Nlp Progress21,962
a month ago52mitPython
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
Flair13,26024693 days ago31October 28, 202358otherPython
A very simple framework for state-of-the-art Natural Language Processing (NLP)
Compromise10,9471631459 days ago161November 16, 202393mitJavaScript
modest natural-language processing
Corenlp9,252
6 days ago1March 03, 2021178gpl-3.0Java
CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
Stanza6,8632912 hours ago22December 03, 202379otherPython
Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
Deeppavlov6,40242a day ago57October 17, 202342apache-2.0Python
An open source library for deep learning end-to-end dialog systems and chatbots.
Snips Nlu3,796856 months ago34January 15, 202066apache-2.0Python
Snips Python library to extract meaning from text
Spark Nlp3,514302 days ago133October 26, 202349apache-2.0Scala
State of the Art Natural Language Processing
Alternatives To Hanlp
Select To Compare


Alternative Project Comparisons
Readme

HanLP: Han Language Processing

中文 | 日本語 | Docs | Forum

The multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x, for advancing state-of-the-art deep learning techniques in both academia and industry. HanLP was designed from day one to be efficient, user-friendly and extendable.

Thanks to open-access corpora like Universal Dependencies and OntoNotes, HanLP 2.1 now offers 10 joint tasks on 130 languages: tokenization, lemmatization, part-of-speech tagging, token feature extraction, dependency parsing, constituency parsing, semantic role labeling, semantic dependency parsing, abstract meaning representation (AMR) parsing.

For end users, HanLP offers light-weighted RESTful APIs and native Python APIs.

RESTful APIs

Tiny packages in several KBs for agile development and mobile applications. Although anonymous users are welcomed, an auth key is suggested and a free one can be applied here under the CC BY-NC-SA 4.0 license.

Click to expand tutorials for RESTful APIs

Python

pip install hanlp_restful

Create a client with our API endpoint and your auth.

from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None, language='mul') # mul: multilingual, zh: Chinese

Java

Insert the following dependency into your pom.xml.

<dependency>
  <groupId>com.hankcs.hanlp.restful</groupId>
  <artifactId>hanlp-restful</artifactId>
  <version>0.0.15</version>
</dependency>

Create a client with our API endpoint and your auth.

HanLPClient HanLP = new HanLPClient("https://hanlp.hankcs.com/api", null, "mul"); // mul: multilingual, zh: Chinese

Quick Start

No matter which language you use, the same interface can be used to parse a document.

HanLP.parse(
    "In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments. 2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。")

See docs for visualization, annotation guidelines and more details.

Native APIs

pip install hanlp

HanLP requires Python 3.6 or later. GPU/TPU is suggested but not mandatory.

Quick Start

import hanlp

HanLP = hanlp.load(hanlp.pretrained.mtl.UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE)
print(HanLP(['In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments.',
             '2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。',
             '2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。']))
  • In particular, the Python HanLPClient can also be used as a callable function following the same semantics. See docs for visualization, annotation guidelines and more details.
  • To process Chinese or Japanese, HanLP provides mono-lingual models in each language which significantly outperform the multi-lingual model. See docs for the list of models.

Train Your Own Models

To write DL models is not hard, the real hard thing is to write a model able to reproduce the scores in papers. The snippet below shows how to surpass the state-of-the-art tokenizer in 6 minutes.

tokenizer = TransformerTaggingTokenizer()
save_dir = 'data/model/cws/sighan2005_pku_bert_base_96.7'
tokenizer.fit(
    SIGHAN2005_PKU_TRAIN_ALL,
    SIGHAN2005_PKU_TEST,  # Conventionally, no devset is used. See Tian et al. (2020).
    save_dir,
    'bert-base-chinese',
    max_seq_len=300,
    char_level=True,
    hard_constraint=True,
    sampler_builder=SortingSamplerBuilder(batch_size=32),
    epochs=3,
    adam_epsilon=1e-6,
    warmup_steps=0.1,
    weight_decay=0.01,
    word_dropout=0.1,
    seed=1660853059,
)
tokenizer.evaluate(SIGHAN2005_PKU_TEST, save_dir)

The result is guaranteed to be 96.73 as the random seed is fixed. Different from some overclaiming papers and projects, HanLP promises every single digit in our scores is reproducible. Any issues on reproducibility will be treated and solved as a top-priority fatal bug.

Performance

The performance of multi-task learning models is shown in the following table.

lang corpora model tok pos ner dep con srl sdp lem fea amr
fine coarse ctb pku 863 ud pku msra ontonotes SemEval16 DM PAS PSD
mul UD2.7
OntoNotes5
small 98.62 - - - - 93.23 - - 74.42 79.10 76.85 70.63 - 91.19 93.67 85.34 87.71 84.51 -
base 98.97 - - - - 90.32 - - 80.32 78.74 71.23 73.63 - 92.60 96.04 81.19 85.08 82.13 -
zh open small 97.25 - 96.66 - - - - - 95.00 84.57 87.62 73.40 84.57 - - - - - -
base 97.50 - 97.07 - - - - - 96.04 87.11 89.84 77.78 87.11 - - - - - -
close small 96.70 95.93 96.87 97.56 95.05 - 96.22 95.74 76.79 84.44 88.13 75.81 74.28 - - - - - -
base 97.52 96.44 96.99 97.59 95.29 - 96.48 95.72 77.77 85.29 88.57 76.52 73.76 - - - - - -
ernie 96.95 97.29 96.76 97.64 95.22 - 97.31 96.47 77.95 85.67 89.17 78.51 74.10 - - - - - -
  • Multi-task learning models often under-perform their single-task learning counterparts according to our latest research. Similarly, mono-lingual models often outperform multi-lingual models. Therefore, we strongly recommend the use of a single-task mono-lingual model if you are targeting at high accuracy instead of faster speed.
  • A state-of-the-art AMR model has been released.

Citing

If you use HanLP in your research, please cite our EMNLP paper:

@inproceedings{he-choi-2021-stem,
    title = "The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders",
    author = "He, Han and Choi, Jinho D.",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.451",
    pages = "5555--5577",
    abstract = "Multi-task learning with transformer encoders (MTL) has emerged as a powerful technique to improve performance on closely-related tasks for both accuracy and efficiency while a question still remains whether or not it would perform as well on tasks that are distinct in nature. We first present MTL results on five NLP tasks, POS, NER, DEP, CON, and SRL, and depict its deficiency over single-task learning. We then conduct an extensive pruning analysis to show that a certain set of attention heads get claimed by most tasks during MTL, who interfere with one another to fine-tune those heads for their own objectives. Based on this finding, we propose the Stem Cell Hypothesis to reveal the existence of attention heads naturally talented for many tasks that cannot be jointly trained to create adequate embeddings for all of those tasks. Finally, we design novel parameter-free probes to justify our hypothesis and demonstrate how attention heads are transformed across the five tasks during MTL through label analysis.",
}

License

Codes

HanLP is licensed under Apache License 2.0. You can use HanLP in your commercial products for free. We would appreciate it if you add a link to HanLP on your website.

Models

Unless otherwise specified, all models in HanLP are licensed under CC BY-NC-SA 4.0.

References

https://hanlp.hankcs.com/docs/references.html

Popular Named Entity Recognition Projects
Popular Natural Language Processing Projects
Popular Machine Learning Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Python
Authentication
Natural Language Processing
Text Classification
Named Entity Recognition
Amr
Pos Tagging