Chinese Nlp Corpus

Collections of Chinese NLP corpus
Alternatives To Chinese Nlp Corpus
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Nltk11,65610,4961,46312 days ago50February 09, 2022233apache-2.0Python
NLTK Source
Nlp_chinese_corpus7,386
4 months ago19mit
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Bert Pytorch5,329
12 months ago5October 23, 201860apache-2.0Python
Google AI 2018 BERT pytorch implementation
Nlp Datasets5,235
4 months ago7
Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)
Nlp_tasks2,904
5 years agoapache-2.0
Natural Language Processing Tasks and References
Cluedatasetsearch2,778
4 months ago6Python
搜索所有中文NLP数据集,附常用英文NLP数据集
Awesome Deeplearning Resources2,609
5 months agomit
Deep Learning and deep reinforcement learning research papers and some codes
Uer Py2,458
18 days ago124apache-2.0Python
Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo
Tensorflow 1.4 Billion Password Analysis1,657
2 years ago8Python
Deep Learning model to analyze a large corpus of clear text passwords.
Gpt2 Ml1,613
6 days ago22apache-2.0Python
GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型
Alternatives To Chinese Nlp Corpus
Select To Compare


Alternative Project Comparisons
Readme

Chinese-NLP-Corpus

Collections of Chinese NLP corpus

Open Domain

Corpus for open domain, including: law, social media, comments

Word Segmentation and Part-of-Speech

Name Description Link
ZhuXian(诛仙) 小说《诛仙》的POS和分词标注数据 zhuxian
CNLC 国家语言委员会的数据,train: dev: test=8: 1: 1 CNLC

* the url in the table is out-of-date, you can find the data from the following reference.
Reference:https://github.com/hankcs/multi-criteria-cws/tree/master/data
the details of the corpus

Named Entity Recognition (NER)

Name Description Link
MSRA 中文NER任务最常用数据之一 MSRA
People's Daily 中文NER任务最常用数据之二 People's Daily
Weibo Data 中文NER任务最常用数据之三 Weibo

Text Classification

Name Description Link notes
CAIL2018 2018中国‘法研杯’法律智能挑战赛(任务:罪名预测、法条推荐、刑期预测)的数据,数据集共包括268万刑法法律文书,共涉及183条罪名,202条法条,刑期长短包括0-25年、无期、死刑。 CAIL2018 比赛官网, github
CSL - Classification 中文科学文献数据集(CSL)中,选取自然科学相关学报的论文摘要根据国家自然科学基金进行学科分类。 CSL - Classification

Sentiment Analysis and Rating

Name Description Link notes
ChnSentiCorp_htl_all 7000多条酒店评论数据,5000多条正面评论,2000多条负面评论 ChnSentiCorp_htl_all
waimai_10k 某外卖平台收集的用户评价,正面4000条,负面约8000 waimai_10k
online_shopping_10_cats 10个类别(书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店),共6万多条评论数据,正、负面评论各约3万 online_shopping_10_cats
weibo_senti_100k 10万多条,带情感标注的新浪微博,正负面评论约各5万 weibo_senti_100k 参考页面,这个数据集里包含大量emoji,效果可能与emoji相关,训练之前最好去除emoji
simplifyweibo_4_moods 36万多条,带情感标注的新浪微博,包含4种情感,其中喜悦约20万条,愤怒、厌恶、低落各约5万 simplifyweibo_4_moods
dmsc_v2 28部电影,超70万用户,超 200万条评分/评论数据 dmsc_v2
yf_dianping 24万家餐馆,54万用户,440万条评论/评分数据 yf_dianping
yf_amazon 52万件商品,1100多个类目,142万用户,720万条评论/评分数据 yf_amazon
ez_douban 5万多部电影(3万多有电影名称,2万多没有电影名称),2.8万用户,280万条评分数据 ez_douban

Other Github Repo

Description Link notes
Chinese NLP Corpus SophonPlus/ChineseNlpCorpus
awesome-chinese-nlp/Corpus 中文语料 crownpku/Awesome-Chinese-NLP
Large Scale Chinese Corpus for NLP brightmart/nlp_chinese_corpus
中文自然语言处理数据集 InsaneLife/ChineseNLPCorpus
funNLP fighting41love/funNLP

Medical Domain

Collect corpus for Chinese medical domain, including medical terminology, QA, clinical NER

Bechmark

Name Description Link notes
ChineseBLUE the Chinese Biomedical Language Understanding Evaluation benchmark by alibaba ChineseBLUE Conceptualized Representation Learning for Chinese Biomedical Text Mining

Word Segmentation

Name Description Link notes
AMTTL 医学语言的分词数据集,来源应该是医学论坛,所以数据还是偏向open,与医学文本中的语言描述有差异。 AMTTL Adaptive Multi-Task Transfer Learning for Chinese Word Segmentation in Medical Text

Clinical NER

Name Description Link notes
CNMER 中文医学实体识别数据集,实体包括身体部位、症状体征、检查、疾病以及治疗。 CNMER 应该是CCKS2017的数据。
CNMER 识别疾病和诊断、解剖部位、影像检查、实验室检验、手术和药物6种命名实体 CCKS2018数据
CNMER 识别中文医学命名实体 CCKS2019数据 来自OpenKG的分享

Question Answer (QA)

Name Description Link notes
cMedQA 医学在线论坛的数据,包含5.4万个问题,及对应的约10万个回答。 cMedQA Chinese Medical Question Answer Matching Using End-to-End Character-Level Multi-Scale CNNs
cMedQA2 cMedQA的扩展版,包含约10万个医学相关问题,及对应的约20万个回答。 cMedQA2 Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection
webMedQA 又一个医学在线问答数据集,包含6万个问题和31万个回答,而且包含问题的类别。 webMedQA Applying deep matching networks to Chinese medical question answering: A study and a dataset

Others

Name Description Link notes
medical-books Open sourece medical books in LaTeX medical-books
awesome_Chinese_medical_NLP 中文医学NLP公开资源整理 awesome_Chinese_medical_NLP
Chinese_medical_NLP 医疗NLP领域(主要关注中文)评测数据集与论文等相关资源。 Chinese_medical_NLP
Popular Corpus Projects
Popular Natural Language Processing Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Python
Dataset
Nlp
Chinese
Corpus
Medical
Ner
Weibo
Chinese Nlp