Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Nlp_chinese_corpus | 7,386 | 4 months ago | 19 | mit | ||||||
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP | ||||||||||
Chinese Names Corpus | 3,411 | 4 months ago | 6 | apache-2.0 | ||||||
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。 | ||||||||||
Clue | 2,954 | 4 months ago | 71 | Python | ||||||
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard | ||||||||||
Cluedatasetsearch | 2,778 | 4 months ago | 6 | Python | ||||||
搜索所有中文NLP数据集,附常用英文NLP数据集 | ||||||||||
Textrecognitiondatagenerator | 2,607 | 16 days ago | 12 | November 15, 2020 | 114 | mit | Python | |||
A synthetic data generator for text recognition | ||||||||||
Awesome_chinese_medical_nlp | 1,411 | 2 months ago | ||||||||
中文医学NLP公开资源整理:术语集/语料库/词向量/预训练模型/知识图谱/命名实体识别/QA/信息抽取/模型/论文/etc | ||||||||||
Chinesenlp | 1,329 | 2 years ago | 3 | HTML | ||||||
Datasets, SOTA results of every fields of Chinese NLP | ||||||||||
Cluener2020 | 1,196 | 4 months ago | 48 | Python | ||||||
CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition | ||||||||||
Cdial Gpt | 944 | 10 months ago | 10 | mit | Python | |||||
A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models | ||||||||||
Synthtext_chinese_version | 682 | 5 years ago | 30 | C++ | ||||||
Modify from https://github.com/ankush-me/SynthText.git to generate chinese character |
We release the first Chinese reading comprehension dataset, which includes People Daily and Children's Fairy Tale (PD&CFT). We hope this would speed up the process for future research in machine comprehension.
people_daily
children_fairy_tale
NOTE: As we have illustrated in the paper, the human evaluation test set is NOT the query proposed by human. The human evaluation set is also the Cloze-style queries, but those easy ones are eliminated.
The statistics of the dataset is listed as below.
- | PD-train | PD-valid | PD-test | CFT-auto | CFT-human |
---|---|---|---|---|---|
# Query | 870,710 | 3,000 | 3,000 | 1,646 | 1,953 |
Max # tokens in docs | 618 | 536 | 634 | 318 | 414 |
Max # tokens in query | 502 | 153 | 265 | 83 | 92 |
Avg # tokens in docs | 379 | 425 | 410 | 122 | 153 |
Avg # tokens in query | 38 | 38 | 41 | 20 | 20 |
Vocabulary | 248,160 | - | - | - | - |
Here is a sample of People Daily data,
1 ||| 1 1 2013
2 ||| 500 29.6% 1997
3 ||| 26.5% 1996
4 ||| 38.3%
5 ||| 12 31
6 ||| 12 78.1 11 72
7 ||| 2013 1995
8 |||
9 ||| XXXXX
10 |||
11 ||| XXXXX |||
This document consists of 10 sentences, each sentence is in the form of
sentence_id(space)|||(space)sentence
and the last line indicate the Query
and Answer
sentence_id(space)|||(space)Query(space)|||(space)Answer
Our data is avaliable through Github
Our paper is avaliable through
ISLRN: 343-112-755-039-0
http://www.islrn.org/resources/resources_info/7838/
Our data is under CC-BY-SA-4.0 licence.
If you wish to use this data in your work, please cite
@InProceedings{cui-etal-2016-consensus,
title = {Consensus Attention-based Neural Networks for Chinese Reading Comprehension},
author = {Cui, Yiming and Liu, Ting and Chen, Zhipeng and Wang, Shijin and Hu, Guoping},
booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
year = {2016},
address = {Osaka, Japan},
pages = {1777--1786},
}
DeepMind CNN / Daily Mail data Pre-processed Data (recommended) Original Data
Children's Book Test (CBTest) Original Data
For any problems concerning the paper or data, please contact: admin [AT] ymcui [dot] com