Chinese Cloze Rc

A Chinese Cloze-style RC Dataset: People's Daily & Children's Fairy Tale (CFT)
Alternatives To Chinese Cloze Rc
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Nlp_chinese_corpus7,386
4 months ago19mit
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Chinese Names Corpus3,411
4 months ago6apache-2.0
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Clue2,954
4 months ago71Python
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Cluedatasetsearch2,778
4 months ago6Python
搜索所有中文NLP数据集,附常用英文NLP数据集
Textrecognitiondatagenerator2,607
16 days ago12November 15, 2020114mitPython
A synthetic data generator for text recognition
Awesome_chinese_medical_nlp1,411
2 months ago
中文医学NLP公开资源整理:术语集/语料库/词向量/预训练模型/知识图谱/命名实体识别/QA/信息抽取/模型/论文/etc
Chinesenlp1,329
2 years ago3HTML
Datasets, SOTA results of every fields of Chinese NLP
Cluener20201,196
4 months ago48Python
CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition
Cdial Gpt944
10 months ago10mitPython
A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models
Synthtext_chinese_version682
5 years ago30C++
Modify from https://github.com/ankush-me/SynthText.git to generate chinese character
Alternatives To Chinese Cloze Rc
Select To Compare


Alternative Project Comparisons
Readme

PD&CFT: A Chinese Reading Comprehension Dataset

We release the first Chinese reading comprehension dataset, which includes People Daily and Children's Fairy Tale (PD&CFT). We hope this would speed up the process for future research in machine comprehension.

Directory Guide

  • people_daily

    • pd.zip
      • pd.train (training file)
      • pd.valid (validation file)
      • pd.test (test file)
  • children_fairy_tale

    • cft.zip
      • cft.test.auto (automatically generated test set)
      • cft.test.human (human evaluated test set)

NOTE: As we have illustrated in the paper, the human evaluation test set is NOT the query proposed by human. The human evaluation set is also the Cloze-style queries, but those easy ones are eliminated.


Statistics

The statistics of the dataset is listed as below.

- PD-train PD-valid PD-test CFT-auto CFT-human
# Query 870,710 3,000 3,000 1,646 1,953
Max # tokens in docs 618 536 634 318 414
Max # tokens in query 502 153 265 83 92
Avg # tokens in docs 379 425 410 122 153
Avg # tokens in query 38 38 41 20 20
Vocabulary 248,160 - - - -

Data Format

Here is a sample of People Daily data,

1 |||  1 1             2013                         
2 |||         500    29.6%   1997      
3 |||      26.5%   1996      
4 |||   38.3% 
5 |||  12 31                 
6 |||     12     78.1    11  72 
7 |||         2013  1995          
8 |||                   
9 |||      XXXXX               
10 |||                      
11 |||      XXXXX                ||| 

This document consists of 10 sentences, each sentence is in the form of

sentence_id(space)|||(space)sentence

and the last line indicate the Query and Answer

sentence_id(space)|||(space)Query(space)|||(space)Answer

Downloads

Our data is avaliable through Github

People Daily & Children's Fairy Tale (PD&CFT)

Paper

Our paper is avaliable through

ACL Anthology

arXiv Pre-print

International Standard Language Resource Number (ISLRN)

ISLRN: 343-112-755-039-0

http://www.islrn.org/resources/resources_info/7838/

Licence

Our data is under CC-BY-SA-4.0 licence.

Reference

If you wish to use this data in your work, please cite

@InProceedings{cui-etal-2016-consensus,
  title		= {Consensus Attention-based Neural Networks for Chinese Reading Comprehension},
  author	= {Cui, Yiming and Liu, Ting and Chen, Zhipeng and Wang, Shijin and Hu, Guoping},
  booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
  year      = {2016},
  address   = {Osaka, Japan},
  pages     = {1777--1786},
}

You may also interested in ...

DeepMind CNN / Daily Mail data Pre-processed Data (recommended) Original Data

Children's Book Test (CBTest) Original Data

Contact

For any problems concerning the paper or data, please contact: admin [AT] ymcui [dot] com

Popular Dataset Projects
Popular Chinese Projects
Popular Data Processing Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Dataset
Token
Chinese