Chinese Cloze Rc

A Chinese Cloze-style RC Dataset: People's Daily & Children's Fairy Tale (CFT)
Alternatives To Chinese Cloze Rc
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
4 months ago19mit
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Chinese Names Corpus3,411
4 months ago6apache-2.0
4 months ago71Python
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
4 months ago6Python
16 days ago12November 15, 2020114mitPython
A synthetic data generator for text recognition
2 months ago
2 years ago3HTML
Datasets, SOTA results of every fields of Chinese NLP
4 months ago48Python
CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition
Cdial Gpt944
10 months ago10mitPython
A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models
5 years ago30C++
Modify from to generate chinese character
Alternatives To Chinese Cloze Rc
Select To Compare

Alternative Project Comparisons

PD&CFT: A Chinese Reading Comprehension Dataset

We release the first Chinese reading comprehension dataset, which includes People Daily and Children's Fairy Tale (PD&CFT). We hope this would speed up the process for future research in machine comprehension.

Directory Guide

  • people_daily

      • pd.train (training file)
      • pd.valid (validation file)
      • pd.test (test file)
  • children_fairy_tale

      • (automatically generated test set)
      • cft.test.human (human evaluated test set)

NOTE: As we have illustrated in the paper, the human evaluation test set is NOT the query proposed by human. The human evaluation set is also the Cloze-style queries, but those easy ones are eliminated.


The statistics of the dataset is listed as below.

- PD-train PD-valid PD-test CFT-auto CFT-human
# Query 870,710 3,000 3,000 1,646 1,953
Max # tokens in docs 618 536 634 318 414
Max # tokens in query 502 153 265 83 92
Avg # tokens in docs 379 425 410 122 153
Avg # tokens in query 38 38 41 20 20
Vocabulary 248,160 - - - -

Data Format

Here is a sample of People Daily data,

1 |||  1 1             2013                         
2 |||         500    29.6%   1997      
3 |||      26.5%   1996      
4 |||   38.3% 
5 |||  12 31                 
6 |||     12     78.1    11  72 
7 |||         2013  1995          
8 |||                   
9 |||      XXXXX               
10 |||                      
11 |||      XXXXX                ||| 

This document consists of 10 sentences, each sentence is in the form of


and the last line indicate the Query and Answer



Our data is avaliable through Github

People Daily & Children's Fairy Tale (PD&CFT)


Our paper is avaliable through

ACL Anthology

arXiv Pre-print

International Standard Language Resource Number (ISLRN)

ISLRN: 343-112-755-039-0


Our data is under CC-BY-SA-4.0 licence.


If you wish to use this data in your work, please cite

  title		= {Consensus Attention-based Neural Networks for Chinese Reading Comprehension},
  author	= {Cui, Yiming and Liu, Ting and Chen, Zhipeng and Wang, Shijin and Hu, Guoping},
  booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
  year      = {2016},
  address   = {Osaka, Japan},
  pages     = {1777--1786},

You may also interested in ...

DeepMind CNN / Daily Mail data Pre-processed Data (recommended) Original Data

Children's Book Test (CBTest) Original Data


For any problems concerning the paper or data, please contact: admin [AT] ymcui [dot] com

Popular Dataset Projects
Popular Chinese Projects
Popular Data Processing Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.