Pre Modern_chinese_corpus_dataset

一个近代汉语语料库数据集 This is a pre-modern Chinese ( From Song dynasty in 10th century AD to Republic of China in the early 20th Century ) language corpus.These language resources are all txt format,arranged by Dynasty(Song,Yuan,Ming,Early-Qing,Late-Qing and Republic of China).The relevant authors' information and types of literature also have been labelled.
Alternatives To Pre Modern_chinese_corpus_dataset
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
D2l Zh40,601
17 days ago45March 25, 202221apache-2.0Python
《动手学深度学习》:面向中文读者、能运行、可讨论。中英文版被60多个国家的400多所大学用于教学。
Easyocr17,43637a day ago30June 02, 2022195apache-2.0Python
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
Python Machine Learning Book11,645
7 months ago11mitJupyter Notebook
The "Python Machine Learning (1st edition)" book code repository and info resource
Zi2zi1,639
4 years ago44apache-2.0Python
Learning Chinese Character style with conditional GAN
Jieba Php1,1932988 months ago16November 22, 201721mitPHP
"結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件。 / "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best PHP Chinese word segmentation module.
Algorithm_interview_notes Chinese603
4 years ago2Python
2018/2019/校招/春招/秋招/自然语言处理(NLP)/深度学习(Deep Learning)/机器学习(Machine Learning)/C/C++/Python/面试笔记,此外,还包括创建者看到的所有机器学习/深度学习面经中的问题。 除了其中 DL/ML 相关的,其他与算法岗相关的计算机知识也会记录。 但是不会包括如前端/测试/JAVA/Android等岗位中有关的问题。
Cv582
3 days ago2Jupyter Notebook
✔️最全面的 深度学习CV 笔记【吴恩达 深度学习】【李沐 动手学深度学习】【我是土堆 Pytorch】
Clause361
a year ago3otherC++
:horse_racing: 聊天机器人,自然语言理解,语义理解
Predict_lottery_ticket343
3 months ago7Python
双色球+大乐透彩票AI预测
Chinese Automatic Speech Recognition243
2 years ago1mitJupyter Notebook
Chinese speech recognition
Alternatives To Pre Modern_chinese_corpus_dataset
Select To Compare


Alternative Project Comparisons
Readme

Pre-modern_Chinese_language_corpus

若在科研论文、项目工程中使用了该近代汉语语料库/数据集,欢迎引用:

蒋彦廷,潘雨婷,杨乐. 基于统计与词嵌入的近代汉语动量结构研究[J]. 西华大学学报(哲学社会科学版),2020,39(2):23−32.

JIANG Yan-ting, PAN Yu-ting, YANG Le. A Research on Verbal Classifiers Collocation in Pre-modern Chinese Based on Statistics and Word Embedding[J]. Journal of Xihua University (Philosophy & Social Sciences), 2020, 39(2): 23-32.


2020-2-18 update:

2020年2月18日 更新:

修复了下载链接失效的问题。 having fixed the failure of download link.


2018-11-21 update:

2018年11月21日 更新:

1.Add the essays parts of 6 eras.

增加了6个时间段的散文类别的语料。

2.The total number of characters increases by over 19.3 million.

文献总字数增加1938万余字。

3.Representative works updated:

更新的代表作: 元_散文_姚燧_牧庵集.txt 元_散文_戴表元_剡源文集(不含韵文部分).txt 元_散文_掲傒斯_文安集.txt 元_散文_苏天爵_元文类.txt 元_散文_苏天爵_滋溪文稿.txt 宋_散文_王安石_临川文集(不含前38卷韵文).txt 宋_散文_祖无择_龙学文集.txt 宋_散文_群星_五百家播芳大全文粹.txt 宋_散文_群星_宋文鉴(不含韵文部分).txt 宋_散文_群星_辽文萃.txt 宋_散文_苏轼_东坡全集(不含前33卷韵文).txt 明_散文_群星_明文海.txt 明_散文_群星_晚明二十家小品.txt 明_散文_群星_皇明文征(不含韵文部分).txt 民国_散文_巴金_巴金散文集.txt 民国_散文_徐志摩_徐志摩散文集.txt 民国_散文_朱自清_朱自清散文集.txt 民国_散文_杨绛_杨绛文集.TXT 民国_散文_梁实秋_林语堂散文集.txt 民国_散文_梁实秋_梁实秋散文集.txt 民国_散文_老舍_老舍散文集.txt 民国_散文_茅盾_茅盾散文集.txt 民国_散文_萧红_散文集.txt 民国_散文_郭沫若_郭沫若散文选集.txt 民国_散文_鲁迅_鲁迅文集.txt 清_散文_刘文武_清文精选(不含晚清梁启超林纾等).txt 清_散文_游戏主人_笑林广记.txt 清_散文_群星_皇清文颖.txt 清末_散文_群星_晚清文选.txt


1.【Introduction 简介】

This is a 280-million-character pre-modern Chinese language corpus.

The total file size is more than 966 MB,including 968 text files.These language resources are by utf-8,arranged in dynasty order(Song,Yuan,Ming,Early-Qing,

Late-Qing and Republic of China).

The relevant authors' information and types of literature also have been labelled.

这是一个2.8亿多字的近代汉语语料集合。总大小超过966 MB,含968个TXT文件。语料文本均为utf-8编码。

文本文件按朝代(宋、元、明、清初、清末、民国)排列,文本的类别、作者姓名也作了标注。

2.【Application area of this corpus 语料用途】

These language resources can be used for literature/history/linguistic/arts/chinese medical/the history of science research,Chinese teaching,data mining,

text automatic classification and so on.

这些语料可服务于文学/文献学/历史学/语言学/艺术学/中医学/科学技术史研究、汉语教学、数据挖掘和文本自动分类等领域。

3.【Types of language resources 语言资源类型】

The types of literature involve文献类型包括 :

(1)诗歌 poetry;

(2)词 "Ci";

(3)剧曲 drama;

(4)小说话本 novel;

(5)军事类 military literature;

(6)中医类 chinese medical literature;

(7)技艺类 arts literature (如eg:乐器musical instrument、棋弈chess、书法calligraphy、厨艺cooking、茶tea、武术功夫Chinese kung fu);

(8)数理科学 math/algorithm/astronomy/chemistry/physics;

(9)农业类 agricultural literature;

(10)历史地理类 history/geography literature.

(11)散文类(非韵文) essay literature.

4.【Language classification 语料编排分类】

All the language resources are separated into 6 parts: (1)Song dynasty, (2)Yuan dynasty, (3)Ming dynasty, (4)Early Qing dynasty(before 1840s AD),

(5)Late Qing dynasty(1840s-1911 AD), (6)Republic of China(1912-1948).

所有语料文本被分为6个部分:宋朝、元朝、明朝、清初(1644-1840)、清末(1840-1911)、民国(1912-1948)。

5.【The number of character of each category 文档字数统计(不含标点)】

类别\朝代 散文 小说话本 历史地理 诗词 医学 农学 剧曲 数理科学 技艺 军事 总字数
5820561 141317 12835787 1680594 5419232 18930 0 285620 33288 445545 26680874
1319350 1378162 5375872 2835050 1869542 189182 2423584 116977 50850 0 15558569
6423460 17357555 27279817 929987 15728504 552105 2639445 1454890 187069 803206 73356038
清初 882491 33290363 39011391 544178 10659597 5692 1040341 3749246 501007 0 89684306
清末 744835 9436857 19075096 124220 511873 0 1411883 0 0 19670 31324434
民国 3853165 9458024 20204169 160852 319042 0 427896 0 0 136671 34559819
总计 19043862 9458024 123782132 6274881 34507790 765909 7943149 5606733 772214 1405092 271164040

6.【Where to download these language resources? 语料下载地址】

请邮件联系[email protected],或加qq号:3225357264,或加微信号jyt629000获取。

If you have any question,or want to help to enlarge this free,open corpus,please contact the

editor: Jiang Yanting([email protected]).Thanks!

若有任何问题,或欲帮助扩充这一免费公开的语料库,请联系编辑[email protected].谢谢!

Popular Chinese Projects
Popular Machine Learning Projects
Popular Community Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Html
Machine Learning
Dataset
Language
Natural Language Processing
Chinese
Corpus
Data Mining