Awesome Open Source
Awesome Open Source
Combined Topics
corpus
x
Advertising
📦 10
All Projects
Application Programming Interfaces
📦 124
Applications
📦 192
Artificial Intelligence
📦 78
Blockchain
📦 73
Build Tools
📦 113
Cloud Computing
📦 80
Code Quality
📦 28
Collaboration
📦 32
Command Line Interface
📦 49
Community
📦 83
Companies
📦 60
Compilers
📦 63
Computer Science
📦 80
Configuration Management
📦 42
Content Management
📦 175
Control Flow
📦 213
Data Formats
📦 78
Data Processing
📦 276
Data Storage
📦 135
Economics
📦 64
Frameworks
📦 215
Games
📦 129
Graphics
📦 110
Hardware
📦 152
Integrated Development Environments
📦 49
Learning Resources
📦 166
Legal
📦 29
Libraries
📦 129
Lists Of Projects
📦 22
Machine Learning
📦 347
Mapping
📦 64
Marketing
📦 15
Mathematics
📦 55
Media
📦 239
Messaging
📦 98
Networking
📦 315
Operating Systems
📦 89
Operations
📦 121
Package Managers
📦 55
Programming Languages
📦 245
Runtime Environments
📦 100
Science
📦 42
Security
📦 396
Social Media
📦 27
Software Architecture
📦 72
Software Development
📦 72
Software Performance
📦 58
Software Quality
📦 133
Text Editors
📦 49
Text Processing
📦 136
User Interface
📦 330
User Interface Components
📦 514
Version Control
📦 30
Virtualization
📦 71
Web Browsers
📦 42
Web Servers
📦 26
Web User Interface
📦 210
The Top 52 Corpus Open Source Projects
Categories
>
Data Processing
>
Corpus
Nlp_chinese_corpus
⭐
5,772
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Corpora
⭐
3,968
A collection of small corpuses of interesting data for the creation of bots and similar stuff.
Chinese Names Corpus
⭐
2,642
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Awesome Deeplearning Resources
⭐
2,348
Deep Learning and deep reinforcement learning research papers and some codes
Weibo_terminater
⭐
2,284
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Awesome Chatbot
⭐
1,650
Awesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
Clue
⭐
1,574
中文语言理解基准测评 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Dialog_corpus
⭐
1,549
用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Cluedatasetsearch
⭐
1,399
搜索所有中文NLP数据集,附常用英文NLP数据集
Chatterbot Corpus
⭐
935
A multilingual dialog corpus
Company Names Corpus
⭐
829
公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。
Insuranceqa Corpus Zh
⭐
811
🚁 保险行业语料库,聊天机器人
Seq2seq Chatbot
⭐
770
Chatbot in 200 lines of code using TensorLayer
Quanteda
⭐
629
An R package for the Quantitative Analysis of Textual Data
Weixin_public_corpus
⭐
459
微信公众号语料库
Small Chinese Corpus
⭐
456
Some useful Chinese corpus datasets 中文语料小数据
Cluepretrainedmodels
⭐
442
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Awesome Persian Nlp Ir
⭐
441
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Bookcorpus
⭐
431
Crawl BookCorpus
Chinese Nlp Corpus
⭐
397
Collections of Chinese NLP corpus
Fuzzdata
⭐
367
Fuzzing resources for feeding various fuzzers with input. 🔧
Wordless
⭐
346
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Cluecorpus2020
⭐
256
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Fakenewscorpus
⭐
248
A dataset of millions of news articles scraped from a curated list of data sources.
Korpora
⭐
230
Korean corpus repository
Nlvr
⭐
188
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Efaqa Corpus Zh
⭐
156
❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
Wp2txt
⭐
144
WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
Nlp_bahasa_resources
⭐
144
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Gossiping Chinese Corpus
⭐
136
PTT 八卦版問答中文語料
Prosody
⭐
136
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Code Docstring Corpus
⭐
132
Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Khcoder
⭐
121
KH Coder: for Quantitative Content Analysis or Text Mining
Indonesian Nlp Resources
⭐
115
data resource untuk NLP bahasa indonesia
Colibri Core
⭐
112
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Awesome Hungarian Nlp
⭐
111
A curated list of NLP resources for Hungarian
Sejong Corpus
⭐
110
Korean sejong corpus download and simple analysis
Datasets
⭐
104
Poetry-related datasets developed by THUAIPoet (Jiuge) group.
Pubmed Rct
⭐
101
PubMed 200k RCT dataset: a large dataset for sequential sentence classification.
Pansori
⭐
99
Tools for ASR Corpus Generation from Online Video
Chi Corpus
⭐
96
迟先生语料库
Lexicon Thai
⭐
92
คลังศัพท์ภาษาไทย
Pyclue
⭐
85
Python toolkit for Chinese Language Understanding(CLUE) Evaluation benchmark
Ja.text8
⭐
80
Japanese text8 corpus for word embedding.
Russian_news_corpus
⭐
76
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Dataset List
⭐
72
lists of text corpus and more (mainly Japanese)
Blacklab
⭐
69
A corpus retrieval engine based on Apache Lucene
Coarij
⭐
55
Corpus of Annual Reports in Japan
Mitie_chinese_wikipedia_corpus
⭐
43
Pre-trained Wikipedia corpus by MITIE
Typing Assistant
⭐
30
Typing Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.
Lyrics Corpora
⭐
13
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Naive Bayes Classifier
⭐
6
Naive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.
1-52 of 52 projects
Advertising
📦 10
All Projects
Application Programming Interfaces
📦 124
Applications
📦 192
Artificial Intelligence
📦 78
Blockchain
📦 73
Build Tools
📦 113
Cloud Computing
📦 80
Code Quality
📦 28
Collaboration
📦 32
Command Line Interface
📦 49
Community
📦 83
Companies
📦 60
Compilers
📦 63
Computer Science
📦 80
Configuration Management
📦 42
Content Management
📦 175
Control Flow
📦 213
Data Formats
📦 78
Data Processing
📦 276
Data Storage
📦 135
Economics
📦 64
Frameworks
📦 215
Games
📦 129
Graphics
📦 110
Hardware
📦 152
Integrated Development Environments
📦 49
Learning Resources
📦 166
Legal
📦 29
Libraries
📦 129
Lists Of Projects
📦 22
Machine Learning
📦 347
Mapping
📦 64
Marketing
📦 15
Mathematics
📦 55
Media
📦 239
Messaging
📦 98
Networking
📦 315
Operating Systems
📦 89
Operations
📦 121
Package Managers
📦 55
Programming Languages
📦 245
Runtime Environments
📦 100
Science
📦 42
Security
📦 396
Social Media
📦 27
Software Architecture
📦 72
Software Development
📦 72
Software Performance
📦 58
Software Quality
📦 133
Text Editors
📦 49
Text Processing
📦 136
User Interface
📦 330
User Interface Components
📦 514
Version Control
📦 30
Virtualization
📦 71
Web Browsers
📦 42
Web Servers
📦 26
Web User Interface
📦 210