Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for dataset corpus
corpus
x
dataset
x
159 search results found
Nlp_chinese_corpus
⭐
8,344
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Chinese Names Corpus
⭐
3,719
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词
Clue
⭐
3,345
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Cluedatasetsearch
⭐
2,778
搜索所有中文NLP数据集,附常用英文NLP数据集
Dialog_corpus
⭐
1,487
用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Entity Recognition Datasets
⭐
1,386
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
Company Names Corpus
⭐
1,106
公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。
Insuranceqa Corpus Zh
⭐
989
🚁 保险行业语料库,聊天机器人
Cdial Gpt
⭐
944
A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models
Nlp Datasets
⭐
871
A list of datasets/corpora for NLP tasks, in reverse chronological order.
Voice_datasets
⭐
846
🔊 A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).
Ngram2vec
⭐
638
Four word embedding models implemented in Python. Supporting arbitrary context features
Ubuntu Ranking Dataset Creator
⭐
570
A script that creates train, valid and test datasets for the ranking task from Ubuntu corpus dialogs.
Dl_eventextractionpapers
⭐
555
2015年以来基于深度学习方法的事件抽取论文整理
Text_renderer
⭐
543
Cluepretrainedmodels
⭐
536
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Cluecorpus2020
⭐
517
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Cblue
⭐
515
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Efaqa Corpus Zh
⭐
505
❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
Korpora
⭐
500
Korean corpus repository
Gensim Data
⭐
492
Data repository for pretrained NLP models and NLP corpora.
Paws
⭐
403
This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
Chinese Nlp Corpus
⭐
378
Collections of Chinese NLP corpus
Github Typo Corpus
⭐
289
GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors
Nlp_bahasa_resources
⭐
260
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Multi Criteria Cws
⭐
260
Simple Solution for Multi-Criteria Chinese Word Segmentation
Nsmc
⭐
259
Naver sentiment movie corpus
Corus
⭐
254
Links to Russian corpora + Python functions for loading and parsing
Ua Gec
⭐
246
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Lotclass
⭐
231
[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach
Awesome Hungarian Nlp
⭐
192
A curated list of NLP resources for Hungarian
Unify Emotion Datasets
⭐
189
A Survey and Experiments on Annotated Corpora for Emotion Classification in Text
Fakenewscorpus
⭐
184
A dataset of millions of news articles scraped from a curated list of data sources.
Robbert
⭐
180
A Dutch RoBERTa-based language model
Awesome Nlp Polish
⭐
169
A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
Pubmed Rct
⭐
166
PubMed 200k RCT dataset: a large dataset for sequential sentence classification.
Ml Datasets
⭐
161
Machine Learning datasets for Nepal
Awesome Scholarly Data Analysis
⭐
161
A curated collection of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources.
Tumcc
⭐
158
Telegram地下市场中文黑话识别语料集。Telegram Underground Market Chinese Corpus. Paper: Identification of Chinese Dark Jargons in Telegram Underground Markets Using Context-Oriented and Linguistic Features (IP&M, 2022).
Conversation Tensorflow
⭐
144
TensorFlow implementation of Conversation Models
Gossiping Chinese Corpus
⭐
136
PTT 八卦版問答中文語料
Pre Modern_chinese_corpus_dataset
⭐
132
近代汉语语料库数据集 自然语言处理 语料库 古代汉语 古汉语 文言文 数字人文 计算语言
Tts Portuguese Corpus
⭐
125
Open Source Text To Speech Portuguese Dataset
How2 Dataset
⭐
125
This repository contains code and metadata of How2 dataset
Open Korean Corpora
⭐
117
Open Korean NLP Dataset Curation for the Users All Around the Globe
Mtdata
⭐
115
A tool that locates, downloads, and extracts machine translation corpora
Prosody
⭐
104
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Neural Code Search Evaluation Dataset
⭐
98
evaluation dataset consisting of natural language query and code snippet pairs
Indonesian Nlp Resources
⭐
98
data resource untuk NLP bahasa indonesia
Speech Corpus Collection
⭐
87
A Collection of Speech Corpus for ASR and TTS
Phrase At Scale
⭐
84
Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English
Sova Dataset
⭐
82
Datasets
⭐
78
Poetry-related datasets developed by THUAIPoet (Jiuge) group.
Curation Corpus
⭐
77
Code for obtaining the Curation Corpus abstractive text summarisation dataset
Hypenet
⭐
76
Integrated path-based and distributional method for hypernymy detection
Fcgec
⭐
75
The Corpus & Code for EMNLP 2022 paper "FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction" | FCGEC中文语法纠错语料及STG模型
Gutenberg
⭐
74
Pipeline to generate the Standardized Project Gutenberg Corpus
Xqa
⭐
74
Dataset and baseline for ACL 2019 paper "XQA: A Cross-lingual Open-domain Question Answering Dataset"
Laboro Bert Japanese
⭐
68
Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus
Gpt 2 Training
⭐
65
Training GPT-2 on a Russian language corpus
Danes
⭐
65
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Quasar
⭐
64
Datasets for Question Answering by Search and Reading
Query Wellformedness
⭐
63
25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.
Dialogue Datasets
⭐
61
A collection of plain text dialogue datasets
Awesome Nlp Chinese Corpus
⭐
59
A curated list of resources of chinese corpora for NLP(Natural Language Processing)
Video_music_book_datasets
⭐
57
NLP NER datasets video/music/book bio
Convolutional_seq2seq
⭐
56
fairseq: Convolutional Sequence to Sequence Learning (Gehring et al. 2017) by Chainer
Bert Commonsense
⭐
56
Code for papers "A Surprisingly Robust Trick for Winograd Schema Challenge" and "WikiCREM: A Large Unsupervised Corpus for Coreference Resolution"
Askubuntu
⭐
54
AskUbuntu Question Dataset
Coarij
⭐
54
Corpus of Annual Reports in Japan
Automatic Corpus Generation
⭐
53
This repository is for the paper "A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check"
Turkish Glove
⭐
51
Türkçe GloVe - Repository for Turkish GloVe Word Embeddings
Tamil Nlp Catalog
⭐
51
Awesome List of Tamil NLP & AI Resources
Cross Language Dataset
⭐
50
A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection
Qasc
⭐
49
Repository for the Question Answering via Sentence Composition (QASC) dataset
3i4k
⭐
49
Intonation-aided intention identification for Korean
Nlp Corpora
⭐
49
List of NLP (Natural Language Processing) Corpora.
Personas
⭐
48
Datasets for Deep learning Personas
When In Rome
⭐
45
meta-corpus of and code library for the functional harmonic analysis of music
Image Verification Corpus
⭐
45
This contains an evolving dataset of fake and real images shared in social media.
Book Names Corpus
⭐
45
图书名语料库。含部分电影、游戏名称。
Cluemotionanalysis2020
⭐
42
CLUE Emotion Analysis Dataset 细粒度情感分析数据集
Cnli
⭐
42
Baseline for the CNLI corpus
Science Result Extractor
⭐
42
Asset
⭐
42
A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations
Producttitlesummarizationcorpus
⭐
41
Dataset for CIKM 2018 paper "Multi-Source Pointer Network for Product Title Summarization"
Dialog Processing
⭐
41
NLG and NLU for dialogue processing
Odsqa
⭐
41
ODSQA: OPEN-DOMAIN SPOKEN QUESTION ANSWERING DATASET
Ewiser
⭐
40
A Word Sense Disambiguation system integrating implicit and explicit external knowledge.
Autocorpus
⭐
38
AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets. Autocorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines.
Medical Names Corpus
⭐
38
医疗语料库。医疗机构名语料库。药品本位码。
Shabby Pages
⭐
34
ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.
Open Australian Legal Corpus Creator
⭐
34
The code used to create and update the Open Australian Legal Corpus, the first and only multijurisdictional open corpus of Australian legislative and judicial documents.
Open2ch Dialogue Corpus
⭐
34
おーぷん2ちゃんねるをクロールして作成した対話コーパス
Feidegger
⭐
34
A Multi-modal Corpus of Fashion Images and Descriptions in German
Voxceleb
⭐
34
mirror of VoxCeleb dataset - a large-scale speaker identification dataset
Phomt
⭐
33
PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation (EMNLP 2021)
Vizml
⭐
33
Plotly dataset-visualization pairs, feature extraction scripts, and model training code for VizML (CHI 2019)
Machinelearningphishing
⭐
32
This project will determine which of the five supervised classification machine learning algorithms performs best in detecting phishy emails
Autoie
⭐
32
This is the repository for NLPCC2020 task AutoIE
Related Searches
Python Dataset (14,792)
Jupyter Notebook Dataset (6,824)
Python Corpus (2,447)
Deep Learning Dataset (2,364)
Machine Learning Dataset (2,279)
Dataset Pytorch (1,847)
Dataset Tensorflow (1,583)
Dataset Classification (1,500)
Dataset Convolutional Neural Networks (1,264)
Dataset Paper (1,252)
1-100 of 159 search results
Next >
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.