Awesome Open Source

Programming Languages

Search results for dataset corpus

159 search results found

Nlp_chinese_corpus ⭐ 8,344

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Chinese Names Corpus ⭐ 3,719

中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

Cluedatasetsearch ⭐ 2,778

搜索所有中文NLP数据集，附常用英文NLP数据集

Dialog_corpus ⭐ 1,487

用于训练中英文对话系统的语料库 Datasets for Training Chatbot System

Entity Recognition Datasets ⭐ 1,386

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

Company Names Corpus ⭐ 1,106

公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。

Insuranceqa Corpus Zh ⭐ 989

🚁 保险行业语料库，聊天机器人

Cdial Gpt ⭐ 944

A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models

Nlp Datasets ⭐ 871

A list of datasets/corpora for NLP tasks, in reverse chronological order.

Voice_datasets ⭐ 846

🔊 A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).

Ngram2vec ⭐ 638

Four word embedding models implemented in Python. Supporting arbitrary context features

Ubuntu Ranking Dataset Creator ⭐ 570

A script that creates train, valid and test datasets for the ranking task from Ubuntu corpus dialogs.

Dl_eventextractionpapers ⭐ 555

2015年以来基于深度学习方法的事件抽取论文整理

Text_renderer ⭐ 543

Cluepretrainedmodels ⭐ 536

高质量中文预训练模型集合：最先进大模型、最快小模型、相似度专门模型

Cluecorpus2020 ⭐ 517

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

Efaqa Corpus Zh ⭐ 505

❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库

Korpora ⭐ 500

Korean corpus repository

Gensim Data ⭐ 492

Data repository for pretrained NLP models and NLP corpora.

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.

Chinese Nlp Corpus ⭐ 378

Collections of Chinese NLP corpus

Github Typo Corpus ⭐ 289

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

Nlp_bahasa_resources ⭐ 260

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

Multi Criteria Cws ⭐ 260

Simple Solution for Multi-Criteria Chinese Word Segmentation

Naver sentiment movie corpus

Links to Russian corpora + Python functions for loading and parsing

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Lotclass ⭐ 231

[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach

Awesome Hungarian Nlp ⭐ 192

A curated list of NLP resources for Hungarian

Unify Emotion Datasets ⭐ 189

A Survey and Experiments on Annotated Corpora for Emotion Classification in Text

Fakenewscorpus ⭐ 184

A dataset of millions of news articles scraped from a curated list of data sources.

Robbert ⭐ 180

A Dutch RoBERTa-based language model

Awesome Nlp Polish ⭐ 169

A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.

Pubmed Rct ⭐ 166

PubMed 200k RCT dataset: a large dataset for sequential sentence classification.

Ml Datasets ⭐ 161

Machine Learning datasets for Nepal

Awesome Scholarly Data Analysis ⭐ 161

A curated collection of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources.

Telegram地下市场中文黑话识别语料集。Telegram Underground Market Chinese Corpus. Paper: Identification of Chinese Dark Jargons in Telegram Underground Markets Using Context-Oriented and Linguistic Features (IP&M, 2022).

Conversation Tensorflow ⭐ 144

TensorFlow implementation of Conversation Models

Gossiping Chinese Corpus ⭐ 136

PTT 八卦版問答中文語料

Pre Modern_chinese_corpus_dataset ⭐ 132

近代汉语语料库数据集自然语言处理语料库古代汉语古汉语文言文数字人文计算语言

Tts Portuguese Corpus ⭐ 125

Open Source Text To Speech Portuguese Dataset

How2 Dataset ⭐ 125

This repository contains code and metadata of How2 dataset

Open Korean Corpora ⭐ 117

Open Korean NLP Dataset Curation for the Users All Around the Globe

A tool that locates, downloads, and extracts machine translation corpora

Prosody ⭐ 104

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

Neural Code Search Evaluation Dataset ⭐ 98

evaluation dataset consisting of natural language query and code snippet pairs

Indonesian Nlp Resources ⭐ 98

data resource untuk NLP bahasa indonesia

Speech Corpus Collection ⭐ 87

A Collection of Speech Corpus for ASR and TTS

Phrase At Scale ⭐ 84

Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English

Sova Dataset ⭐ 82

Datasets ⭐ 78

Poetry-related datasets developed by THUAIPoet (Jiuge) group.

Curation Corpus ⭐ 77

Code for obtaining the Curation Corpus abstractive text summarisation dataset

Integrated path-based and distributional method for hypernymy detection

The Corpus & Code for EMNLP 2022 paper "FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction" | FCGEC中文语法纠错语料及STG模型

Gutenberg ⭐ 74

Pipeline to generate the Standardized Project Gutenberg Corpus

Dataset and baseline for ACL 2019 paper "XQA: A Cross-lingual Open-domain Question Answering Dataset"

Laboro Bert Japanese ⭐ 68

Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus

Gpt 2 Training ⭐ 65

Training GPT-2 on a Russian language corpus

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Datasets for Question Answering by Search and Reading

Query Wellformedness ⭐ 63

25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.

Dialogue Datasets ⭐ 61

A collection of plain text dialogue datasets

Awesome Nlp Chinese Corpus ⭐ 59

A curated list of resources of chinese corpora for NLP(Natural Language Processing)

Video_music_book_datasets ⭐ 57

NLP NER datasets video/music/book bio

Convolutional_seq2seq ⭐ 56

fairseq: Convolutional Sequence to Sequence Learning (Gehring et al. 2017) by Chainer

Bert Commonsense ⭐ 56

Code for papers "A Surprisingly Robust Trick for Winograd Schema Challenge" and "WikiCREM: A Large Unsupervised Corpus for Coreference Resolution"

Askubuntu ⭐ 54

AskUbuntu Question Dataset

Corpus of Annual Reports in Japan

Automatic Corpus Generation ⭐ 53

This repository is for the paper "A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check"

Turkish Glove ⭐ 51

Türkçe GloVe - Repository for Turkish GloVe Word Embeddings

Tamil Nlp Catalog ⭐ 51

Awesome List of Tamil NLP & AI Resources

Cross Language Dataset ⭐ 50

A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection

Repository for the Question Answering via Sentence Composition (QASC) dataset

Intonation-aided intention identification for Korean

Nlp Corpora ⭐ 49

List of NLP (Natural Language Processing) Corpora.

Personas ⭐ 48

Datasets for Deep learning Personas

When In Rome ⭐ 45

meta-corpus of and code library for the functional harmonic analysis of music

Image Verification Corpus ⭐ 45

This contains an evolving dataset of fake and real images shared in social media.

Book Names Corpus ⭐ 45

图书名语料库。含部分电影、游戏名称。

Cluemotionanalysis2020 ⭐ 42

CLUE Emotion Analysis Dataset 细粒度情感分析数据集

Baseline for the CNLI corpus

Science Result Extractor ⭐ 42

A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations

Producttitlesummarizationcorpus ⭐ 41

Dataset for CIKM 2018 paper "Multi-Source Pointer Network for Product Title Summarization"

Dialog Processing ⭐ 41

NLG and NLU for dialogue processing

ODSQA: OPEN-DOMAIN SPOKEN QUESTION ANSWERING DATASET

A Word Sense Disambiguation system integrating implicit and explicit external knowledge.

Autocorpus ⭐ 38

AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets. Autocorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines.

Medical Names Corpus ⭐ 38

医疗语料库。医疗机构名语料库。药品本位码。

Shabby Pages ⭐ 34

ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.

Open Australian Legal Corpus Creator ⭐ 34

The code used to create and update the Open Australian Legal Corpus, the first and only multijurisdictional open corpus of Australian legislative and judicial documents.

Open2ch Dialogue Corpus ⭐ 34

おーぷん2ちゃんねるをクロールして作成した対話コーパス

Feidegger ⭐ 34

A Multi-modal Corpus of Fashion Images and Descriptions in German

Voxceleb ⭐ 34

mirror of VoxCeleb dataset - a large-scale speaker identification dataset

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation (EMNLP 2021)

Plotly dataset-visualization pairs, feature extraction scripts, and model training code for VizML (CHI 2019)

Machinelearningphishing ⭐ 32

This project will determine which of the five supervised classification machine learning algorithms performs best in detecting phishy emails

This is the repository for NLPCC2020 task AutoIE

Related Searches

Python Dataset (14,792)

Jupyter Notebook Dataset (6,824)

Python Corpus (2,447)

Deep Learning Dataset (2,364)

Machine Learning Dataset (2,279)

Dataset Pytorch (1,847)

Dataset Tensorflow (1,583)

Dataset Classification (1,500)

Dataset Convolutional Neural Networks (1,264)

Dataset Paper (1,252)

1-100 of 159 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.