Awesome Open Source

Programming Languages

Search results for dataset natural language processing

natural-language-processing x

205 search results found

Dialogsum ⭐ 153

DialogSum: A Real-life Scenario Dialogue Summarization Dataset - Findings of ACL 2021

Awesome Ukrainian Nlp ⭐ 146

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

DaNLP is a repository for Natural Language Processing resources for the Danish Language.

Emotion_dataset ⭐ 140

😄 Dataset for Emotion Classification

Compact high quality word embeddings for Russian language

Summarus ⭐ 140

Models for automatic abstractive summarization

Machine Learning Resources ⭐ 137

A curated list of awesome machine learning frameworks, libraries, courses, books and many more.

Commongen ⭐ 136

A Constrained Text Generation Challenge Towards Generative Commonsense Reasoning

Existing Medical Qa Datasets ⭐ 135

Multimodal Question Answering in the Medical Domain: A summary of Existing Datasets and Systems

Twitter Sentiment Cnn ⭐ 133

An implementation in TensorFlow of a convolutional neural network (CNN) to perform sentiment classification on tweets.

Repository for paper "SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference"

Pre Modern_chinese_corpus_dataset ⭐ 132

近代汉语语料库数据集自然语言处理语料库古代汉语古汉语文言文数字人文计算语言

Thermostat ⭐ 131

Collection of NLP model explanations and accompanying analysis tools

Chatgpt Retrievalqa ⭐ 130

A dataset for training/evaluating Question Answering Retrieval models on ChatGPT responses with the possibility to training/evaluating on real human responses.

Mongolian Nlp ⭐ 126

Useful resources for Mongolian NLP

Chariot ⭐ 121

Deliver the ready-to-train data to your NLP model.

Lecturebank ⭐ 121

LectureBank Dataset

Code for the ACL 2018 paper "Neural Document Summarization by Jointly Learning to Score and Select Sentences"

Open Korean Corpora ⭐ 117

Open Korean NLP Dataset Curation for the Users All Around the Globe

Hierarchical Attention Network ⭐ 117

Implementation of Hierarchical Attention Networks in PyTorch

Awesome Llm Human Preference Datasets ⭐ 116

A curated list of Human Preference Datasets for LLM fine-tuning, RLHF, and eval.

Mol Instructions ⭐ 116

Mol-Instructions is a Large-Scale Biomolecules Instruction Dataset for Large Language Models.

Active Nlp ⭐ 116

Bayesian Deep Active Learning for Natural Language Processing Tasks

A tool that locates, downloads, and extracts machine translation corpora

BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision

Fnc 1 Baseline ⭐ 113

A baseline implementation for FNC-1

Detecting Scientific Claim ⭐ 111

Extracting scientific claims from biomedical abstracts (powered by AllenNLP), demo:

Machine reading comprehension on clinical case reports

Ask2transformers ⭐ 107

A Framework for Textual Entailment based Zero Shot text classification

Bertqa Attention On Steroids ⭐ 105

BertQA - Attention on Steroids

Falcon2.0 ⭐ 104

Falcon 2.0 is a joint entity and relation linking tool over Wikidata.

Prosody ⭐ 104

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

Recon NER, Debug and correct annotated Named Entity Recognition (NER) data for inconsistencies and get insights on improving the quality of your data.

Cnn Text Classification ⭐ 101

Text classification with Convolution Neural Networks on Yelp, IMDB & sentence polarity dataset v1.0

Financialdatasets ⭐ 100

SmoothNLP 金融文本数据集(公开) Public Financial Datasets for NLP Researches Only

Indonesian Nlp Resources ⭐ 98

data resource untuk NLP bahasa indonesia

We introduce MKQA, an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Please refer to our paper for details, MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering

Korean Hate Speech ⭐ 93

Korean HateSpeech Dataset

Pytorch_gbw_lm ⭐ 90

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Enso: An Open Source Library for Benchmarking Embeddings + Transfer Learning Methods

Awesome_multimodel_llm ⭐ 89

Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context learning, visual reasoning, foundational models, and more. Stay updated with the latest advancement.

An original implementation of EMNLP 2020, "AmbigQA: Answering Ambiguous Open-domain Questions"

Sentiment ⭐ 85

An example project using a feed-forward neural network for text sentiment classification trained with 25,000 movie reviews from the IMDB website.

Phrase At Scale ⭐ 84

Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English

Persianqa ⭐ 84

Persian (Farsi) Question Answering Dataset (+ Models)

Pytreebank ⭐ 83

😡😇 Stanford Sentiment Treebank loader in Python

Fastlorachat ⭐ 83

Instruct-tune LLaMA on consumer hardware with shareGPT data

Dialogue Understanding ⭐ 82

This repository contains PyTorch implementation for the baseline models from the paper Utterance-level Dialogue Understanding: An Empirical Study

Chabsa Dataset ⭐ 82

chakki's Aspect-Based Sentiment Analysis dataset

Marathinlp ⭐ 80

Marathi NLP - is a repository dedicated to development of tools and resources for Marathi language.

Kobert Ner ⭐ 79

NER Task with KoBERT (with Naver NLP Challenge dataset)

Nlp Models ⭐ 77

NLP research experiments, built on PyTorch within the AllenNLP framework.

Doccano Client ⭐ 76

A simple client for doccano API.

Canrevan ⭐ 75

대량의 네이버 뉴스 기사를 수집하는 라이브러리입니다.

Text Segmentation ⭐ 73

Implementation of the paper: Text Segmentation as a Supervised Learning Task

Writing Editing Network ⭐ 72

Code for Paper Abstract Writing through Editing Mechanism

Mams For Absa ⭐ 72

A Multi-Aspect Multi-Sentiment Dataset for aspect-based sentiment analysis.

Wiki Split ⭐ 72

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.

News Headlines Dataset For Sarcasm Detection ⭐ 68

High quality dataset for the task of Sarcasm Detection

Turkish Nlp Resources ⭐ 67

🔡 List of Tools, Libraries, Models, Datasets and other resources for Turkish NLP.

Name2nat ⭐ 66

name2nat: a Python package for nationality prediction from a name

XFUND: A Multilingual Form Understanding Benchmark

Farstail ⭐ 66

FarsTail: a Persian natural language inference dataset

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages (ACL'23)

The repo containing the Critical Role Dungeons and Dragons Dataset.

State of the art open-source translation for Indic languages.

The first-ever vast natural language generation benchmark for Indonesian, Sundanese, and Javanese. We provide multiple downstream tasks, pre-trained IndoGPT and IndoBART models, and a starter code! (EMNLP 2021)

Simple Questions Generate Named Entity Recognition Datasets (EMNLP 2022)

N3 Collection ⭐ 63

N3 - A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format

Query Wellformedness ⭐ 63

25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.

Vtuber Livechat Dataset ⭐ 63

📊 VTuber 1B: Billion-scale Live Chat and Moderation Event Dataset

Code for "ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning" (ICLR 2020)

Rc Experiments ⭐ 61

Reading Comprehension Experiments repository.

Curated repository of notes from papers I'm reading, mostly NLP related. Updated regularly.

Shakkelha ⭐ 59

Neural Arabic text diacritization

Awesome Nlp Chinese Corpus ⭐ 59

A curated list of resources of chinese corpora for NLP(Natural Language Processing)

ExpMRC: Explainability Evaluation for Machine Reading Comprehension

Discovery ⭐ 59

Mining Discourse Markers for Unsupervised Sentence Representation Learning

Char Rnn Tensorflow ⭐ 58

Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow

Video_music_book_datasets ⭐ 57

NLP NER datasets video/music/book bio

Deep Semantic Code Search ⭐ 57

Deep Semantic Code Search aims to explore a joint embedding space for code and description vectors and then use it for a code search application

Ake Datasets ⭐ 57

Large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms.

Mkg_analogy ⭐ 56

Code and datasets for the ICLR2023 paper "Multimodal Analogical Reasoning over Knowledge Graphs."

COVID-19 Open Research Dataset (CORD-19) Analysis

Doccano Transformer ⭐ 55

The official tool for transforming doccano format into common dataset formats.

Code for the paper "FactCHD: Benchmarking Fact-Conflicting Hallucination Detection".

Nlp Datasets ⭐ 54

Curation note of NLP datasets

Distractor Generation Race ⭐ 54

[AAAI 2019] Generating Distractors for Reading Comprehension Questions from Real Examinations

Chinese_book_dataset ⭐ 54

中文图书数据集/数据挖掘/自然语言处理/中国图书分类法/图书情报学/数据挖掘/文本分类/

Corpus of Annual Reports in Japan

Prosocial Dialog ⭐ 53

🐥 Code and Dataset for our EMNLP 2022 paper - "ProsocialDialog: A Prosocial Backbone for Conversational Agents"

Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and prompting mass-media news into datasets for ML-model training

Text Style Transfer Benchmark ⭐ 52

Text style transfer benchmark

Text Mined Synthesis_public ⭐ 52

Codes for text-mined solid-state reactions dataset

Indic.page ⭐ 52

A directory of Indic (Indian) language computing resources.

Causalnewscorpus ⭐ 51

Participate in our Shared Task: Event Causality Identification with Causal News Corpus, featured under CASE @ RANLP 2023!

Tamil Nlp Catalog ⭐ 51

Awesome List of Tamil NLP & AI Resources

Snorkeling ⭐ 51

Extracting biomedical relationships from literature with Snorkel 🏊

Related Searches

Python Dataset (15,297)

Python Natural Language Processing (7,915)

Jupyter Notebook Dataset (6,824)

Jupyter Notebook Natural Language Processing (4,405)

Machine Learning Natural Language Processing (3,939)

Deep Learning Natural Language Processing (2,414)

Machine Learning Dataset (2,395)

Deep Learning Dataset (2,364)

Dataset Pytorch (1,847)

Dataset Tensorflow (1,583)

101-200 of 205 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.