Awesome Open Source

Programming Languages

Search results for dataset natural language processing

natural-language-processing x

205 search results found

Collection of Urdu datasets for POS, NER, and NLP tasks

Podium: a framework agnostic Python NLP library for data loading and preprocessing

A collection of publicly available bug reports

Irc Disentanglement ⭐ 48

Dataset and model for disentangling chat on IRC

Pytorch_basic_nmt ⭐ 48

A simple yet strong implementation of neural machine translation in pytorch

Code for the collection and analysis of the MTNT dataset

Wongnai Corpus ⭐ 47

Collection of Wongnai's datasets

Topic Rnn ⭐ 47

Implementation (in progress) of Dieng et al.'s TopicRNN: a neural topic model & RNN hybrid.

Glami 1m ⭐ 47

The largest multilingual image-text classification dataset. It contains fashion products.

Wiki Atomic Edits ⭐ 47

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.

Trscraper ⭐ 47

TRScraper, doğal dil işleme uygulamalarında kullanılmak amacıyla geliştirilmiş, Türkçe içerik girilen büyük platformlarda metin madenciliği yapma imkanı sunan bir uygulamadır.

Indonesian_datasets ⭐ 46

NLP Datasets for Indonesian

Africanlp Public Datasets ⭐ 46

A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.

Nodejs Stanford Classifier ⭐ 46

Nodejs wrapper for Stanford Classifier.

Official implementations for (1) BlonDe: An Automatic Evaluation Metric for Document-level Machine Translation and (2) Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus

Transformer Srl ⭐ 45

Reimplementation of a BERT based model (Shi et al, 2019), currently the state-of-the-art for English SRL. This model implements also predicate disambiguation.

Awesome Chinese Llm ⭐ 45

Awesome Chinese LLM: A curated list of Chinese Large Language Model 中文大语言模型数据集和模型资料汇总

Cccapsnet ⭐ 44

A PyTorch implementation of Compositional Coding Capsule Network based on PRL 2022 paper "Compositional Coding Capsule Network with K-Means Routing for Text Classification"

Book Genre Classification ⭐ 44

Classification of books based on titles without prior knowledge of context or author

Awesome Resources For Scholarly Big Data ⭐ 44

Tools, datasets, Corpus and Venue Challenge for scholarly big data——Pick up scattered pearls

Python package for understanding the difficulty of text classification datasets. (in CoNNL 2018)

Ua Datasets ⭐ 44

A collection of datasets for Ukrainian language

Text classification models. Used a submodule for other projects.

Wikineural ⭐ 43

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Synergy Dataset ⭐ 43

SYNERGY - Open machine learning dataset on study selection in systematic reviews

Huggingartists ⭐ 42

Lyrics generation with GPT2-based Transformer

Science Result Extractor ⭐ 42

Machine Learning ⭐ 41

This repository will contain all the stuffs required for beginners in ML and DL do follow and star this repo for regular updates

ODSQA: OPEN-DOMAIN SPOKEN QUESTION ANSWERING DATASET

A Natural Portuguese Language Benchmark (Napolab) for the evaluation of language models.

Attention_is_all_you_need ⭐ 41

A Causal Relation Schema for Text

Bibsample ⭐ 40

Eample of using dataset api in tensorflow

A Word Sense Disambiguation system integrating implicit and explicit external knowledge.

Cdqa Annotator ⭐ 40

⛔ [NOT MAINTAINED] A web-based annotator for closed-domain question answering datasets with SQuAD format.

Ai Sentiment Analysis On Imdb Dataset ⭐ 40

Sentiment Analysis using Stochastic Gradient Descent on 50,000 Movie Reviews Compiled from the IMDB Dataset

C4 Dataset Script ⭐ 39

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

Finbert Qa ⭐ 39

Financial Domain Question Answering with pre-trained BERT Language Model

Toefl Qa ⭐ 39

A question answering dataset for machine comprehension of spoken content

A Hierarchical Type system for fine grained entity typing

WikiWhy is a new benchmark for evaluating LLMs' ability to explain between cause-effect relationships. It is a QA dataset containing 9000+ "why" question-answer-rationale triplets.

Smiles X ⭐ 38

Autonomous characterization of molecular compounds from small datasets without descriptors

Tensorflow implementation of ACL2020 paper "Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks."

Pn Summary ⭐ 37

A well-structured summarization dataset for the Persian language!

Fast Annotation Tool ⭐ 37

FAST is an annotation tool that focuses on mobile devices.

A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

Squirrel Datasets Core ⭐ 37

Squirrel dataset hub

Yelp_challenge ⭐ 37

Yelp dataset challenge: NLP & sentiment analysis

Datasets ⭐ 36

A bunch of some 200 datasets. You can call it mini-kaggle :)

Pytorch Pqrnn ⭐ 36

Implementation of pQRNN in PyTorch

NLP on Yelp's DataSet Challenge

Crnn Pytorch ⭐ 36

✍️ Convolutional Recurrent Neural Network in Pytorch | Text Recognition

Datasetstation ⭐ 36

快速下载中文数据集，处理数据集，数据分析、可视化分析，一站式解决数据问题

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

A Data Centric annotation tool for your Named Entity Recognition projects

Pqg Pytorch ⭐ 35

Paraphrase Generation model using pair-wise discriminator loss

Focused Empathy ⭐ 35

🤗 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Ten Thousand German News Articles Dataset for Topic Classification

Persianner ⭐ 34

Named-Entity Recognition in Persian Language

Dialogue ⭐ 34

Open_type ⭐ 33

Chinesemrc Data ⭐ 33

收集了目前为止中文领域的MRC抽取式数据集

Bothub is an open platform for predicting, training and sharing NLP datasets in multiple languages

Codesearch ⭐ 32

Models and datasets for annotated code search.

This repository contains the dataset and code for "WiCE: Real-World Entailment for Claims in Wikipedia" in EMNLP 2023.

Adept Augmentations ⭐ 31

A Python library aimed at dissecting and augmenting NER training data.

Machine Comprehension Train on MSMARCO with S-NET Extraction Modification

Very Deep Cnn Pytorch ⭐ 30

Very deep CNN for text classification

Diverse Natural Language Inference Collection - NLI dataset that can used to evaluate how well models perform distinct types of reasoning (EMNLP 2018)

Sentence Autosegmentation ⭐ 30

Deep-learning based sentence auto-segmentation from unstructured text w/o punctuation

Code for the paper "Contextualized Weak Supervision for Text Classification"

Extractive_rc_by_runtime_mt ⭐ 30

Code and datasets of "Multilingual Extractive Reading Comprehension by Runtime Machine Translation"

⚖️ A Statutory Article Retrieval Dataset in French. (ACL 2022)

Easy multi-task learning with HuggingFace Datasets and Trainer

Biomedical Nlp Corpus ⭐ 29

Corpus (datasets) collection about biology and medical NLP.

110k Dutch Book Reviews Dataset for Sentiment Analysis

Implementation of the EMNLP 2020 paper "Counterfactual Generator: A Weakly-Supervised Method for Named Entity Recognition".

Cnn Question Classification Keras ⭐ 29

Chinese Question Classifier (Keras Implementation) on BQuLD

numeric fused-head identification and resolution

Naturallanguageprocessing ⭐ 28

Natural Language Procesing

Sentence Classification Pytorch ⭐ 28

Sentiment analysis with variable length sequences in pytorch

Surnames ⭐ 28

Surnames dispersion around the world which sorted by population

Dureader_qanet_bidaf ⭐ 27

Using QANet and BiDAF on DuReader datasets

Pytorch Transformer Kor Eng ⭐ 27

Transformer Implementation using PyTorch for Neural Machine Translation (Korean to English)

Noisemix ⭐ 27

NoiseMix - data generation for natural language

Repository for the paper "ViHOS: Vietnamese Hate and Offensive Spans Detection" (EACL2023)

SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batching, and more. Supports datasets from Huggingface, torchdata iterables, or simple lists of dictionaries.

Spoken Squad ⭐ 26

A spoken question answering dataset on SQUAD

Moral_stories ⭐ 26

Data and code for the "Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences" (Emelin et al., 2021) paper.

Findvehicle ⭐ 26

FindVehicle: A NER dataset in transportation to extract keywords describing vehicles on the road

COSMOS: Catching Out-of-Context Misinformation using Self Supervised Learning (AAAI 2023)

Datasets ⭐ 25

Collections of many datasets you may need and play with.

Awesome Azeri Nlp ⭐ 24

Azerbaijani language processing software, models and datasets.

20 Newsgroups_text Classification ⭐ 24

"20 newsgroups" dataset - Text Classification using Multinomial Naive Bayes in Python.

Pban Pytorch ⭐ 24

A Position-aware Bidirectional Attention Network for Aspect-level Sentiment Analysis, PyTorch implementation.

🍳 NLPrep - dataset tool for many natural language processing task

Nlp_pemdc ⭐ 23

NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The collection will keep updating.

Exams Qa ⭐ 23

A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering

Sarcasm dataset, 15K tweets, very high quality, both intended & perceived sarcasm, rich context

Mp Cnn Variants ⭐ 23

Variants of Multi-Perspective Convolutional Neural Networks

Related Searches

Python Dataset (15,297)

Python Natural Language Processing (7,915)

Jupyter Notebook Dataset (6,824)

Jupyter Notebook Natural Language Processing (4,405)

Machine Learning Natural Language Processing (3,939)

Deep Learning Natural Language Processing (2,414)

Machine Learning Dataset (2,395)

Deep Learning Dataset (2,364)

Dataset Pytorch (1,847)

Dataset Tensorflow (1,583)

201-205 of 205 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.