Awesome Open Source

Programming Languages

Search results for dataset natural language processing

natural-language-processing x

205 search results found

Datasets ⭐ 18,319

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Doccano ⭐ 8,927

Open source annotation tool for machine learning practitioners.

Nlp_chinese_corpus ⭐ 8,344

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Awesome Pretrained Chinese Nlp Models ⭐ 3,738

Awesome Pretrained Chinese NLP Models，高质量中文预训练模型&大模型&多模态模型&大语言模型集合

Models, data loaders and abstractions for language processing, powered by PyTorch

Cluedatasetsearch ⭐ 2,778

搜索所有中文NLP数据集，附常用英文NLP数据集

Textattack ⭐ 2,597

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP https://textattack.readthedocs.io/en/master/

Pytorch Nlp ⭐ 2,180

Basic Utilities for PyTorch Natural Language Processing (NLP)

Codesearchnet ⭐ 2,054

Datasets, tools, and benchmarks for representation learning of code.

Medical_nlp ⭐ 1,969

Medical NLP Competition, dataset, large models, paper 医疗NLP领域比赛，数据集，大模型，论文，工具包

Awesome_chinese_medical_nlp ⭐ 1,847

中文医学NLP公开资源整理：术语集/语料库/词向量/预训练模型/知识图谱/命名实体识别/QA/信息抽

Chineseglue ⭐ 1,765

Language Understanding Evaluation benchmark for Chinese: datasets, baselines, pre-trained models,corpus and leaderboard

Awesome Domain Llm ⭐ 1,502

收集和梳理垂直领域的开源模型、数据集及评测基准。

Transfer Learning Conv Ai ⭐ 1,499

🦄 State-of-the-Art Conversational AI with Transfer Learning

Deepmoji ⭐ 1,462

State-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc.

Entity Recognition Datasets ⭐ 1,386

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

Wikisql ⭐ 1,370

A large annotated semantic parsing corpus for developing natural language interfaces.

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

Chinesenlp ⭐ 1,329

Datasets, SOTA results of every fields of Chinese NLP

Dataprofiler ⭐ 1,310

What's in your data? Extract schema, statistics and entities from datasets

Projects ⭐ 1,207

🪐 End-to-end NLP workflows from prototype to production

Data Juicer ⭐ 994

A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据！

Insuranceqa Corpus Zh ⭐ 989

🚁 保险行业语料库，聊天机器人

Textbox ⭐ 966

TextBox 2.0 is a text generation library with pre-trained language models

Chatgpt Comparison Detection ⭐ 921

Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥

Torchmoji ⭐ 882

😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc

Source code of K-BERT (AAAI2020)

Chatito ⭐ 755

🎯🗯 Generate datasets for AI chatbots, NLP tasks, named entity recognition or text classification models using a simple DSL!

Prompt4reasoningpapers ⭐ 717

Repository for the ACL2023 paper "Reasoning with Language Model Prompting: A Survey".

Hate Speech And Offensive Language ⭐ 698

Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017

Thoughtsource ⭐ 680

A central, open resource for data and tools related to chain-of-thought reasoning in large language models. Developed @ Samwald research group: https://samwald.info/

Long Range Arena ⭐ 635

Long Range Arena for Benchmarking Efficient Transformers

Sequence Labeling Bilstm Crf ⭐ 605

The classical BiLSTM-CRF model implemented in Tensorflow, for sequence labeling tasks. In Vex version, everything is configurable.

Datasets Server ⭐ 578

Lightweight web API for visualizing and exploring all types of datasets - computer vision, speech, text, and tabular - stored on the Hugging Face Hub

Annotated Semantic Relationships Datasets ⭐ 565

A collections of public and free annotated datasets of relationships between entities/nominals (Portuguese and English)

Neuspell ⭐ 541

NeuSpell: A Neural Spelling Correction Toolkit

Cluecorpus2020 ⭐ 517

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

Efaqa Corpus Zh ⭐ 505

❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库

Complete Life Cycle Of A Data Science Project ⭐ 499

Complete-Life-Cycle-of-a-Data-Science-Project

Indonlu ⭐ 494

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained IndoBERT models, and a starter code! (AACL-IJCNLP 2020)

Convokit ⭐ 483

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.

Text2sql Data ⭐ 478

A collection of datasets that pair questions with SQL queries.

Attention Networks For Classification ⭐ 477

Hierarchical Attention Networks for Document Classification in PyTorch

RNNLG is an open source benchmark toolkit for Natural Language Generation (NLG) in spoken dialogue system application domains. It is released by Tsung-Hsien (Shawn) Wen from Cambridge Dialogue Systems Group under Apache License 2.0.

Dstc8 Schema Guided Dialogue ⭐ 464

The Schema-Guided Dialogue Dataset

Oie Resources ⭐ 435

A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.

Subreddit Analyzer ⭐ 422

A comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit.

Matterport3dsimulator ⭐ 414

AI Research Platform for Reinforcement Learning from Real Panoramic Images.

Openai Clip ⭐ 404

Simple implementation of OpenAI CLIP model in PyTorch.

Paperrobot ⭐ 384

Code for PaperRobot: Incremental Draft Generation of Scientific Ideas

Chinese Nlp Corpus ⭐ 378

Collections of Chinese NLP corpus

Awesomefakenews ⭐ 317

This repository contains recent research on fake news.

Transformer Pointer Generator ⭐ 314

A Abstractive Summarization Implementation with Transformer and Pointer-generator

Simple downloader for pre-trained word vectors

Cmrc2018 ⭐ 313

A Span-Extraction Dataset for Chinese Machine Reading Comprehension (CMRC 2018)

Automated Fact Checking Resources ⭐ 303

Links to conference/journal publications in automated fact-checking (resources for the TACL22/EMNLP23 paper).

Data Science Hacks ⭐ 300

Data Science Hacks consists of tips, tricks to help you become a better data scientist. Data science hacks are for all - beginner to advanced. Data science hacks consist of python, jupyter notebook, pandas hacks and so on.

Peerread ⭐ 297

Data and code for Kang et al., NAACL 2018's paper titled "A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications"

Language model fine-tuning on NER with an easy interface and cross-domain evaluation. "T-NER: An All-Round Python Library for Transformer-based Named Entity Recognition, EACL 2021"

Nlp_datasets ⭐ 285

My NLP datasets for Russian language

Rc Cnn Dailymail ⭐ 282

CNN/Daily Mail Reading Comprehension Task

Neural Sentiment Classification

Medquad ⭐ 275

Medical Question Answering Dataset of 47,457 QA pairs created from 12 NIH websites

Squirrel Core ⭐ 271

A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way 🌰

Primekg ⭐ 269

Precision Medicine Knowledge Graph (PrimeKG)

Nlp_bahasa_resources ⭐ 260

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

Multi Criteria Cws ⭐ 260

Simple Solution for Multi-Criteria Chinese Word Segmentation

Dialoglue ⭐ 256

DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue

Links to Russian corpora + Python functions for loading and parsing

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Chazutsu ⭐ 237

The tool to make NLP datasets ready to use

Persian Swear Words ⭐ 235

Persian Swear Dataset - you can use in your production to filter unwanted content. دیتاست کلمات نامناسب و بد فارسی برای فیلتر کردن متن ها

Tensorflow_qrnn ⭐ 228

QRNN implementation for TensorFlow

Nlp_profiler ⭐ 227

A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.

Triviaqa ⭐ 227

Code for the TriviaQA reading comprehension dataset

Torchnlp ⭐ 221

Easy to use NLP library built on PyTorch and TorchText

Sota Extractor ⭐ 221

The SOTA extractor pipeline

Aidl_kb ⭐ 218

A Knowledge Base for the FB Group Artificial Intelligence and Deep Learning (AIDL)

Awesome Tensorlayer ⭐ 212

A curated list of dedicated resources and applications

Neuralqa ⭐ 207

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

[NeurIPS 2021] WRENCH: Weak supeRvision bENCHmark

Dataset ⭐ 194

darija <-> english dataset

Awesome Hungarian Nlp ⭐ 192

A curated list of NLP resources for Hungarian

Unify Emotion Datasets ⭐ 189

A Survey and Experiments on Annotated Corpora for Emotion Classification in Text

Goodreads ⭐ 186

code samples for the goodreads datasets

Bert Attributeextraction ⭐ 185

USING BERT FOR Attribute Extraction in KnowledgeGraph. fine-tuning and feature extraction. 使用基于bert的微调和特征提取方法来进行知识图谱百度百科人物词条属性抽取。

Text Summarization Repo ⭐ 184

텍스트 요약 분야의 주요 연구 주제, Must-read Papers, 이용 가능한 model 및 data 등을 추천 자료와 함께 정리한 저장소입니다.

Fakenewscorpus ⭐ 184

A dataset of millions of news articles scraped from a curated list of data sources.

Awesome Llm Eval ⭐ 183

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, learderboard, papers, docs and models, mainly for Evaluation on LLMs.

Financial News Dataset ⭐ 182

Reuters and Bloomberg

Robbert ⭐ 180

A Dutch RoBERTa-based language model

Lineflow ⭐ 178

⚡A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

Siamese Lstm ⭐ 172

Siamese LSTM for evaluating semantic similarity between sentences of the Quora Question Pairs Dataset.

Nlp Public Dataset ⭐ 172

Chinese, English NER, English-Chinese machine translation dataset. 中英文实体识别数据集，中英文机器翻译数据集, 中文分词数据集

Awesome Nlp Polish ⭐ 169

A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.

Comprehensive NLP Evaluation System

Pubmed Rct ⭐ 166

PubMed 200k RCT dataset: a large dataset for sequential sentence classification.

Trustllm ⭐ 164

TrustLLM: Trustworthiness in Large Language Models

Scanrefer ⭐ 163

[ECCV 2020] ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

QANTA Quiz Bowl AI

Related Searches

Python Dataset (15,297)

Python Natural Language Processing (7,915)

Jupyter Notebook Dataset (6,824)

Jupyter Notebook Natural Language Processing (4,405)

Machine Learning Natural Language Processing (3,939)

Deep Learning Natural Language Processing (2,414)

Machine Learning Dataset (2,395)

Deep Learning Dataset (2,364)

Dataset Pytorch (1,847)

Dataset Tensorflow (1,583)

1-100 of 205 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.