Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Nlp_chinese_corpus | 8,245 | 16 days ago | 20 | mit | ||||||
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP | ||||||||||
Vale | 3,119 | 1 | 7 days ago | 118 | September 14, 2022 | 19 | mit | Go | ||
:pencil: A syntax-aware linter for prose built with speed and extensibility in mind. | ||||||||||
Deepmoji | 1,331 | a year ago | 9 | mit | Python | |||||
State-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc. | ||||||||||
Beir | 872 | 3 | a month ago | 28 | June 30, 2022 | 60 | apache-2.0 | Python | ||
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets. | ||||||||||
Bolt | 822 | 8 days ago | 39 | mit | C++ | |||||
Bolt is a deep learning library with high performance and heterogeneous flexibility. | ||||||||||
Torchmoji | 678 | 3 years ago | 15 | mit | Python | |||||
😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc | ||||||||||
Long Range Arena | 537 | 2 months ago | 22 | apache-2.0 | Python | |||||
Long Range Arena for Benchmarking Efficient Transformers | ||||||||||
Rnnlg | 476 | 4 years ago | 3 | other | Python | |||||
RNNLG is an open source benchmark toolkit for Natural Language Generation (NLG) in spoken dialogue system application domains. It is released by Tsung-Hsien (Shawn) Wen from Cambridge Dialogue Systems Group under Apache License 2.0. | ||||||||||
Indicnlp_catalog | 405 | 2 months ago | 110 | |||||||
A collaborative catalog of NLP resources for Indic languages | ||||||||||
Indonlu | 400 | 6 months ago | 5 | apache-2.0 | Jupyter Notebook | |||||
The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained IndoBERT models, and a starter code! (AACL-IJCNLP 2020) |
Huggingface Datasets is an excellent library, but it lacks standardization, and datasets often require preprocessing work to be used interchangeably.
tasksource
streamlines interchangeable datasets usage to scale evaluation or multi-task learning.
Each dataset is standardized to a MultipleChoice
, Classification
, or TokenClassification
template with canonical fields. We focus on discriminative tasks (= with negative examples or classes) and do not yet support generation tasks as they are addressed by promptsource. All implemented preprocessings are in tasks.py or tasks.md. A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.
pip install tasksource
from tasksource import list_tasks, load_task
df = list_tasks() # takes some time
for id in df[df.task_type=="MultipleChoice"].id:
dataset = load_task(id) # all yielded datasets can be used interchangeably
Browse the 500+ curated tasks in tasks.md (200+ MultipleChoice tasks, 200+ Classification tasks), and feel free to request a new task. Datasets are downloaded to $HF_DATASETS_CACHE (like any Hugging Face dataset), so ensure you have more than 100GB of space available.
Text encoder pretrained on tasksource reached state-of-the-art results: /deberta-v3-base-tasksource-nli
Tasksource pretraining is notably helpful for RLHF reward modeling.
The repo also contains some recasting code that was used to convert tasksource datasets to instructions format, providing one of the richest instruction-tuning dataset: /tasksource-instruct-v0
from tasksource import MultipleChoice, concatenate_dataset_dict
codah = MultipleChoice('question_propmt',choices_list='candidate_answers',
labels='correct_answer_idx',
dataset_name='codah', config_name='codah')
winogrande = MultipleChoice('sentence',['option1','option2'],'answer',
dataset_name='winogrande',config_name='winogrande_xl',
splits=['train','validation',None]) # these test labels are not usable
tasks = [winogrande.load(), codah.load()]) # Aligned datasets (same columns) can be used interchangably
For help integrating tasksource into your experiments, please contact [email protected].
For more details, refer to this article:
@article{sileo2023tasksource,
title={tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation},
author={Sileo, Damien},
url= {https://arxiv.org/abs/2301.05948},
journal={arXiv preprint arXiv:2301.05948},
year={2023}
}