Tasksource

Datasets collection and standardization for NLP extreme multitask learning
Alternatives To Tasksource
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Nlp_chinese_corpus8,245
16 days ago20mit
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Vale3,11917 days ago118September 14, 202219mitGo
:pencil: A syntax-aware linter for prose built with speed and extensibility in mind.
Deepmoji1,331
a year ago9mitPython
State-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc.
Beir8723a month ago28June 30, 202260apache-2.0Python
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Bolt822
8 days ago39mitC++
Bolt is a deep learning library with high performance and heterogeneous flexibility.
Torchmoji678
3 years ago15mitPython
😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc
Long Range Arena537
2 months ago22apache-2.0Python
Long Range Arena for Benchmarking Efficient Transformers
Rnnlg476
4 years ago3otherPython
RNNLG is an open source benchmark toolkit for Natural Language Generation (NLG) in spoken dialogue system application domains. It is released by Tsung-Hsien (Shawn) Wen from Cambridge Dialogue Systems Group under Apache License 2.0.
Indicnlp_catalog405
2 months ago110
A collaborative catalog of NLP resources for Indic languages
Indonlu400
6 months ago5apache-2.0Jupyter Notebook
The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained IndoBERT models, and a starter code! (AACL-IJCNLP 2020)
Alternatives To Tasksource
Select To Compare


Alternative Project Comparisons
Readme

tasksource: 500+ dataset harmonization preprocessings for effortless extreme multi-task learning and evaluation

Huggingface Datasets is an excellent library, but it lacks standardization, and datasets often require preprocessing work to be used interchangeably. tasksource streamlines interchangeable datasets usage to scale evaluation or multi-task learning.

Each dataset is standardized to a MultipleChoice, Classification, or TokenClassification template with canonical fields. We focus on discriminative tasks (= with negative examples or classes) and do not yet support generation tasks as they are addressed by promptsource. All implemented preprocessings are in tasks.py or tasks.md. A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.

Installation and usage:

pip install tasksource

from tasksource import list_tasks, load_task
df = list_tasks() # takes some time

for id in df[df.task_type=="MultipleChoice"].id:
    dataset = load_task(id) # all yielded datasets can be used interchangeably

Browse the 500+ curated tasks in tasks.md (200+ MultipleChoice tasks, 200+ Classification tasks), and feel free to request a new task. Datasets are downloaded to $HF_DATASETS_CACHE (like any Hugging Face dataset), so ensure you have more than 100GB of space available.

Pretrained model:

Text encoder pretrained on tasksource reached state-of-the-art results: /deberta-v3-base-tasksource-nli

Tasksource pretraining is notably helpful for RLHF reward modeling.

tasksource-instruct

The repo also contains some recasting code that was used to convert tasksource datasets to instructions format, providing one of the richest instruction-tuning dataset: /tasksource-instruct-v0

Write and use custom preprocessings

from tasksource import MultipleChoice, concatenate_dataset_dict

codah = MultipleChoice('question_propmt',choices_list='candidate_answers',
    labels='correct_answer_idx',
    dataset_name='codah', config_name='codah')
    
winogrande = MultipleChoice('sentence',['option1','option2'],'answer',
    dataset_name='winogrande',config_name='winogrande_xl',
    splits=['train','validation',None]) # these test labels are not usable
    
tasks = [winogrande.load(), codah.load()]) #  Aligned datasets (same columns) can be used interchangably  

Contact and citation

For help integrating tasksource into your experiments, please contact [email protected].

For more details, refer to this article:

@article{sileo2023tasksource,
  title={tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation},
  author={Sileo, Damien},
  url= {https://arxiv.org/abs/2301.05948},
  journal={arXiv preprint arXiv:2301.05948},
  year={2023}
}
Popular Natural Language Processing Projects
Popular Benchmark Projects
Popular Machine Learning Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Python
Nlp
Benchmark
Sentiment Analysis
Text Classification
Meta Learning