Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Transformers | 87,738 | 64 | 911 | 8 hours ago | 91 | June 21, 2022 | 617 | apache-2.0 | Python | |
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. | ||||||||||
Made With Ml | 32,763 | 7 days ago | 5 | May 15, 2019 | 8 | mit | Jupyter Notebook | |||
Learn how to responsibly develop, deploy and maintain production machine learning applications. | ||||||||||
D2l En | 16,954 | 8 days ago | 83 | other | Python | |||||
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 400 universities from 60 countries including Stanford, MIT, Harvard, and Cambridge. | ||||||||||
Datasets | 15,594 | 9 | 208 | 2 days ago | 52 | June 15, 2022 | 526 | apache-2.0 | Python | |
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools | ||||||||||
Awesome Pytorch List | 13,786 | a month ago | 2 | |||||||
A comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc. | ||||||||||
Dive Into Dl Pytorch | 13,747 | a year ago | 76 | apache-2.0 | Jupyter Notebook | |||||
本项目将《动手学深度学习》(Dive into Deep Learning)原书中的MXNet实现改为PyTorch实现。 | ||||||||||
Best Of Ml Python | 13,088 | 3 days ago | 15 | cc-by-sa-4.0 | ||||||
🏆 A ranked list of awesome machine learning Python libraries. Updated weekly. | ||||||||||
Flair | 12,593 | 24 | 52 | 2 days ago | 27 | May 20, 2022 | 73 | other | Python | |
A very simple framework for state-of-the-art Natural Language Processing (NLP) | ||||||||||
Nlp Tutorial | 12,146 | 21 days ago | 33 | mit | Jupyter Notebook | |||||
Natural Language Processing Tutorial for Deep Learning Researchers | ||||||||||
Allennlp | 11,300 | 117 | 67 | 4 months ago | 264 | April 14, 2022 | 94 | apache-2.0 | Python | |
An open-source NLP research library, built on PyTorch. |
PyTorch-NLP, or torchnlp
for short, is a library of basic utilities for PyTorch
NLP. torchnlp
extends PyTorch to provide you with
basic text data processing functions.
Logo by Chloe Yeo, Corporate Sponsorship by WellSaid Labs
Make sure you have Python 3.6+ and PyTorch 1.0+. You can then install pytorch-nlp
using
pip:
pip install pytorch-nlp
Or to install the latest code via:
pip install git+https://github.com/PetrochukM/PyTorch-NLP.git
The complete documentation for PyTorch-NLP is available via our ReadTheDocs website.
Within an NLP data pipeline, you'll want to implement these basic steps:
Load the IMDB dataset, for example:
from torchnlp.datasets import imdb_dataset
# Load the imdb training dataset
train = imdb_dataset(train=True)
train[0] # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'}
Load a custom dataset, for example:
from pathlib import Path
from torchnlp.download import download_file_maybe_extract
directory_path = Path('data/')
train_file_path = Path('trees/train.txt')
download_file_maybe_extract(
url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',
directory=directory_path,
check_files=[train_file_path])
open(directory_path / train_file_path)
Don't worry we'll handle caching for you!
Tokenize and encode your text as a tensor.
For example, a WhitespaceEncoder
breaks
text into tokens whenever it encounters a whitespace character.
from torchnlp.encoders.text import WhitespaceEncoder
loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = WhitespaceEncoder(loaded_data)
encoded_data = [encoder.encode(example) for example in loaded_data]
With your loaded and encoded data in hand, you'll want to batch your dataset.
import torch
from torchnlp.samplers import BucketBatchSampler
from torchnlp.utils import collate_tensors
from torchnlp.encoders.text import stack_and_pad_tensors
encoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]
train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data)
train_batch_sampler = BucketBatchSampler(
train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])
batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler]
batches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches]
PyTorch-NLP builds on top of PyTorch's existing torch.utils.data.sampler
, torch.stack
and default_collate
to support sequential inputs of varying lengths!
With your batch in hand, you can use PyTorch to develop and train your model using gradient descent. For example, check out this example code for training on the Stanford Natural Language Inference (SNLI) Corpus.
PyTorch-NLP has a couple more NLP focused utility packages to support you! 🤗
Now you've setup your pipeline, you may want to ensure that some functions run deterministically.
Wrap any code that's random, with fork_rng
and you'll be good to go, like so:
import random
import numpy
import torch
from torchnlp.random import fork_rng
with fork_rng(seed=123): # Ensure determinism
print('Random:', random.randint(1, 2**31))
print('Numpy:', numpy.random.randint(1, 2**31))
print('Torch:', int(torch.randint(1, 2**31, (1,))))
This will always print:
Random: 224899943
Numpy: 843828735
Torch: 843828736
Now that you've computed your vocabulary, you may want to make use of pre-trained word vectors to set your embeddings, like so:
import torch
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.word_to_vector import GloVe
encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])
vocab_set = set(encoder.vocab)
pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab_set)
embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
for i, token in enumerate(encoder.vocab):
embedding_weights[i] = pretrained_embedding[token]
For example, from the neural network package, apply the state-of-the-art LockedDropout
:
import torch
from torchnlp.nn import LockedDropout
input_ = torch.randn(6, 3, 10)
dropout = LockedDropout(0.5)
# Apply a LockedDropout to `input_`
dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)
Compute common NLP metrics such as the BLEU score.
from torchnlp.metrics import get_moses_multi_bleu
hypotheses = ["The brown fox jumps over the dog 笑"]
references = ["The quick brown fox jumps over the lazy dog 笑"]
# Compute BLEU score with the official BLEU perl script
get_moses_multi_bleu(hypotheses, references, lowercase=True) # RETURNS: 47.9
Maybe looking at longer examples may help you at examples/
.
Need more help? We are happy to answer your questions via Gitter Chat
We've released PyTorch-NLP because we found a lack of basic toolkits for NLP in PyTorch. We hope that other organizations can benefit from the project. We are thankful for any contributions from the community.
Read our contributing guide to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to PyTorch-NLP.
torchtext and PyTorch-NLP differ in the architecture and feature set; otherwise, they are similar. torchtext and PyTorch-NLP provide pre-trained word vectors, datasets, iterators and text encoders. PyTorch-NLP also provides neural network modules and metrics. From an architecture standpoint, torchtext is object orientated with external coupling while PyTorch-NLP is object orientated with low coupling.
AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to be a lightweight toolkit.
If you find PyTorch-NLP useful for an academic publication, then please use the following BibTeX to cite it:
@misc{pytorch-nlp,
author = {Petrochuk, Michael},
title = {PyTorch-NLP: Rapid Prototyping with PyTorch Natural Language Processing (NLP) Tools},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/PetrochukM/PyTorch-NLP}},
}