Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Pytorch Cyclegan And Pix2pix | 19,434 | 3 months ago | 476 | other | Python | |||||
Image-to-Image Translation in PyTorch | ||||||||||
Datasets | 16,357 | 9 | 208 | a day ago | 52 | June 15, 2022 | 615 | apache-2.0 | Python | |
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools | ||||||||||
Insightface | 15,515 | 1 | 4 | 21 hours ago | 24 | January 29, 2022 | 899 | mit | Python | |
State-of-the-art 2D and 3D Face Analysis Project | ||||||||||
Vision | 14,054 | 2,306 | 1,413 | 13 hours ago | 32 | June 28, 2022 | 907 | bsd-3-clause | Python | |
Datasets, Transforms and Models specific to Computer Vision | ||||||||||
Cvat | 9,456 | a day ago | 2 | September 08, 2022 | 492 | mit | TypeScript | |||
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale. | ||||||||||
Tts | 6,557 | 5 months ago | 7 | mpl-2.0 | Jupyter Notebook | |||||
:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts) | ||||||||||
Techniques | 6,342 | a month ago | apache-2.0 | |||||||
Techniques for deep learning with satellite & aerial imagery | ||||||||||
Deeplake | 6,051 | 52 | 1 | 12 hours ago | 149 | June 28, 2022 | 54 | mpl-2.0 | Python | |
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai | ||||||||||
Yet Another Efficientdet Pytorch | 4,892 | 2 years ago | 325 | lgpl-3.0 | Jupyter Notebook | |||||
The pytorch re-implement of the official efficientdet with SOTA performance in real time and pretrained weights. | ||||||||||
Transformers Tutorials | 4,686 | 6 days ago | 177 | mit | Jupyter Notebook | |||||
This repository contains demos I made with the Transformers library by HuggingFace. |
Hi there!
This repository contains demos I made with the Transformers library by 🤗 HuggingFace. Currently, all of them are implemented in PyTorch.
NOTE: if you are not familiar with HuggingFace and/or Transformers, I highly recommend to check out our free course, which introduces you to several Transformer architectures (such as BERT, GPT-2, T5, BART, etc.), as well as an overview of the HuggingFace libraries, including Transformers, Tokenizers, Datasets, Accelerate and the hub.
For an overview of the ecosystem of HuggingFace for computer vision (June 2022), refer to this notebook with corresponding video.
Currently, it contains the following demos:
LayoutLMv2ForSequenceClassification
on RVL-CDIP LayoutLMv2ForTokenClassification
on FUNSD LayoutLMv2ForTokenClassification
on FUNSD using the 🤗 Trainer LayoutLMv2ForTokenClassification
on FUNSD LayoutLMv2ForTokenClassification
(when no labels are available) + Gradio demo LayoutLMv2ForTokenClassification
on CORD LayoutLMv2ForQuestionAnswering
on DOCVQA LayoutLMv3ForTokenClassification
on the FUNSD dataset PerceiverForOpticalFlow
PerceiverForMultimodalAutoencoding
TapasForQuestionAnswering
on the Microsoft Sequential Question Answering (SQA) dataset TapasForSequenceClassification
on the Table Fact Checking (TabFact) dataset ViLT
for visual question answering (VQA) ViLT
to illustrate visual question answering (VQA) ViLT
model ViLT
for image-text retrieval ViLT
to illustrate natural language for visual reasoning (NLVR) ... more to come! 🤗
If you have any questions regarding these demos, feel free to open an issue on this repository.
Btw, I was also the main contributor to add the following algorithms to the library:
All of them were an incredible learning experience. I can recommend anyone to contribute an AI algorithm to the library!
Regarding preparing your data for a PyTorch model, there are a few options:
torch.utils.data.Dataset
, and then creating a corresponding DataLoader
(which is a Python generator that allows to loop over the items of a dataset). When subclassing the Dataset
class, one needs to implement 3 methods: __init__
, __len__
(which returns the number of examples of the dataset) and __getitem__
(which returns an example of the dataset, given an integer index). Here's an example of creating a basic text classification dataset (assuming one has a CSV that contains 2 columns, namely "text" and "label"):from torch.utils.data import Dataset
class CustomTrainDataset(Dataset):
def __init__(self, df, tokenizer):
self.df = df
self.tokenizer = tokenizer
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
# get item
item = df.iloc[idx]
text = item['text']
label = item['label']
# encode text
encoding = self.tokenizer(text, padding="max_length", max_length=128, truncation=True, return_tensors="pt")
# remove batch dimension which the tokenizer automatically adds
encoding = {k:v.squeeze() for k,v in encoding.items()}
# add label
encoding["label"] = torch.tensor(label)
return encoding
Instantiating the dataset then happens as follows:
from transformers import BertTokenizer
import pandas as pd
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
df = pd.read_csv("path_to_your_csv")
train_dataset = CustomTrainDataset(df=df, tokenizer=tokenizer)
Accessing the first example of the dataset can then be done as follows:
encoding = train_dataset[0]
In practice, one creates a corresponding DataLoader
, that allows to get batches from the dataset:
from torch.utils.data import DataLoader
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
I often check whether the data is created correctly by fetching the first batch from the data loader, and then printing out the shapes of the tensors, decoding the input_ids back to text, etc.
batch = next(iter(train_dataloader))
for k,v in batch.items():
print(k, v.shape)
# decode the input_ids of the first example of the batch
print(tokenizer.decode(batch['input_ids'][0].tolist())
Loading a custom dataset as a Dataset object can be done as follows (you can install datasets using pip install datasets
):
from datasets import load_dataset
dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'] 'test': 'my_test_file.csv'})
Here I'm loading local CSV files, but there are other formats supported (including JSON, Parquet, txt) as well as loading data from a local Pandas dataframe or dictionary for instance. You can check out the docs for all details.
Regarding fine-tuning Transformer models (or more generally, PyTorch models), there are a few options:
model.train()
/model.eval()
), handle device placement (model.to(device)
), etc. A typical training loop in PyTorch looks as follows (inspired by this great PyTorch intro tutorial):import torch
from transformers import BertForSequenceClassification
# Instantiate pre-trained BERT model with randomly initialized classification head
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
# I almost always use a learning rate of 5e-5 when fine-tuning Transformer based models
optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)
# put model on GPU, if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
for epoch in range(epochs):
model.train()
train_loss = 0.0
for batch in train_dataloader:
# put batch on device
batch = {k:v.to(device) for k,v in batch.items()}
# forward pass
outputs = model(**batch)
loss = outputs.loss
train_loss += loss.item()
loss.backward()
optimizer.step()
optimizer.zero_grad()
print("Loss after epoch {epoch}:", train_loss/len(train_dataloader))
model.eval()
val_loss = 0.0
with torch.no_grad():
for batch in eval_dataloader:
# put batch on device
batch = {k:v.to(device) for k,v in batch.items()}
# forward pass
outputs = model(**batch)
loss = outputs.logits
val_loss += loss.item()
print("Validation loss after epoch {epoch}:", val_loss/len(eval_dataloader))
trainer = Trainer()
and then trainer.fit(model)
. The advantage is that you can start training models very quickly (hence the name lightning), as all training-related code is handled by the Trainer
object. The disadvantage is that it may be more difficult to debug your model, as the training and evaluation is now abstracted away.Seq2SeqTrainer
for encoder-decoder models, such as BART, T5 and the EncoderDecoderModel
classes. Note that all PyTorch example scripts of the Transformers library make use of the Trainer.