Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Easyocr | 19,436 | 67 | 17 days ago | 31 | September 20, 2022 | 298 | apache-2.0 | Python | ||
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc. | ||||||||||
Insightface | 18,230 | 1 | 9 | 5 days ago | 28 | December 17, 2022 | 977 | mit | Python | |
State-of-the-art 2D and 3D Face Analysis Project | ||||||||||
Flair | 13,102 | 24 | 63 | 3 days ago | 30 | May 20, 2022 | 47 | other | Python | |
A very simple framework for state-of-the-art Natural Language Processing (NLP) | ||||||||||
The Incredible Pytorch | 9,479 | 7 months ago | 1 | mit | ||||||
The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relating to PyTorch. | ||||||||||
Facenet Pytorch | 3,727 | 3 | 16 | 10 days ago | 32 | March 10, 2021 | 61 | mit | Python | |
Pretrained Pytorch face detection (MTCNN) and facial recognition (InceptionResnet) models | ||||||||||
3d Resnets Pytorch | 2,677 | 3 years ago | 120 | mit | Python | |||||
3D ResNets for Action Recognition (CVPR 2018) | ||||||||||
Mmskeleton | 2,604 | 10 months ago | 193 | apache-2.0 | Python | |||||
A OpenMMLAB toolbox for human pose estimation, skeleton-based action recognition, and action synthesis. | ||||||||||
Crnn.pytorch | 2,181 | 5 months ago | 103 | mit | Python | |||||
Convolutional recurrent network in pytorch | ||||||||||
Pytorch Kaldi | 2,138 | 2 years ago | 24 | Python | ||||||
pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit. | ||||||||||
Crnn_chinese_characters_rec | 1,710 | a year ago | 74 | Python | ||||||
(CRNN) Chinese Characters Recognition. |
Pytorch reimplementation of Google's repository for the ViT model that was released with the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
This paper show that Transformers applied directly to image patches and pre-trained on large datasets work really well on image recognition task.
Vision Transformer achieve State-of-the-Art in image recognition task with standard Transformer encoder and fixed-size patches. In order to perform classification, author use the standard approach of adding an extra learnable "classification token" to the sequence.
# imagenet21k pre-train
wget https://storage.googleapis.com/vit_models/imagenet21k/{MODEL_NAME}.npz
# imagenet21k pre-train + imagenet2012 fine-tuning
wget https://storage.googleapis.com/vit_models/imagenet21k+imagenet2012/{MODEL_NAME}.npz
python3 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz
CIFAR-10 and CIFAR-100 are automatically download and train. In order to use a different dataset you need to customize data_utils.py.
The default batch size is 512. When GPU memory is insufficient, you can proceed with training by adjusting the value of --gradient_accumulation_steps
.
Also can use Automatic Mixed Precision(Amp) to reduce memory usage and train faster
python3 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz --fp16 --fp16_opt_level O2
To verify that the converted model weight is correct, we simply compare it with the author's experimental results. We trained using mixed precision, and --fp16_opt_level
was set to O2.
model | dataset | resolution | acc(official) | acc(this repo) | time |
---|---|---|---|---|---|
ViT-B_16 | CIFAR-10 | 224x224 | - | 0.9908 | 3h 13m |
ViT-B_16 | CIFAR-10 | 384x384 | 0.9903 | 0.9906 | 12h 25m |
ViT_B_16 | CIFAR-100 | 224x224 | - | 0.923 | 3h 9m |
ViT_B_16 | CIFAR-100 | 384x384 | 0.9264 | 0.9228 | 12h 31m |
R50-ViT-B_16 | CIFAR-10 | 224x224 | - | 0.9892 | 4h 23m |
R50-ViT-B_16 | CIFAR-10 | 384x384 | 0.99 | 0.9904 | 15h 40m |
R50-ViT-B_16 | CIFAR-100 | 224x224 | - | 0.9231 | 4h 18m |
R50-ViT-B_16 | CIFAR-100 | 384x384 | 0.9231 | 0.9197 | 15h 53m |
ViT_L_32 | CIFAR-10 | 224x224 | - | 0.9903 | 2h 11m |
ViT_L_32 | CIFAR-100 | 224x224 | - | 0.9276 | 2h 9m |
ViT_H_14 | CIFAR-100 | 224x224 | - | WIP |
model | dataset | resolution | acc |
---|---|---|---|
ViT-B_16-224 | CIFAR-10 | 224x224 | 0.99 |
ViT_B_16-224 | CIFAR-100 | 224x224 | 0.9245 |
ViT-L_32 | CIFAR-10 | 224x224 | 0.9903 |
ViT-L_32 | CIFAR-100 | 224x224 | 0.9285 |
upstream | model | dataset | total_steps /warmup_steps | acc(official) | acc(this repo) |
---|---|---|---|---|---|
imagenet21k | ViT-B_16 | CIFAR-10 | 500/100 | 0.9859 | 0.9859 |
imagenet21k | ViT-B_16 | CIFAR-10 | 1000/100 | 0.9886 | 0.9878 |
imagenet21k | ViT-B_16 | CIFAR-100 | 500/100 | 0.8917 | 0.9072 |
imagenet21k | ViT-B_16 | CIFAR-100 | 1000/100 | 0.9115 | 0.9216 |
The ViT consists of a Standard Transformer Encoder, and the encoder consists of Self-Attention and MLP module. The attention map for the input image can be visualized through the attention score of self-attention.
Visualization code can be found at visualize_attention_map.
@article{dosovitskiy2020,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
journal={arXiv preprint arXiv:2010.11929},
year={2020}
}