[ECCV 2022] Official repository for "MaxViT: Multi-Axis Vision Transformer". SOTA foundation models for classification, detection, segmentation, image quality, and generative modeling...
Alternatives To Maxvit
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Deep Learning For Image Processing14,599
22 days ago28gpl-3.0Python
deep learning for image processing including classification and object-detection etc.
Labelme9,8968817 days ago177March 03, 202267otherPython
Image Polygonal Annotation with Python (polygon, rectangle, circle, line, point and image-level flag annotation).
Jetson Inference6,243
2 days ago938mitC++
Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
Pyaudioanalysis4,9731186 months ago23February 07, 2022184apache-2.0Python
Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications
Paddlex4,1651a month ago54December 10, 2021477apache-2.0Python
PaddlePaddle End-to-End Development Toolkit(『飞桨』深度学习全流程开发工具)
6 months ago174otherPython
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
Catalyst3,102191010 hours ago108April 29, 20224apache-2.0Python
Accelerated deep learning R&D
Imgclsmob2,3994a year ago67September 21, 20216mitPython
Sandbox for training deep learning networks
Awesome Deeplearning2,048
15 days ago477apache-2.0Jupyter Notebook
深度学习入门课、资深课、特色课、学术案例、产业实践案例、深度学习知识百科及面试题库The course, case and knowledge of Deep Learning and AI
7 months ago80mitPython
PointNet and PointNet++ implemented by pytorch (pure python) and on ModelNet, ShapeNet and S3DIS.
Alternatives To Maxvit
Select To Compare

Alternative Project Comparisons

MaxViT: Multi-Axis Vision Transformer (ECCV 2022)

Paper Tutorial In Colab video

This repository hosts the official TensorFlow implementation of MAXViT models:

MaxViT: Multi-Axis Vision Transformer. ECCV 2022.
Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li
Google Research, University of Texas at Austin

Disclaimer: This is not an officially supported Google product.


  • Oct 12, 2022: Added the remaining ImageNet-1K and -21K checkpoints.
  • Oct 4, 2022: A list of updates
    • Added MaxViTTiny and MaxViTSmall checkpoints.
    • Added a Colab tutorial.
  • Sep 8, 2022: our Google AI blog covering both MaxViT and MAXIM is live.
  • Sep 7, 2022: @rwightman released a few small model weights in timm. Achieves even better results than our paper. See more here.
  • Aug 26, 2022: our MaxViT models have been implemented in timm (pytorch-image-models). Kudos to @rwightman!
  • July 21, 2022: Initial code release of MaxViT models: accepted to ECCV'22.
  • Apr 6, 2022: MaxViT has been implemented by @lucidrains: vit-pytorch 😱 :exploding_head:
  • Apr 4, 2022: initial uploads to Arxiv

MaxViT Models

MaxViT is a family of hybrid (CNN + ViT) image classification models, that achieves better performances across the board for both parameter and FLOPs efficiency than both SoTA ConvNets and Transformers. They can also scale well on large dataset sizes like ImageNet-21K. Notably, due to the linear-complexity of the grid attention used, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages.

MaxViT meta-architecture:

Results on ImageNet-1k train and test:

Results on ImageNet-21k and JFT pre-trained models:

Colab Demo

We have released a Google Colab Demo on the tutorials of how to run MaxViT on images. Try it here Open In Colab

Pretrained MaxViT Checkpoints

We have provided a list of results and checkpoints as follows:

Name Resolution Top1 Acc. #Params FLOPs Model
MaxViT-T 224x224 83.62% 31M 5.6B ckpt
MaxViT-T 384x384 85.24% 31M 17.7B ckpt
MaxViT-T 512x512 85.72% 31M 33.7B ckpt
MaxViT-S 224x224 84.45% 69M 11.7B ckpt
MaxViT-S 384x384 85.74% 69M 36.1B ckpt
MaxViT-S 512x512 86.19% 69M 67.6B ckpt
MaxViT-B 224x224 84.95% 119M 24.2B ckpt
MaxViT-B 384x384 86.34% 119M 74.2B ckpt
MaxViT-B 512x512 86.66% 119M 138.5B ckpt
MaxViT-L 224x224 85.17% 212M 43.9B ckpt
MaxViT-L 384x384 86.40% 212M 133.1B ckpt
MaxViT-L 512x512 86.70% 212M 245.4B ckpt

Here are a list of ImageNet-21K pretrained and ImageNet-1K finetuned models:

Name Resolution Top1 Acc. #Params FLOPs 21k model 1k model
MaxViT-B 224x224 - 119M 24.2B ckpt -
MaxViT-B 384x384 - 119M 74.2B - ckpt
MaxViT-B 512x512 - 119M 138.5B - ckpt
MaxViT-L 224x224 - 212M 43.9B ckpt -
MaxViT-L 384x384 - 212M 133.1B - ckpt
MaxViT-L 512x512 - 212M 245.4B - ckpt
MaxViT-XL 224x224 - 475M 97.8B ckpt -
MaxViT-XL 384x384 - 475M 293.7B - ckpt
MaxViT-XL 512x512 - 475M 535.2B - ckpt


Should you find this repository useful, please consider citing:

  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},

Other Related Works

  • MAXIM: Multi-Axis MLP for Image Processing, CVPR 2022. Paper | Code
  • CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers, CoRL 2022. Paper | Code
  • Improved Transformer for High-Resolution GANs, NeurIPS 2021. Paper | Code
  • CoAtNet: Marrying Convolution and Attention for All Data Sizes, NeurIPS 2021. Paper
  • EfficientNetV2: Smaller Models and Faster Training, ICML 2021. Paper | Code

Acknowledgement: This repository is built on the EfficientNets and CoAtNet.

Popular Segmentation Projects
Popular Classification Projects
Popular Machine Learning Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Jupyter Notebook
Computer Vision
Image Processing
Object Detection