Attention Is All You Need Pytorch

A PyTorch implementation of the Transformer model in "Attention is All You Need".
Alternatives To Attention Is All You Need Pytorch
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Annotated_deep_learning_paper_implementations22,4641a month ago76June 27, 202217mitJupyter Notebook
🧑‍🏫 59 Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, ...), gans(cyclegan, stylegan2, ...), 🎮 reinforcement learning (ppo, dqn), capsnet, distillation, ... 🧠
Vit Pytorch14,120319 days ago143June 30, 2022106mitPython
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
Nlp Tutorial12,403
2 months ago34mitJupyter Notebook
Natural Language Processing Tutorial for Deep Learning Researchers
External Attention Pytorch8,745
22 days ago61mitPython
🍀 Pytorch implementation of various Attention Mechanisms, MLP, Re-parameter, Convolution, which is helpful to further understand papers.⭐⭐⭐
Attention Is All You Need Pytorch7,444
a month ago68mitPython
A PyTorch implementation of the Transformer model in "Attention is All You Need".
Espnet6,6523a day ago27May 28, 2022473apache-2.0Python
End-to-End Speech Processing Toolkit
Dalle Pytorch5,213
15 days ago172May 30, 2022124mitPython
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
Bertviz5,1841a month ago5April 02, 20227apache-2.0Python
BertViz: Visualize Attention in NLP Models (BERT, GPT2, BART, etc.)
Pytorch Seq2seq4,548
6 days ago56mitJupyter Notebook
Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
2 months ago39apache-2.0Python
The GitHub repository for the paper "Informer" accepted by AAAI 2021.
Alternatives To Attention Is All You Need Pytorch
Select To Compare

Alternative Project Comparisons

Attention is all you need: A Pytorch Implementation

This is a PyTorch implementation of the Transformer model in "Attention is All You Need" (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, arxiv, 2017).

A novel sequence to sequence framework utilizes the self-attention mechanism, instead of Convolution operation or Recurrent structure, and achieve the state-of-the-art performance on WMT 2014 English-to-German translation task. (2017/06/12)

The official Tensorflow Implementation can be found in: tensorflow/tensor2tensor.

To learn more about self-attention mechanism, you could read "A Structured Self-attentive Sentence Embedding".

The project support training and translation with trained model now.

Note that this project is still a work in progress.

BPE related parts are not yet fully tested.

If there is any suggestion or error, feel free to fire an issue to let me know. :)


WMT'16 Multimodal Translation: de-en

An example of training for the WMT'16 Multimodal Translation task (

0) Download the spacy language model.

# conda install -c conda-forge spacy 
python -m spacy download en
python -m spacy download de

1) Preprocess the data with torchtext and spacy.

python -lang_src de -lang_trg en -share_vocab -save_data m30k_deen_shr.pkl

2) Train the model

python -data_pkl m30k_deen_shr.pkl -log m30k_deen_shr -embs_share_weight -proj_share_weight -label_smoothing -output_dir output -b 256 -warmup 128000 -epoch 400

3) Test the model

python -data_pkl m30k_deen_shr.pkl -model trained.chkpt -output prediction.txt

[(WIP)] WMT'17 Multimodal Translation: de-en w/ BPE

1) Download and preprocess the data with bpe:

Since the interfaces is not unified, you need to switch the main function call from main_wo_bpe to main.

python -raw_dir /tmp/raw_deen -data_dir ./bpe_deen -save_data bpe_vocab.pkl -codes codes.txt -prefix deen

2) Train the model

python -data_pkl ./bpe_deen/bpe_vocab.pkl -train_path ./bpe_deen/deen-train -val_path ./bpe_deen/deen-val -log deen_bpe -embs_share_weight -proj_share_weight -label_smoothing -output_dir output -b 256 -warmup 128000 -epoch 400

3) Test the model (not ready)

  • TODO:
    • Load vocabulary.
    • Perform decoding after the translation.



  • Parameter settings:
    • batch size 256
    • warmup step 4000
    • epoch 200
    • lr_mul 0.5
    • label smoothing
    • do not apply BPE and shared vocabulary
    • target embedding / pre-softmax linear layer weight sharing.


  • coming soon.


  • Evaluation on the generated text.
  • Attention weight plot.


  • The byte pair encoding parts are borrowed from subword-nmt.
  • The project structure, some scripts and the dataset preprocessing steps are heavily borrowed from OpenNMT/OpenNMT-py.
  • Thanks for the suggestions from @srush, @iamalbert, @Zessay, @JulesGM, @ZiJianZhao, and @huanghoujing.
Popular Attention Projects
Popular Pytorch Projects
Popular Machine Learning Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Deep Learning
Natural Language Processing