Awesome Open Source
Awesome Open Source

A Pytorch Implementation of the Transformer: Attention Is All You Need

Our implementation is largely based on Tensorflow implementation


Why This Project?

I'm a freshman of pytorch. So I tried to implement some projects by pytorch. Recently, I read the paper Attention is all you need and impressed by the idea. So that's it. I got similar result compared with the original tensorflow implementation.

Differences with the original paper

I don't intend to replicate the paper exactly. Rather, I aim to implement the main ideas in the paper and verify them in a SIMPLE and QUICK way. In this respect, some parts in my code are different than those in the paper. Among them are

  • I used the IWSLT 2016 de-en dataset, not the wmt dataset because the former is much smaller, and requires no special preprocessing.
  • I constructed vocabulary with words, not subwords for simplicity. Of course, you can try bpe or word-piece if you want.
  • I parameterized positional encoding. The paper used some sinusoidal formula, but Noam, one of the authors, says they both work. See the discussion in reddit
  • The paper adjusted the learning rate to global steps. I fixed the learning to a small number, 0.0001 simply because training was reasonably fast enough with the small dataset (Only a couple of hours on a single GTX 1060!!).

File description

  • includes all hyper parameters that are needed.
  • creates vocabulary files for the source and the target.
  • contains functions regarding loading and batching data.
  • has all building blocks for encoder/decoder networks.
  • has the model.
  • is for evaluation.


wget -qO- | tar xz; mv de-en corpora
  • STEP 2. Adjust hyper parameters in if necessary.
  • STEP 3. Run to generate vocabulary files to the preprocessed folder.
  • STEP 4. Run or download pretrained weights, put it into folder './models/' and change the eval_epoch in to 18
  • STEP 5. Show loss and accuracy in tensorboard
tensorboard --logdir runs


  • Run


I got a BLEU score of 16.7.(tensorflow implementation 17.14) (Recollect I trained with a small dataset, limited vocabulary) Some of the evaluation results are as follows. Details are available in the results folder.

source: Ich bin nicht sicher was ich antworten soll
expected: I'm not really sure about the answer
got: I'm not sure what I'm going to answer

source: Was macht den Unterschied aus
expected: What makes his story different
got: What makes a difference

source: Vielen Dank
expected: Thank you
got: Thank you

source: Das ist ein Baum
expected: This is a tree
got: So this is a tree

Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
python (51,962
pytorch (2,279
translation (295
transformer (175
attention-is-all-you-need (19

Find Open Source By Browsing 7,000 Topics Across 59 Categories