Subword Nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Alternatives To Subword Nmt
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
8 months ago275apache-2.0Python
TensorFlow Neural Machine Translation Tutorial
Practical Pytorch4,272
2 years ago91mitJupyter Notebook
Go to - this repo is deprecated and no longer maintained
3 months ago15February 26, 202125mitGo
💁‍♀️Your new best friend powered by an artificial neural network
5 years agoapache-2.0
Natural Language Processing Tasks and References
Mt Reading List2,289
a year ago4bsd-3-clauseTeX
A machine translation reading list maintained by Tsinghua Natural Language Processing Group
Subword Nmt1,93718139 months ago8December 08, 20212mitPython
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
2 years ago85apache-2.0Python
Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
Seq2seq Attn1,167
2 years ago14mitLua
Sequence-to-sequence model with LSTM encoder/decoders and attention
23 months ago80May 05, 20222apache-2.0Python
Sequence-to-sequence framework with a focus on Neural Machine Translation based on PyTorch
2 months ago78otherC++
Fast Neural Machine Translation in C++
Alternatives To Subword Nmt
Select To Compare

Alternative Project Comparisons

Subword Neural Machine Translation

This repository contains preprocessing scripts to segment text into subword units. The primary purpose is to facilitate the reproduction of our experiments on Neural Machine Translation with subword units (see below for reference).


install via pip (from PyPI):

pip install subword-nmt

install via pip (from Github):

pip install

alternatively, clone this repository; the scripts are executable stand-alone.


Check the individual files for usage instructions.

To apply byte pair encoding to word segmentation, invoke these commands:

subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}
subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}

To segment rare words into character n-grams, do the following:

subword-nmt get-vocab --train_file {train_file} --vocab_file {vocab_file}
subword-nmt segment-char-ngrams --vocab {vocab_file} -n {order} --shortlist {size} < {test_file} > {out_file}

The original segmentation can be restored with a simple replacement:

sed -r 's/(@@ )|(@@ ?$)//g'

If you cloned the repository and did not install a package, you can also run the individual commands as scripts:

./subword_nmt/ -s {num_operations} < {train_file} > {codes_file}


We found that for languages that share an alphabet, learning BPE on the concatenation of the (two or more) involved languages increases the consistency of segmentation, and reduces the problem of inserting/deleting characters when copying/transliterating names.

However, this introduces undesirable edge cases in that a word may be segmented in a way that has only been observed in the other language, and is thus unknown at test time. To prevent this, accepts a --vocabulary and a --vocabulary-threshold option so that the script will only produce symbols which also appear in the vocabulary (with at least some frequency).

To use this functionality, we recommend the following recipe (assuming L1 and L2 are the two languages):

Learn byte pair encoding on the concatenation of the training text, and get resulting vocabulary for each:

cat {train_file}.L1 {train_file}.L2 | subword-nmt learn-bpe -s {num_operations} -o {codes_file}
subword-nmt apply-bpe -c {codes_file} < {train_file}.L1 | subword-nmt get-vocab > {vocab_file}.L1
subword-nmt apply-bpe -c {codes_file} < {train_file}.L2 | subword-nmt get-vocab > {vocab_file}.L2

more conventiently, you can do the same with with this command:

subword-nmt learn-joint-bpe-and-vocab --input {train_file}.L1 {train_file}.L2 -s {num_operations} -o {codes_file} --write-vocabulary {vocab_file}.L1 {vocab_file}.L2

re-apply byte pair encoding with vocabulary filter:

subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {train_file}.L1 > {train_file}.BPE.L1
subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L2 --vocabulary-threshold 50 < {train_file}.L2 > {train_file}.BPE.L2

as a last step, extract the vocabulary to be used by the neural network. Example with Nematus:

nematus/data/ {train_file}.BPE.L1 {train_file}.BPE.L2

[you may want to take the union of all vocabularies to support multilingual systems]

for test/dev data, re-use the same options for consistency:

subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {test_file}.L1 > {test_file}.BPE.L1


On top of the basic BPE implementation, this repository supports:

  • BPE dropout (Provilkov, Emelianenko and Voita, 2019): use the argument --dropout 0.1 for subword-nmt apply-bpe to randomly drop out possible merges. Doing this on the training corpus can improve quality of the final system; at test time, use BPE without dropout. In order to obtain reproducible results, argument --seed can be used to set the random seed.

    Note: In the original paper, the authors used BPE-Dropout on each new batch separately. You can copy the training corpus several times to get similar behavior to obtain multiple segmentations for the same sentence.

  • support for glossaries: use the argument --glossaries for subword-nmt apply-bpe to provide a list of subwords and/or regular expressions that should always be passed to the output without subword segmentation

echo "I am flying to <country>Switzerland</country> at noon ." | subword-nmt apply-bpe --codes subword_nmt/tests/data/bpe.ref
I am [email protected]@ [email protected]@ ing to <@@ [email protected]@ [email protected]@ [email protected]@ >@@ [email protected]@ [email protected]@ [email protected]@ [email protected]@ [email protected]@ [email protected]@ [email protected]@ <@@ /@@ [email protected]@ [email protected]@ [email protected]@ > at [email protected]@ on .

echo "I am flying to <country>Switzerland</country> at noon ." | subword-nmt apply-bpe --codes subword_nmt/tests/data/bpe.ref --glossaries "<country>\w*</country>" "fly"
I am [email protected]@ ing to <country>Switzerland</country> at [email protected]@ on .


The segmentation methods are described in:

Rico Sennrich, Barry Haddow and Alexandra Birch (2016): Neural Machine Translation of Rare Words with Subword Units Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.


This repository implements the subword segmentation as described in Sennrich et al. (2016), but since version 0.2, there is one core difference related to end-of-word tokens.

In Sennrich et al. (2016), the end-of-word token </w> is initially represented as a separate token, which can be merged with other subwords over time:

u n d </w>
f u n d </w>

Since 0.2, end-of-word tokens are initially concatenated with the word-final character:

u n d</w>
f u n d</w>

The new representation ensures that when BPE codes are learned from the above examples and then applied to new text, it is clear that a subword unit und is unambiguously word-final, and un is unambiguously word-internal, preventing the production of up to two different subword units from each BPE merge operation. is backward-compatible and continues to accept old-style BPE files. New-style BPE files are identified by having the following first line: #version: 0.2


This project has received funding from Samsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland, and from the European Union’s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21).

Popular Translation Projects
Popular Neural Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Machine Translation
Neural Machine Translation
Word Segmentation