Awesome Open Source
Awesome Open Source

Corpora Cleaning Tools

Tools for filtering and cleaning parallel and monolingual corpora in order to train better (neural) machine translation systems.

Inspired by the Data Filtering and Data Pre-processing sections of Tilde's WMT17 paper. This repository includes some of the more basic scripts that can help to get rid of the majority of junk from parallel corpora.

Tools included

  • parallel - tools for parallel corpora
  • mono - tools for monolingual corpora

Requirements

pip install subword-nmt
pip install langid

Publications

If you use this tool, please cite the following paper:

Matīss Rikters (2018). "Impact of Corpora Quality on Neural Machine Translation." In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018) (2018).

@inproceedings{Rikters2018BalticHLT,
	author = {Rikters, Matīss},
	booktitle={In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018)},
	title = {{Impact of Corpora Quality on Neural Machine Translation}},
	address={Tartu, Estonia},
	year = {2018}
}
Related Awesome Lists
Top Programming Languages

Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
Php (297,808
Language (29,925
Neural (16,507
Natural Language Processing (15,880
Translation (13,664
Data Science (11,415
Parallel (7,140
Corpus (5,323
Filtering (4,191
Cleaning (1,781
Machine Translation (957
Data Processing (696
Natural Language (659
Nmt (523
Neural Machine Translation (224
Corpora (86
Language Processing (74
Corpus Tools (47