Gutenberg Dialog

Build a dialog dataset from online books in many languages
Alternatives To Gutenberg Dialog
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Awesome Visual Slam1,621
a year ago
:books: The list of vision-based SLAM / Visual Odometry open source, blogs, and papers
4 months ago3mitC++
solution of exercises of the book "probabilistic robotics"
2 years ago14May 19, 202129Python
The score code of FastBERT (ACL2020)
14 years ago1September 08, 2014mitRuby
Collaborative Filtering for Rails
Machine Learning With R Datasets288
5 years ago1
Formatted datasets for Machine Learning With R by Brett Lantz
Awesome Mobile Robotics282
2 months ago
Useful links of different content related to AI, Computer Vision, and Robotics.
3 months ago3CSS
Interesting datasets you could use with Algolia
3 years ago3June 10, 2020otherR
Support files for a data visualization course and book
10 months ago2mitHTML
Book Recommendation System built for Book Lovers📖. Simply Rate ⭐ some books and get immediate recommendations🤩
Book Dataset124
3 years ago2apache-2.0Python
This dataset contains 207,572 books from the, Inc. marketplace.
Alternatives To Gutenberg Dialog
Select To Compare

Alternative Project Comparisons

gutenberg-dialog · twitter

Paper Paper
Code for downloading and building your own version of the Gutenberg Dialog Dataset. Easily extendable with new languages. Try trained chatbots in various languages here:

Download datasets

Download link Number of utterances Average utterance length Number of dialogues Average dialogue length
English 14 773 741 22.17 2 526 877 5.85
German 226 015 24.44 43 440 5.20
Dutch 129 471 24.26 23 541 5.50
Spanish 58 174 18.62 6 912 8.42
Italian 41 388 19.47 6 664 6.21
Hungarian 18 816 14.68 2 826 6.66
Portuguese 16 228 21.40 2 233 7.27

Download resources for The Gutenberg Dialogue Dataset paper

Download responses from GPT2 trainings here

Download data used in the paper here

Download trained models here

The gpt2_training_scripts folder contains code for running the trainings from the paper. Code adapted from here.


🔀   Generate your own dataset by tuning parameters affecting the size-quality trade-off of the dataset
🚀   The modular interface makes it easy to extend the dataset to other languages
💾   You can easily exclude books manually when building the dataset


Run which installs required packages.



The main file should be called from the root of the repo. The command below runs the dataset building pipeline for the comma-separated languages given as argument. Currently English, German, Dutch, Spanish, Portuguese, Italian, and Hungarian are supported.

python code/ -l=en,de,nl,es,pt,it,hu -a

All settable arguments can bee seen below:

Pipeline steps

The -a flag controls whether to run the whole pipeline automatically. If -a is omitted a subset of steps have to be specified using flags (see help above). Once a step is finished its output can be used in subsequent steps and it only has run again if parameters or code related to that step is changed. All steps run separately for each language.

1. Download (-d)

Download books for given languages.

Note: if all books fail to download with the error "Could not download book", a likely cause is that the default mirror used by the gutenberg package has become inaccessible. In the event that this occurs, it is possible to use any of the alternate mirrors listed at via the GUTENBERG_MIRROR environment variable. For example:

python code/ ...

2. Pre-filter (-f1)

Pre-filtering removes some old books and noise.

3. Extract (-e)

Dialogs are extracted from books. When extending the dataset to new languages (see section below), this is the step that can be modified, thus previous steps can be skipped once finished.

4. Post-filter (-f2)

A second filtering step removing some dialogs based on vocabulary.

5. Create dataset and manual filtering (-c)

Putting together the final dataset and splitting into train/dev/test data. The final step creates the author_and_title.txt file in the output directory containing all books (plus titles and authors) used to extract the final dataset. Users can manually copy lines from this file to banned_books.txt corresponding to books which should not be allowed in the dataset. In subsequent runs of any steps, books in this file will not be taken into account.

Extending to other languages

The code can be easily extended to process other languages. A file named <language code>.py has to be created in the languages folder. Here a class should be defined named the upper-case language code (e.g. En for English), with LANG or any of the other subclasses as parent. With self.cfg config parameters can be accessed. Inside this class the 3 functions below have to be defined. Please see for an example.

Languages statistics


This function should return a dictionary where the keys are potential delimiters. For each delimiter a function should be defined (values in dictionary), which takes as input a line and returns a number. This number can be for example the count of delimiters, a flag whether there is a delimiter in the line, etc. Usually a weighted count is advisable, depending on the importance of different delimiters. The values will be used to determine the delimiter that should be used in the respective book (passed to the function below), and for filtering books which contain a low amount of delimiters. contains examples of multiple delimiters.

process_file(paragraph_list, delimiter)

This function should extract the dialogs from a book and append them to self.dialogs, which is a list of dialogs, and each dialog is a list of consecutive utterances. paragraph_list contains the book as a list of consecutive paragraphs. delimiter is the most common delimiter in this file which should be used to extract dialogs.


This function is used for post-processing dialogs (e.g. remove certain characters). It takes as input an utterance. Please note that nltk word tokenization is run automatically.



This project is licensed under the MIT License - see the LICENSE file for details.
Please include a link to this repo if you use any of the dataset or code in your work and consider citing the following paper:

    title = "The Gutenberg Dialogue Dataset",
    author = "Cs{\'a}ky, Rich{\'a}rd and Recski, G{\'a}bor",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics",
    month = apr,
    year = "2021",
    publisher = "Association for Computational Linguistics",
    url = "",
Popular Book Projects
Popular Dataset Projects
Popular Learning Resources Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.