Awesome Open Source
Awesome Open Source


This repository contains code for the paper:

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog
Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach
[PDF] [ArXiv] [Code]
Oral Presentation
Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019

If you find this code useful, consider citing our work:

	title  = {CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog},  
	author = {Kottur, Satwik and Moura, Jos\'e M. F. and Parikh, Devi and   
	          Batra, Dhruv and Rohrbach, Marcus},  
	journal = {arXiv preprint arXiv:1903.03166},
	year   = {2019}  


Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image, using the conversation history as context. It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolation on large, real datasets is infeasible as it requires prohibitively-expensive complete annotation of the 'state' of all images and dialogs.

We develop CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs.

We use CLEVR-Dialog to benchmark performance of standard visual dialog models; in particular, on visual coreference resolution (as a function of the coreference distance). This is the first analysis of its kind for visual dialog models that was not possible without this dataset. We hope the findings from CLEVR-Dialog will help inform the development of future models for visual dialog.

CorefNMN This repository generates a version of our diagnostic dataset CLEVR-Dialog (figure above).


The code is in Python3 with following python package dependencies:

pip install absl-py
pip install json
pip install tqdm
pip install numpy

Directory Structure

The repository contains the following files:

  • Main script to generate the dataset
  • List of constraints for caption and question generation
  • Utility functions to dialog generation
  • List of global variables along with initialization

In addition, the dataset generation code requires following files:

  • templates/synonyms.json: Compilation of words and their synonyms
  • templates/metainfo.json: Contains information about attributes and their values for CLEVR objects
  • templates/captions and templates/questions: Caption and question templates respectively.

CLEVR Images

Our dataset is built on CLEVR images, which can be downloaded from here. Extract the images and scene JSON files in data/ folder. We will only use CLEVR train and val splits as scene JSON files are unavailable for test split.

Generating CLEVR-Dialog Dataset

To generate the dataset, please check Additional details about the supported flags can be found in An example command is shown below:

python -u \
	--scene_path=${DATA_ROOT}"scenes/CLEVR_train_scenes.json" \
	--num_beams=100 \
	--num_workers=1 \
	--save_path=${DATA_ROOT}"clevr_dialog_train_raw.json" \

CLEVR-Dialog Annotations

The generated JSON contains a list of dialogs on CLEVR images with following fields:

  • split: Specifies if the CLEVR split is train/val.
  • image_index: CLEVR image index.
  • image_filename: CLEVR image filename.
  • dialogs: List of dialog instances for a given image, each with following fields:
       |--caption: Caption for the dialog instance
       |--template_info: Template information for the dialog (caption + 10 questions)
       |--dialog: Text for the ten rounds of dialog, each with following fields:
             |--question: Question text for the current round
             |--answer: Answer text for the current round
             |--template: Question template for the current round
       |--graph: Scene graph information for the dialog, with following fields:
             |--objects: Objects with attributes discussed in the dialog
             |--counts: Specific object counts discussed in the dialog
             |--relationships: Object relationships discussed in the dialog
             |--exists: Object existences discussed in the dialog
             |--history: List of incremental scene graph information conveyed in each round

The dataset used in the paper can be downloaded here: train and val splits.


For any questions, please feel free to contact the above contributor(s).


This project is licensed under the license found in the LICENSE file in the root directory of this source tree (here).

Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
python (52,051
deep-learning (3,859
computer-vision (1,223
vision-and-language (17

Find Open Source By Browsing 7,000 Topics Across 59 Categories