PyTorch VQA implementation that achieved top performances in the (ECCV18) VizWiz Grand Challenge: Answering Visual Questions from Blind People. The code can be easily adapted for training on VQA 1.0/2.0 or any other dataset.
The implemented architecture is a variant of the VQA model described in Kazemi et al. (2017). Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering. Visual feature are extracted using a pretrained (on ImageNet) ResNet-152. Input Questions are tokenized, embedded and encoded with an LSTM. Image features and encoded questions are combined and used to compute multiple attention maps over image features. The attended image features and the encoded questions are concatenated and finally fed to a 2-layer classifier that outputs probabilities over the answers (classes).
More information about the attention module can be found in Yang et al. (2015). Stacked Attention Networks for Image Question Answering.
In order to consider all 10 answers given by the annotators we exploit a Soft Cross-Entropy loss : a weighted average of the negative log-probabilities of each unique ground-truth answer. This loss function better aligns to the VQA evaluation metric used to evaluate the challenge submissions.
conda create --name viz_env python=3.6 source activate viz_env pip install -r requirements.txt
wget https://ivc.ischool.utexas.edu/VizWiz/data/VizWiz_data_ver1.tar.gz tar -xzf VizWiz_data_ver1.tar.gz
After unpacking the dataset, the Image folder will contain files with prefix
Those files should be removed before extracting the image features:
Set the paths to the downloaded data in the yaml configuration file
Extract features from input images (~26GB) The script will extract two types of features from the images:
Our model will use only the "Attention" features. However it is possible to extend the implementation designing new models that do not use attention mechanisms.
During training, the models with the highest validation accuracy and with the lowest validation loss are saved.
The path of the log directory is specified in the yaml configuration file