Dataset for Visually Indicated Sound Generation by Perceptually Optimized Classification
Alternatives To Vig
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Simpletransformers3,4872152 months ago280May 29, 202244apache-2.0Python
Transformers for Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI
3d Pointcloud1,374
12 days ago2Python
Papers and Datasets about Point Cloud.
7 months ago14Python
Embedding, NMT, Text_Classification, Text_Generation, NER etc.
2 years ago
paper note, including personal comments, introduction, code etc
Triple Gan199
3 years ago3Python
Triple-GAN: a unified framework for classification and class-conditional generation in semi-supervised learing
4 years agoPython
Generation and Classification of Drug Like molecule usings Neural Networks
6 years ago1gpl-2.0C++
linear genetic programming system for symbolic regression and classification.
Linknet Pytorch21
3 years ago1Python
Pytorch reimplementation of LinkNet for Scene Graph Generation
3 years ago4apache-2.0Python
Cryptographic Dataset Generation & Modelling Framework
3 years agomitPython
Dataset for Visually Indicated Sound Generation by Perceptually Optimized Classification
Alternatives To Vig
Select To Compare

Alternative Project Comparisons

VIG Dataset

This repository includes Visually Indicated sound Generation (VIG) dataset mentioned in Visually Indicated Sound Generation by Perceptually Optimized Classification (Best paper in ECCV MULA workshop 2018).


Visually Indicated Sound Generation aims to predict visually consistent sound from the video content. Previous methods in Visually Indicated Sounds addressed this problem by creating a single generative model that ignores the distinctive characteristics of various sound categories. Nowadays, state-of-the-art sound classification networks are available to capture semantic-level information in audio modality, which can also serve for the purposeof visually indicated sound generation.


We explore generating fine-grained sound from a variety of sound classes, and leverage pre-trained sound classification networks to improve the audio generation quality. We propose a novel Perceptually Optimized Classification based Audio generation Network (POCAN), which generates sound conditioned on the sound class predicted from visual information. Additionally, a perceptual loss is calculated via a pre-trained sound classification network to align the semantic information between the generated sound and its ground truth during training. The framework of POCAN is shown below.

VIG dataset download

Data processing is based on Python 2.7

We provide the Youtube ID for each video in the file vig_dl.lst. You may use tools like youtube-dl to download these videos (A sample download script is provided in this repository). In file vig_dl.lst, each youtube video is mapped to a file ID in each line. The files vig_train.lst and vig_test.lst specify the training and test videos by these file IDs respectively. For annotation files, we provide start time (key name start_time), end time (key name end_time) and sound class label (key name vig_label) in the file vig_annotation.pkl. The sound class is mapped to a class ID in the annotation file. The map between class name and class ID is provided in the file vig_class_map.pkl.

Some demo video clips as well as sound waveform and spectrogram are shown in the figure below.

Performance of POCAN on VIG

We choose the recall at top K ([email protected]) as the metric for retrieving sound in the test set in VIG. The performance of POCAN is listed in the table below.

Model K = 1 K = 5 K = 10
Owens et al [1] 0.0997 0.2888 0.4640
POCAN 0.1223 0.3625 0.4802

More details can be found in the paper.


If you find the repository is useful for your research, please consider citing the following work:

  title={Visually Indicated Sound Generation by Perceptually Optimized Classification},
  author={Chen*, Kan and Zhang*, Chuanxi and Fang, Chen and Wang, Zhaowen and Bui, Trung and Nevatia, Ram}
  booktitle={ECCV MULA Workshop},


[1] Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." In CVPR. 2016

Popular Classification Projects
Popular Generation Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.