3d Retinanet

3D-RetinaNet a baseline models on ROAD dataset

3D-RetinaNet for ROAD and UCF-24 dataset

This repository contains code for 3D-RetinaNet, a novel Single-Stage action detection newtwork proposed along with ROAD dataset. Our TPAMI paper contain detailed description 3D-RetinaNet and ROAD dataset. This code contains training and evaluation for ROAD and UCF-24 datasets.

Table of Contents


We need three things to get started with training: datasets, kinetics pre-trained weight, and pytorch with torchvision and tensoboardX.

Dataset download an pre-process

Pytorch and weights

  • Install Pytorch and torchvision
  • INstall tensorboardX viad pip install tensorboardx
  • Pre-trained weight on kinetics-400. Download them by changing current directory to kinetics-pt and run the bash file get_kinetics_weights.sh. OR Download them from Google-Drive. Name the folder kinetics-pt, it is important to name it right.

Training 3D-RetinaNet

  • We assume that you have downloaded and put dataset and pre-trained weight in correct places.
  • To train 3D-RetinaNet using the training script simply specify the parameters listed in main.py as a flag or manually change them.

You will need 4 GPUs (each with at least 10GB VRAM) to run training.

Let's assume that you extracted dataset in /home/user/road/ and weights in /home/user/kinetics-pt/ directory then your train command from the root directory of this repo is going to be:

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py /home/user/ /home/user/  /home/user/kinetics-pt/ --MODE=train --ARCH=resnet50 --MODEL_TYPE=I3D --DATASET=road --TRAIN_SUBSETS=train_3 --SEQ_LEN=8 --TEST_SEQ_LEN=8 --BATCH_SIZE=4 --LR=0.0041

Second instance of /home/user/ in above command specifies where checkpoint weight and logs are going to be stored. In this case, checkpoints and logs will be in /home/user/road/cache/<experiment-name>/.

Different parameters in main.py will result in different performance. Validation split is automatically selected based in training split number in road.

You can train ucf24 dataset by change some command line parameter as the training sechdule and learning rate differ compared ot road training.

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py /home/user/ /home/user/  /home/user/kinetics-pt/ --MODE=train --ARCH=resnet50 --MODEL_TYPE=I3D --DATASET=ucf24 --TRAIN_SUBSETS=train --VAL_SUBSETS=val --SEQ_LEN=8 --TEST_SEQ_LEN=8 --BATCH_SIZE=4 --LR=0.00245 --MILESTONES=6,8 --MAX_EPOCHS=10
  • Training notes:
    • Network occupies almost 9.7GB VRAM on each GPU, we used 1080Ti for training and normal training takes about 24 hrs on road dataset.
    • During training checkpoint is saved every epoch also log it's frame-level frame-mean-ap on a subset of validation split test.
    • Crucial parameters are LR, MILESTONES, MAX_EPOCHS, and BATCH_SIZE for training process.
    • label_types is very important variable, it defines label-types are being used for training and validation time it is bummed up by one with ego-action label type. It is created in data\dataset.py for each dataset separately and copied to args in main.py, further used at the time of evaluations.
    • Event detection and triplet detection is used interchangeably in this code base.

Testing and Building Tubes

To generate the tubes and evaluate them, first, you will need frame-level detection and link them. It is pretty simple in out case. Similar to training command, you can run following commands. These can run on single GPUs.

There are various MODEs in main.py. You can do each step independently or together. At the moment gen-dets mode generates and evaluated frame-wise detection and finally performs tube building and evaluation.

For ROAD dataset, run the following commands.

python main.py /home/user/ /home/user/  /home/user/kinetics-pt/ --MODE=gen_dets --MODEL_TYPE=I3D --TEST_SEQ_LEN=8 --TRAIN_SUBSETS=train_3 --SEQ_LEN=8 --BATCH_SIZE=4 --LR=0.0041 

and for UCF24

python main.py /home/user/ /home/user/  /home/user/kinetics-pt/ --MODE=gen_dets --ARCH=resnet50 --MODEL_TYPE=I3D --DATASET=ucf24 --TRAIN_SUBSETS=train --VAL_SUBSETS=val --SEQ_LEN=8 --TEST_SEQ_LEN=8 --BATCH_SIZE=4 --LR=0.00245 --EVAL_EPOCHS=10 --GEN_NMS=80 --TOPK=20 --PATHS_IOUTH=0.25 --TRIM_METHOD=indiv
  • Testing notes
    • Evaluation can be done on single GPU for test sequence length up to 32
    • No temporal trimming is performed for ROAD dataset however we use class specific alphas with temporal trimming formulation described in paper, which relies on temporal label consistency.
    • Please go through the hypermeter in main.py to understand there functions.
    • After performing tubes a detection .json file is dumped, which is used for evaluation, see tubes.py for more detatils.
    • See modules\evaluation.py and data\dataset.py for frame-level and video-level evaluation code to compute frame-mAP and video-mAP.


Here, you find the reproduced results from our paper. We use training split #3 for reproduction on a different machines compared to where results were generated for the paper. Below you will find the test results on validation split #3, which closer to test set compared to other split in terms of environmental conditions. We there is little change in learning rate here, so results are little different than the paper. Also, there are six tasks in ROAD dataset that makes it difficult balance the learning among tasks.

Model is set to I3D with resnet50 backbone. Kinetics pre-trained weights used for resnet50I3D, download link to given above in Requirements section. Results on split #3 with test-sequence length being 8 <[email protected]>/<[email protected]>.

Model I3D
Agentness 54.7/--
Agent 31.1/26.0
Action 22.0/16.1
Location 27.3/24.2
Duplexes 23.7/19.5
Events/triplets 13.9/15.5
AV-action 44.8/--
UCF24 results
Actionness --
Action detection --
ActionNess-framewise --
Download pre-trained weights
  • Currently, we provide the models from above table:
  • These models can be used to reproduce above table which is almost same as in our paper


If this work has been helpful in your research please cite following articles:

@ARTICLE {singh2022road,
author = {Singh, Gurkirt and Akrigg, Stephen and Di Maio, Manuele and Fontana, Valentina and Alitappeh, Reza Javanmard and Saha, Suman and Jeddisaravi, Kossar and Yousefi, Farzad and Culley, Jacob and Nicholson, Tom and others},
journal = {IEEE Transactions on Pattern Analysis & Machine Intelligence},
title = {ROAD: The ROad event Awareness Dataset for autonomous Driving},
year = {5555},
volume = {},
number = {01},
issn = {1939-3539},
pages = {1-1},
keywords = {roads;autonomous vehicles;task analysis;videos;benchmark testing;decision making;vehicle dynamics},
doi = {10.1109/TPAMI.2022.3150906},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = {feb}

  title={Online real-time multiple spatiotemporal action localisation and prediction},
  author={Singh, Gurkirt and Saha, Suman and Sapienza, Michael and Torr, Philip HS and Cuzzolin, Fabio},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},

  title={1 year, 1000 km: The Oxford RobotCar dataset},
  author={Maddern, Will and Pascoe, Geoffrey and Linegar, Chris and Newman, Paul},
  journal={The International Journal of Robotics Research},
  publisher={SAGE Publications Sage UK: London, England}

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.