Keras implementation of video classifiers serving as web
The training data is UCF101 - Action Recognition Data Set. Codes are included that will download the UCF101 if they do not exist (due to their large size) in the demo/very_large_data folder. The download utility codes can be found in keras_video_classifier/library/utility/ucf directory
The video classifiers are defined and implemented in the keras_video_classifier/library directory.
By default the classifiers are trained using video files inside the dataset "UCF-101" located in demo/very_large_data (the videos files will be downloaded if not exist during training). However, the classifiers are generic and can be used to train on any other datasets (just change the data_set_name parameter in its fit() method to other dataset name instead of UCF-101 will allow it to be trained on other video datasets)
The opencv-python is used to extract frames from the videos.
The following deep learning models have been implemented and studied:
VGG16+LSTM: this approach uses VGG16 to extract features from individual frame of the video, the sequence of frame features are then taken into LSTM recurrent networks for classifier.
VGG16+Bidirectional LSTM: this approach uses VGG16 to extract features from individual frame of the video, the sequence of frame features are then taken into bidirectional LSTM recurrent networks for classifier.
Convolutional Network: this approach uses stores frames into the "channels" of input of the CNN which then classify the "image" (video frames stacked in the channels)
The trained models are available in the demo/models/UCF-101 folder (Weight files of two of the trained model are not included as they are too big to upload, they are
To train a deep learning model, say VGG16BidirectionalLSTMVideoClassifier, run the following commands:
pip install -r requirements.txt cd demo python vgg16_bidirectional_lstm_train.py
The training code in vgg16_bidirectional_lstm_train.py is quite straightforward and illustrated below:
import numpy as np from keras import backend as K from keras_video_classifier.library.recurrent_networks import VGG16BidirectionalLSTMVideoClassifier from keras_video_classifier.library.utility.plot_utils import plot_and_save_history from keras_video_classifier.library.utility.ucf.UCF101_loader import load_ucf K.set_image_dim_ordering('tf') data_set_name = 'UCF-101' input_dir_path = './very_large_data' output_dir_path = './models/' + data_set_name report_dir_path = './reports/' + data_set_name np.random.seed(42) # this line downloads the video files of UCF-101 dataset if they are not available in the very_large_data folder load_ucf(input_dir_path) classifier = VGG16BidirectionalLSTMVideoClassifier() history = classifier.fit(data_dir_path=input_dir_path, model_dir_path=output_dir_path, data_set_name=data_set_name) plot_and_save_history(history, VGG16BidirectionalLSTMVideoClassifier.model_name, report_dir_path + '/' + VGG16BidirectionalLSTMVideoClassifier.model_name + '-history.png')
After the training is completed, the trained models will be saved as cf-v1-. in the demo/models.
To use the trained deep learning model to predict the class label of a video, you can use the following code:
import numpy as np from keras_video_classifier.library.recurrent_networks import VGG16BidirectionalLSTMVideoClassifier from keras_video_classifier.library.utility.ucf.UCF101_loader import load_ucf, scan_ucf_with_labels vgg16_include_top = True data_set_name = 'UCF-101' data_dir_path = './very_large_data' model_dir_path = './models/' + data_set_name config_file_path = VGG16BidirectionalLSTMVideoClassifier.get_config_file_path(model_dir_path, vgg16_include_top=vgg16_include_top) weight_file_path = VGG16BidirectionalLSTMVideoClassifier.get_weight_file_path(model_dir_path, vgg16_include_top=vgg16_include_top) np.random.seed(42) # this line downloads the video files of UCF-101 dataset if they are not available in the very_large_data folder load_ucf(data_dir_path) predictor = VGG16BidirectionalLSTMVideoClassifier() predictor.load_model(config_file_path, weight_file_path) # scan_ucf returns a dictionary object of (video_file_path, video_class_label) where video_file_path # is the key and video_class_label is the value videos = scan_ucf_with_labels(data_dir_path, [label for (label, label_index) in predictor.labels.items()]) video_file_path_list = np.array([file_path for file_path in videos.keys()]) np.random.shuffle(video_file_path_list) correct_count = 0 count = 0 for video_file_path in video_file_path_list: label = videos[video_file_path] predicted_label = predictor.predict(video_file_path) print('predicted: ' + predicted_label + ' actual: ' + label) correct_count = correct_count + 1 if label == predicted_label else correct_count count += 1 accuracy = correct_count / count print('accuracy: ', accuracy)
Below shows the print out of demo/vgg16_bidirectional_lstm_predict.py towards the end of its execution:
predicted: Biking actual: Biking accuracy: 0.8593481989708405 Extracting frames from video: ./very_large_data/UCF-101\Billiards\v_Billiards_g24_c01.avi predicted: Billiards actual: Billiards accuracy: 0.8595890410958904 Extracting frames from video: ./very_large_data/UCF-101\BabyCrawling\v_BabyCrawling_g22_c06.avi predicted: BabyCrawling actual: BabyCrawling accuracy: 0.8598290598290599 Extracting frames from video: ./very_large_data/UCF-101\Bowling\v_Bowling_g13_c01.avi predicted: Bowling actual: Bowling accuracy: 0.8600682593856656 Extracting frames from video: ./very_large_data/UCF-101\BalanceBeam\v_BalanceBeam_g24_c04.avi predicted: BalanceBeam actual: BalanceBeam accuracy: 0.8603066439522998 Extracting frames from video: ./very_large_data/UCF-101\BrushingTeeth\v_BrushingTeeth_g12_c02.avi predicted: BrushingTeeth actual: BrushingTeeth accuracy: 0.8605442176870748 Extracting frames from video: ./very_large_data/UCF-101\BasketballDunk\v_BasketballDunk_g04_c01.avi predicted: BasketballDunk actual: BasketballDunk accuracy: 0.8607809847198642 Extracting frames from video: ./very_large_data/UCF-101\Bowling\v_Bowling_g04_c03.avi predicted: BenchPress actual: Bowling accuracy: 0.8593220338983051 Extracting frames from video: ./very_large_data/UCF-101\BaseballPitch\v_BaseballPitch_g19_c01.avi predicted: BaseballPitch actual: BaseballPitch accuracy: 0.8595600676818951 Extracting frames from video: ./very_large_data/UCF-101\Archery\v_Archery_g18_c03.avi predicted: Archery actual: Archery accuracy: 0.8597972972972973 Extracting frames from video: ./very_large_data/UCF-101\Bowling\v_Bowling_g19_c03.avi ...
20 classes from UCF101 is used to train the video classifier. 20 epochs are set for the training
Below is the train history for the VGG16+LSTM (top included for VGG16):
The LSTM with VGG16 (top included) feature extractor: (accuracy around 68.9% for training and 55% for validation)
Below is the train history for the VGG16+Bidirectional LSTM (top included for VGG16):
The bidirectional LSTM with VGG16 (top included) feature extractor: (accuracy around 89% for training and 77% for validation)
Below is the train history for the VGG16+LSTM (top not included for VGG16):
The LSTM with VGG16 (top not included)feature extractor: (accuracy around 100% for training and 98.83% for validation)
Below is the train history for the VGG16+LSTM (top not included for VGG16):
The LSTM with VGG16 (top not included) feature extractor: (accuracy around 100% for training and 98.57% for validation)
Below is the train history for the Convolutional Network:
The Convolutional Network: (accuracy around 22.73% for training and 28.75% for validation)