Action Recognition and Tracking Using Capsule Networks
Abstract
Capsule Neural Networks (CapsNets) are new deep neural networks that build hierarchical relationships between objects and their parts. The new architecture finds agreements between low-level and high-level features with the different layers of the network. Unlike neurons in Convolutional Neural Networks (CNNs), CapsNets use a capsule as the building block of the network. Each capsule is a group of neurons that capture spatial input features. When sending activation from one layer to the next layer, CapsNets send votes from the low-level capsule to the high-level capsule when they find an agreement between the coordinate frame of the two capsules. In this thesis, we study the performance of CapsNets on Human Action Recognition (HAR) and single object tracking (SOT) tasks. We proposed simple Spatial ActionCaps architecture with dynamic routing to recognise action from the Spatial dimension. To overcome the sensitivity of the CapsNets, we proposed a weight pooling algorithm to reduce the extracted features’ dimensionality and background noise. Our proposed architecture outperformed a baseline CNNs architecture. In addition, we showed the ability of the CapsNets to encode action’s temporal information in the class feature vector. We tested Spatio-Temporal CapsNets on videos captured by drone. The proposed CapsNets architecture with EM routing was able to recognise actions from unfamiliar viewpoints. Instead of weight pooling, we introduced Binary Volume Comparison (BVC) layer to reduce the noise from the 3D features. To evaluate the results of our architecture, we used four metrics for multi-label HAR. Our proposed architecture outperformed multiple CNNs methods on multi-label classes of the Okutama-Action dataset. In addition, we proposed multi-modality CapsNets for single object tracking (SOT) tasks. The proposed architecture showed faster generalization compared with a baseline CNNs SOT network. The proposed routing algorithm finds agreements between the object in the bounding box of the first frame and the remaining video frames. Based on the background and foreground classification, the coarse location of the object is located. Centreness and Regression networks help the network precisely locate the object in the remaining frames.
Description
Keywords
tracking, action recognition, computer vision