Breckon, TobyAlsehaim, Aishah2023-08-242023-08-242023-08-22https://hdl.handle.net/20.500.14154/68959Person Re-identification across a collection of surveillance cameras is becoming an increasingly vital component of smart intelligent surveillance systems. Due to the numerous variations in human position, occlusion, viewpoint, illumination and background clutter most contemporary video Re-ID studies ( in order to extract spatio-temporal video features ) use complex CNN-based network architectures with 3D convolution or multi-branch networks. In this thesis, we intend to leverage the significant challenge given by person Re-ID by encoding person videos into a robust discriminative feature vector to improve performance under these challenging settings. The extraction of strong and discriminative features is a fundamental aspect of person Re-ID such that CNN-based approaches have dominated in this area. We show that a simple single-stream 2D convolution network using the ResNet50-IBN architecture to extract frame-level features can achieve superior performance when combined with temporal attention for clip-level features. By averaging, these features can be generalised to extract features from entire videos without added expense. While other recent work uses complicated and memory-intensive 3D convolutions or multi-stream networks architectures, our method uses both video Re-ID best practice and transfer learning between datasets to achieve superior outcomes for person Re-ID. Moreover, we consider the task of joint person Re-ID and action recognition within the context of automated surveillance to learn discriminative feature representations that both improve Re-ID performance and are capable of providing viable per-view (clip-wise) action recognition. Weakly labelled actions from the leading two benchmark video Re-ID datasets (MARS, LPW) are used to perform a hybrid Re-ID and action recognition task utilising a mixture of two task-specific and multi-loss terms. Our multi-branch 2D CNN architecture achieves superior results to previous work in the field solely because we treat Re-ID and action detection as multi-task problem. Recently, vision transformer (ViT) architectures have been shown to boost fine-grained feature discrimination across a variety of vision tasks. To enable vision transformer (ViT) for video person Re-ID, two unique module constructions, Temporal Clip Shift and Shuffled (TCSS) and Video Patch Part Feature (VPPF), are proposed to enable ViT architectures to effectively meet the challenges of video person Re-ID. Overall, we present three novel deep learning architectures that address the video person Re-ID task spanning the use of CNN, multi-task learning and ViT approaches.131enRe-IDperson Re-IDVideo Person Re-identification for Future Automated Visual Surveillance SystemsThesis