Human Action Recognition Based on Convolutional Neural Networks and Vision Transformers
dc.contributor.advisor | Xiaohao, Cai | |
dc.contributor.author | Alomar, Khaled Abdulaziz | |
dc.date.accessioned | 2025-05-12T07:05:23Z | |
dc.date.issued | 2025-05 | |
dc.description | This thesis seeks to deepen our understanding and expand our knowledge of the impacts of deep-learning techniques on human action recognition. It addresses the challenges faced in human action recognition and proposes solutions focused on enhancing feature extraction and optimizing model designs. This is accomplished through the completion of three distinct yet closely interconnected chapters (i.e., papers). These chapters are: (i) Data Augmentation in Classification and Segmentation: A Survey and New Strategies; (ii) TransNet: A Transfer Learning-Based Network for Human Action Recognition; and (iii) RNNs, CNNs, and Transformers in Human Action Recognition: A Survey and a Hybrid Model. The second chapter provides a survey of the existing data augmentation techniques in computer vision tasks, including segmentation and classification. Data augmentation is a well-established method in computer vision. It can be especially beneficial for human action recognition (HAR) by enhancing feature extraction. This technique addresses challenges such as limited datasets and class imbalance, resulting in more robust feature extraction and reduced overfitting in neural networks. Studies have demonstrated that data augmentation significantly enhances the accuracy and generalizability of models in tasks like image classification and segmentation, which is subsequently utilized in the task of HAR in the third chapter. The third chapter addresses two significant challenges in HAR: feature extraction and the complexity of HAR models. It introduces a straightforward, yet versatile and effective end-to-end deep learning architecture, termed TransNet, as a solution to these challenges. Extensive experimental results and comparisons with state-of-the-art models demonstrate the superior performance of TransNet in terms of flexibility, model complexity, transfer learning capability, training speed, and classification accuracy. Additionally, this chapter introduces a novel strategy that utilizes autoencoders to form the 2D component of TransNet, referred to as TransNet+. TransNet+ enhances feature extraction by directing the model to extract specific features based on our needs. TransNet+ leverages the encoder part of an autoencoder, trained on computer vision tasks such as human semantic segmentation (HSS), to perform HAR. The extensive experimental results and comparisons with leading models further validate the superior performance of both TransNet and TransNet+ in HAR. The fourth chapter provides a comprehensive review of Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Vision Transformers (ViTs). It examines the progression from traditional methods to the latest advancements in neural network architectures, offering a chronological and extensive analysis of the existing literature on action recognition. The chapter proposes a novel hybrid model that integrates the strengths of CNNs and ViTs. Additionally, it offers a detailed performance comparison of the proposed hybrid model against existing models, highlighting its efficacy in handling complex HAR tasks with improved accuracy and efficiency. The chapter also discusses emerging trends and future directions for HAR technologies. | |
dc.description.abstract | This thesis explores the impact of deep learning on human action recognition (HAR), addressing challenges in feature extraction and model optimization through three interconnected studies. The second chapter surveys data augmentation techniques in classification and segmentation, emphasizing their role in improving HAR by mitigating dataset limitations and class imbalance. The third chapter introduces TransNet, a transfer learning-based model, and its enhanced version, TransNet+, which utilizes autoencoders for improved feature extraction, demonstrating superior performance over existing models. The fourth chapter reviews CNNs, RNNs, and Vision Transformers, proposing a novel CNN-ViT hybrid model and comparing its effectiveness against state-of-the-art HAR methods, while also discussing future research directions. | |
dc.format.extent | 195 | |
dc.identifier.uri | https://hdl.handle.net/20.500.14154/75372 | |
dc.language.iso | en | |
dc.publisher | University of Southampton | |
dc.subject | Computer Science | |
dc.subject | Artificial Intelligence | |
dc.subject | Mchine Learning | |
dc.subject | Deep Learning | |
dc.subject | Neural Networks | |
dc.subject | Computer Vision | |
dc.subject | Convolutional Neural Networks | |
dc.subject | Vision Transformers | |
dc.subject | Information Systems | |
dc.subject | Human Action Recognition | |
dc.title | Human Action Recognition Based on Convolutional Neural Networks and Vision Transformers | |
dc.type | Thesis | |
sdl.degree.department | School of Electronics and Computer Science | |
sdl.degree.discipline | Computer Science, Artificial Intelligence, Mchine Learning, Deep Learning, Neural Networks, Computer Vision, Information Systems. | |
sdl.degree.grantor | University of Southampton | |
sdl.degree.name | Doctor of Philosophy |