Human Action Recognition Based on Convolutional Neural Networks and Vision Transformers

dc.contributor.advisorXiaohao, Cai
dc.contributor.authorAlomar, Khaled Abdulaziz
dc.date.accessioned2025-05-12T07:05:23Z
dc.date.issued2025-05
dc.descriptionThis thesis seeks to deepen our understanding and expand our knowledge of the impacts of deep-learning techniques on human action recognition. It addresses the challenges faced in human action recognition and proposes solutions focused on enhancing feature extraction and optimizing model designs. This is accomplished through the completion of three distinct yet closely interconnected chapters (i.e., papers). These chapters are: (i) Data Augmentation in Classification and Segmentation: A Survey and New Strategies; (ii) TransNet: A Transfer Learning-Based Network for Human Action Recognition; and (iii) RNNs, CNNs, and Transformers in Human Action Recognition: A Survey and a Hybrid Model. The second chapter provides a survey of the existing data augmentation techniques in computer vision tasks, including segmentation and classification. Data augmentation is a well-established method in computer vision. It can be especially beneficial for human action recognition (HAR) by enhancing feature extraction. This technique addresses challenges such as limited datasets and class imbalance, resulting in more robust feature extraction and reduced overfitting in neural networks. Studies have demonstrated that data augmentation significantly enhances the accuracy and generalizability of models in tasks like image classification and segmentation, which is subsequently utilized in the task of HAR in the third chapter. The third chapter addresses two significant challenges in HAR: feature extraction and the complexity of HAR models. It introduces a straightforward, yet versatile and effective end-to-end deep learning architecture, termed TransNet, as a solution to these challenges. Extensive experimental results and comparisons with state-of-the-art models demonstrate the superior performance of TransNet in terms of flexibility, model complexity, transfer learning capability, training speed, and classification accuracy. Additionally, this chapter introduces a novel strategy that utilizes autoencoders to form the 2D component of TransNet, referred to as TransNet+. TransNet+ enhances feature extraction by directing the model to extract specific features based on our needs. TransNet+ leverages the encoder part of an autoencoder, trained on computer vision tasks such as human semantic segmentation (HSS), to perform HAR. The extensive experimental results and comparisons with leading models further validate the superior performance of both TransNet and TransNet+ in HAR. The fourth chapter provides a comprehensive review of Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Vision Transformers (ViTs). It examines the progression from traditional methods to the latest advancements in neural network architectures, offering a chronological and extensive analysis of the existing literature on action recognition. The chapter proposes a novel hybrid model that integrates the strengths of CNNs and ViTs. Additionally, it offers a detailed performance comparison of the proposed hybrid model against existing models, highlighting its efficacy in handling complex HAR tasks with improved accuracy and efficiency. The chapter also discusses emerging trends and future directions for HAR technologies.
dc.description.abstractThis thesis explores the impact of deep learning on human action recognition (HAR), addressing challenges in feature extraction and model optimization through three interconnected studies. The second chapter surveys data augmentation techniques in classification and segmentation, emphasizing their role in improving HAR by mitigating dataset limitations and class imbalance. The third chapter introduces TransNet, a transfer learning-based model, and its enhanced version, TransNet+, which utilizes autoencoders for improved feature extraction, demonstrating superior performance over existing models. The fourth chapter reviews CNNs, RNNs, and Vision Transformers, proposing a novel CNN-ViT hybrid model and comparing its effectiveness against state-of-the-art HAR methods, while also discussing future research directions.
dc.format.extent195
dc.identifier.urihttps://hdl.handle.net/20.500.14154/75372
dc.language.isoen
dc.publisherUniversity of Southampton
dc.subjectComputer Science
dc.subjectArtificial Intelligence
dc.subjectMchine Learning
dc.subjectDeep Learning
dc.subjectNeural Networks
dc.subjectComputer Vision
dc.subjectConvolutional Neural Networks
dc.subjectVision Transformers
dc.subjectInformation Systems
dc.subjectHuman Action Recognition
dc.titleHuman Action Recognition Based on Convolutional Neural Networks and Vision Transformers
dc.typeThesis
sdl.degree.departmentSchool of Electronics and Computer Science
sdl.degree.disciplineComputer Science, Artificial Intelligence, Mchine Learning, Deep Learning, Neural Networks, Computer Vision, Information Systems.
sdl.degree.grantorUniversity of Southampton
sdl.degree.nameDoctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
SACM-Dissertation.pdf
Size:
4.88 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.61 KB
Format:
Item-specific license agreed to upon submission
Description:

Copyright owned by the Saudi Digital Library (SDL) © 2025