Human Action Recognition Based on Convolutional Neural Networks and Vision Transformers

Alomar, Khaled Abdulaziz

Human Action Recognition Based on Convolutional Neural Networks and Vision Transformers

dc.contributor.advisor	Xiaohao, Cai
dc.contributor.author	Alomar, Khaled Abdulaziz
dc.date.accessioned	2025-05-12T07:05:23Z
dc.date.issued	2025-05
dc.description	This thesis seeks to deepen our understanding and expand our knowledge of the impacts of deep-learning techniques on human action recognition. It addresses the challenges faced in human action recognition and proposes solutions focused on enhancing feature extraction and optimizing model designs. This is accomplished through the completion of three distinct yet closely interconnected chapters (i.e., papers). These chapters are: (i) Data Augmentation in Classification and Segmentation: A Survey and New Strategies; (ii) TransNet: A Transfer Learning-Based Network for Human Action Recognition; and (iii) RNNs, CNNs, and Transformers in Human Action Recognition: A Survey and a Hybrid Model. The second chapter provides a survey of the existing data augmentation techniques in computer vision tasks, including segmentation and classification. Data augmentation is a well-established method in computer vision. It can be especially beneficial for human action recognition (HAR) by enhancing feature extraction. This technique addresses challenges such as limited datasets and class imbalance, resulting in more robust feature extraction and reduced overfitting in neural networks. Studies have demonstrated that data augmentation significantly enhances the accuracy and generalizability of models in tasks like image classification and segmentation, which is subsequently utilized in the task of HAR in the third chapter. The third chapter addresses two significant challenges in HAR: feature extraction and the complexity of HAR models. It introduces a straightforward, yet versatile and effective end-to-end deep learning architecture, termed TransNet, as a solution to these challenges. Extensive experimental results and comparisons with state-of-the-art models demonstrate the superior performance of TransNet in terms of flexibility, model complexity, transfer learning capability, training speed, and classification accuracy. Additionally, this chapter introduces a novel strategy that utilizes autoencoders to form the 2D component of TransNet, referred to as TransNet+. TransNet+ enhances feature extraction by directing the model to extract specific features based on our needs. TransNet+ leverages the encoder part of an autoencoder, trained on computer vision tasks such as human semantic segmentation (HSS), to perform HAR. The extensive experimental results and comparisons with leading models further validate the superior performance of both TransNet and TransNet+ in HAR. The fourth chapter provides a comprehensive review of Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Vision Transformers (ViTs). It examines the progression from traditional methods to the latest advancements in neural network architectures, offering a chronological and extensive analysis of the existing literature on action recognition. The chapter proposes a novel hybrid model that integrates the strengths of CNNs and ViTs. Additionally, it offers a detailed performance comparison of the proposed hybrid model against existing models, highlighting its efficacy in handling complex HAR tasks with improved accuracy and efficiency. The chapter also discusses emerging trends and future directions for HAR technologies.
dc.description.abstract	This thesis explores the impact of deep learning on human action recognition (HAR), addressing challenges in feature extraction and model optimization through three interconnected studies. The second chapter surveys data augmentation techniques in classification and segmentation, emphasizing their role in improving HAR by mitigating dataset limitations and class imbalance. The third chapter introduces TransNet, a transfer learning-based model, and its enhanced version, TransNet+, which utilizes autoencoders for improved feature extraction, demonstrating superior performance over existing models. The fourth chapter reviews CNNs, RNNs, and Vision Transformers, proposing a novel CNN-ViT hybrid model and comparing its effectiveness against state-of-the-art HAR methods, while also discussing future research directions.
dc.format.extent	195
dc.identifier.uri	https://hdl.handle.net/20.500.14154/75372
dc.language.iso	en
dc.publisher	University of Southampton
dc.subject	Computer Science
dc.subject	Artificial Intelligence
dc.subject	Mchine Learning
dc.subject	Deep Learning
dc.subject	Neural Networks
dc.subject	Computer Vision
dc.subject	Convolutional Neural Networks
dc.subject	Vision Transformers
dc.subject	Information Systems
dc.subject	Human Action Recognition
dc.title	Human Action Recognition Based on Convolutional Neural Networks and Vision Transformers
dc.type	Thesis
sdl.degree.department	School of Electronics and Computer Science
sdl.degree.discipline	Computer Science, Artificial Intelligence, Mchine Learning, Deep Learning, Neural Networks, Computer Vision, Information Systems.
sdl.degree.grantor	University of Southampton
sdl.degree.name	Doctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1

Name:: SACM-Dissertation.pdf
Size:: 4.88 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.61 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

SACM - United Kingdom