Saudi Cultural Missions Theses & Dissertations

Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10

Browse

Search Results

Now showing 1 - 8 of 8
  • ItemRestricted
    Facial Emotion Recognition via Label Distribution Learning and Customized Convolutional Layers
    (The University of Warwick, 2024-11) Almowallad, Abeer; Sanchez, Victor
    This thesis attempts to investigate the task of recognizing human emotions from facial expressions in images, a topic that has been interest of to researchers in computer vision and machine learning. It addresses the challenge of deciphering a mixture of six basic emotions—happiness, sadness, anger, fear, surprise, and disgust—each presented with distinct intensities. This thesis introduces three Label Distribution Learning (LDL) frameworks to tackle this. Previous studies have dealt with this challenge by using LDL and focusing on optimizing a conditional probability function that attempts to reduce the relative entropy of the predicted distribution with respect to the target distribution, which leads to a lack of generality of the model. First, we propose a deep learning framework for LDL, utilizing convolutional neural network (CNN) features to broaden the model’s generalization capabilities. Named EDL-LBCNN, this framework integrates a Local Binary Convolutional (LBC) layer to refine the texture information extracted from CNNs, targeting a more precise emotion recognition. Secondly, we propose VCNN-ELDL framework, which employs an innovative Visibility Convolutional Layer (VCL). The VCL is engineered to maintain the advantages of traditional convolutional (Conv) layers for feature extraction, while also reducing the number of learnable parameters and enhancing the capture of crucial texture features from facial images. Furthermore, this research presents a novel Transformer architecture, the Visibility Convolutional Vision Transformer (VCLvT), incorporating Depth-Wise Visibility Convolutional Layers (DepthVCL) to bolster spatial feature extraction. This novel approach yields promising outcomes, particularly on limited datasets, showcasing its capacity to meet or exceed state-of-the-art performance across different dataset sizes. Through these advancements, the thesis significantly contributes to the advancement of facial emotion recognition, presenting robust, scalable models adept at interpreting the complex nuances of human emotions.
    10 0
  • ItemRestricted
    Beyond Sight: Generating Clothing Graphic Captions for Visually Impaired Users
    (university of sheffield, 2024) Alluqmani, Amnah; Harvey, Morgan
    People with visual impairment (VI) often face challenges when performing everyday tasks, of which clothes shopping is the most challenging. Many people with VI engage in online shopping to eliminate some of the challenges associated with physical shopping. However, buying clothes online presents many other limitations and barriers. More research is needed to address these challenges and propose solutions to enhance the shopping experiences of people with VI. Most research studies have relied on interviews alone, which provide only subjective, recall-biased information, and only few studies have offered a solution. Thus, we conducted two complementary studies, using both observational and interview approaches to fill the gap in understanding the behaviours of people with VI when selecting and purchasing clothes online. In addition, we built a VI-friendly clothing graphic captioning model, adopting the LLM approach to simulate the needs of people with VI and the ROI approach to generate focused, informative captions. We conducted multiple assessments, including a manual analysis of GT clothing image-text datasets and automatic and human evaluations of our proposed model. Our findings show that shopping websites have inaccurate, misleading and contradictory clothing descriptions and that people with VI rely mainly on (unreliable) search tools and check product descriptions by reviewing customer comments. Our findings also indicate that people with VI are hesitant to accept assistance from automated systems. Trust in such systems could be improved if researchers develop systems that better accommodate users' needs and preferences. The results of the evaluation of the GT captions showed that they had several limitations that disqualify them from consideration as gold standard captions. The evaluation of our model revealed that adopting LLMs and ROI approach generates captions that are the most informative, considering the needs of people with VI, compared with BLIP-2 (i.e., adopting a grid-based approach) and GT captions. Our model achieved average scores of 1.73 and 1.774 from evaluators with VI and sighted evaluators, respectively. In comparison, BLIP-2 and GT data obtained average scores of 1.460 and 1.440 from evaluators with VI and 1.565 and 1.671 from sighted evaluators, respectively. Furthermore, although our model was the most informative among the existing baseline models, it is difficult to maintain its accuracy, which is lower than those of the GT and BLIP-2 models. This thesis investigates the factors that affect the accuracy of captions and will be further addressed in future work.
    25 0
  • Thumbnail Image
    ItemRestricted
    A New Way of Imagining
    (The University of Nottingham, 2024-05-09) Alshammary, Yazeed Hamoud; Wright, Amanda
    The realm of particle tracking is undergoing a significant transformation, driven by advancements in imaging technology. This thesis explores the innovative world of event cameras and their applications, focusing on their principles and comparative performance against established sCMOS (Scientific Complementary Metal-Oxide-Semiconductor) cameras. Event cameras, also known as dynamic vision sensors (DVS), represent a paradigm shift in imaging, operating on an event-based sensing principle that detects brightness changes asynchronously at each pixel. This capability allows for capturing rapid movements with high precision, making event cameras particularly suited for tracking particles in fluid flows, microfluidic applications, and other scenarios characterized by swift motion. This study aims to highlight the potential of event cameras beyond particle tracking, including advancements in robotics, augmented reality, and computer vision. In contrast, sCMOS cameras, known for their high sensitivity and low noise, have been pivotal in scientific imaging, especially in controlled environments requiring high-resolution, frame-based imaging. The thesis provides a comprehensive examination of both technologies, their operational mechanisms, applications, and comparative strengths. Through practical applications and detailed analysis, this research underscores the significance of event cameras in revolutionizing particle tracking and other dynamic imaging domains.
    27 0
  • Thumbnail Image
    ItemRestricted
    Understanding Perceptual Mesh Quality in Virtual Reality and Desktop Settings
    (Cardiff University, 2024-07) Alfarasani, Dalia A; Lai, Yukun
    This thesis focuses on 3D mesh quality, essential for immersive VR applications. It examines subjective methodologies for Quality of Experience (QoE) assessments and then develops objective quality metrics incorporating QoE influencing factors. Existing studies consider 3D mesh quality on the desktop. The perceptual quality in a Virtual Reality (VR) setting can be different, this inspired us to measure mesh quality in a VR setting, which has been the subject of limited studies in this area. We consider how different 3D distortion types affect perceptual quality of 3D when viewed in a VR setup. In our experiment findings, in the VR setting, perception appears more sensitive to particular distortions than others, compared with the desktop. This can provide helpful guidance for downstream applications. Furthermore, we evaluate state-of-the-art perceptually inspired mesh difference metrics for predicting objective quality scores captured in VR and compare them with the desktop. The experimental results show that subjective scores in the VR setting are more consistent than those on desktop setting. As we focus on a better understanding of perceptual mesh quality, we further consider the problem of mesh saliency, which measures the perceptual importance of different regions on a mesh. However, existing mesh saliency models are largely built with hard-coded formulae or utilise indirect measures, which cannot capture true human perception. In this thesis, to generate ground truth mesh saliency, we use subjective studies that collect eye-tracking data from participants and develop a method for mapping the eye-tracking data of individual views consistently onto a mesh. We further evaluate existing methods of measuring saliency and propose a new machine learning-based method that better predicts subjective saliency values. The predicted saliency is also demonstrated to help with mesh quality prediction as salient regions tend to be more important perceptually, leading to a novel effective mesh quality measure.
    14 0
  • Thumbnail Image
    ItemRestricted
    The Effectiveness of Fcial Cues for Automatic Detection of Cognitive Impairment Using In-the-wild Data
    (Saudi Digital Library, 2023-11-30) Alzahrani, Fatimah; Christensen, Heidi; Maddock, Steve
    The development of automatic methods for the early detection of cognitive impairment (CI) has attracted much research interest due to its crucial role in helping people get suitable treatment or care. People with CI may experience various changes in their facial cues, such as eye blink rate and head movement. This thesis aims to investigate the use of facial cues to develop an automatic system for detecting CI using in-the-wild data. Firstly, the 'in-the-wild data' term is defined, and associated challenges are identified by analysing datasets used in previous work. In-the-wild data can affect the reliability of the performance of state-of-the-art approaches. Second, this thesis investigates the automatic detection of neurodegenerative disorder, mild cognitive impairment and functional memory disorder, showing the applicability of detecting health conditions with similar symptoms. Then, a novel multiple thresholds (MTs) approach for detecting an eye blink rate feature is introduced. This approach addresses in-the-wild data challenges by generating multiple thresholds, resulting in a vector of blink rates for each participant. Then, the feasibility of this feature in detecting CI is examined. Other features considered are head turn rate, head turn statistical features, head movement statistical features and low-level features. The results show that these facial features significantly distinguish different health conditions.
    15 0
  • Thumbnail Image
    ItemRestricted
    Landmarks Retrieval Using Deep Learning
    (Saudi Digital Library, 2023-11-14) Alturki, Reef; Husain, Sameed
    In today’s digital age, the complexity and diversity of multimedia data is increasing rapidly. This leads to an urgent need to design highly performing image retrieval systems to meet human demands. Retrieving similar images efficiently and effectively remains difficult due to several reasons, including variations in object appearance, partial occlusions and changes in viewpoints and scale. Image retrieval has gained research attention due to its contribution to a variety of applications, such as mobile commerce, tourism, surveillance and robotics. Although extensive studies have been conducted to enhance the performance of image retrieval systems, these studies are yet to achieve the desired outcomes. Furthermore, most of the studies concentrated on learning either local or global feature representations to handle the retrieval problem. This project aims to tackle the retrieval problem by utilizing a range of advanced techniques to develop and analyze two models. In the first model, the focus was to design a lightweight yet high performing architecture through the utilization of EfficientNet and ResNet as base CNNs, in addition to a fusion component that integrates both local and global features after their extraction, resulting in a single descriptor that well describes the image. Conversely, the second model focuses on learning deep local feature representation and aggregation through the use of spatial attention, self-attention, and cross-attention. To train these models, we utilized the Google Landmarks v2 dataset, which is currently the largest dataset available for image retrieval. Furthermore, we used ROxford and RParis datasets as benchmarks to evaluate our models’ effectiveness. Our thorough analysis involved assessing the models against state-of-the-art models, and the evaluation results showed their promising performance compared to existing solutions. Furthermore, this study gains deeper insights by including Grad-CAM visualizations, offering a clear glimpse into how the model makes decisions and revealing the areas it focuses on, thereby enhancing the model’s interpretability. Finally, conclusions and potential areas for future research are pointed out.
    6 0
  • Thumbnail Image
    ItemRestricted
    Action Recognition and Tracking Using Capsule Networks
    (2023-05-25) Algamdi, Abdullah M.; Sanchez, Victor
    Capsule Neural Networks (CapsNets) are new deep neural networks that build hierarchical relationships between objects and their parts. The new architecture finds agreements between low-level and high-level features with the different layers of the network. Unlike neurons in Convolutional Neural Networks (CNNs), CapsNets use a capsule as the building block of the network. Each capsule is a group of neurons that capture spatial input features. When sending activation from one layer to the next layer, CapsNets send votes from the low-level capsule to the high-level capsule when they find an agreement between the coordinate frame of the two capsules. In this thesis, we study the performance of CapsNets on Human Action Recognition (HAR) and single object tracking (SOT) tasks. We proposed simple Spatial ActionCaps architecture with dynamic routing to recognise action from the Spatial dimension. To overcome the sensitivity of the CapsNets, we proposed a weight pooling algorithm to reduce the extracted features’ dimensionality and background noise. Our proposed architecture outperformed a baseline CNNs architecture. In addition, we showed the ability of the CapsNets to encode action’s temporal information in the class feature vector. We tested Spatio-Temporal CapsNets on videos captured by drone. The proposed CapsNets architecture with EM routing was able to recognise actions from unfamiliar viewpoints. Instead of weight pooling, we introduced Binary Volume Comparison (BVC) layer to reduce the noise from the 3D features. To evaluate the results of our architecture, we used four metrics for multi-label HAR. Our proposed architecture outperformed multiple CNNs methods on multi-label classes of the Okutama-Action dataset. In addition, we proposed multi-modality CapsNets for single object tracking (SOT) tasks. The proposed architecture showed faster generalization compared with a baseline CNNs SOT network. The proposed routing algorithm finds agreements between the object in the bounding box of the first frame and the remaining video frames. Based on the background and foreground classification, the coarse location of the object is located. Centreness and Regression networks help the network precisely locate the object in the remaining frames.
    34 0
  • Thumbnail Image
    ItemRestricted
    DEEP LEARNING MODELS FOR MOBILE AND WEARABLE BIOMETRICS
    (Saudi Digital Library, 2023-04-27) Almadan, Ali; Rattani, Ajita
    The mobile technology revolution has transformed mobile devices from communication tools to all-in-one platforms. As a result, more people are using smartphones to access e-commerce and banking services, replacing traditional desktop computers. However, smartphones are more prone to being lost or stolen, requiring e ective user authentication mechanisms for securing transactions. Ocular biometrics o ers accuracy, security, and ease of use on mobile devices for user authentication. In addition, face recognition technology has been widely adopted in intelligence gathering, law enforcement, surveillance, and consumer applications. This technology has recently been implemented in smartphones and body-worn cameras (BWC) for surveillance and situational awareness. However, these high-performing models require significant computational resources, making their deployment on resource-constrained smartphones challenging. To address this challenge, studies have proposed compact-size ocular-based deep-learning models for on-device deployment. In this context, we conduct a thorough analysis of existing neural network compression techniques applied standalone and in combination for ocular-based user authentication and facial recognition.
    16 0

Copyright owned by the Saudi Digital Library (SDL) © 2025