Saudi Cultural Missions Theses & Dissertations
Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10
Browse
13 results
Search Results
Item Restricted Human Action Recognition Based on Convolutional Neural Networks and Vision Transformers(University of Southampton, 2025-05) Alomar, Khaled Abdulaziz; Xiaohao, CaiThis thesis explores the impact of deep learning on human action recognition (HAR), addressing challenges in feature extraction and model optimization through three interconnected studies. The second chapter surveys data augmentation techniques in classification and segmentation, emphasizing their role in improving HAR by mitigating dataset limitations and class imbalance. The third chapter introduces TransNet, a transfer learning-based model, and its enhanced version, TransNet+, which utilizes autoencoders for improved feature extraction, demonstrating superior performance over existing models. The fourth chapter reviews CNNs, RNNs, and Vision Transformers, proposing a novel CNN-ViT hybrid model and comparing its effectiveness against state-of-the-art HAR methods, while also discussing future research directions.23 0Item Restricted ADAPTIVE SELF-LEARNING AND MULTI-STAGE MODELING FOR EFFICIENT MEDICAL AND DENTAL IMAGE SEGMENTATION(University of Missouir - Kansas City, 2025) Alqarni, Saeed; Yugyung, LeeMedical imaging has revolutionized healthcare by enabling non-invasive visualization of anatomical structures and pathologies, significantly improving diagnostic accuracy, treatment planning, and patient monitoring. Modalities like computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound provide critical insights into the human body, yet precise medical image segmentation remains a challenging task. This difficulty arises from factors such as image variability, noise, artifacts, and the limited availability of annotated data necessary to train robust segmentation models. Overcoming these hurdles is essential to unlock the full potential of medical imaging in diverse clinical applications. This dissertation presents a novel framework for efficient and accurate medical image segmentation, incorporating multi-stage transfer learning, uncertainty-driven data selection, and weakly supervised learning. By combining human-guided refinement with adaptive data selection, this research addresses fundamental barriers such as data scarcity, computational resource limitations, and the high cost of annotation. The framework is structured around three key objectives: 1. Adaptive Uncertainty Sampling with SAM (AUSAM), which introduces a flexible, real-time data selection and segmentation approach, reducing reliance on large annotated datasets through dynamic thresholds and DBSCAN clustering. 2. AUSAM-SL - Active Self-Learning with SAM, which integrates entropy-based active learning with iterative self-labeling, supported by SAM for initial training, refining the selection criteria, and enhancing model predictions. 3. AUSAM-3D- 3D Modeling for Domain-Aware Segmentation and Aggregation, which builds upon AUSAM by incorporating a spatial and volumetric dimension, improving segmentation accuracy for organs and tumors, and enabling more clinically relevant outcomes. Preliminary results on medical and dental imaging datasets (MRI, CT, X-ray) validate the effectiveness of the proposed framework in improving segmentation accuracy while maintaining computational efficiency. The research offers scalable solutions suitable for resource-constrained environments by integrating human feedback with semisupervised and weakly supervised learning techniques. This work advances the field of medical and dental image segmentation and provides practical methods for leveraging multi-stage learning in real-world applications where data and computational resources are limited.20 0Item Restricted LIGHTREFINENET-SFMLEARNER: SEMI-SUPERVISED VISUAL DEPTH, EGO-MOTION AND SEMANTIC MAPPING(Newcastle University, 2024) Alshadadi, Abdullah Turki; Holder, ChrisThe advancement of autonomous vehicles has garnered significant attention, particularly in the development of complex software stacks that enable navigation, decision-making, and planning. Among these, the Perception [1] component is critical, allowing vehicles to understand their surroundings and maintain localisation. Simultaneous Localisation and Mapping (SLAM) plays a key role by enabling vehicles to map unknown environments while tracking their positions. Historically, SLAM has relied on heuristic techniques, but with the advent of the "Perception Age," [2] research has shifted towards more robust, high-level environmental awareness driven by advancements in computer vision and deep learning. In this context, MLRefineNet [3] has demonstrated superior robustness and faster convergence in supervised learning tasks. However, despite its improvements, MLRefineNet struggled to fully converge within 200 epochs when integrated into SfmLearner. Nevertheless, clear improvements were observed with each epoch, indicating its potential for enhancing performance. SfmLearner [4] is a state-of-the-art deep learning model for visual odometry, known for its competitive depth and pose estimation. However, it lacks high-level understanding of the environment, which is essential for comprehensive perception in autonomous systems. This paper addresses this limitation by introducing a multi-modal shared encoder-decoder architecture that integrates both semantic segmentation and depth estimation. The inclusion of high-level environmental understanding not only enhances scene interpretation—such as identifying roads, vehicles, and pedestrians—but also improves the depth estimation of SfmLearner. This multi-task learning approach strengthens the model’s overall robustness, marking a significant step forward in the development of autonomous vehicle perception systems.43 0Item Restricted Smart Glasses for the Blind and Visually Impaired (BVI)(NEWCASTLE UNIVERSITY, 2024-08-23) Aljamil, Ibtisam; Wang, ShidongTechnology has rapidly advanced to become a powerful tool for enhancing quality of life and providing creative solutions to the daily obstacles faced by different societal groups. Blindness is a significant obstacle that impacts the lives of millions worldwide. When carrying out everyday activities, blind individuals face many movements and independent living challenges. Consequently, there is an immediate need to create assistive tools and technologies that enhance the capabilities and independence of visually impaired individuals. The 'Smart Glasses for the Blind and Visually Impaired’ (BVI) project is an innovative solution that offers an approach based on computer vision and artificial intelligence technologies. This solution aims to assist blind individuals in recognising their surroundings and navigating safely and quickly. The fundamental goal of this project is to create intelligent eyewear that incorporates advanced cameras and sensors, as well as data analysis software and technologies for recognising objects. It also possesses text-to-speech capabilities, object detection, and user guidance through audio feedback. The glasses employ a Raspberry Pi 4 microprocessor for camera image processing, utilising Optical Character Recognition (OCR) technologies and OpenCV software to identify text accurately. The smart glasses use a Single Shot Multibox Detector (SSD) to detect objects in the environment. This technology enables visually impaired individuals to navigate independently while receiving real-time alerts about potential hazards along their path. The numerous benefits of glasses make them a valuable tool in the lives of visually impaired people.9 0Item Restricted Multi-Stage and Multi-Target Data-Centric Approaches to Object Detection, Localization, and Segmentation in Medical Imaging(University of California San Diego, 2024) Albattal, Abdullah; Nguyen, TruongObject detection, localization, and segmentation in medical images are essential in several medical procedures. Identifying abnormalities and anatomical structures of interest within these images remains challenging due to the variability in patient anatomy, imaging conditions, and the inherent complexities of biological structures. To address these challenges, we propose a set of frameworks for real-time object detection and tracking in ultrasound scans and two frameworks for liver lesion detection and segmentation in single and multi-phase computed tomography (CT) scans. The first framework for ultrasound object detection and tracking uses a segmentation model weakly trained on bounding box labels as the backbone architecture. The framework outperformed state-of-the-art object detection models in detecting the Vagus nerve within scans of the neck. To improve the detection and localization accuracy of the backbone network, we propose a multi-path decoder UNet. Its detection performance is on par with, or slightly better than, the more computationally expensive UNet++, which has 20% more parameters and requires twice the inference time. For liver lesion segmentation and detection in multi-phase CT scans, we propose an approach to first align the liver using liver segmentation masks followed by deformable registration with the VoxelMorph model. We also propose a learning-free framework to estimate and correct abnormal deformations in deformable image registration models. The first framework for liver lesion segmentation is a multi-stage framework designed to incorporate models trained on each of the phases individually in addition to the model trained on all the phases together. The framework uses a segmentation refinement and correction model that combines these models' predictions with the CT image to improve the overall lesion segmentation. The framework improves the subject-wise segmentation performance by 1.6% while reducing performance variability across subjects by 8% and the instances of segmentation failure by 50%. In the second framework, we propose a liver lesion mask selection algorithm that compares the separation of intensity features between the lesion and surrounding tissue from multi-specialized model predictions and selects the mask that maximizes this separation. The selection approach improves the detection rates for small lesions by 15.5% and by 4.3% for lesions overall.21 0Item Restricted Efficient Processing of Convolutional Neural Networks on the Edge: A Hybrid Approach Using Hardware Acceleration and Dual-Teacher Compression(University of Central Florida, 2024-07-05) Alhussain, Azzam; Lin, MingjieThis dissertation addresses the challenge of accelerating Convolutional Neural Networks (CNNs) for edge computing in computer vision applications by developing specialized hardware solutions that maintain high accuracy and perform real-time inference. Driven by open-source hardware design frameworks such as FINN and HLS4ML, this research focuses on hardware acceleration, model compression, and efficient implementation of CNN algorithms on AMD SoC-FPGAs using High-Level Synthesis (HLS) to optimize resource utilization and improve the throughput/watt of FPGA-based AI accelerators compared to traditional fixed-logic chips, such as CPUs, GPUs, and other edge accelerators. The dissertation introduces a novel CNN compression technique, "Two-Teachers Net," which utilizes PyTorch FX-graph mode to train an 8-bit quantized student model using knowledge distillation from two teacher models, improving the accuracy of the compressed model by 1%-2% compared to existing solutions for edge platforms. This method can be applied to any CNN model and dataset for image classification and seamlessly integrated into existing AI hardware and software optimization toolchains, including Vitis-AI, OpenVINO, TensorRT, and ONNX, without architectural adjustments. This provides a scalable solution for deploying high-accuracy CNNs on low-power edge devices across various applications, such as autonomous vehicles, surveillance systems, robotics, healthcare, and smart cities.31 0Item Restricted Automatic Sketch-to-Image Synthesis and Recognition(University of Dayton, 2024-05) Baraheem, Samah; Nguyen, TamImage is used everywhere since it conveys a story, a fact, or an imagination without any words. Thus, it can substitute the sentences because the human brain can extract the knowledge from images faster than words. However, creating an image from scratch is not only time-consuming, but also a tedious task that requires skills. Creating an image is not a trivial task since it contains rich features and fine-grained details, such as colors, brightness, saturation, luminance, texture, shadow, and so on. Thus, in order to generate an image in less time and without any artistic skills, sketch-to-image synthesis can be used. The reason is that hand sketches are much easier to produce, where only the key structural information is contained. Moreover, it can be drawn without skills and in less time. In fact, since sketches are often simple and rough black and white and sometimes imperfect, converting a sketch into an image is not a trivial problem. Hence, it has attracted the researchers' attention to solve this challenging problem; therefore, much research has been conducted in this field to generate photorealistic images. However, the generated images still suffer from issues, such as the un-naturality, the ambiguity, the distortion, and most importantly, the difficulty in generating images from complex input with multiple objects. Most of these problems are due to converting a sketch into an image directly in one-shot. To this end, in this dissertation, we propose a new framework that divides the problem into sub-problems, leading to generating high-quality photorealistic images even with complicated sketches. Instead of directly mapping the input sketch into an image, we map the sketch into an intermediate result, namely, mask map, through an instance segmentation and semantic segmentation in two levels: background segmentation and foreground segmentation. Background segmentation is formed based on the 4 context of the existing foreground objects. Various natural scenes are implemented for both indoor and outdoor scenes. Then, a foreground segmentation process is commenced, where each detected object is sequentially and semantically added into the constructed segmented background. Next, the mask map is converted into an image through image-to-image translation model. Following this step, a post-processing stage is implemented to enhance the synthetic image further via background improvement and human face refinement. This leads to not only generating better results but also being able to generate images from complicated sketches with multiple objects. We further improve our framework by implementing scene and size sensing. As for size awareness feature, in the instance segmentation stage, the objects' sizes might be modified based on the surrounding environment and their respective size prior to reflect reality and produce more realistic and naturalistic images. Moreover, to implement scene awareness feature in the background improvement step, after the scene is initially defined based on the context and then classified based on a scene classifier, a scene image is first selected. Then, the generated objects are placed on the chosen scene image and based on a pre-defined snapping point to place the objects in their proper location and maintain realism. Furthermore, since the generated images have been improved over time regardless of the input modality, it sometimes becomes hard to distinguish between the synthetic images and genuine ones. Of course, this improves the content and the media, but it is considered a serious threat regarding legitimacy, authenticity, and security. Thus, an automatic detection system of AI-generated images is a legitim need. This system also can be used for image synthesis models as an evaluation tool despite the input modality. Indeed, AI-generated images usually bear explicit or implicit artifacts that result during the generation process. Prior research work on detecting the synthetic images generated by one specific model or similar models with similar architecture. Hence, a generalization problem has arisen. To tackle this problem, we propose to fine-tune a pre-trained Convolutional Neural Network (CNN) model on a special newly collected dataset. This 5 dataset consists of AI-generated images from different image synthesis architectures and different input modalities, i.e., text, sketch, and other sources (another image or mask) to help in the generalization ability across various tasks and architectures. Our contribution in general is two-fold. We first generate high-quality realistic images from simple, rough, black and white sketches, where a newly collected dataset of sketch-like images is compiled for training purposes. Second, since artificial images would have advantages and disadvantages in the real world, we create an automated system that is able to detect and localize synthetic images from genuine ones, where a large dataset of generated and real images is collected to train a CNN model.21 0Item Restricted Audio-to-Video Synthesis using Generative Adversarial Network(University of New South Wales, 2024-01-23) Aldausari, Nuha; Mohammadi, Gelareh; Sowmya, Arcot; Marcus, NadineVideo generation is often perceived as stringing several image generators. However, in addition to visual quality, video generators must also consider motion smoothness and synchronicity with audio and text. Audio plays a crucial role in guiding visual content, as even slight discrepancies between audio and motion can be noticeable to human eyes. Thus, audio can be a self-supervised signal for learning the motion and building correlations between the audio and motion. While there are attempts to build promising audio-to-video generation models, these models typically rely on supervised signals such as keypoints. However, annotating keypoints as supervised signals takes time and effort. Thus, this thesis focuses on audio-based pixel-level video generation without keypoints. The primary goal of this thesis is to build models that generate a temporally and spatially coherent video from audio inputs. The thesis proposes multiple audio-to-video generator frameworks. The first proposed model, PhonicsGAN, uses GRU units for audio to generate pixel-based videos. The subsequent frameworks, each, address particular challenges while still pursuing the main objective. To improve the spatial quality of the generated videos, a model that adapts the image fusion concept to video generation is proposed. This model incorporates a multiscale fusion model that combines images with video frames. While the spatial quality of the video is important, the temporal aspect of the video frames should also be considered. To address this, a shuffling technique is proposed presenting each dataset sample with varied permutations to improve the video's temporal learning. We propose a new model that learns motion trajectory from sparse motion frames. AdaIN is utilised to adjust the motion in the content frame to the target frame to enhance the learning of video motion. All the proposed models are compared with state-of-the-art models to demonstrate their ability to generate high-quality videos from audio inputs. This thesis contributes to the field of video generation in several ways: Firstly, by providing an extensive survey on GAN-based video generation techniques. Secondly, by proposing and evaluating four pixel-based frameworks for enhanced audio-to-video generation output that each addresses important challenges in the field. Lastly, by collecting and publishing a new audio/visual dataset that can be used by the research community for further investigations in this area.24 0Item Restricted Automatic Generation of a Coherent Story from a Set of Images(Saudi Digital Library, 2023-12) Aljawy, Zainy; Mian, Ajaml; Hassan, Ghulam MubasharThis dissertation explores vision and language (V&L) algorithms. While (V&L) succeeds in image and video captioning tasks, the dynamic Visual Storytelling Task (VST) remains challenging. VST demands coherent stories from a set of images, requiring grammatical accuracy, flow, and style. The dissertation addresses these challenges. Chapter 2 presents a framework utilizing an advanced language model. Chapters 3 and 4 introduce novel techniques that integrate rich visual representation to enhance generated stories. Chapter 5 introduces a new storytelling dataset with a comprehensive analysis. Chapter 6 proposes a state-of-the-art Transformer-based model for generating coherent and informative story descriptions from image sets.55 0Item Restricted Seeing in the Dark: Towards Robust Pedestrian Detection at Nighttime(Saudi Digital Library, 2023-12-24) Althoupety, Afnan; Feng, Wu-chi“At some point in the day, everyone is a pedestrian” a message from the National Highway Traffic Safety Administration (NHTSA) about pedestrian safety. In 2020, NHTSA reported that 6,516 pedestrians were killed in traffic crashes and a pedestrian was killed every 81 minutes on average in the United States. In relation to light condition, 77% of pedestrian fatalities occurred in the dark, 20% in daylight, 2% in dusk, and 2% in dawn. To tackle the issue from a technological perspective, this dissertation addresses the problem of pedestrian detection robustness in dark conditions, benefiting from image processing and learning-based approaches by: (i) proposing a pedestrian- luminance-aware brightening framework that moderately corrects image luminance so that pedestrians can be more robustly detected, (ii) proposing an image-to-image translation framework that learns the mapping between night and day domains through the game training of generators and discriminators and thus alleviates detecting dark pedestrian using the synthetic night images, and (iii) proposing a multi-modal framework that pairs RGB and infrared images to reduce the light factor and make pedestrian detection a fair game regardless the illumination variance.11 0