Saudi Cultural Missions Theses & Dissertations
Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10
Browse
10 results
Search Results
Item Restricted Smart Glasses for the Blind and Visually Impaired (BVI)(NEWCASTLE UNIVERSITY, 2024-08-23) Aljamil, Ibtisam; Wang, ShidongTechnology has rapidly advanced to become a powerful tool for enhancing quality of life and providing creative solutions to the daily obstacles faced by different societal groups. Blindness is a significant obstacle that impacts the lives of millions worldwide. When carrying out everyday activities, blind individuals face many movements and independent living challenges. Consequently, there is an immediate need to create assistive tools and technologies that enhance the capabilities and independence of visually impaired individuals. The 'Smart Glasses for the Blind and Visually Impaired’ (BVI) project is an innovative solution that offers an approach based on computer vision and artificial intelligence technologies. This solution aims to assist blind individuals in recognising their surroundings and navigating safely and quickly. The fundamental goal of this project is to create intelligent eyewear that incorporates advanced cameras and sensors, as well as data analysis software and technologies for recognising objects. It also possesses text-to-speech capabilities, object detection, and user guidance through audio feedback. The glasses employ a Raspberry Pi 4 microprocessor for camera image processing, utilising Optical Character Recognition (OCR) technologies and OpenCV software to identify text accurately. The smart glasses use a Single Shot Multibox Detector (SSD) to detect objects in the environment. This technology enables visually impaired individuals to navigate independently while receiving real-time alerts about potential hazards along their path. The numerous benefits of glasses make them a valuable tool in the lives of visually impaired people.9 0Item Restricted Multi-Stage and Multi-Target Data-Centric Approaches to Object Detection, Localization, and Segmentation in Medical Imaging(University of California San Diego, 2024) Albattal, Abdullah; Nguyen, TruongObject detection, localization, and segmentation in medical images are essential in several medical procedures. Identifying abnormalities and anatomical structures of interest within these images remains challenging due to the variability in patient anatomy, imaging conditions, and the inherent complexities of biological structures. To address these challenges, we propose a set of frameworks for real-time object detection and tracking in ultrasound scans and two frameworks for liver lesion detection and segmentation in single and multi-phase computed tomography (CT) scans. The first framework for ultrasound object detection and tracking uses a segmentation model weakly trained on bounding box labels as the backbone architecture. The framework outperformed state-of-the-art object detection models in detecting the Vagus nerve within scans of the neck. To improve the detection and localization accuracy of the backbone network, we propose a multi-path decoder UNet. Its detection performance is on par with, or slightly better than, the more computationally expensive UNet++, which has 20% more parameters and requires twice the inference time. For liver lesion segmentation and detection in multi-phase CT scans, we propose an approach to first align the liver using liver segmentation masks followed by deformable registration with the VoxelMorph model. We also propose a learning-free framework to estimate and correct abnormal deformations in deformable image registration models. The first framework for liver lesion segmentation is a multi-stage framework designed to incorporate models trained on each of the phases individually in addition to the model trained on all the phases together. The framework uses a segmentation refinement and correction model that combines these models' predictions with the CT image to improve the overall lesion segmentation. The framework improves the subject-wise segmentation performance by 1.6% while reducing performance variability across subjects by 8% and the instances of segmentation failure by 50%. In the second framework, we propose a liver lesion mask selection algorithm that compares the separation of intensity features between the lesion and surrounding tissue from multi-specialized model predictions and selects the mask that maximizes this separation. The selection approach improves the detection rates for small lesions by 15.5% and by 4.3% for lesions overall.19 0Item Restricted Efficient Processing of Convolutional Neural Networks on the Edge: A Hybrid Approach Using Hardware Acceleration and Dual-Teacher Compression(University of Central Florida, 2024-07-05) Alhussain, Azzam; Lin, MingjieThis dissertation addresses the challenge of accelerating Convolutional Neural Networks (CNNs) for edge computing in computer vision applications by developing specialized hardware solutions that maintain high accuracy and perform real-time inference. Driven by open-source hardware design frameworks such as FINN and HLS4ML, this research focuses on hardware acceleration, model compression, and efficient implementation of CNN algorithms on AMD SoC-FPGAs using High-Level Synthesis (HLS) to optimize resource utilization and improve the throughput/watt of FPGA-based AI accelerators compared to traditional fixed-logic chips, such as CPUs, GPUs, and other edge accelerators. The dissertation introduces a novel CNN compression technique, "Two-Teachers Net," which utilizes PyTorch FX-graph mode to train an 8-bit quantized student model using knowledge distillation from two teacher models, improving the accuracy of the compressed model by 1%-2% compared to existing solutions for edge platforms. This method can be applied to any CNN model and dataset for image classification and seamlessly integrated into existing AI hardware and software optimization toolchains, including Vitis-AI, OpenVINO, TensorRT, and ONNX, without architectural adjustments. This provides a scalable solution for deploying high-accuracy CNNs on low-power edge devices across various applications, such as autonomous vehicles, surveillance systems, robotics, healthcare, and smart cities.25 0Item Restricted Automatic Sketch-to-Image Synthesis and Recognition(University of Dayton, 2024-05) Baraheem, Samah; Nguyen, TamImage is used everywhere since it conveys a story, a fact, or an imagination without any words. Thus, it can substitute the sentences because the human brain can extract the knowledge from images faster than words. However, creating an image from scratch is not only time-consuming, but also a tedious task that requires skills. Creating an image is not a trivial task since it contains rich features and fine-grained details, such as colors, brightness, saturation, luminance, texture, shadow, and so on. Thus, in order to generate an image in less time and without any artistic skills, sketch-to-image synthesis can be used. The reason is that hand sketches are much easier to produce, where only the key structural information is contained. Moreover, it can be drawn without skills and in less time. In fact, since sketches are often simple and rough black and white and sometimes imperfect, converting a sketch into an image is not a trivial problem. Hence, it has attracted the researchers' attention to solve this challenging problem; therefore, much research has been conducted in this field to generate photorealistic images. However, the generated images still suffer from issues, such as the un-naturality, the ambiguity, the distortion, and most importantly, the difficulty in generating images from complex input with multiple objects. Most of these problems are due to converting a sketch into an image directly in one-shot. To this end, in this dissertation, we propose a new framework that divides the problem into sub-problems, leading to generating high-quality photorealistic images even with complicated sketches. Instead of directly mapping the input sketch into an image, we map the sketch into an intermediate result, namely, mask map, through an instance segmentation and semantic segmentation in two levels: background segmentation and foreground segmentation. Background segmentation is formed based on the 4 context of the existing foreground objects. Various natural scenes are implemented for both indoor and outdoor scenes. Then, a foreground segmentation process is commenced, where each detected object is sequentially and semantically added into the constructed segmented background. Next, the mask map is converted into an image through image-to-image translation model. Following this step, a post-processing stage is implemented to enhance the synthetic image further via background improvement and human face refinement. This leads to not only generating better results but also being able to generate images from complicated sketches with multiple objects. We further improve our framework by implementing scene and size sensing. As for size awareness feature, in the instance segmentation stage, the objects' sizes might be modified based on the surrounding environment and their respective size prior to reflect reality and produce more realistic and naturalistic images. Moreover, to implement scene awareness feature in the background improvement step, after the scene is initially defined based on the context and then classified based on a scene classifier, a scene image is first selected. Then, the generated objects are placed on the chosen scene image and based on a pre-defined snapping point to place the objects in their proper location and maintain realism. Furthermore, since the generated images have been improved over time regardless of the input modality, it sometimes becomes hard to distinguish between the synthetic images and genuine ones. Of course, this improves the content and the media, but it is considered a serious threat regarding legitimacy, authenticity, and security. Thus, an automatic detection system of AI-generated images is a legitim need. This system also can be used for image synthesis models as an evaluation tool despite the input modality. Indeed, AI-generated images usually bear explicit or implicit artifacts that result during the generation process. Prior research work on detecting the synthetic images generated by one specific model or similar models with similar architecture. Hence, a generalization problem has arisen. To tackle this problem, we propose to fine-tune a pre-trained Convolutional Neural Network (CNN) model on a special newly collected dataset. This 5 dataset consists of AI-generated images from different image synthesis architectures and different input modalities, i.e., text, sketch, and other sources (another image or mask) to help in the generalization ability across various tasks and architectures. Our contribution in general is two-fold. We first generate high-quality realistic images from simple, rough, black and white sketches, where a newly collected dataset of sketch-like images is compiled for training purposes. Second, since artificial images would have advantages and disadvantages in the real world, we create an automated system that is able to detect and localize synthetic images from genuine ones, where a large dataset of generated and real images is collected to train a CNN model.19 0Item Restricted Audio-to-Video Synthesis using Generative Adversarial Network(University of New South Wales, 2024-01-23) Aldausari, Nuha; Mohammadi, Gelareh; Sowmya, Arcot; Marcus, NadineVideo generation is often perceived as stringing several image generators. However, in addition to visual quality, video generators must also consider motion smoothness and synchronicity with audio and text. Audio plays a crucial role in guiding visual content, as even slight discrepancies between audio and motion can be noticeable to human eyes. Thus, audio can be a self-supervised signal for learning the motion and building correlations between the audio and motion. While there are attempts to build promising audio-to-video generation models, these models typically rely on supervised signals such as keypoints. However, annotating keypoints as supervised signals takes time and effort. Thus, this thesis focuses on audio-based pixel-level video generation without keypoints. The primary goal of this thesis is to build models that generate a temporally and spatially coherent video from audio inputs. The thesis proposes multiple audio-to-video generator frameworks. The first proposed model, PhonicsGAN, uses GRU units for audio to generate pixel-based videos. The subsequent frameworks, each, address particular challenges while still pursuing the main objective. To improve the spatial quality of the generated videos, a model that adapts the image fusion concept to video generation is proposed. This model incorporates a multiscale fusion model that combines images with video frames. While the spatial quality of the video is important, the temporal aspect of the video frames should also be considered. To address this, a shuffling technique is proposed presenting each dataset sample with varied permutations to improve the video's temporal learning. We propose a new model that learns motion trajectory from sparse motion frames. AdaIN is utilised to adjust the motion in the content frame to the target frame to enhance the learning of video motion. All the proposed models are compared with state-of-the-art models to demonstrate their ability to generate high-quality videos from audio inputs. This thesis contributes to the field of video generation in several ways: Firstly, by providing an extensive survey on GAN-based video generation techniques. Secondly, by proposing and evaluating four pixel-based frameworks for enhanced audio-to-video generation output that each addresses important challenges in the field. Lastly, by collecting and publishing a new audio/visual dataset that can be used by the research community for further investigations in this area.24 0Item Restricted Automatic Generation of a Coherent Story from a Set of Images(Saudi Digital Library, 2023-12) Aljawy, Zainy; Mian, Ajaml; Hassan, Ghulam MubasharThis dissertation explores vision and language (V&L) algorithms. While (V&L) succeeds in image and video captioning tasks, the dynamic Visual Storytelling Task (VST) remains challenging. VST demands coherent stories from a set of images, requiring grammatical accuracy, flow, and style. The dissertation addresses these challenges. Chapter 2 presents a framework utilizing an advanced language model. Chapters 3 and 4 introduce novel techniques that integrate rich visual representation to enhance generated stories. Chapter 5 introduces a new storytelling dataset with a comprehensive analysis. Chapter 6 proposes a state-of-the-art Transformer-based model for generating coherent and informative story descriptions from image sets.52 0Item Restricted Seeing in the Dark: Towards Robust Pedestrian Detection at Nighttime(Saudi Digital Library, 2023-12-24) Althoupety, Afnan; Feng, Wu-chi“At some point in the day, everyone is a pedestrian” a message from the National Highway Traffic Safety Administration (NHTSA) about pedestrian safety. In 2020, NHTSA reported that 6,516 pedestrians were killed in traffic crashes and a pedestrian was killed every 81 minutes on average in the United States. In relation to light condition, 77% of pedestrian fatalities occurred in the dark, 20% in daylight, 2% in dusk, and 2% in dawn. To tackle the issue from a technological perspective, this dissertation addresses the problem of pedestrian detection robustness in dark conditions, benefiting from image processing and learning-based approaches by: (i) proposing a pedestrian- luminance-aware brightening framework that moderately corrects image luminance so that pedestrians can be more robustly detected, (ii) proposing an image-to-image translation framework that learns the mapping between night and day domains through the game training of generators and discriminators and thus alleviates detecting dark pedestrian using the synthetic night images, and (iii) proposing a multi-modal framework that pairs RGB and infrared images to reduce the light factor and make pedestrian detection a fair game regardless the illumination variance.11 0Item Restricted Assessing Transit Oriented Development using Satellite Imagery: Riyadh vs. Phoenix(Saudi Digital Library, 2023-08-23) Almazroa, Noor; de Weck, OlivierAs urbanization becomes the way of the future, the demands on the cities are becoming more urgent, with an increased awareness of the need for sustainability and resilience, making the utilization of today’s Technology and data critical in decision-making and planning. In the first part of this thesis, I combine a few of these techniques and datasets to explore their ability to provide a helpful assessment of Transit-Oriented Development (TOD). This research assesses the transit-oriented characteristics in two cities, Riyadh, Saudi Arabia, and Pheonix City, Arizona, US. Both share many similarities in urban design and climate. I use high-resolution satellite imagery with Computer Vision methods to detect the built area around public transit stations to measure the building density and, combined with land use data, measure the residential and nonresidential density. Both of these measurements are important indicators of the success of a public transportation system. I found that out of the two different building detection methods, the one based on deep learning techniques was more precise, with better generalization abilities. While the method based on classical image processing techniques is more sensitive to threshold choices, with considerable variability when tested on different years. Both methods, however, were able to give a useful prediction of buildings. And from their results, I found that Phoenix City has a building density of less than 50%, even around the busiest stations downtown stations. Riyadh, on the other hand, is more compact and with at least more than 50% of the land being developed. In the second part, I formulate a System Dynamics that is validated by Phoenix’s actual ridership for the 2010-2020 period and predicts transit ridership in Riyadh. The model closely approximated Phoenix’s ridership up until 2016. The Riyadh model estimated that the ridership would start with six million riders, surpassing the predictions of the Royal Commission for Riyadh City (RCRC) of 1.6 million initially. The results of both parts indicate that given that Riyadh is more densely built with a smaller area and has a more extensive transportation system and bigger population, this should serve as an incentive to promote a more transit-oriented built environment by increasing walkability and dense mixed-use developments throughout the city.13 0Item Restricted DEEP LEARNING APPROACHES FOR OBJECT TRACKING AND MOTION ESTIMATION OF ULTRASOUND IMAGING SEQUENCES(Saudi Digital Library, 2023) Alshahrani, Mohammed; Almekkawy, MohamedIn recent decades, object tracking and motion estimation in medical imaging have gained importance. It is a powerful tool that can be used to improve diagnostic accuracy and therapy efficiency. This importance has led researchers to search for faster and more accurate algorithms for object tracking. Different approaches have been used to perform object tracking, such as object detection, motion estimation, and similarity matching, which are the focus of this study. Different avenues can be used to address similarity matching. First, the classical method, which takes an object and searches for a similar object in the subsequent frame (because it is an object tracking in a video sequence) by examining all the sub-windows in the subsequent frame and measuring a cost function between the reference object and the sub-window. This approach is inefficient and cannot achieve real-time tracking. The deep learning method for similarity matching utilizes twin convolutional networks that produce a feature map that is later combined using a correlation layer. This layer provides a score map that points to a high-similarity area. This study examined and developed object tracking algorithms to track objects of interest in the human liver using a correlation filter-based neural network (CFNet). The dataset used in this study was CLUST-2D, which was provided by the Swiss Federal Institute of Technology in Zürich (ETH). It contains approximately 96 ultrasound sequences of the liver from different patients. Three versions of the CFNet network were tested in this study. First, baseline-CFNet was used for training. Baseline-CFNet struggled to track objects under significant displacements and deformations. To address this limitation of the baseline-CFNet, a second version was developed. Advanced-CFNet is the second version of CFNet implemented in this study. This is the main contribution of this study. This version incorporates a dynamic template update and motion prediction modules, which improve object tracking by preventing tracker drift and maintaining the template from being polluted with inappropriate appearances of the tracked object. The third version implemented in this study is Kalman-CFNet, which utilizes a linear Kalman filter to estimate an object's motion and enhance its robustness against unexpected motions. The comparative analysis demonstrated the superiority of Advanced-CFNet, as it achieved lower root mean square error (RMSE) values than the other methods, particularly in challenging scenarios. These findings highlight the effectiveness of the advanced CFNet for object tracking in liver ultrasound imaging.13 0Item Restricted Deep Learning for Detecting and Classifying The Growth Stages of Weeds on Fields(ProQuest, 2023) Almalky, Abeer M; Ahmed, Khaled RDue to the current and anticipated massive increase of world population, expanding the agriculture cycle is necessary for accommodating the expected human’s demand. However, weeds invasion, which is a detrimental factor for agricultural production and quality, is a challenge for such agricultural expansion. Therefore, controlling weeds on fields by accurate, automatic, low-cost, environment-friendly, and real-time weeds detection technique is required. Additionally, automating the process of detecting, classifying, and counting of weeds per their growth stages is vital for using appropriate weeds controlling techniques. The literature review shows that there is a gap in the research efforts that handle the automation of weeds’ growth stages classification using DL models. Accordingly, in this thesis, a dataset of four weed (Consolida Regalis) growth stages was collected using unnamed aerial vehicle. In addition, we developed and trained one-stage and two-stages deep learning models: YOLOv5, RetinaNet (with Resnet-101-FPN, Resnet-50-FPN backbones), and Faster R-CNN (with Resnet-101-DC5, Resnet-101-FPN, Resnet-50-FPN backbones) respectively. Comparing the results of all trained models, we concluded that, in one hand, the Yolov5-small model succeeds in detecting weeds and classifying the weed’s growth stages in the shortest inference time in real-time with the highest recall of 0.794 and succeeds in counting the instances of weeds per the four growth stages in real-time with counting time of 0.033 millisecond per frame. On the other hand, RetinaNet with ResNet-101-FPN backbone shows accurate and precise results in the testing phase (average precision of 87.457). Even though the Yolov5-large model showed the highest precision value in classifying almost all weed’s growth stages in training phase, Yolov5-large could not detect all objects in tested images. As a whole, RetinaNet with ResNet-101-FPN backbone shows accurate and high precision, while Yolov5-small has the shortest real inference time of detection and growth stages classification. Farmers can use the resulted deep learning model to detect, classify, and count weeds per growth stages automatically and as a result decrease not only the needed time and labor cost, but also the use of chemicals to control weeds on fields.11 0