SACM - United States of America

Permanent URI for this collectionhttps://drepo.sdl.edu.sa/handle/20.500.14154/9668

Browse

Search Results

Now showing 1 - 10 of 45

Restricted
AI-Enabled Autonomous Knowledge Extraction from Large-Scale Textual Data
(Saudi Digital Library, 2026) Alharbi, Abdulrahman; Obradovic, Zoran
The rapid growth of large-scale textual data across social media platforms, news media, and scientific repositories presents both unprecedented opportunities and significant challenges for extracting meaningful insights. During global events such as the COVID-19 pandemic, understanding public discourse requires analyzing vast amounts of noisy, heterogeneous, dynamic, and geographically distributed data. At the same time, the exponential increase in scientific publications has made traditional evidence synthesis methods increasingly labor-intensive, time-consuming, and difficult to scale. Existing approaches to textual knowledge extraction often operate in isolation, lack interpretability, fail to integrate heterogeneous data sources, and do not support scalable end-to-end automation. This dissertation addresses these limitations by proposing a unified framework for AI-enabled autonomous knowledge extraction from large-scale textual data. The research introduces a comprehensive pipeline that integrate sentiment analysis, topic modeling, semantic interpretation, spatiotemporal reasoning, and multi-agent automation for scalable, robust and reproducible text analysis across heterogeneous domains. First, this work introduces TriLex, a novel unsupervised sentiment analysis framework that combines multiple lexicon-based sentiment analysis methods through weighted aggregation, majority voting, and dynamic thresholding technique to improve robustness and accuracy for short and noisy textual data. Building on this foundation, a hierarchical spatiotemporal framework is developed to capture the evolution of public sentiment across global, national, and regional scales. The framework integrates over 7 million social media posts and thousands of news articles to analyze COVID-19 vaccine discourse across time, geographic regions, and platforms. To enhance topic interpretability, this research integrates BERTopic with large language models (LLMs), enabling automated generation of coherent and context-aware topic representations for large-scale textual discourse. A cross-platform analytical framework is further introduced to examine temporal relationships between social media and news media discourse, demonstrating a bidirectional relationship in which news coverage and public discourse influence each other over time. Extending beyond discourse analysis, this dissertation introduces an Agentic AI framework that automates the end-to-end process of large-scale multilingual knowledge extraction and evidence synthesis. The proposed multi-agent system coordinates specialized agents for query generation, multilingual retrieval, metadata harmonization, title and abstract screening, and full-text analysis. Evaluated on a multilingual corpus of over 52,000 scientific records, the framework achieves high screening performance while substantially reducing processing time from months to hours, demonstrating significant improvements in scalability, robustness, and reproducibility. Collectively, this dissertation bridges the gap between analytical understanding and autonomous knowledge extraction from large-scale textual data. By integrating robust sentiment analysis, interpretable topic modeling, spatiotemporal discourse analysis, and autonomous AI systems within a unified framework, this work establishes a scalable and extensible paradigm for transforming heterogeneous textual data into actionable knowledge. The proposed methodologies are validated using real-world datasets spanning social media, news media, and scientific literature across diverse application domains.
13 0
Restricted
Personalized Course Recommendations Leveraging Machine & Transfer Learning Toward Improved Student Outcomes
(Saudi Digital Library, 2026) Algarni, Shrooq; Frederick, Sheldon
At matriculation, university advising typically operates under tight informational constraints, often with no access to post-enrolment interaction history. We propose a unified, leakage-controlled pipeline that (i) predicts early dropout risk and (ii) generates cold-start programme recommendations using only pre-enrolment signals, with an optional early-warning variant that additionally incorporates first-term academic aggregates. The pipeline instantiates lightweight multimodal components: a tabular RNN, a DistilBERT encoder for short profile sentences, and a cross-attention fusion module, trained and evaluated end-to-end on a public benchmark (UCI id 697; n = 3630 students across 17 programmes). For dropout prediction, fusing text with numeric features yields the strongest thresh olded performance (Hybrid RNN–DistilBERT: F1 ≈ 0.9161, MCC ≈ 0.7750), while simple ensembling modestly improves threshold-free discrimination (AUROC up to ≈ 0.9488 via Stacking Ensemble, compared to ≈ 0.9459 for Weighted Ensemble). A text-only branch performs substantially worse, indicating that numeric demographics and early curricular aggregates carry most of the predictive signal at this horizon. For programme recommendation, pre-enrolment demographics alone support actionable rankings (De mographic MLP: NDCG@10 ≈ 0.5793, Top-10 ≈ 0.9380), outperforming a popularity prior by roughly 25–27 percentage points in NDCG@10; adding text yields only marginal improvements in hit rate and does not improve NDCG on this cohort. Methodologically, we apply leakage guards, deterministic preprocessing, stratified splits, and comprehensive metric reporting to enable reproducibility on non-proprietary data. Practically, the pipeline supports orientation-time triage via high-recall early warning and shortlist generation for programme selection. Overall, the results cast matriculation-time advising as a joint prediction–recommendation problem solvable with carefully engineered pre-enrolment views and lightweight multimodal models, without relying on historical interactions.
13 0
Restricted
Machine Learning Algorithms for Secure and Reliable Electric Grid Operations and Control
(Saudi Digital Library, 2026-09) Bahwal, Obai; Sankar, Lalitha
This dissertation develops Machine Learning (ML) algorithms for secure and reliable electric grid operations and control by addressing three related problems. The first part studies real-time event identification using synchrophasor measurements, physics-based modal decomposition, and interpretable classifiers to distinguish generation loss from load loss events. Targeted adversarial attacks are developed to evaluate robustness under both white box and gray box settings, showing that learned event identification models are susceptible to adversarial attacks and that simpler models such as logistic regression are generally more vulnerable than gradient boosting. The second part builds on this vulnerability analysis and focuses on enhancing the security of ML event identification models in a white box adversarial setting. Two mitigation strategies are developed: robust classification through iterative adversarial retraining, and a dual-classifier architecture that combines event classification with attack detection. Numerical results on the synthetic South Carolina 500-bus system show that while robust retraining provides only modest improvement, the dual classifier approach is highly effective, reducing successful undetected attacks to under 0.1%. The third part addresses reliable grid control through a forecast-integrated rolling-horizon Model Predictive Control (MPC) framework for net-demand balancing using Distributed Energy Resource Aggregators (DERAs). Each DERA is modeled as a generalized battery with state-of-charge, power, and ramping constraints, while Linear Regression (LR) and Long Short-TermMemory (LSTM) forecasting models are integrated with MPC to generate real-time allocation policies. Using high-resolution California Independent System Operator (CAISO) net-demand data, results show close tracking of net-demand and reveal clear tradeoffs among forecast horizon, update frequency, and control performance, with LSTM generally benefiting longer time-shifts and LR remaining competitive for shorter update intervals. These three parts show that effective ML for power systems must be accurate, physically grounded, cyber-resilient, and compatible with real-time operational constraints.
17 0
Restricted
MALWARE CLASSIFICATION VIA BYTECODE VISUALIZATION AND MULTIMODAL DEEP LEARNING
(Saudi Digital Library, 2026) مكاوي, صالح; Kenneth, Barner; Michael De. Lucia
The rapid proliferation of Android malware poses a critical threat to mobile security, driven by the open-source nature of the Android ecosystem, broad access to its Software Development Kit (SDK), and the availability of multiple app distribution channels. Traditional detection methods, including signature-based, static, and dynamic analysis, often fail against novel or obfuscated variants that use encryption, packing, and polymorphism to evade detection. This dissertation addresses these limitations through three progressive, interconnected contributions. Contribution I introduces a semantic bytecode-to-image encoding method based on Shannon entropy, computed over sliding windows of Dalvik Executable (DEX) bytecode and mapped to the red and blue channels of an RGB image. This encoding directly exposes obfuscation artifacts in encrypted, packed, and compressed code regions that blind color-mapping approaches cannot distinguish. Evaluated on 184,474 samples from AndroZoo for binary malware classification, the entropy encoder outperforms the MalNet and Classbyte baselines across five state-of-the-art CNN architectures, achieving up to 95.77% accuracy and 98.25% ROC-AUC. Contribution II extends this encoding into a full multiclass framework and dataset, MalVis, by incorporating byte-level N-gram frequency statistics into the green channel. By combining entropy and N-gram signals, MalVis images capture both low-level randomness and high-level structural code patterns. To support the research community, we release the MalVis dataset, the largest publicly available Android malware visualization resource, comprising over 1.3 million labeled RGB images spanning nine malware families and benign samples. To validate that CNN models rely on the proposed encodings rather than spurious image artifacts, we apply Grad-CAM and Grad-CAM++ attention mapping, confirming that model attention consistently aligns with the entropy- and N-gram-encoded channels across all malware classes. Contribution III introduces ViCoMal, a late-fusion multimodal framework that addresses the single point of failure (SPOF) inherent in unimodal approaches. ViCoMal pairs MalVis image-based malware representations with 65 DEX-based contextual features, including 19 novel engineered features consisting of normalized risk scores, binary capability indicators, and rule-based behavioral profiles extracted using static analysis. Five CNN architectures and seven classical machine learning models are trained independently per modality, and their class-probability outputs are combined through eight fusion strategies. To handle a severe 169:1 class imbalance, SMOTE oversampling and class-weighted training are applied jointly. Evaluated under five-fold stratified cross-validation on the full 1.3M ViCoMal dataset, the best ensemble, which averages the predictions of the top three models (Random Forest, ResNet50, and XGBoost), achieves 91.84% accuracy, outperforming the strongest image-only baseline by 3.43 percentage points and the strongest contextual-only baseline by 2.05 percentage points. Together, these contributions establish a scalable, interpretable, and high-performing pipeline for Android malware detection, advancing the state of the art in both malware visualization and multimodal security analysis.
13 0
Restricted
ENHANCING TRAFFIC SAFETY THROUGH AI-DRIVEN, PRIVACY-PRESERVING, AND SECURE IMPAIRED DRIVING DETECTION SYSTEMS
(Saudi Digital Library, 2026) Alsulieman, Razan; Sherif, Ahmed
Drunk driving remains a major threat to road safety worldwide, contributing significantly to traﬃc injuries and fatalities each year. Traditional detection approaches are largely reactive and vehicle-centric, relying on in-vehicle sensors, breathalyzers, or post-incident enforcement. These methods often depend on driver cooperation, intrusive hardware installations, or limited monitoring environments, restricting their scalability and eﬀectiveness in large transportation systems. At the same time, modern cities increasingly deploy roadside cameras, surveillance networks, and drone based monitoring systems, creating new opportunities for proactive intoxication detection at the infrastructure level. However, leveraging such external monitoring introduces challenges related to secure data collection, reliable AI-based analysis, privacy protection, and real-world deployment. This dissertation proposes a secure, privacy-preserving Artificial Intelligence framework for proactive drunk driving detection using out-of-vehicle surveillance data. The framework addresses three key aspects required for reliable infrastructure-level monitoring. First, a lightweight authentication scheme is developed to ensure secure data collection from distributed monitoring platforms such as drones and surveillance devices. The proposed design employs physically unclonable functions and symmetric cryptographic primitives to provide protection against impersonation, replay attacks, and device cloning while maintaining low computational overhead for resource-constrainedenvironments. Second, AI-based intoxication detection models are developed using Machine Learning and Deep Learning techniques to analyze facial imagery captured under real-world surveillance conditions. Extensive experiments evaluate multiple models under varying noise and disruption scenarios to ensure robustness across both low- and high-resource computational environments. The framework also incorporates explainable AI methods to improve transparency and verify that model decisions rely on meaningful facial features. Finally, the framework integrates privacy-preserving learning mechanisms through federated learning, enabling distributed model training without transferring sensitive facial images to centralized servers. This approach protects user privacy while maintaining strong detection performance across distributed monitoring nodes. These contributions establish a secure, scalable, and privacy-aware infrastructure-level system for proactive intoxication detection, supporting intelligent transportation systems aimed at improving traﬃc safety.
7 0
Embargo
Understanding Ransomware and Enhancing Their Detection Using Machine Learning
(Saudi Digital Library, 2026) Alzahrani, Saleh; Xiao, Yang
Ransomware attacks have escalated significantly in recent years, causing substantial financial losses and operational disruptions to individuals, organizations, and critical infrastructure worldwide. According to The Chainalysis 2024 Crypto Crime Report, ransomware attacks have imposed increasing financial burdens on victims over recent years. The total value received by ransomware attackers reached $1.1 billion in 2023, representing a significant rise from $567 million in 2022. This trend highlights the evolving threat posed by ransomware as attackers continue to refine their methods. Compared to $220 million in 2019. Despite the proliferation of detection methods, contemporary ransomware continues to evade traditional security measures through increasingly sophisticated evasion techniques. This dissertation addresses critical gaps in ransomware detection research through a investigation that combines in-depth malware analysis, evolutionary tracking, systematic literature review, novel detection methodology, and dataset development. The research begins with a detailed examination of Conti ransomware, one of the most notorious Ransomware-as-a-Service operations that caused approximately $45 million in damages and significantly impacted healthcare systems. Through analysis of leaked source code and controlled environment testing, this study reveals advanced evasion mechanisms including API disguise techniques, anti-hook mechanisms, and multithreaded encryption for rapid file encryption. Building upon this foundation, the research tracks Conti's evolution from its beta version through multiple iterations, categorizing samples into seven distinct versions. This longitudinal analysis demonstrates that modern ransomware success stems from continuous development and delivery practices, with features such as API hashing and runtime API loading being progressively integrated over time. To contextualize these findings within the broader detection landscape, a survey of existing ransomware detection methods was conducted, examining both machine learning and non-machine learning approaches alongside available datasets. This survey identifies critical limitations in current research, specifically that non-machine learning methods fail to identify new samples from known variants, while machine learning approaches suffer from inadequate model design and the absence of comprehensive, standardized datasets. These deficiencies severely limit their effectiveness against emerging ransomware variants. Addressing these identified gaps, this dissertation introduces RansomFormer, a Transformer-based detection model that leverages cross-attention mechanisms to fuse Portable Executable byte data with Application Programming Interface information, including both static imports and dynamic sequence calls. Unlike existing single-feature approaches that ransomware developers can circumvent, RansomFormer's multi-modal architecture achieves exceptional accuracy of 99.25% on static datasets and 99.50% on combined static-dynamic datasets across more than 150 ransomware families. Furthermore, recognizing the fundamental need for comprehensive training data, this dissertation presents RanDS, a rigorously curated dataset comprising a large collection of ransomware samples spanning hundreds of families alongside a substantial set of benign samples, collected and verified over multiple years from an initial corpus of millions of malware files. RanDS includes several processed feature extraction datasets encompassing static raw strings, English strings, imported and exported APIs, demangled APIs, and dynamic behavioral activities, all made publicly available. This dissertation makes contributions to cybersecurity by providing deep insights into modern ransomware operations, demonstrating the importance of evolutionary analysis in understanding threat progression, and delivering both an detection methodology and a foundational dataset that addresses longstanding research limitations in the field.
67 0
Restricted
INTELLIGENT ROBOTICS WITH DIGITAL-TWIN ALIGNMENT: SEMANTIC NAVIGATION, MANIPULATION, PLANNING, AND HUMAN-TO-ROBOT ACTION TRANSFORMATION
(Saudi Digital Library, 2025) Alanazi, Ahmed Hamdan; Lee, Yugyung
This dissertation advances AI-empowered indoor robotics through four interconnected contributions that unify navigation, manipulation, semantic planning, and human-to-robot action transformation within a digital-twin-aligned framework. GRIP, a grid-aware semantic navigation module, integrates symbolic scene understanding with hybrid search-and-policy execution to achieve robust and context-aware ObjectNav. PathFormer, a transformer-based manipulation model structured around a 3D spatial--semantic grid, generates smooth, interpretable, and physically consistent trajectories that remain tightly aligned with digital-twin simulation. KG-Transformer, a knowledge-guided semantic planner, leverages a lightweight digital twin to calibrate execution, veto unsafe behaviors, and autonomously repair failing plans across diverse indoor environments. ActionFormer, an action-generation transformer, introduces a unified imitation-learning pipeline that integrates human-activity recognition, human-motion generation, and robot-motion generation. ActionFormer supports more than twenty complex human activities, producing robot-ready demonstrations that generalize across platforms and enable end-to-end imitation learning from video and landmark sequences. Collectively, these contributions establish a coherent foundation for AI-empowered robotics grounded in digital-twin intelligence. Across benchmarks and real-world deployments, GRIP yields up to 9.6\% higher success rate and more than $2\times$ gains in path efficiency (SPL, SAE). PathFormer produces digitally consistent manipulation trajectories validated through robust sim-to-real transfer. KG-Transformer achieves 99.6\% executability, delivers a +4.6-point improvement on unseen-scene tasks, and eliminates safety violations in both simulated and multi-robot execution. ActionFormer attains state-of-the-art performance in human-activity recognition and high execution accuracy across more than 20 activities, generating realistic human-motion traces and corresponding robot-motion trajectories for embodied robotic demonstration. Together, these advances deliver a trustworthy, semantically aligned, and high-performance simulation-to-reality pipeline that significantly enhances the adaptability, reliability, and real-world readiness of autonomous indoor robotic systems.
42 0
Restricted
Sensing, Scheduling, and Learning for Resource-Constrained Edge Systems
(Saudi Digital Library, 2025) Bukhari, Abdulrahman; Kim, Hyoseung
Recent advances in Internet of Things (IoT) technologies have sparked significant interest in developing learning-based sensing applications on embedded edge devices. These efforts, however, are challenged by adapting to unforeseen conditions in open-world environments and by the practical limitations of low-cost sensors in the field. This dissertation presents the design, implementation, and evaluation of resource-constrained edge systems that address these challenges through time-series sensing, scheduling, and classification. First, we present OpenSense, an open-world time-series sensing framework for performing inference and incremental classification on an embedded edge device, eliminating reliance on powerful cloud servers. To create time for on-device updates without missing events and to reduce sensing and communication overhead, we introduce two dynamic sensor-scheduling techniques: (i) a class-level period assignment scheduler that selects an appropriate sensing period for each inferred class and (ii) a Q-learning–based scheduler that learns event patterns to choose the sensing interval at each classification moment. Experimental results show that OpenSense incrementally adapts to unforeseen conditions and schedules effectively on a resource-constrained device. Second, to bridge the gap between theoretical potential and field practice for low-cost sensors, we present a comprehensive evaluation of a sensing and classification system for early stress and disease detection in avocado plants. The greenhouse deployment spans 72 plants in four treatment categories over six months. For leaves, spectral reflectance coupled with multivariate analysis and permutation testing yields statistically significant results and reliable inference. For soils, we develop a two-level hierarchical classification approach tailored to treatment characteristics that achieves 75–86\% accuracy across avocado genotypes and outperforms conventional approaches by over 20\%. Embedded evaluations on Raspberry Pi and Jetson report end-to-end latency, computation, memory usage, and power consumption, demonstrating practical feasibility. In summary, the contributions are a generalized framework for dynamic, open-world learning on edge devices and an application-specific system for robust classification in noisy field deployments. These real-world deployments collectively outline a practical framework for designing intelligent, cloud-independent edge systems from sensing to inference.
26 0
Restricted
Sensing, Scheduling, and Learning for Resource-Constrained Edge Systems
(Saudi Digital Library, 2025) Bukhari, Abdulrahman Ismail Ibrahim; Kim, Hyoseung
Recent advances in Internet of Things (IoT) technologies have sparked significant interest in developing learning-based sensing applications on embedded edge devices. These efforts, however, are challenged by adapting to unforeseen conditions in open-world environments and by the practical limitations of low-cost sensors in the field. This dissertation presents the design, implementation, and evaluation of resource-constrained edge systems that address these challenges through time-series sensing, scheduling, and classification. First, we present OpenSense, an open-world time-series sensing framework for performing inference and incremental classification on an embedded edge device, eliminating reliance on powerful cloud servers. To create time for on-device updates without missing events and to reduce sensing and communication overhead, we introduce two dynamic sensor-scheduling techniques: (i) a class-level period assignment scheduler that selects an appropriate sensing period for each inferred class and (ii) a Q-learning–based scheduler that learns event patterns to choose the sensing interval at each classification moment. Experimental results show that OpenSense incrementally adapts to unforeseen conditions and schedules effectively on a resource-constrained device. Second, to bridge the gap between theoretical potential and field practice for low-cost sensors, we present a comprehensive evaluation of a sensing and classification system for early stress and disease detection in avocado plants. The greenhouse deployment spans 72 plants in four treatment categories over six months. For leaves, spectral reflectance coupled with multivariate analysis and permutation testing yields statistically significant results and reliable inference. For soils, we develop a two-level hierarchical classification approach tailored to treatment characteristics that achieves 75–86\% accuracy across avocado genotypes and outperforms conventional approaches by over 20\%. Embedded evaluations on Raspberry Pi and Jetson report end-to-end latency, computation, memory usage, and power consumption, demonstrating practical feasibility. In summary, the contributions are a generalized framework for dynamic, open-world learning on edge devices and an application-specific system for robust classification in noisy field deployments. These real-world deployments collectively outline a practical framework for designing intelligent, cloud-independent edge systems from sensing to inference.
34 0
Restricted
EXPERIMENTAL STUDY OF THE IMPORTANCE OF DATA FOR MACHINE LEARNING-BASED BREAST CANCER OUTCOME PREDICTION
(Saudi Digital Library, 2024) Yamani, Wid; Wojtusaik, Janusz
EXPERIMENTAL STUDY OF THE IMPORTANCE OF DATA FOR MACHINE LEARNING-BASED BREAST CANCER OUTCOME PREDICTION Wid Yamani, Ph.D. George Mason University, 2025 Dissertation Director: Dr. Janusz Wojtusiak Researchers have used various large-scale datasets to develop and validate predictive models in breast cancer outcome prediction. However, a notable gap exists due to the lack of a systematic comparison among these datasets regarding predictive performance, feature availability, and suitability for different analytical objectives. While each dataset has unique strengths and limitations, no comprehensive studies evaluate how these differences impact model performance, particularly across diverse timeframes, survival, and recurrence outcomes. This gap limits researchers in making informed choices about the most appropriate dataset for specific research questions. Effective modeling and prediction of breast cancer outcomes (such as cancer survival and recurrence) rely on the dataset's quality, the pre-processing techniques used to clean and transform data, and the choice of predictive models. Therefore, selecting a suitable dataset and identifying relevant variables are as crucial as the choice of the model itself. This thesis addresses this gap by systematically comparing five prominent datasets for predicting breast cancer outcomes. This dissertation compares five datasets—SEER Research 8, SEER Research 17, SEER Research Plus, SEER-Medicare, and Medicare Claims data—focusing on breast cancer survival and recurrence. It evaluates the predictive performance of each dataset using supervised machine learning methods, including logistic regression, random forest, and gradient boosting. The models were tested on metrics such as AUC, accuracy, recall, and precision, with gradient boosting delivering the most accurate results. The findings indicate that SEER-Medicare, which integrates cancer registry data with three years of retrospective claims, outperformed the other datasets, achieving AUCs of 0.891 for 5-year survival and 0.942 for 10-year survival. This dataset's inclusion of comprehensive health information, including pre-existing conditions and other claims data, makes it particularly valuable for outcome prediction. However, a drawback of SEER-Medicare is that it primarily includes patients aged 65 and older, as it is based on Medicare data. This limitation reduces its suitability for predicting outcomes in younger breast cancer patients, a significant subgroup with distinct risk factors and treatment responses. SEER Research Plus ranked second, offering data on patient demographics, breast cancer characteristics, staging, outcomes, and treatment, with AUC values of 0.877, 0.901, and 0.937 for 5-year, 10-year, and 15-year survival, respectively. SEER Research 17 and SEER Research 8 include patient demographics, breast cancer characteristics, and staging information but lack treatment details. SEER Research 17, which covers a larger population with more variables, yielded AUC values of 0.870 for 5-year survival, 0.897 for 10-year survival, and 0.920 for 15-year survival. SEER Research 8, which covers a smaller population over a more extended period, yielded slightly lower AUC values of 0.857, 0.868, and 0.880 for 5-year, 10-year, and 15-year survival, respectively. Results indicate that including treatment and additional variables significantly enhances prediction accuracy while the data size is less critical. This thesis is the first study that compares SEER datasets and provides a groundbreaking, comprehensive evaluation of these datasets, providing crucial insights into how data characteristics influence breast cancer outcome modeling.
18 0

SACM - United States of America

Browse

Filters

Settings

Sort By

Results per page

Search Results