Saudi Cultural Missions Theses & Dissertations

Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10

Browse

Search Results

Now showing 1 - 8 of 8
  • ItemRestricted
    Metadata-Centric Cybersecurity Classification: A Fair Benchmark of LLMs and Classical Models
    (Saudi Digital Library, 2025) Binothman, Elyas; Chaudhry, Umair Bilal
    Cybersecurity breach classification supports triage and risk response but is hindered by heterogeneous reporting, class imbalance, and limited semantic coverage in traditional pipelines. Prior work has relied on rule-based heuristics and classical models (SVM, Random Forest) with heavy feature engineering, while recent LLM studies rarely evaluate breach metadata under identical, fair splits; severity labels are often absent or not reproducibly constructed. We present a metadata-centric benchmark on the Privacy Rights Clearinghouse chronology spanning two tasks: breach-type classification and severity tiering in three and five labels, with severity derived reproducibly from native fields using a Breach Level Index style mapping. All models share one preprocessing recipe and a single stratified 80/20 train–test split. We compare parameter-efficient transformers (DistilBERT and T5 with LoRA) against tuned tabular baselines (Linear SVM, Random Forest, compact ANN). On breach type, DistilBERT achieves the strongest results (Accuracy 0.943; Macro– F1 0.840), surpassing tabular baselines. For severity, a classweighted ANN on TF–IDF and categorical features attains the highest Macro–F1 at both granularities, while T5 shows high accuracy but low Macro–F1, indicating majority-class bias. The study contributes a unified PRC schema with transparent severity construction, a fair head-to-head comparison under identical conditions, and an efficiency-oriented training recipe suitable for modest hardware.
    12 0
  • ItemRestricted
    Cross-Lingual Transfer Learning for Arabic Sentiment Analysis
    (Saudi Digital Library, 2025) Bin Owayn, Najd Mohammed; Lauria, Stasha
    This dissertation presents a comprehensive investigation into the efficacy of cross-lingual transfer learning for Arabic sentiment analysis within low-resource contexts. The study rigorously compares the performance of a multilingual transformer model, XLM-RoBERTa (XLM-R), against a monolingual Arabic-specific model, CAMeLBERT, under varying data availability conditions, specifically zero-shot and few-shot learning paradigms. The primary objective is to identify the most effective and efficient modeling approach for accurate sentiment analysis when only limited Arabic training data is accessible. The research addresses the inherent challenges of Arabic sentiment analysis, including its complex morphology, pervasive dialectal variations, and the scarcity of large, annotated datasets. Utilizing a publicly available Arabic Company Reviews Dataset, the study systematically evaluates model performance across incrementally increasing amounts of labeled data: zero-shot application, and fine-tuning with 100, 500, and 1000 samples. This controlled experimental design allows for a direct, data-driven comparison of the models' efficiency and effectiveness. Key findings demonstrate that XLM-R exhibits remarkable zero-shot capabilities, achieving an accuracy of 0.829 and an Area Under the Curve (AUC) of 0.921 even without any direct fine-tuning on Arabic sentiment data. This underscores the power of large-scale multilingual pre-training in fostering language-agnostic sentiment understanding. With the introduction of limited Arabic labeled data, XLM-R's performance further improved, reaching an accuracy of 0.886 and an AUC of 0.942 with 1000 samples. The most substantial performance gains for XLM-R were observed during the initial stages of few-shot fine-tuning, highlighting its high data efficiency. In contrast, CAMeLBERT, designed as a monolingual Arabic model, showed poor zero-shot performance (accuracy 0.275, AUC 0.522), as anticipated due to its specialisation in Arabic linguistic structures rather than cross-lingual transfer. However, CAMeLBERT demonstrated exceptional adaptability and rapid improvement with few-shot fine-tuning. With a mere 100 labeled Arabic samples, its accuracy dramatically surged to 0.814 and AUC to 0.913. Its performance continued to improve, eventually approaching XLM-R's levels at 1000 samples (accuracy 0.868, AUC 0.936). This indicates that while monolingual models necessitate some target-language data to become effective, they can quickly leverage their deep linguistic understanding of Arabic to achieve competitive results. Learning curve analysis revealed that for both models, the most significant performance improvements occurred between the zero-shot and 100-sample conditions, with diminishing returns observed as the training data size increased further. This finding is crucial for practitioners, suggesting that a relatively small investment in data annotation can yield substantial performance gains, while further extensive annotation may offer only marginal improvements. In conclusion, this dissertation provides a data-driven cost-benefit analysis for practitioners navigating Arabic sentiment analysis in resource-constrained environments. It demonstrates that while monolingual models like CAMeLBERT can achieve competitive performance with modest amounts of labeled Arabic data, multilingual models like XLM-R offer a superior starting point with strong zero-shot capabilities and maintain a statistically significant edge even with limited fine-tuning data. This research contributes to a more nuanced understanding of the practical utility of cross-lingual transfer learning, advocating for its strategic adoption in scenarios where extensive Arabic data annotation is not feasible. Future work includes investigating domain-specific pre-training, exploring advanced few-shot learning techniques, and incorporating explicit dialectal Arabic analysis.
    11 0
  • ItemRestricted
    ADVANCED LARGE LANGUAGE MODEL APPROACHES AND NATURAL LANGUAGE PROCESSING TECHNIQUES TO IMPROVE HATE SPEECH DETECTION USING AI
    (University of Central Florida, 2025) Almohaimeed, Saad; Boloni, Ladislau
    The proliferation of hate speech on social networks can create a significant negative social effect, making the development of AI-based classifiers that can identify and characterize different types of hateful speech in messages highly important for stakeholders. While this is a highly challenging problem, recent advances in language models promise to advance the state of the art such that even subtle and indirect forms of hate speech can be detected. In this dissertation we present a series of contributions that improve different aspects of hate speech classification. We developed THOS, a hate speech dataset consisting of 8.3k tweets. Compared to previous datasets, THOS contains fine-grained labels that identify not only whether a tweet is offensive or hateful, but also the target of the hate. Using this dataset, we studied the degree to which finer grained classification of tweets is feasible. In the follow-up work, we focus on the difficult problem of implicit hate speech, where hate is conveyed through subtle verbal constructs and allusions, without the use of explicitly offensive terms. We evaluate the efficacy of lexicon-based methods, transfer learning, and advanced LLMs such as GPT-4 on this problem. We found that the proposed techniques can boost the detection performance of implicit hate, although even advanced models often struggle with certain interpretations. In our third contribution, we introduce the Closest Positive Cluster (CPC) auxiliary loss, which improves the generalizability of classifiers across a wide range of datasets, resulting in enhanced performance for both explicit and implicit hate speech scenarios. Finally, given the scarcity of implicit hate speech datasets and the abundance of explicit hate datasets, we proposed an approach to generalize explicit hate datasets for the classification of implicit hate speech. Additionally, the proposed approach addresses noisy label correction issues commonly found in crowd-sourced datasets. Our method comprises three key components: influential sample identification, reannotation, and augmentation. We show that the approach improves generalization across datasets and enhances implicit hate classification.
    16 0
  • ItemRestricted
    PyZoBot: A Platform for Conversational Information Extraction and Synthesis from Curated Zotero Reference Libraries through Advanced Retrieval-Augmented Generation. A
    (Virginia Commonwealth University, 2025) Alshammari, Suad; Wijesinghe, Dayanjan
    This work presents a systematic evaluation of PyZoBot, an AI-powered platform for literature-based question answering, using the Retrieval-Augmented Generation Assessment Scores (RAGAS) framework. The study focuses on a subset of 49 cardiology-related questions extracted from the BioASQ benchmark dataset. PyZoBot's performance was assessed across 32 configurations, including standard Retrieval-Augmented Generation (RAG) and GraphRAG pipelines, implemented with both OpenAI-based models (GPT-3.5-Turbo, GPT-4o) and open-source models (LLaMA 3.1, Mistral). To establish a comparative benchmark, responses generated by PyZoBot were evaluated alongside answers manually written by six PhD students and recent graduates from the pharmacotherapy field, using a curated Zotero library containing BioASQ-referenced documents. The evaluation applied four key RAGAS metrics—faithfulness, answer relevancy, context recall, and context precision—along with a composite harmonic score to determine overall performance. The findings reveal that 22 PyZoBot configurations surpassed the highest-performing human participant, with the top pipeline (GPT-3.5-Turbo + layout-aware chunking, k=10) achieving a 129 harmonic RAGAS score of 0.6944. Statistical analysis using Kruskal-Wallis and Dunn’s post hoc tests confirmed significant differences across all metrics, especially in faithfulness and time efficiency. These results validate PyZoBot’s ability to support high-quality biomedical information synthesis and demonstrate the system’s potential to meet or exceed human performance in complex, evidence-based academic tasks.
    17 0
  • Thumbnail Image
    ItemRestricted
    GPT-4 attempting to attack AI-text detectors
    (University of Adelaide, 2024-07-10) Alshehri, Nojoud; Lin, Yuhao
    Recent large language models (LLMs) generate machine content across a wide range of channels, including news, social media, and educational frameworks. The significant challenge of differentiating between AI-generated content and the content written by humans raised the potential misuse of LLMs. Academic integrity risks have become a growing concern due to the potential utilisation of these models in completing assignments and writing essays. There-fore, many detection tools have been developed to identify AI-generated and hu-man-generated texts. The effectiveness of these tools against attack strategies and adversarial perturbations has not been adequately validated, specifically in the context of student essay writing. In this work, we aim to utilize GPT-4 model to apply a series of perturbations to an essay generated originally by GPT-4 in order to confuse three AI detectors: GPTZero, DetectGPT, and ZeroGPT. The pro-posed attack technique produces a text as an adversarial sample used to examine the effect on the detection accuracy of AI detectors. The results demonstrate that utilizing GPT-4 to rephrase and apply perturbation at the sentence and word level is able to confuse the detection models and reduce their prediction probabilities. Moreover, the final essay, after applying the series of perturbations, maintains a reasonable amount of both writing quality and semantic similarity with the orig-inal GPT-generated essay. This project will provide insights for further improve-ments to increase the robustness of AI detectors and future AI-generated text classification studies.
    57 0
  • Thumbnail Image
    ItemRestricted
    Towards Numerical Reasoning in Machine Reading Comprehension
    (Imperial College London, 2024-02-01) Al-Negheimish, Hadeel; Russo, Alessandra; Madhyastha, Pranava
    Answering questions about a specific context often requires integrating multiple pieces of information and reasoning about them to arrive at the intended answer. Reasoning in natural language for machine reading comprehension (MRC) remains a significant challenge. In this thesis, we focus on numerical reasoning tasks. As opposed to current black-box approaches that provide little evidence of their reasoning process, we propose a novel approach that facilitates interpretable and verifiable reasoning by using Reasoning Templates for question decomposition. Our evaluations hinted at the existence of problematic behaviour in numerical reasoning models, underscoring the need for a better understanding of their capabilities. We conduct, as a second contribution of this thesis, a controlled study to assess how well current models understand questions and to what extent such models are basing their answers on textual evidence. Our findings indicate that applying transformations that obscure or destroy the syntactic and semantic properties of the questions does not change the output of the top-performing models. This behaviour reveals serious holes in how the models work. It calls into question evaluation paradigms that only use standard quantitative measures such as accuracy and F1 scores, as they lead to a false illusion of progress. To improve the reliability of numerical reasoning models in MRC, we propose and demonstrate, as our third contribution, the effectiveness of a solution to one of these fundamental problems: catastrophic insensitivity to word order. We do this by FORCED INVALIDATION: training the model to flag samples that cannot be reliably answered. We show it is highly effective at preserving word order importance in machine reading comprehension tasks and generalises well to other natural language understanding tasks. While our Reasoning Templates are competitive with the state-of-the-art on a single type, engineering them incurs a considerable overhead. Leveraging our better insights on natural language understanding and concurrent advancements in few-shot learning, we conduct a first investigation to overcome scalability limitations. Our fourth contribution combines large language models for question decomposition with symbolic rule learning for answer recomposition, we surpass our previous results on Subtraction questions and generalise to more reasoning types.
    17 0
  • Thumbnail Image
    ItemRestricted
    Creating Synthetic Data for Stance Detection Tasks using Large Language Models
    (Cardiff University, 2023-09-11) Alsemairi, Alhanouf; Manchego, Fernando Alva
    Stance detection is a natural language processing (NLP) task that analyses people’s stances (e.g. in favour, against or neutral) towards a specific topic. It is usually tackled using supervised classification approaches. However, collecting datasets with suitable human annotations is a resource-expensive process. The impressive capability of large language models (LLMs) in generating human-like text has revolutionized various NLP tasks. Therefore, in this dissertation, we investigate the capabilities of LLMs, specifically ChatGPT and Falcon, as a potential solution to create synthetic data that may address the data scarcity problem in stance detection tasks, and observe its impact on the performance of stance detection models. The study was conducted across various topics (e.g. Feminism, Covid-19) and two languages (English and Arabic). Different prompting approaches were employed to guide these LLMs in generating artificial data that is similar to real-world data. The results demonstrate a range of capabilities and limitations of LLMs for this use case. ChatGPT’s ethical guidelines affect its performance in simulating real-world tweets. Conversely, the open-source Falcon model’s performance in resembling the original data was better than ChatGPT’s; however, it could not create good Arabic tweets compared to ChatGPT. The study concludes that the current abilities of ChatGPT and Falcon are insufficient to generate diverse synthetic tweets. Thus, additional improvements are required to bridge the gap between synthesized and real-world data to enhance the performance of stance detection models.
    32 0
  • Thumbnail Image
    ItemRestricted
    Improving vulnerability description using natural language generation
    (Saudi Digital Library, 2023-10-25) Althebeiti, Hattan; Mohaisen, David
    Software plays an integral role in powering numerous everyday computing gadgets. As our reliance on software continues to grow, so does the prevalence of software vulnerabilities, with significant implications for organizations and users. As such, documenting vulnerabilities and tracking their development becomes crucial. Vulnerability databases addressed this issue by storing a record with various attributes for each discovered vulnerability. However, their contents suffer several drawbacks, which we address in our work. In this dissertation, we investigate the weaknesses associated with vulnerability descriptions in public repositories and alleviate such weaknesses through Natural Language Processing (NLP) approaches. The first contribution examines vulnerability descriptions in those databases and approaches to improve them. We propose a new automated method leveraging external sources to enrich the scope and context of a vulnerability description. Moreover, we exploit fine-tuned pretrained language models for normalizing the resulting description. The second contribution investigates the need for uniform and normalized structure in vulnerability descriptions. We address this need by breaking the description of a vulnerability into multiple constituents and developing a multi-task model to create a new uniform and normalized summary that maintains the necessary attributes of the vulnerability using the extracted features while ensuring a consistent vulnerability description. Our method proved effective in generating new summaries with the same structure across a collection of various vulnerability descriptions and types. Our final contribution investigates the feasibility of assigning the Common Weakness Enumeration (CWE) attribute to a vulnerability based on its description. CWE offers a comprehensive framework that categorizes similar exposures into classes, representing the types of exploitation associated with such vulnerabilities. Our approach utilizing pre-trained language models is shown to outperform Large Language Model (LLM) for this task. Overall, this dissertation provides various technical approaches exploiting advances in NLP to improve publicly available vulnerability databases.
    10 0

Copyright owned by the Saudi Digital Library (SDL) © 2026