Saudi Cultural Missions Theses & Dissertations

Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10

Browse

Search Results

Now showing 1 - 10 of 21

Restricted
Improving Feature Location in Source Code via Large Language Model-Based Descriptive Annotations
(Arizona State University, 2025-05) Alneif, Sultan; Alhindawi, Nouh
Feature location is a crucial task in software maintenance, aiding developers in identifying the precise segments of code responsible for specific functionalities. Traditional feature location methods, such as grep and static analysis, often result in high false-positive rates and inadequate ranking accuracy, increasing developer effort and reducing productivity. Information Retrieval (IR) techniques like Latent Semantic Indexing (LSI) have improved precision and recall but still struggle with lexical mismatches and semantic ambiguities. This research introduces an innovative method to enhance feature location by augmenting source code corpora with descriptive annotations generated by Large Language Models (LLMs), specifically Code Llama. The enriched corpora provide deeper semantic contexts, improving the alignment between developer queries and relevant source code components. Empirical evaluations were conducted on two open-source systems, HippoDraw and Qt, using standard IR performance metrics: precision, recall, First Relevant Position (FRP), and Last Relevant Position (LRP). Results showed significant performance gains; a 40% precision improvement in HippoDraw, and a 26% improvement in Qt. Recall improved by 32% in HippoDraw and 24% in Qt. The findings highlight the efficacy of incorporating LLM-generated annotations, significantly reducing developer effort and enhancing software comprehension and maintainability. This research provides a practical and scalable solution for software maintenance and evolution tasks.
13 0
Restricted
Cross Dataset Fairness Evaluation of Transformer Based Sentiment Models
(Saudi Digital Library, 2025-05-10) Zuiran, Sara; Bhattacharyya, Siddhartha
With the growing exploration of Natural Language Processing (NLP) systems in decision-making environments, it is essential to evaluate technical and ethical aspects of the dataset and the NLP model to improve fairness. To assess fairness, the thesis examines demographic imbalances in sentiment classification models by evaluating transformer-based models fine-tuned on the Stanford Sentiment Treebank version 2 dataset (SST-2) against the demographically annotated Comprehensive Assessment of Language Model dataset (CALM). This work identifies performance disparities in sentiment prediction across demographic groups by examining sensitive attributes such as gender and race. The study evaluates both the RoBERTa and MentalBERT transformer models using a complete set of fairness metrics consisting of Statistical Parity Difference (SPD), Equal Opportunity Difference (EOD), False Positive Rates (FPR), False Negative Rates (FNR), Jensen-Shannon Divergence (JSD), and Wasserstein Distance (WD). The analysis examines both group-vs-rest and pairwise subgroup comparisons, including gender and ethnicity. Results show that applying adversarial mitigation reduced fairness disparities across demographic subgroups, with the most notable improvements observed for non-binary and Asian users. The observed disparities emphasize the challenge of reducing performance gaps across demographic subgroups in sentiment classification tasks. The thesis introduces a practical framework for evaluating demographic dis- disparities, extends fairness analysis, and assesses the impact of mitigation techniques in cross-dataset sentiment classification. This research proposes a framework that demonstrates a path toward creating inclusive NLP systems and establishes the groundwork for upcoming ethical Artificial Intelligence (AI) studies.
13 0
Restricted
Analysing and Visualising (Cyber)crime data using Structured Occurrence Nets and Natural Language Processing
(Newcastle University, 2025-03-01) Alshammari, Tuwailaa; Koutny, Maciej
Structured Occurrence Nets (SONs) are a Petri net-based formalism designed to represent the behaviour of complex evolving systems, capturing concurrent events and interactions between subsystems. Recently, the modelling and visualisation of crime and cybercrime investigations have gained increasing interest. In particular, SONs have proven to be versatile tools for modelling and visualising various applications, including crime and cybercrime. This thesis presents two contributions aimed at making SON-based techniques suitable for real-life applications. The main contribution is motivated by the fact that manually developing SON models from unstructured text can be time-consuming, as it requires extensive reading, comprehension, and model construction. This thesis aims to develop a methodology for the formal representation of unstructured textual resources in English. This involves experimenting, mapping, and deriving relationships between natural and formal languages, specifically using SON for crime modelling and visualisation as an application. The second contribution addresses the scalability of SON-based representations for cybercrime analysis. It provides a novel approach in which acyclic nets have been extended with coloured features to enable reduction of net size to help in visualisation. While the two contributions address distinct challenges, they are unified by their use of SONs as a formalism to model complex systems. Structured occurrence nets demonstrated their adaptability in representing both crime scenarios and cybercrime activities.
42 0
Restricted
Analysing and Visualising (Cyber)crime data using Structured Occurrence Nets and Natural Language Processing
(2025) Tuwailaa Alshammari; Professor Maciej Koutny
Structured Occurrence Nets (SONs) are a Petri net-based formalism designed to represent the behaviour of complex evolving systems, capturing concurrent events and interactions between subsystems. Recently, the modelling and visualisation of crime and cybercrime investigations have gained increasing interest. In particular, SONs have proven to be versatile tools for modelling and visualising various applications, including crime and cybercrime. This thesis presents two contributions aimed at making SON-based techniques suitable for real-life applications. The main contribution is motivated by the fact that manually developing SON models from unstructured text can be time-consuming, as it requires extensive reading, comprehension, and model construction. This thesis aims to develop a methodology for the formal representation of unstructured textual resources in English. This involves experimenting, mapping, and deriving relationships between natural and formal languages, specifically using SON for crime modelling and visualisation as an application. The second contribution addresses the scalability of SON-based representations for cybercrime analysis. It provides a novel approach in which acyclic nets have been extended with coloured features to enable reduction of net size to help in visualisation. While the two contributions address distinct challenges, they are unified by their use of SONs as a formalism to model complex systems. Structured occurrence nets demonstrated their adaptability in representing both crime scenarios and cybercrime activities.
52 0
Restricted
IMPROVING ASPECT-BASED SENTIMENT ANALYSIS THROUGH LARGE LANGUAGE MODELS
(Florida state university, 2024) Alanazi, Sami; Liu, Xiuwen
Aspect-Based Sentiment Analysis (ABSA) is a crucial task in Natural Language Processing (NLP) that seeks to extract sentiments associated with specific aspects within text data. While traditional sentiment analysis offers a broad view, ABSA provides a fine-grained approach by identifying sentiments tied to particular aspects, enabling deeper insights into user opinions across diverse domains. Despite improvements in NLP, accurately capturing aspect-specific sentiments, especially in complex and multi-aspect sentences, remains challenging due to the nuanced dependencies and variations in sentiment expression. Additionally, languages with limited annotated datasets, such as Arabic, present further obstacles in ABSA. This dissertation addresses these challenges by proposing methodologies that enhance ABSA capabilities through large language models and transformer architectures. Three primary approaches are developed and evaluated: First, aspect-specific sentiment classification using GPT-4 with prompt engineering to improve few-shot learning and in-context classification; second, triplet extraction utilizing an encoder-decoder framework based on the T5 model, designed to capture aspect-opinion-sentiment associations effectively; and lastly, Aspect-Aware Conditional BERT, an extension of AraBERT, incorporating a customized attention mechanism to dynamically adjust focus based on target aspects, particularly improving ABSA in multi-aspect Arabic text. Our experimental results demonstrate that these proposed methods outperform current baselines across multiple datasets, particularly in improving sentiment accuracy and aspect relevance. This research contributes new model architectures and techniques that enhance ABSA for high-resource and low-resource languages, offering a scalable solution adaptable to various domains.
38 0
Restricted
Towards Representative Pre-training Corpora for Arabic Natural Language Processing
(Clarkson University, 2024-11-30) Alshahrani, Saied Falah A; Matthews, Jeanna
Natural Language Processing (NLP) encompasses various tasks, problems, and algorithms that analyze human-generated textual corpora or datasets to produce insights, suggestions, or recommendations. These corpora and datasets are crucial for any NLP task or system, as they convey social concepts, including views, culture, heritage, and perspectives of native speakers. However, a corpus or dataset in a particular language does not necessarily represent the culture of its native speakers. Native speakers may organically write some textual corpora or datasets, and some may be written by non-native speakers, translated from other languages, or generated using advanced NLP technologies, such as Large Language Models (LLMs). Yet, in the era of Generative Artificial Intelligence (GenAI), it has become increasingly difficult to distinguish between human-generated texts and machine-translated or machine-generated texts, especially when all these different types of texts, i.e., corpora or datasets, are combined to create large corpora or datasets for pre-training NLP tasks, systems, and technologies. Therefore, there is an urgent need to study the degree to which pre-training corpora or datasets represent native speakers and reflect their values, beliefs, cultures, and perspectives, and to investigate the potentially negative implications of using unrepresentative corpora or datasets for the NLP tasks, systems, and technologies. One of the most widely utilized pre-training corpora or datasets for NLP are Wikipedia articles, especially for low-resource languages like Arabic, due to their large multilingual content collection and massive array of metadata that can be quantified. In this dissertation, we study the representativeness of the Arabic NLP pre-training corpora or datasets, focusing specifically on the three Arabic Wikipedia editions: Arabic Wikipedia, Egyptian Arabic Wikipedia, and Moroccan Arabic Wikipedia. Our primary goals are to 1) raise awareness of the potential negative implications of using unnatural, inorganic, and unrepresentative corpora—those generated or translated automatically without the input of native speakers, 2) find better ways to promote transparency and ensure that native speakers are involved through metrics, metadata, and online applications, and 3) strive to reduce the impact of automatically generated or translated contents by using machine learning algorithms to identify or detect them automatically. To do this, firstly, we analyze the metadata of the three Arabic Wikipedia editions, focusing on differences using collected statistics such as total pages, articles, edits, registered and active users, administrators, and top editors. We document issues related to the automatic creation and translation of articles (content pages) from English to Arabic without human (i.e., native speakers) review, revision, or supervision. Secondly, we quantitatively study the performance implications of using unnatural, inorganic corpora that do not represent native speakers and are primarily generated using automation, such as bot-created articles or template-based translation. We intrinsically evaluate the performance of two main NLP tasks—Word Representation and Language Modeling—using the Word Analogy and Fill-Mask evaluation tasks on our two newly created datasets: the Arab States Analogy Dataset and the Masked Arab States Dataset. Thirdly, we assess the quality of Wikipedia corpora at the edition level rather than the article level by quantifying bot activities and enhancing Wikipedia’s Depth metric. After analyzing the limitations of the existing Depth metric, we propose a bot-free version by excluding bot-created articles and bot-made edits on articles called the DEPTH+ metric, presenting its mathematical definitions, highlighting its features and limitations, and explaining how this new metric accurately reflects human collaboration depth within the Wikipedia project. Finally, we address the issue of template translation in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics. We explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and employ the resulting insights to build multivariate machine learning classifiers leveraging article metadata to automatically detect template-translated articles. We lastly deploy the best-performing classifier publicly as an online application and release the extracted, filtered, labeled, and preprocessed datasets to the research community to benefit from our datasets and the web-based detection system.
61 0
Restricted
Enhancing Biomedical Named Entity Recognition through Multi-Task Learning and Syntactic Feature Integration with BioBERT
(De Montfort University, 2024-08) Alqulayti, Abdulaziz; Taherkhani, Aboozar
Biomedical Named Entity Recognition (BioNER) is a critical task in natural language processing (NLP) for pulling noteworthy knowledge from the frequently growing size of biomedical literature. The concentrate of this study is creating refined BioNER models, which identify entities like proteins, diseases, and genes with remarkable generalizability and accuracy. Important challenges in BioNER are handled in the study, such as morphological variations, the complex nature of biomedical terminology, the vagueness usually seen in context-dependent language and morphological variations. This study establishes a unique standard in BioNER methodology, it incorporates cutting-edge machine learning techniques like character-level embeddings through Bidirectional Long Short-Term Memory (BiLSTM) networks, pre-trained models like BioBERT, multi-task learning solution, and syntactic feature extraction. The NCBI Disease Corpus, a standard dataset for disease name recognition, was used to apply the methodology to it. Two main models were created The BioBERTForNER and BioBERTBiLSTMForNER. The BioBERTBiLSTM model contains an additional BiLSTM layer, which showed exceptional performance by catching long-term dependencies and complicated morphological patterns in biomedical text. An exceptional 0.938 F1-score has been reached with This model beating existing advanced systems and the baseline BioBERT model. Also, the study investigates the effect of syntactic features and character-level embeddings, demonstrating their vital part in improving recall and precision. The combination of a multi-task learning solution demonstrated quite adequate at moderating the model’s capacity to maintain generalize across different contexts and overfitting. The final models not solely formed further measures on the NCBI Disease Corpus they also presented a multi-faceted strategy and expandable to BioNER, which shows how architectural innovations and refined embedding methods can greatly enhance biomedical text mining. The study results underscore the key part of progressive embedding techniques and multi-task learning in NLP, displaying their flexibility across various biomedical domains. Additionally, this study displays the possibility for these improvements to be used in analysis and real-world clinical data extraction preparing the path for forthcoming studies. Additional mixed biomedical datasets could be used to extend These methodologies, which eventually improve the efficiency and precision of automated biomedical information retrieval in clinical settings.
7 0
Restricted
A Quality Model to Assess Airport Services Using Machine Learning and Natural Language Processing
(Cranfield University, 2024-04) Homaid, Mohammed; Moulitsas, Irene
In the dynamic environment of passenger experiences, precisely evaluating passenger satisfaction remains crucial. This thesis is dedicated to the analysis of Airport Service Quality (ASQ) by analysing passenger reviews through sentiment analysis. The research aims to investigate and propose a novel model for assessing ASQ through the application of Machine Learning (ML) and Natural Language Processing (NLP) techniques. It utilises a comprehensive dataset sourced from Skytrax, incorporating both text reviews and numerical ratings. The initial analysis presents challenges for traditional and general NLP techniques when applied to specific domains, such as ASQ, due to limitations like general lexicon dictionaries and pre-compiled stopword lists. To overcome these challenges, a domain-specific sentiment lexicon for airport service reviews is created using the Pointwise Mutual Information (PMI) scoring method. This approach involved replacing the default VADER sentiment scores with those derived from the newly developed lexicon. The outcomes demonstrate that this specialised lexicon for the airport review domain substantially exceeds the benchmarks, delivering consistent and significant enhancements. Moreover, six unique methods for identifying stopwords within the Skytrax review dataset are developed. The research reveals that employing dynamic methods for stopword removal markedly improves the performance of sentiment classification. Deep learning (DL), especially using transformer models, has revolutionised the processing of textual data, achieving unprecedented success. Therefore, novel models are developed through the meticulous development and fine-tuning of advanced deep learning models, specifically Bidirectional Long Short-Term Memory (BiLSTM) and Bidirectional Encoder Representations from Transformers (BERT), tailored for the airport services domain. The results demonstrate superior performance, highlighting the BERT model's exceptional ability to seamlessly blend textual and numerical data. This progress marks a significant improvement upon the current state-of-the-art achievements documented in the existing literature. To encapsulate, this thesis presents a thorough exploration of sentiment analysis, ML and DL methodologies, establishing a framework for the enhancement of ASQ evaluation through detailed analysis of passenger feedback.
21 0
Restricted
Enhancing Biomedical Named Entity Recognition through Multi-Task Learning and Syntactic Feature Integration with BioBERT
(De Montfort University, 2024-08) Alqulayti, Abdulaziz; Taherkhani, Aboozar
Biomedical Named Entity Recognition (BioNER) is a critical task in natural language processing (NLP) for pulling noteworthy knowledge from the frequently growing size of biomedical literature. The concentrate of this study is creating refined BioNER models, which identify entities like proteins, diseases, and genes with remarkable generalizability and accuracy. Important challenges in BioNER are handled in the study, such as morphological variations, the complex nature of biomedical terminology, the vagueness usually seen in context-dependent language and morphological variations. This study establishes a unique standard in BioNER methodology, it incorporates cutting-edge machine learning techniques like character-level embeddings through Bidirectional Long Short-Term Memory (BiLSTM) networks, pre-trained models like BioBERT, multi-task learning solution, and syntactic feature extraction. The NCBI Disease Corpus, a standard dataset for disease name recognition, was used to apply the methodology to it. Two main models were created The BioBERTForNER and BioBERTBiLSTMForNER. The BioBERTBiLSTM model contains an additional BiLSTM layer, which showed exceptional performance by catching long-term dependencies and complicated morphological patterns in biomedical text. An exceptional 0.938 F1-score has been reached with This model beating existing advanced systems and the baseline BioBERT model. Also, the study investigates the effect of syntactic features and character-level embeddings, demonstrating their vital part in improving recall and precision. The combination of a multi-task learning solution demonstrated quite adequate at moderating the model’s capacity to maintain generalize across different contexts and overfitting. The final models not solely formed further measures on the NCBI Disease Corpus they also presented a multi-faceted strategy and expandable to BioNER, which shows how architectural innovations and refined embedding methods can greatly enhance biomedical text mining. The study results underscore the key part of progressive embedding techniques and multi-task learning in NLP, displaying their flexibility across various biomedical domains. Additionally, this study displays the possibility for these improvements to be used in analysis and real-world clinical data extraction preparing the path for forthcoming studies. Additional mixed biomedical datasets could be used to extend These methodologies, which eventually improve the efficiency and precision of automated biomedical information retrieval in clinical settings.
25 0
Restricted
Adapting to Change: The Temporal Performance of Text Classifiers in the Context of Temporally Evolving Data
(Queen Mary University of London, 2024-07-08) Alkhalifa, Rabab; Zubiaga, Arkaitz
This thesis delves into the evolving landscape of NLP, particularly focusing on the temporal persistence of text classifiers amid the dynamic nature of language use. The primary objective is to understand how changes in language patterns over time impact the performance of text classification models and to develop methodologies for maintaining their effectiveness. The research begins by establishing a theoretical foundation for text classification and temporal data analysis, highlighting the challenges posed by the evolving use of language and its implications for NLP models. A detailed exploration of various datasets, including the stance detection and sentiment analysis datasets, sets the stage for examining these dynamics. The characteristics of the datasets, such as linguistic variations and temporal vocabulary growth, are carefully examined to understand their influence on the performance of the text classifier. A series of experiments are conducted to evaluate the performance of text classifiers across different temporal scenarios. The findings reveal a general trend of performance degradation over time, emphasizing the need for classifiers that can adapt to linguistic changes. The experiments assess models' ability to estimate past and future performance based on their current efficacy and linguistic dataset characteristics, leading to valuable insights into the factors influencing model longevity. Innovative solutions are proposed to address the observed performance decline and adapt to temporal changes in language use over time. These include incorporating temporal information into word embeddings and comparing various methods across temporal gaps. The Incremental Temporal Alignment (ITA) method emerges as a significant contributor to enhancing classifier performance in same-period experiments, although it faces challenges in maintaining effectiveness over longer temporal gaps. Furthermore, the exploration of machine learning and statistical methods highlights their potential to maintain classifier accuracy in the face of longitudinally evolving data. The thesis culminates in a shared task evaluation, where participant-submitted models are compared against baseline models to assess their classifiers' temporal persistence. This comparison provides a comprehensive understanding of the short-term, long-term, and overall persistence of their models, providing valuable information to the field. The research identifies several future directions, including interdisciplinary approaches that integrate linguistics and sociology, tracking textual shifts on online platforms, extending the analysis to other classification tasks, and investigating the ethical implications of evolving language in NLP applications. This thesis contributes to the NLP field by highlighting the importance of evaluating text classifiers' temporal persistence and offering methodologies to enhance their sustainability in dynamically evolving language environments. The findings and proposed approaches pave the way for future research, aiming at the development of more robust, reliable, and temporally persistent text classification models.
22 0

Saudi Cultural Missions Theses & Dissertations

Browse

Filters

Settings

Sort By

Results per page

Search Results