SACM - United States of America

Permanent URI for this collectionhttps://drepo.sdl.edu.sa/handle/20.500.14154/9668

Browse

Search Results

Now showing 1 - 9 of 9

Restricted
Improving Feature Location in Source Code via Large Language Model-Based Descriptive Annotations
(Arizona State University, 2025-05) Alneif, Sultan; Alhindawi, Nouh
Feature location is a crucial task in software maintenance, aiding developers in identifying the precise segments of code responsible for specific functionalities. Traditional feature location methods, such as grep and static analysis, often result in high false-positive rates and inadequate ranking accuracy, increasing developer effort and reducing productivity. Information Retrieval (IR) techniques like Latent Semantic Indexing (LSI) have improved precision and recall but still struggle with lexical mismatches and semantic ambiguities. This research introduces an innovative method to enhance feature location by augmenting source code corpora with descriptive annotations generated by Large Language Models (LLMs), specifically Code Llama. The enriched corpora provide deeper semantic contexts, improving the alignment between developer queries and relevant source code components. Empirical evaluations were conducted on two open-source systems, HippoDraw and Qt, using standard IR performance metrics: precision, recall, First Relevant Position (FRP), and Last Relevant Position (LRP). Results showed significant performance gains; a 40% precision improvement in HippoDraw, and a 26% improvement in Qt. Recall improved by 32% in HippoDraw and 24% in Qt. The findings highlight the efficacy of incorporating LLM-generated annotations, significantly reducing developer effort and enhancing software comprehension and maintainability. This research provides a practical and scalable solution for software maintenance and evolution tasks.
13 0
Restricted
Cross Dataset Fairness Evaluation of Transformer Based Sentiment Models
(Saudi Digital Library, 2025-05-10) Zuiran, Sara; Bhattacharyya, Siddhartha
With the growing exploration of Natural Language Processing (NLP) systems in decision-making environments, it is essential to evaluate technical and ethical aspects of the dataset and the NLP model to improve fairness. To assess fairness, the thesis examines demographic imbalances in sentiment classification models by evaluating transformer-based models fine-tuned on the Stanford Sentiment Treebank version 2 dataset (SST-2) against the demographically annotated Comprehensive Assessment of Language Model dataset (CALM). This work identifies performance disparities in sentiment prediction across demographic groups by examining sensitive attributes such as gender and race. The study evaluates both the RoBERTa and MentalBERT transformer models using a complete set of fairness metrics consisting of Statistical Parity Difference (SPD), Equal Opportunity Difference (EOD), False Positive Rates (FPR), False Negative Rates (FNR), Jensen-Shannon Divergence (JSD), and Wasserstein Distance (WD). The analysis examines both group-vs-rest and pairwise subgroup comparisons, including gender and ethnicity. Results show that applying adversarial mitigation reduced fairness disparities across demographic subgroups, with the most notable improvements observed for non-binary and Asian users. The observed disparities emphasize the challenge of reducing performance gaps across demographic subgroups in sentiment classification tasks. The thesis introduces a practical framework for evaluating demographic dis- disparities, extends fairness analysis, and assesses the impact of mitigation techniques in cross-dataset sentiment classification. This research proposes a framework that demonstrates a path toward creating inclusive NLP systems and establishes the groundwork for upcoming ethical Artificial Intelligence (AI) studies.
13 0
Restricted
IMPROVING ASPECT-BASED SENTIMENT ANALYSIS THROUGH LARGE LANGUAGE MODELS
(Florida state university, 2024) Alanazi, Sami; Liu, Xiuwen
Aspect-Based Sentiment Analysis (ABSA) is a crucial task in Natural Language Processing (NLP) that seeks to extract sentiments associated with specific aspects within text data. While traditional sentiment analysis offers a broad view, ABSA provides a fine-grained approach by identifying sentiments tied to particular aspects, enabling deeper insights into user opinions across diverse domains. Despite improvements in NLP, accurately capturing aspect-specific sentiments, especially in complex and multi-aspect sentences, remains challenging due to the nuanced dependencies and variations in sentiment expression. Additionally, languages with limited annotated datasets, such as Arabic, present further obstacles in ABSA. This dissertation addresses these challenges by proposing methodologies that enhance ABSA capabilities through large language models and transformer architectures. Three primary approaches are developed and evaluated: First, aspect-specific sentiment classification using GPT-4 with prompt engineering to improve few-shot learning and in-context classification; second, triplet extraction utilizing an encoder-decoder framework based on the T5 model, designed to capture aspect-opinion-sentiment associations effectively; and lastly, Aspect-Aware Conditional BERT, an extension of AraBERT, incorporating a customized attention mechanism to dynamically adjust focus based on target aspects, particularly improving ABSA in multi-aspect Arabic text. Our experimental results demonstrate that these proposed methods outperform current baselines across multiple datasets, particularly in improving sentiment accuracy and aspect relevance. This research contributes new model architectures and techniques that enhance ABSA for high-resource and low-resource languages, offering a scalable solution adaptable to various domains.
38 0
Restricted
Towards Representative Pre-training Corpora for Arabic Natural Language Processing
(Clarkson University, 2024-11-30) Alshahrani, Saied Falah A; Matthews, Jeanna
Natural Language Processing (NLP) encompasses various tasks, problems, and algorithms that analyze human-generated textual corpora or datasets to produce insights, suggestions, or recommendations. These corpora and datasets are crucial for any NLP task or system, as they convey social concepts, including views, culture, heritage, and perspectives of native speakers. However, a corpus or dataset in a particular language does not necessarily represent the culture of its native speakers. Native speakers may organically write some textual corpora or datasets, and some may be written by non-native speakers, translated from other languages, or generated using advanced NLP technologies, such as Large Language Models (LLMs). Yet, in the era of Generative Artificial Intelligence (GenAI), it has become increasingly difficult to distinguish between human-generated texts and machine-translated or machine-generated texts, especially when all these different types of texts, i.e., corpora or datasets, are combined to create large corpora or datasets for pre-training NLP tasks, systems, and technologies. Therefore, there is an urgent need to study the degree to which pre-training corpora or datasets represent native speakers and reflect their values, beliefs, cultures, and perspectives, and to investigate the potentially negative implications of using unrepresentative corpora or datasets for the NLP tasks, systems, and technologies. One of the most widely utilized pre-training corpora or datasets for NLP are Wikipedia articles, especially for low-resource languages like Arabic, due to their large multilingual content collection and massive array of metadata that can be quantified. In this dissertation, we study the representativeness of the Arabic NLP pre-training corpora or datasets, focusing specifically on the three Arabic Wikipedia editions: Arabic Wikipedia, Egyptian Arabic Wikipedia, and Moroccan Arabic Wikipedia. Our primary goals are to 1) raise awareness of the potential negative implications of using unnatural, inorganic, and unrepresentative corpora—those generated or translated automatically without the input of native speakers, 2) find better ways to promote transparency and ensure that native speakers are involved through metrics, metadata, and online applications, and 3) strive to reduce the impact of automatically generated or translated contents by using machine learning algorithms to identify or detect them automatically. To do this, firstly, we analyze the metadata of the three Arabic Wikipedia editions, focusing on differences using collected statistics such as total pages, articles, edits, registered and active users, administrators, and top editors. We document issues related to the automatic creation and translation of articles (content pages) from English to Arabic without human (i.e., native speakers) review, revision, or supervision. Secondly, we quantitatively study the performance implications of using unnatural, inorganic corpora that do not represent native speakers and are primarily generated using automation, such as bot-created articles or template-based translation. We intrinsically evaluate the performance of two main NLP tasks—Word Representation and Language Modeling—using the Word Analogy and Fill-Mask evaluation tasks on our two newly created datasets: the Arab States Analogy Dataset and the Masked Arab States Dataset. Thirdly, we assess the quality of Wikipedia corpora at the edition level rather than the article level by quantifying bot activities and enhancing Wikipedia’s Depth metric. After analyzing the limitations of the existing Depth metric, we propose a bot-free version by excluding bot-created articles and bot-made edits on articles called the DEPTH+ metric, presenting its mathematical definitions, highlighting its features and limitations, and explaining how this new metric accurately reflects human collaboration depth within the Wikipedia project. Finally, we address the issue of template translation in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics. We explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and employ the resulting insights to build multivariate machine learning classifiers leveraging article metadata to automatically detect template-translated articles. We lastly deploy the best-performing classifier publicly as an online application and release the extracted, filtered, labeled, and preprocessed datasets to the research community to benefit from our datasets and the web-based detection system.
61 0
Restricted
EXPLORING LANGUAGE MODELS AND QUESTION ANSWERING IN BIOMEDICAL AND ARABIC DOMAINS
(University of Delaware, 2024-05-10) Alrowili, Sultan; Shanker, K.Vijay
Despite the success of the Transformer model and its variations (e.g., BERT, ALBERT, ELECTRA, T5) in addressing NLP tasks, similar success is not achieved when these models are applied to specific domains (e.g., biomedical) and limited-resources language (e.g., Arabic). This research addresses issues to overcome some challenges in the use of Transformer models to specialized domains and languages that lack in language processing resources. One of the reasons for reduced performance in limited domains might be due to the lack of quality contextual representations. We address this issue by adapting different types of language models and introducing five BioM-Transformer models for the biomedical domain and Funnel transformer and T5 models for the Arabic language. For each of our models, we present experiments for studying the impact of design factors (e.g., corpora and vocabulary domain, model-scale, architecture design) on performance and efficiency. Our evaluation of BioM-Transformer models shows that we obtain state-of-the-art results on several biomedical NLP tasks and achieved the top-performing models on the BLURB leaderboard. The evaluation of our small scale Arabic Funnel and T5 models shows that we achieve comparable performance while utilizing less computation compared to the fine tuning cost of existing Arabic models. Further, our base-scale Arabic language models extend state-of-the-art results on several Arabic NLP tasks while maintaining a comparable fine-tuning cost to existing base-scale models. Next, we focus on the question-answering task, specifically tackling issues in specialized domains and low-resource languages such as the limited size of question-answering datasets and limited topics coverage within them. We employ several methods to address these issues in the biomedical domain, including the employment of models adapted to the domain and Task-to-Task Transfer Learning. We evaluate the effectiveness of these methods at the BioASQ10 (2022) challenge, showing that we achieved the top-performing system on several batches of the BioASQ10 challenge. In Arabic, we address similar existing issues by introducing a novel approach to create question-answer-passage triplets, and propose a pipeline, Pair2Passage, to create large QA datasets. Using this method and the pipeline, we create the ArTrivia dataset, a new Arabic question-answering dataset comprising more than +10,000 high-quality question-answer-passage triplets. We presented a quantitative and qualitative analysis of ArTrivia that shows the importance of some often overlooked yet important components, such as answer normalization in enhancing the quality of the question-answer dataset and future annotation. In addition, our evaluation shows the ability of ArTrivia to build a question-answering model that can address the out-of-distribution issue in existing Arabic QA datasets.
22 0
Restricted
Entity Information Extraction And Normalization From Scientific And Clinical Texts
(University of Alabama at Birmingham, 2024) Almudaifer, Abdullateef; Yan Da
In the analysis of scientific and clinical texts, the extraction of named entities and their relevant information, such as modifiers, play a pivotal role. Recent advancements in natural language processing (NLP), particularly through the application of transfer learning from pre-trained Transformer models, have greatly enhanced the performance of entity extraction tasks. However, challenges persist with nested entities. This thesis investigates the impact of transfer learning on extracting nested entities and their modifiers, using Opioid Use Disorder (OUD) as a prototype. By adopting a multi-task training strategy, this work enhances the model's capacity to discern and categorize overlapping entities, a task that traditional transfer learning models often struggle with due to their single-focus training on flat entities. Moreover, entity modifiers, which can alter the semantics of entities extracted from clinical texts, are critical for interpreting clinical narratives accurately. Traditional models for identifying these modifiers often rely on regular expressions or feature weights, trained in isolation for each modifier. In contrast, this thesis proposes a novel, unified multi-task Transformer architecture that simultaneously learns and predicts various modifiers. The effectiveness of this approach is validated on the ShARe and OUD data sets, demonstrating state-of-the-art results and highlighting the potential of transfer learning between data sets with partially similar modifiers in clinical texts. This work extends into document-level entity relation extraction, enhancing the ability to understand and analyze the relationships between entities within scientific literature comprehensively. Furthermore, the thesis addresses the essential task of entity normalization - linking textual mentions to ontology concepts. Despite the challenges posed by the diverse expression of concepts and the complexity of ontology graphs, this work introduces a model that utilizes graph neural networks (GNN) to encode entity mentions and ontology concepts in a common hyperbolic space, aiming to enhance entity normalization performance in scientific and clinical texts.
27 0
Restricted
Development Techniques for Large Language Models for Low Resource Languages
(University of Texas at Dallas, 2023-12) Alsarra, Sultan; Khan, Latifur
Recent advancements in Natural Language Processing (NLP) driven by large language models have brought about transformative changes in various sectors reliant on extensive text-based research. This dissertation is the culmination of techniques designed for crafting domain-specific large language models tailored to low-resource languages, offering invaluable support to researchers engaged in large-scale text analysis. The primary focus of these models is to address the nuances of politics, conflicts, and violence in the Middle East and Latin America using domain-specific, pre-trained large language models in Arabic and Spanish. Throughout the development of these language models, we construct a multitude of downstream tasks, including named entity recognition, binary classification, multi-label classification, and question answering. Additionally, we lay out a roadmap for the creation of domain-specific large language models. Our core objective is to contribute by devising NLP strategies and methodologies that surmount the challenges posed by low-resource languages. This contribution extends to curating an extensive corpus of texts centered around regional politics and conflicts in Spanish and Arabic, thereby enriching research in the domain of NLP large language models for low-resource languages. We assess the performance of our models against the Bidirectional Encoder Representations from Transformers (BERT) model as a baseline. Our findings unequivocally establish that the utilization of domain-specific pre-trained language models markedly enhances the performance of NLP models in the realm of politics and conflict analysis. This is observed in both Arabic and Spanish, spanning diverse types of downstream tasks. Consequently, our work equips researchers in the realm of large language models for low-resource languages with potent tools. Simultaneously, it offers political and conflict analysts, including policymakers, scholars, and practitioners, novel approaches and instruments for deciphering the intricate dynamics of local politics and conflicts, directly in Arabic and Spanish.
86 0
Restricted
Improving vulnerability description using natural language generation
(Saudi Digital Library, 2023-10-25) Althebeiti, Hattan; Mohaisen, David
Software plays an integral role in powering numerous everyday computing gadgets. As our reliance on software continues to grow, so does the prevalence of software vulnerabilities, with significant implications for organizations and users. As such, documenting vulnerabilities and tracking their development becomes crucial. Vulnerability databases addressed this issue by storing a record with various attributes for each discovered vulnerability. However, their contents suffer several drawbacks, which we address in our work. In this dissertation, we investigate the weaknesses associated with vulnerability descriptions in public repositories and alleviate such weaknesses through Natural Language Processing (NLP) approaches. The first contribution examines vulnerability descriptions in those databases and approaches to improve them. We propose a new automated method leveraging external sources to enrich the scope and context of a vulnerability description. Moreover, we exploit fine-tuned pretrained language models for normalizing the resulting description. The second contribution investigates the need for uniform and normalized structure in vulnerability descriptions. We address this need by breaking the description of a vulnerability into multiple constituents and developing a multi-task model to create a new uniform and normalized summary that maintains the necessary attributes of the vulnerability using the extracted features while ensuring a consistent vulnerability description. Our method proved effective in generating new summaries with the same structure across a collection of various vulnerability descriptions and types. Our final contribution investigates the feasibility of assigning the Common Weakness Enumeration (CWE) attribute to a vulnerability based on its description. CWE offers a comprehensive framework that categorizes similar exposures into classes, representing the types of exploitation associated with such vulnerabilities. Our approach utilizing pre-trained language models is shown to outperform Large Language Model (LLM) for this task. Overall, this dissertation provides various technical approaches exploiting advances in NLP to improve publicly available vulnerability databases.
10 0
Restricted
Deep Learning Methods to Investigate Online Hate Speech and Counterhate Replies to Mitigate Hateful Content
(2025-05-15) Albanyan, Abdullah; Blanco, Eduardo; Albert, Mark
Hateful content and offensive language are commonplace on social media platforms. Many surveys prove that high percentages of social media users experience online harassment. Previous efforts have been made to detect and remove online hate content automatically. However, removing users’ content restricts free speech. A complementary strategy to address hateful content that does not interfere with free speech is to counter the hate with new content to divert the discourse away from the hate. In this dissertation, we complement the lack of previous work on counterhate arguments by analyzing and detecting them. Firstly, we study the relationships between hateful tweets and replies. Specifically, we analyze their fine-grained relationships by indicating whether the reply counters the hate, provides a justification, attacks the author of the tweet, or adds additional hate. The most obvious finding is that most replies generally agree with the hateful tweets; only 20% of them counter the hate. Secondly, we focus on the hate directed toward individuals and detect authentic counterhate arguments from online articles. We propose a methodology that assures the authenticity of the argument and its specificity to the individual of interest. We show that finding arguments in online articles is an efficient alternative compared to counterhate generation approaches that may hallucinate unsupported arguments. Thirdly, we investigate the replies to counterhate tweets beyond whether the reply agrees or disagrees with the counterhate tweet. We analyze the language of the counterhate tweet that leads to certain types of replies and predict which counterhate tweets may elicit more hate instead of stopping it. We find that counterhate tweets with profanity content elicit replies that agree with the counterhate tweet. This dissertation presents several corpora, detailed corpus analyses, and deep learning-based approaches for the three tasks mentioned above.
68 0

SACM - United States of America

Browse

Filters

Settings

Sort By

Results per page

Search Results