Saudi Cultural Missions Theses & Dissertations
Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10
Browse
3 results
Search Results
Item Restricted EXPLORING LANGUAGE MODELS AND QUESTION ANSWERING IN BIOMEDICAL AND ARABIC DOMAINS(University of Delaware, 2024-05-10) Alrowili, Sultan; Shanker, K.VijayDespite the success of the Transformer model and its variations (e.g., BERT, ALBERT, ELECTRA, T5) in addressing NLP tasks, similar success is not achieved when these models are applied to specific domains (e.g., biomedical) and limited-resources language (e.g., Arabic). This research addresses issues to overcome some challenges in the use of Transformer models to specialized domains and languages that lack in language processing resources. One of the reasons for reduced performance in limited domains might be due to the lack of quality contextual representations. We address this issue by adapting different types of language models and introducing five BioM-Transformer models for the biomedical domain and Funnel transformer and T5 models for the Arabic language. For each of our models, we present experiments for studying the impact of design factors (e.g., corpora and vocabulary domain, model-scale, architecture design) on performance and efficiency. Our evaluation of BioM-Transformer models shows that we obtain state-of-the-art results on several biomedical NLP tasks and achieved the top-performing models on the BLURB leaderboard. The evaluation of our small scale Arabic Funnel and T5 models shows that we achieve comparable performance while utilizing less computation compared to the fine tuning cost of existing Arabic models. Further, our base-scale Arabic language models extend state-of-the-art results on several Arabic NLP tasks while maintaining a comparable fine-tuning cost to existing base-scale models. Next, we focus on the question-answering task, specifically tackling issues in specialized domains and low-resource languages such as the limited size of question-answering datasets and limited topics coverage within them. We employ several methods to address these issues in the biomedical domain, including the employment of models adapted to the domain and Task-to-Task Transfer Learning. We evaluate the effectiveness of these methods at the BioASQ10 (2022) challenge, showing that we achieved the top-performing system on several batches of the BioASQ10 challenge. In Arabic, we address similar existing issues by introducing a novel approach to create question-answer-passage triplets, and propose a pipeline, Pair2Passage, to create large QA datasets. Using this method and the pipeline, we create the ArTrivia dataset, a new Arabic question-answering dataset comprising more than +10,000 high-quality question-answer-passage triplets. We presented a quantitative and qualitative analysis of ArTrivia that shows the importance of some often overlooked yet important components, such as answer normalization in enhancing the quality of the question-answer dataset and future annotation. In addition, our evaluation shows the ability of ArTrivia to build a question-answering model that can address the out-of-distribution issue in existing Arabic QA datasets.22 0Item Restricted Improving vulnerability description using natural language generation(Saudi Digital Library, 2023-10-25) Althebeiti, Hattan; Mohaisen, DavidSoftware plays an integral role in powering numerous everyday computing gadgets. As our reliance on software continues to grow, so does the prevalence of software vulnerabilities, with significant implications for organizations and users. As such, documenting vulnerabilities and tracking their development becomes crucial. Vulnerability databases addressed this issue by storing a record with various attributes for each discovered vulnerability. However, their contents suffer several drawbacks, which we address in our work. In this dissertation, we investigate the weaknesses associated with vulnerability descriptions in public repositories and alleviate such weaknesses through Natural Language Processing (NLP) approaches. The first contribution examines vulnerability descriptions in those databases and approaches to improve them. We propose a new automated method leveraging external sources to enrich the scope and context of a vulnerability description. Moreover, we exploit fine-tuned pretrained language models for normalizing the resulting description. The second contribution investigates the need for uniform and normalized structure in vulnerability descriptions. We address this need by breaking the description of a vulnerability into multiple constituents and developing a multi-task model to create a new uniform and normalized summary that maintains the necessary attributes of the vulnerability using the extracted features while ensuring a consistent vulnerability description. Our method proved effective in generating new summaries with the same structure across a collection of various vulnerability descriptions and types. Our final contribution investigates the feasibility of assigning the Common Weakness Enumeration (CWE) attribute to a vulnerability based on its description. CWE offers a comprehensive framework that categorizes similar exposures into classes, representing the types of exploitation associated with such vulnerabilities. Our approach utilizing pre-trained language models is shown to outperform Large Language Model (LLM) for this task. Overall, this dissertation provides various technical approaches exploiting advances in NLP to improve publicly available vulnerability databases.10 0Item Restricted Hate Speech Detection for the Arabic Language(Saudi Digital Library, 2023-11-03) Alhejaili, Abrar; Moosavi, NafiseAs online social networks grow and communication technologies become more available, people can exercise their freedom of expression more than ever before. Even though the interaction between users on these platforms can be constructive, they are increasingly used for spreading hateful content, mainly due to the anonymity feature of these online platforms. Hate speech can induce cyber conflict, negatively impacting social life at both the individual and national levels. In spite of this, social network providers are unable to monitor all the content posted by their users. As a result, there is a need to detect hate speech automatically. This need increases when the text is written in a language like Arabic. Arabic is known for its challenges, complexities, and resource scarcity. This project uses transfer learning methods to adapt, and evaluate some pretrained models to detect hate speech in Arabic. Many experiments were conducted in this project to assess the transferring of some options from BERT and Sequence-to-Sequence families (e.g., DehateBERT, MARBERT, T5, and Flan-T5), and the transferring of preprocessing functions from a pretrained model (AraBERT). Experiments show that transfer learning by finetuning monolingual models has promising results to a different extent. In addition, the additional preprocessing can affect the performance in a good way. Nevertheless, dealing with low-frequency labels independently, such as our dataset’s hate class, is still challenging. Warning: This paper may include instances of offensive language.30 0