Saudi Cultural Missions Theses & Dissertations
Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10
Browse
15 results
Search Results
Item Restricted Disinformation Classification Using Transformer based Machine Learning(Howard University, 2024) alshaqi, Mohammed Al; Rawat, Danda BThe proliferation of false information via social media has become an increasingly pressing problem. Digital means of communication and social media platforms facilitate the rapid spread of disinformation, which calls for the development of advanced techniques for identifying incorrect information. This dissertation endeavors to devise effective multimodal techniques for identifying fraudulent news, considering the noteworthy influence that deceptive stories have on society. The study proposes and evaluates multiple approaches, starting with a transformer-based model that uses word embeddings for accurate text classification. This model significantly outperforms baseline methods such as hybrid CNN and RNN, achieving higher accuracy. The dissertation also introduces a novel BERT-powered multimodal approach to fake news detection, combining textual data with extracted text from images to improve accuracy. By lever aging the strengths of the BERT-base-uncased model for text processing and integrating it with image text extraction via OCR, this approach calculates a confidence score indicating the likeli hood of news being real or fake. Rigorous training and evaluation show significant improvements in performance compared to state-of-the-art methods. Furthermore, the study explores the complexities of multimodal fake news detection, integrat ing text, images, and videos into a unified framework. By employing BERT for textual analysis and CNN for visual data, the multimodal approach demonstrates superior performance over traditional models in handling multiple media formats. Comprehensive evaluations using datasets such as ISOT and MediaEval 2016 confirm the robustness and adaptability of these methods in combating the spread of fake news. This dissertation contributes valuable insights to fake news detection, highlighting the effec tiveness of transformer-based models, emotion-aware classifiers, and multimodal frameworks. The findings provide robust solutions for detecting misinformation across diverse platforms and data types, offering a path forward for future research in this critical area.24 0Item Restricted IS THE METAVERSEFAILING? ANALYSINGSENTIMENTS TOWARDSTHEMETAVERSE(The University of Manchester, 2024) Alharbi, Manal Dowaihi; Batista-navarro, RizaThis dissertation investigates Aspect-Based Sentiment Analysis (ABSA) within the context of the Metaverse to better understand opinions on this emerging digital environment, particularly from a news perspective. The Metaverse, a virtual space where users can engage in various experiences, has attracted both positive and negative opinions, making it crucial to explore these sentiments to gain insights into public perspectives. A novel dataset of news articles related to the Metaverse was created, and Target Aspect-Sentiment Detection (TASD) models were applied to analyze sentiments ex pressed toward various aspects of the Metaverse, such as device performance and user privacy. A key contribution of this research is the evaluation of the TASD architecture, TAS-BERT, and its enhanced version, Advanced TAS-BERT (ATAS-BERT), which performs each task separately, on two datasets: the newly created Metaverse dataset and the SemEval15 Restaurant dataset. They were tested with different Transformer based models, including BERT, DeBERTa, RoBERTa, and ALBERT, to assess performance, particularly in cases where the target is implicit. The findings demonstrate the ability of advanced Transformer models to handle complex tasks, even when the target is implicit. ALBERT performed well on the simpler Metaverse dataset, while DeBERTa and RoBERTa showed superior performance on both datasets. This dissertation also suggests several areas for improvement in future research, such as processing paragraphs instead of individual sentences, utilizing Meta AI models for dataset annotation to enhance accuracy, and designing architectures specifically for models like DeBERTa, RoBERTa, and ALBERT, rather than relying on architectures originally designed for BERT, to improve performance. Additionally, incorporating enriched context representations, such as Part-of-Speech tags, could further enhance model performance.9 0Item Restricted Developing a Generative AI Model to Enhance Sentiment Analysis for the Saudi Dialect(Texas Tech University, 2024-12) Aftan, Sulaiman; Zhuang, YuSentiment Analysis (SA) is a fundamental task in Natural Language Processing (NLP) with broad applications across various real-world domains. While Arabic is a globally significant language with several well-developed NLP models for its standard form, achieving high performance in sentiment analysis for the Saudi Dialect (SD) remains challenging. A key factor contributing to this difficulty is inadequate SD datasets for training of NLP models. This study introduces a novel method for adapting a high-resource language model to a closely related but low-resource dialect by combining moderate effort in SD data collection with generative AI to address this problem of inadequacy in SD datasets. Then, AraBERT was fine-tuned using a combination of collected SD data and additional SD data generated by GPT. The results demonstrate a significant improvement in SD sentiment analysis performance compared to the AraBERT model, which is fine-tuned with only collected SD datasets. This approach highlights an efficient approach to generating high-quality datasets for fine-tuning a model trained on a high-resource language to perform well in a low-resource dialect. Leveraging generative AI enables reduced effort in data collection, making our approach a promising avenue for future research in low-resource NLP tasks.29 0Item Restricted Automatic Detection and Verification System for Arabic Rumor News on Twitter(University of Technology Sydney, 2026-04) Karali, Sami; Chin-Teng, LinLanguage models have been extensively studied and applied in various fields in recent years. However, the majority of the language use models are designed for and perform significantly better in English compared to other languages, such as Arabic. The differences between English and Arabic in terms of grammar, writing, and word-forming structures pose significant challenges in applying English-based language models to Arabic content. Therefore, there is a critical need to develop and refine models and methodologies that can effectively process Arabic content. This research aims to address the gaps in Arabic language models by developing innovative machine learning (ML) and natural language processing (NLP) methodologies. We apply the developed model to Arabic rumor detection on Twitter to test its effectiveness. To achieve this, the research is divided into three fundamental phases: 1) Efficiently collecting and pre-processing a comprehensive dataset of Arabic news tweets; 2) The refinement of ML models through an enhanced Convolutional Neural Network (ECNN) equipped with N-gram feature maps for accurate rumor identification; 3) The augmentation of decision-making precision in rumor verification via sophisticated ensemble learning techniques. In the first phase, the research meticulously develops a methodology for the collection and pre-processing of Arabic news tweets, aiming to establish a dataset optimized for rumor detection analysis. Leveraging a blend of automated and manual processes, the research navigates the intricacies of the Arabic language, enhancing the dataset’s quality for ML applications. This foundational phase ensures removing irrelevant data and normalizing text, setting a precedent for accuracy in subsequent detection tasks. The second phase is to develop an Enhanced Convolutional Neural Network (ECNN) model, which incorporates N-gram feature maps for a deeper linguistic analysis of tweets. This innovative ECNN model, designed specifically for the Arabic language, marks a significant departure from traditional rumor detection models by harnessing the power of spatial feature extraction alongside the contextual insights provided by N-gram analysis. Empirical results underscore the ECNN model’s superior performance, demonstrating a marked improvement in detecting and classifying rumors with heightened accuracy and efficiency. The culmination of the study explores the efficacy of ensemble learning methods in enhancing the robustness and accuracy of rumor detection systems. By synergizing the ECNN model with Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), and Gated Recurrent Unit (GRU) networks within a stacked ensemble framework, the research pioneers a composite approach that significantly outstrips the capabilities of singular models. This innovation results in a state-of-the-art system for rumor verification that outperforms accuracy in identifying rumors, as demonstrated by empirical testing and analysis. This research contributes to bridging the gap between English-centric language models and Arabic language processing, demonstrating the importance of tailored approaches for different languages in the field of ML and NLP. These contributions signify a monumental step forward in the field of Arabic NLP and ML and offer practical solutions for the real-world challenge of rumor proliferation on social media platforms, ultimately fostering a more reliable digital environment for Arabic-speaking communities.42 0Item Restricted Evaluating CAMeL-BERT for Sentiment Analysis of Customer Satisfaction with STC (Saudi Telecom Company) Services(The University of Sussex, 2024-08-15) Alotaibi, Fahad; Pay, JackIn the age of informatics platforms such as Twitter (X) plays a crucial role for measuring public sentiment, especially in both private and public sectors. This study explores the application of machine learning, particularly deep learning, to perform sentiment analysis on tweets about Saudi Telecom Company (STC) services in Saudi Arabia. A comparative analysis was conducted between pre-trained sentiment analysis models in English and in Arabic to assess their effectiveness in classifying sentiments. In addition, the study highlights a challenge in existing Arabic models, which are based on English model architectures but trained on varied datasets, such as Modern Standard Arabic and Classical Arabic (Al-Fus’ha). These models often lack the capability to handle the diverse Arabic dialects commonly used on social media. To overcome this issue, the study involved fine-tuning a pre-trained Arabic model using a dataset of tweets related to STC services, specifically focusing on the Saudi dialect. Data was collected from Twitter (X), focusing on mentions of the Saudi Telecom Company (STC). Both English and Arabic models were applied to this data, and their performance in sentiment analysis was evaluated. The fine-tuned Arabic model (CAMeL-BERT) demonstrated improved accuracy and a better understanding of local dialects compared to its initial version. The results highlight the importance of model adaptation for specific languages and contexts and underline the potential of CAMeL-BERT in sentiment analysis for Arabic-language content. The findings offer practical implications for enhancing customer service and engagement through more accurate sentiment analysis of social media content in the service providers sector.15 0Item Restricted Towards Automated Security and Privacy Policies Specification and Analysis(Colorado State University, 2024-07-03) Alqurashi, Saja; Ray, IndrakshiSecurity and privacy policies, vital for information systems, are typically expressed in natural language documents. Security policy is represented by Access Control Policies (ACPs) within security requirements, initially drafted in natural language and subsequently translated into enforceable policy. The unstructured and ambiguous nature of the natural language documents makes the manual translation process tedious, expensive, labor-intensive, and prone to errors. On the other hand, Privacy policy, with its length and complexity, presents unique challenges. The dense lan- guage and extensive content of the privacy policies can be overwhelming, hindering both novice users and experts from fully understanding the practices related to data collection and sharing. The disclosure of these data practices to users, as mandated by privacy regulations such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), is of utmost importance. To address these challenges, we have turned to Natural Language Processing (NLP) to automate extracting critical information from natural language documents and analyze those security and privacy policies. Thus, this dissertation aims to address two primary research questions: Question 1: How can we automate the translation of Access Control Policies (ACPs) from natural language expressions to the formal model of Next Generation Access Control (NGAC) and subsequently analyze the generated model? Question 2: How can we automate the extraction and analysis of data practices from privacy policies to ensure alignment with privacy regulations (GDPR and CCPA)? Addressing these research questions necessitates the development of a comprehensive framework comprising two key components. The first component, SR2ACM, focuses on translating natural language ACPs into the NGAC model. This component introduces a series of innovative contributions to the analysis of security policies. At the core of our contributions is an automated approach to constructing ACPs within the NGAC specification directly from natural language documents. Our approach integrates machine learning with software testing, a novel methodology to ensure the quality of the extracted access control model. The second component, Privacy2Practice, is designed to automate the extraction and analysis of the data practices from privacy policies written in natural language. We have developed an automated method to extract data practices mandated by privacy regulations and to analyze the disclosure of these data practices within the privacy policies. The novelty of this research lies in creating a comprehensive framework that identifies the critical elements within security and privacy policies. Thus, this innovative framework enables automated extraction and analysis of both types of policies directly from natural language documents.29 0Item Restricted Semantic Analysis of Amazon Reviews of Sustainable Products(University of Leeds, 2024-02-18) Alotaibi, Amal; Dimitrova, VaniaOnline shopping has grown to be an essential part of modern living, garnering a wealth of client input. This project advances the field of consumer feedback mining and semantic and sentiment analysis of customer reviews since, when applied effectively, it can enhance goods, services, or marketing initiatives. This project proposes a framework using Natural Language Processing (NLP) techniques to find customer preferences related to sustainability through mining customer reviews (CR) text. First, implement the LDA and sLDA models using the Gensim package in Python to extract sustainable topics from CR. After that, implement the BERTopic model to find the sustainability aspect in (CR). Then, the overall sentiment for every review in each topic was calculated using the Vader sentiment library in Python. Lastly, interpret the results and generate helpful insights for brand managers. The Amazon product review data is used in this study, and we use Food and Grocery Sustainable Products. The findings of the proposed framework are promising, as we were able to identify the most discussed topics in sustainability aspects of products and produce an assessment that provides information about the aspects that the customers are most satisfied with and that can be improved. However, the sLDA model and the BERTopic model achieve the goal but not the expectation. especially BERTopic, it was not accurate enough for weakly supervised text classification. Also, the Vader sentiment tool did not meet expectations because of the complexity of CR. However, the text analyst specialist found that the structure is flexible enough to allow for future development and increased usage. Ultimately, we think that these data will help brand managers create and improve future products, which will raise consumer satisfaction and boost revenue and profitability.25 0Item Restricted EXPLORING LANGUAGE MODELS AND QUESTION ANSWERING IN BIOMEDICAL AND ARABIC DOMAINS(University of Delaware, 2024-05-10) Alrowili, Sultan; Shanker, K.VijayDespite the success of the Transformer model and its variations (e.g., BERT, ALBERT, ELECTRA, T5) in addressing NLP tasks, similar success is not achieved when these models are applied to specific domains (e.g., biomedical) and limited-resources language (e.g., Arabic). This research addresses issues to overcome some challenges in the use of Transformer models to specialized domains and languages that lack in language processing resources. One of the reasons for reduced performance in limited domains might be due to the lack of quality contextual representations. We address this issue by adapting different types of language models and introducing five BioM-Transformer models for the biomedical domain and Funnel transformer and T5 models for the Arabic language. For each of our models, we present experiments for studying the impact of design factors (e.g., corpora and vocabulary domain, model-scale, architecture design) on performance and efficiency. Our evaluation of BioM-Transformer models shows that we obtain state-of-the-art results on several biomedical NLP tasks and achieved the top-performing models on the BLURB leaderboard. The evaluation of our small scale Arabic Funnel and T5 models shows that we achieve comparable performance while utilizing less computation compared to the fine tuning cost of existing Arabic models. Further, our base-scale Arabic language models extend state-of-the-art results on several Arabic NLP tasks while maintaining a comparable fine-tuning cost to existing base-scale models. Next, we focus on the question-answering task, specifically tackling issues in specialized domains and low-resource languages such as the limited size of question-answering datasets and limited topics coverage within them. We employ several methods to address these issues in the biomedical domain, including the employment of models adapted to the domain and Task-to-Task Transfer Learning. We evaluate the effectiveness of these methods at the BioASQ10 (2022) challenge, showing that we achieved the top-performing system on several batches of the BioASQ10 challenge. In Arabic, we address similar existing issues by introducing a novel approach to create question-answer-passage triplets, and propose a pipeline, Pair2Passage, to create large QA datasets. Using this method and the pipeline, we create the ArTrivia dataset, a new Arabic question-answering dataset comprising more than +10,000 high-quality question-answer-passage triplets. We presented a quantitative and qualitative analysis of ArTrivia that shows the importance of some often overlooked yet important components, such as answer normalization in enhancing the quality of the question-answer dataset and future annotation. In addition, our evaluation shows the ability of ArTrivia to build a question-answering model that can address the out-of-distribution issue in existing Arabic QA datasets.22 0Item Restricted Legal Judgment Prediction for Canadian Appeal Cases(University of Ottawa, 2024) Almuslim, Intisar; Inkpen, FianaLaw is one of the knowledge domains that are most reliant on textual material. In this age of legal big data, and with the increased availability of legal text online, many researchers started working on the development of legal intelligent systems and applications. These intelligent systems can provide great services and solve many problems in the legal domain. Over the last few years, researchers have focused on predicting judicial case outcomes using Natural Language Processing (NLP) and Machine Learning (ML) methods over case documents. Thus, Legal Judgment Prediction (LJP) is the task of automatically predicting the outcome of a court case given the case description. To the best of our knowledge, no prior research with this intention has been conducted in English for appeal courts in Canada, as of 2023. The NLP application to legal judgments, that our proposed methodology focuses on, is to predict the outcomes of cases by looking only at the text of cases. Because appeal court decisions are often binary, as in ’Allow’ or ’Dismiss’, the task is defined as a binary classification problem. This is the general approach in the literature as well. However, many of the previous LJP approaches utilized traditional classifiers or standard general language models (LMs). In our thesis, we constructed a Canadian Appeal-Law dataset (CanAL-DS) that contains a collection of decisions from different higher courts in Canada. In addition, we further pre-trained the LegalBERT model on our collected corpus that combines around 50,000 documents of Canadian case law and legislation which resulted in (CanAL) LegalBERT, a Canadian Appeal-Law BERT-based legal model. Moreover, we proposed a novel Ensemble-Hierarchical CanAL (EH-CanAL) architecture that simulates the actual voting setting in appellate courts showing great promise in LJP performance within Canadian case law. We improved the architecture with a multi-task component (MEH-CanAL) to help the model identify what legal paragraphs require the most attention and facilitate its explainability. Results from our study demonstrate the potential for the proposed approaches to reshape traditional judicial decision-making and the efficacy of domain-specific language models. Through this study, we hope to establish the basis for future research on the appellate law system of Canada and offer a baseline for future work.21 0Item Restricted Towards Numerical Reasoning in Machine Reading Comprehension(Imperial College London, 2024-02-01) Al-Negheimish, Hadeel; Russo, Alessandra; Madhyastha, PranavaAnswering questions about a specific context often requires integrating multiple pieces of information and reasoning about them to arrive at the intended answer. Reasoning in natural language for machine reading comprehension (MRC) remains a significant challenge. In this thesis, we focus on numerical reasoning tasks. As opposed to current black-box approaches that provide little evidence of their reasoning process, we propose a novel approach that facilitates interpretable and verifiable reasoning by using Reasoning Templates for question decomposition. Our evaluations hinted at the existence of problematic behaviour in numerical reasoning models, underscoring the need for a better understanding of their capabilities. We conduct, as a second contribution of this thesis, a controlled study to assess how well current models understand questions and to what extent such models are basing their answers on textual evidence. Our findings indicate that applying transformations that obscure or destroy the syntactic and semantic properties of the questions does not change the output of the top-performing models. This behaviour reveals serious holes in how the models work. It calls into question evaluation paradigms that only use standard quantitative measures such as accuracy and F1 scores, as they lead to a false illusion of progress. To improve the reliability of numerical reasoning models in MRC, we propose and demonstrate, as our third contribution, the effectiveness of a solution to one of these fundamental problems: catastrophic insensitivity to word order. We do this by FORCED INVALIDATION: training the model to flag samples that cannot be reliably answered. We show it is highly effective at preserving word order importance in machine reading comprehension tasks and generalises well to other natural language understanding tasks. While our Reasoning Templates are competitive with the state-of-the-art on a single type, engineering them incurs a considerable overhead. Leveraging our better insights on natural language understanding and concurrent advancements in few-shot learning, we conduct a first investigation to overcome scalability limitations. Our fourth contribution combines large language models for question decomposition with symbolic rule learning for answer recomposition, we surpass our previous results on Subtraction questions and generalise to more reasoning types.14 0