SACM - United States of America
Permanent URI for this collectionhttps://drepo.sdl.edu.sa/handle/20.500.14154/9668
Browse
5 results
Search Results
Item Restricted Disinformation Classification Using Transformer based Machine Learning(Howard University, 2024) alshaqi, Mohammed Al; Rawat, Danda BThe proliferation of false information via social media has become an increasingly pressing problem. Digital means of communication and social media platforms facilitate the rapid spread of disinformation, which calls for the development of advanced techniques for identifying incorrect information. This dissertation endeavors to devise effective multimodal techniques for identifying fraudulent news, considering the noteworthy influence that deceptive stories have on society. The study proposes and evaluates multiple approaches, starting with a transformer-based model that uses word embeddings for accurate text classification. This model significantly outperforms baseline methods such as hybrid CNN and RNN, achieving higher accuracy. The dissertation also introduces a novel BERT-powered multimodal approach to fake news detection, combining textual data with extracted text from images to improve accuracy. By lever aging the strengths of the BERT-base-uncased model for text processing and integrating it with image text extraction via OCR, this approach calculates a confidence score indicating the likeli hood of news being real or fake. Rigorous training and evaluation show significant improvements in performance compared to state-of-the-art methods. Furthermore, the study explores the complexities of multimodal fake news detection, integrat ing text, images, and videos into a unified framework. By employing BERT for textual analysis and CNN for visual data, the multimodal approach demonstrates superior performance over traditional models in handling multiple media formats. Comprehensive evaluations using datasets such as ISOT and MediaEval 2016 confirm the robustness and adaptability of these methods in combating the spread of fake news. This dissertation contributes valuable insights to fake news detection, highlighting the effec tiveness of transformer-based models, emotion-aware classifiers, and multimodal frameworks. The findings provide robust solutions for detecting misinformation across diverse platforms and data types, offering a path forward for future research in this critical area.34 0Item Restricted Developing a Generative AI Model to Enhance Sentiment Analysis for the Saudi Dialect(Texas Tech University, 2024-12) Aftan, Sulaiman; Zhuang, YuSentiment Analysis (SA) is a fundamental task in Natural Language Processing (NLP) with broad applications across various real-world domains. While Arabic is a globally significant language with several well-developed NLP models for its standard form, achieving high performance in sentiment analysis for the Saudi Dialect (SD) remains challenging. A key factor contributing to this difficulty is inadequate SD datasets for training of NLP models. This study introduces a novel method for adapting a high-resource language model to a closely related but low-resource dialect by combining moderate effort in SD data collection with generative AI to address this problem of inadequacy in SD datasets. Then, AraBERT was fine-tuned using a combination of collected SD data and additional SD data generated by GPT. The results demonstrate a significant improvement in SD sentiment analysis performance compared to the AraBERT model, which is fine-tuned with only collected SD datasets. This approach highlights an efficient approach to generating high-quality datasets for fine-tuning a model trained on a high-resource language to perform well in a low-resource dialect. Leveraging generative AI enables reduced effort in data collection, making our approach a promising avenue for future research in low-resource NLP tasks.41 0Item Restricted Towards Automated Security and Privacy Policies Specification and Analysis(Colorado State University, 2024-07-03) Alqurashi, Saja; Ray, IndrakshiSecurity and privacy policies, vital for information systems, are typically expressed in natural language documents. Security policy is represented by Access Control Policies (ACPs) within security requirements, initially drafted in natural language and subsequently translated into enforceable policy. The unstructured and ambiguous nature of the natural language documents makes the manual translation process tedious, expensive, labor-intensive, and prone to errors. On the other hand, Privacy policy, with its length and complexity, presents unique challenges. The dense lan- guage and extensive content of the privacy policies can be overwhelming, hindering both novice users and experts from fully understanding the practices related to data collection and sharing. The disclosure of these data practices to users, as mandated by privacy regulations such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), is of utmost importance. To address these challenges, we have turned to Natural Language Processing (NLP) to automate extracting critical information from natural language documents and analyze those security and privacy policies. Thus, this dissertation aims to address two primary research questions: Question 1: How can we automate the translation of Access Control Policies (ACPs) from natural language expressions to the formal model of Next Generation Access Control (NGAC) and subsequently analyze the generated model? Question 2: How can we automate the extraction and analysis of data practices from privacy policies to ensure alignment with privacy regulations (GDPR and CCPA)? Addressing these research questions necessitates the development of a comprehensive framework comprising two key components. The first component, SR2ACM, focuses on translating natural language ACPs into the NGAC model. This component introduces a series of innovative contributions to the analysis of security policies. At the core of our contributions is an automated approach to constructing ACPs within the NGAC specification directly from natural language documents. Our approach integrates machine learning with software testing, a novel methodology to ensure the quality of the extracted access control model. The second component, Privacy2Practice, is designed to automate the extraction and analysis of the data practices from privacy policies written in natural language. We have developed an automated method to extract data practices mandated by privacy regulations and to analyze the disclosure of these data practices within the privacy policies. The novelty of this research lies in creating a comprehensive framework that identifies the critical elements within security and privacy policies. Thus, this innovative framework enables automated extraction and analysis of both types of policies directly from natural language documents.29 0Item Restricted EXPLORING LANGUAGE MODELS AND QUESTION ANSWERING IN BIOMEDICAL AND ARABIC DOMAINS(University of Delaware, 2024-05-10) Alrowili, Sultan; Shanker, K.VijayDespite the success of the Transformer model and its variations (e.g., BERT, ALBERT, ELECTRA, T5) in addressing NLP tasks, similar success is not achieved when these models are applied to specific domains (e.g., biomedical) and limited-resources language (e.g., Arabic). This research addresses issues to overcome some challenges in the use of Transformer models to specialized domains and languages that lack in language processing resources. One of the reasons for reduced performance in limited domains might be due to the lack of quality contextual representations. We address this issue by adapting different types of language models and introducing five BioM-Transformer models for the biomedical domain and Funnel transformer and T5 models for the Arabic language. For each of our models, we present experiments for studying the impact of design factors (e.g., corpora and vocabulary domain, model-scale, architecture design) on performance and efficiency. Our evaluation of BioM-Transformer models shows that we obtain state-of-the-art results on several biomedical NLP tasks and achieved the top-performing models on the BLURB leaderboard. The evaluation of our small scale Arabic Funnel and T5 models shows that we achieve comparable performance while utilizing less computation compared to the fine tuning cost of existing Arabic models. Further, our base-scale Arabic language models extend state-of-the-art results on several Arabic NLP tasks while maintaining a comparable fine-tuning cost to existing base-scale models. Next, we focus on the question-answering task, specifically tackling issues in specialized domains and low-resource languages such as the limited size of question-answering datasets and limited topics coverage within them. We employ several methods to address these issues in the biomedical domain, including the employment of models adapted to the domain and Task-to-Task Transfer Learning. We evaluate the effectiveness of these methods at the BioASQ10 (2022) challenge, showing that we achieved the top-performing system on several batches of the BioASQ10 challenge. In Arabic, we address similar existing issues by introducing a novel approach to create question-answer-passage triplets, and propose a pipeline, Pair2Passage, to create large QA datasets. Using this method and the pipeline, we create the ArTrivia dataset, a new Arabic question-answering dataset comprising more than +10,000 high-quality question-answer-passage triplets. We presented a quantitative and qualitative analysis of ArTrivia that shows the importance of some often overlooked yet important components, such as answer normalization in enhancing the quality of the question-answer dataset and future annotation. In addition, our evaluation shows the ability of ArTrivia to build a question-answering model that can address the out-of-distribution issue in existing Arabic QA datasets.22 0Item Restricted Toward Leveraging Artificial Intelligence to Support the Identification of Accessibility Challenges(2023) Aljedaani, Wajdi Mohammed; Ludi, Stephanie; Wiem Mkaouer, MohamedContext: Today, mobile devices provide support to disabled people to make their life easier due to their high accessibility and capability, e.g., finding accessible locations, picture and voice-based communication, customized user interfaces, and vocabulary levels. These accessibility frameworks are directly integrated, as libraries, in various apps, providing them with accessibility functions. Just like any other software, these frameworks regularly encounter errors. App developers report these errors in the form of bug reports or by the user in user reviews. User reviews include insights that are useful for app evolution. These reports related to accessibility faults/issues need to be urgently fixed since their existence significantly hinders the usability of apps. However, recent studies have shown that developers may incorporate accessibility strategies in inspecting manually or partial reports to investigate if there are accessibility reports that exist. Unfortunately, these studies are limited to the developer. With the increase in the number of received reviews, manually analyzing them is tedious and time-consuming, especially when searching for accessibility reviews. Objective: The goal of this thesis is to support the automated identification of accessibility in user reviews or bug reports, to help technology professionals prioritize their handling, and, thus, to create more inclusive apps. Particularly, we propose a model that takes as input accessibility user reviews or bug reports and learns their keyword-based features to make a classification decision, for a given review, on whether it is about accessibility or not. To complement this goal, we aim to reveal insights from deaf and hard-of-hearing students about Blackboard, which is one of the most common Learning Management systems (LMS) used by many universities, especially in the current COVID-19 pandemic. This occurs to explore how deaf and hard-of-hearing students have challenges and concerns in e-learning experiences during the sudden shift to online learning during COVID-19 in terms of accessibility. Method: Our empirically-driven study follows a mixture of qualitative and quantitative methods. We text mine user reviews and bug reports documentation. We identify the accessibility reports and categorize them based on the BBC standards and guidelines for mobile accessibility and Web Content Accessibility Guidelines (WCAG 2.1). Then, we automatically classify a large set of user reviews and bug reports and identify among the various accessibility models presented in the literature. After that, we used a mixed-methods approach by conducting a survey and interviews to get the information we needed. This was done on deaf and hard-of-hearing students to identify the challenges and concerns in terms of accessibility in the e-learning platform Blackboard. Result: We introduced models that can accurately identify accessibility reviews and bug reports and automate detecting them. Our models (1) outperform two baselines, namely a keyword-based detector and a random classifier; (2) our model achieves an accuracy of 91% with a relatively small training dataset; however, the accuracy improves as we increase the size of the training dataset. Our mixed methods with deaf and hard-of-hearing students have revealed several difficulties, such as inadequate support and inaccessibility of content from learning systems. Conclusion: Our models can automatically classify app reviews and bug reports as accessibility-related or not so developers can easily detect accessibility issues with their products and improve them to more accessible and inclusive apps utilizing the users' input. Our goal is to create a sustainable change by including a model in the developer’s software maintenance pipeline and raising awareness of existing errors that hinder the accessibility of mobile apps, which is a pressing need. In light of our findings from the Blackboard case study, Blackboard and the course material are not easily accessible to deaf students and hard of hearing. Thus, deaf students find that learning is extremely stressful during the pandemic.64 0