Saudi Cultural Missions Theses & Dissertations
Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10
Browse
12 results
Search Results
Item Restricted Enhancing Opinion Mining in E-Commerce: The Role of Text Segmentation and K-Means Clustering in Transformer-Based Consumer Trust Analysis(Texas Tech University, 2025) Alkhalil, Bandar; Zhuang, YuAs the E-commerce market expands, customer reviews have become essential for companies aiming to understand consumer opinions. Building consumer trust is critical to the success of E-commerce businesses, as it significantly influences purchasing decisions. Understanding how to build this trust is essential, especially given that 93% of consumers report that online reviews influence their purchasing choices. Trust in E-commerce is commonly understood as a consumer’s willingness to rely on an online seller based on expectations of reliability, security, and competence. In other words, various factors affect consumer purchase decisions when shopping online. Customer reviews are crucial for gauging consumer opinions and can help identify the factors influencing trust in online shopping. However, current research primarily focuses on using transformer models to classify reviews as positive, negative, or neutral or to predict customer ratings based on the content of those reviews. This dissertation introduces a new approach that expands the capabilities of pre-trained transformer models, such as GPT, BART, and BERT, to extract trust factors from customer reviews, addressing a significant gap in the current literature. The research notably improves the model’s accuracy by utilizing text segmentation. Comparative analysis between segmented and unsegmented datasets, benchmarked against manually annotated reviews, demonstrates that text segmentation increases accuracy. Specifically, GPT-3.5 achieved an accuracy of 86.9%, representing a 15.5 percentage point improvement over unsegmented data. These findings validate segmentation as a critical technique for enhancing granularity and enabling models to identify nuanced trust factors effectively. To further validate the effectiveness of our approach, a second experiment was conducted using a different dataset to determine whether segmentation would yield comparable or even better performance in terms of accuracy. In this experiment, text segmentation was applied before the initial factor extraction to enhance the identification of trust factors. However, the large number of extracted factors created new challenges, as many were redundant or represented similar concepts under different names, complicating large-scale analysis. To address this challenge, K-means clustering, combined with the elbow method, successfully standardized the 2,890 extracted factors and grouped them into nine key categories. This refined process further improved the GPT-3.5 model’s accuracy to 88.5%, demonstrating the scalability and robustness of the proposed methodology in handling large-scale review datasets. The findings highlight the centrality of text segmentation and underscore the crucial role of normalization techniques, particularly K-means clustering, in managing large-scale review datasets. By offering a scalable and adaptable framework, this dissertation provides actionable insights for improving E-commerce analytics. Furthermore, it lays the groundwork for broader applications, extending its suitability beyond E-commerce to other areas where manual labeling is challenging or resource-intensive.2 0Item Restricted Embracing Emojis in Sarcasm Detection to Enhance Sentiment Analysis(University of Southampton, 2025) Alsabban, Malak Abdullah; Hall, Wendy; Weal, MarkPeople frequently share their ideas, concerns, and emotions on social networks, making sentiment analysis on social media increasingly important for understanding public opinion and user sentiment. Sentiment analysis provides an effective means of interpreting people's attitudes towards various topics, individuals, or ideas. This thesis introduces the creation of an Emoji Dictionary (ED) to harness the rich contextual information conveyed by emojis. It acts as a valuable resource for deciphering the emotional nuances embedded in textual content, contributing to a deeper understanding of sentiment. In addition, the research explores the complex domain of sarcasm detection by proposing a novel Sarcasm Detection Approach (SDA). This approach identifies sarcasm by analysing conflicts between textual content and the accompanying emojis. The thesis addresses key challenges in sentiment analysis by evaluating and comparing emoji dictionaries and sarcasm detection approaches to enhance sentiment classification. Extensive experimentation on diverse datasets rigorously assesses the effectiveness of these methods in improving sentiment analysis accuracy and sarcasm detection performance, particularly in emoji-rich datasets. The findings highlight the crucial role of emojis as contextual cues, underscoring their value in sentiment analysis and sarcasm detection tasks. The outcomes of this thesis aim to advance sentiment analysis methodologies by offering insights into preprocessing strategies, leveraging the expressive potential of emojis through the Emoji Dictionary (ED), and introducing the Sarcasm Detection Approach (SDA). The research demonstrates that integrating emojis through these tools substantially enhances both sentiment analysis and sarcasm detection. By utilizing these tools, the study not only improves model performance but also opens avenues for further exploration into the nuanced complexities of digital communication.19 0Item Restricted Evaluating Chess Moves by Analysing Sentiments in Teaching Textbooks(the University of Manchester, 2025) Alrdahi, Haifa Saleh T; Batista-navarro, RizaThe rules of playing chess are simple to comprehend, and yet it is challenging to make accurate decisions in the game. Hence, chess lends itself well to the development of an artificial intelligence (AI) system that simulates real-life problems, such as in decision-making processes. Learning chess strategies has been widely investigated, with most studies focused on learning from previous games using search algorithms. Chess textbooks encapsulate grandmaster knowledge, which explains playing strategies. This thesis investigates three research questions on the possibility of unlocking hidden knowledge in chess teaching textbooks. Firstly, we contribute to the chess domain with a new heterogeneous chess dataset “LEAP”, consists of structured data that represents the environment “board state”, and unstructured data that represent explanation of strategic moves. Additionally, we build a larger unstructured synthetic chess dataset to improve large language models familiarity with the chess teaching context. With the LEAP dataset, we examined the characteristics of chess teaching textbooks and the challenges of using such a data source for training Natural Language (NL)-based chess agent. We show by empirical experiments that following the common approach of sentence-level evaluation of moves are not insightful. Secondly, we observed that chess teaching textbooks are focused on explanation of the move’s outcome for both players alongside discussing multiple moves in one sentence, which confused the models in move evaluation. To address this, we introduce an auxiliary task by using verb phrase-level to evaluate the individual moves. Furthermore, we show by empirical experiments the usefulness of adopting the Aspect-based Sentiment Analysis (ABSA) approach as an evaluation method of chess moves expressed in free-text. With this, we have developed a fine-grained annotation and a small-scale dataset for the chess-ABSA domain “ASSESS”. Finally we examined the performance of a fine-tuned LLM encoder model for chess-ABSA and showed that the performance of the model for evaluating chess moves is comparable to scores obtained from a chess engine, Stockfish. Thirdly, we developed an instruction-based explanation framework, using prompt engineering with zero-shot learning to generate an explanation text of the move outcome. The framework also used a chess ABSA decoder model that uses an instructions format and evaluated its performance on the ASSESS dataset, which shows an overall improvement performance. Finally, we evaluate the performance of the framework and discuss the possibilities and current challenges of generating large-scale unstructured data for the chess, and the effect on the chess-ABSA decoder model.9 0Item Restricted Disinformation Classification Using Transformer based Machine Learning(Howard University, 2024) alshaqi, Mohammed Al; Rawat, Danda BThe proliferation of false information via social media has become an increasingly pressing problem. Digital means of communication and social media platforms facilitate the rapid spread of disinformation, which calls for the development of advanced techniques for identifying incorrect information. This dissertation endeavors to devise effective multimodal techniques for identifying fraudulent news, considering the noteworthy influence that deceptive stories have on society. The study proposes and evaluates multiple approaches, starting with a transformer-based model that uses word embeddings for accurate text classification. This model significantly outperforms baseline methods such as hybrid CNN and RNN, achieving higher accuracy. The dissertation also introduces a novel BERT-powered multimodal approach to fake news detection, combining textual data with extracted text from images to improve accuracy. By lever aging the strengths of the BERT-base-uncased model for text processing and integrating it with image text extraction via OCR, this approach calculates a confidence score indicating the likeli hood of news being real or fake. Rigorous training and evaluation show significant improvements in performance compared to state-of-the-art methods. Furthermore, the study explores the complexities of multimodal fake news detection, integrat ing text, images, and videos into a unified framework. By employing BERT for textual analysis and CNN for visual data, the multimodal approach demonstrates superior performance over traditional models in handling multiple media formats. Comprehensive evaluations using datasets such as ISOT and MediaEval 2016 confirm the robustness and adaptability of these methods in combating the spread of fake news. This dissertation contributes valuable insights to fake news detection, highlighting the effec tiveness of transformer-based models, emotion-aware classifiers, and multimodal frameworks. The findings provide robust solutions for detecting misinformation across diverse platforms and data types, offering a path forward for future research in this critical area.33 0Item Restricted AI-Driven Approaches for Privacy Compliance: Enhancing Adherence to Privacy Regulations(Univeristy of Warwick, 2024-02) Alamri, Hamad; Maple, CarstenThis thesis investigates and explores some inherent limitations within the current privacy policy landscape, provides recommendations, and proposes potential solutions to address these issues. The first contribution of this thesis is a comprehensive study that addresses a significant gap in the literature. This study provides a detailed overview of the current landscape of privacy policies, covering both their limitations and proposed solutions, with the aim of identifying the most practical and applicable approaches for researchers in the field. Second, the thesis tackles the challenge of privacy policy accessibility in app stores by introducing the App Privacy Policy Extractor (APPE) system. The APPE pipeline consists of various components, each developed to perform a specific task and provide insightful information about the apps' privacy policies. By analysing over two million apps in the iOS App Store, APPE offers unprecedented and comprehensive store-wide insights into policy distribution and can act as a mechanism for enforcing privacy policy requirements in app stores automatically. Third, the thesis investigates the issue of privacy policy complexity. By establishing generalisability across app categories and drawing attention to associated matters of time and cost, the study demonstrates that the current situation requires immediate and effective solutions. It suggests several recommendations and potential solutions. Finally, to enhance user engagement with privacy policies, a novel framework utilising a cost-effective unsupervised approach, based on the latest AI innovations, has been developed. The comparison of the findings of this study with state-of-the-art methods suggests that this approach can produce outcomes that are on par with those of human experts, or even surpass them, yet in a more efficient and automated manner.23 0Item Restricted Evaluating CAMeL-BERT for Sentiment Analysis of Customer Satisfaction with STC (Saudi Telecom Company) Services(The University of Sussex, 2024-08-15) Alotaibi, Fahad; Pay, JackIn the age of informatics platforms such as Twitter (X) plays a crucial role for measuring public sentiment, especially in both private and public sectors. This study explores the application of machine learning, particularly deep learning, to perform sentiment analysis on tweets about Saudi Telecom Company (STC) services in Saudi Arabia. A comparative analysis was conducted between pre-trained sentiment analysis models in English and in Arabic to assess their effectiveness in classifying sentiments. In addition, the study highlights a challenge in existing Arabic models, which are based on English model architectures but trained on varied datasets, such as Modern Standard Arabic and Classical Arabic (Al-Fus’ha). These models often lack the capability to handle the diverse Arabic dialects commonly used on social media. To overcome this issue, the study involved fine-tuning a pre-trained Arabic model using a dataset of tweets related to STC services, specifically focusing on the Saudi dialect. Data was collected from Twitter (X), focusing on mentions of the Saudi Telecom Company (STC). Both English and Arabic models were applied to this data, and their performance in sentiment analysis was evaluated. The fine-tuned Arabic model (CAMeL-BERT) demonstrated improved accuracy and a better understanding of local dialects compared to its initial version. The results highlight the importance of model adaptation for specific languages and contexts and underline the potential of CAMeL-BERT in sentiment analysis for Arabic-language content. The findings offer practical implications for enhancing customer service and engagement through more accurate sentiment analysis of social media content in the service providers sector.16 0Item Restricted Exploring Malnutrition in Residential Aged Care: A Study on Nursing Notes using Natural Language Processing and Large Language Models(University of Wollongong, 2024-03-21) Alkhalaf, Mohammad; Yu, PingPopulation ageing has led to an increasing demand for services for the older people. Residential aged care facilities (RACFs) in Australia provide a range of services for older people who can no longer live independently at home. These include accommodation, personal care, health care services and social and emotional support. Despite efforts for comprehensive care, managing nutrition for older people has been complex in RACFs. Malnutrition has emerged as a prevalent issue within these facilities, raising serious health concerns. Therefore, understanding and addressing malnutrition becomes a critical concern for the Australian government. To date, there has been a reliance on nutrition screening tools to assess older people’s nutritional care needs. Conducting these assessments require adequate healthcare training, and is time consuming, thus are not implemented as frequently as needed to timely uncover the risk of malnutrition for older people. In Australia, the majority of RACFs have established electronic health record (EHRs) system to capture and record care recipients’ information. These include medical diagnosis, regular nursing assessment, weight chart, care plan, periodic review, incident and infection review, and nursing progress report. Therefore, RAC EHRs contain wealth of information that can be mined to support aged care services. The advancement in natural language processing (NLP) technologies, in specific, large language models (LLMs), provides an opportunity to uncover useful insight from the RAC EHRs. Therefore, this PhD research is dedicated to extend NLP technology to the under-studied area RAC, design, implement and evaluate LLM applications in nutrition management among older individuals living in RACFs. It aims to design and develop a sophisticated machine learning framework capable of analysing both structured and unstructured EHR data to gain comprehensive insights into the malnutrition issue. Drawing from literature insights, the study initiates by employing word embedding techniques integrating with cosine similarity and UMLS ontology to extract nutrition- related terms from nursing notes in RACFs. This led to the uncover of language style and terminology used by the practicing nursing and aged care workers in manage nutrition for the older people under their care. Subsequent development of 13 extraction rules identifies relevant notes indicative of malnutrition, forming the basis for a training data set of 2,278 relevant nursing notes, which is utilized in LLM implementation. To enhance the LLM understanding of nursing notes, we randomly selected 500,000 notes for pre-training a domain specific LLM based on the established RoBERTa model. This is followed by fine-tuning the LLM specifically for malnutrition note detection. Achieving an impressive F1-score of 0.96, our model significantly surpassed previous models, ensuring more accurate classification of notes documenting malnutrition. Furthermore, we developed a framework integrating generative LLM, Llama 2, and retrieval augmented generation (RAG) system to extract comprehensive summary information from malnutrition-related notes. This framework demonstrates high accuracy (90%) in identifying malnutrition risk factors from 1,399 notes. It generates detailed summaries about nutrition status from EHRs with 99% of accuracy. Our study reveals a malnutrition prevalence rate of approximately 33% in the studied RACFs. There are 15 main categories and 43 subcategories of malnutrition risk factors. For the first time, this research identified the primary risk factors of malnutrition in RACFs, including poor appetite that affects 17% of older people. This is followed by insufficient oral intake and dementia progression. To enhance malnutrition predictive capabilities, we fine-tuned the RAC domain specific model to address the sequence length limitation of the RoBERTa model, 512 tokens. This is achieved by extending the sequence length to support 1,536 tokens. Augmented with risk factors, our model achieved an F1-score of 0.687, demonstrating its effectiveness in predicting malnutrition risk one month before the event onset. In conclusion, this research designs, develops and evaluates an innovative AI framework that leverages advanced AI technologies, particularly NLP and domain- specific LLMs, to tackle malnutrition among older people in residential aged care facilities. By analysing text data in EHR, The AI framework identifies risk factors, summarises nutrition information, and predict malnutrition one-month before the event onset. After thorough evaluation by domain experts, the AI framework can be implemented as an automated assessment tool. Its implementation into aged care services will alleviate the time burden associated with nutrition care for health and aged care practitioners, supporting them in identifying risk factors of malnutrition for the old people under their care, and manage malnutrition efficiently. The framework’s scalability extends beyond residential aged care facilities. It can be further extended to other healthcare settings to improve nutrition care effectiveness and quality of life for consumers.51 0Item Restricted Synonym-based Adversarial Attacks in Arabic Text Classification Systems(Clarkson University, 2024-05-21) Alshahrani, Norah Falah S; Matthews, JeannaText classification systems have been proven vulnerable to adversarial text examples, modified versions of the original text examples that are often unnoticed by human eyes, yet can force text classification models to alter their classification. Often, research works quantifying the impact of adversarial text attacks have been applied only to models trained in English. In this thesis, we introduce the first word-level study of adversarial attacks in Arabic. Specifically, we use a synonym (word-level) attack using a Masked Language Modeling (MLM) task with a BERT model in a black-box setting to assess the robustness of the state-of-the-art text classification models to adversarial attacks in Arabic. To evaluate the grammatical and semantic similarities of the newly produced adversarial examples using our synonym BERT-based attack, we invite four human evaluators to assess and compare the produced adversarial examples with their original examples. We also study the transferability of these newly produced Arabic adversarial examples to various models and investigate the effectiveness of defense mechanisms against these adversarial examples on the BERT models. We find that fine-tuned BERT models were more susceptible to our synonym attacks than the other Deep Neural Networks (DNN) models like WordCNN and WordLSTM we trained. We also find that fine-tuned BERT models were more susceptible to transferred attacks. We, lastly, find that fine-tuned BERT models successfully regain at least 2% in accuracy after applying adversarial training as an initial defense mechanism. We share our code scripts and trained models on GitHub at https://github.com/NorahAlshahrani/bert_synonym_attack.39 0Item Unknown EXPLORING LANGUAGE MODELS AND QUESTION ANSWERING IN BIOMEDICAL AND ARABIC DOMAINS(University of Delaware, 2024-05-10) Alrowili, Sultan; Shanker, K.VijayDespite the success of the Transformer model and its variations (e.g., BERT, ALBERT, ELECTRA, T5) in addressing NLP tasks, similar success is not achieved when these models are applied to specific domains (e.g., biomedical) and limited-resources language (e.g., Arabic). This research addresses issues to overcome some challenges in the use of Transformer models to specialized domains and languages that lack in language processing resources. One of the reasons for reduced performance in limited domains might be due to the lack of quality contextual representations. We address this issue by adapting different types of language models and introducing five BioM-Transformer models for the biomedical domain and Funnel transformer and T5 models for the Arabic language. For each of our models, we present experiments for studying the impact of design factors (e.g., corpora and vocabulary domain, model-scale, architecture design) on performance and efficiency. Our evaluation of BioM-Transformer models shows that we obtain state-of-the-art results on several biomedical NLP tasks and achieved the top-performing models on the BLURB leaderboard. The evaluation of our small scale Arabic Funnel and T5 models shows that we achieve comparable performance while utilizing less computation compared to the fine tuning cost of existing Arabic models. Further, our base-scale Arabic language models extend state-of-the-art results on several Arabic NLP tasks while maintaining a comparable fine-tuning cost to existing base-scale models. Next, we focus on the question-answering task, specifically tackling issues in specialized domains and low-resource languages such as the limited size of question-answering datasets and limited topics coverage within them. We employ several methods to address these issues in the biomedical domain, including the employment of models adapted to the domain and Task-to-Task Transfer Learning. We evaluate the effectiveness of these methods at the BioASQ10 (2022) challenge, showing that we achieved the top-performing system on several batches of the BioASQ10 challenge. In Arabic, we address similar existing issues by introducing a novel approach to create question-answer-passage triplets, and propose a pipeline, Pair2Passage, to create large QA datasets. Using this method and the pipeline, we create the ArTrivia dataset, a new Arabic question-answering dataset comprising more than +10,000 high-quality question-answer-passage triplets. We presented a quantitative and qualitative analysis of ArTrivia that shows the importance of some often overlooked yet important components, such as answer normalization in enhancing the quality of the question-answer dataset and future annotation. In addition, our evaluation shows the ability of ArTrivia to build a question-answering model that can address the out-of-distribution issue in existing Arabic QA datasets.22 0Item Unknown Unsupervised Semantic Change Detection in Arabic(Queen Mary University of London, 2023-10-23) Sindi, Kenan; Dubossarsky, HaimThis study employs pretrained BERT models— AraBERT, CAMeLBERT (CA), and CAMeLBERT (MSA)—to investigate semantic change in Arabic across distinct time periods. Analyzing word embeddings and cosine distance scores reveals variations in capturing semantic shifts. The research highlights the significance of training data quality and diversity, while acknowledging limitations in data scope. The project's outcome—a list of most stable and changed words—contributes to Arabic NLP by shedding light on semantic change detection, suggesting potential model selection strategies and areas for future exploration.95 0