Saudi Cultural Missions Theses & Dissertations
Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10
Browse
19 results
Search Results
Item Restricted Analysing and Visualising (Cyber)crime data using Structured Occurrence Nets and Natural Language Processing(Newcastle University, 2025-03-01) Alshammari, Tuwailaa; Koutny, MaciejStructured Occurrence Nets (SONs) are a Petri net-based formalism designed to represent the behaviour of complex evolving systems, capturing concurrent events and interactions between subsystems. Recently, the modelling and visualisation of crime and cybercrime investigations have gained increasing interest. In particular, SONs have proven to be versatile tools for modelling and visualising various applications, including crime and cybercrime. This thesis presents two contributions aimed at making SON-based techniques suitable for real-life applications. The main contribution is motivated by the fact that manually developing SON models from unstructured text can be time-consuming, as it requires extensive reading, comprehension, and model construction. This thesis aims to develop a methodology for the formal representation of unstructured textual resources in English. This involves experimenting, mapping, and deriving relationships between natural and formal languages, specifically using SON for crime modelling and visualisation as an application. The second contribution addresses the scalability of SON-based representations for cybercrime analysis. It provides a novel approach in which acyclic nets have been extended with coloured features to enable reduction of net size to help in visualisation. While the two contributions address distinct challenges, they are unified by their use of SONs as a formalism to model complex systems. Structured occurrence nets demonstrated their adaptability in representing both crime scenarios and cybercrime activities.38 0Item Restricted Analysing and Visualising (Cyber)crime data using Structured Occurrence Nets and Natural Language Processing(2025) Tuwailaa Alshammari; Professor Maciej KoutnyStructured Occurrence Nets (SONs) are a Petri net-based formalism designed to represent the behaviour of complex evolving systems, capturing concurrent events and interactions between subsystems. Recently, the modelling and visualisation of crime and cybercrime investigations have gained increasing interest. In particular, SONs have proven to be versatile tools for modelling and visualising various applications, including crime and cybercrime. This thesis presents two contributions aimed at making SON-based techniques suitable for real-life applications. The main contribution is motivated by the fact that manually developing SON models from unstructured text can be time-consuming, as it requires extensive reading, comprehension, and model construction. This thesis aims to develop a methodology for the formal representation of unstructured textual resources in English. This involves experimenting, mapping, and deriving relationships between natural and formal languages, specifically using SON for crime modelling and visualisation as an application. The second contribution addresses the scalability of SON-based representations for cybercrime analysis. It provides a novel approach in which acyclic nets have been extended with coloured features to enable reduction of net size to help in visualisation. While the two contributions address distinct challenges, they are unified by their use of SONs as a formalism to model complex systems. Structured occurrence nets demonstrated their adaptability in representing both crime scenarios and cybercrime activities.48 0Item Restricted IMPROVING ASPECT-BASED SENTIMENT ANALYSIS THROUGH LARGE LANGUAGE MODELS(Florida state university, 2024) Alanazi, Sami; Liu, XiuwenAspect-Based Sentiment Analysis (ABSA) is a crucial task in Natural Language Processing (NLP) that seeks to extract sentiments associated with specific aspects within text data. While traditional sentiment analysis offers a broad view, ABSA provides a fine-grained approach by identifying sentiments tied to particular aspects, enabling deeper insights into user opinions across diverse domains. Despite improvements in NLP, accurately capturing aspect-specific sentiments, especially in complex and multi-aspect sentences, remains challenging due to the nuanced dependencies and variations in sentiment expression. Additionally, languages with limited annotated datasets, such as Arabic, present further obstacles in ABSA. This dissertation addresses these challenges by proposing methodologies that enhance ABSA capabilities through large language models and transformer architectures. Three primary approaches are developed and evaluated: First, aspect-specific sentiment classification using GPT-4 with prompt engineering to improve few-shot learning and in-context classification; second, triplet extraction utilizing an encoder-decoder framework based on the T5 model, designed to capture aspect-opinion-sentiment associations effectively; and lastly, Aspect-Aware Conditional BERT, an extension of AraBERT, incorporating a customized attention mechanism to dynamically adjust focus based on target aspects, particularly improving ABSA in multi-aspect Arabic text. Our experimental results demonstrate that these proposed methods outperform current baselines across multiple datasets, particularly in improving sentiment accuracy and aspect relevance. This research contributes new model architectures and techniques that enhance ABSA for high-resource and low-resource languages, offering a scalable solution adaptable to various domains.38 0Item Restricted Towards Representative Pre-training Corpora for Arabic Natural Language Processing(Clarkson University, 2024-11-30) Alshahrani, Saied Falah A; Matthews, JeannaNatural Language Processing (NLP) encompasses various tasks, problems, and algorithms that analyze human-generated textual corpora or datasets to produce insights, suggestions, or recommendations. These corpora and datasets are crucial for any NLP task or system, as they convey social concepts, including views, culture, heritage, and perspectives of native speakers. However, a corpus or dataset in a particular language does not necessarily represent the culture of its native speakers. Native speakers may organically write some textual corpora or datasets, and some may be written by non-native speakers, translated from other languages, or generated using advanced NLP technologies, such as Large Language Models (LLMs). Yet, in the era of Generative Artificial Intelligence (GenAI), it has become increasingly difficult to distinguish between human-generated texts and machine-translated or machine-generated texts, especially when all these different types of texts, i.e., corpora or datasets, are combined to create large corpora or datasets for pre-training NLP tasks, systems, and technologies. Therefore, there is an urgent need to study the degree to which pre-training corpora or datasets represent native speakers and reflect their values, beliefs, cultures, and perspectives, and to investigate the potentially negative implications of using unrepresentative corpora or datasets for the NLP tasks, systems, and technologies. One of the most widely utilized pre-training corpora or datasets for NLP are Wikipedia articles, especially for low-resource languages like Arabic, due to their large multilingual content collection and massive array of metadata that can be quantified. In this dissertation, we study the representativeness of the Arabic NLP pre-training corpora or datasets, focusing specifically on the three Arabic Wikipedia editions: Arabic Wikipedia, Egyptian Arabic Wikipedia, and Moroccan Arabic Wikipedia. Our primary goals are to 1) raise awareness of the potential negative implications of using unnatural, inorganic, and unrepresentative corpora—those generated or translated automatically without the input of native speakers, 2) find better ways to promote transparency and ensure that native speakers are involved through metrics, metadata, and online applications, and 3) strive to reduce the impact of automatically generated or translated contents by using machine learning algorithms to identify or detect them automatically. To do this, firstly, we analyze the metadata of the three Arabic Wikipedia editions, focusing on differences using collected statistics such as total pages, articles, edits, registered and active users, administrators, and top editors. We document issues related to the automatic creation and translation of articles (content pages) from English to Arabic without human (i.e., native speakers) review, revision, or supervision. Secondly, we quantitatively study the performance implications of using unnatural, inorganic corpora that do not represent native speakers and are primarily generated using automation, such as bot-created articles or template-based translation. We intrinsically evaluate the performance of two main NLP tasks—Word Representation and Language Modeling—using the Word Analogy and Fill-Mask evaluation tasks on our two newly created datasets: the Arab States Analogy Dataset and the Masked Arab States Dataset. Thirdly, we assess the quality of Wikipedia corpora at the edition level rather than the article level by quantifying bot activities and enhancing Wikipedia’s Depth metric. After analyzing the limitations of the existing Depth metric, we propose a bot-free version by excluding bot-created articles and bot-made edits on articles called the DEPTH+ metric, presenting its mathematical definitions, highlighting its features and limitations, and explaining how this new metric accurately reflects human collaboration depth within the Wikipedia project. Finally, we address the issue of template translation in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics. We explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and employ the resulting insights to build multivariate machine learning classifiers leveraging article metadata to automatically detect template-translated articles. We lastly deploy the best-performing classifier publicly as an online application and release the extracted, filtered, labeled, and preprocessed datasets to the research community to benefit from our datasets and the web-based detection system.58 0Item Restricted Enhancing Biomedical Named Entity Recognition through Multi-Task Learning and Syntactic Feature Integration with BioBERT(De Montfort University, 2024-08) Alqulayti, Abdulaziz; Taherkhani, AboozarBiomedical Named Entity Recognition (BioNER) is a critical task in natural language processing (NLP) for pulling noteworthy knowledge from the frequently growing size of biomedical literature. The concentrate of this study is creating refined BioNER models, which identify entities like proteins, diseases, and genes with remarkable generalizability and accuracy. Important challenges in BioNER are handled in the study, such as morphological variations, the complex nature of biomedical terminology, the vagueness usually seen in context-dependent language and morphological variations. This study establishes a unique standard in BioNER methodology, it incorporates cutting-edge machine learning techniques like character-level embeddings through Bidirectional Long Short-Term Memory (BiLSTM) networks, pre-trained models like BioBERT, multi-task learning solution, and syntactic feature extraction. The NCBI Disease Corpus, a standard dataset for disease name recognition, was used to apply the methodology to it. Two main models were created The BioBERTForNER and BioBERTBiLSTMForNER. The BioBERTBiLSTM model contains an additional BiLSTM layer, which showed exceptional performance by catching long-term dependencies and complicated morphological patterns in biomedical text. An exceptional 0.938 F1-score has been reached with This model beating existing advanced systems and the baseline BioBERT model. Also, the study investigates the effect of syntactic features and character-level embeddings, demonstrating their vital part in improving recall and precision. The combination of a multi-task learning solution demonstrated quite adequate at moderating the model’s capacity to maintain generalize across different contexts and overfitting. The final models not solely formed further measures on the NCBI Disease Corpus they also presented a multi-faceted strategy and expandable to BioNER, which shows how architectural innovations and refined embedding methods can greatly enhance biomedical text mining. The study results underscore the key part of progressive embedding techniques and multi-task learning in NLP, displaying their flexibility across various biomedical domains. Additionally, this study displays the possibility for these improvements to be used in analysis and real-world clinical data extraction preparing the path for forthcoming studies. Additional mixed biomedical datasets could be used to extend These methodologies, which eventually improve the efficiency and precision of automated biomedical information retrieval in clinical settings.7 0Item Restricted A Quality Model to Assess Airport Services Using Machine Learning and Natural Language Processing(Cranfield University, 2024-04) Homaid, Mohammed; Moulitsas, IreneIn the dynamic environment of passenger experiences, precisely evaluating passenger satisfaction remains crucial. This thesis is dedicated to the analysis of Airport Service Quality (ASQ) by analysing passenger reviews through sentiment analysis. The research aims to investigate and propose a novel model for assessing ASQ through the application of Machine Learning (ML) and Natural Language Processing (NLP) techniques. It utilises a comprehensive dataset sourced from Skytrax, incorporating both text reviews and numerical ratings. The initial analysis presents challenges for traditional and general NLP techniques when applied to specific domains, such as ASQ, due to limitations like general lexicon dictionaries and pre-compiled stopword lists. To overcome these challenges, a domain-specific sentiment lexicon for airport service reviews is created using the Pointwise Mutual Information (PMI) scoring method. This approach involved replacing the default VADER sentiment scores with those derived from the newly developed lexicon. The outcomes demonstrate that this specialised lexicon for the airport review domain substantially exceeds the benchmarks, delivering consistent and significant enhancements. Moreover, six unique methods for identifying stopwords within the Skytrax review dataset are developed. The research reveals that employing dynamic methods for stopword removal markedly improves the performance of sentiment classification. Deep learning (DL), especially using transformer models, has revolutionised the processing of textual data, achieving unprecedented success. Therefore, novel models are developed through the meticulous development and fine-tuning of advanced deep learning models, specifically Bidirectional Long Short-Term Memory (BiLSTM) and Bidirectional Encoder Representations from Transformers (BERT), tailored for the airport services domain. The results demonstrate superior performance, highlighting the BERT model's exceptional ability to seamlessly blend textual and numerical data. This progress marks a significant improvement upon the current state-of-the-art achievements documented in the existing literature. To encapsulate, this thesis presents a thorough exploration of sentiment analysis, ML and DL methodologies, establishing a framework for the enhancement of ASQ evaluation through detailed analysis of passenger feedback.16 0Item Restricted Enhancing Biomedical Named Entity Recognition through Multi-Task Learning and Syntactic Feature Integration with BioBERT(De Montfort University, 2024-08) Alqulayti, Abdulaziz; Taherkhani, AboozarBiomedical Named Entity Recognition (BioNER) is a critical task in natural language processing (NLP) for pulling noteworthy knowledge from the frequently growing size of biomedical literature. The concentrate of this study is creating refined BioNER models, which identify entities like proteins, diseases, and genes with remarkable generalizability and accuracy. Important challenges in BioNER are handled in the study, such as morphological variations, the complex nature of biomedical terminology, the vagueness usually seen in context-dependent language and morphological variations. This study establishes a unique standard in BioNER methodology, it incorporates cutting-edge machine learning techniques like character-level embeddings through Bidirectional Long Short-Term Memory (BiLSTM) networks, pre-trained models like BioBERT, multi-task learning solution, and syntactic feature extraction. The NCBI Disease Corpus, a standard dataset for disease name recognition, was used to apply the methodology to it. Two main models were created The BioBERTForNER and BioBERTBiLSTMForNER. The BioBERTBiLSTM model contains an additional BiLSTM layer, which showed exceptional performance by catching long-term dependencies and complicated morphological patterns in biomedical text. An exceptional 0.938 F1-score has been reached with This model beating existing advanced systems and the baseline BioBERT model. Also, the study investigates the effect of syntactic features and character-level embeddings, demonstrating their vital part in improving recall and precision. The combination of a multi-task learning solution demonstrated quite adequate at moderating the model’s capacity to maintain generalize across different contexts and overfitting. The final models not solely formed further measures on the NCBI Disease Corpus they also presented a multi-faceted strategy and expandable to BioNER, which shows how architectural innovations and refined embedding methods can greatly enhance biomedical text mining. The study results underscore the key part of progressive embedding techniques and multi-task learning in NLP, displaying their flexibility across various biomedical domains. Additionally, this study displays the possibility for these improvements to be used in analysis and real-world clinical data extraction preparing the path for forthcoming studies. Additional mixed biomedical datasets could be used to extend These methodologies, which eventually improve the efficiency and precision of automated biomedical information retrieval in clinical settings.15 0Item Restricted Adapting to Change: The Temporal Performance of Text Classifiers in the Context of Temporally Evolving Data(Queen Mary University of London, 2024-07-08) Alkhalifa, Rabab; Zubiaga, ArkaitzThis thesis delves into the evolving landscape of NLP, particularly focusing on the temporal persistence of text classifiers amid the dynamic nature of language use. The primary objective is to understand how changes in language patterns over time impact the performance of text classification models and to develop methodologies for maintaining their effectiveness. The research begins by establishing a theoretical foundation for text classification and temporal data analysis, highlighting the challenges posed by the evolving use of language and its implications for NLP models. A detailed exploration of various datasets, including the stance detection and sentiment analysis datasets, sets the stage for examining these dynamics. The characteristics of the datasets, such as linguistic variations and temporal vocabulary growth, are carefully examined to understand their influence on the performance of the text classifier. A series of experiments are conducted to evaluate the performance of text classifiers across different temporal scenarios. The findings reveal a general trend of performance degradation over time, emphasizing the need for classifiers that can adapt to linguistic changes. The experiments assess models' ability to estimate past and future performance based on their current efficacy and linguistic dataset characteristics, leading to valuable insights into the factors influencing model longevity. Innovative solutions are proposed to address the observed performance decline and adapt to temporal changes in language use over time. These include incorporating temporal information into word embeddings and comparing various methods across temporal gaps. The Incremental Temporal Alignment (ITA) method emerges as a significant contributor to enhancing classifier performance in same-period experiments, although it faces challenges in maintaining effectiveness over longer temporal gaps. Furthermore, the exploration of machine learning and statistical methods highlights their potential to maintain classifier accuracy in the face of longitudinally evolving data. The thesis culminates in a shared task evaluation, where participant-submitted models are compared against baseline models to assess their classifiers' temporal persistence. This comparison provides a comprehensive understanding of the short-term, long-term, and overall persistence of their models, providing valuable information to the field. The research identifies several future directions, including interdisciplinary approaches that integrate linguistics and sociology, tracking textual shifts on online platforms, extending the analysis to other classification tasks, and investigating the ethical implications of evolving language in NLP applications. This thesis contributes to the NLP field by highlighting the importance of evaluating text classifiers' temporal persistence and offering methodologies to enhance their sustainability in dynamically evolving language environments. The findings and proposed approaches pave the way for future research, aiming at the development of more robust, reliable, and temporally persistent text classification models.20 0Item Restricted EXTRACTION OF TEMPORAL RELATIONSHIPS BETWEEN EVENTS FROM NEWS ARTICLES FOR TIMELINE GENERATION(University of Manchester, 0024-06-27) Alsayyahi, Sarah; Batista- Navarro, RizaExtracting temporal information from natural language texts is crucial for understanding the sequence and context of events, enhancing the accuracy of timeline generation and event analysis in various applications. However, within the NLP community, determining the temporal ordering of events has been recognised as a challenging task. This difficulty arises from the inherent vagueness of temporal information found in natural language texts like news articles. In Temporal Information Extraction (TIE), different datasets and methods have been proposed to extract various types of temporal entities, including events, temporal expressions, temporal relations, and the relative order of events. Some of these tasks have been considered easier than others in the field. For instance, extracting the temporal expressions or events is easier than determining the optimal order of a set of events. The complexity of determining the event order arises due to the requirement of commonsense and external knowledge, which is not readily accessible to computers. In contrast, humans can effortlessly identify this chronological order by relying on their external knowledge and understanding to establish the most appropriate sequence. In this thesis, our goal was to improve the performance of state-of-the-art methods for determining the temporal order of events in news articles. Accordingly, we present the following contributions: 1. We reviewed the literature by conducting a systematic survey, categorising tasks and datasets relevant to extracting the order of events mentioned in the news articles. We also identified existing findings and highlighted some research directions worth further investigation. 2. We proposed a novel annotation scheme with an unambiguous definition of the types of events and temporal relations of interest. Adopting this scheme, we developed a TIMELINE dataset, which annotates both verb and nominal events and considers the long-distance temporal relations between events separated by more than one sentence. 3. We integrated problem-related features with a neural-based method to improve the model's ability to extract temporal relations that involved nominal events and the temporal relations with small classes (e.g., EQUAL class). We found that integrating these features has significantly improved the performance of the neural baseline model and could achieve state-of-the-art results in two datasets in the literature. 4. We proposed a framework that uses local search algorithms (e.g., Hill Climbing and Simulated Annealing) to generate document-level timelines from a set of temporal relations. These algorithms have improved the performance of the current models and resolved the problem in less time than the state-of-the-art models.28 0Item Restricted Predicting Actions in Images using Distributed Lexical Representations(University of Sheffield, 2023-08-05) Alsunaidi, Abdulsalam; Gaizauskas, RobArtificial intelligence has long sought to develop agents capable of perceiving the complex visual environment around us and communicating about it using natural language. In recent years, significant strides have been made towards this objective, particularly in the field of image content description. For instance, current artificial systems are able to classify images of a single object with a high level of accuracy that is sometimes comparable to that of humans. Although there has been remarkable progress in recognising objects, there has been less headway in action recognition due to a significant limitation in the current approach. Most of the advances in visual recognition rely on classifying images into distinct and non-overlapping categories. While this approach may work well in many contexts, it is inadequate for under- standing actions. It constrains the categorisation of an action to a single interpretation, thereby preventing an agent from proposing multiple possible interpretations. To tackle this fundamental limitation, this thesis proposes a framework that seeks to de- scribe action-depicting images using multiple verbs, and expands the vocabulary used to de- scribe such images beyond the limitations of the training dataset. In particular, the framework leverages lexical embeddings as a supplementary tool to go beyond the verbs that are supplied as explicit labels for images in datasets used for supervised training of action classifiers. More specifically, these embeddings are used for representing the target labels (i.e., verbs). By exploiting a richer representations of human actions, this framework has the potential to improve the capability of artificial agents to accurately recognise and describe human actions in images. In this thesis, we focus on the representation of input images and target labels. We examine various components for both elements, ranging from commonly used off-the-shelf options to custom-designed ones tailored to the task at hand. By carefully selecting and evaluating these components, we aim not only to improve the accuracy and effectiveness of the proposed frame- work but also to gain deeper insight into the potential of distributed lexical representations for action prediction in images.12 0