Saudi Cultural Missions Theses & Dissertations

Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10

Browse

Search Results

Now showing 1 - 10 of 17
  • ItemRestricted
    Towards Representative Pre-training Corpora for Arabic Natural Language Processing
    (Clarkson University, 2024-11-30) Alshahrani, Saied Falah A; Matthews, Jeanna
    Natural Language Processing (NLP) encompasses various tasks, problems, and algorithms that analyze human-generated textual corpora or datasets to produce insights, suggestions, or recommendations. These corpora and datasets are crucial for any NLP task or system, as they convey social concepts, including views, culture, heritage, and perspectives of native speakers. However, a corpus or dataset in a particular language does not necessarily represent the culture of its native speakers. Native speakers may organically write some textual corpora or datasets, and some may be written by non-native speakers, translated from other languages, or generated using advanced NLP technologies, such as Large Language Models (LLMs). Yet, in the era of Generative Artificial Intelligence (GenAI), it has become increasingly difficult to distinguish between human-generated texts and machine-translated or machine-generated texts, especially when all these different types of texts, i.e., corpora or datasets, are combined to create large corpora or datasets for pre-training NLP tasks, systems, and technologies. Therefore, there is an urgent need to study the degree to which pre-training corpora or datasets represent native speakers and reflect their values, beliefs, cultures, and perspectives, and to investigate the potentially negative implications of using unrepresentative corpora or datasets for the NLP tasks, systems, and technologies. One of the most widely utilized pre-training corpora or datasets for NLP are Wikipedia articles, especially for low-resource languages like Arabic, due to their large multilingual content collection and massive array of metadata that can be quantified. In this dissertation, we study the representativeness of the Arabic NLP pre-training corpora or datasets, focusing specifically on the three Arabic Wikipedia editions: Arabic Wikipedia, Egyptian Arabic Wikipedia, and Moroccan Arabic Wikipedia. Our primary goals are to 1) raise awareness of the potential negative implications of using unnatural, inorganic, and unrepresentative corpora—those generated or translated automatically without the input of native speakers, 2) find better ways to promote transparency and ensure that native speakers are involved through metrics, metadata, and online applications, and 3) strive to reduce the impact of automatically generated or translated contents by using machine learning algorithms to identify or detect them automatically. To do this, firstly, we analyze the metadata of the three Arabic Wikipedia editions, focusing on differences using collected statistics such as total pages, articles, edits, registered and active users, administrators, and top editors. We document issues related to the automatic creation and translation of articles (content pages) from English to Arabic without human (i.e., native speakers) review, revision, or supervision. Secondly, we quantitatively study the performance implications of using unnatural, inorganic corpora that do not represent native speakers and are primarily generated using automation, such as bot-created articles or template-based translation. We intrinsically evaluate the performance of two main NLP tasks—Word Representation and Language Modeling—using the Word Analogy and Fill-Mask evaluation tasks on our two newly created datasets: the Arab States Analogy Dataset and the Masked Arab States Dataset. Thirdly, we assess the quality of Wikipedia corpora at the edition level rather than the article level by quantifying bot activities and enhancing Wikipedia’s Depth metric. After analyzing the limitations of the existing Depth metric, we propose a bot-free version by excluding bot-created articles and bot-made edits on articles called the DEPTH+ metric, presenting its mathematical definitions, highlighting its features and limitations, and explaining how this new metric accurately reflects human collaboration depth within the Wikipedia project. Finally, we address the issue of template translation in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics. We explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and employ the resulting insights to build multivariate machine learning classifiers leveraging article metadata to automatically detect template-translated articles. We lastly deploy the best-performing classifier publicly as an online application and release the extracted, filtered, labeled, and preprocessed datasets to the research community to benefit from our datasets and the web-based detection system.
    34 0
  • ItemRestricted
    Understanding Family Language Policies in Saudi Sojourning Families: insights from Mothers in Melbourne.
    (Monash University, 2024) Alsubaie, Samah; Fang, Nina
    This study investigates how ten Saudi sojourning mothers in Melbourne manage Family Language Policy (FLP) decisions regarding their children's language development. Unlike immigrants, who aim for long-term integration, sojourners live abroad temporarily, planning to return to their home country. Much research has been conducted on immigrants; however, few studies have focused on sojourners, particularly Saudi sojourning mothers. Therefore, they are the focus of this study. Using a qualitative approach, including semi-structured interviews, the study finds that all mothers prioritize maintaining Arabic for religious, cultural, and educational reasons. The research reveals the significant influence of external societal pressures and internal family dynamics on FLP choices, leading to a gap between the mothers' declared language ideologies and their actual practices. Despite these challenges, the mothers show a strong commitment to preserving their children's first language (L1) through consistent strategies. A key finding is the positive impact of fathers' active involvement in language education, which not only enhances language acquisition but also strengthens family unity and authority. The study highlights the complexities of FLP in transnational families and offers valuable insights into how parental roles and external factors shape language policies.
    14 0
  • ItemRestricted
    Expanding our understanding of the uses of Modern Standard and Hijazi Colloquial Arabic in Education: A Study Exploring Learners’ Attention, Academic Performance, and Language Attitudes in Saudi Arabia
    (University of Sussex, 2024-07) Alamir, Sarah; Blair, Andrew; Alkabani, Feras
    This study investigates how the use of Hijazi Colloquial Arabic (HCA) and Modern Standard Arabic (MSA) in oral instruction affects students' sustained auditory attention and academic performance and their attitudes towards both varieties in education. To form a clear picture of how effective both varieties are, the results of a nine-week pre-post-test classroom experiment, a follow-up questionnaire, and interviews were used for analysis. First, two groups of undergraduate female students (aged between 20 and 27) assigned to the 'History of the Americas' module at Umm Al-Qura University and a professor were selected for the experiment. One group had 29 students, whereas the other had 25. One group was instructed in MSA, and the other in HCA. The study findings showed that both HCA and MSA oral instruction improved the students' ability to sustain auditory attention, leading to better academic performance, with HCA instruction being slightly more effective. In addition, the disparities in automaticity and language execution between HCA and MSA were negligible. When it comes to attitudes, both HCA and MSA groups had more positive perceptions of MSA. Their actions, however, did not reflect their beliefs and feelings. Their attitudes and the underlying reasons could be grouped into six and five categories. Globally speaking, standard codes in diglossic contexts receive positive attitudes despite the changing social circumstances, while societal changes impact colloquial codes’ perceptions. These results implicate the field of higher education in Saudi Arabia and other Arab countries when considering using Colloquial Arabic codes (CAs) as a medium of instruction, as they should go hand in hand with MSA. This can be done by further research and modifying language policies to promote the coexistence between the two codes, combining them in instruction according to contexts and the psychological aspects instructors want to provoke, and using non-featured CAs, such as the educated HCA or White dialect.
    33 0
  • Thumbnail Image
    ItemRestricted
    Children’s Development of the Arabic Emphatic Consonants; An Acoustic Investigation
    (Macquarie University, 2024-02) Alkhudidi, Anwar; Benders, Titia; Demuth, Katherine; Holt, Rebecca; Szalay, Tuende
    This thesis examines the developmental trajectory in the production of plain-emphatic consonant contrasts among Saudi-Hijazi-Arabic-speaking children aged 3 to 6 years. The production of the articulatory complex emphatic consonants involves a primary coronal constriction and a secondary pharyngeal/uvular constriction. Acoustically, emphatics exert a strong anticipatory and carryover coarticulatory influence that can extend to all segments within the same word, a phenomenon termed ‘emphasis spread’(e.g., J. Al-Tamimi, 2017; Card, 1983; Jongman et al., 2011; Khattab et al., 2006; Zawaydeh & de Jong, 2011). Prior research, primarily based on impressionistic data, suggests emphatic segments are typically late acquired, after the age of 4 years (e.g., Alqattan, 2015; Amayreh, 2003; Amayreh & Dyson, 1998). However, auditory judgments may not fully capture the subtle developmental changes or gradations in the production of these consonants that are detectable through acoustic analysis (Macken & Barton, 1980; Mashaqba et al., 2022). Consequently, this thesis aims to acoustically examine the acquisition route of these complex emphatic consonants, focusing on both the consonantal and vocalic cues to the plain-emphatic contrast across different phonetic contexts. Specifically, this thesis acoustically examines the production of emphatic consonants across different word positions, initial, medial, and final, across three vocalic contexts, /aː/, /iː/, and /uː/, and whether the effect of the emphatic segment extends bidirectionally beyond the immediately adjacent vowel. Target consonants examined were the voiceless plain-emphatic obstruents /t/ vs. /tˤ/ and /s/ vs. /sˤ/. A single-word repetition task was used to elicit speech from 38 Saudi-Hijazi -Arabic-speaking children aged between 3;1 to 6;11, and 13 adults serving as reference data. The acoustic measurements taken were VOT of stops and F1 and F2 of adjacent vowels. Across these three studies, children demonstrate a non-linear developmental trajectory, initially showing a faster increase in the size of the plain-emphatic contrast with age, with the rate of this increase slowing down as children grow older. Furthermore, there is substantial alignment between child and adult production patterns concerning positional effects, vowel context effects, and emphasis spread patterns, highlighting the potential role of input on the development of emphatic consonants. Finally, female children produced, on average, larger contrasts than males. The findings of each study are discussed in relation to previous literature on emphatic production in adults, serving as a benchmark for understanding the developmental stages and strategies observed by children. References to various aspects of child phonology and production, including the cross-linguistic development of coarticulation, are also discussed.
    25 0
  • Thumbnail Image
    ItemRestricted
    EXPLORING LANGUAGE MODELS AND QUESTION ANSWERING IN BIOMEDICAL AND ARABIC DOMAINS
    (University of Delaware, 2024-05-10) Alrowili, Sultan; Shanker, K.Vijay
    Despite the success of the Transformer model and its variations (e.g., BERT, ALBERT, ELECTRA, T5) in addressing NLP tasks, similar success is not achieved when these models are applied to specific domains (e.g., biomedical) and limited-resources language (e.g., Arabic). This research addresses issues to overcome some challenges in the use of Transformer models to specialized domains and languages that lack in language processing resources. One of the reasons for reduced performance in limited domains might be due to the lack of quality contextual representations. We address this issue by adapting different types of language models and introducing five BioM-Transformer models for the biomedical domain and Funnel transformer and T5 models for the Arabic language. For each of our models, we present experiments for studying the impact of design factors (e.g., corpora and vocabulary domain, model-scale, architecture design) on performance and efficiency. Our evaluation of BioM-Transformer models shows that we obtain state-of-the-art results on several biomedical NLP tasks and achieved the top-performing models on the BLURB leaderboard. The evaluation of our small scale Arabic Funnel and T5 models shows that we achieve comparable performance while utilizing less computation compared to the fine tuning cost of existing Arabic models. Further, our base-scale Arabic language models extend state-of-the-art results on several Arabic NLP tasks while maintaining a comparable fine-tuning cost to existing base-scale models. Next, we focus on the question-answering task, specifically tackling issues in specialized domains and low-resource languages such as the limited size of question-answering datasets and limited topics coverage within them. We employ several methods to address these issues in the biomedical domain, including the employment of models adapted to the domain and Task-to-Task Transfer Learning. We evaluate the effectiveness of these methods at the BioASQ10 (2022) challenge, showing that we achieved the top-performing system on several batches of the BioASQ10 challenge. In Arabic, we address similar existing issues by introducing a novel approach to create question-answer-passage triplets, and propose a pipeline, Pair2Passage, to create large QA datasets. Using this method and the pipeline, we create the ArTrivia dataset, a new Arabic question-answering dataset comprising more than +10,000 high-quality question-answer-passage triplets. We presented a quantitative and qualitative analysis of ArTrivia that shows the importance of some often overlooked yet important components, such as answer normalization in enhancing the quality of the question-answer dataset and future annotation. In addition, our evaluation shows the ability of ArTrivia to build a question-answering model that can address the out-of-distribution issue in existing Arabic QA datasets.
    21 0
  • Thumbnail Image
    ItemRestricted
    Straddling Two Worlds: How Linguistic Backgrounds and Sociocultural Norms Influence the Experiences of Saudi Female Expats in Australia
    (University of Wollongong, 2023-03-08) Alhassoun, Lamia Abdulaziz; Ward, Rowena
    This study constructs a collective story of Saudi female expats (SFEs) as they navigate the transition from their conservative society in Saudi Arabia to a new one in Australia. It examines the impact of the SFE’s Arabic background and their English learning experiences in Australia on their lives and explores SFEs perceptions of their self-representation in the social and educational milieu in Australia. Additionally, it sheds light on the intricate relationships between language, culture, gender and self-representation. The study employs a demographic questionnaire and semi-structured interviews with twenty-two SFEs in Australia. The study adopts lenses from social identity theory (Erikson 1968; Tajfel & Turner 1979, 1986), Intragroup marginalisation (Castillo et al. 2007), Social learning theory (Bandura 1977; Ryle 2011) and Oberg's cultural shock theory (1960) to guide the analysis of the study data. The findings of this study reveal that SFEs, generally, have a positive attitude towards learning and using the English language. However, SFEs’ low self-confidence in their English language proficiency negatively impacted their cross-cultural interactions in Australia. The study attributes SFEs' low self-confidence to five factors: limited opportunities to practice English, a preference for socialising with Arabic speakers, the COVID-19 pandemic, Saudi Arabia's English education policies and limited interest in English improvement. The study also explores how SFEs represent themselves differently in Australia. Their reflections in the research interviews revealed that they define their ‘in’- and ‘out’-groups differently depending on the context in relation to their interlocutor’s ethnicity, gender, language and faith. SFEs indicated that when interacting with Saudi male compatriots, they tend to be formal and direct in line with the norms of their Saudi culture and upbringing marked by gender segregation. However, they are friendlier and more open with non-Saudi male interlocutors due to the more relaxed gender norms in Australia. One of the key findings is that SFEs’ interactions with non-Saudis are influenced by their sense of obligation to represent their faith and nationality in the best light. The study also shows that the SFE’s insufficient knowledge of Australian culture and their low confidence in their English skills, tends to make their interactions with native-English speakers to be direct and to the point. SFEs are aware that they can appear terse for this reason. The lack of opportunities to interact with the host/Australian community and learn about Australian culture was negatively impacted by the social isolation policies imposed during the COVID-19 pandemic. This study contributes to the limited literature on the experiences of SFEs as they navigate cross-cultural contexts and self-representation in Australia. This study offers valuable insights into the real-world challenges experienced by SFEs in adjusting to a more liberal society while maintaining their cultural identity. It sheds light on their perceptions of self-representations and attitude towards learning and using English and straddling cultures in Australia. Practical implications for improving cross-cultural interaction and strategies for enhancing English language education programs to better accommodate the growing SFE community in Australia are discussed in the conclusion.
    47 0
  • Thumbnail Image
    ItemRestricted
    Exploring Emoji Sentiment Roles in Arabic Textual Content on Digital Social Networks
    (Saudi Digital Library, 2024-07-09) Hakami, Shatha Ali A; Hendley, Robert; Smith, Phillip
    In today’s digital landscape, emoji have risen as pivotal elements in articulating sentiment, especially within the intricacies of the Arabic language. This thesis examines the various roles that emoji can play in expressing sentiment in Arabic texts, highlighting their relevance both in academic and real-world contexts. Beginning with foundational insights, our investigation retraces the history of emoji as important non-verbal communicative tools in human interaction. Then, we explore the distinct challenges of sentiment analysis in Arabic and refer to a thorough review of previous studies to frame our method, identifying both established techniques and unexplored opportunities. At the heart of our research is the understanding that, depending on the context, an emoji can adopt a wide variety of sentiment roles. These range from acting as an indicator, mitigator, emphasizer, reverser, releaser, or trigger of either negative or positive sentiment. Additionally, there are instances where an emoji simply maintains a neutral effect on the sentiment of the accompanying text. To achieve this, we gathered a large dataset, mainly from Twitter, and developed lexicons of words and emoji tailored for sentiment analysis in Arabic. These lexicons were the basis of our analysis model. By leveraging the insights gained from the emoji-roles sentiment lexicon and combining them with our established knowledge of the sentiment roles associated with specific emoji patterns, we make a significant improvement in the conventional sentiment classifier based on the emoji lexicon. Traditional methods often assign a static sentiment score to an emoji, failing to consider its varying roles in different textual contexts. Our refined approach corrects this oversight. Instead of considering a singular unchanging sentiment score for each emoji, the classifier dynamically retrieves sentiment scores based on the specific role the emoji plays within a given sentence. In conclusion, we compare our method with other Arabic sentiment analysis tools, demonstrating the value of our approach, especially within nuanced linguistic phenomena such as sarcasm and humour. This thesis sets the foundation for future Arabic research in this expanding domain.
    51 0
  • Thumbnail Image
    ItemRestricted
    Long Annotated Translation of the First Chapter of "Unlawful Killings", Life, Love and Murder: Trials at the Old Bailey By Wendy Joseph
    (Saudi Digital Library, 2023) Alburaidi, Ibrahim Saleh; Mizori, Hassan
    This project introduces the translation of the first chapter of "Unlawful Killings," a true crime narrative by Her Honour Wendy Joseph QC, offering insights into the UK legal system. The translation aims to fill a gap in Arabic literature, providing Arabic readers access to a best-selling work not previously translated. The rationale is grounded in the book's thematic relevance, the translator's personal connection to the Old Bailey Court, and the absence of an Arabic version. The translation strategy employs direct translation, borrowing, calque, and literal translation, supplemented by oblique translation techniques. The target readership includes Arabic literature enthusiasts, and the potential publisher is "Athra," known for its commitment to quality translation. In addition to the translation, there is an annotation that provides further context, explanations of translation choices, and cultural insights. This comprehensive approach seeks to enrich Arabic literature, presenting a unique perspective on true crime and legal proceedings while maintaining linguistic accuracy and cultural resonance.
    22 0
  • Thumbnail Image
    ItemRestricted
    An Investigation into Matching Learning Material to the Different Needs of Arabic Learners with Dyslexia
    (Saudi Digital Library, 2023-11-29) Alghabban, Weam Gaoud; Hendley, Robert
    Dyslexia is a common learning disability that affects people’s ability to spell, read words and their fluency in language. Adaptive e-learning is becoming increasingly popular as a tool to help individuals with dyslexia. It provides more-customised learning experiences and interactions based on the learners’ characteristics. Each learner with dyslexia has unique characteristics for which content should ideally be suitably tailored. However, adaptation to satisfy the individual needs and characteristics of learners with dyslexia is limited. In particular, the benefits of adapting e-learning based on dyslexia type or reading skill level have not yet been sufficiently explored, despite the type of dyslexia and the learner’s reading skill level being critical factors. Most previous studies have focused upon the technological aspects and have been marked by inadequately designed and controlled experiments to assess the system’s effectiveness. This limits the ability to understand the effectiveness of adaptation. This thesis aims to increase understanding about the value of adaptation of learning material based on individual dyslexia types and reading skill levels and to understand how this affects the learning experience of learners with dyslexia. To do this, an empirical evaluation through three controlled experiments with a reasonable number of subjects has been undertaken and assessed using the following metrics: learning gain, word understanding, learner satisfaction and perceived level of usability. In all three experiments, careful experimental design and precise reporting of results are all considered. A dynamic, web-based e-learning system that matches learning material based on dyslexia type and/or reading skill level was implemented to support these experiments. Across the three experiments, the findings reveal that matching learning material to dyslexia type, reading skill level and the combination of both, yields significantly better short- and long-term learning gains and improves the learners’ perception of their learning.
    19 0
  • Thumbnail Image
    ItemRestricted
    Hate Speech Detection for the Arabic Language
    (Saudi Digital Library, 2023-11-03) Alhejaili, Abrar; Moosavi, Nafise
    As online social networks grow and communication technologies become more available, people can exercise their freedom of expression more than ever before. Even though the interaction between users on these platforms can be constructive, they are increasingly used for spreading hateful content, mainly due to the anonymity feature of these online platforms. Hate speech can induce cyber conflict, negatively impacting social life at both the individual and national levels. In spite of this, social network providers are unable to monitor all the content posted by their users. As a result, there is a need to detect hate speech automatically. This need increases when the text is written in a language like Arabic. Arabic is known for its challenges, complexities, and resource scarcity. This project uses transfer learning methods to adapt, and evaluate some pretrained models to detect hate speech in Arabic. Many experiments were conducted in this project to assess the transferring of some options from BERT and Sequence-to-Sequence families (e.g., DehateBERT, MARBERT, T5, and Flan-T5), and the transferring of preprocessing functions from a pretrained model (AraBERT). Experiments show that transfer learning by finetuning monolingual models has promising results to a different extent. In addition, the additional preprocessing can affect the performance in a good way. Nevertheless, dealing with low-frequency labels independently, such as our dataset’s hate class, is still challenging. Warning: This paper may include instances of offensive language.
    17 0

Copyright owned by the Saudi Digital Library (SDL) © 2024