SACM - United States of America
Permanent URI for this collectionhttps://drepo.sdl.edu.sa/handle/20.500.14154/9668
Browse
4 results
Search Results
Item Restricted Towards Representative Pre-training Corpora for Arabic Natural Language Processing(Clarkson University, 2024-11-30) Alshahrani, Saied Falah A; Matthews, JeannaNatural Language Processing (NLP) encompasses various tasks, problems, and algorithms that analyze human-generated textual corpora or datasets to produce insights, suggestions, or recommendations. These corpora and datasets are crucial for any NLP task or system, as they convey social concepts, including views, culture, heritage, and perspectives of native speakers. However, a corpus or dataset in a particular language does not necessarily represent the culture of its native speakers. Native speakers may organically write some textual corpora or datasets, and some may be written by non-native speakers, translated from other languages, or generated using advanced NLP technologies, such as Large Language Models (LLMs). Yet, in the era of Generative Artificial Intelligence (GenAI), it has become increasingly difficult to distinguish between human-generated texts and machine-translated or machine-generated texts, especially when all these different types of texts, i.e., corpora or datasets, are combined to create large corpora or datasets for pre-training NLP tasks, systems, and technologies. Therefore, there is an urgent need to study the degree to which pre-training corpora or datasets represent native speakers and reflect their values, beliefs, cultures, and perspectives, and to investigate the potentially negative implications of using unrepresentative corpora or datasets for the NLP tasks, systems, and technologies. One of the most widely utilized pre-training corpora or datasets for NLP are Wikipedia articles, especially for low-resource languages like Arabic, due to their large multilingual content collection and massive array of metadata that can be quantified. In this dissertation, we study the representativeness of the Arabic NLP pre-training corpora or datasets, focusing specifically on the three Arabic Wikipedia editions: Arabic Wikipedia, Egyptian Arabic Wikipedia, and Moroccan Arabic Wikipedia. Our primary goals are to 1) raise awareness of the potential negative implications of using unnatural, inorganic, and unrepresentative corpora—those generated or translated automatically without the input of native speakers, 2) find better ways to promote transparency and ensure that native speakers are involved through metrics, metadata, and online applications, and 3) strive to reduce the impact of automatically generated or translated contents by using machine learning algorithms to identify or detect them automatically. To do this, firstly, we analyze the metadata of the three Arabic Wikipedia editions, focusing on differences using collected statistics such as total pages, articles, edits, registered and active users, administrators, and top editors. We document issues related to the automatic creation and translation of articles (content pages) from English to Arabic without human (i.e., native speakers) review, revision, or supervision. Secondly, we quantitatively study the performance implications of using unnatural, inorganic corpora that do not represent native speakers and are primarily generated using automation, such as bot-created articles or template-based translation. We intrinsically evaluate the performance of two main NLP tasks—Word Representation and Language Modeling—using the Word Analogy and Fill-Mask evaluation tasks on our two newly created datasets: the Arab States Analogy Dataset and the Masked Arab States Dataset. Thirdly, we assess the quality of Wikipedia corpora at the edition level rather than the article level by quantifying bot activities and enhancing Wikipedia’s Depth metric. After analyzing the limitations of the existing Depth metric, we propose a bot-free version by excluding bot-created articles and bot-made edits on articles called the DEPTH+ metric, presenting its mathematical definitions, highlighting its features and limitations, and explaining how this new metric accurately reflects human collaboration depth within the Wikipedia project. Finally, we address the issue of template translation in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics. We explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and employ the resulting insights to build multivariate machine learning classifiers leveraging article metadata to automatically detect template-translated articles. We lastly deploy the best-performing classifier publicly as an online application and release the extracted, filtered, labeled, and preprocessed datasets to the research community to benefit from our datasets and the web-based detection system.61 0Item Restricted EXPLORING LANGUAGE MODELS AND QUESTION ANSWERING IN BIOMEDICAL AND ARABIC DOMAINS(University of Delaware, 2024-05-10) Alrowili, Sultan; Shanker, K.VijayDespite the success of the Transformer model and its variations (e.g., BERT, ALBERT, ELECTRA, T5) in addressing NLP tasks, similar success is not achieved when these models are applied to specific domains (e.g., biomedical) and limited-resources language (e.g., Arabic). This research addresses issues to overcome some challenges in the use of Transformer models to specialized domains and languages that lack in language processing resources. One of the reasons for reduced performance in limited domains might be due to the lack of quality contextual representations. We address this issue by adapting different types of language models and introducing five BioM-Transformer models for the biomedical domain and Funnel transformer and T5 models for the Arabic language. For each of our models, we present experiments for studying the impact of design factors (e.g., corpora and vocabulary domain, model-scale, architecture design) on performance and efficiency. Our evaluation of BioM-Transformer models shows that we obtain state-of-the-art results on several biomedical NLP tasks and achieved the top-performing models on the BLURB leaderboard. The evaluation of our small scale Arabic Funnel and T5 models shows that we achieve comparable performance while utilizing less computation compared to the fine tuning cost of existing Arabic models. Further, our base-scale Arabic language models extend state-of-the-art results on several Arabic NLP tasks while maintaining a comparable fine-tuning cost to existing base-scale models. Next, we focus on the question-answering task, specifically tackling issues in specialized domains and low-resource languages such as the limited size of question-answering datasets and limited topics coverage within them. We employ several methods to address these issues in the biomedical domain, including the employment of models adapted to the domain and Task-to-Task Transfer Learning. We evaluate the effectiveness of these methods at the BioASQ10 (2022) challenge, showing that we achieved the top-performing system on several batches of the BioASQ10 challenge. In Arabic, we address similar existing issues by introducing a novel approach to create question-answer-passage triplets, and propose a pipeline, Pair2Passage, to create large QA datasets. Using this method and the pipeline, we create the ArTrivia dataset, a new Arabic question-answering dataset comprising more than +10,000 high-quality question-answer-passage triplets. We presented a quantitative and qualitative analysis of ArTrivia that shows the importance of some often overlooked yet important components, such as answer normalization in enhancing the quality of the question-answer dataset and future annotation. In addition, our evaluation shows the ability of ArTrivia to build a question-answering model that can address the out-of-distribution issue in existing Arabic QA datasets.22 0Item Restricted Appearances are Deceiving: Long-Distance Subject Anaphors and Phasal Binding Domains(2023-05) Almalki, Fahad; Punske, JeffreyAn unusual behavior of anaphors is to occur in embedded subject positions and be bound across a finite clause boundary by a matrix subject. This thesis, however, demonstrates that such constructions exist in Malki Arabic, besides other languages. First, this thesis shows that the clause size of the embedded clause in which subject anaphors are allowed is CP and not always a TP. Second, in light of current reductionist approaches to binding domains of the classical binding theory to phase theory, a cross-clausal binding relation bears issues to those approaches, as a long-distance antecedence relation crosses a phase boundary. Taking long-distance bound subject anaphors as the main empirical focus in this thesis, I show that the cross-clausal binding relation in Malki Arabic is not bona fide evidence against reducing binding domains to phases. Following Wurmbrand (2019) and Lohninger et al. (2022), I propose that constructions with long-distance bound subject anaphors theoretically resemble cross-clausal A-dependencies, like hyperraising and long-distance agreement, for undergoing movement to a position in the edge of the embedded clause and showing similar properties. Third, I show that reducing binding domains to whole phases is plausible, but taking spell-out domains as binding domains is untenable. Finally, the proposal suggested in this thesis also sheds lights on the possibility of the anaphor agreement effect as an interface condition, in addition to highlighting an account for the accusative-marked embedded subject in Modern Standard Arabic.24 0Item Restricted Arabic Diacritics And Reading: A Proposed Psycholinguistic Approch To Foreign/Second-Language Pedagogy(Saudi Digital Library, 2023) Alqazlan, Bandar; Morkus, NaderArabic orthography is mainly presented either in shallow orthography (with all diacritics) for novice students or in deep orthography (without diacritics) for superior readers. However, the shallow orthography is heavily loaded with diacritics which may burden the reading process, whereas deep orthography can cause ambiguity (heterophonic homographic words). Building upon the findings of current psycholinguistic research, this study introduces a systematic approach to effectively and economically address the issue of diacritics and reading. This proposed approach begins with shallow orthography for new words the first six to twelve times they are encountered to assure lexical internalization and then ends with the newly-coined term semi-deep orthography in which only the needed diacritics are used. The semi-deep orthography is employed based on two principles: word frequency and ambiguity within the root-pattern system. The first principle is word frequency, in which the top 5000 high-frequency words, accounting for approximately 90% of written discourse, do not need diacritics. The second principle is ambiguity within the root-pattern system, since this system produces nearly 85% of Arabic vocabulary and thus provides the basic unwritten-vowel framework required for reading. However, occasionally ambiguity emerges within the system, for example when diacritics are required to distinguish between the active and passive forms of a verb.34 0