Towards A Unified Framework for Computational Processing of Linguistic Code Switching

Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Saudi Digital Library

Abstract

Linguistic Code Switching (CS) is a phenomenon that occurs when multilingual speakers alternate between two or more languages or dialects during a single conversation. CS may occur at various linguistic levels, such as at sentence boundaries, mixing two or more languages in the same utterance, or mixing morphemes within words. Code-switching is prevalent in multilingual societies. Current studies have indicated that as much as 20% of user generated content from some geographies, like Singapore, parts of Europe, and South Asia, are code-switched (Choudhury and Bali, 2019). The CS content poses challenges for different natural language processing (NLP) technologies; traditional techniques trained for one language quickly break down when the input happens to include two or more languages. The performance of NLP enabling technologies that are currently expected to yield good results, such as Part-of-Speech-Tagging, and Base Phrase Chunking, would degrade at a rate proportional to the amount and level of mixed-language present. An accurate NLP enabling technology system is an important component that can be used to enhance performance in other NLP applications, such as Machine Translation, Sentiment Analysis, and Information Extraction. For each of these NLP applications, Part-of-Speech tagging (POS), Language Identification (LID), Base Phrase Chinking (BPC), and Tokenization (TOK) systems are used in order to pre-process data and thus provide more information to the global system. The main objective of this thesis is to investigate the feasibility and viability of building a unified learning model that can automatically process noisy CS data as well as establish a basis for future research in CS. To do so, we develop novel models targeting NLP enabling technologies for the purposes of processing code switched data. Our devised models cover the spectrum from classical machine learning algorithms depending on hand crafted feature engineering to deep neural networks leveraging learning representation paradigms. Classical machine learning algorithms rely on human understanding to define suitable features for CS complex patterns that require intensive human resources. Thus, a special focus is given to deep learning-based approaches that can learn CS patterns at different levels of linguistic complexity automatically. We investigate various feature representation techniques that incorporate the relations between languages involved in code-switching data. Additionally, we develop novel models based on two approaches, transfer learning-based approaches, and multi-task learning-based approaches. These approaches are appealing paradigms for CS data as they allow for incorporating information from related tasks and languages. Moreover, they allow the models to learn more robust parameters for each task and to adapt to low resource scenarios. We systematically study the approaches’ ability to process CS data. Namely, we target the four following enabling technologies: Language Identification (LID), Tokenization (TOK), Part of Speech tagging (POS), and Base Phrase Chunking (BPC). Moreover, we develop a hierarchical multi-task architecture (HMTL) combining the four tasks. To the best of our knowledge, our BPC system is the first system developed for CS data. We demonstrate that hierarchical multi-task learning models (HMTL) and language model pre-training (e.g., BERT and ELMO) are complementary technologies that are effective at processing CS data. Our results show that incorporating the pre-trained feature representations extracted from ELMO with fasttext embeddings help in enhancing the performance of the LID and POS tasks significantly outperforming current state-of-the-art.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025