Development Techniques for Large Language Models for Low Resource Languages

Thumbnail Image

Date

2023-12

Journal Title

Journal ISSN

Volume Title

Publisher

University of Texas at Dallas

Abstract

Recent advancements in Natural Language Processing (NLP) driven by large language models have brought about transformative changes in various sectors reliant on extensive text-based research. This dissertation is the culmination of techniques designed for crafting domain-specific large language models tailored to low-resource languages, offering invaluable support to researchers engaged in large-scale text analysis. The primary focus of these models is to address the nuances of politics, conflicts, and violence in the Middle East and Latin America using domain-specific, pre-trained large language models in Arabic and Spanish. Throughout the development of these language models, we construct a multitude of downstream tasks, including named entity recognition, binary classification, multi-label classification, and question answering. Additionally, we lay out a roadmap for the creation of domain-specific large language models. Our core objective is to contribute by devising NLP strategies and methodologies that surmount the challenges posed by low-resource languages. This contribution extends to curating an extensive corpus of texts centered around regional politics and conflicts in Spanish and Arabic, thereby enriching research in the domain of NLP large language models for low-resource languages. We assess the performance of our models against the Bidirectional Encoder Representations from Transformers (BERT) model as a baseline. Our findings unequivocally establish that the utilization of domain-specific pre-trained language models markedly enhances the performance of NLP models in the realm of politics and conflict analysis. This is observed in both Arabic and Spanish, spanning diverse types of downstream tasks. Consequently, our work equips researchers in the realm of large language models for low-resource languages with potent tools. Simultaneously, it offers political and conflict analysts, including policymakers, scholars, and practitioners, novel approaches and instruments for deciphering the intricate dynamics of local politics and conflicts, directly in Arabic and Spanish.

Description

Keywords

Software Engineering, Large Language Models, Natural Language Processing, Development Cycle

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025