Cross-Lingual Transfer Learning for Arabic Sentiment Analysis

dc.contributor.advisorLauria, Stasha
dc.contributor.authorBin Owayn, Najd Mohammed
dc.date.accessioned2025-11-17T04:43:00Z
dc.date.issued2025
dc.description.abstractThis dissertation presents a comprehensive investigation into the efficacy of cross-lingual transfer learning for Arabic sentiment analysis within low-resource contexts. The study rigorously compares the performance of a multilingual transformer model, XLM-RoBERTa (XLM-R), against a monolingual Arabic-specific model, CAMeLBERT, under varying data availability conditions, specifically zero-shot and few-shot learning paradigms. The primary objective is to identify the most effective and efficient modeling approach for accurate sentiment analysis when only limited Arabic training data is accessible. The research addresses the inherent challenges of Arabic sentiment analysis, including its complex morphology, pervasive dialectal variations, and the scarcity of large, annotated datasets. Utilizing a publicly available Arabic Company Reviews Dataset, the study systematically evaluates model performance across incrementally increasing amounts of labeled data: zero-shot application, and fine-tuning with 100, 500, and 1000 samples. This controlled experimental design allows for a direct, data-driven comparison of the models' efficiency and effectiveness. Key findings demonstrate that XLM-R exhibits remarkable zero-shot capabilities, achieving an accuracy of 0.829 and an Area Under the Curve (AUC) of 0.921 even without any direct fine-tuning on Arabic sentiment data. This underscores the power of large-scale multilingual pre-training in fostering language-agnostic sentiment understanding. With the introduction of limited Arabic labeled data, XLM-R's performance further improved, reaching an accuracy of 0.886 and an AUC of 0.942 with 1000 samples. The most substantial performance gains for XLM-R were observed during the initial stages of few-shot fine-tuning, highlighting its high data efficiency. In contrast, CAMeLBERT, designed as a monolingual Arabic model, showed poor zero-shot performance (accuracy 0.275, AUC 0.522), as anticipated due to its specialisation in Arabic linguistic structures rather than cross-lingual transfer. However, CAMeLBERT demonstrated exceptional adaptability and rapid improvement with few-shot fine-tuning. With a mere 100 labeled Arabic samples, its accuracy dramatically surged to 0.814 and AUC to 0.913. Its performance continued to improve, eventually approaching XLM-R's levels at 1000 samples (accuracy 0.868, AUC 0.936). This indicates that while monolingual models necessitate some target-language data to become effective, they can quickly leverage their deep linguistic understanding of Arabic to achieve competitive results. Learning curve analysis revealed that for both models, the most significant performance improvements occurred between the zero-shot and 100-sample conditions, with diminishing returns observed as the training data size increased further. This finding is crucial for practitioners, suggesting that a relatively small investment in data annotation can yield substantial performance gains, while further extensive annotation may offer only marginal improvements. In conclusion, this dissertation provides a data-driven cost-benefit analysis for practitioners navigating Arabic sentiment analysis in resource-constrained environments. It demonstrates that while monolingual models like CAMeLBERT can achieve competitive performance with modest amounts of labeled Arabic data, multilingual models like XLM-R offer a superior starting point with strong zero-shot capabilities and maintain a statistically significant edge even with limited fine-tuning data. This research contributes to a more nuanced understanding of the practical utility of cross-lingual transfer learning, advocating for its strategic adoption in scenarios where extensive Arabic data annotation is not feasible. Future work includes investigating domain-specific pre-training, exploring advanced few-shot learning techniques, and incorporating explicit dialectal Arabic analysis.
dc.format.extent70
dc.identifier.urihttps://hdl.handle.net/20.500.14154/77006
dc.language.isoen
dc.publisherSaudi Digital Library
dc.subjectLLM
dc.subjectTransfer Learning
dc.subjectSentiment Analysis
dc.titleCross-Lingual Transfer Learning for Arabic Sentiment Analysis
dc.typeThesis
sdl.degree.departmentComputer Science
sdl.degree.disciplineData Science and analytics
sdl.degree.grantorBrunel University
sdl.degree.nameMaster of Science

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
SACM-Dissertation.pdf
Size:
3.08 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.61 KB
Format:
Item-specific license agreed to upon submission
Description:

Copyright owned by the Saudi Digital Library (SDL) © 2026