Automatic Essay Scoring in Arabic: Development, Evaluation, and Advanced Techniques

No Thumbnail Available

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

University of Bristol

Abstract

Automated Essay Scoring (AES) has advanced considerably due to recent progress in natural language processing (NLP). This thesis examines key challenges in AES, with a particular focus on the Arabic language, and proposes practical approaches informed by both computational techniques and educational theory. First, the research investigates how the formulation of essay questions affects the accuracy of automated scoring systems. A set of question-design criteria, derived from educational principles, is introduced and empirically tested. Experiments show that adherence to these criteria can significantly improve AES performance, with improvements of up to 40% observed using BERT-based models for English essays. Given the limited resources for Arabic AES, this thesis introduces the AR-AES dataset, consisting of 2046 essays from undergraduate students across multiple courses, annotated independently by two university instructors. This resource alleviates the scarcity of Arabic-language datasets for AES, supporting model development and evaluation. Experimental analyses using pretrained Arabic NLP models demonstrate that transformer-based approaches achieve the highest levels of agreement with human scores. In many cases, their predictions show greater consistency with the gold scores than the agreement observed between the human annotators themselves. This high level of agreement with human scores indicates that, under appropriate conditions, the proposed AES system may be suitable for assisting human markers in real-world educational settings. Additionally, the thesis explores the potential of large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT for Arabic AES. Experiments with different training approaches, zero-shot, few-shot, and fine-tuning, demonstrate the importance of prompt engineering. A mixed-language prompting strategy, combining Arabic essays with English scoring guidelines, was found to notably enhance model performance. Nonetheless, fine-tuned AraBERT consistently yielded the strongest results, indicating that LLMs may not yet be the most effective option for Arabic AES tasks when training data is limited. Finally, an active learning framework is introduced, integrating AraBERT with uncertainty- and diversity-based sampling strategies. This human-in-the-loop approach prioritises essays that most benefit from expert review, reducing the need for extensive manual annotation while preserving high-scoring accuracy. Rather than replacing human markers, the system complements their efforts, offering a more efficient and consistent approach to large-scale essay evaluation. Overall, this thesis advances AES by introducing explicit criteria for effective essay question design, while also addressing specific challenges in Arabic AES. It contributes a comprehensively annotated dataset, presents a systematic evaluation of state-of-the-art NLP models, and effectively integrates active learning to balance automated scoring accuracy and human involvement.

Description

Keywords

Automated Essay Scoring (AES), Arabic Dataset, Arabic, AraBERT, AI, Natural Language Processing (NLP)

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025