AUTOMATIC IDENTIFICATION OF ARABIC DIALECTS

Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The Automatic Identification of Arabic dialects uses Natural Language Processing methods that is Dialect Identification (DID). Arabic Dialect Identification (ADID) classifies Arabic texts according to their Arabic dialects. This dissertation aims to investigate the impact of using different pre-processing methods for the automatic classification of Arabic texts into four dialect categories, namely Levantine (LAV), North African (NOR), Egyptian (EGY), and Gulf (GLF), and the modern standard Arabic (MSA) as well. This dissertation argues about the impact of using two different pre-processing methods that are; 1) word segmentation using a software that called Stanford Segmenter, 2) text vocalization (adding diacritics to text) using a software that called Mishkal. A variety of features have been used in this study, including word n-gram, character n-gram, parts-of-speech (POS) tagger, and function word frequency. Texts are classified into dialect categories by two supervised classifiers, which are the Logistic Regression (LR) algorithm, and the Naïve Bayes (NB) algorithm. The two Arabic dialects datasets which are used are the Arabic Online Commentary (AOC) released by (Zaidan and Callison-Burch, 2014) and modified by El-Haj et al. (2018), and the Dialectal Arabic Tweets (DARTS) released by Alsarsour et al (2018). This dissertation findings are: 1) word segmentation classifies Arabic texts into dialects with accuracy close to clean data by word n-gram, 2) text vocalization might improve dialectal identification using character n-gram up to 4.2% for AOC dataset and up to 8.37% for DARTS dataset; 3) using the Stanford Segmenter on Arabic text with the Stanford POS tag produced better classification accuracy.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025