A CODE-MIXING TRANSLITERATION MODEL TO IMPROVE HATE SPEECH DETECTION IN THE SAUDI DIALECT TWEETS
No Thumbnail Available
Date
2024
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Universiti Malaya
Abstract
Technological developments over the past few decades have changed the way people
communicate, with platforms like social media and blogs becoming vital channels for
international conversation. Even though hate speech is vigorously suppressed on social
media, it is still a concern that needs to be constantly recognized and observed. Although
great efforts have been made in this area for English-language social media content, but
for Arabic language, the detection of hate speech still has many specific difficulties.
Arabic calls for particular consideration when it comes to hate speech detection, because
of its many dialects and linguistic nuances. Another degree of complication is added by
the widespread practice of "code-mixing," in which users merge various languages
smoothly. Recognizing this research vacuum, the study aims to close it by examining how
well machine learning models containing variation features can detect hate speech,
especially when it comes to Arabic tweets featuring code-mixing. Therefore, the objective
of this study is to assess and compare the effectiveness of different features and machine
learning models for hate speech detection on Arabic hate speech emoji, and code-mixing
hate speech datasets. To achieve the objectives, the methodology used includes data
collection, data pre-processing, feature extraction, the construction of classification
models, and the evaluation of the constructed classification models. The findings from
the analysis revealed that the Term Frequency-Inverse Document Frequency (TF-IDF)
feature, when employed with the Stochastic Gradient Descent (SGD) model, attained the
highest accuracy, reaching 98.21% on code-mixing transliteration dataset. The findings
from the analysis also revealed that the highest accuracy of 99% was attained on emoji
transliteration dataset. Subsequently, these results were contrasted with outcomes from
three baseline studies, and the proposed transliteration learning model on both the code
mixing and emoji outperformed them, underscoring the significance of the proposed
models. Consequently, this study carries practical implications and serves as a
foundational exploration in the realm of automated hate speech detection in text.
Description
Keywords
Hate speech, Natural language processing, Arabic language, Code-mixing, Machine learning models.