Arabic Short Texts Authorship Verification
Date
2023-11-07
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
Authorship verification is the process of determining whether or not two pieces of
writing are written by the same author by comparing their writing styles. Technically,
it is a branch of the authorship analysis problem, and is considered to be a text
classification task that results in (Yes or No) binary output. Despite the widespread
usage of Twitter in the Arab world, short text research has so far focused on
authorship verification in languages other than Arabic, such as English, Spanish, and
Greek. Arabic, with its complex morphology, lack of capitalisation, and short vowels,
presents unique linguistic challenges to verifying authorship. This thesis seeks to
address that issue by applying different machine learning and deep learning
techniques with focusing on extracting the most effective features to solve the problem
of authorship verification for Arabic short writing. Due to the lack of publicly
available data for this task, an Arabic Twitter corpus was compiled for 100 users, with
a minimum of 1,000 tweets and a maximum of 3,000 tweets per user. Different features
were used in order to investigate the most predictive features for authorship
verification of Arabic short texts (specifically the tweets). This study explores the
impacts of using different textual features, such as stylometric features, Term
Frequency-Inverse Document Frequency (TF-IDF), Bag Of Words (BOW), and n-gram.
A novel Arabic knowledge-base model (AraKB) was created to enhance the
authorship verification of the challenging Arabic short texts that yielded promising
results. In addition, different deep learning techniques were tested to identify their
impact to verify authorship. Long Short-Term Model (LSTM) and Arabic Bidirectional
Encoder Representations from Transformers (AraBERT) were applied separately, and
resulted in different performance outcomes. In addition, an analytical analysis was
done to see how meta-data from Twitter’s postings, such as time and device source,
can help to verify users better. The experiments were conducted using different
machine learning algorithms which are Gradient Boosting, Random Forest, Support Vector Machine, and k-Nearest Neighbour. The performance was measured using the
most commonly used metrics for authorship analysis tasks, which are accuracy,
precision, recall, and F1 score. The results provide evidence of the importance of
choosing the right features based on the given texts, and indicate that no feature can
be generalised to all types of data. To the best of the researcher’s knowledge, no study
has been conducted on verifying Arabic social media texts. This study suggests that
the ability to verify users on social media platforms provides solutions to different
forensics and safety issues, and aids in the prevention of using fake identities to
practice fraud, bullying, terrorism, and violence. This research is significant on the
subject of digital forensics investigation and cyber safety.
Description
Keywords
Arabic language, Authorship Verification, NLP