Arabic Short Texts Authorship Verification
Saudi Digital Library
Authorship verification is the process of determining whether or not two pieces of writing are written by the same author by comparing their writing styles. Technically, it is a branch of the authorship analysis problem, and is considered to be a text classification task that results in (Yes or No) binary output. Despite the widespread usage of Twitter in the Arab world, short text research has so far focused on authorship verification in languages other than Arabic, such as English, Spanish, and Greek. Arabic, with its complex morphology, lack of capitalisation, and short vowels, presents unique linguistic challenges to verifying authorship. This thesis seeks to address that issue by applying different machine learning and deep learning techniques with focusing on extracting the most effective features to solve the problem of authorship verification for Arabic short writing. Due to the lack of publicly available data for this task, an Arabic Twitter corpus was compiled for 100 users, with a minimum of 1,000 tweets and a maximum of 3,000 tweets per user. Different features were used in order to investigate the most predictive features for authorship verification of Arabic short texts (specifically the tweets). This study explores the impacts of using different textual features, such as stylometric features, Term Frequency-Inverse Document Frequency (TF-IDF), Bag Of Words (BOW), and n-gram. A novel Arabic knowledge-base model (AraKB) was created to enhance the authorship verification of the challenging Arabic short texts that yielded promising results. In addition, different deep learning techniques were tested to identify their impact to verify authorship. Long Short-Term Model (LSTM) and Arabic Bidirectional Encoder Representations from Transformers (AraBERT) were applied separately, and resulted in different performance outcomes. In addition, an analytical analysis was done to see how meta-data from Twitter’s postings, such as time and device source, can help to verify users better. The experiments were conducted using different machine learning algorithms which are Gradient Boosting, Random Forest, Support Vector Machine, and k-Nearest Neighbour. The performance was measured using the most commonly used metrics for authorship analysis tasks, which are accuracy, precision, recall, and F1 score. The results provide evidence of the importance of choosing the right features based on the given texts, and indicate that no feature can be generalised to all types of data. To the best of the researcher’s knowledge, no study has been conducted on verifying Arabic social media texts. This study suggests that the ability to verify users on social media platforms provides solutions to different forensics and safety issues, and aids in the prevention of using fake identities to practice fraud, bullying, terrorism, and violence. This research is significant on the subject of digital forensics investigation and cyber safety.
Arabic language, Authorship Verification, NLP