AI-GENERATED TEXT DETECTOR FOR ARABIC LANGUAGE
Date
2024-08
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Bridgeport
Abstract
The rise of AI-generated texts (AIGTs), particularly with the arrival of advanced language models like ChatGPT, has spurred a growing need for effective detection methods. While these models offer various beneficial applications, their potential for misuse, such as facilitating plagiarism and the generation of fake textual content, raises significant ethical concerns. These concerns have sparked extensive academic research into detecting AIGTs. Efforts to mitigate potential misuse include commercial platforms like Turnitin, GPTZero, and more. Notably, most evaluations conducted on the current AI detection thus far have predominantly focused on English or languages rooted in Latin-driven scripts. However, the effectiveness of existing AI detectors is notably hampered when processing Arabic texts due to the unique challenges posed by the language's diacritics, which are small marks placed above or below letters to indicate pronunciation. These diacritics can cause human-written texts (HWTs) to be misclassified as AIGTs. Recognizing the limitations of current detectors, this research first established a baseline performance assessment using a newly developed benchmark dataset of Arabic texts that contain HWTs and AIGTs against the existing detection systems such as OpenAI Text Classifier and GPTZero. This evaluation highlighted critical weaknesses in existing detectors' ability to handle diacritics and differentiate between HWTs and AIGTs, particularly in essay-length texts. This research introduces a novel AI text detector designed explicitly for Arabic to address these limitations, leveraging transformer-based pre-trained models trained on several novel datasets. Our resulting detector significantly outperforms the existing detection models in accurately identifying both HWTs and AIGTs in Arabic. Although the research focus was on Arabic due to its unique writing challenges, our detector architecture is adaptable to other languages.
Description
Keywords
AI, AI-GENERATED TEXTS, ARABIC DETECTOR, AI DETECTOR, SENTHATIC TEXTS DETECTOR