GPT-4 attempting to attack AI-text detectors
Date
2024-07-10
Authors
Alshehri, Nojoud
Journal Title
Journal ISSN
Volume Title
Publisher
University of Adelaide
Abstract
Recent large language models (LLMs) generate machine content across a wide range of channels, including news, social media, and educational frameworks. The significant challenge of differentiating between AI-generated content and the content written by humans raised the potential misuse of LLMs. Academic integrity risks have become a growing concern due to the potential utilisation of these models in completing assignments and writing essays. There-fore, many detection tools have been developed to identify AI-generated and hu-man-generated texts. The effectiveness of these tools against attack strategies and adversarial perturbations has not been adequately validated, specifically in the context of student essay writing. In this work, we aim to utilize GPT-4 model to apply a series of perturbations to an essay generated originally by GPT-4 in order to confuse three AI detectors: GPTZero, DetectGPT, and ZeroGPT. The pro-posed attack technique produces a text as an adversarial sample used to examine the effect on the detection accuracy of AI detectors. The results demonstrate that utilizing GPT-4 to rephrase and apply perturbation at the sentence and word level is able to confuse the detection models and reduce their prediction probabilities. Moreover, the final essay, after applying the series of perturbations, maintains a reasonable amount of both writing quality and semantic similarity with the orig-inal GPT-generated essay. This project will provide insights for further improve-ments to increase the robustness of AI detectors and future AI-generated text classification studies.
Description
Keywords
LLM, AI-generated text, AI-text detectors.