GPT-4 attempting to attack AI-text detectors

Thumbnail Image

Date

2024-07-10

Authors

Alshehri, Nojoud

Journal Title

Journal ISSN

Volume Title

Publisher

University of Adelaide

Abstract

Recent large language models (LLMs) generate machine content across a wide range of channels, including news, social media, and educational frameworks. The significant challenge of differentiating between AI-generated content and the content written by humans raised the potential misuse of LLMs. Academic integrity risks have become a growing concern due to the potential utilisation of these models in completing assignments and writing essays. There-fore, many detection tools have been developed to identify AI-generated and hu-man-generated texts. The effectiveness of these tools against attack strategies and adversarial perturbations has not been adequately validated, specifically in the context of student essay writing. In this work, we aim to utilize GPT-4 model to apply a series of perturbations to an essay generated originally by GPT-4 in order to confuse three AI detectors: GPTZero, DetectGPT, and ZeroGPT. The pro-posed attack technique produces a text as an adversarial sample used to examine the effect on the detection accuracy of AI detectors. The results demonstrate that utilizing GPT-4 to rephrase and apply perturbation at the sentence and word level is able to confuse the detection models and reduce their prediction probabilities. Moreover, the final essay, after applying the series of perturbations, maintains a reasonable amount of both writing quality and semantic similarity with the orig-inal GPT-generated essay. This project will provide insights for further improve-ments to increase the robustness of AI detectors and future AI-generated text classification studies.

Description

Keywords

LLM, AI-generated text, AI-text detectors.

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2024