Saudi Cultural Missions Theses & Dissertations

Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10

Browse

Search Results

Now showing 1 - 1 of 1
  • Thumbnail Image
    ItemRestricted
    English-Arabic Cross-Language Plagiarism Detection
    (2022) Alotaibi, Naif; Joy, Mike
    The advancement of the information era and technology has contributed to the rapid growth of digital text libraries and automatic machine translation systems. The machine translation tools facilitate translating texts from one language into another. Those have resulted in increasing the content accessible in different languages, which makes it easy to perform translated plagiarism, which is referred to as “cross-language plagiarism”. Identification of plagiarism amongst texts in different languages is more challenging than recognizing plagiarism within a corpus written in the same language. This research proposes a new framework for enhancing English-Arabic cross-language plagiarism detection at the sentence level. The framework comprises of two phases: the first phase is feature extraction, while the second is plagiarism detection based on a supervised machine learning classification model. Phase one is concerned with extracted features among English-Arabic cross-language sentences, where we propose approaches to extracting sets of features at lexical, semantic and syntactic levels. This phase involves two components. The first relies on translation plus a monolingual, pretrained word embedding model, integrated with term frequency inverse document frequency (TFIDF), and part of speech (POS) scheme methods, as well as word order information. The second component employs a pre-trained multilingual model for determining semantic relatedness between cross-language sentence pairs. In terms of the second phase, we propose to apply and examine using various supervised machine learning classifier methods, along with the extracted features and with combinations of those features to assist in the task of classifying sentences as either plagiarized or non-plagiarized. Each phase was assessed using different datasets. The experimental results for phase one on different benchmark datasets, such as SemEval-2017, show the proposed methods for extracted features achieved improvement when compared against the baselines and other methods. Analysis of experimental data for phase two demonstrates that using extracted features and their combinations with various supervised machine learning classification methods achieves promising results. Ultimately, using the combination of extracted features along with a supervised ensemble machine learning classifier achieves the best classification results.
    36 0

Copyright owned by the Saudi Digital Library (SDL) © 2025