Machine learning for spam classification on Stack Exchange (Industry project with Charcoal)
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
This project develops a comprehensive AI-based model to identify spam and irrelevant questions. The classification of questions is based on the features since ham posts contain programming-related features and spam posts contain inappropriate words. An extensive dataset containing spam questions was extracted from StackExchange by Charcoal Company, which is an all-volunteer organisation that focuses on detecting and removing spam and other kinds of online abuses across the Internet and social media networks. To deal with the spam questions’ detection problem, combination of seven ML classifiers Decision Tree (DT), Support Vector Classifier (SVC), Random Forest (RF), Extra Tree Classifier (ETC), Logistics Regression (LR), K-Nearest Neighbor (KNN), and Gaussian Naïve Bayes (GNB) have been employed with three different feature extraction techniques including Bag of Words (BoW), Term Frequency–Inverse Document Frequency (TF-IDF) and Words2Vec. In this approach, all the questions are first preprocessed, in which all the unwanted and noisy data values (e.g., punctuations, tags, links, numbers, etc.) are eliminated. Hence, the preprocessing steps include the removal of punctuation, numbers, tags, and links. Moreover, these steps include the conversion to lower case, stemming and lemmatization, removal of stopwords, and removal of words less than length two words to clean the dataset. In this research project, the quality of data was significantly improved after performing these preprocessing steps. Moreover, the computation time was considerably reduced due to the elimination of unnecessary information in the dataset. Subsequently, the feature selection techniques were applied to the refined data in order to produce important features. Finally, these features were passed to the aforementioned classifiers, and the results were recorded and analyzed. Apart from the seven ML models, a Deep Learning (DL) model, Recurrent Neural Network (RNN) was also applied to the dataset, and the results were compared with the aforementioned model results. For this study, the dataset was divided into 25% and 75% ratios for testing and training, respectively. Our results exhibited that two classifiers, Random Forest (RF) and Extra Tree Classifier (ETC), produced the highest accuracy results, amounting to 84%, using the Bag of Words (BoW) and Term Frequency–Inverse Document Frequency (TF-IDF) feature selection methods, respectively.