Evaluation of different feature selection methods for Author profiling task: identification of the gender

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
University of Sheffield
Introduction: In technological era, numerous digital platforms, including social media and emails facilitate text-based communication. While these technologies simplify daily life for some, they can also pose risks in various contexts, including hate speech, and phishing attacks. Consequently, there is a need to detect their involvement in such activities. Within the author attribution, ongoing research focuses on author profiling tasks, aiming to discern attributes such as gender, age, and language. Several tools have been proposed in this field to identify various demographic characteristics of the author such as gender. Objectives: The goal of this dissertation is to assess and pick the most fitting method for selecting features for the task of identifying gender within the realm of author profiling, using the widely recognised PAN18 dataset. Methods: Three different feature selection techniques, specifically Pearson correlation coefficient as a filter method, feature forward selection as a wrapped method, and embedded methods, were applied and evaluated alongside six supervised machine learning algorithms (Logistic regression, Support victor machine, K-Narest neighbour, Naive base, Decision tree, and random forest) to evaluate its performance, generalisability, dimensionality, and computational time. Results: The primary finding of the dissertation indicates that the Logistic Regression embedded method achieved a 75% accuracy rate within 1.8 seconds, utilizing 368 features and demonstrating strong generalization performance on unseen data points. Conclusion: Different feature selection methods have attained considerable results. However, LR embedded method register better accuracy in short computational time and considerable generalisability and dimensionality. Therefore, LR embedded method might enhance the accuracy, time computational, generalisability, and dimensionality; Thus, a proper feature selection method for the gender identification task. Limitations: Key limitations of this dissertation are lack of computational resources to handle the large volume of the texture dataset. Also, the short period of the study.
feature selection, Author profiling, Gender identification