AN EXPERIMENTAL STUDY OF SUPERVISED MACHINE LEARNING TECHNIQUES FOR MINOR CLASS PREDICTION UTILIZING KERNEL DENSITY ESTIMATION: FACTORS IMPACTING MODEL PERFORMANCE
Date
2024-06-29
Authors
Alfarwan, Abdullah
Journal Title
Journal ISSN
Volume Title
Publisher
Western Michigan University
Abstract
This dissertation examined classification outcome differences among four popular
individual supervised machine learning (ISML) models (logistic regression, decision tree,
support vector machine, and multilayer perceptron) when predicting minor class membership
within imbalanced datasets. The study context and the theoretical population sampled focus on
one aspect of the larger problem of student retention and dropout prediction in higher education
(HE): identification.
This study differs from current literature by implementing an experimental design
approach with simulated student data that closely mirrors HE situational and student data.
Specifically, this study tested the predictive ability of the four ISML classification models (CLS)
under experimentally manipulated conditions. These included total sample size (TS), minor class
proportion (MCP), training-to-testing sample size ratios (TTSS), and the application of bagging
techniques during model training (BAG). Using this 4-between, 1-within mixed design, five
different outcome measures (precision, recall/sensitivity, specificity, F1-score and AUC) were
examined and analyzed individually.
For each outcome measure, findings revealed multiple statistically significant interactions
among classifier models and design variables. Simple effect analyses of these interactions
highlighted how TS, MCP, TTSS, and BAG differentially affect different measures of
classification performance such as precision, recall/sensitivity, specificity, F1-score, and AUC.
For instance, the presence of interactions involving MCP underscores the importance of
informed modeling of class distribution for enhancing overall model predictive capability and
performance.
Such insights regarding how the experimental variables can critically affect different
measures of classification success advances our understanding of how these four ISML models
might be optimized for the prediction of student-at-risk status within imbalanced datasets. This
dissertation provides a framework for using these or similar ISML models more effectively in
HE. It points toward the development of predictive modeling methods that are more useful and
perhaps equitable by demonstrating empirically the impact of one of the most challenging
aspects of implementing machine learning in HE: maximizing the accurate identification of the
minority class. This work contributes to the use of machine learning in HE and will help inform
its use in smaller and larger educational research communities by providing strategies for
improving the prediction of student dropout.
Description
Keywords
Supervised machine learning model, Imbalanced datasets, Dropout prediction, Higher education, Predictive modeling, Classification performance metrics, Sample size effect, Class proportion, Minority class identification.