Classification with missing data
Abstract
Missing data is considered a significant challenge in many real-world datasets in classification. Using imputation methods to fill missing values is one of the most common approaches
to deal with incomplete datasets. This research has analysed the impact of missing values on
the performance of six classification models with four dataset scenarios, including the original
dataset and dataset with 5%, 15% and 30% missing values on a real covid-19 dataset from the
King Abdulaziz Medical City, Riyadh, Saudi Arabia. Four performance metrics were applied
consisting of Accuracy, Specificity and ROC-AUC score. This study ascertained that the imputation methods have enhanced the accuracy, specificity and ROC-AUC score of the classification
models. In contrast, in practically every case, the sensitivity of the classification model is significantly decreased with imputation methods. Furthermore, with imputation methods, the Naive
Bayes model and the ensemble learning models, including Random forest and Xgboost, significantly enhance the model performance compared to Logistic regression, SVM and Decision
trees. However, according to the results of 72 classifiers, no imputation methods consistently
outperform others, while in most cases, LOCF slightly outperforms KNN and Miceforest.