Classification with missing data

Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Saudi Digital Library

Abstract

Missing data is considered a significant challenge in many real-world datasets in classification. Using imputation methods to fill missing values is one of the most common approaches to deal with incomplete datasets. This research has analysed the impact of missing values on the performance of six classification models with four dataset scenarios, including the original dataset and dataset with 5%, 15% and 30% missing values on a real covid-19 dataset from the King Abdulaziz Medical City, Riyadh, Saudi Arabia. Four performance metrics were applied consisting of Accuracy, Specificity and ROC-AUC score. This study ascertained that the imputation methods have enhanced the accuracy, specificity and ROC-AUC score of the classification models. In contrast, in practically every case, the sensitivity of the classification model is significantly decreased with imputation methods. Furthermore, with imputation methods, the Naive Bayes model and the ensemble learning models, including Random forest and Xgboost, significantly enhance the model performance compared to Logistic regression, SVM and Decision trees. However, according to the results of 72 classifiers, no imputation methods consistently outperform others, while in most cases, LOCF slightly outperforms KNN and Miceforest.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025