Classification With Missing Data

Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The classification pattern has played a fundamental role in successfully solving problems in various fields, such as computer science and medicine. However, missing values have become an increasingly challenging problem in many real-world classification applications, causing errors in classification results. Therefore, when solving classification tasks, dealing with missing data problems must be a prerequisite for pattern classification. The methods for dealing with missing data come from a statistical learning theory that has been extensively investigated and applied. These range from simple methods such as deletion and single imputation, to more sophisticated methods such as multiple imputation. Each method has its own set of benefits and drawbacks. This study aims to determine the most accurate classification algorithms for datasets that include missing observations. Additionally, it includes theoretical and analysis sections. The theoretical section reviews detailed information required to understand the topic more deeply using a combination of journal, articles, and books. The basic concepts of classification and missingness mechanisms are provided, and the accessible approaches for data with missing values are explained. Further, this section includes examples of ways to develop a better understanding of related theories. The analysis section performs predictive analysis on Scania Trucks’ air pressure system data for component failure using Support Vector Machine, Naive Bays, and Random Forest. Approaches to dealing with missing values and class imbalance are covered because they are crucial parts of the machine learning process. The accuracy of seven measurements are employed to compare the results of all the classification algorithms with different imputation approaches. The results showed that the effect of the classification technique is higher than effect of the imputation technique. The best technique for imputation is based on the technique of classification employed, therefore no approach for imputation is the best technique of classification. The optimal combination of the classification technique and imputation method depends on measured metrics used and whether the state of the training dataset is balanced or imbalanced. However, in real-life applications, in-depth investigation for every dataset is required to establish which missing data estimation approaches will improve classification accuracy.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025