Classification With Missing Data
Abstract
The classification pattern has played a fundamental role in successfully solving problems in
various fields, such as computer science and medicine. However, missing values have become
an increasingly challenging problem in many real-world classification applications, causing errors in classification results. Therefore, when solving classification tasks, dealing with missing
data problems must be a prerequisite for pattern classification. The methods for dealing with
missing data come from a statistical learning theory that has been extensively investigated and
applied. These range from simple methods such as deletion and single imputation, to more
sophisticated methods such as multiple imputation. Each method has its own set of benefits
and drawbacks. This study aims to determine the most accurate classification algorithms for
datasets that include missing observations. Additionally, it includes theoretical and analysis
sections. The theoretical section reviews detailed information required to understand the topic
more deeply using a combination of journal, articles, and books. The basic concepts of classification and missingness mechanisms are provided, and the accessible approaches for data with
missing values are explained. Further, this section includes examples of ways to develop a better
understanding of related theories. The analysis section performs predictive analysis on Scania
Trucks’ air pressure system data for component failure using Support Vector Machine, Naive
Bays, and Random Forest. Approaches to dealing with missing values and class imbalance are
covered because they are crucial parts of the machine learning process. The accuracy of seven
measurements are employed to compare the results of all the classification algorithms with different imputation approaches. The results showed that the effect of the classification technique
is higher than effect of the imputation technique. The best technique for imputation is based
on the technique of classification employed, therefore no approach for imputation is the best
technique of classification. The optimal combination of the classification technique and imputation method depends on measured metrics used and whether the state of the training dataset is
balanced or imbalanced.
However, in real-life applications, in-depth investigation for every dataset is required to establish which missing data estimation approaches will improve classification accuracy.