Weighted Classification Tree-based Ensemble Methods
Abstract
This thesis explores the role and selection of weighted data in classification
trees, including ensemble applications that is tackling datasets
with weights. In order to achieve this goal, we will combine decision
trees as base classifiers with some ensemble methods. Decision
trees are quite an easy to interpret method but their performance is
not usually the most accurate performance among all other machine
learning algorithms. However, a way to overcome this problem is to
combine many decision trees together, i.e. an ensemble of trees. Ensemble
methods are popular in statistical computing for their powerful
performance.
In order to focus our methods, we started by investigating the effect
on the tree first splitting variable and split point when the re-sampling
method changes from bootstrapping to reweighting the observations
according to a distribution. We found that this change does not have
a major impact on the distribution of the first split variable and split
point in most cases we investigated. Then, we introduced two methods:
“bagging with stumps method” and “Gini stumps method”. Bagging
with stumps method is decision stumps fitted on different sample
sizes while Gini stumps method is a new way to generate split
points with probabilities proportional to Gini gain. There are two
sub-methods of Gini stumps method, Gini-sampled stumps method
and Gini-midpoints stumps method. After that, these methods with
two aggregation methods (majority vote and weighted vote) are applied
to simulations and real-world datasets to measure their performance
accuracies. Gini stumps method has faster computational time
than bagging with stumps when using an ordinary tree as the base classifier. Also, Gini stumps method with especially weighted vote
has promising results with most simulations and real-world datasets.
Gini stumps method was adjusted to tackle weight updates and then
combined with AdaBoost methods. This combination has accurate
results especially in discrete AdaBoost.
We found the expected value of Gini index theoretically in case of two
classes, and the final formula is the squared difference between the two
cumulative distributions divided by the variance of the weighted sum
of the two cumulative functions. This final result is the same as the
distance that the Anderson-Darling test is based on.