Weighted Classification Tree-based Ensemble Methods

Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis explores the role and selection of weighted data in classification trees, including ensemble applications that is tackling datasets with weights. In order to achieve this goal, we will combine decision trees as base classifiers with some ensemble methods. Decision trees are quite an easy to interpret method but their performance is not usually the most accurate performance among all other machine learning algorithms. However, a way to overcome this problem is to combine many decision trees together, i.e. an ensemble of trees. Ensemble methods are popular in statistical computing for their powerful performance. In order to focus our methods, we started by investigating the effect on the tree first splitting variable and split point when the re-sampling method changes from bootstrapping to reweighting the observations according to a distribution. We found that this change does not have a major impact on the distribution of the first split variable and split point in most cases we investigated. Then, we introduced two methods: “bagging with stumps method” and “Gini stumps method”. Bagging with stumps method is decision stumps fitted on different sample sizes while Gini stumps method is a new way to generate split points with probabilities proportional to Gini gain. There are two sub-methods of Gini stumps method, Gini-sampled stumps method and Gini-midpoints stumps method. After that, these methods with two aggregation methods (majority vote and weighted vote) are applied to simulations and real-world datasets to measure their performance accuracies. Gini stumps method has faster computational time than bagging with stumps when using an ordinary tree as the base classifier. Also, Gini stumps method with especially weighted vote has promising results with most simulations and real-world datasets. Gini stumps method was adjusted to tackle weight updates and then combined with AdaBoost methods. This combination has accurate results especially in discrete AdaBoost. We found the expected value of Gini index theoretically in case of two classes, and the final formula is the squared difference between the two cumulative distributions divided by the variance of the weighted sum of the two cumulative functions. This final result is the same as the distance that the Anderson-Darling test is based on.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025