Ensemble Learning
Abstract
In this dissertation, we take a deep look at ensemble learning, an advanced machine learning technique
developed in the last few decades which has led to remarkable advancements in supervised learning
problems. We further explore the ensemble technique known as gradient boosting in thorough detail,
and look at its performance on real datasets using one of its most popular software implementations, the
XGBoost Python library.
The first part of the project is a thorough literature review, while the second part involves application of
the theoretical background to several actual supervised learning problems.
First, we overview the theoretical background around the supervised learning problem, motivating ensemble learning out of a concern around the bias-variance trade-off. We then take closer look at two
dominant paradigms in ensemble learning, parallel and sequential ensembling before exploring the sequential paradigm in depth through a thorough consideration of gradient boosting.
We then spend some time over-viewing the history and evolution of ensemble learning. Our primary
focus is on the first and for a time the most successful gradient boosting algorithm known as AdaBoost,
examining the algorithm in detail. Then we look at one of the most successful software implementations
of gradient boosting, XGBoost, which achieves impressive success on large datasets due to some novel
algorithmic and engineering innovations.
Next, we take a short look at another dominant machine learning paradigm known as deep learning,
and compare XGBoost library to deep learning on two well known supervised learning tasks from the
machine learning competition website Kaggle, the Adult Income and Fashion MNIST datasets. The
Adult income dataset is used in a binary classification task for mixed datatypes, while the Fashion
MNIST dataset is a multiclass image classification task.
In particular, we present evidence that gradient boosting may perform better on structured tabular
datasets, while deep learning may perform better on unstructured data.
Finally, we look at more advanced ensembling techniques, in particular the use of combination methods.
We close by considering the current state-of-the-art in ensemble learning.
Description
Keywords
Ensemble Learning, Machine Learning, Data Science and Analysis