Machine Learning Techniques for Financial Loan Default Prediction in UK: A Comparative Analysis of Decision Tree and Random Forest Models
No Thumbnail Available
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
This dissertation proposes a comprehensive approach to variable selection and model
comparison applied to credit scoring, based on a Lending Club 2016–2018 dataset.
The methodology combines an initial manual selection, based on completeness and
business logic, followed by an automatic selection via RFECV (Recursive Feature
Elimination with Cross-Validation) using a Random Forest. Finally, an importance
permutation analysis and an ablation experiment (Top 10 variables) complete the
evaluation.
The results show that all 21 variables selected are considered relevant by RFECV, but
that most of the predictive power is concentrated in a subset of about 15 variables. A
comparison of the models highlights the clear superiority of Random Forest (AUC ≈
0.713; PR-AUC ≈ 0.437) over Decision Tree (AUC ≈ 0.594; PR-AUC ≈ 0.319).
Permutation importance analysis confirms business intuition: interest rate, credit sub-
grade, and residential status appear to be the main explanatory factors, supplemented
by financial indicators (debt ratio, loan amount, FICO score). The ablation experiment
shows that these ten main variables are sufficient to preserve almost all of the Random
Forest's performance (AUC = 0.708), while reducing training time by approximately
40%.
These results highlight two major points: (i) Random Forest is robust and capable of
effectively exploiting a small core of variables, but its performance remains below the
standards expected for an industrial model (>0.80 AUC); (ii) the hierarchy of variables
reveals both the relevance of expected indicators and the redundancy between certain
correlated measures. The limitations identified concern sensitivity to correlations, the
temporal restriction of the sample (2016–2018), and the computational cost of certain
steps (RFECV).
In conclusion, this project validates the feasibility of a robust and parsimonious model
based on Random Forest, while opening up prospects for improvement: use of
boosting algorithms, calibration of thresholds according to economic issues, temporal
robustness tests, and pipeline optimization.
Description
Keywords
Financial Loan Default Prediction, Credit Risk Assessment, Machine Learning, Random Forest, Decision Tree, Credit Scoring, Feature Selection, Recursive Feature Elimination (RFE/RFECV), Class Imbalance, SMOTENC, Ensemble Learning, Financial Risk Modelling, LendingClub Dataset, Model Evaluation Metrics (ROC-AUC, PR-AUC, F1-score), Feature Importance Analysis, Model Interpretability, Supervised Learning, Borrower Behaviour Analysis, Credit Risk Management (CRM), UK Financial Institutions
