Machine Learning Techniques for Financial Loan Default Prediction in UK: A Comparative Analysis of Decision Tree and Random Forest Models

No Thumbnail Available

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Saudi Digital Library

Abstract

This dissertation proposes a comprehensive approach to variable selection and model comparison applied to credit scoring, based on a Lending Club 2016–2018 dataset. The methodology combines an initial manual selection, based on completeness and business logic, followed by an automatic selection via RFECV (Recursive Feature Elimination with Cross-Validation) using a Random Forest. Finally, an importance permutation analysis and an ablation experiment (Top 10 variables) complete the evaluation. The results show that all 21 variables selected are considered relevant by RFECV, but that most of the predictive power is concentrated in a subset of about 15 variables. A comparison of the models highlights the clear superiority of Random Forest (AUC ≈ 0.713; PR-AUC ≈ 0.437) over Decision Tree (AUC ≈ 0.594; PR-AUC ≈ 0.319). Permutation importance analysis confirms business intuition: interest rate, credit sub- grade, and residential status appear to be the main explanatory factors, supplemented by financial indicators (debt ratio, loan amount, FICO score). The ablation experiment shows that these ten main variables are sufficient to preserve almost all of the Random Forest's performance (AUC = 0.708), while reducing training time by approximately 40%. These results highlight two major points: (i) Random Forest is robust and capable of effectively exploiting a small core of variables, but its performance remains below the standards expected for an industrial model (>0.80 AUC); (ii) the hierarchy of variables reveals both the relevance of expected indicators and the redundancy between certain correlated measures. The limitations identified concern sensitivity to correlations, the temporal restriction of the sample (2016–2018), and the computational cost of certain steps (RFECV). In conclusion, this project validates the feasibility of a robust and parsimonious model based on Random Forest, while opening up prospects for improvement: use of boosting algorithms, calibration of thresholds according to economic issues, temporal robustness tests, and pipeline optimization.

Description

Keywords

Financial Loan Default Prediction, Credit Risk Assessment, Machine Learning, Random Forest, Decision Tree, Credit Scoring, Feature Selection, Recursive Feature Elimination (RFE/RFECV), Class Imbalance, SMOTENC, Ensemble Learning, Financial Risk Modelling, LendingClub Dataset, Model Evaluation Metrics (ROC-AUC, PR-AUC, F1-score), Feature Importance Analysis, Model Interpretability, Supervised Learning, Borrower Behaviour Analysis, Credit Risk Management (CRM), UK Financial Institutions

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025