Combining Traditional and Machine Learning Approaches to Predict TCGA Colon Cancer Outcomes

No Thumbnail Available

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Saudi Digital Library

Abstract

This study utilises standardised clinical data from the Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD) cohort to perform a comparative survival analysis of colorectal cancer (CRC). Three modelling approaches were evaluated: the Cox Proportional Hazards (Cox PH) model, LASSO-penalised Cox regression, and the Gradient Boosted Survival Model (GBSM). Models were trained and evaluated using the concordance index (C-index) and time-dependent area under the curve (AUC) following comprehensive data preprocessing, including missing value imputation, outlier removal, and Kaplan–Meier–based variable stratification. LASSO-Cox improved model sparsity and feature selection (C-index = 0.80), while Cox PH demonstrated consistent identification of clinically established predictors with strong interpretability (C-index = 0.76). GBSM achieved the highest predictive performance (C-index = 0.87; AUC = 0.841) by effectively modelling complex non-linear relationships. Model interpretability was enhanced using SHAP values, which highlighted key prognostic factors, including tumour staging components (T4, N2, M1), as well as underexplored but clinically meaningful variables such as residual tumour status (R2), age at diagnosis, and ethnicity. These findings demonstrate the potential of interpretable machine learning models to improve survival prediction and feature discovery in colorectal cancer. The study highlights the importance of external validation and multimodal data integration to enhance generalisability and translational relevance in precision oncology.

Description

This MSc project presents a comparative survival analysis of colorectal cancer using traditional statistical models and modern machine learning approaches. The study focuses on balancing predictive accuracy with clinical interpretability by integrating Cox-based models with gradient boosting and explainable AI techniques.

Keywords

Colorectal Cancer, Survival Analysis, TCGA-COAD, Cox Proportional Hazards, LASSO-Cox Regression, Gradient Boosted Survival Model, Machine Learning, Explainable AI, SHAP, Precision Oncology

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2026