Early Prediction of Cancer Using Supervised Machine Learning: A Study of Electronic Health Records From The Ministry of National Gurad Health Affairs
No Thumbnail Available
Date
2024-08
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University College London (UCL)
Abstract
Early detection and treatment of cancer can save lives; however, identifying those most at risk of developing cancer remains challenging. Electronic health records (EHR) provide a rich source of "big" data on large patient numbers. I hypothesised that in the period preceding a definitive cancer diagnosis, there exist healthcare events, such as a history of disease, captured within EHR data that characterise cancer progression and can be exploited to predict future cancer occurrence. Using longitudinal phenotype data from the EHR of the Ministry of National Guard Health Affairs, a large healthcare provider in Saudi Arabia, I aimed to discover health event patterns present in EHR data that predict cancer development in periods prior to diagnosis by developing predictive models using supervised machine learning (ML) algorithms. I used two different prediction periods: six months and one year prior to cancer diagnosis. Initially, the thesis focused on the prediction of both malignant and benign neoplasms, before moving on to predicting the future risk of malignant neoplasms (cancer), since predicting life-threatening illness remains the most important clinical challenge. To refine the approach for specific cancer types, predictive models were built for the top three malignancies in this population: breast, colon, and thyroid cancers. ML predictive models were developed using the following algorithms: (1) logistic regression; (2) penalised logistic regression; (3) decision trees; (4) random forests; (5) gradient boosting; (6) extreme gradient boosting; (7) k-nearest neighbours; and (8) support vector machine. Model performance was assessed using k-fold cross-validation and area under the curve—receiver operating characteristics (AUC-ROC). After developing different models, their performance was compared with and without hyperparameter tuning using tree-based pipeline optimization (TPOT) and GridSearch. This study provides novel proof-of-principle that ML algorithms can be applied to EHR data to develop models that can be used to predict future cancer occurrence.
Description
PhD thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy form University College London (UCL)
Keywords
artificial intelligence, machine learning, data science, big data, algorithms, prediction, data mining