Early Prediction of Cancer Using Supervised Machine Learning: A Study of Electronic Health Records From The Ministry of National Gurad Health Affairs

dc.contributor.advisorLai, Alvina
dc.contributor.advisorKunz, Holger
dc.contributor.authorAlfayez, Asma
dc.date.accessioned2024-10-29T16:31:41Z
dc.date.issued2024-08
dc.descriptionPhD thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy form University College London (UCL)
dc.description.abstractEarly detection and treatment of cancer can save lives; however, identifying those most at risk of developing cancer remains challenging. Electronic health records (EHR) provide a rich source of "big" data on large patient numbers. I hypothesised that in the period preceding a definitive cancer diagnosis, there exist healthcare events, such as a history of disease, captured within EHR data that characterise cancer progression and can be exploited to predict future cancer occurrence. Using longitudinal phenotype data from the EHR of the Ministry of National Guard Health Affairs, a large healthcare provider in Saudi Arabia, I aimed to discover health event patterns present in EHR data that predict cancer development in periods prior to diagnosis by developing predictive models using supervised machine learning (ML) algorithms. I used two different prediction periods: six months and one year prior to cancer diagnosis. Initially, the thesis focused on the prediction of both malignant and benign neoplasms, before moving on to predicting the future risk of malignant neoplasms (cancer), since predicting life-threatening illness remains the most important clinical challenge. To refine the approach for specific cancer types, predictive models were built for the top three malignancies in this population: breast, colon, and thyroid cancers. ML predictive models were developed using the following algorithms: (1) logistic regression; (2) penalised logistic regression; (3) decision trees; (4) random forests; (5) gradient boosting; (6) extreme gradient boosting; (7) k-nearest neighbours; and (8) support vector machine. Model performance was assessed using k-fold cross-validation and area under the curve—receiver operating characteristics (AUC-ROC). After developing different models, their performance was compared with and without hyperparameter tuning using tree-based pipeline optimization (TPOT) and GridSearch. This study provides novel proof-of-principle that ML algorithms can be applied to EHR data to develop models that can be used to predict future cancer occurrence.
dc.format.extent71
dc.identifier.urihttps://hdl.handle.net/20.500.14154/73372
dc.language.isoen
dc.publisherUniversity College London (UCL)
dc.subjectartificial intelligence
dc.subjectmachine learning
dc.subjectdata science
dc.subjectbig data
dc.subjectalgorithms
dc.subjectprediction
dc.subjectdata mining
dc.titleEarly Prediction of Cancer Using Supervised Machine Learning: A Study of Electronic Health Records From The Ministry of National Gurad Health Affairs
dc.typeThesis
sdl.degree.departmentInstitute of Health Informatics
sdl.degree.disciplineArtificial Intelligence and Big Data Science
sdl.degree.grantorUniversity College London (UCL)
sdl.degree.nameDoctor of Philosophy

Files

Copyright owned by the Saudi Digital Library (SDL) © 2024