Predicting Biochemical Recurrence of Prostate Cancer Using Genetic Sequence Data and Clinical Variables Using High Dimensional Multivariate Models
Abstract
Abstract
Prostate cancer is a major cause of death in men worldwide. Although well-established
predictors of disease progression and death have been identified, it remains a challenge to
predict outcomes accurately due to disease heterogeneity. Biochemical recurrence (BCR) is
an early surrogate endpoint and identifying key predictors of BCR may lead to a better and
earlier treatment of patients who are at high risk.
The overall aim of this study was to identify clinical and genetic predictors for BCR on
prostate cancer patients after radical prostatectomy using statistical techniques, and to
develop signatures to predict time to BCR event. This study used prostate cancer data (n=495
subjects) from The Cancer Genome Atlas (TCGA) and patient’s demographic and clinical
characteristics, and high-dimensional gene expression variables were used for analyses.
Principal component analysis (PCA) was applied to the high-dimensional gene expression
(n=57,251 genes) to understand the overall pattern in post-treated patients. Shrinkage
approaches (Lasso and Elastic-net) were also proposed to predict BCR using gene
expressions. However, prior to applying the prediction methods in the real dataset, their
performances were assessed using simulated data. Overall, both techniques performed
similarly. However, the ratio between the event and control was significantly imbalanced
(19.2% events vs. 80.8% control) which had a negative impact on the sensitivity/specificity
but not on the overall classification accuracy of the methods. These methods were also
applied to predict BCR (as binary outcome) using TCGA gene expression, and both
techniques performed similarly although the group of genes selected by Lasso was a subset
of the genes selected by Elastic-net.
Furthermore, a Cox survival model (using shrinkage approach) was applied to predict time
to BCR event on most significant genes (n=743 genes), and further stepwise variable
selection was implemented. Eight genes were selected, and the corresponding genetic
signature scores (with and without clinical characteristics) were generated. The novel score
had a strong positive association with high risk of a BCR event. Finally, the 31 Cell-Cycle
Progression (CCP) genes scores (with and without clinical characteristics) were validated in
TCGA and showed strong positive associations with high risk of BCR event. The newly
iv
developed score in this thesis has several advantages over the most popular CCP score in
terms of identifying the high-risk patients earlier for better management and shorter
processing time and reduced expense.