Browsing by Author "Yamani, Wid"

Now showing 1 - 1 of 1

Restricted
EXPERIMENTAL STUDY OF THE IMPORTANCE OF DATA FOR MACHINE LEARNING-BASED BREAST CANCER OUTCOME PREDICTION
(Saudi Digital Library, 2024) Yamani, Wid; Wojtusaik, Janusz
EXPERIMENTAL STUDY OF THE IMPORTANCE OF DATA FOR MACHINE LEARNING-BASED BREAST CANCER OUTCOME PREDICTION Wid Yamani, Ph.D. George Mason University, 2025 Dissertation Director: Dr. Janusz Wojtusiak Researchers have used various large-scale datasets to develop and validate predictive models in breast cancer outcome prediction. However, a notable gap exists due to the lack of a systematic comparison among these datasets regarding predictive performance, feature availability, and suitability for different analytical objectives. While each dataset has unique strengths and limitations, no comprehensive studies evaluate how these differences impact model performance, particularly across diverse timeframes, survival, and recurrence outcomes. This gap limits researchers in making informed choices about the most appropriate dataset for specific research questions. Effective modeling and prediction of breast cancer outcomes (such as cancer survival and recurrence) rely on the dataset's quality, the pre-processing techniques used to clean and transform data, and the choice of predictive models. Therefore, selecting a suitable dataset and identifying relevant variables are as crucial as the choice of the model itself. This thesis addresses this gap by systematically comparing five prominent datasets for predicting breast cancer outcomes. This dissertation compares five datasets—SEER Research 8, SEER Research 17, SEER Research Plus, SEER-Medicare, and Medicare Claims data—focusing on breast cancer survival and recurrence. It evaluates the predictive performance of each dataset using supervised machine learning methods, including logistic regression, random forest, and gradient boosting. The models were tested on metrics such as AUC, accuracy, recall, and precision, with gradient boosting delivering the most accurate results. The findings indicate that SEER-Medicare, which integrates cancer registry data with three years of retrospective claims, outperformed the other datasets, achieving AUCs of 0.891 for 5-year survival and 0.942 for 10-year survival. This dataset's inclusion of comprehensive health information, including pre-existing conditions and other claims data, makes it particularly valuable for outcome prediction. However, a drawback of SEER-Medicare is that it primarily includes patients aged 65 and older, as it is based on Medicare data. This limitation reduces its suitability for predicting outcomes in younger breast cancer patients, a significant subgroup with distinct risk factors and treatment responses. SEER Research Plus ranked second, offering data on patient demographics, breast cancer characteristics, staging, outcomes, and treatment, with AUC values of 0.877, 0.901, and 0.937 for 5-year, 10-year, and 15-year survival, respectively. SEER Research 17 and SEER Research 8 include patient demographics, breast cancer characteristics, and staging information but lack treatment details. SEER Research 17, which covers a larger population with more variables, yielded AUC values of 0.870 for 5-year survival, 0.897 for 10-year survival, and 0.920 for 15-year survival. SEER Research 8, which covers a smaller population over a more extended period, yielded slightly lower AUC values of 0.857, 0.868, and 0.880 for 5-year, 10-year, and 15-year survival, respectively. Results indicate that including treatment and additional variables significantly enhances prediction accuracy while the data size is less critical. This thesis is the first study that compares SEER datasets and provides a groundbreaking, comprehensive evaluation of these datasets, providing crucial insights into how data characteristics influence breast cancer outcome modeling.
12 0