Saudi Cultural Missions Theses & Dissertations
Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10
Browse
4 results
Search Results
Item Restricted Scalable Human Mobility Prediction: Integrating Clustering and Parallel Processing(Saudi Digital Library, 2024) Alhomidan, Suliman; Chen, ZexunHuman mobility modelling is essential for various applications, including urban planning, transportation logistics, and public health. Traditional algorithms for predicting human movement patterns face significant computational challenges, particularly with large-scale datasets. This dissertation addresses these challenges by introducing an optimised approach that leverages parallel computing and machine learning techniques. We refactored the existing human mobility prediction algorithm to utilise Dask, a parallel computing library that enables distributed processing. This modification enhanced the algorithm's scalability and computational efficiency, making it suitable for big data environments. Additionally, we incorporated clustering as a preprocessing step to group similar users, significantly reducing the number of pairwise comparisons required for trajectory analysis. We evaluated eight clustering algorithms: K-means, Gaussian Mixture Models (GMM), DBSCAN, MeanShift, Agglomerative Clustering, OPTICS, Birch, and HDBSCAN. Each algorithm was tested with various hyperparameters and clustering approaches. Performance metrics, including execution time, Adjusted Rand Index (ARI), and Normalised Mutual Information (NMI), were used to assess the computational efficiency and clustering accuracy of each algorithm. Our findings indicate that the “mean” and “std” aggregation methods consistently provide the best performance in terms of ARI and NMI. The “std” method demonstrated the lowest execution times, highlighting its computational efficiency. The results underscore the importance of selecting appropriate clustering algorithms and parameter values to optimise performance. The improved approach was validated through practical examples, demonstrating substantial reductions in computational complexity compared to the original algorithm. For instance, clustering reduced the complexity from O(n^2∙ m^2 ) to O(t∙nk)+O(n^2/k∙m^2 ) where n is the number of users, m is the number of records per user, k is the number of clusters, and t is the number of iterations for clustering convergence. The practical implications of this research are significant, offering improved computational efficiency for applications in urban planning, public health, and commercial sectors. However, challenges such as real-time processing, adaptive clustering methodologies, and ethical considerations remain. Future research should address these challenges to further enhance the algorithm's applicability and performance. This dissertation presents a robust and scalable solution for human mobility modelling, integrating parallel computing and clustering techniques to significantly improve computational efficiency and accuracy. The flexibility of the implemented code allows users to tailor the clustering approach to their specific needs, ensuring optimal performance for various applications.6 0Item Restricted EXPERIMENTAL STUDY OF THE IMPORTANCE OF DATA FOR MACHINE LEARNING-BASED BREAST CANCER OUTCOME PREDICTION(Saudi Digital Library, 2024) Yamani, Wid; Wojtusaik, JanuszEXPERIMENTAL STUDY OF THE IMPORTANCE OF DATA FOR MACHINE LEARNING-BASED BREAST CANCER OUTCOME PREDICTION Wid Yamani, Ph.D. George Mason University, 2025 Dissertation Director: Dr. Janusz Wojtusiak Researchers have used various large-scale datasets to develop and validate predictive models in breast cancer outcome prediction. However, a notable gap exists due to the lack of a systematic comparison among these datasets regarding predictive performance, feature availability, and suitability for different analytical objectives. While each dataset has unique strengths and limitations, no comprehensive studies evaluate how these differences impact model performance, particularly across diverse timeframes, survival, and recurrence outcomes. This gap limits researchers in making informed choices about the most appropriate dataset for specific research questions. Effective modeling and prediction of breast cancer outcomes (such as cancer survival and recurrence) rely on the dataset's quality, the pre-processing techniques used to clean and transform data, and the choice of predictive models. Therefore, selecting a suitable dataset and identifying relevant variables are as crucial as the choice of the model itself. This thesis addresses this gap by systematically comparing five prominent datasets for predicting breast cancer outcomes. This dissertation compares five datasets—SEER Research 8, SEER Research 17, SEER Research Plus, SEER-Medicare, and Medicare Claims data—focusing on breast cancer survival and recurrence. It evaluates the predictive performance of each dataset using supervised machine learning methods, including logistic regression, random forest, and gradient boosting. The models were tested on metrics such as AUC, accuracy, recall, and precision, with gradient boosting delivering the most accurate results. The findings indicate that SEER-Medicare, which integrates cancer registry data with three years of retrospective claims, outperformed the other datasets, achieving AUCs of 0.891 for 5-year survival and 0.942 for 10-year survival. This dataset's inclusion of comprehensive health information, including pre-existing conditions and other claims data, makes it particularly valuable for outcome prediction. However, a drawback of SEER-Medicare is that it primarily includes patients aged 65 and older, as it is based on Medicare data. This limitation reduces its suitability for predicting outcomes in younger breast cancer patients, a significant subgroup with distinct risk factors and treatment responses. SEER Research Plus ranked second, offering data on patient demographics, breast cancer characteristics, staging, outcomes, and treatment, with AUC values of 0.877, 0.901, and 0.937 for 5-year, 10-year, and 15-year survival, respectively. SEER Research 17 and SEER Research 8 include patient demographics, breast cancer characteristics, and staging information but lack treatment details. SEER Research 17, which covers a larger population with more variables, yielded AUC values of 0.870 for 5-year survival, 0.897 for 10-year survival, and 0.920 for 15-year survival. SEER Research 8, which covers a smaller population over a more extended period, yielded slightly lower AUC values of 0.857, 0.868, and 0.880 for 5-year, 10-year, and 15-year survival, respectively. Results indicate that including treatment and additional variables significantly enhances prediction accuracy while the data size is less critical. This thesis is the first study that compares SEER datasets and provides a groundbreaking, comprehensive evaluation of these datasets, providing crucial insights into how data characteristics influence breast cancer outcome modeling.15 0Item Restricted Advancing Emergency Department Efficiency, Infectious Disease Management at Mass Gatherings, and Self-Efficacy Through Data Science and Dynamic Modeling(Virginia Polytechnic Institute and State University, 2024-02-27) Ba-Aoum, Mohammed; Hosseinichimeh, Niyousha; Triantis, KonstantinosThis dissertation employs management systems engineering principles, data science, and industrial systems engineering techniques to address pressing challenges in emergency department (ED) efficiency, infectious disease management at mass gatherings, and student self-efficacy. It is structured into three essays, each contributing to a distinct domain of research, and utilizes industrial and systems engineering approaches to provide data-driven insights and recommend solutions. The first essay used data analytics and regression analysis to understand how patient length of stay (LOS) in EDs could be influenced by multi-level variables integrating patient, service, and organizational factors. The findings suggested that specific demographic variables, the complexity of service provided, and staff-related variables significantly impacted LOS, offering guidance for operational improvements and better resource allocation. The second essay utilized system dynamics simulations to develop a modified SEIR model for modeling infectious diseases during mass gatherings and assessing the effectiveness of commonly implemented policies. The results demonstrated the significant collective impact of interventions such as visitor limits, vaccination mandates, and mask wearing, emphasizing their role in preventing health crises. The third essay applied machine learning methods to predict student self-efficacy in Muslim societies, revealing the importance of socio-emotional traits, cognitive abilities, and regulatory competencies. It provided a basis for identifying students with varying levels of self-efficacy and developing tailored strategies to enhance their academic and personal success. Collectively, these essays underscore the value of data-driven and evidence-based decision- making. The dissertation’s broader impact lies in its contribution to optimizing healthcare operations, informing public health policy, and shaping educational strategies to be more culturally sensitive and psychologically informed. It provides a roadmap for future research and practical applications across the healthcare, public health, and education sectors, fostering advancements that could significantly benefit society.27 0Item Restricted Predicting Paid Certification in Massive Open Online Courses(Durham University, 2024-02-08) Alshehri, Mohammad Abdullah; Cristea, AlexandraMassive open online courses (MOOCs) have been proliferating because of the free or low-cost offering of content for learners, attracting the attention of many stakeholders across the entire educational landscape. Since 2012, coined as “the Year of the MOOCs”, several platforms have gathered millions of learners in just a decade. Nevertheless, the certification rate of both free and paid courses has been low, and only about 4.5–13% and 1–3%, respectively, of the total number of enrolled learners obtain a certificate at the end of their courses. Still, most research concentrates on completion, ignoring the certification problem, and especially its financial aspects. Thus, the research described in the present thesis aimed to investigate paid certification in MOOCs, for the first time, in a comprehensive way, and as early as the first week of the course, by exploring its various levels. First, the latent correlation between learner activities and their paid certification decisions was examined by (1) statistically comparing the activities of non-paying learners with course purchasers and (2) predicting paid certification using different machine learning (ML) techniques. Our temporal (weekly) analysis showed statistical significance at various levels when comparing the activities of non-paying learners with those of the certificate purchasers across the five courses analysed. Furthermore, we used the learner’s activities (number of step accesses, attempts, correct and wrong answers, and time spent on learning steps) to build our paid certification predictor, which achieved promising balanced accuracies (BAs), ranging from 0.77 to 0.95. Having employed simple predictions based on a few clickstream variables, we then analysed more in-depth what other information can be extracted from MOOC interaction (namely discussion forums) for paid certification prediction. However, to better explore the learners’ discussion forums, we built, as an original contribution, MOOCSent, a cross- platform review-based sentiment classifier, using over 1.2 million MOOC sentiment-labelled reviews. MOOCSent addresses various limitations of the current sentiment classifiers including (1) using one single source of data (previous literature on sentiment classification in MOOCs was based on single platforms only, and hence less generalisable, with relatively low number of instances compared to our obtained dataset;) (2) lower model outputs, where most of the current models are based on 2-polar classifier (positive or negative only); (3) disregarding important sentiment indicators, such as emojis and emoticons, during text embedding; and (4) reporting average performance metrics only, preventing the evaluation of model performance at the level of class (sentiment). Finally, and with the help of MOOCSent, we used the learners’ discussion forums to predict paid certification after annotating learners’ comments and replies with the sentiment using MOOCSent. This multi-input model contains raw data (learner textual inputs), sentiment classification generated by MOOCSent, computed features (number of likes received for each textual input), and several features extracted from the texts (character counts, word counts, and part of speech (POS) tags for each textual instance). This experiment adopted various deep predictive approaches – specifically that allow multi-input architecture - to early (i.e., weekly) investigate if data obtained from MOOC learners’ interaction in discussion forums can predict learners’ purchase decisions (certification). Considering the staggeringly low rate of paid certification in MOOCs, this present thesis contributes to the knowledge and field of MOOC learner analytics with predicting paid certification, for the first time, at such a comprehensive (with data from over 200 thousand learners from 5 different discipline courses), actionable (analysing learners decision from the first week of the course) and longitudinal (with 23 runs from 2013 to 2017) scale. The present thesis contributes with (1) investigating various conventional and deep ML approaches for predicting paid certification in MOOCs using learner clickstreams (Chapter 5) and course discussion forums (Chapter 7), (2) building the largest MOOC sentiment classifier (MOOCSent) based on learners’ reviews of the courses from the leading MOOC platforms, namely Coursera, FutureLearn and Udemy, and handles emojis and emoticons using dedicated lexicons that contain over three thousand corresponding explanatory words/phrases, (3) proposing and developing, for the first time, multi-input model for predicting certification based on the data from discussion forums which synchronously processes the textual (comments and replies) and numerical (number of likes posted and received, sentiments) data from the forums, adapting the suitable classifier for each type of data as explained in detail in Chapter 7.16 0
