Machine Learning Based Predication of Diabetes
No Thumbnail Available
Date
2024
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Notingham
Abstract
Diabetes mellitus, commonly known as diabetes, is a chronic disease related to the metabolic
system of humans. It is part of a broader category of chronic diseases, including cardiovascular
disease, acute kidney infection, eye problems, and foot ulcers. Currently, 537 million people
worldwide are living with diabetes, a figure expected to rise to 643 million by 2030. Given the
limited availability of medical professionals, there is an increasing need to develop automated tools
to assist decision-making for various diseases using prevalence datasets.
This dissertation focuses on the implementation of both deterministic models, such as decision
trees, random forests, support vector machines, and neural networks, and probabilistic models,
including logistic regression, Naïve Bayes, Gaussian Naïve Bayes, and nonparametric Naïve
Bayes, for binary diabetes classification. Seven input features—age, gender, BMI, blood glucose
level, HbA1C level, hypertension (yes/no), and heart disease (yes/no)—along with the binary
response variable (diabetes), are utilized to develop these classification models.
The dataset comprises 100,000 patients and eight features, with a significant class imbalance:
91.5% do not have diabetes. Among the models, the decision tree exhibited the highest balanced
accuracy of 98.48%, with a sensitivity of 100% and a specificity of 96.95%. The decision tree
outperformed all other models when applied to the imbalanced data. For the balanced data, the
random forest model demonstrated superior performance (except logistic regression) with a
balanced accuracy of 92.42%, sensitivity of 92%, and specificity of 92.85%.
These models can be further refined by considering additional relevant variables and applying
advanced deep-learning models.
Description
Keywords
Diabetes Prediction, Machine Learning Models, Deterministic Models, Probabilistic Models, Decision Trees, Random Forests, Support Vector Machines (SVM), Naïve Bayes, Artificial Neural Networks (ANN), Logistic Regression, Classification Algorithms, Synthetic Minority Over-sampling Technique (SMOTE), Diabetes Mellitus