Model Card for Model ID

The following model is designed to predict, given a certain number of inputs, whether a person has and/or is it at risk of acquiring diabetes.

This modelcard aims to be a base template for new models. It has been generated using this raw template.

Model Description

The following model is designed to predict, given a certain number of inputs, whether a person has and/or is it at risk of acquiring diabetes. The model has a total of 21 input features, and is designed to work within form-based applications, i.e. software applications which require user input.

NOTE: The following model is meant as an assistive tool, and must NOT directly be used to produce the final verdict on a person or patient's condition. As it is meant to promote further evaluations upon having completed its prediction.

Developed by: DeepNeural
Model type: Tabular Classifier
Language(s): English
License: MIT

Model Inputs

Variable Name	Type	Description	Question Input Type
HighBP	Binary	Does the patient have high blood pressure?	0 = no, 1 = yes
HighChol	Binary	Does the patient have high cholesterol?	0 = no, 1 = yes
CholCheck	Binary	Has the patient had a cholesterol check in 5 years?	0 = no, 1 = yes
BMI	Integer	Body Mass Index	Numeric value
Smoker	Binary	Does the patient smoke? (at least 5 packs)?	0 = no, 1 = yes
Stroke	Binary	Has the patient suffered from a stroke?	0 = no, 1 = yes
HeartDiseaseAttack	Binary	Coronary heart disease or myocardial infarction?	0 = no, 1 = yes
PhysActivity	Binary	Physical activity in the past 30 days?	0 = no, 1 = yes
Fruits	Binary	Does the patient consume one or more fruits per day?	0 = no, 1 = yes
Veggies	Binary	Does the patient consume vegetables one or more times per day?	0 = no, 1 = yes
HvyAlcoholConsump	Binary	Heavy drinker (14 drinks per week for men, 7 for women)?	0 = no, 1 = yes
AnyHealthcare	Binary	Does the patient have healthcare coverage?	0 = no, 1 = yes
NoDocbcCost	Binary	Difficulty reaching a doctor due to cost in the past 12 months?	0 = no, 1 = yes
GenHlth	Integer	How good is the patient's general health?	1 = excellent, 2 = very good, 3 = good, 4 = fair, 5 = poor
MenHlth	Integer	Days in the past 30 when mental health was not good?	Scale 1-30
PhysHlth	Integer	Days in the past 30 when physical health was poor?	Scale 1-30
DiffWalk	Binary	Does the patient have difficulty walking?	0 = no, 1 = yes
Sex	Binary	What is the patient's sex?	0 = female, 1 = male
Age	Integer	What is the patient's age?	1 = 18-24, 9 = 60-64, 13 = 80 or older
Education	Integer	Maximum education reached	1 = never attended school 2 = grades 1-8 3 = grades 9-11 4 = grade 12 or GED 5 = college (1-3 years) 6 = college (4+ years)
Income	Integer	Income level	1 = less than $10,000 5 = less than $35,000 8 = $75,000 or more

Model Sources

Repository: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

Uses

This model is primarily designed for Data Scientists, Software Engineers and Machine Learning Engineers who have an interest in developing diabetic-based software applications, for various healthcare institutions, ranging from hospitals to clinics. Furthermore, this model is also designed for educational purposes within acadamia, whereby diabetic risk-analysis is a priority of the study.

Foreseeable users of the software applications to be developed with this model include: doctors, nurses (with respect to their patients)

Bias, Risks, and Limitations

Please be adviced that our model had to be adjusted to place a greater emphasis on the minority class - a positive result - which ensured a robust model was built. However, in correcting the aforementioned issue of an imbalanced dataset, our model now works well with real life data, whereby the minority class requires a greater level of importance (see the results for metrics). However, the model may still suffer from misclassifications at certain points, and therefore, users are adviced to remember that this model is meant as an assistive tool, aiding in faster diagnostics.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More research needed for further recommendations. Furthermore, the following model will continously undergo improvements and testing for better results capable of fixing the limitations mentioned in the previous section.

How to Get Started with the Model

To properly make use of this model, please refer to the illustration below, which showcases how this model can be loaded directly into an application. Please note, that, because it was built with the Scikit-Learn Machine Learning library, the model has been saved as a .joblib file. With that in mind, please proceed by copying the following code into your coding environment (Python).

Install Joblib
```
!pip install joblib
```

Load the model Upon Installation

my_model = joblib.load('diabetes_health_indicators_classifier_v1.joblib')

Make predictions (Binary or Probability)

my_model.predict(X_test)

# For probability-based outputs

my_model.predict_proba(X_test)

NOTE: This model requires input data in a 2-Dimensional format (Pandas Series) with the column names, considering the model is to be used in form-based applications.

Metrics

We tested our dataset on various Machine Learning models, namely: logistic regression, Stochastic Gradient Descent, and Support Vector Machines. In all of these cases, we tested our models on the new (unforseen) test data. In doing this, we discovered that all three models performed well, with promising accuracy, recall, and AUC scores; these being the most trustworthy scores, as our dataset was originally imbalanced; we thus performed multiple types of imbalance adjustments, to place a greater emphasis on the minority class, which is, the more important class. Upon having adjusted the dataset, we retrained all of our models once more to draw a conclusion. The best performing model, after performing hyperparameter tuning, was the SGDClassifier model. The primary metrics used were: accuracy, recall, AUC, precision and f1-score. Please refer to the results section to see the results.

Results (Best and final scores after fixing imbalanced issues)

Accuracy - 73% Precision - 31% Recall - 78% AUC - 75% F1-Score - 44%

Environmental Impact

Hardware Type: T4 (for training)
Hours used: < 20hr
Cloud Provider: Google Cloud
Compute Region: Europe
Carbon Emitted: 1.02