This project aims to build a machine learning model to classify individuals as diabetic or non-diabetic based on various health indicators. The dataset used is the "Diabetes Binary Health Indicators BRFSS 2015" from the CDC.
python
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
df = pd.read_csv("diabetes_binary_health_indicators_BRFSS2015.csv")
profile = ProfileReport(df, title="Profiling Report")
profile.to_file("analysis_report.html")
The profiling report generated provides a comprehensive overview of the dataset, including distributions, missing values, correlations, and more.
print("First few rows of the dataset:")
df.head()
print("Columns in the dataset:")
df.columns
print("Statistical summary of the dataset:")
df.describe().T
print("Information about the dataset:")
df.info()
print("Number of missing values in each column:")
df.isnull().sum()
print("Number of duplicated rows in the dataset:")
df.duplicated().sum()
print("Number of unique values in each column:")
df.nunique()
print("Correlation matrix:")
df.corr(numeric_only=True)
plt.figure(figsize = (16,10))
sns.heatmap(df.corr(), annot=True)
plt.show()
sns.countplot(x='Diabetes_binary', data=df)
plt.title("Class Distribution of Diabetes_binary")
plt.show()
plt.figure(figsize=(12, 8))
df.corr()['Diabetes_binary'].sort_values().plot(kind='bar')
plt.title('Correlation with Diabetes_binary')
plt.show()
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.diverging_palette(100, 7, s=75, l=40, n=5, center="light", as_cmap=True)
plt.figure(figsize=(15, 12))
sns.heatmap(corr, mask=mask, center=0, annot=True, fmt='.2f', square=True, cmap=cmap)
plt.show()
The dataset does not contain any missing values but has duplicated rows which were handled accordingly.
X = df.drop(columns='Diabetes_binary')
y = df['Diabetes_binary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Various scalers were used to handle the data:
- StandardScaler
- MinMaxScaler
- RobustScaler
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
Various models were built using the following classifiers:
- Logistic Regression
- RandomForestClassifier
- GradientBoostingClassifier
- KNeighborsClassifier
- GaussianNB
- DecisionTreeClassifier
- XGBClassifier
- CatBoostClassifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
param_grid = {
'classifier__C': [0.1, 1, 10],
'classifier__penalty': ['l2']
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train_res, y_train_res)
Models were evaluated using metrics such as:
- Accuracy
- Precision
- Recall
- F1 Score
from sklearn.metrics import classification_report, confusion_matrix
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
- Data Quality: The dataset had a significant number of duplicated rows which needed to be removed.
- Feature Importance: Certain features like BMI, HighBP, and Age showed higher correlation with diabetes.
- Class Imbalance: The target variable was imbalanced, necessitating the use of techniques like SMOTE to handle it.
- Model Performance: Ensemble models like Random Forest and Gradient Boosting performed better compared to simpler models like Logistic Regression and Naive Bayes.
- Hyperparameter Tuning: GridSearchCV was effective in tuning the hyperparameters and improving model performance.
The project successfully classified individuals as diabetic or non-diabetic using various machine learning models. Ensemble methods proved to be the most effective, and handling class imbalance was crucial for improving model performance.