Stroke Predictions
Group Green
Dataset
Our goal for this project was to answer two research questions:
Which variables are most useful for determining whether an individual is likely to have a stroke?
How can we train a model on these variables to predict stroke likelihood?
Predictor variables
Our group chose to use the “Stroke Prediction” dataset, which observes 12 stroke-related factors for 5110 individuals. The categorical variables include:
-
id
-
gender
-
hypertension
-
heart disease
-
ever married
-
work type
-
residence type
-
smoking status
-
stroke
The quantitative variables include:
-
age
-
average glucose level
-
bmi
Selection
We immediately eliminated a few of the variables that we thought were unlikely to have an effect on stroke likelihood, including id, marriage status, work type, residence type, and smoking status. For the rest of the variables, we created a logistical regression model to narrow the list down further by their p-values, which were shown in the summary statistics for the model.
A low p-value means that we are closer to rejecting our null hypothesis that the variable has no effect on stroke likelihood, whereas a high p-value allows us to assume that the variable has little effect on stroke likelihood. Based on these statistics, we eliminated gender, leaving us with age, average glucose level, hypertension, and heart disease as our four predictor variables.
Model
We chose to use a logistical regression model as opposed to a linear regression model because the stroke variable is categorical, meaning there are only two outcomes: either the individual had a stroke or did not have a stroke. A logistical regression fits this type of variable much better than a linear regression.
Creation
After eliminating predictor variables with a high p-value, we trained a new logistical regression model using only the remaining predictor variables on the training data.
Evaluation
Training data
In order to evaluate the performance of the model, we performed 5-fold cross-validation on the training data and got an accuracy value of 0.9535595 and a ROC area under the curve value of 0.8525241.
The high accuracy value indicates that the model is a good fit to the data, and the relatively high ROC-AUC value means that the distribution of the positive curve (individuals who had a stroke) has very little overlap with the negative curve (individuals wo did not have a stroke).
Visually, when plotting the ROC curve, we can see that the model attempts to maximize both sensitivity and specificity whenever possible from the high curve and separation from the mid line.
Testing data
After confirming that the model is a good fit for our training data and can predict stroke likelihood reasonably well, we evaluated the performance of the model on our training data by creating another ROC-AUC curve. The AUC value for this curve is 0.813, which is slightly lower than the AUC value for the training data, but still relatively high. This indicates that our model is a reasonable fit for the training data as well and that it can predict stroke likelihood with a reasonably high degree of accuracy.
Application
With our newly updated and trained model, we can input a set of different values for age, average glucose level, hypertension, and heart disease that represents a single individual and use the predict() function to determine the likelihood that an individual will have a stroke. This type of model could be extremely valuable in the healthcare realm, especially for diseases that occur later in life and are difficult to predict.
Presentation
Our presentation can be found here.
Data
Soriano, F 2021, Stroke Prediction Dataset, electronic dataset, Kaggle, viewed 1 April 2021, https://www.kaggle.com/fedesoriano/stroke-prediction-dataset.
References
Soriano, F 2021, Stroke Prediction Dataset, electronic dataset, Kaggle, viewed 1 April 2021, https://www.kaggle.com/fedesoriano/stroke-prediction-dataset.
Executive Summary
Our project examined what variables are important when it comes to predicting the likelihood of stroke in individuals. Our data came from a public database from a online user named Fedesoriano. We are not certain where the sample was taken or if was randomized. The origin of the data was unknown, yet it seems to be a frequently used data set within the medical field. The reason why the source of the data is hard to find might be explained by the fact that medical data is a sensitive topic when it comes to protecting people’s privacy protection. Many people do not want their personal health information being used for studies or being publicized. Our project created a model that allows you to predict the likelihood of someone having stroke based off a few critical variables that can have a high correlation of whether someone will have a stroke or not.It was fairly simple to put in our own information, if you have it, and see your probability of having a stroke. This kind of model has been used by insurance companies for many years, in order to optimize their investments and cut their losses when it comes to issuing their customers insurance rates. With a just a little bit of medical history and information, insurance companies can predict whether you are at a high risk of stroke, and charge you a higher fee to cover for the higher opportunity cost. This model is not just used for stroke, but also for many other types of diseases in the medical field. With this type of modeling, there is an ethical question that must be raised that looks at if obtaining people’s heath information to use to predict their chance of stroke, or even death, should be allowed. Insurance companies profit off their ability to charge people higher rates according to their medical risk. Should this private health information be used by these insurance companies? This question has no simple answer, and therefore our group needs to be cautious if we ever choose to implement this model, or general idea in the future. When it comes to the accuracy of our model, we have to look back to the beginning of the project when we made the decision to eliminate variables from the beginning. Our methodology involved removing the variables that were categorical, in order to facilitate the construction of the model. It may have been ignorant of our group to remove these variables as they may have had a significant correlation with the chance of stroke in patients. The variables we removed were gender, working status, residence type, smoking status, and ever-married. If we redid this project, it would interesting to see if our model would have any significant change. For the final selection of our variables we calculated the p-values of the variables, and choose those that best predicted the probability of stroke. While this may have been a clever technique that allowed us to save the time of trial and error, there is always the chance that using the variables we choose to omit might be crucial in our model. Overall, our group was pleased with our project, and we think that the applicability of our models provides some interesting ideas to using this type of modeling in the future. We recognize how there are improvement that can be made, but it is very satisfying to watch a model perform as intended with real world results.