class: center, middle, inverse, title-slide # Predicting stroke probability ### Team Green
Matthew Leblanc, Sarika Rau, Martin Zakoian, Josias Zongo ### 2021-05-12 --- class: center, middle ## Which variables are most useful for determining whether an individual is likely to have a stroke? <br> <br> How can we train a model on these variables to predict stroke likelihood? --- # The dataset .pull-left[ 9 categorical variables: - id - gender - hypertension - heart disease - ever married - work type - residence type - smoking status - stroke ] .pull-right[ 3 quantitative variables: - age - average glucose level - bmi ] --- # Choosing potential predictor variables We used a logistical regression model to narrow down predictor variables by their p-values from the following: - gender - hypertension - heart disease - age - average glucose level --- # Choosing potential predictor variables ```r stroke_model_prelim <- glm(stroke ~ gender + age + avg_glucose_level + hypertension + heart_disease, data = train_data, family = binomial) summary(stroke_model_prelim) ``` ``` ## ## Call: ## glm(formula = stroke ~ gender + age + avg_glucose_level + hypertension + ## heart_disease, family = binomial, data = train_data) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.2159 -0.3075 -0.1601 -0.0759 3.2214 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -7.803983 0.440518 -17.715 < 2e-16 *** ## genderMale 0.096937 0.164841 0.588 0.55649 ## genderOther -8.332742 535.411238 -0.016 0.98758 ## age 0.069014 0.006200 11.132 < 2e-16 *** ## avg_glucose_level 0.005416 0.001356 3.995 6.46e-05 *** ## hypertension 0.453510 0.191232 2.372 0.01772 * ## heart_disease 0.573123 0.209305 2.738 0.00618 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1446.4 on 3832 degrees of freedom ## Residual deviance: 1131.8 on 3826 degrees of freedom ## AIC: 1145.8 ## ## Number of Fisher Scoring iterations: 12 ``` --- class: middle, center # Why a logistical model? --- # Creating the model - eliminated predictor variables with a high p-value, including gender - trained updated model on training data --- # Evaluating the model: training data ``` ## # A tibble: 2 x 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.954 5 0.00397 Preprocessor1_Model1 ## 2 roc_auc binary 0.853 5 0.0117 Preprocessor1_Model1 ``` --- # Evaluating the model: training data ``` ## Setting levels: control = 0, case = 1 ``` ``` ## Setting direction: controls < cases ``` <img src="presentation_files/figure-html/fit1-1.png" width="80%" /> --- # Evaluating the model: testing data ``` ## Setting levels: control = 0, case = 1 ``` ``` ## Setting direction: controls < cases ``` <img src="presentation_files/figure-html/fit2-1.png" width="80%" /> --- class: middle, center # How can we use this in the real world? --- # Sample prediction ```r predict(stroke_model_updated, newdata=data.frame(age=30, avg_glucose_level=130, hypertension=0, heart_disease=1), type="response") ``` ``` ## 1 ## 0.01216023 ``` --- class: inverse, middle, center # Thank you