Predicting stroke probability

# Predicting stroke probability
### Team Green <br> Matthew Leblanc, Sarika Rau, Martin Zakoian, Josias Zongo
### 2021-05-12

---

## Which variables are most useful for determining whether an individual is likely to have a stroke? <br> <br> How can we train a model on these variables to predict stroke likelihood?

---
# The dataset

.pull-left[
9 categorical variables:
- id
- gender
- hypertension
- heart disease
- ever married
- work type
- residence type
- smoking status
- stroke
]
.pull-right[
3 quantitative variables:
- age
- average glucose level
- bmi
]

---
# Choosing potential predictor variables

We used a logistical regression model to narrow down predictor variables by their p-values from the following:
- gender
- hypertension
- heart disease
- age
- average glucose level
---
# Choosing potential predictor variables

```r
stroke_model_prelim <- glm(stroke ~ gender + age + avg_glucose_level + hypertension + heart_disease, data = train_data, family = binomial)
summary(stroke_model_prelim)
```

```
## 
## Call:
## glm(formula = stroke ~ gender + age + avg_glucose_level + hypertension + 
##     heart_disease, family = binomial, data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2159  -0.3075  -0.1601  -0.0759   3.2214  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -7.803983   0.440518 -17.715  < 2e-16 ***
## genderMale          0.096937   0.164841   0.588  0.55649    
## genderOther        -8.332742 535.411238  -0.016  0.98758    
## age                 0.069014   0.006200  11.132  < 2e-16 ***
## avg_glucose_level   0.005416   0.001356   3.995 6.46e-05 ***
## hypertension        0.453510   0.191232   2.372  0.01772 *  
## heart_disease       0.573123   0.209305   2.738  0.00618 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1446.4  on 3832  degrees of freedom
## Residual deviance: 1131.8  on 3826  degrees of freedom
## AIC: 1145.8
## 
## Number of Fisher Scoring iterations: 12
```
---

# Why a logistical model?

---
# Creating the model

- eliminated predictor variables with a high p-value, including gender
- trained updated model on training data

---

# Evaluating the model: training data

```
## # A tibble: 2 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.954     5 0.00397 Preprocessor1_Model1
## 2 roc_auc  binary     0.853     5 0.0117  Preprocessor1_Model1
```

---

# Evaluating the model: training data

```
## Setting levels: control = 0, case = 1
```

```
## Setting direction: controls < cases
```

---

# Evaluating the model: testing data

```
## Setting levels: control = 0, case = 1
```

```
## Setting direction: controls < cases
```

---

# How can we use this in the real world?
---

# Sample prediction

```r
predict(stroke_model_updated, newdata=data.frame(age=30, avg_glucose_level=130, hypertension=0, heart_disease=1), type="response")
```

```
##          1 
## 0.01216023
```

---

# Thank you