MachineLearningWithTreeBasedModelsInR-SC/01-MLTB.qmd at main · STAT-ATA-ASU/MachineLearningWithTreeBasedModelsInR-SC · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
# Classification Trees {#intro}

```{r include=FALSE, message = FALSE, warning = FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = NA, fig.align = "center", fig.width = 4, fig.height = 4, message = FALSE, warning = FALSE)
library(tidyverse)
library(tidymodels)
library(dplyr)
library(knitr)
```

Ready to build a real machine learning pipeline? Complete step-by-step exercises to learn how to create decision trees, split your data, and predict which patients are most likely to suffer from diabetes. Last but not least, you'll build performance measures to assess your models and judge your predictions.

## Welcome to the course! - (video) {.unnumbered}

<iframe src="https://drive.google.com/file/d/1X-Cb_8ocpXezZPLk-6o8IcP860m5_av5/preview" width="640" height="480" allow="autoplay">

</iframe>

## Why tree-based methods?

Tree-based models are one class of methods used in machine learning. They are superior in many ways, but also have their drawbacks.

Which of these statements are true and which are false?

```{r, echo = FALSE}
include_graphics("./pics/trees.png")
```

## Specify that tree

In order to build models and use them to solve real-world problems, you first need to lay the foundations of your model by creating a model specification. This is the very first step in every machine learning pipeline that you will ever build.

You are going to load the relevant packages and design the specification for your classification tree in just a few steps.

A magical moment, enjoy!

### Instructions {.unnumbered}

-   Load the `tidymodels` package.

```{r}
library(tidymodels)
```

-   Pick a model class for decision trees, save it as `tree_spec`, and print it.

```{r}
# Pick a model class
tree_spec <- decision_tree()

# Print the result
tree_spec
```

-   Set the engine to `"rpart"` and print the result.

```{r}
# Pick a model class
tree_spec <- decision_tree() |>
  # Set the engine
  set_engine("rpart")

# Print the result
tree_spec
```

-   Set the mode to `"classification"` and print the result.

```{r}
# Pick a model class
tree_spec <- decision_tree() |>
  # Set the engine
  set_engine("rpart") |>
  # Set the mode
  set_mode("classification")

# Print the result
tree_spec
```

:::{.callout-note icon=false}
You created a decision tree model class, used an `rpart` engine, and set the mode to `"classification"`. Remember, you will need to perform similar steps every time you design a new model. Come back anytime if you need a reminder!
:::

## Train that model

A model specification is a good start, just like the canvas for a painter. But just as a painter needs color, the specification needs data. Only the final model is able to make predictions:

*Model specification + data = model*

In this exercise, you will train a decision tree that models the risk of diabetes using health variables as predictors. The response variable, `outcome`, indicates whether the patient has diabetes or not, which means this is a binary classification problem (there are just two classes). The dataset also contains health variables of patients like `blood_pressure`, `age`, and `bmi`.

For the rest of the course, the `tidymodels` package will always be pre-loaded. In this exercise, the `diabetes` dataset is also available in your workspace.

```{r}
diabetes <- read_csv("./data/diabetes_tibble.csv")
# Change character outcome to a factor
diabetes <- diabetes |>  mutate_if(is.character, as.factor)
```

### Instructions {.unnumbered}

-   Create `tree_spec`, a specification for a decision tree with an `rpart` engine.

```{r}
# Create the specification
tree_spec <- decision_tree() |>
  set_engine("rpart") |>
  set_mode("classification")
```

-   Train a model `tree_model_bmi`, where the `outcome` depends only on the `bmi` predictor by fitting the `diabetes` dataset to the specification.

```{r}
# Train the model
tree_model_bmi <- tree_spec |>
  fit(outcome ~ bmi, data = diabetes)
```

-   Print the model to the console.

```{r}
# Print the model
tree_model_bmi
```

-   Graph the model with `rpart.plot()`

```{r}
tree_model_bmi |>
  extract_fit_engine() |>
  rpart.plot::rpart.plot()
```

:::{.callout-tip icon=false}
Each node shows:

  - the predicted class (`no` or `yes` diabetes),

  - the predicted probability of diabetes,

  - the percentage of observations in the node.

:::

:::{.callout-note icon=false}
You have defined your model with `decision_tree()` and trained it to predict outcome using `bmi` like a professional coach! Printing the model displays useful information, such as the training time, the model formula used during training, and the node details. Remember, to fit a model to data is just a different phrase for training it. Don't worry about the precise output too much, you'll cover that later!
:::

## How to grow your tree - (video) {.unnumbered}

<iframe src="https://drive.google.com/file/d/1mG-kkJlnt4akhVJtCPbSc3Bm0L2k8ZEZ/preview" width="640" height="480" allow="autoplay">

</iframe>

## Train/test split

In order to test your models, you need to build and test the model on two different parts of the data - otherwise, it's like cheating on an exam (as you already know the answers).

The data split is an integral part of the modeling process. You will dive into this by splitting the diabetes data and confirming the split proportions.

The `diabetes` data from the last exercise is pre-loaded in your workspace.

### Instructions {.unnumbered}

-   Split the `diabetes` tibble into `diabetes_split`, a split of 80% training and 20% test data.

```{r}
set.seed(123)
# Create the split
diabetes_split <- initial_split(diabetes, prop = 0.80)
```

-   Print the resulting object.

```{r}
# Print the data split
diabetes_split
```

-   Extract the training and test sets and save them as `diabetes_train` and `diabetes_test`.

```{r}
# Extract the training and test set
diabetes_train <- training(diabetes_split)
diabetes_test  <- testing(diabetes_split)
```

-   Verify the correct row proportion in both datasets compared to the `diabetes` tibble.

```{r}
# Verify the proportions of both sets
round(nrow(diabetes_train) / nrow(diabetes), 2) == 0.80
round(nrow(diabetes_test) / nrow(diabetes), 2) == 0.20
```

:::{.callout-note icon=false}
Using `training()` and `testing()` after the split ensures that you save your working datasets.
:::

## Avoiding class imbalances

Some data contains very imbalanced outcomes - like a rare disease dataset. When splitting randomly, you might end up with a very unfortunate split. Imagine all the rare observations are in the test and none in the training set. That would ruin your whole training process!

Fortunately, the `initial_split()` function provides a remedy. You are going to observe and solve these so-called class imbalances in this exercise.

There is already code provided to create a split object `diabetes_split` with a 75% training and 25% test split.

```{r}
# Preparation
set.seed(9888)
diabetes_split <- initial_split(diabetes, prop = 0.75)
```

### Instructions {.unnumbered}

-   Count the proportion of `"yes"` outcomes in the training and test sets of `diabetes_split`.

```{r}
# Proportion of 'yes' outcomes in the training data
counts_train <- table(training(diabetes_split)$outcome)
prop_yes_train <- counts_train["yes"] / sum(counts_train)

# Proportion of 'yes' outcomes in the test data
counts_test <- table(testing(diabetes_split)$outcome)
prop_yes_test <- counts_test["yes"] / sum(counts_test)

paste("Proportion of positive outcomes in training set:", round(prop_yes_train, 2))
paste("Proportion of positive outcomes in test set:", round(prop_yes_test, 2))
```

-   Redesign `diabetes_split` using the same training/testing proportion, but with the `outcome` variable being equally distributed in both sets. Count the proportion of `yes` outcomes in both datasets.

```{r}
set.seed(123)
# Create a split with a constant outcome distribution
diabetes_split <- initial_split(diabetes, strata = outcome)

# Proportion of 'yes' outcomes in the training data
counts_train <- table(training(diabetes_split)$outcome)
prop_yes_train <- counts_train["yes"] / sum(counts_train)

# Proportion of 'yes' outcomes in the test data
counts_test <- table(testing(diabetes_split)$outcome)
prop_yes_test <- counts_test['yes'] / sum(counts_test)

paste("Proportion of positive outcomes in training set:", round(prop_yes_train, 2))
paste("Proportion of positive outcomes in test set:", round(prop_yes_test, 2))
```

:::{.callout-note icon=false}
Impressive - from 31% vs. 46% positive outcomes to 35% in both sets. This was a tough one, but now you know how simple it is to avoid class imbalances! This is even more important in a large dataset with a very imbalanced target variable.
:::

## From zero to hero

You mastered the skills of creating a model specification and splitting the data into training and test sets. You also know how to avoid class imbalances in the split. It's now time to combine what you learned in the preceding lesson and build your model using only the training set!

You are going to build a proper *machine learning pipeline*. This is comprised of creating a model specification, splitting your data into training and test sets, and last but not least, fitting the training data to a model. Enjoy!

### Instructions {.unnumbered}

-   Create `diabetes_split`, a split where the training set contains three-quarters of all diabetes rows and where training and test sets have a similar distribution in the outcome variable.

```{r}
set.seed(9)
# Create the balanced data split
diabetes_split <- initial_split(diabetes,
                                prop = 0.75,
                                strata = outcome)
```

-   Build a decision tree specification for your model using the `rpart` engine and save it as `tree_spec`.

```{r}
# Build the specification of the model
tree_spec <- decision_tree() |>
  set_engine("rpart") |>
  set_mode("classification")
```

-   Fit a model `model_trained` using the training data of `diabetes_split` with `outcome` as the target variable and `bmi` and `skin_thickness` as the predictors.

```{r, fig.width = 8, fig.height = 6}
# Train the model
model_trained <- tree_spec |>
  fit(outcome ~ bmi + skin_thickness,
      data = training(diabetes_split))

model_trained
# Graph the model
model_trained |>
  extract_fit_engine() |>
  rpart.plot::rpart.plot()
```

:::{.callout-tip icon=false}
Each node shows:

  - the predicted class (`no` or `yes` diabetes),

  - the predicted probability of diabetes,

  - the percentage of observations in the node.

:::

:::{.callout-note icon=false}
That pipeline was perfectly handcrafted!f Let's head over to the engine room to check your model's performance.
:::

## Predict and evaluate - (video) {.unnumbered}

<iframe src="https://drive.google.com/file/d/1tgR_xrdkzi_hq8_wNO0Sm6TtGysU9Xtb/preview" width="640" height="480" allow="autoplay">

</iframe>

## Make predictions

Making predictions with data is one of the fundamental goals of machine learning. Now that you know how to split the data and fit a model, it's time to make predictions about unseen samples with your models.

You are going to make predictions about your test set using a model obtained by fitting the training data to a tree specification.

Available in your workspace are the datasets that you generated previously (`diabetes_train` and `diabetes_test`) and a decision tree specification `tree_spec`, which was generated using the following code:

```{r}
tree_spec <- decision_tree() |>
  set_engine("rpart") |>
  set_mode("classification")
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)
```

### Instructions {.unnumbered}

-   Fit your specification to the training data using `outcome` as the target variable and all predictors to create model.

```{r}
# Train your model
model <- tree_spec |>
  fit(outcome ~ ., data = diabetes_train)
```

-   Use your `model` to predict the outcome of diabetes for every observation in the test set and assign the result to `predictions`.

```{r}
# Generate predictions
predictions <- predict(model,
                       new_data = diabetes_test,
                       type = "class")
```

-   Add the true test set outcome to `predictions` as a column named `true_class` and save the result as `predictions_combined`.

```{r}
# Add the true outcomes
predictions_combined <- predictions |>
  mutate(true_class = diabetes_test$outcome)
```

-   Use the `head()` function to print the first rows of the result.

```{r}
# Print the first  6 lines of the result
predictions_combined |>
  head() |>
  kable()
```

:::{.callout-note icon=false}
Now every predicted `.pred_class` has its `true_class` counterpart. The natural next step would be to compare these two and see how many are correct. You are about to find out in the next exercise.
:::

## Crack the matrix

Visual representations are a great and intuitive way to assess results. One way to visualize and assess the performance of your model is by using a confusion matrix. In this exercise, you will create the confusion matrix of your predicted values to see in which cases it performs well and in which cases it doesn't.

The result of the previous exercise, `predictions_combined`, is still loaded.

### Instructions {.unnumbered}

-   Calculate the confusion matrix of the `predictions_combined` tibble and save it as `diabetes_matrix`. Print the result to the console.

```{r}
# The confusion matrix
diabetes_matrix <- conf_mat(predictions_combined,
                            truth = true_class,
                            estimate = .pred_class)

# Print the matrix
diabetes_matrix
# Print metrics
diabetes_matrix |>
  summary() |>
  kable()
```

-   Out of all true `no` outcomes, what percent did your model correctly predict?

```{r}
99/(99 + 26)
sens(predictions_combined,
     truth = true_class,
     estimate = .pred_class) -> M1
M1
```

:::{.callout-note icon=false}
Your model found `r M1$.estimate*100`% of all positive (`no` diabetes) outcomes. This measure is called ***sensitivity***.
:::

## Are you predicting correctly?

Your model should be as good as possible, right? One way you can assess this is by counting how often it predicted the correct classes compared to the total number of predictions it made. As discussed in the video, we call this performance measure accuracy. You can either calculate this manually or by using a handy shortcut. Both obtain the same result.

The confusion matrix `diabetes_matrix` and the tibble `predictions_combined` are loaded.

### Instructions {.unnumbered}

-   Print `diabetes_matrix` to the console and use its entries to directly calculate `correct_predictions`, the number of correct predictions. Save the total number of predictions to `all_predictions`. Calculate and the accuracy, save it to `acc_manual`, and print it.

```{r}
diabetes_matrix
# Calculate the number of correctly predicted classes
correct_predictions <- 99 + 31

# Calculate the number of all predicted classes
all_predictions <- 99 + 36 + 26 + 31

# Calculate and print the accuracy
acc_manual <- correct_predictions / all_predictions
acc_manual
```

-   Calculate the accuracy using a `yardstick` function and store the result in `acc_auto`. Print the accuracy estimate.

```{r}
# The accuracy calculated by a function
acc_auto <- accuracy(predictions_combined,
                     truth = true_class,
                     estimate = .pred_class)
acc_auto$.estimate
```

-   Accuracy is very intuitive but also has its limitations. Imagine we have a naive model that always predicts `no`, regardless of the input. What would the accuracy be for that model?

```{r}
HDF <- data.frame(true_class = diabetes_test$outcome,
                  .pred_class = factor(rep("no", 192),
                                       levels = c("no", "yes")))

accuracy(HDF, truth = true_class, estimate = .pred_class)
```

For the naive model, it would be accurate `r round(accuracy(HDF, truth = true_class, estimate = .pred_class)$.estimate*100, 2)`% of the time.

:::{.callout-note icon=false}
A naive model always predicting no is almost as good as our model. Luckily there are more useful performance metrics which we'll cover later in the course. Stay tuned for Chapter 3!
:::