Skip to content

Commit b0fc33c

Browse files
committed
add a couple of exercises
1 parent 9c5cbcd commit b0fc33c

File tree

1 file changed

+31
-12
lines changed

1 file changed

+31
-12
lines changed

_episodes/03-variance.md

Lines changed: 31 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -43,15 +43,25 @@ Image(graph.create_png())
4343

4444
## Overfitting
4545

46-
Looking at the tree, we can see that there are some very specific rules. Consider a patient aged 45 years with an acute physiology score of 100. From the top of the tree, we would work our way down:
47-
48-
- acutePhysiologyScore <= 78.5? No.
49-
- acutePhysiologyScore <= 104.5? Yes.
50-
- age <= 76.5? Yes
51-
- age <= 55.5. Yes.
52-
- acutePhysiologyScore <= 96.5? No.
53-
54-
This leads us to our single node with a gini impurity of 0. Having an entire rule based upon this one observation seems silly, but it is perfectly logical at the moment. The only objective the algorithm cares about is minimizing the gini impurity.
46+
Looking at the tree, we can see that there are some very specific rules.
47+
48+
> ## Question
49+
> a) Consider a patient aged 45 years with an acute physiology score of 100. Using the image of the tree, work through the nodes until your can make a prediction. What outcome does your model predict?
50+
> b) What is the gini impurity of the final node, and why?
51+
> c) Does the decision that led to this final node seem sensible to you? Why?
52+
> > ## Answer
53+
> > a) From the top of the tree, we would work our way down:
54+
> >
55+
> > - acutePhysiologyScore <= 78.5? No.
56+
> > - acutePhysiologyScore <= 104.5? Yes.
57+
> > - age <= 76.5? Yes
58+
> > - age <= 55.5. Yes.
59+
> > - acutePhysiologyScore <= 96.5? No.
60+
> >
61+
> > b) This leads us to our single node with a gini impurity of 0. The node contains a single class (i.e. it is completely "pure".).
62+
> > c) Having an entire rule based upon this one observation seems silly, but it is perfectly logical at the moment. The only objective the algorithm cares about is minimizing the gini impurity.
63+
> {: .solution}
64+
{: .challenge}
5565

5666
Overfitting is a problem that occurs when our algorithm is too closely aligned to our training data. The result is that the model may not generalise well to "unseen" data, such as observations for new patients entering a critical care unit. This is where "pruning" comes in.
5767

@@ -61,7 +71,7 @@ Let's prune the model and look again.
6171

6272
```python
6373
mdl = glowyr.prune(mdl, min_samples_leaf = 10)
64-
graph = glowyr.create_graph(mdl,feature_names=features)
74+
graph = glowyr.create_graph(mdl, feature_names=features)
6575
Image(graph.create_png())
6676
```
6777

@@ -112,11 +122,20 @@ for i in range(3):
112122

113123
![Simple tree (depth 5)](../fig/section3-fig5.png){: width="900px"}
114124

115-
Above we can see that we are using random subsets of data, and as a result, our decision boundary can change quite a bit. As you could guess, we actually don't want a model that randomly works well and randomly works poorly, so you may wonder why this is useful.
125+
Above we can see that we are using random subsets of data, and as a result, our decision boundary can change quite a bit. As you could guess, we actually don't want a model that randomly works well and randomly works poorly.
116126

117-
The trick is that by combining many of instances of "high variance" classifiers (decision trees), we can end up with a single classifier with low variance. There is an old joke: two farmers and a statistician go hunting. They see a deer: the first farmer shoots, and misses to the left. The next farmer shoots, and misses to the right. The statistician yells "We got it!!".
127+
There is an old joke: two farmers and a statistician go hunting. They see a deer: the first farmer shoots, and misses to the left. The next farmer shoots, and misses to the right. The statistician yells "We got it!!".
118128

119129
While it doesn't quite hold in real life, it turns out that this principle does hold for decision trees. Combining them in the right way ends up building powerful models.
120130

131+
> ## Question
132+
> a) Why are decision trees considered have high variance?
133+
> b) An "ensemble" is the name used for a machine learning model that aggregates the decisions of multiple sub-models. Why might creating ensembles of decision trees be a good idea?
134+
> > ## Answer
135+
> > a) Minor changes in the data used to train decision trees can lead to very different model performance.
136+
> > b) By combining many of instances of "high variance" classifiers (decision trees), we can end up with a single classifier with low variance.
137+
> {: .solution}
138+
{: .challenge}
139+
121140
{% include links.md %}
122141

0 commit comments

Comments
 (0)