You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _episodes/05-bagging.md
+8-4Lines changed: 8 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,11 +16,15 @@ keypoints:
16
16
17
17
## Bootstrap aggregation ("Bagging")
18
18
19
-
Bootstrap aggregation, or "Bagging", is another form of ensemble learning where we aim to build a single good model by combining many models together. With AdaBoost, we modified the data to focus on hard to classify observations. We can imagine this as a form of resampling the data for each new tree. For example, say we have three observations: A, B, and C, [A, B, C]. If we correctly classify observations [A, B], but incorrectly classify C, then AdaBoost involves building a new tree that focuses on C. Equivalently, we could say AdaBoost builds a new tree using the dataset [A, B, C, C, C], where we have intentionally repeated observation C 3 times so that the algorithm thinks it is 3 times as important as the other observations. Makes sense?
19
+
Bootstrap aggregation, or "Bagging", is another form of ensemble learning.
20
20
21
-
Bagging involves the same approach, except we don't selectively choose which observations to focus on, but rather we randomly select subsets of data each time. As you can see, while this is a similar process to AdaBoost, the concept is quite different. Whereas before we aimed to iteratively improve our overall model with new trees, we now build trees on what we hope are independent datasets.
21
+
With boosting, we iteratively changed the dataset to have new trees focus on the "difficult" observations. Bagging involves the same approach, except we don't selectively choose which observations to focus on, but rather we randomly select subsets of data each time.
22
22
23
-
Let's take a step back, and think about a practical example. Say we wanted a good model of heart disease. If we saw researchers build a model from a dataset of patients from their hospital, we would be happy. If they then acquired a new dataset from new patients, and built a new model, we'd be inclined to feel that the combination of the two models would be better than any one individually. This exact scenario is what bagging aims to replicate, except instead of actually going out and collecting new datasets, we instead use bootstrapping to create new sets of data from our current dataset. If you are unfamiliar with bootstrapping, you can treat it as "magic" for now (and if you are familiar with the bootstrap, you already know that it is magic).
23
+
Boosting aimed to iteratively improve our overall model with new trees. With bagging, we now build trees on what we hope are independent datasets.
24
+
25
+
Let's take a step back, and think about a practical example. Say we wanted a good model of heart disease. If we saw researchers build a model from a dataset of patients from their hospital, we might think this would be sufficient. If the researchers were able to acquire a new dataset from new patients, and built a new model, we'd be inclined to feel that the combination of the two models would be better than any one individually.
26
+
27
+
This is the scenario that bagging aims to replicate, except instead of actually going out and collecting new datasets, we instead use "bootstrapping" to create new sets of data from our current dataset. If you are unfamiliar with bootstrapping, you can treat it as magic for now (and if you are familiar with the bootstrap, you already know that it is magic).
24
28
25
29
Let's take a look at a simple bootstrap model.
26
30
@@ -39,7 +43,7 @@ for i, estimator in enumerate(mdl.estimators_):
39
43
40
44
{: width="900px"}
41
45
42
-
We can see that each individual tree is quite variable. This is a result of using a random set of data to train the classifier.
46
+
We can see that each individual tree varies considerably. This is a result of using a random set of data to train the classifier.
0 commit comments