Skip to content

Commit 9c5cbcd

Browse files
committed
add simple exercises to lesson 01.
1 parent a9572da commit 9c5cbcd

File tree

1 file changed

+23
-4
lines changed

1 file changed

+23
-4
lines changed

_episodes/01-introduction.md

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ We will use decision trees for this task. Decision trees are a family of intuiti
2525

2626
## Load the patient cohort
2727

28-
We will begin by extracting a set of observations from our critical care dataset. To help us visualise our models, we will include only two variables in our models: age and acute physiology score.
28+
We will begin by loading a set of observations from our critical care dataset. The data includes variables collected on Day 1 of the stay, along with outcomes such as length of stay and in-hospital mortality.
2929

3030
```python
3131
# import libraries
@@ -51,7 +51,6 @@ The data has been assigned to a dataframe called `cohort`. Let's take a look at
5151
|3|Female|51|77\.1|0\.1986|19|24|ALIVE|122\.0|73\.0|-1\.0|36\.8|26\.0|-1\.0|160\.0|
5252
|4|Female|48|63\.4|1\.7285|25|30|ALIVE|130\.0|68\.0|1\.1|-1\.0|29\.0|7\.6|172\.7|
5353

54-
5554
## Preparing the data for analysis
5655

5756
We first need to do some basic data preparation.
@@ -63,7 +62,7 @@ encoder = LabelEncoder()
6362
cohort['actualhospitalmortality_enc'] = encoder.fit_transform(cohort['actualhospitalmortality'])
6463
```
6564

66-
In the eICU Collaborative Research Database, ages >89 years have been removed to comply with data sharing regulations. We will need to decide how to handle these ages. For simplicity, we will assign an age of 91.5 years to these patients.
65+
In the eICU Research Database, ages over 89 years are recorded as ">89" to comply with US data privacy laws. For simplicity, we will assign an age of 91.5 years to these patients (this is the approximate average age of patients over 89 in the dataset).
6766

6867
```python
6968
# Handle the deidentified ages
@@ -105,9 +104,20 @@ The table below shows summary characteristics of our dataset:
105104
| actualhospitalmortality_enc, n (%) | 0 | 0 | 488 (91.0) | 488 (100.0) | |
106105
| | 1 | | 48 (9.0) | | 48 (100.0) |
107106

107+
> ## Question
108+
> a) What proportion of patients survived their hospital stay?
109+
> b) What is the "apachescore" variable? Hint, see the [Wikipeda entry for the Apache Score](https://en.wikipedia.org/wiki/APACHE_II ).
110+
> c) What is the average age of patients?
111+
> > ## Answer
112+
> > a) 91% of patients survived their stay. There is 9% in-hospital mortality.
113+
> > b) APACHE ("Acute Physiology and Chronic Health Evaluation II") is a severity-of-disease classification system. It is applied within 24 hours of admission of a patient to an intensive care unit. Higher scores correspond to more severe disease and a higher risk of death.
114+
> > c) The median age is 64 years. Remember that the age of patients above 89 years is unknown. Median is therefore a better measure of central tendency. The median age can be calculated with `cohort['age'].median()`.
115+
> {: .solution}
116+
{: .challenge}
117+
108118
## Creating train and test sets
109119

110-
We only focus on two variables for our analysis, age and acute physiology score. Limiting ourselves to two variables will make it easier to visualize our models.
120+
We will only focus on two variables for our analysis, age and acute physiology score. Limiting ourselves to two variables (or "features") will make it easier to visualize our models.
111121

112122
```python
113123
from sklearn.model_selection import train_test_split
@@ -121,5 +131,14 @@ y = cohort[outcome]
121131
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.7, random_state = 42)
122132
```
123133

134+
> ## Question
135+
> a) Why did we split our data into training and test sets?
136+
> b) What is the effect of setting a random state in the splotting algorithm?
137+
> > ## Answer
138+
> > a) We want to be able to evaluate our model on data that it has not seen before. If we evaluate our model on data that it is trained upon, we will overestimate the performance.
139+
> > b) Setting the random state means that the split will be deterministic (i.e. we will all see the same "random" split). This helps to ensure our analysis is reproducible.
140+
> {: .solution}
141+
{: .challenge}
142+
124143
{% include links.md %}
125144

0 commit comments

Comments
 (0)