You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _episodes/01-introduction.md
+23-4Lines changed: 23 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ We will use decision trees for this task. Decision trees are a family of intuiti
25
25
26
26
## Load the patient cohort
27
27
28
-
We will begin by extracting a set of observations from our critical care dataset. To help us visualise our models, we will include only two variables in our models: age and acute physiology score.
28
+
We will begin by loading a set of observations from our critical care dataset. The data includes variables collected on Day 1 of the stay, along with outcomes such as length of stay and in-hospital mortality.
29
29
30
30
```python
31
31
# import libraries
@@ -51,7 +51,6 @@ The data has been assigned to a dataframe called `cohort`. Let's take a look at
In the eICU Collaborative Research Database, ages >89 years have been removed to comply with data sharing regulations. We will need to decide how to handle these ages. For simplicity, we will assign an age of 91.5 years to these patients.
65
+
In the eICU Research Database, ages over 89 years are recorded as ">89" to comply with US data privacy laws. For simplicity, we will assign an age of 91.5 years to these patients (this is the approximate average age of patients over 89 in the dataset).
67
66
68
67
```python
69
68
# Handle the deidentified ages
@@ -105,9 +104,20 @@ The table below shows summary characteristics of our dataset:
> a) What proportion of patients survived their hospital stay?
109
+
> b) What is the "apachescore" variable? Hint, see the [Wikipeda entry for the Apache Score](https://en.wikipedia.org/wiki/APACHE_II).
110
+
> c) What is the average age of patients?
111
+
> > ## Answer
112
+
> > a) 91% of patients survived their stay. There is 9% in-hospital mortality.
113
+
> > b) APACHE ("Acute Physiology and Chronic Health Evaluation II") is a severity-of-disease classification system. It is applied within 24 hours of admission of a patient to an intensive care unit. Higher scores correspond to more severe disease and a higher risk of death.
114
+
> > c) The median age is 64 years. Remember that the age of patients above 89 years is unknown. Median is therefore a better measure of central tendency. The median age can be calculated with `cohort['age'].median()`.
115
+
> {: .solution}
116
+
{: .challenge}
117
+
108
118
## Creating train and test sets
109
119
110
-
We only focus on two variables for our analysis, age and acute physiology score. Limiting ourselves to two variables will make it easier to visualize our models.
120
+
We will only focus on two variables for our analysis, age and acute physiology score. Limiting ourselves to two variables (or "features") will make it easier to visualize our models.
111
121
112
122
```python
113
123
from sklearn.model_selection import train_test_split
@@ -121,5 +131,14 @@ y = cohort[outcome]
121
131
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=42)
122
132
```
123
133
134
+
> ## Question
135
+
> a) Why did we split our data into training and test sets?
136
+
> b) What is the effect of setting a random state in the splotting algorithm?
137
+
> > ## Answer
138
+
> > a) We want to be able to evaluate our model on data that it has not seen before. If we evaluate our model on data that it is trained upon, we will overestimate the performance.
139
+
> > b) Setting the random state means that the split will be deterministic (i.e. we will all see the same "random" split). This helps to ensure our analysis is reproducible.
0 commit comments