Skip to content

Commit e10ce96

Browse files
committed
add more tidy data info
1 parent c7417b2 commit e10ce96

File tree

1 file changed

+41
-17
lines changed

1 file changed

+41
-17
lines changed

_episodes/03-data-wrangling.md

Lines changed: 41 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -21,25 +21,45 @@ Data visualization libraries often expect data to be in a certain format so that
2121

2222
We want to visualize the data in `gapminder_all.csv`. However, this dataset is in a "wide" format - it has many columns, with each year + metric value in it's own column. The unit of observation is the "country" - each country has its own single row.
2323

24-
You can click on the `Data` folder and double click on `gapminder_all.csv` to view this file within Jupyter Lab.
25-
26-
We are going take this very wide dataset and make it very long, so the unit of observation will be each country + year + metric combination, rather than just the country. This process is made much simpler by a couple of functions in the `pandas` library.
27-
28-
> ## Tidy Data
29-
>
30-
> The term "tidy data" may be most popular in the R ecosystem (the "tidyverse" is a collection of R packages designed around the tidy data philosophy), but it is applicable to all tabular datasets, not matter what programming language you are using to wrangle your data.
31-
>
32-
> You can ready more about the tidy data philosophy in Hadley Wickham's 2014 paper, "Tidy Data", available [here](https://vita.had.co.nz/papers/tidy-data.pdf).
33-
>
34-
> Wickham later refined and revised the tidy data philosophy, and published it in the 12th chapter of his open access textbook "R for Data Science" - available [here](https://r4ds.had.co.nz/tidy-data.html).
24+
> ## Open the CSV file within Jupyter Lab
3525
>
36-
> The revised rules are:
26+
> Click on the `Data` folder in the left-hand navigation pane and then double click on `gapminder_all.csv` to view this file within Jupyter Lab.
3727
>
38-
> 1. Each variable must have its own column
39-
> 2. Each observation must have its own row
40-
> 3. Each value must have its own cell
28+
> Explore the dataset visually. What does each row represent? What does each column represent? About how many rows and columns are there?
4129
{: .callout}
4230

31+
We are going take this wide dataset and make it long, so the unit of observation will be each country + year + metric combination, rather than just the country. This process is made much simpler by a couple of functions in the `pandas` library.
32+
33+
## Tidy Data
34+
35+
The term "tidy data" may be most popular in the R ecosystem (the "tidyverse" is a collection of R packages designed around the tidy data philosophy), but it is applicable to all tabular datasets, not matter what programming language you are using to wrangle your data.
36+
37+
You can ready more about the tidy data philosophy in Hadley Wickham's 2014 paper, "Tidy Data", available [here](https://vita.had.co.nz/papers/tidy-data.pdf).
38+
39+
Wickham later refined and revised the tidy data philosophy, and published it in the 12th chapter of his open access textbook "R for Data Science" - available [here](https://r4ds.had.co.nz/tidy-data.html).
40+
41+
The revised rules are:
42+
43+
1. Each variable must have its own column
44+
2. Each observation must have its own row
45+
3. Each value must have its own cell
46+
47+
It might be difficult at first to identify what makes a dataset "untidy", and therefore what you will need to change in order to wrangle the dataset into a tidy shape.
48+
49+
Here are the five most common problems with untidy datasets (Identified in ["Tidy Data"](https://vita.had.co.nz/papers/tidy-data.pdf)):
50+
51+
1. Column headers are values, not variable names
52+
2. Multiple variables are stored in one column
53+
3. Variables are stored in both rows and columns
54+
4. Multiple types of observational units are stored in the same table
55+
5. A single observational unit is stored in multiple tables
56+
57+
> ## Discuss: how is our dataset untidy?
58+
>
59+
> Look again at the file `gapminder_all.csv` you opened in Jupyter Lab.
60+
> Which of the 5 most common problems with untidy datasets applies to this dataset?
61+
{: .discussion}
62+
4363
## Getting Started
4464

4565
Let's go ahead and get started by opening a Jupyter Notebook with the `dataviz` kernel. If you navigated to the `Data` folder to look at the CSV file, navigate back to the root before opening the new notebook.
@@ -70,6 +90,8 @@ df
7090

7191
## Melting the dataframe from wide to long
7292

93+
One problem with our dataset is that "column headers are values, not variable names". The type of metric and the year are stuck in our column headers, and we want that information to be stored in rows.
94+
7395
The first function we are going to use to wrangle this dataset is `pd.melt()`. This function's entire purpose to to make wide dataframes into long dataframes.
7496

7597
> ## Check out the documentation
@@ -117,7 +139,9 @@ Just look at that beautiful, long dataframe! Take a closer look to understand ex
117139

118140
## Splitting a column
119141

120-
But we're not done yet! Take a closer look at the `variable` column. This column contains two pieces of information - the metric and the year. Thankfully, these former column names have a consistent naming scheme, so we can easily split these two pieces of information into two different columns.
142+
Now that we have melted our datset, we can address another untidy problem: "Multiple variables are stored in one column".
143+
144+
Take a closer look at the `variable` column. This column contains two pieces of information - the metric and the year. Thankfully, these former column names have a consistent naming scheme, so we can easily split these two pieces of information into two different columns.
121145

122146
~~~
123147
df_melted[['metric', 'year']] = df_melted['variable'].str.split("_", expand=True)
@@ -134,7 +158,7 @@ df_melted
134158

135159
## Saving the final dataframe
136160

137-
Now that all of our columns contain the appropriate information, in a tidy/long format, it's time to save our dataframe back to a CSV file. But first, we're going to re-order our columns (and remove the now extra `variable` column) and sort the rows.
161+
Now that all of our columns contain the appropriate information, in a tidy/long format, it's time to save our dataframe back to a CSV file. But first, let's clean up our datset: we're going to re-order our columns (and remove the now extra `variable` column) and sort the rows.
138162

139163
~~~
140164
df_final = df_melted[['country', 'continent', 'year', 'metric', 'value']]

0 commit comments

Comments
 (0)