You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _episodes/01-create-new-environment.md
+96-6Lines changed: 96 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,16 +5,106 @@ exercises: 0
5
5
questions:
6
6
- "How can I create a new conda environment?"
7
7
objectives:
8
-
- "Create a new environment"
9
-
- "Install libraries in the environment"
8
+
- "Create a new environment from an environment.yml file"
9
+
- "Add this environment to Jupyter's kernel list"
10
10
keypoints:
11
-
- "use `conda create --name <NAME>` to create a new named environment"
11
+
- "use `conda env create --file environment.yml` to create a new environment from a YAML file"
12
12
- "see a list of all environments with `conda env list`"
13
-
- "need to activate the new environment with `conda activate <NAME>`"
14
-
- "use `conda install <PACKAGE>` to install new packages"
13
+
- "activate the new environment with `conda activate <NAME>`"
15
14
- "see a list of all installed packages with `conda list`"
16
15
---
17
-
FIXME
16
+
17
+
This workshop utilizes some Python packages (such as Plotly) that cannot be installed in Anaconda's base environment, because they will cause conflicts. To avoid these conflicts, we will create a new environment with only the packages we need for this workshop. These packages are:
18
+
* streamlit
19
+
* plotly
20
+
* plotly-geo
21
+
* jupyterlab
22
+
23
+
## Create an environment from the `environment.yml` file
24
+
25
+
The necessary packages are specified in the `environment.yml` file.
26
+
Open your terminal, and navigate to the project directory. Then, take a look at the contents.
27
+
28
+
~~~
29
+
cd ~/Desktop/data_viz_workshop
30
+
ls
31
+
~~~
32
+
{: .language-bash}
33
+
34
+
You should now see an `environment.yml` file and a `Data` directory.
35
+
36
+
Make sure that conda is working on your machine. You can verify this with:
37
+
38
+
~~~
39
+
conda env list
40
+
~~~
41
+
{: .language-bash}
42
+
43
+
This will list all of your conda environments. You should make sure that you do not already have an environment called `dataviz`, or it will be overwritten. If you do already have an environment called `dataviz`, you can change the environment name by editing the first line in the `environment.yml` file.
44
+
45
+
Now, you need to create a new environment using this `environment.yml` file. To do this, type in the command line:
46
+
47
+
~~~
48
+
conda env create --file environment.yml
49
+
~~~
50
+
{: .language-bash}
51
+
52
+
This process can take a while - about 2-3 minutes.
53
+
54
+
After the environment is created, go ahead and activate it. You can then see for yourself the packages that have been installed - both those listed in the file and all of their dependencies.
55
+
56
+
~~~
57
+
conda activate dataviz
58
+
conda list
59
+
~~~
60
+
{: .language-bash}
61
+
62
+
Now we will need to tell Jupyter that this environment exists and should be made available as a kernel in Jupyter Lab.
63
+
64
+
~~~
65
+
python -m ipykernel install --user --name dataviz
66
+
~~~
67
+
{: .language-bash}
68
+
69
+
Finally, we can go ahead and start Jupyter Lab
70
+
71
+
~~~
72
+
jupyter lab
73
+
~~~
74
+
{: .language-bash}
75
+
76
+
## Create the environment from scratch
77
+
78
+
If for some reason you are unable to create the environment from the `environment.yml` file, or you simply wish to do the process for yourself, you can follow these steps. These steps replace the `conda env create --file environment.yml` step in the instructions above.
79
+
80
+
First, create a new environment named `dataviz` and specify the python version.
81
+
Then, you will need to activate it and add the conda-forge channel.
82
+
Note that you can use any name for this new environment that you want, but you will need to make sure to continue to use that name for future steps.
83
+
84
+
~~~
85
+
conda create --name dataviz python=3.9
86
+
conda activate dataviz
87
+
conda config --add channels conda-forge
88
+
~~~
89
+
{: .language-bash}
90
+
91
+
Next, you will need to install the top-level packages we will need for the workshop. Installing these packages will also install their dependencies.
92
+
93
+
~~~
94
+
conda install -c conda-forge streamlit
95
+
conda install -c plotly plotly=5.1.0
96
+
conda install -c plotly plotly-geo=1.0.0
97
+
conda install -c conda-forge jupyterlab
98
+
~~~
99
+
{: .language-bash}
100
+
101
+
Note that this process will take a lot longer than installing from `environment.yml`, and you will also need to type `y` and press enter when prompted to complete the installation.
102
+
103
+
> ## Learn more about using Anaconda to manage your environments
104
+
> This episode only covers the bare minimum we need to get set up with using this new environment.
105
+
>
106
+
> To learn more, please refer to the lesson [Introduction to Conda for (Data) Scientists](https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/)
- "What format should my data be in for Plotly Express?"
@@ -10,9 +10,121 @@ questions:
10
10
objectives:
11
11
- "Learn useful pandas functions for wrangling data into a tidy format"
12
12
keypoints:
13
-
- "First key point. Brief Answer to questions. (FIXME)"
13
+
- "Import your CSV using `pd.read_csv('<FILEPATH>')"
14
+
- "Transform your dataframe from wide to long with `pd.melt()`"
15
+
- "Split column values with `df['<COLUMN>'].str.split('<DELIM>')`"
16
+
- "Sort rows using `df.sort_values()`"
17
+
- "Export your dataframe to CSV using `df.to_csv('<FILEPATH>')`"
14
18
---
15
-
FIXME
19
+
20
+
Data visualization libraries often expect data to be in a certain format so that the functions can correctly interpret and present the data. We will be using the Plotly Express library for visualizing data, which works best when data is in a tidy, or "long" format.
21
+
22
+
We want to visualize the data in `gapminder_all.csv`. However, this dataset is in a "wide" format - it has many columns, with each year + metric value in it's own column. The unit of observation is the "country" - each country has its own single row.
23
+
24
+
You can click on the `Data` folder and double click on `gapminder_all.csv` to view this file within Jupyter Lab.
25
+
26
+
We are going take this very wide dataset and make it very long, so the unit of observation will be each country + year + metric combination, rather than just the country. This process is made much simpler by a couple of functions in the `pandas` library.
27
+
28
+
## Getting Started
29
+
30
+
Let's go ahead and get started by opening a Jupyter Notebook with the `dataviz` kernel. If you navigated to the `Data` folder to look at the CSV file, navigate back to the root before opening the new notebook.
31
+
We are also going to rename this new notebook to `data_wrangling.ipynb`.
32
+
33
+
Jupyter Notebooks are very handy because we can combine documentation (markdown cells) with our program (code cells) in a reader-friendly way.
34
+
Let's make our first cell into a markdown cell, and give this notebook a title:
35
+
36
+
~~~
37
+
# Data Wrangling
38
+
~~~
39
+
{: .source}
40
+
41
+
You can then add basic metadata like your name, the current date, and the purpose of this notebook.
42
+
43
+
## Read in the data
44
+
45
+
We will start by importing pandas and reading in our data file. We can call the `df` variable to display it.
46
+
47
+
~~~
48
+
import pandas as pd
49
+
df = pd.read_csv("data/gapminder_all.csv")
50
+
df
51
+
~~~
52
+
{: .language-python}
53
+
54
+
## Melting the dataframe from wide to long
55
+
56
+
The first function we are going to use to wrangle this dataset is `pd.melt()`. This function's entire purpose to to make wide dataframes into long dataframes.
57
+
58
+
> ## Check out the documentation
59
+
> To learn more about `pd.melt()`, you can look at the function's [documentation](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)
60
+
> To see this documentation within Jupyter Lab, you can type `pd.melt()` in a cell and then hold down the shift + tab keys.
61
+
> You can also open a "Show Contextual Help" window from the Launcher.
62
+
{: .callout}
63
+
64
+
Let's take a look at all of the columns with:
65
+
66
+
~~~
67
+
df.columns
68
+
~~~
69
+
{: .language-python}
70
+
71
+
`pd.melt()` requires us to specify at least 3 arguments: the dataframe (`frame`), the "id" columns (`id_vars`) - that is, the columns that won't be "melted" - and the "value" columns (`value_vars`) - the columns that will be melted.
72
+
73
+
Our "id" columns are `country` and `continent`. Our "value" columns are all of the rest. That's a lot of columns! But no worries - we can programmatically make a list of all of these columns.
74
+
75
+
76
+
~~~
77
+
cols = list(df.columns)
78
+
cols.remove('continent')
79
+
cols.remove('country')
80
+
cols
81
+
~~~
82
+
{: .language-python}
83
+
84
+
Now, we can call `pd.melt()` and pass `cols` rather than typing out the whole list.
> When wrangling a dataframe in a Jupyter notebook, it's a good idea to assign transformed dataframes to a new variable name.
94
+
> You don't have to do this with every transformation, but do try to do this with every substantial transformation.
95
+
> This way, we don't have to re-run the entire notebook when we are experimenting with transformations on a dataframe.
96
+
{: .callout}
97
+
98
+
Just look at that beautiful, long dataframe! Take a closer look to understand exactly what `pd.melt()` did. The `variable` column has all of our former column names, and the `value` column has all of the values that used to belong in those columns.
99
+
100
+
## Splitting a column
101
+
102
+
But we're not done yet! Take a closer look at the `variable` column. This column contains two pieces of information - the metric and the year. Thankfully, these former column names have a consistent naming scheme, so we can easily split these two pieces of information into two different columns.
Now that all of our columns contain the appropriate information, in a tidy/long format, it's time to save our dataframe back to a CSV file. But first, we're going to re-order our columns (and remove the now extra `variable` column) and sort the rows.
- "How can I create a visualization using Plotly Express?"
6
+
- "How can I create an interactive visualization using Plotly Express?"
7
7
objectives:
8
-
- "Learn how to use the px.line() function"
8
+
- "Learn how to create and modify an interactive line plot using the px.line() function"
9
9
keypoints:
10
-
- "First key point. Brief Answer to questions. (FIXME)"
10
+
- "Before visualizing your dataframe, make sure it only includes the rows you want to visualize. You can use pandas' `query()` function to easily accomplish this"
11
+
- "To make a line plot with `px.line`, you need to specify the dataframe, X axis, and Y axis"
12
+
- "If you want to have multiple lines, you also need to specify what column determines the line color"
13
+
- "In a Jupyter Notebook, you need to call `fig.show()` to display the chart"
11
14
---
12
-
FIXME
15
+
16
+
Now that our data is in a tidy format, we can start creating some visualizations. Let's start by creating a new notebook (make sure to select the `dataviz` kernel in the Launcher) and renaming it `data_visualizations.ipynb`.
17
+
18
+
## Import our newly tidy data
19
+
20
+
First, we need to import pandas and Plotly Express, and then read in our dataframe.
21
+
22
+
~~~
23
+
import pandas as pd
24
+
import plotly.express as px
25
+
26
+
df = pd.read_csv("data/gapminder_tidy.csv")
27
+
df
28
+
~~~
29
+
{: .language-python}
30
+
31
+
## Creating our first plot
32
+
33
+
Our first plot is going to be relatively simple. Let's plot the GDP of New Zealand over time. First, let's figure out what our X and Y axis will need to be.
34
+
35
+
The X axis is typically used for time, so that will be our `year` column.
36
+
The Y axis will be the GDP amount, which is kept in the `value` column.
37
+
38
+
However, this dataframe has a lot of extra information in it. We want to create a new dataframe with only the rows we need for the visualization.
39
+
That means we need to filter for rows where the `country` is "New Zealand" and the `metric` is "gdpPercap".
40
+
We can do this with the `query()` function.
41
+
42
+
~~~
43
+
df.query("country=='New Zealand'")
44
+
~~~
45
+
{: .language-python}
46
+
47
+
This will select all of the rows where `country` is "New Zealand". We can add our second condition by either chaining another `query()` function or specifying the additional condition in the same `query()` function.
Now we can pass this dataframe to the `px.line()` function. At a minimum, we need to tell the function what dataframe to use, what column should be the X axis, and what column should be the Y axis.
63
+
64
+
~~~
65
+
fig = px.line(df_gdp_nz, x = "year", y = "value")
66
+
fig.show()
67
+
~~~
68
+
{: .language-python}
69
+
70
+
There it is! Our first line plot.
71
+
72
+
## When you want multiple lines
73
+
74
+
By itself, this plot of New Zealand's GDP isn't especially interesting. Let's add another line, to compare it to Australia.
75
+
76
+
First, we need to define a new dataframe to select the rows we need. This time, we will specify the `continent` as "Oceania".
0 commit comments