carpentries-incubator
diff --git a/‎_episodes/01-create-new-environment.md‎
Lines changed: 96 additions & 6 deletions b/‎_episodes/01-create-new-environment.md‎
Lines changed: 96 additions & 6 deletions
diff --git a/‎_episodes/02-data-wrangling.md‎
Lines changed: 115 additions & 3 deletions b/‎_episodes/02-data-wrangling.md‎
Lines changed: 115 additions & 3 deletions
diff --git a/‎_episodes/03-create-visualizations.md‎
Lines changed: 100 additions & 5 deletions b/‎_episodes/03-create-visualizations.md‎
Lines changed: 100 additions & 5 deletions
@@ -5,16 +5,106 @@ exercises: 0
 questions:
 - "How can I create a new conda environment?"
 objectives:
-- "Create a new environment"
-- "Install libraries in the environment"
+- "Create a new environment from an environment.yml file"
+- "Add this environment to Jupyter's kernel list"
 keypoints:
-- "use `conda create --name <NAME>` to create a new named environment"
+- "use `conda env create --file environment.yml` to create a new environment from a YAML file"
 - "see a list of all environments with `conda env list`"
-- "need to activate the new environment with `conda activate <NAME>`"
-- "use `conda install <PACKAGE>` to install new packages"
+- "activate the new environment with `conda activate <NAME>`"
 - "see a list of all installed packages with `conda list`"
 ---
-FIXME
+
+This workshop utilizes some Python packages (such as Plotly) that cannot be installed in Anaconda's base environment, because they will cause conflicts. To avoid these conflicts, we will create a new environment with only the packages we need for this workshop. These packages are:
+* streamlit
+* plotly
+* plotly-geo
+* jupyterlab
+
+## Create an environment from the `environment.yml` file
+
+The necessary packages are specified in the `environment.yml` file. 
+Open your terminal, and navigate to the project directory. Then, take a look at the contents.
+
+~~~
+cd ~/Desktop/data_viz_workshop
+ls
+~~~
+{: .language-bash}
+
+You should now see an `environment.yml` file and a `Data` directory.
+
+Make sure that conda is working on your machine. You can verify this with: 
+
+~~~
+conda env list
+~~~
+{: .language-bash}
+
+This will list all of your conda environments. You should make sure that you do not already have an environment called `dataviz`, or it will be overwritten. If you do already have an environment called `dataviz`, you can change the environment name by editing the first line in the `environment.yml` file.
+
+Now, you need to create a new environment using this `environment.yml` file. To do this, type in the command line:
+
+~~~
+conda env create --file environment.yml
+~~~
+{: .language-bash}
+
+This process can take a while - about 2-3 minutes.
+
+After the environment is created, go ahead and activate it. You can then see for yourself the packages that have been installed - both those listed in the file and all of their dependencies.
+
+~~~
+conda activate dataviz
+conda list
+~~~
+{: .language-bash}
+
+Now we will need to tell Jupyter that this environment exists and should be made available as a kernel in Jupyter Lab.
+
+~~~
+python -m ipykernel install --user --name dataviz
+~~~
+{: .language-bash}
+
+Finally, we can go ahead and start Jupyter Lab
+
+~~~
+jupyter lab
+~~~
+{: .language-bash}
+
+## Create the environment from scratch
+
+If for some reason you are unable to create the environment from the `environment.yml` file, or you simply wish to do the process for yourself, you can follow these steps. These steps replace the `conda env create --file environment.yml` step in the instructions above.
+
+First, create a new environment named `dataviz` and specify the python version.
+Then, you will need to activate it and add the conda-forge channel.
+Note that you can use any name for this new environment that you want, but you will need to make sure to continue to use that name for future steps.
+
+~~~
+conda create --name dataviz python=3.9
+conda activate dataviz
+conda config --add channels conda-forge
+~~~
+{: .language-bash}
+
+Next, you will need to install the top-level packages we will need for the workshop. Installing these packages will also install their dependencies.
+
+~~~
+conda install -c conda-forge streamlit
+conda install -c plotly plotly=5.1.0
+conda install -c plotly plotly-geo=1.0.0
+conda install -c conda-forge jupyterlab
+~~~
+{: .language-bash}
+
+Note that this process will take a lot longer than installing from `environment.yml`, and you will also need to type `y` and press enter when prompted to complete the installation.
+
+> ## Learn more about using Anaconda to manage your environments
+> This episode only covers the bare minimum we need to get set up with using this new environment.
+>
+> To learn more, please refer to the lesson [Introduction to Conda for (Data) Scientists](https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/)
+{: .callout}
 
 {% include links.md %}
 
@@ -1,6 +1,6 @@
 ---
 title: "Data Wrangling"
-teaching: 10
+teaching: 15
 exercises: 0
 questions:
 - "What format should my data be in for Plotly Express?"
@@ -10,9 +10,121 @@ questions:
 objectives:
 - "Learn useful pandas functions for wrangling data into a tidy format"
 keypoints:
-- "First key point. Brief Answer to questions. (FIXME)"
+- "Import your CSV using `pd.read_csv('<FILEPATH>')"
+- "Transform your dataframe from wide to long with `pd.melt()`"
+- "Split column values with `df['<COLUMN>'].str.split('<DELIM>')`"
+- "Sort rows using `df.sort_values()`"
+- "Export your dataframe to CSV using `df.to_csv('<FILEPATH>')`"
 ---
-FIXME
+
+Data visualization libraries often expect data to be in a certain format so that the functions can correctly interpret and present the data. We will be using the Plotly Express library for visualizing data, which works best when data is in a tidy, or "long" format.
+
+We want to visualize the data in `gapminder_all.csv`. However, this dataset is in a "wide" format - it has many columns, with each year + metric value in it's own column. The unit of observation is the "country" - each country has its own single row.
+
+You can click on the `Data` folder and double click on `gapminder_all.csv` to view this file within Jupyter Lab.
+
+We are going take this very wide dataset and make it very long, so the unit of observation will be each country + year + metric combination, rather than just the country. This process is made much simpler by a couple of functions in the `pandas` library.
+
+## Getting Started
+
+Let's go ahead and get started by opening a Jupyter Notebook with the `dataviz` kernel. If you navigated to the `Data` folder to look at the CSV file, navigate back to the root before opening the new notebook. 
+We are also going to rename this new notebook to `data_wrangling.ipynb`.
+
+Jupyter Notebooks are very handy because we can combine documentation (markdown cells) with our program (code cells) in a reader-friendly way.
+Let's make our first cell into a markdown cell, and give this notebook a title:
+
+~~~
+# Data Wrangling
+~~~
+{: .source}
+
+You can then add basic metadata like your name, the current date, and the purpose of this notebook.
+
+## Read in the data
+
+We will start by importing pandas and reading in our data file. We can call the `df` variable to display it.
+
+~~~
+import pandas as pd
+df = pd.read_csv("data/gapminder_all.csv")
+df
+~~~
+{: .language-python}
+
+## Melting the dataframe from wide to long
+
+The first function we are going to use to wrangle this dataset is `pd.melt()`. This function's entire purpose to to make wide dataframes into long dataframes.
+
+> ## Check out the documentation
+> To learn more about `pd.melt()`, you can look at the function's [documentation](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)
+> To see this documentation within Jupyter Lab, you can type `pd.melt()` in a cell and then hold down the shift + tab keys.
+> You can also open a "Show Contextual Help" window from the Launcher.
+{: .callout}
+
+Let's take a look at all of the columns with:
+
+~~~
+df.columns
+~~~
+{: .language-python}
+
+`pd.melt()` requires us to specify at least 3 arguments: the dataframe (`frame`), the "id" columns (`id_vars`) - that is, the columns that won't be "melted" - and the "value" columns (`value_vars`) - the columns that will be melted.
+
+Our "id" columns are `country` and `continent`. Our "value" columns are all of the rest. That's a lot of columns! But no worries - we can programmatically make a list of all of these columns.
+
+
+~~~
+cols = list(df.columns)
+cols.remove('continent')
+cols.remove('country')
+cols
+~~~
+{: .language-python}
+
+Now, we can call `pd.melt()` and pass `cols` rather than typing out the whole list.
+
+~~~
+df_melted = pd.melt(df, id_vars=['country', 'continent'], value_vars = cols
+df_melted
+~~~
+{: .language-python}
+
+> ## New dataframe variable names
+> When wrangling a dataframe in a Jupyter notebook, it's a good idea to assign transformed dataframes to a new variable name.
+> You don't have to do this with every transformation, but do try to do this with every substantial transformation.
+> This way, we don't have to re-run the entire notebook when we are experimenting with transformations on a dataframe.
+{: .callout}
+
+Just look at that beautiful, long dataframe! Take a closer look to understand exactly what `pd.melt()` did. The `variable` column has all of our former column names, and the `value` column has all of the values that used to belong in those columns.
+
+## Splitting a column
+
+But we're not done yet! Take a closer look at the `variable` column. This column contains two pieces of information - the metric and the year. Thankfully, these former column names have a consistent naming scheme, so we can easily split these two pieces of information into two different columns.
+
+~~~
+df_melted[['metric', 'year']] = df_melted['variable'].str.split("_", expand=True)
+df_melted
+~~~
+{: .language-python}
+
+## Saving the final dataframe
+
+Now that all of our columns contain the appropriate information, in a tidy/long format, it's time to save our dataframe back to a CSV file. But first, we're going to re-order our columns (and remove the now extra `variable` column) and sort the rows.
+
+~~~
+df_final = df_melted[['country', 'continent', 'year', 'metric', 'value']]
+df_final = df_final.sort_values(by=["continent", "country", "year", "metric"])
+df_final
+~~~
+{: .language-python}
+
+Finally, we will export the dataframe to a CSV file.
+
+~~~
+df_final.to_csv("data/gapminder_tidy.csv", index=False)
+~~~
+{: .language-python}
+
 
 {% include links.md %}
 
@@ -1,15 +1,110 @@
 ---
 title: "Create Visualizations"
-teaching: 10
+teaching: 15
 exercises: 0
 questions:
-- "How can I create a visualization using Plotly Express?"
+- "How can I create an interactive visualization using Plotly Express?"
 objectives:
-- "Learn how to use the px.line() function"
+- "Learn how to create and modify an interactive line plot using the px.line() function"
 keypoints:
-- "First key point. Brief Answer to questions. (FIXME)"
+- "Before visualizing your dataframe, make sure it only includes the rows you want to visualize. You can use pandas' `query()` function to easily accomplish this"
+- "To make a line plot with `px.line`, you need to specify the dataframe, X axis, and Y axis"
+- "If you want to have multiple lines, you also need to specify what column determines the line color"
+- "In a Jupyter Notebook, you need to call `fig.show()` to display the chart"
 ---
-FIXME
+
+Now that our data is in a tidy format, we can start creating some visualizations. Let's start by creating a new notebook (make sure to select the `dataviz` kernel in the Launcher) and renaming it `data_visualizations.ipynb`.
+
+## Import our newly tidy data
+
+First, we need to import pandas and Plotly Express, and then read in our dataframe.
+
+~~~
+import pandas as pd
+import plotly.express as px
+
+df = pd.read_csv("data/gapminder_tidy.csv")
+df
+~~~
+{: .language-python}
+
+## Creating our first plot
+
+Our first plot is going to be relatively simple. Let's plot the GDP of New Zealand over time. First, let's figure out what our X and Y axis will need to be.
+
+The X axis is typically used for time, so that will be our `year` column.
+The Y axis will be the GDP amount, which is kept in the `value` column.
+
+However, this dataframe has a lot of extra information in it. We want to create a new dataframe with only the rows we need for the visualization. 
+That means we need to filter for rows where the `country` is "New Zealand" and the `metric` is "gdpPercap".
+We can do this with the `query()` function.
+
+~~~
+df.query("country=='New Zealand'")
+~~~
+{: .language-python}
+
+This will select all of the rows where `country` is "New Zealand". We can add our second condition by either chaining another `query()` function or specifying the additional condition in the same `query()` function.
+
+~~~
+df.query("country=='New Zealand'").query("metric=='gdpPercap")
+df.query("country=='New Zealand' & metric=='gdpPercap'")
+~~~
+{: .language-python}
+
+Let's make sure to save that filtered dataframe to a new variable.
+
+~~~
+df_gdp_nz = df.query("country=='New Zealand' & metric=='gdpPercap'")
+~~~
+{: .language-python}
+
+Now we can pass this dataframe to the `px.line()` function. At a minimum, we need to tell the function what dataframe to use, what column should be the X axis, and what column should be the Y axis.
+
+~~~
+fig = px.line(df_gdp_nz, x = "year", y = "value")
+fig.show()
+~~~
+{: .language-python}
+
+There it is! Our first line plot.
+
+## When you want multiple lines
+
+By itself, this plot of New Zealand's GDP isn't especially interesting. Let's add another line, to compare it to Australia.
+
+First, we need to define a new dataframe to select the rows we need. This time, we will specify the `continent` as "Oceania".
+
+~~~
+df_gdp_o = df.query("continent=='Oceania' & metric=='gdpPercap'")
+df_gdp_o
+~~~
+{: .language-python}
+
+Now, we will create another figure, but this time we need to pass an additional parameter: `color`.
+
+~~~
+fig = px.line(df_gdp_o, x = "year", y = "value", color = "country")
+fig.show()
+~~~
+{: .language-python}
+
+Great! This already looking better. But we should fix that y-axis label and add a title.
+
+~~~
+title = "GDP for countries in Oceania"
+fig = px.line(df_gdp_o, x = "year", y = "value", color = "country", title = title, labels={"value": "GDP Percap"})
+fig.show()
+~~~
+{: .language-python}
+
+You can go ahead and experiment with creating different plots for the different continents and metrics.
+
+> ## Interactivity is baked in to Plotly charts
+> When you have many more lines, the interactive features of Plotly become very useful. 
+> Notice how hovering over a line will tell you more information about that point. 
+> You will also see several options in the upper right corner to further interact with the plot - including saving it as a PNG file!
+{: .callout}
 
 {% include links.md %}