Introduction

A part of the Udacity Data Engineering Nanodegree, this ETL project looks to collect and present user activity information for a fictional music streaming service called Sparkify. To do this, data is gathered from song information and application .json log files (which were generated from the Million Song Dataset and from eventsim respectively and given to us).

These log files are stored in two Amazon S3 directories, and are loaded into an Amazon EMR Spark cluster for processing. The etl.py script reads these files from S3, transforms them to create five different tables in Spark and writes them to partitioned parquet files in table directories on S3.

Having these data stored as .parquet files means they can be easily loaded into Hadoop for analysis whenever required, meaning that the .json files will not need to be reprocessed in order to make use of this data.

Files

- README.md -- this file
- etl.py -- the main ETL script that interacts with Spark to create the paraquet files
- dl.cfg -- configuration file where AWS connection details need to be entered

Setup

Set up server

In order to run these Python scripts, you will first need to install Python 3 and Apache Spark on your server, and then install the following Python modules via pip or anaconda:

pyspark - a Pyspark adapter for Python

To install these via pip you can run:

pip install pyspark

Spinning up an Apache Spark cluster on Amazon EMR may be a quicker way to set all this up.

Move ETL files over to Spark cluster

Please move over all files in this folder over to your Spark cluster. You can use SCP with your my-key-pair.pem file in order to do this.

Thereafter, you will need to fill out the empty fields in the dl.cfg configuration file with your AWS access key that allows you to connect to your S3 buckets.

Testing

Running the ETL process for this full dataset can take time. Thus, as suggested by Tran Nguyen, it can be useful to test out the script on a smaller subset of data first.

If you would like to run the etl.py script on a subset of the data, you can do this by running:

spark-submit etl.py --run-subset

This will only load and transform a subset of the data and can help isolate issues in your ETL process without having to process all the data every single time.

Production

When you are ready to run the ETL process on all your data, you can just execute the whole etl.py file via:

spark-submit etl.py

Schema

Fact Table

songplays - a list of times users played a song, extracted from both the song_data and log_data JSON files: songplay_id, start_time, month, year, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - a list of users, extracted from the log_data JSON files: user_id, first_name, last_name, gender, level
songs - a list of songs, extracted from the song_data JSON files: song_id, title, artist_id, year, duration
artists - a list of artists, extracted from the song_data JSON files: artist_id, name, location, lattitude, longitude
time - a list of timestamps, extracted from the log_data JSON files: start_time, hour, day, week, month, year, weekday

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Files

Setup

Set up server

Move ETL files over to Spark cluster

Testing

Production

Schema

Fact Table

Dimension Tables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py

Folders and files

Latest commit

History

Repository files navigation

Introduction

Files

Setup

Set up server

Move ETL files over to Spark cluster

Testing

Production

Schema

Fact Table

Dimension Tables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages