Skip to content

lastnode/nanodeg_p4_spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 

Repository files navigation

Introduction

A part of the Udacity Data Engineering Nanodegree, this ETL project looks to collect and present user activity information for a fictional music streaming service called Sparkify. To do this, data is gathered from song information and application .json log files (which were generated from the Million Song Dataset and from eventsim respectively and given to us).

These log files are stored in two Amazon S3 directories, and are loaded into an Amazon EMR Spark cluster for processing. The etl.py script reads these files from S3, transforms them to create five different tables in Spark and writes them to partitioned parquet files in table directories on S3.

Having these data stored as .parquet files means they can be easily loaded into Hadoop for analysis whenever required, meaning that the .json files will not need to be reprocessed in order to make use of this data.

Files

- README.md -- this file
- etl.py -- the main ETL script that interacts with Spark to create the paraquet files
- dl.cfg -- configuration file where AWS connection details need to be entered

Setup

Set up server

In order to run these Python scripts, you will first need to install Python 3 and Apache Spark on your server, and then install the following Python modules via pip or anaconda:

  • pyspark - a Pyspark adapter for Python

To install these via pip you can run:

pip install pyspark

Spinning up an Apache Spark cluster on Amazon EMR may be a quicker way to set all this up.

Move ETL files over to Spark cluster

Please move over all files in this folder over to your Spark cluster. You can use SCP with your my-key-pair.pem file in order to do this.

Thereafter, you will need to fill out the empty fields in the dl.cfg configuration file with your AWS access key that allows you to connect to your S3 buckets.

Testing

Running the ETL process for this full dataset can take time. Thus, as suggested by Tran Nguyen, it can be useful to test out the script on a smaller subset of data first.

If you would like to run the etl.py script on a subset of the data, you can do this by running:

spark-submit etl.py --run-subset

This will only load and transform a subset of the data and can help isolate issues in your ETL process without having to process all the data every single time.

Production

When you are ready to run the ETL process on all your data, you can just execute the whole etl.py file via:

spark-submit etl.py

Schema

Fact Table

  • songplays - a list of times users played a song, extracted from both the song_data and log_data JSON files: songplay_id, start_time, month, year, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

  • users - a list of users, extracted from the log_data JSON files: user_id, first_name, last_name, gender, level

  • songs - a list of songs, extracted from the song_data JSON files: song_id, title, artist_id, year, duration

  • artists - a list of artists, extracted from the song_data JSON files: artist_id, name, location, lattitude, longitude

  • time - a list of timestamps, extracted from the log_data JSON files: start_time, hour, day, week, month, year, weekday

About

Udacity Data Eng Nano Degree Project 4

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages