A part of the Udacity Data Engineering Nanodegree, this ETL project looks to collect and present user activity information for a fictional music streaming service called Sparkify. To do this, data is gathered from song information and application .json log files (which were generated from the Million Song Dataset and from eventsim respectively and given to us).
These log files are stored in two Amazon S3 directories, and are loaded into an Amazon EMR Spark cluster for processing. The etl.py script reads these files from S3, transforms them to create five different tables in Spark and writes them to partitioned parquet files in table directories on S3.
Having these data stored as .parquet files means they can be easily loaded into Hadoop for analysis whenever required, meaning that the .json files will not need to be reprocessed in order to make use of this data.
- README.md -- this file
- etl.py -- the main ETL script that interacts with Spark to create the paraquet files
- dl.cfg -- configuration file where AWS connection details need to be entered
In order to run these Python scripts, you will first need to install Python 3 and Apache Spark on your server, and then install the following Python modules via pip or anaconda:
- pyspark - a Pyspark adapter for Python
To install these via pip you can run:
pip install pyspark
Spinning up an Apache Spark cluster on Amazon EMR may be a quicker way to set all this up.
Please move over all files in this folder over to your Spark cluster. You can use SCP with your my-key-pair.pem file in order to do this.
Thereafter, you will need to fill out the empty fields in the dl.cfg configuration file with your AWS access key that allows you to connect to your S3 buckets.
Running the ETL process for this full dataset can take time. Thus, as suggested by Tran Nguyen, it can be useful to test out the script on a smaller subset of data first.
If you would like to run the etl.py script on a subset of the data, you can do this by running:
spark-submit etl.py --run-subset
This will only load and transform a subset of the data and can help isolate issues in your ETL process without having to process all the data every single time.
When you are ready to run the ETL process on all your data, you can just execute the whole etl.py file via:
spark-submit etl.py
songplays- a list of times users played a song, extracted from both thesong_dataandlog_dataJSON files:songplay_id,start_time,month,year,user_id,level,song_id,artist_id,session_id,location,user_agent
-
users- a list of users, extracted from thelog_dataJSON files:user_id,first_name,last_name,gender,level -
songs- a list of songs, extracted from thesong_dataJSON files:song_id,title,artist_id,year,duration -
artists- a list of artists, extracted from thesong_dataJSON files:artist_id,name,location,lattitude, longitude -
time- a list of timestamps, extracted from thelog_dataJSON files:start_time,hour,day,week,month,year,weekday