Skip to content

manuelandersen/reddit-pipeline

Repository files navigation

Reddit ELT Pipeline

This is a repo to implement an ETL pipeline for Reddit data using Airflow and AWS cloud services.

Overview

What the pipelines does:

  • Extract Reddit data trough their API.
  • Load the data to an S3 bucket.
  • Perform some transformations to the data using AWS Glue.

Installation

  1. Clone the repository
git clone https://github.com/manuelandersen/reddit-pipeline.git
  1. Create a virtual environment (optional but recommended):
python3 -m venv venv
source venv/bin/activate
  1. Install the dependencies:
pip install -r requirements.txt
  1. Rename the configuration file:
 mv config/config.conf.example config/config.conf

Warning

Make sure to put the credentials you need in the new config.conf file.

  1. Build and run the Docker container:
docker compose up -d --build
  1. Open Airflow web UI:

In your browser go to http://localhost:8080, you will see the DAG's and then you can run them.

About

Reddit data extraction to S3 bucket

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors