To reproduce this project you will need:
- Google Cloud account
- Docker with docker-compose
- Git account
Note
You can use either your local machine or a virtual machine on Google Cloud. The decision to opt for a local machine was made to reduce the costs associated with cloud usage. However, if you prefer to run it on a virtual machine, please refer to the video below:
- Create an account with your Google email ID
- Setup your first project if you haven't already
- eg. "truck-logistics", and note down the "Project ID" (we'll use this later when deploying infra with TF)
- Create a service account
- Add a service account name and click create and continue.
- Grant
Viewerrole to begin with.
- Create a service account key
- Under 'Actions' click on the 3 dots and 'Manage Keys'
- Click 'Add key' and 'Create new key', choose 'JSON' key type. It will download it to your local machine, move it to a safe directory.
-
IAM Roles for Service account:
- Go to the IAM section of IAM & Admin https://console.cloud.google.com/iam-admin/iam
- Click the Edit principal icon for your service account.
- Add these roles in addition to Viewer : Storage Admin + Storage Object Admin + BigQuery Admin
-
Enable these APIs for your project:
-
Please ensure
GOOGLE_APPLICATION_CREDENTIALSenv-var is set.export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json" # Refresh token/session, and verify authentication gcloud auth application-default login
git clone https://github.com/dieegogutierrez/Data-Engineering-Capstone-Project.gitcd mage-zoomcamp- Rename file
dev.envto simply.env. - Update the variables with your information, specially 'LOCAL_PATH_SERVICE_ACCOUNT' with the path to your local service account file and 'TF_VAR' with your cloud project information.
./start.sh- The script will run Terraform in Docker and create the infrastructure in Google Cloud, specifically, a storage bucket and a BigQuery dataset.
- Then, it will run the orchestrator MAGE, which will load local data, transform it, and export it to Google Cloud. Afterward, DBT will create models that will build a final table to be used on a dashboard.
- Access the orchestrator at http://localhost:6789/ and run the pipeline by yourself.
- After completion, it will create a table named 'trips_gross_revenue' in BigQuery, which can be used in Looker Studio to build a dashboard.