Data Engineering Team Activity

This project is a simple end-to-end system built for Data Engineering students. The project demonstrates how to:

Train a basic logistic regression model using scikit-learn.
Serve the model via a Flask REST API.
Log API request parameters and model predictions into a PostgreSQL database.
Containerize the application using Docker and Docker Compose.
Provide a static HTML form to test the API.
Run a simple unit test for the API.
(Optional) Deploy the entire pipeline to Google Cloud Platform (GCP) using GitHub Actions.

Project Structure

The project is organized into the following key directories and files:

├── model
│   └── train_model.py         # Train and save the logistic regression model
│   └── test_model.py          # Unit tests for the model
│   └── logistic_model.pkl     # trained logistic regression model (needs to be trained first)
├── api
│   ├── app.py                 # Flask API to serve predictions and log data to PostgreSQL
│   └── test_app.py            # Unit tests for the API
├── db/
│   └── init.sql               # SQL script to create the prediction_logs table
├── static
│   └── index.html             # Static HTML form to call the API
├── Dockerfile                 # Dockerfile for containerizing the Flask API
├── docker-compose.yml         # Docker Compose configuration (starts API and PostgreSQL containers)
├── requirements.txt           # Python dependencies
└── .github
│    └── workflows
│       └── deploy.yml                 # GitHub Actions workflows for deploying to cloud

Project Brief: Data Engineering Team Activity

Overview

Your task is to develop a simple end-to-end system that:

Trains a basic logistic regression model (using scikit-learn)
Serves the model via a Flask API
Logs API calls and predictions in a PostgreSQL database
Containerizes the components with Docker
Deploys the pipeline to a cloud platform (using free credits)
Develops API documentation (and optionally a Postman script) so other groups can use your API

Repository and Collaboration

Forking the Base Repository
- One group should fork the provided base repository. This repo will serve as your project starting point.
- The main group member (team lead) should then add the other team members as collaborators to the group repository.
Cloning and Contributing
- Each team member should clone the main group repository to their local machines.
- All initial contributions should be made directly into this single repository.
- Once you become comfortable with the workflow, experiment by having individual team members fork the group repository into their own personal repos.
- Practice making pull requests (PRs) from these forks. Each PR must be reviewed and approved by at least two team members before merging into the main repository.

Team Roles

Assign roles based on the following suggestions:

Model Developer: Implements and trains the logistic regression model.
API Developer: Develops the Flask API (including a static HTML form and unit tests).
Database Engineer: Sets up and integrates PostgreSQL for logging.
DevOps/CI-CD Engineer: Handles Docker containerization, GitHub Actions for deployment, and ensures proper cloud deployment.
Documentation Specialist (optional): Develops API documentation and a Postman script to help other groups use your API.

Workflow

Initial Development: Work on the main repository by committing and pushing your changes.
Experiment with Forking: Once comfortable, try forking the repo, making changes, and submitting pull requests. This helps simulate a real-world multi-repository workflow.
Code Review: Every pull request should be reviewed by at least two team members. Use the review process to discuss improvements and ensure code quality.

Cloud Deployment

Platform Choice: Choose a cloud platform for deployment (GCP, AWS, or any platform offering free credits).
Free Credits: Use the free credits available, but be cautious of cost overruns.
Security: Make sure to follow best practices regarding credentials and resource permissions.
Deployment Automation: Use GitHub Actions (or another CI/CD tool) to automate the deployment process.

API Documentation and Postman Script

Documentation: Develop comprehensive API documentation that explains:
- Available endpoints (e.g., /predict)
- Expected input data and output format
- How to authenticate (if applicable)
- Example requests and responses
Postman Script: Optionally, create a Postman collection that contains:
- A script for testing the API endpoints.
- Pre-configured requests so other groups can easily import the collection and test your API.

Final Deliverables

A working system deployed on the cloud platform.
A fully documented API (with a README and additional documentation if needed).
A Postman collection file (optional) for easy API testing.
A summary of your workflow, including how you used forks, pull requests, and reviews to merge changes.

Remember:

The main repository is your base. Use forks and PRs to experiment with collaborative workflows.
Ensure that you keep track of costs and follow best security practices during cloud deployment.
Good documentation is key—make it easy for other groups to understand and use your API.

Good luck, and happy coding!

Setup Instructions

clone this repository

Prerequisites

Docker installed on your machine and running

How to Run Locally

Prerequisites

Git installed on your machine
Docker installed on your machine and running
Python 3.9+ installed (if running locally without Docker)

Clone the repository:

git clone https://github.com/niallroche/data-engineering-mlops.git

Navigate to the project directory:

cd data-engineering-mlops

Setup Dependencies

If running without Docker, install the required Python packages: create a virtual environment and activate it

python -m venv de-venv
source de-venv/bin/activate

install the required Python packages:

pip install -r requirements.txt

1. Train the Model

Navigate to the model folder.
Run the training script to build the logistic regression model and save it to model/logistic_model.pkl.

cd model
python train_model.py

Run Unit Tests
Navigate to the root directory of the project.
Run the following command to run the unit tests for the model.

cd ..
python -m unittest model/test_model.py

2. Build the Docker image:

Navigate to the root directory of the project.
Run the following command to build the Docker image. (Note that this command might require admin rights such as sudo)

docker-compose build

3. Start the API

Navigate to the root directory of the project.
Run the following command to start the API and optionally the PostgreSQL containers.

docker-compose up -d

to run the server locally outside of docker run the following in a dedicated terminal window

python3 api/app.py

4. Run Unit Test

Navigate to the root directory of the project.
Run the following command to run the unit tests for the API.

python -m unittest api/test_app.py

5. Test the API

set the USE_DATABASE environment variable to false

USE_DATABASE=false

Open your web browser and navigate to http://localhost:5000.
Fill in the form with the required fields and click the "Predict" button to see the model's prediction.

6. Test the API using curl

Navigate to the root directory of the project.
Run the following command to test the API using curl.

curl -X POST http://localhost:5000/predict -H "Content-Type: application/json" -d '{"features": [5.1, 3.5, 1.4, 0.2]}'

7. Shutdown the containers

Navigate to the root directory of the project.
Run the following command to shutdown the containers.

docker-compose down

Database Setup

Schema

The application uses PostgreSQL to store prediction logs with the following schema:

CREATE TABLE prediction_logs (
    id SERIAL PRIMARY KEY,
    timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    input_features JSONB NOT NULL,
    prediction INTEGER NOT NULL,
    model_version TEXT DEFAULT '1.0',
    confidence FLOAT,
    processing_time FLOAT,
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Key Fields:

id: Unique identifier for each prediction
timestamp: When the prediction was made
input_features: JSON object containing the input parameters
prediction: The model's output prediction
model_version: Version of the model used
confidence: Confidence score of the prediction (if available)
processing_time: Time taken to process the request
created_at: Record creation timestamp

Setup Instructions

Using Docker Compose (Recommended):

docker-compose up -d

This will automatically:

Create the PostgreSQL database
Initialize the schema
Set up the required users and permissions

Note, the use_database environment variable is set to false, so the database will not be created and the logs will not be saved to the database

to run the server with the database enabled, run the following command

# find the container id running the api
docker ps
# stop the container
docker stop <container_id>
# run the container again with the database enabled
docker run -p 5000:5000 -e DB_ENABLED=true <container_id>
# or just change the value of the use_database environment variable to true in the docker-compose.yml file

Manual Setup:

# Connect to PostgreSQL in the container
# if using docker desktop then connect to a terminal for the running container

# find the container id running the api
docker ps
docker exec -it <container_id> psql -U postgres


# Create database
CREATE DATABASE flask_logs;

# Connect to the new database
\c flask_logs

# Run the schema initialization script
\i db/init.sql

Connect to PostgreSQL in Docker:

# find the container id running PostgreSQL
docker ps

# connect to the container
docker exec -it <container_id>

# connect to the database
psql -U postgres -d flask_logs

# list the database tables
\dt

# select recordfrom the table
select * from prediction_logs;

# exit the database
\q

if the database needs to be created first, execute the following command when logged in to the container

psql -U postgres -d flask_logs -f /docker-entrypoint-initdb.d/init.sql

Environment Variables

Configure the following environment variables for database connection:

USE_DATABASE=true
POSTGRES_DB=flask_logs
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_password
POSTGRES_HOST=db
POSTGRES_PORT=5432

Accessing Logs

Query recent predictions:

SELECT * FROM recent_predictions;

Monitor database size:

SELECT pg_size_pretty(pg_database_size('flask_logs'));

Running the Application

...

publish the image to a container registry

(to a GCP registry)

docker tag data-engineering-mlops:latest gcr.io/data-engineering-mlops/data-engineering-mlops:latest
docker push gcr.io/data-engineering-mlops/data-engineering-mlops:latest

(to an AWS registry) change the tag to your own aws account id

docker tag data-engineering-mlops:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/data-engineering-mlops:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/data-engineering-mlops:latest

to docker hub

docker tag data-engineering-mlops:latest niallroche/data-engineering-mlops:latest
docker push niallroche/data-engineering-mlops:latest

How to Deploy to Cloud Platform

Prerequisites

A Google Cloud Platform (GCP) account.
The Google Cloud SDK installed on your machine.
A GCP project.
A service account key for the GCP project.

Deploying to Google Cloud Run (Web Console)

This section explains how to deploy the ML model API to Google Cloud Run using the Google Cloud Web Console.

Part 1 — Create a Google Cloud Account

Open the Google Cloud Console: https://console.cloud.google.com

Sign in using a Google account. If this is your first time using Google Cloud you will be prompted to start the free trial and enable billing. The free trial provides credits which are sufficient for this lab.

Part 2 — Create a New Project

Projects are the top-level container for all Google Cloud resources.

Click the Project Selector in the top navigation bar
Click New Project

Enter:

Project name: mlops-lab-group1
Organization: leave blank
Location: No organization

Click Create. After creation, make sure your new project is selected in the top menu.

Part 3 — Enable Required APIs

Navigate to APIs & Services → Library and enable the following APIs:

API	Purpose
Cloud Run API	Runs the containerised service
Artifact Registry API	Stores container images
Cloud Build API	Builds container images from source
Secret Manager API (optional)	Secure storage for credentials/API keys
Vertex AI API (optional)	Advanced ML model hosting

Part 4 — Create a Service Account

Navigate to IAM & Admin → Service Accounts and click Create Service Account.

Enter:

Name: mlops-lab-sa
Description: Service account for MLOps lab deployment

Click Create and assign the following roles:

Cloud Run Admin
Artifact Registry Writer
Cloud Build Editor
Service Account User

Click Done.

Part 5 — Create an Artifact Registry

Navigate to Artifact Registry → Repositories and click Create Repository.

Fill in:

Name: ml-models
Format: Docker
Mode: Standard
Region: europe-west2 (London)

Click Create. The registry is now ready to store container images built by Cloud Build.

Part 6 — Deploy the Service to Cloud Run

Navigate to Cloud Run → Create Service and choose Deploy from source repository.

If GitHub is not yet connected, click Set up Cloud Build and follow the prompts to connect your GitHub account.

Once connected:

Select your repository (e.g. data-engineering-mlops)
Choose the branch: main
Select the folder containing the API (e.g. api/)

Part 7 — Configure the Cloud Run Service

Setting	Value
Service name	`ensemble-gateway`
Region	`europe-west2`
Authentication	Allow unauthenticated
Port	`8080`
Min instances	`0`
Max instances	`5`
CPU	`1`
Memory	`512 MB`

Click Create. Cloud Run will build the container image, push it to Artifact Registry, and deploy a new service revision. Deployment usually takes 1–2 minutes.

Part 8 — Retrieve the Public API URL

After deployment finishes, Cloud Run displays a Service URL, for example:

https://ensemble-gateway-abc123-ew2.a.run.app

If you are using FastAPI, the Swagger docs are available at:

https://SERVICE_URL/docs

Part 9 — Test the API Endpoint

export SERVICE_URL=https://ensemble-gateway-abc123-ew2.a.run.app

curl -X POST "$SERVICE_URL/predict" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'

Expected response:

{
  "prediction": 1,
  "confidence": 0.83
}

Troubleshooting

Container failed to start

Ensure the application listens on 0.0.0.0:8080. Cloud Run sets the PORT environment variable automatically:

import os
port = int(os.environ.get("PORT", 8080))
app.run(host="0.0.0.0", port=port)

Model file not found

If you see FileNotFoundError: model/logistic_model.pkl, ensure the model file exists inside the container image. Your project structure should look like:

repo/
├── api/
│   └── app.py
├── model/
│   └── logistic_model.pkl
└── Dockerfile

And the Dockerfile must copy the model directory:

COPY . /app

Verify application startup locally

Before deploying, test on the Cloud Run port:

PORT=8080 python api/app.py
# or
PORT=8080 uvicorn api.app:app --host 0.0.0.0 --port 8080

scikit-learn version mismatch

If you see 'LogisticRegression' object has no attribute 'multi_class', the model was pickled with a different scikit-learn version than the one in requirements.txt. Retrain the model after installing the pinned version:

pip install scikit-learn==1.6.1
python model/train_model.py

Commit the updated logistic_model.pkl and redeploy.

How to Deploy to AWS

Prerequisites

An AWS account.
The AWS CLI installed on your machine.
A service account key for the AWS project.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
api		api
db		db
model		model
static		static
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Team Activity

Project Structure

Project Brief: Data Engineering Team Activity

Overview

Repository and Collaboration

Team Roles

Workflow

Cloud Deployment

API Documentation and Postman Script

Final Deliverables

Setup Instructions

Prerequisites

How to Run Locally

Prerequisites

Setup Dependencies

1. Train the Model

2. Build the Docker image:

3. Start the API

4. Run Unit Test

5. Test the API

6. Test the API using curl

7. Shutdown the containers

Database Setup

Schema

Key Fields:

Setup Instructions

Environment Variables

Accessing Logs

Running the Application

How to Deploy to Cloud Platform

Prerequisites

Deploying to Google Cloud Run (Web Console)

Part 1 — Create a Google Cloud Account

Part 2 — Create a New Project

Part 3 — Enable Required APIs

Part 4 — Create a Service Account

Part 5 — Create an Artifact Registry

Part 6 — Deploy the Service to Cloud Run

Part 7 — Configure the Cloud Run Service

Part 8 — Retrieve the Public API URL

Part 9 — Test the API Endpoint

Troubleshooting

How to Deploy to AWS

Prerequisites

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages