This project is a simple end-to-end system built for Data Engineering students. The project demonstrates how to:
- Train a basic logistic regression model using scikit-learn.
- Serve the model via a Flask REST API.
- Log API request parameters and model predictions into a PostgreSQL database.
- Containerize the application using Docker and Docker Compose.
- Provide a static HTML form to test the API.
- Run a simple unit test for the API.
- (Optional) Deploy the entire pipeline to Google Cloud Platform (GCP) using GitHub Actions.
The project is organized into the following key directories and files:
├── model
│ └── train_model.py # Train and save the logistic regression model
│ └── test_model.py # Unit tests for the model
│ └── logistic_model.pkl # trained logistic regression model (needs to be trained first)
├── api
│ ├── app.py # Flask API to serve predictions and log data to PostgreSQL
│ └── test_app.py # Unit tests for the API
├── db/
│ └── init.sql # SQL script to create the prediction_logs table
├── static
│ └── index.html # Static HTML form to call the API
├── Dockerfile # Dockerfile for containerizing the Flask API
├── docker-compose.yml # Docker Compose configuration (starts API and PostgreSQL containers)
├── requirements.txt # Python dependencies
└── .github
│ └── workflows
│ └── deploy.yml # GitHub Actions workflows for deploying to cloud
Your task is to develop a simple end-to-end system that:
- Trains a basic logistic regression model (using scikit-learn)
- Serves the model via a Flask API
- Logs API calls and predictions in a PostgreSQL database
- Containerizes the components with Docker
- Deploys the pipeline to a cloud platform (using free credits)
- Develops API documentation (and optionally a Postman script) so other groups can use your API
-
Forking the Base Repository
- One group should fork the provided base repository. This repo will serve as your project starting point.
- The main group member (team lead) should then add the other team members as collaborators to the group repository.
-
Cloning and Contributing
- Each team member should clone the main group repository to their local machines.
- All initial contributions should be made directly into this single repository.
- Once you become comfortable with the workflow, experiment by having individual team members fork the group repository into their own personal repos.
- Practice making pull requests (PRs) from these forks. Each PR must be reviewed and approved by at least two team members before merging into the main repository.
Assign roles based on the following suggestions:
- Model Developer: Implements and trains the logistic regression model.
- API Developer: Develops the Flask API (including a static HTML form and unit tests).
- Database Engineer: Sets up and integrates PostgreSQL for logging.
- DevOps/CI-CD Engineer: Handles Docker containerization, GitHub Actions for deployment, and ensures proper cloud deployment.
- Documentation Specialist (optional): Develops API documentation and a Postman script to help other groups use your API.
- Initial Development: Work on the main repository by committing and pushing your changes.
- Experiment with Forking: Once comfortable, try forking the repo, making changes, and submitting pull requests. This helps simulate a real-world multi-repository workflow.
- Code Review: Every pull request should be reviewed by at least two team members. Use the review process to discuss improvements and ensure code quality.
- Platform Choice: Choose a cloud platform for deployment (GCP, AWS, or any platform offering free credits).
- Free Credits: Use the free credits available, but be cautious of cost overruns.
- Security: Make sure to follow best practices regarding credentials and resource permissions.
- Deployment Automation: Use GitHub Actions (or another CI/CD tool) to automate the deployment process.
- Documentation: Develop comprehensive API documentation that explains:
- Available endpoints (e.g.,
/predict) - Expected input data and output format
- How to authenticate (if applicable)
- Example requests and responses
- Available endpoints (e.g.,
- Postman Script: Optionally, create a Postman collection that contains:
- A script for testing the API endpoints.
- Pre-configured requests so other groups can easily import the collection and test your API.
- A working system deployed on the cloud platform.
- A fully documented API (with a README and additional documentation if needed).
- A Postman collection file (optional) for easy API testing.
- A summary of your workflow, including how you used forks, pull requests, and reviews to merge changes.
Remember:
- The main repository is your base. Use forks and PRs to experiment with collaborative workflows.
- Ensure that you keep track of costs and follow best security practices during cloud deployment.
- Good documentation is key—make it easy for other groups to understand and use your API.
Good luck, and happy coding!
clone this repository
Docker installed on your machine and running
- Git installed on your machine
- Docker installed on your machine and running
- Python 3.9+ installed (if running locally without Docker)
- Clone the repository:
git clone https://github.com/niallroche/data-engineering-mlops.git- Navigate to the project directory:
cd data-engineering-mlops- If running without Docker, install the required Python packages: create a virtual environment and activate it
python -m venv de-venv
source de-venv/bin/activateinstall the required Python packages:
pip install -r requirements.txt- Navigate to the
modelfolder. - Run the training script to build the logistic regression model and save it to
model/logistic_model.pkl.
cd model
python train_model.py- Run Unit Tests
- Navigate to the root directory of the project.
- Run the following command to run the unit tests for the model.
cd ..
python -m unittest model/test_model.py- Navigate to the root directory of the project.
- Run the following command to build the Docker image. (Note that this command might require admin rights such as sudo)
docker-compose build- Navigate to the root directory of the project.
- Run the following command to start the API and optionally the PostgreSQL containers.
docker-compose up -d- to run the server locally outside of docker run the following in a dedicated terminal window
python3 api/app.py- Navigate to the root directory of the project.
- Run the following command to run the unit tests for the API.
python -m unittest api/test_app.py- set the USE_DATABASE environment variable to false
USE_DATABASE=false- Open your web browser and navigate to
http://localhost:5000. - Fill in the form with the required fields and click the "Predict" button to see the model's prediction.
- Navigate to the root directory of the project.
- Run the following command to test the API using curl.
curl -X POST http://localhost:5000/predict -H "Content-Type: application/json" -d '{"features": [5.1, 3.5, 1.4, 0.2]}'- Navigate to the root directory of the project.
- Run the following command to shutdown the containers.
docker-compose downThe application uses PostgreSQL to store prediction logs with the following schema:
CREATE TABLE prediction_logs (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
input_features JSONB NOT NULL,
prediction INTEGER NOT NULL,
model_version TEXT DEFAULT '1.0',
confidence FLOAT,
processing_time FLOAT,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);id: Unique identifier for each predictiontimestamp: When the prediction was madeinput_features: JSON object containing the input parametersprediction: The model's output predictionmodel_version: Version of the model usedconfidence: Confidence score of the prediction (if available)processing_time: Time taken to process the requestcreated_at: Record creation timestamp
- Using Docker Compose (Recommended):
docker-compose up -dThis will automatically:
- Create the PostgreSQL database
- Initialize the schema
- Set up the required users and permissions
Note, the use_database environment variable is set to false, so the database will not be created and the logs will not be saved to the database
to run the server with the database enabled, run the following command
# find the container id running the api
docker ps
# stop the container
docker stop <container_id>
# run the container again with the database enabled
docker run -p 5000:5000 -e DB_ENABLED=true <container_id>
# or just change the value of the use_database environment variable to true in the docker-compose.yml file - Manual Setup:
# Connect to PostgreSQL in the container
# if using docker desktop then connect to a terminal for the running container
# find the container id running the api
docker ps
docker exec -it <container_id> psql -U postgres
# Create database
CREATE DATABASE flask_logs;
# Connect to the new database
\c flask_logs
# Run the schema initialization script
\i db/init.sql- Connect to PostgreSQL in Docker:
# find the container id running PostgreSQL
docker ps
# connect to the container
docker exec -it <container_id>
# connect to the database
psql -U postgres -d flask_logs
# list the database tables
\dt
# select recordfrom the table
select * from prediction_logs;
# exit the database
\q
if the database needs to be created first, execute the following command when logged in to the container
psql -U postgres -d flask_logs -f /docker-entrypoint-initdb.d/init.sqlConfigure the following environment variables for database connection:
USE_DATABASE=true
POSTGRES_DB=flask_logs
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_password
POSTGRES_HOST=db
POSTGRES_PORT=5432Query recent predictions:
SELECT * FROM recent_predictions;Monitor database size:
SELECT pg_size_pretty(pg_database_size('flask_logs'));...
publish the image to a container registry
- (to a GCP registry)
docker tag data-engineering-mlops:latest gcr.io/data-engineering-mlops/data-engineering-mlops:latest
docker push gcr.io/data-engineering-mlops/data-engineering-mlops:latest- (to an AWS registry) change the tag to your own aws account id
docker tag data-engineering-mlops:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/data-engineering-mlops:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/data-engineering-mlops:latest- to docker hub
docker tag data-engineering-mlops:latest niallroche/data-engineering-mlops:latest
docker push niallroche/data-engineering-mlops:latest- A Google Cloud Platform (GCP) account.
- The Google Cloud SDK installed on your machine.
- A GCP project.
- A service account key for the GCP project.
This section explains how to deploy the ML model API to Google Cloud Run using the Google Cloud Web Console.
Open the Google Cloud Console: https://console.cloud.google.com
Sign in using a Google account. If this is your first time using Google Cloud you will be prompted to start the free trial and enable billing. The free trial provides credits which are sufficient for this lab.
Projects are the top-level container for all Google Cloud resources.
- Click the Project Selector in the top navigation bar
- Click New Project
Enter:
- Project name:
mlops-lab-group1 - Organization: leave blank
- Location: No organization
Click Create. After creation, make sure your new project is selected in the top menu.
Navigate to APIs & Services → Library and enable the following APIs:
| API | Purpose |
|---|---|
| Cloud Run API | Runs the containerised service |
| Artifact Registry API | Stores container images |
| Cloud Build API | Builds container images from source |
| Secret Manager API (optional) | Secure storage for credentials/API keys |
| Vertex AI API (optional) | Advanced ML model hosting |
Navigate to IAM & Admin → Service Accounts and click Create Service Account.
Enter:
- Name:
mlops-lab-sa - Description:
Service account for MLOps lab deployment
Click Create and assign the following roles:
- Cloud Run Admin
- Artifact Registry Writer
- Cloud Build Editor
- Service Account User
Click Done.
Navigate to Artifact Registry → Repositories and click Create Repository.
Fill in:
- Name:
ml-models - Format: Docker
- Mode: Standard
- Region:
europe-west2(London)
Click Create. The registry is now ready to store container images built by Cloud Build.
Navigate to Cloud Run → Create Service and choose Deploy from source repository.
If GitHub is not yet connected, click Set up Cloud Build and follow the prompts to connect your GitHub account.
Once connected:
- Select your repository (e.g.
data-engineering-mlops) - Choose the branch:
main - Select the folder containing the API (e.g.
api/)
| Setting | Value |
|---|---|
| Service name | ensemble-gateway |
| Region | europe-west2 |
| Authentication | Allow unauthenticated |
| Port | 8080 |
| Min instances | 0 |
| Max instances | 5 |
| CPU | 1 |
| Memory | 512 MB |
Click Create. Cloud Run will build the container image, push it to Artifact Registry, and deploy a new service revision. Deployment usually takes 1–2 minutes.
After deployment finishes, Cloud Run displays a Service URL, for example:
https://ensemble-gateway-abc123-ew2.a.run.app
If you are using FastAPI, the Swagger docs are available at:
https://SERVICE_URL/docs
export SERVICE_URL=https://ensemble-gateway-abc123-ew2.a.run.app
curl -X POST "$SERVICE_URL/predict" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(gcloud auth print-identity-token)" \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'Expected response:
{
"prediction": 1,
"confidence": 0.83
}Container failed to start
Ensure the application listens on 0.0.0.0:8080. Cloud Run sets the PORT environment variable automatically:
import os
port = int(os.environ.get("PORT", 8080))
app.run(host="0.0.0.0", port=port)Model file not found
If you see FileNotFoundError: model/logistic_model.pkl, ensure the model file exists inside the container image. Your project structure should look like:
repo/
├── api/
│ └── app.py
├── model/
│ └── logistic_model.pkl
└── Dockerfile
And the Dockerfile must copy the model directory:
COPY . /appVerify application startup locally
Before deploying, test on the Cloud Run port:
PORT=8080 python api/app.py
# or
PORT=8080 uvicorn api.app:app --host 0.0.0.0 --port 8080scikit-learn version mismatch
If you see 'LogisticRegression' object has no attribute 'multi_class', the model was pickled with a different scikit-learn version than the one in requirements.txt. Retrain the model after installing the pinned version:
pip install scikit-learn==1.6.1
python model/train_model.pyCommit the updated logistic_model.pkl and redeploy.
- An AWS account.
- The AWS CLI installed on your machine.
- A service account key for the AWS project.