B-CITI Data Engineering Take-Home Assignment

ETL & Analytics with MongoDB Supplies Dataset

At B-CITI Solutions, our data platform ingests and transforms data from multiple operational systems — including NoSQL sources like MongoDB — into relational models for analytics and downstream services. We value engineers who can build solid data pipelines and think about how to operationalize them.

In this exercise, you'll work with MongoDB's public sample_supplies.sales dataset, which contains e-commerce sales transactions. You'll design and implement a containerized ETL pipeline that extracts and transforms this nested data into a relational schema suitable for analytics, and then perform a small analysis or exploration on the result.

Beyond the core ETL work, we'd like to see your awareness of operational best practices by making this pipeline:

Containerized — basic Docker setup so it runs consistently anywhere
Configurable — externalized config (env vars or config file)
Re-runnable — can execute multiple times without breaking

This gives us insight into how you think about operationalizing data pipelines.

🚀 Expected User Experience

By the end, a reviewer should be able to run your pipeline with minimal setup:

# Clone and run
git clone <your-repo>
cd data-engineering-assignment

# Run via Docker
docker-compose up

# Or with a simple script/Makefile
make run

The key is that it should be easy to run, repeatable, and not require manual setup of Python environments or database installations.

🎯 Your Task

1️⃣ Extract, Transform, Load (ETL)

You’re provided with the sample_supplies.sales dataset — a collection of documents representing e-commerce transactions. Each record includes customer information, store location, purchased items, and payment details.

Your goal is to:

Extract the JSON data
Transform the nested JSON into a relational model suitable for analytics.

At minimum, include the following tables:

sales — main transaction table
customers — customer attributes (e.g., age, gender, email)
items — flattened item-level details
(Optional) stores — distinct store locations or metadata

Load the transformed data into a SQL database (SQLite or Postgres).
Handle nested arrays (e.g., items) and embedded objects (e.g., customer) cleanly — avoid redundancy and ensure referential integrity.
The pipeline should be reproducible (e.g., main.py).

2️⃣ Post-ETL Analysis or Exploration

Once the data is loaded into your relational tables, perform one or two small analyses or investigations to demonstrate the value of your transformation.

Choose one or more of the following examples — or create your own analysis if you prefer:

📊 Business-Oriented Analytics

Top 10 products by total sales revenue
Average basket size (number of items per order)
Total revenue per store location
Purchase methods breakdown (Online vs. In-store)
Coupon usage rate by customer gender or age group

🔍 Data Quality & Exploration

Detect missing or inconsistent customer data
Check for negative or zero prices or quantities
Validate that total order value equals the sum of item-level prices × quantities

📈 Optional Visualization

If you wish, include some visualisations (matplotlib/seaborn/pandas) — e.g., a bar chart of revenue by store or a histogram of customer age distribution.

Visuals are optional, but they help illustrate insights clearly.

📦 Dataset Overview

Each sales document looks roughly like this:

{
  "_id": {"$oid": "5bd761dcae323e45a93ccfe8"},
  "saleDate": {"$date": "2017-11-28T05:00:00Z"},
  "items": [
    {"name": "envelopes", "tags": ["stationary"], "price": 3.19, "quantity": 3},
    {"name": "notepads", "tags": ["office"], "price": 12.95, "quantity": 1}
  ],
  "storeLocation": "Denver",
  "customer": {"gender": "F", "age": 36, "email": "someone@example.com"},
  "couponUsed": false,
  "purchaseMethod": "Online"
}

This dataset contains roughly 5,000 documents, but you can work with a smaller subset (e.g., first 500 records) for simplicity.

🐳 3️⃣ Containerization & Operationalization

We'd like to see how you think about running this pipeline as a repeatable service. Include the following:

1. Docker Setup

Create a Dockerfile and docker-compose.yml
Include both your ETL app and the database
Should run with docker-compose up

2. Configuration

Externalize configuration (database connection, file paths, etc.)
Use environment variables or a config file
No hardcoded credentials

3. Re-runnability

Pipeline should handle being run multiple times
Choose an approach: truncate-and-reload, upserts, or incremental processing
Briefly document your choice in the README

4. Basic Logging

Use Python's logging module
Log key steps: records processed, any errors encountered

5. Simple Automation (Optional)

A Makefile or shell script to make running easier
Example: make run to execute the whole pipeline

⚙️ Requirements

Language: Python 3
Libraries: Pandas, SQLAlchemy, or similar tools of your choice
Database: SQLite (for simplicity) or Postgres
Structure: Modular, reproducible code (e.g., extract.py, transform.py, load.py, analyze.py)
Documentation: Include a README.md describing:
- How to run your pipeline
- Your schema and design choices
- Key assumptions and data-quality considerations
- The analysis you performed and what insights you found

Example Project Structure:

supplies_data_etl/
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── README.md
├── src/
│   ├── extract.py
│   ├── transform.py
│   ├── load.py
│   └── main.py
└── data/
    └── sample_supplies.json

🧩 Deliverables

Please submit either:

A zipped project folder, or
A GitHub repository containing:
- Source code
- A small sample of the dataset (subset of JSON if needed)
- A README.md with setup and instructions

🧮 Evaluation Criteria

Category	What We're Looking For
Data Modeling	Logical relational schema; handling of nested JSON; data integrity
Code Quality	Clean, modular, well-documented Python code
Transformation Logic	Correct flattening and joining of nested data
Analytical Thinking	Clear and meaningful insights drawn from transformed data
Containerization	Basic Docker setup; easy to run; configuration externalized
Re-runnability	Pipeline can execute multiple times; documented approach
Documentation & Communication	Clarity of explanation; ability to discuss design trade-offs

🕒 Time & Scope

Expected time: it's up to you, but we expect it to take around 3 hours
Presentation: ~30 minutes live walkthrough
Focus: Solid fundamentals, clear thinking, and basic operational awareness

🗣️ Live Discussion

During the follow-up interview, we'll ask you to:

Walk us through your ETL pipeline design
Explain schema and transformation decisions
Share your analytical findings
Demo running the pipeline via Docker (just show it works!)
Discuss your re-runnability approach (why did you choose it?)
Talk about how you'd scale or improve this with more time (Airflow, dbt, cloud deployment, etc.)

✅ Summary

Build an ETL pipeline that transforms MongoDB's sample_supplies.sales dataset into relational SQL tables, containerize it with Docker for easy deployment, make it re-runnable, and demonstrate some analytics on the transformed data. This shows both your data engineering skills and your awareness of operational best practices.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
sales.json		sales.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

B-CITI Data Engineering Take-Home Assignment

ETL & Analytics with MongoDB Supplies Dataset

🚀 Expected User Experience

🎯 Your Task

1️⃣ Extract, Transform, Load (ETL)

2️⃣ Post-ETL Analysis or Exploration

📊 Business-Oriented Analytics

🔍 Data Quality & Exploration

📈 Optional Visualization

📦 Dataset Overview

🐳 3️⃣ Containerization & Operationalization

⚙️ Requirements

Example Project Structure:

🧩 Deliverables

🧮 Evaluation Criteria

🕒 Time & Scope

✅ Summary

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

bciti/data-engineer-assignment

Folders and files

Latest commit

History

Repository files navigation

B-CITI Data Engineering Take-Home Assignment

ETL & Analytics with MongoDB Supplies Dataset

🚀 Expected User Experience

🎯 Your Task

1️⃣ Extract, Transform, Load (ETL)

2️⃣ Post-ETL Analysis or Exploration

📊 Business-Oriented Analytics

🔍 Data Quality & Exploration

📈 Optional Visualization

📦 Dataset Overview

🐳 3️⃣ Containerization & Operationalization

⚙️ Requirements

Example Project Structure:

🧩 Deliverables

🧮 Evaluation Criteria

🕒 Time & Scope

✅ Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages