At B-CITI Solutions, our data platform ingests and transforms data from multiple operational systems — including NoSQL sources like MongoDB — into relational models for analytics and downstream services. We value engineers who can build solid data pipelines and think about how to operationalize them.
In this exercise, you'll work with MongoDB's public sample_supplies.sales dataset, which contains e-commerce sales transactions. You'll design and implement a containerized ETL pipeline that extracts and transforms this nested data into a relational schema suitable for analytics, and then perform a small analysis or exploration on the result.
Beyond the core ETL work, we'd like to see your awareness of operational best practices by making this pipeline:
- Containerized — basic Docker setup so it runs consistently anywhere
- Configurable — externalized config (env vars or config file)
- Re-runnable — can execute multiple times without breaking
This gives us insight into how you think about operationalizing data pipelines.
By the end, a reviewer should be able to run your pipeline with minimal setup:
# Clone and run
git clone <your-repo>
cd data-engineering-assignment
# Run via Docker
docker-compose up
# Or with a simple script/Makefile
make runThe key is that it should be easy to run, repeatable, and not require manual setup of Python environments or database installations.
You’re provided with the sample_supplies.sales dataset — a collection of documents representing e-commerce transactions. Each record includes customer information, store location, purchased items, and payment details.
Your goal is to:
-
Extract the JSON data
-
Transform the nested JSON into a relational model suitable for analytics.
At minimum, include the following tables:
- sales — main transaction table
- customers — customer attributes (e.g., age, gender, email)
- items — flattened item-level details
- (Optional) stores — distinct store locations or metadata
-
Load the transformed data into a SQL database (SQLite or Postgres).
-
Handle nested arrays (e.g., items) and embedded objects (e.g., customer) cleanly — avoid redundancy and ensure referential integrity.
-
The pipeline should be reproducible (e.g., main.py).
Once the data is loaded into your relational tables, perform one or two small analyses or investigations to demonstrate the value of your transformation.
Choose one or more of the following examples — or create your own analysis if you prefer:
- Top 10 products by total sales revenue
- Average basket size (number of items per order)
- Total revenue per store location
- Purchase methods breakdown (Online vs. In-store)
- Coupon usage rate by customer gender or age group
- Detect missing or inconsistent customer data
- Check for negative or zero prices or quantities
- Validate that total order value equals the sum of item-level prices × quantities
If you wish, include some visualisations (matplotlib/seaborn/pandas) — e.g., a bar chart of revenue by store or a histogram of customer age distribution.
Visuals are optional, but they help illustrate insights clearly.
Each sales document looks roughly like this:
{
"_id": {"$oid": "5bd761dcae323e45a93ccfe8"},
"saleDate": {"$date": "2017-11-28T05:00:00Z"},
"items": [
{"name": "envelopes", "tags": ["stationary"], "price": 3.19, "quantity": 3},
{"name": "notepads", "tags": ["office"], "price": 12.95, "quantity": 1}
],
"storeLocation": "Denver",
"customer": {"gender": "F", "age": 36, "email": "someone@example.com"},
"couponUsed": false,
"purchaseMethod": "Online"
}This dataset contains roughly 5,000 documents, but you can work with a smaller subset (e.g., first 500 records) for simplicity.
We'd like to see how you think about running this pipeline as a repeatable service. Include the following:
1. Docker Setup
- Create a
Dockerfileanddocker-compose.yml - Include both your ETL app and the database
- Should run with
docker-compose up
2. Configuration
- Externalize configuration (database connection, file paths, etc.)
- Use environment variables or a config file
- No hardcoded credentials
3. Re-runnability
- Pipeline should handle being run multiple times
- Choose an approach: truncate-and-reload, upserts, or incremental processing
- Briefly document your choice in the README
4. Basic Logging
- Use Python's
loggingmodule - Log key steps: records processed, any errors encountered
5. Simple Automation (Optional)
- A
Makefileor shell script to make running easier - Example:
make runto execute the whole pipeline
-
Language: Python 3
-
Libraries: Pandas, SQLAlchemy, or similar tools of your choice
-
Database: SQLite (for simplicity) or Postgres
-
Structure: Modular, reproducible code (e.g.,
extract.py,transform.py,load.py,analyze.py) -
Documentation: Include a README.md describing:
- How to run your pipeline
- Your schema and design choices
- Key assumptions and data-quality considerations
- The analysis you performed and what insights you found
supplies_data_etl/
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── README.md
├── src/
│ ├── extract.py
│ ├── transform.py
│ ├── load.py
│ └── main.py
└── data/
└── sample_supplies.json
Please submit either:
- A zipped project folder, or
- A GitHub repository containing:
- Source code
- A small sample of the dataset (subset of JSON if needed)
- A README.md with setup and instructions
| Category | What We're Looking For |
|---|---|
| Data Modeling | Logical relational schema; handling of nested JSON; data integrity |
| Code Quality | Clean, modular, well-documented Python code |
| Transformation Logic | Correct flattening and joining of nested data |
| Analytical Thinking | Clear and meaningful insights drawn from transformed data |
| Containerization | Basic Docker setup; easy to run; configuration externalized |
| Re-runnability | Pipeline can execute multiple times; documented approach |
| Documentation & Communication | Clarity of explanation; ability to discuss design trade-offs |
- Expected time: it's up to you, but we expect it to take around 3 hours
- Presentation: ~30 minutes live walkthrough
- Focus: Solid fundamentals, clear thinking, and basic operational awareness
🗣️ Live Discussion
During the follow-up interview, we'll ask you to:
- Walk us through your ETL pipeline design
- Explain schema and transformation decisions
- Share your analytical findings
- Demo running the pipeline via Docker (just show it works!)
- Discuss your re-runnability approach (why did you choose it?)
- Talk about how you'd scale or improve this with more time (Airflow, dbt, cloud deployment, etc.)
Build an ETL pipeline that transforms MongoDB's sample_supplies.sales dataset into relational SQL tables, containerize it with Docker for easy deployment, make it re-runnable, and demonstrate some analytics on the transformed data. This shows both your data engineering skills and your awareness of operational best practices.