Skip to content
Open
100 changes: 100 additions & 0 deletions PROBLEMS.md

Large diffs are not rendered by default.

29 changes: 29 additions & 0 deletions Problem 10: Slowly Changing Dimensions/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
## Problem 10: Slowly Changing Dimensions

**Scenario:**
A billing team runs a report that says "total revenue per region per month." A customer who lived in Singapore last year moved to Malaysia in March. The report now shows all of their historical revenue under Malaysia, including invoices from when they were still in Singapore. The finance lead is upset because the regional numbers for 2024 just silently changed.

This is the classic slowly changing dimension problem.

In the interview, the question is:

> What is a slowly changing dimension and why does it matter when the business asks "what did this customer look like last year"?

---

### Your Task:

1. Explain what a slowly changing dimension (SCD) is.
2. Describe the common SCD types (Type 1, Type 2, Type 3) in plain words.
3. Show a small example table for each.
4. Explain which type you would pick for a customer's address and why.

---

### What a Good Answer Covers:

* The difference between a fact and a dimension (briefly).
* Why "current state" is not enough.
* The trade-off between storage and history.
* Type 2 as the most common real-world answer.
* The "as-of" join pattern.
141 changes: 141 additions & 0 deletions Problem 10: Slowly Changing Dimensions/solution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
## Solution 10: Slowly Changing Dimensions

### Short version you can say out loud

> A slowly changing dimension is a dimension table whose values change occasionally, not constantly. The big question is what you do when one of those values changes: do you overwrite, or do you keep history. The answer depends on whether reports need to know what the value used to be. For things like a customer's address, where past invoices need to stay correct, you almost always keep history. That is called SCD Type 2, and it is the most common pattern in real warehouses.

### Why this matters

The story above is a real failure mode. If you store only the current address, every historical query silently rewrites the past. Yesterday's report no longer matches today's, even though you only loaded new data, you did not change the old data. That destroys trust in the warehouse.

```
Without history (broken)
─────────────────────────
customer_id │ name │ country
1001 │ Alice │ MY ← changed from SG in March

Old invoice from January says customer 1001.
Report joins to dimension, gets MY.
The January invoice now appears under MY.
Finance: "Why did SG's January number drop?"


With history (SCD Type 2)
─────────────────────────
customer_id │ name │ country │ valid_from │ valid_to │ is_current
1001 │ Alice │ SG │ 2023-01-01 │ 2025-03-15 │ false
1001 │ Alice │ MY │ 2025-03-15 │ 9999-12-31 │ true

January invoice (date = 2025-01-20) joins to the row valid that day.
It correctly appears under SG.
```

### The classic SCD types in plain words

**Type 0 — never change.** The value is set once and frozen. Used for things like a customer's original signup country, or the date they joined. Rare but useful.

**Type 1 — overwrite.** When the value changes, you replace it. No history kept. Used for things where history does not matter, like a typo fix in a name.

**Type 2 — add a new row.** When the value changes, you keep the old row and add a new one. You add columns like `valid_from`, `valid_to`, and `is_current`. Every fact joins to the dimension row that was valid at the fact's timestamp. This is by far the most common.

**Type 3 — add a column.** You keep one extra column like `previous_country`. Useful when you only care about the most recent change, not the full history. Rare in modern warehouses.

There are higher types (Type 4 with mini-dimensions, Type 6 hybrid) but in interviews, knowing 1, 2 and 3 well is enough.

### Concrete shapes

**Type 1 (overwrite)**

```
customers
─────────────────────────────────────
customer_id │ name │ country
1001 │ Alice Lee │ MY ← overwritten
1002 │ Bob Khan │ SG
```

After Alice moves, the old `SG` value is gone. Cheap to store. History lost.

**Type 2 (add row, version with dates)**

```
customers_history
──────────────────────────────────────────────────────────────────────
customer_id │ name │ country │ valid_from │ valid_to │ is_current
1001 │ Alice Lee │ SG │ 2023-01-01 │ 2025-03-15 │ false
1001 │ Alice Lee │ MY │ 2025-03-15 │ 9999-12-31 │ true
1002 │ Bob Khan │ SG │ 2024-05-10 │ 9999-12-31 │ true
```

Every change adds a new row. `valid_from` / `valid_to` mark the period it was true. `is_current` is a convenience flag so queries that want "right now" do not have to use `9999-12-31`.

**Type 3 (one extra column)**

```
customers
─────────────────────────────────────────────────
customer_id │ name │ country │ previous_country
1001 │ Alice Lee │ MY │ SG
1002 │ Bob Khan │ SG │ NULL
```

Tracks one prior value only. Useful for "did this customer recently move."

### How to query Type 2 (the as-of join)

This is the join pattern you will draw on the whiteboard:

```sql
SELECT
i.invoice_id,
i.amount,
c.country AS country_at_invoice_time
FROM invoices i
LEFT JOIN customers_history c
ON c.customer_id = i.customer_id
AND i.invoice_date >= c.valid_from
AND i.invoice_date < c.valid_to;
```

Two important details:

1. The interval is `[valid_from, valid_to)`. Half open. This avoids the row appearing in both the old and the new period on the exact change date.
2. You join on **date inside range**, not on `is_current`. Using `is_current` would re-introduce the original bug.

### Which type to pick

| Field | Likely type | Why |
| ---------------------------- | ----------- | --------------------------------------------------------- |
| Customer address / country | Type 2 | Past invoices must keep their original region |
| Customer's display name | Type 1 | A typo fix should fix old reports too |
| Subscription plan tier | Type 2 | Revenue reporting depends on which plan they had when |
| Currency code of an account | Type 2 | Historical balances were stored in that currency |
| Most recent campaign source | Type 3 | We only care about "the one before this one" |
| Their original signup date | Type 0 | Set once, never changes |

### Common mistakes interviewers want you to name

1. **Storing only current state.** The "moving customer breaks historical region revenue" story.
2. **Type 2 done wrong** by joining on `is_current` instead of date ranges. Same bug, fancier table.
3. **Overlapping validity ranges** because two updates landed in the same second and you forgot to close the previous row before opening a new one.
4. **Type 2 explosion** when you flag too many columns as "track history." Pick the few that really need it. Otherwise the dimension grows to billions of rows for no reason.

### Bonus follow-up the interviewer might throw

> *"How would you actually build a Type 2 table in dbt?"*

dbt has a built in `snapshot` materialisation that does exactly this. You declare:

```sql
{% snapshot customers_history %}
{{ config(
target_schema='snapshots',
unique_key='customer_id',
strategy='check',
check_cols=['country', 'plan_tier']
) }}
SELECT * FROM {{ source('app', 'customers') }}
{% endsnapshot %}
```

Every time it runs, dbt diffs the source against the snapshot. New or changed rows get a fresh `dbt_valid_from` and the previous row's `dbt_valid_to` is closed. It is the easiest way to get Type 2 in a modern warehouse without writing the MERGE logic by hand.
29 changes: 29 additions & 0 deletions Problem 11: Data Contracts in Plain Words/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
## Problem 11: Data Contracts in Plain Words

**Scenario:**
A producer team renames a column from `user_id` to `userId` in their event stream as part of a refactor. They do not tell anyone. Three downstream pipelines break overnight, including the daily revenue report. After the postmortem, leadership asks: how do we stop this from happening every quarter?

The answer everyone keeps mentioning is "data contracts."

In the interview, the question is:

> What is a data contract, in plain words, and why are companies suddenly talking about them?

---

### Your Task:

1. Explain what a data contract is and what it is not.
2. Explain why this conversation is happening now.
3. Sketch what a real data contract looks like (columns, types, rules).
4. Explain how it gets enforced in practice.

---

### What a Good Answer Covers:

* The shift from "data is a side effect" to "data is a product."
* The contract as an agreement between a producer and a consumer.
* Schema, semantics, freshness, ownership.
* Where it gets enforced: producer side, ingest side, CI checks.
* Why it usually fails when it's only a document.
123 changes: 123 additions & 0 deletions Problem 11: Data Contracts in Plain Words/solution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
## Solution 11: Data Contracts in Plain Words

### Short version you can say out loud

> A data contract is an explicit agreement between the team that produces data and the teams that consume it. It says what fields will be there, what types they will be, what they mean, how fresh they will be, and who owns them. It is the same idea as an API contract between two services, just applied to data. People are talking about it now because data has become a product, and treating it like a side effect of the app keeps breaking downstream systems.

### Why now

For most of the last 20 years, data was a by-product of the application. Engineers built features and the data team scraped whatever ended up in the database. When the app team changed a column, the data team found out by the dashboard breaking the next morning. That worked when there were two analysts and one report. It does not work now, because data feeds machine learning models, billing, regulators, and live customer features. The cost of a breaking change is much higher.

Data contracts are the industry trying to apply software engineering discipline (interfaces, versioning, tests, ownership) to data the same way we did to microservice APIs ten years ago.

### What a contract actually contains

```
┌────────────────────────────────────────┐
│ DATA CONTRACT │
│ │
│ Schema fields, types, nullability │
│ Semantics what each field means │
│ Quality rules and SLAs │
│ Freshness how often, how late │
│ Owner team and on-call │
│ Version semver, deprecation policy │
└─────────────┬──────────────────────────┘
┌────────────────┴────────────────┐
▼ ▼
Producer team Consumer teams
(app backend) (analytics, ML, finance)
```

A typical YAML contract might look like:

```yaml
name: orders
version: 1.3.0
owner: checkout-team
sla:
freshness: 5 minutes from event time
availability: 99.9%
schema:
- name: order_id
type: string
required: true
description: A unique id for the order. Stable across retries.
- name: customer_id
type: int64
required: true
- name: amount_cents
type: int64
required: true
description: Charged amount in the smallest unit of the currency.
- name: currency
type: string
required: true
constraints:
enum: [SGD, MYR, IDR, USD]
- name: created_at
type: timestamp
required: true
quality:
- rule: amount_cents > 0
- rule: order_id is unique
- rule: no more than 0.01% of rows missing currency
breaking_changes_policy: 6 months deprecation window
```

It is the same shape as a Protobuf schema, an OpenAPI spec, or an Avro schema, plus extra metadata about ownership and SLA.

### What a contract is NOT

* It is **not** just a document on Confluence. A document does not catch a renamed column at 3 AM.
* It is **not** the same as a schema. A schema only describes shape. A contract also covers meaning, ownership and freshness.
* It is **not** a one-way wish list from the consumer. Both sides have to agree, because the producer takes on the cost of stability.

### Where it gets enforced

The whole point is that the contract is **machine readable** and **checked automatically**. Three common enforcement points:

```
┌─────────┐ 1 ┌──────────┐ 2 ┌──────────┐ 3 ┌──────────┐
│Producer │──────▶│ Kafka / │──────▶│ Warehouse│─────▶│ Consumer │
│ code │ │ S3 │ │ │ │ code │
└─────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
1. CI check in 2. Schema 3. dbt tests against
producer repo registry the contract on
(a renamed (Avro / Protobuf, every model run.
column fails rejects messages
the build) that don't match
the registered
schema)
```

* **Producer side.** A CI test fails the build if a code change would break the contract. This is the most valuable spot, because it catches the issue before it leaves the producer team.
* **Ingest side.** A schema registry (Confluent, Apicurio, Glue Schema Registry) rejects events that don't match the registered schema. This catches drift between code and reality.
* **Consumer side.** dbt tests or Great Expectations checks validate the data on arrival. Last line of defence.

### How a real change happens with contracts

The producer team wants to rename `user_id` to `userId`. With a contract in place:

1. They open a pull request that changes the contract: `user_id` is now deprecated in version 1.4, `userId` is added.
2. CI runs the consumer test list against the new contract. It tells them which downstream models reference `user_id` (8 of them).
3. The contract says the deprecation window is 6 months. They cannot remove `user_id` for 6 months. They keep emitting both fields during that window.
4. Consumers migrate at their own pace. After 6 months, the field is removed.

The dashboard never breaks at 3 AM, because the system enforced the agreement.

### Common mistakes

1. **"We have a contract" but it lives in a Google Doc.** Not enforced is not a contract.
2. **The contract is owned by the data team, not the producer team.** Producers will not feel responsible, so it drifts.
3. **No deprecation window.** Producers will still break consumers because they can change the schema instantly.
4. **Treating semantic changes as non breaking.** Renaming the *meaning* of `amount` from "net" to "gross" is a breaking change even if the type and name stay the same.

### Bonus follow-up the interviewer might throw

> *"What is the difference between a data contract and a schema registry?"*

A schema registry enforces shape. The data contract is the bigger agreement that *contains* a schema and adds meaning, ownership, freshness and quality rules. In practice you usually have both: the contract lives in source control, and at runtime, the registry enforces the schema piece of it.
27 changes: 27 additions & 0 deletions Problem 12: Parquet vs CSV vs JSON/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
## Problem 12: Parquet vs CSV vs JSON

**Scenario:**
Your team is choosing the storage format for a 5 TB events archive in S3. One engineer wants CSV "because everything reads it." Another wants JSON "because that's how the events arrive." You suggest Parquet. The team has not used it before and asks you to explain.

In the interview, the question is:

> When would you use Parquet, CSV, or JSON for storing data, and how would you explain Parquet to someone who has never heard of it?

---

### Your Task:

1. Explain the three formats in plain words.
2. Compare them on size, query speed, schema, and tooling.
3. Explain why Parquet is so popular for analytics.
4. Say when you would actually pick CSV or JSON over Parquet.

---

### What a Good Answer Covers:

* Row-oriented vs column-oriented storage.
* Compression and predicate pushdown.
* The schema-on-write vs schema-on-read difference.
* Real numbers: same dataset in CSV vs Parquet sizes.
* Honest cases where Parquet is the wrong call.
Loading