shiningflash · shiningflash · May 14, 2026 · May 14, 2026 · May 14, 2026 · May 14, 2026
diff --git a/PROBLEMS.md b/PROBLEMS.md
diff --git a/Problem 10: Slowly Changing Dimensions/question.md b/Problem 10: Slowly Changing Dimensions/question.md
@@ -0,0 +1,29 @@
+## Problem 10: Slowly Changing Dimensions
+
+**Scenario:**
+A billing team runs a report that says "total revenue per region per month." A customer who lived in Singapore last year moved to Malaysia in March. The report now shows all of their historical revenue under Malaysia, including invoices from when they were still in Singapore. The finance lead is upset because the regional numbers for 2024 just silently changed.
+
+This is the classic slowly changing dimension problem.
+
+In the interview, the question is:
+
+> What is a slowly changing dimension and why does it matter when the business asks "what did this customer look like last year"?
+
+---
+
+### Your Task:
+
+1. Explain what a slowly changing dimension (SCD) is.
+2. Describe the common SCD types (Type 1, Type 2, Type 3) in plain words.
+3. Show a small example table for each.
+4. Explain which type you would pick for a customer's address and why.
+
+---
+
+### What a Good Answer Covers:
+
+* The difference between a fact and a dimension (briefly).
+* Why "current state" is not enough.
+* The trade-off between storage and history.
+* Type 2 as the most common real-world answer.
+* The "as-of" join pattern.
diff --git a/Problem 10: Slowly Changing Dimensions/solution.md b/Problem 10: Slowly Changing Dimensions/solution.md
@@ -0,0 +1,141 @@
+## Solution 10: Slowly Changing Dimensions
+
+### Short version you can say out loud
+
+> A slowly changing dimension is a dimension table whose values change occasionally, not constantly. The big question is what you do when one of those values changes: do you overwrite, or do you keep history. The answer depends on whether reports need to know what the value used to be. For things like a customer's address, where past invoices need to stay correct, you almost always keep history. That is called SCD Type 2, and it is the most common pattern in real warehouses.
+
+### Why this matters
+
+The story above is a real failure mode. If you store only the current address, every historical query silently rewrites the past. Yesterday's report no longer matches today's, even though you only loaded new data, you did not change the old data. That destroys trust in the warehouse.
+
+```
+Without history (broken)
+─────────────────────────
+customer_id │ name  │ country
+1001        │ Alice │ MY        ← changed from SG in March
+
+Old invoice from January says customer 1001.
+Report joins to dimension, gets MY.
+The January invoice now appears under MY.
+Finance: "Why did SG's January number drop?"
+
+
+With history (SCD Type 2)
+─────────────────────────
+customer_id │ name  │ country │ valid_from │ valid_to   │ is_current
+1001        │ Alice │ SG      │ 2023-01-01 │ 2025-03-15 │ false
+1001        │ Alice │ MY      │ 2025-03-15 │ 9999-12-31 │ true
+
+January invoice (date = 2025-01-20) joins to the row valid that day.
+It correctly appears under SG.
+```
+
+### The classic SCD types in plain words
+
+**Type 0 — never change.** The value is set once and frozen. Used for things like a customer's original signup country, or the date they joined. Rare but useful.
+
+**Type 1 — overwrite.** When the value changes, you replace it. No history kept. Used for things where history does not matter, like a typo fix in a name.
+
+**Type 2 — add a new row.** When the value changes, you keep the old row and add a new one. You add columns like `valid_from`, `valid_to`, and `is_current`. Every fact joins to the dimension row that was valid at the fact's timestamp. This is by far the most common.
+
+**Type 3 — add a column.** You keep one extra column like `previous_country`. Useful when you only care about the most recent change, not the full history. Rare in modern warehouses.
+
+There are higher types (Type 4 with mini-dimensions, Type 6 hybrid) but in interviews, knowing 1, 2 and 3 well is enough.
+
+### Concrete shapes
+
+**Type 1 (overwrite)**
+
+```
+customers
+─────────────────────────────────────
+customer_id │ name      │ country
+1001        │ Alice Lee │ MY        ← overwritten
+1002        │ Bob Khan  │ SG
+```
+
+After Alice moves, the old `SG` value is gone. Cheap to store. History lost.
+
+**Type 2 (add row, version with dates)**
+
+```
+customers_history
+──────────────────────────────────────────────────────────────────────
+customer_id │ name      │ country │ valid_from │ valid_to   │ is_current
+1001        │ Alice Lee │ SG      │ 2023-01-01 │ 2025-03-15 │ false
+1001        │ Alice Lee │ MY      │ 2025-03-15 │ 9999-12-31 │ true
+1002        │ Bob Khan  │ SG      │ 2024-05-10 │ 9999-12-31 │ true
+```
+
+Every change adds a new row. `valid_from` / `valid_to` mark the period it was true. `is_current` is a convenience flag so queries that want "right now" do not have to use `9999-12-31`.
+
+**Type 3 (one extra column)**
+
+```
+customers
+─────────────────────────────────────────────────
+customer_id │ name      │ country │ previous_country
+1001        │ Alice Lee │ MY      │ SG
+1002        │ Bob Khan  │ SG      │ NULL
+```
+
+Tracks one prior value only. Useful for "did this customer recently move."
+
+### How to query Type 2 (the as-of join)
+
+This is the join pattern you will draw on the whiteboard:
+
+```sql
+SELECT
+  i.invoice_id,
+  i.amount,
+  c.country AS country_at_invoice_time
+FROM invoices i
+LEFT JOIN customers_history c
+  ON  c.customer_id = i.customer_id
+  AND i.invoice_date >= c.valid_from
+  AND i.invoice_date <  c.valid_to;
+```
+
+Two important details:
+
+1. The interval is `[valid_from, valid_to)`. Half open. This avoids the row appearing in both the old and the new period on the exact change date.
+2. You join on **date inside range**, not on `is_current`. Using `is_current` would re-introduce the original bug.
+
+### Which type to pick
+
+| Field                        | Likely type | Why                                                       |
+| ---------------------------- | ----------- | --------------------------------------------------------- |
+| Customer address / country   | Type 2      | Past invoices must keep their original region             |
+| Customer's display name      | Type 1      | A typo fix should fix old reports too                     |
+| Subscription plan tier       | Type 2      | Revenue reporting depends on which plan they had when     |
+| Currency code of an account  | Type 2      | Historical balances were stored in that currency          |
+| Most recent campaign source  | Type 3      | We only care about "the one before this one"              |
+| Their original signup date   | Type 0      | Set once, never changes                                   |
+
+### Common mistakes interviewers want you to name
+
+1. **Storing only current state.** The "moving customer breaks historical region revenue" story.
+2. **Type 2 done wrong** by joining on `is_current` instead of date ranges. Same bug, fancier table.
+3. **Overlapping validity ranges** because two updates landed in the same second and you forgot to close the previous row before opening a new one.
+4. **Type 2 explosion** when you flag too many columns as "track history." Pick the few that really need it. Otherwise the dimension grows to billions of rows for no reason.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you actually build a Type 2 table in dbt?"*
+
+dbt has a built in `snapshot` materialisation that does exactly this. You declare:
+
+```sql
+{% snapshot customers_history %}
+  {{ config(
+      target_schema='snapshots',
+      unique_key='customer_id',
+      strategy='check',
+      check_cols=['country', 'plan_tier']
+  ) }}
+  SELECT * FROM {{ source('app', 'customers') }}
+{% endsnapshot %}
+```
+
+Every time it runs, dbt diffs the source against the snapshot. New or changed rows get a fresh `dbt_valid_from` and the previous row's `dbt_valid_to` is closed. It is the easiest way to get Type 2 in a modern warehouse without writing the MERGE logic by hand.
diff --git a/Problem 11: Data Contracts in Plain Words/question.md b/Problem 11: Data Contracts in Plain Words/question.md
@@ -0,0 +1,29 @@
+## Problem 11: Data Contracts in Plain Words
+
+**Scenario:**
+A producer team renames a column from `user_id` to `userId` in their event stream as part of a refactor. They do not tell anyone. Three downstream pipelines break overnight, including the daily revenue report. After the postmortem, leadership asks: how do we stop this from happening every quarter?
+
+The answer everyone keeps mentioning is "data contracts."
+
+In the interview, the question is:
+
+> What is a data contract, in plain words, and why are companies suddenly talking about them?
+
+---
+
+### Your Task:
+
+1. Explain what a data contract is and what it is not.
+2. Explain why this conversation is happening now.
+3. Sketch what a real data contract looks like (columns, types, rules).
+4. Explain how it gets enforced in practice.
+
+---
+
+### What a Good Answer Covers:
+
+* The shift from "data is a side effect" to "data is a product."
+* The contract as an agreement between a producer and a consumer.
+* Schema, semantics, freshness, ownership.
+* Where it gets enforced: producer side, ingest side, CI checks.
+* Why it usually fails when it's only a document.
diff --git a/Problem 11: Data Contracts in Plain Words/solution.md b/Problem 11: Data Contracts in Plain Words/solution.md
@@ -0,0 +1,123 @@
+## Solution 11: Data Contracts in Plain Words
+
+### Short version you can say out loud
+
+> A data contract is an explicit agreement between the team that produces data and the teams that consume it. It says what fields will be there, what types they will be, what they mean, how fresh they will be, and who owns them. It is the same idea as an API contract between two services, just applied to data. People are talking about it now because data has become a product, and treating it like a side effect of the app keeps breaking downstream systems.
+
+### Why now
+
+For most of the last 20 years, data was a by-product of the application. Engineers built features and the data team scraped whatever ended up in the database. When the app team changed a column, the data team found out by the dashboard breaking the next morning. That worked when there were two analysts and one report. It does not work now, because data feeds machine learning models, billing, regulators, and live customer features. The cost of a breaking change is much higher.
+
+Data contracts are the industry trying to apply software engineering discipline (interfaces, versioning, tests, ownership) to data the same way we did to microservice APIs ten years ago.
+
+### What a contract actually contains
+
+```
+                 ┌────────────────────────────────────────┐
+                 │              DATA CONTRACT             │
+                 │                                        │
+                 │   Schema    fields, types, nullability │
+                 │   Semantics what each field means      │
+                 │   Quality   rules and SLAs             │
+                 │   Freshness how often, how late        │
+                 │   Owner     team and on-call           │
+                 │   Version   semver, deprecation policy │
+                 └─────────────┬──────────────────────────┘
+                               │
+              ┌────────────────┴────────────────┐
+              ▼                                 ▼
+        Producer team                    Consumer teams
+        (app backend)                    (analytics, ML, finance)
+```
+
+A typical YAML contract might look like:
+
+```yaml
+name: orders
+version: 1.3.0
+owner: checkout-team
+sla:
+  freshness: 5 minutes from event time
+  availability: 99.9%
+schema:
+  - name: order_id
+    type: string
+    required: true
+    description: A unique id for the order. Stable across retries.
+  - name: customer_id
+    type: int64
+    required: true
+  - name: amount_cents
+    type: int64
+    required: true
+    description: Charged amount in the smallest unit of the currency.
+  - name: currency
+    type: string
+    required: true
+    constraints:
+      enum: [SGD, MYR, IDR, USD]
+  - name: created_at
+    type: timestamp
+    required: true
+quality:
+  - rule: amount_cents > 0
+  - rule: order_id is unique
+  - rule: no more than 0.01% of rows missing currency
+breaking_changes_policy: 6 months deprecation window
+```
+
+It is the same shape as a Protobuf schema, an OpenAPI spec, or an Avro schema, plus extra metadata about ownership and SLA.
+
+### What a contract is NOT
+
+* It is **not** just a document on Confluence. A document does not catch a renamed column at 3 AM.
+* It is **not** the same as a schema. A schema only describes shape. A contract also covers meaning, ownership and freshness.
+* It is **not** a one-way wish list from the consumer. Both sides have to agree, because the producer takes on the cost of stability.
+
+### Where it gets enforced
+
+The whole point is that the contract is **machine readable** and **checked automatically**. Three common enforcement points:
+
+```
+┌─────────┐   1   ┌──────────┐   2   ┌──────────┐   3   ┌──────────┐
+│Producer │──────▶│  Kafka / │──────▶│  Warehouse│─────▶│ Consumer │
+│   code  │       │  S3      │       │           │       │   code   │
+└─────────┘       └──────────┘       └──────────┘       └──────────┘
+    │                  │                   │
+    ▼                  ▼                   ▼
+1. CI check in        2. Schema             3. dbt tests against
+   producer repo         registry              the contract on
+   (a renamed            (Avro / Protobuf,     every model run.
+   column fails          rejects messages
+   the build)            that don't match
+                         the registered
+                         schema)
+```
+
+* **Producer side.** A CI test fails the build if a code change would break the contract. This is the most valuable spot, because it catches the issue before it leaves the producer team.
+* **Ingest side.** A schema registry (Confluent, Apicurio, Glue Schema Registry) rejects events that don't match the registered schema. This catches drift between code and reality.
+* **Consumer side.** dbt tests or Great Expectations checks validate the data on arrival. Last line of defence.
+
+### How a real change happens with contracts
+
+The producer team wants to rename `user_id` to `userId`. With a contract in place:
+
+1. They open a pull request that changes the contract: `user_id` is now deprecated in version 1.4, `userId` is added.
+2. CI runs the consumer test list against the new contract. It tells them which downstream models reference `user_id` (8 of them).
+3. The contract says the deprecation window is 6 months. They cannot remove `user_id` for 6 months. They keep emitting both fields during that window.
+4. Consumers migrate at their own pace. After 6 months, the field is removed.
+
+The dashboard never breaks at 3 AM, because the system enforced the agreement.
+
+### Common mistakes
+
+1. **"We have a contract" but it lives in a Google Doc.** Not enforced is not a contract.
+2. **The contract is owned by the data team, not the producer team.** Producers will not feel responsible, so it drifts.
+3. **No deprecation window.** Producers will still break consumers because they can change the schema instantly.
+4. **Treating semantic changes as non breaking.** Renaming the *meaning* of `amount` from "net" to "gross" is a breaking change even if the type and name stay the same.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What is the difference between a data contract and a schema registry?"*
+
+A schema registry enforces shape. The data contract is the bigger agreement that *contains* a schema and adds meaning, ownership, freshness and quality rules. In practice you usually have both: the contract lives in source control, and at runtime, the registry enforces the schema piece of it.
diff --git a/Problem 12: Parquet vs CSV vs JSON/question.md b/Problem 12: Parquet vs CSV vs JSON/question.md
@@ -0,0 +1,27 @@
+## Problem 12: Parquet vs CSV vs JSON
+
+**Scenario:**
+Your team is choosing the storage format for a 5 TB events archive in S3. One engineer wants CSV "because everything reads it." Another wants JSON "because that's how the events arrive." You suggest Parquet. The team has not used it before and asks you to explain.
+
+In the interview, the question is:
+
+> When would you use Parquet, CSV, or JSON for storing data, and how would you explain Parquet to someone who has never heard of it?
+
+---
+
+### Your Task:
+
+1. Explain the three formats in plain words.
+2. Compare them on size, query speed, schema, and tooling.
+3. Explain why Parquet is so popular for analytics.
+4. Say when you would actually pick CSV or JSON over Parquet.
+
+---
+
+### What a Good Answer Covers:
+
+* Row-oriented vs column-oriented storage.
+* Compression and predicate pushdown.
+* The schema-on-write vs schema-on-read difference.
+* Real numbers: same dataset in CSV vs Parquet sizes.
+* Honest cases where Parquet is the wrong call.