diff --git a/PROBLEMS.md b/PROBLEMS.md
new file mode 100644
index 0000000..38df9e6
--- /dev/null
+++ b/PROBLEMS.md
@@ -0,0 +1,100 @@
+# Problems Index
+
+A quick overview of every problem in this repo. Use the **Category** and **Topics** columns to filter by what you want to practice. Each row links to the problem statement and the reference solution.
+
+| # | Problem | Category | Topics | Difficulty | Question | Solution |
+|----|------------------------------------------------------------------------|---------------------|---------------------------------------------------------|------------|----------|----------|
+| 1 | Log File Error Analysis | Logs and Monitoring | file streaming, counters, top-N, IoT logs | Easy | [Question](Problem%201%3A%20Log%20File%20Error%20Analysis/question.md) | [Solution](Problem%201%3A%20Log%20File%20Error%20Analysis/solution.py) |
+| 2 | Rolling Average of Sensor Readings | Streaming | rolling window, deque, IoT sensors, real-time | Easy | [Question](Problem%202%3A%20Rolling%20Average%20of%20Sensor%20Readings/question.md) | [Solution](Problem%202%3A%20Rolling%20Average%20of%20Sensor%20Readings/solution.py) |
+| 3 | Transform and Clean Raw Data for Analytics | Data Cleaning | CSV, validation, regex, date checks | Medium | [Question](Problem%203%3A%20Transform%20and%20Clean%20Raw%20Data%20for%20Analytics/question.md) | [Solution](Problem%203%3A%20Transform%20and%20Clean%20Raw%20Data%20for%20Analytics/solution.py) |
+| 4 | Schema Evolution and Validation for Streaming Events | Schema Validation | JSON, schema evolution, type coercion, pydantic | Medium | [Question](Problem%204%3A%20Schema%20Evolution%20%26%20Validation%20for%20Streaming%20Events/question.md) | [Solution](Problem%204%3A%20Schema%20Evolution%20%26%20Validation%20for%20Streaming%20Events/solution.py) |
+| 5 | Merging Messy CSVs from Multiple Partners | Data Integration | CSV, column mapping, date parsing, file walk | Medium | [Question](Problem%205%3A%20Merging%20Messy%20CSVs%20from%20Multiple%20Partners/question.md) | [Solution](Problem%205%3A%20Merging%20Messy%20CSVs%20from%20Multiple%20Partners/solution.py) |
+| 6 | Partitioning vs Clustering in BigQuery | Fundamentals | BigQuery, partitioning, clustering, cost | Easy | [Question](Problem%206%3A%20Partitioning%20vs%20Clustering%20in%20BigQuery/question.md) | [Solution](Problem%206%3A%20Partitioning%20vs%20Clustering%20in%20BigQuery/solution.md) |
+| 7 | ETL vs ELT and Why ELT Won | Fundamentals | ETL, ELT, dbt, warehouse | Easy | [Question](Problem%207%3A%20ETL%20vs%20ELT%20and%20Why%20ELT%20Won/question.md) | [Solution](Problem%207%3A%20ETL%20vs%20ELT%20and%20Why%20ELT%20Won/solution.md) |
+| 8 | OLTP vs OLAP | Fundamentals | OLTP, OLAP, column store, row store | Easy | [Question](Problem%208%3A%20OLTP%20vs%20OLAP/question.md) | [Solution](Problem%208%3A%20OLTP%20vs%20OLAP/solution.md) |
+| 9 | Idempotency in Data Pipelines | Fundamentals | idempotency, retries, MERGE, partitions | Medium | [Question](Problem%209%3A%20Idempotency%20in%20Data%20Pipelines/question.md) | [Solution](Problem%209%3A%20Idempotency%20in%20Data%20Pipelines/solution.md) |
+| 10 | Slowly Changing Dimensions | Fundamentals | SCD, dimensions, history, dbt snapshot | Medium | [Question](Problem%2010%3A%20Slowly%20Changing%20Dimensions/question.md) | [Solution](Problem%2010%3A%20Slowly%20Changing%20Dimensions/solution.md) |
+| 11 | Data Contracts in Plain Words | Fundamentals | data contracts, schema registry, ownership | Medium | [Question](Problem%2011%3A%20Data%20Contracts%20in%20Plain%20Words/question.md) | [Solution](Problem%2011%3A%20Data%20Contracts%20in%20Plain%20Words/solution.md) |
+| 12 | Parquet vs CSV vs JSON | Fundamentals | Parquet, CSV, JSON, columnar storage | Easy | [Question](Problem%2012%3A%20Parquet%20vs%20CSV%20vs%20JSON/question.md) | [Solution](Problem%2012%3A%20Parquet%20vs%20CSV%20vs%20JSON/solution.md) |
+| 13 | Data Lake vs Warehouse vs Lakehouse | Fundamentals | lake, warehouse, lakehouse, Iceberg, Delta | Medium | [Question](Problem%2013%3A%20Data%20Lake%20vs%20Warehouse%20vs%20Lakehouse/question.md) | [Solution](Problem%2013%3A%20Data%20Lake%20vs%20Warehouse%20vs%20Lakehouse/solution.md) |
+| 14 | Exactly Once Delivery | Fundamentals | exactly once, idempotency, Kafka, streaming | Medium | [Question](Problem%2014%3A%20Exactly%20Once%20Delivery/question.md) | [Solution](Problem%2014%3A%20Exactly%20Once%20Delivery/solution.md) |
+| 15 | Teaching SQL Performance to a Junior | SQL Thinking | EXPLAIN, performance, mentoring, optimization | Medium | [Question](Problem%2015%3A%20Teaching%20SQL%20Performance%20to%20a%20Junior/question.md) | [Solution](Problem%2015%3A%20Teaching%20SQL%20Performance%20to%20a%20Junior/solution.md) |
+| 16 | SELECT DISTINCT Hiding Join Bugs | SQL Thinking | DISTINCT, joins, grain, semi-join | Medium | [Question](Problem%2016%3A%20SELECT%20DISTINCT%20Hiding%20Join%20Bugs/question.md) | [Solution](Problem%2016%3A%20SELECT%20DISTINCT%20Hiding%20Join%20Bugs/solution.md) |
+| 17 | Reading an EXPLAIN Plan | SQL Thinking | EXPLAIN, query plan, joins, sort spill | Medium | [Question](Problem%2017%3A%20Reading%20an%20EXPLAIN%20Plan/question.md) | [Solution](Problem%2017%3A%20Reading%20an%20EXPLAIN%20Plan/solution.md) |
+| 18 | CTE vs Subquery | SQL Thinking | CTE, subquery, materialization, recursion | Medium | [Question](Problem%2018%3A%20CTE%20vs%20Subquery/question.md) | [Solution](Problem%2018%3A%20CTE%20vs%20Subquery/solution.md) |
+| 19 | Same Query Different Answers | SQL Thinking | time zones, RLS, session settings, debugging | Medium | [Question](Problem%2019%3A%20Same%20Query%20Different%20Answers/question.md) | [Solution](Problem%2019%3A%20Same%20Query%20Different%20Answers/solution.md) |
+| 20 | Window Functions vs GROUP BY | SQL Thinking | window functions, GROUP BY, running totals, ranking | Medium | [Question](Problem%2020%3A%20Window%20Functions%20vs%20GROUP%20BY/question.md) | [Solution](Problem%2020%3A%20Window%20Functions%20vs%20GROUP%20BY/solution.md) |
+| 21 | Data Platform for an Electricity Retailer | System Design | smart meter, IoT, warehouse, batch | Hard | [Question](Problem%2021%3A%20Data%20Platform%20for%20an%20Electricity%20Retailer/question.md) | [Solution](Problem%2021%3A%20Data%20Platform%20for%20an%20Electricity%20Retailer/solution.md) |
+| 22 | Banking App Monthly Spending Widget | System Design | streaming, CDC, serving store, low latency | Hard | [Question](Problem%2022%3A%20Banking%20App%20Monthly%20Spending%20Widget/question.md) | [Solution](Problem%2022%3A%20Banking%20App%20Monthly%20Spending%20Widget/solution.md) |
+| 23 | Ride Hailing Surge Pricing | System Design | streaming, H3, real-time, pricing | Hard | [Question](Problem%2023%3A%20Ride%20Hailing%20Surge%20Pricing/question.md) | [Solution](Problem%2023%3A%20Ride%20Hailing%20Surge%20Pricing/solution.md) |
+| 24 | Spotify Minutes Listened This Week | System Design | streaming aggregation, KV store, watermarks | Hard | [Question](Problem%2024%3A%20Spotify%20Minutes%20Listened%20This%20Week/question.md) | [Solution](Problem%2024%3A%20Spotify%20Minutes%20Listened%20This%20Week/solution.md) |
+| 25 | Smart Meter to Monthly Bill PDF | System Design | billing, SCD2, idempotency, audit | Hard | [Question](Problem%2025%3A%20Smart%20Meter%20to%20Monthly%20Bill%20PDF/question.md) | [Solution](Problem%2025%3A%20Smart%20Meter%20to%20Monthly%20Bill%20PDF/solution.md) |
+| 26 | Delivery Idle Driver Tracking | System Design | streaming, H3, TTL, geospatial | Hard | [Question](Problem%2026%3A%20Delivery%20Idle%20Driver%20Tracking/question.md) | [Solution](Problem%2026%3A%20Delivery%20Idle%20Driver%20Tracking/solution.md) |
+| 27 | Year in Review Recap | System Design | batch, KV store, CDN, image render | Medium | [Question](Problem%2027%3A%20Year%20in%20Review%20Recap/question.md) | [Solution](Problem%2027%3A%20Year%20in%20Review%20Recap/solution.md) |
+| 28 | Low Balance Notification Pipeline | System Design | batch, idempotency, time zones, notifications | Medium | [Question](Problem%2028%3A%20Low%20Balance%20Notification%20Pipeline/question.md) | [Solution](Problem%2028%3A%20Low%20Balance%20Notification%20Pipeline/solution.md) |
+| 29 | Daily Report Quietly Wrong for Two Weeks | Scenarios | incident, postmortem, comms, data quality | Medium | [Question](Problem%2029%3A%20Daily%20Report%20Quietly%20Wrong%20for%20Two%20Weeks/question.md) | [Solution](Problem%2029%3A%20Daily%20Report%20Quietly%20Wrong%20for%20Two%20Weeks/solution.md) |
+| 30 | Warehouse Cost Doubled in Two Months | Scenarios | cost, governance, comms, INFORMATION_SCHEMA | Medium | [Question](Problem%2030%3A%20Warehouse%20Cost%20Doubled%20in%20Two%20Months/question.md) | [Solution](Problem%2030%3A%20Warehouse%20Cost%20Doubled%20in%20Two%20Months/solution.md) |
+| 31 | The Dashboard is Wrong | Scenarios | trust, comms, vague reports | Easy | [Question](Problem%2031%3A%20The%20Dashboard%20is%20Wrong/question.md) | [Solution](Problem%2031%3A%20The%20Dashboard%20is%20Wrong/solution.md) |
+| 32 | Inheriting a Pipeline No One Owns | Scenarios | ownership, judgement, rewrite-or-not | Medium | [Question](Problem%2032%3A%20Inheriting%20a%20Pipeline%20No%20One%20Owns/question.md) | [Solution](Problem%2032%3A%20Inheriting%20a%20Pipeline%20No%20One%20Owns/solution.md) |
+| 33 | Executive Needs a Number Tomorrow | Scenarios | comms, exec, caveats, prioritization | Medium | [Question](Problem%2033%3A%20Executive%20Needs%20a%20Number%20Tomorrow/question.md) | [Solution](Problem%2033%3A%20Executive%20Needs%20a%20Number%20Tomorrow/solution.md) |
+| 34 | Three Days of Data Lost | Scenarios | Kafka retention, replay, recovery, postmortem | Hard | [Question](Problem%2034%3A%20Three%20Days%20of%20Data%20Lost/question.md) | [Solution](Problem%2034%3A%20Three%20Days%20of%20Data%20Lost/solution.md) |
+| 35 | Lambda vs Cloud Function vs Cloud Run | Cloud Decisions | serverless, AWS, GCP, runtime limits | Medium | [Question](Problem%2035%3A%20Lambda%20vs%20Cloud%20Function%20vs%20Cloud%20Run/question.md) | [Solution](Problem%2035%3A%20Lambda%20vs%20Cloud%20Function%20vs%20Cloud%20Run/solution.md) |
+| 36 | Scheduled Pipeline Pay Only When Run | Cloud Decisions | scheduled jobs, Cloud Run Jobs, AWS Batch | Easy | [Question](Problem%2036%3A%20Scheduled%20Pipeline%20Pay%20Only%20When%20Run/question.md) | [Solution](Problem%2036%3A%20Scheduled%20Pipeline%20Pay%20Only%20When%20Run/solution.md) |
+| 37 | BigQuery vs Snowflake for New Team | Cloud Decisions | BigQuery, Snowflake, pricing model | Medium | [Question](Problem%2037%3A%20BigQuery%20vs%20Snowflake%20for%20New%20Team/question.md) | [Solution](Problem%2037%3A%20BigQuery%20vs%20Snowflake%20for%20New%20Team/solution.md) |
+| 38 | Store Partner Files in S3 or Warehouse | Cloud Decisions | S3, raw layer, audit, schema evolution | Easy | [Question](Problem%2038%3A%20Store%20Partner%20Files%20in%20S3%20or%20Warehouse/question.md) | [Solution](Problem%2038%3A%20Store%20Partner%20Files%20in%20S3%20or%20Warehouse/solution.md) |
+| 39 | Managed Airflow vs Self Hosted | Cloud Decisions | Airflow, MWAA, Composer, Astronomer, Dagster | Medium | [Question](Problem%2039%3A%20Managed%20Airflow%20vs%20Self%20Hosted/question.md) | [Solution](Problem%2039%3A%20Managed%20Airflow%20vs%20Self%20Hosted/solution.md) |
+| 40 | BigQuery Access Control for 50 Person Company | Cloud Decisions | IAM, datasets, groups, RLS, audit | Medium | [Question](Problem%2040%3A%20BigQuery%20Access%20Control%20for%2050%20Person%20Company/question.md) | [Solution](Problem%2040%3A%20BigQuery%20Access%20Control%20for%2050%20Person%20Company/solution.md) |
+| 41 | Tables for an Airbnb Like App | Data Modeling | star schema, SCD2, multi-currency, reviews | Medium | [Question](Problem%2041%3A%20Tables%20for%20an%20Airbnb%20Like%20App/question.md) | [Solution](Problem%2041%3A%20Tables%20for%20an%20Airbnb%20Like%20App/solution.md) |
+| 42 | Tracking Subscription Plan History | Data Modeling | history, valid_from/to, billing, SCD2 | Medium | [Question](Problem%2042%3A%20Tracking%20Subscription%20Plan%20History/question.md) | [Solution](Problem%2042%3A%20Tracking%20Subscription%20Plan%20History/solution.md) |
+| 43 | Mixing Facts and Dimensions | Data Modeling | star schema, SCD2, views, history | Medium | [Question](Problem%2043%3A%20Mixing%20Facts%20and%20Dimensions/question.md) | [Solution](Problem%2043%3A%20Mixing%20Facts%20and%20Dimensions/solution.md) |
+| 44 | Explaining Fact Table Grain | Data Modeling | grain, facts, dimensions, aggregations | Easy | [Question](Problem%2044%3A%20Explaining%20Fact%20Table%20Grain/question.md) | [Solution](Problem%2044%3A%20Explaining%20Fact%20Table%20Grain/solution.md) |
+| 45 | Current State and Full History | Data Modeling | event sourcing, projections, MV, audit | Medium | [Question](Problem%2045%3A%20Current%20State%20and%20Full%20History/question.md) | [Solution](Problem%2045%3A%20Current%20State%20and%20Full%20History/solution.md) |
+| 46 | Region Suddenly Shows Zero Revenue | Debugging | dashboard, joins, SCD, time zones | Medium | [Question](Problem%2046%3A%20Region%20Suddenly%20Shows%20Zero%20Revenue/question.md) | [Solution](Problem%2046%3A%20Region%20Suddenly%20Shows%20Zero%20Revenue/solution.md) |
+| 47 | Airflow Green but Output Empty | Debugging | silent success, idempotency, anomaly checks | Medium | [Question](Problem%2047%3A%20Airflow%20Green%20but%20Output%20Empty/question.md) | [Solution](Problem%2047%3A%20Airflow%20Green%20but%20Output%20Empty/solution.md) |
+| 48 | Query Suddenly 80x Slower | Debugging | EXPLAIN, statistics, plan flip, join strategy | Medium | [Question](Problem%2048%3A%20Query%20Suddenly%2080x%20Slower/question.md) | [Solution](Problem%2048%3A%20Query%20Suddenly%2080x%20Slower/solution.md) |
+| 49 | User Says Data Is Wrong | Debugging | comms, vague reports, triage | Easy | [Question](Problem%2049%3A%20User%20Says%20Data%20Is%20Wrong/question.md) | [Solution](Problem%2049%3A%20User%20Says%20Data%20Is%20Wrong/solution.md) |
+| 50 | Partition Always Ten Percent Smaller | Debugging | anomaly, baselines, patterns, judgement | Medium | [Question](Problem%2050%3A%20Partition%20Always%20Ten%20Percent%20Smaller/question.md) | [Solution](Problem%2050%3A%20Partition%20Always%20Ten%20Percent%20Smaller/solution.md) |
+| 51 | BigQuery Bill Eight Times Higher | Cost & Performance | INFORMATION_SCHEMA, top queries, slot reservation | Medium | [Question](Problem%2051%3A%20BigQuery%20Bill%20Eight%20Times%20Higher/question.md) | [Solution](Problem%2051%3A%20BigQuery%20Bill%20Eight%20Times%20Higher/solution.md) |
+| 52 | Four Hour Spark Job Under One Hour | Cost & Performance | Spark UI, skew, AQE, broadcast joins | Medium | [Question](Problem%2052%3A%20Four%20Hour%20Spark%20Job%20Under%20One%20Hour/question.md) | [Solution](Problem%2052%3A%20Four%20Hour%20Spark%20Job%20Under%20One%20Hour/solution.md) |
+| 53 | Hourly Scan on Daily Data | Cost & Performance | summary tables, MV, refresh, BI tool | Easy | [Question](Problem%2053%3A%20Hourly%20Scan%20on%20Daily%20Data/question.md) | [Solution](Problem%2053%3A%20Hourly%20Scan%20on%20Daily%20Data/solution.md) |
+| 54 | Just Throw More Memory At It | Cost & Performance | upsize, plan inspection, optimization | Medium | [Question](Problem%2054%3A%20Just%20Throw%20More%20Memory%20At%20It/question.md) | [Solution](Problem%2054%3A%20Just%20Throw%20More%20Memory%20At%20It/solution.md) |
+| 55 | Partitioning Clustering Materialized Views | Cost & Performance | partitioning, clustering, MV, BigQuery | Easy | [Question](Problem%2055%3A%20Partitioning%20Clustering%20Materialized%20Views/question.md) | [Solution](Problem%2055%3A%20Partitioning%20Clustering%20Materialized%20Views/solution.md) |
+| 56 | Watermarks in Plain Words | Streaming | watermarks, event time, allowed lateness | Medium | [Question](Problem%2056%3A%20Watermarks%20in%20Plain%20Words/question.md) | [Solution](Problem%2056%3A%20Watermarks%20in%20Plain%20Words/solution.md) |
+| 57 | Kafka Ordering Guarantee | Streaming | Kafka, partition key, ordering, idempotent producer | Medium | [Question](Problem%2057%3A%20Kafka%20Ordering%20Guarantee/question.md) | [Solution](Problem%2057%3A%20Kafka%20Ordering%20Guarantee/solution.md) |
+| 58 | Streaming Consumer Lag Diagnosis | Streaming | lag, back-pressure, skew, Flink UI | Medium | [Question](Problem%2058%3A%20Streaming%20Consumer%20Lag%20Diagnosis/question.md) | [Solution](Problem%2058%3A%20Streaming%20Consumer%20Lag%20Diagnosis/solution.md) |
+| 59 | Onboarding a New Analyst | People & Process | onboarding, mentoring, pairing | Easy | [Question](Problem%2059%3A%20Onboarding%20a%20New%20Analyst/question.md) | [Solution](Problem%2059%3A%20Onboarding%20a%20New%20Analyst/solution.md) |
+| 60 | Metric by Tomorrow vs Doing It Right | People & Process | comms, prioritization, metrics | Easy | [Question](Problem%2060%3A%20Metric%20by%20Tomorrow%20vs%20Doing%20It%20Right/question.md) | [Solution](Problem%2060%3A%20Metric%20by%20Tomorrow%20vs%20Doing%20It%20Right/solution.md) |
+| 61 | Two Teams Disagree on Active User | People & Process | metric ownership, comms, metrics layer | Medium | [Question](Problem%2061%3A%20Two%20Teams%20Disagree%20on%20Active%20User/question.md) | [Solution](Problem%2061%3A%20Two%20Teams%20Disagree%20on%20Active%20User/solution.md) |
+| 62 | Postmortem After a Bad Day | People & Process | postmortem, blameless, action items | Medium | [Question](Problem%2062%3A%20Postmortem%20After%20a%20Bad%20Day/question.md) | [Solution](Problem%2062%3A%20Postmortem%20After%20a%20Bad%20Day/solution.md) |
+| 63 | Inherited Pipeline No Docs No Tests | People & Process | ownership, docs, tests, expectations | Medium | [Question](Problem%2063%3A%20Inherited%20Pipeline%20No%20Docs%20No%20Tests/question.md) | [Solution](Problem%2063%3A%20Inherited%20Pipeline%20No%20Docs%20No%20Tests/solution.md) |
+| 64 | Breaking Change in dbt Model 200 Consumers | People & Process | dbt, deprecation, comms, rollout | Medium | [Question](Problem%2064%3A%20Breaking%20Change%20in%20dbt%20Model%20200%20Consumers/question.md) | [Solution](Problem%2064%3A%20Breaking%20Change%20in%20dbt%20Model%20200%20Consumers/solution.md) |
+| 65 | 4000 DAG Airflow at 90 Percent CPU | People & Process | Airflow, scheduler, parsing, scale-out | Medium | [Question](Problem%2065%3A%204000%20DAG%20Airflow%20at%2090%20Percent%20CPU/question.md) | [Solution](Problem%2065%3A%204000%20DAG%20Airflow%20at%2090%20Percent%20CPU/solution.md) |
+
+---
+
+### Category Legend
+
+| Category | What you practice |
+|---------------------|-------------------------------------------------------------------------|
+| Logs and Monitoring | Parsing and analyzing large log files, counting events, ranking |
+| Streaming | Continuous data, rolling stats, watermarks, ordering, lag |
+| Data Cleaning | Validating, normalizing and rejecting bad rows from raw files |
+| Schema Validation | Handling evolving JSON or Avro schemas without breaking consumers |
+| Data Integration | Combining data from many sources with different shapes and conventions |
+| Fundamentals | Core concepts every data engineer should be able to explain plainly |
+| SQL Thinking | Writing, reading and reasoning about SQL like a senior engineer |
+| System Design | End-to-end pipelines for real consumer and energy-sector products |
+| Scenarios | Tricky real-life situations that test judgement and communication |
+| Cloud Decisions | Picking between AWS / GCP services with clear trade-offs |
+| Data Modeling | Star schemas, history tracking, grain, dimensions |
+| Debugging | Step-by-step investigation of "the number is wrong" style problems |
+| Cost & Performance | Finding waste in queries, jobs, and infrastructure |
+| People & Process | Mentoring, comms, postmortems, ownership, rollouts |
+
+### Difficulty Guide
+
+* **Easy** — A focused warm-up. Solvable or explainable in under an hour.
+* **Medium** — Realistic interview question. Has edge cases that matter.
+* **Hard** — Multi-step or system-design heavy. Closer to a take-home task.
+
+> New problems are added regularly. If you want to contribute, see the [Contribution Guide](CONTRIBUTION.md).
diff --git a/Problem 10: Slowly Changing Dimensions/question.md b/Problem 10: Slowly Changing Dimensions/question.md
new file mode 100644
index 0000000..8c5e7f9
--- /dev/null
+++ b/Problem 10: Slowly Changing Dimensions/question.md
@@ -0,0 +1,29 @@
+## Problem 10: Slowly Changing Dimensions
+
+**Scenario:**
+A billing team runs a report that says "total revenue per region per month." A customer who lived in Singapore last year moved to Malaysia in March. The report now shows all of their historical revenue under Malaysia, including invoices from when they were still in Singapore. The finance lead is upset because the regional numbers for 2024 just silently changed.
+
+This is the classic slowly changing dimension problem.
+
+In the interview, the question is:
+
+> What is a slowly changing dimension and why does it matter when the business asks "what did this customer look like last year"?
+
+---
+
+### Your Task:
+
+1. Explain what a slowly changing dimension (SCD) is.
+2. Describe the common SCD types (Type 1, Type 2, Type 3) in plain words.
+3. Show a small example table for each.
+4. Explain which type you would pick for a customer's address and why.
+
+---
+
+### What a Good Answer Covers:
+
+* The difference between a fact and a dimension (briefly).
+* Why "current state" is not enough.
+* The trade-off between storage and history.
+* Type 2 as the most common real-world answer.
+* The "as-of" join pattern.
diff --git a/Problem 10: Slowly Changing Dimensions/solution.md b/Problem 10: Slowly Changing Dimensions/solution.md
new file mode 100644
index 0000000..a7f81f8
--- /dev/null
+++ b/Problem 10: Slowly Changing Dimensions/solution.md
@@ -0,0 +1,141 @@
+## Solution 10: Slowly Changing Dimensions
+
+### Short version you can say out loud
+
+> A slowly changing dimension is a dimension table whose values change occasionally, not constantly. The big question is what you do when one of those values changes: do you overwrite, or do you keep history. The answer depends on whether reports need to know what the value used to be. For things like a customer's address, where past invoices need to stay correct, you almost always keep history. That is called SCD Type 2, and it is the most common pattern in real warehouses.
+
+### Why this matters
+
+The story above is a real failure mode. If you store only the current address, every historical query silently rewrites the past. Yesterday's report no longer matches today's, even though you only loaded new data, you did not change the old data. That destroys trust in the warehouse.
+
+```
+Without history (broken)
+─────────────────────────
+customer_id │ name │ country
+1001 │ Alice │ MY ← changed from SG in March
+
+Old invoice from January says customer 1001.
+Report joins to dimension, gets MY.
+The January invoice now appears under MY.
+Finance: "Why did SG's January number drop?"
+
+
+With history (SCD Type 2)
+─────────────────────────
+customer_id │ name │ country │ valid_from │ valid_to │ is_current
+1001 │ Alice │ SG │ 2023-01-01 │ 2025-03-15 │ false
+1001 │ Alice │ MY │ 2025-03-15 │ 9999-12-31 │ true
+
+January invoice (date = 2025-01-20) joins to the row valid that day.
+It correctly appears under SG.
+```
+
+### The classic SCD types in plain words
+
+**Type 0 — never change.** The value is set once and frozen. Used for things like a customer's original signup country, or the date they joined. Rare but useful.
+
+**Type 1 — overwrite.** When the value changes, you replace it. No history kept. Used for things where history does not matter, like a typo fix in a name.
+
+**Type 2 — add a new row.** When the value changes, you keep the old row and add a new one. You add columns like `valid_from`, `valid_to`, and `is_current`. Every fact joins to the dimension row that was valid at the fact's timestamp. This is by far the most common.
+
+**Type 3 — add a column.** You keep one extra column like `previous_country`. Useful when you only care about the most recent change, not the full history. Rare in modern warehouses.
+
+There are higher types (Type 4 with mini-dimensions, Type 6 hybrid) but in interviews, knowing 1, 2 and 3 well is enough.
+
+### Concrete shapes
+
+**Type 1 (overwrite)**
+
+```
+customers
+─────────────────────────────────────
+customer_id │ name │ country
+1001 │ Alice Lee │ MY ← overwritten
+1002 │ Bob Khan │ SG
+```
+
+After Alice moves, the old `SG` value is gone. Cheap to store. History lost.
+
+**Type 2 (add row, version with dates)**
+
+```
+customers_history
+──────────────────────────────────────────────────────────────────────
+customer_id │ name │ country │ valid_from │ valid_to │ is_current
+1001 │ Alice Lee │ SG │ 2023-01-01 │ 2025-03-15 │ false
+1001 │ Alice Lee │ MY │ 2025-03-15 │ 9999-12-31 │ true
+1002 │ Bob Khan │ SG │ 2024-05-10 │ 9999-12-31 │ true
+```
+
+Every change adds a new row. `valid_from` / `valid_to` mark the period it was true. `is_current` is a convenience flag so queries that want "right now" do not have to use `9999-12-31`.
+
+**Type 3 (one extra column)**
+
+```
+customers
+─────────────────────────────────────────────────
+customer_id │ name │ country │ previous_country
+1001 │ Alice Lee │ MY │ SG
+1002 │ Bob Khan │ SG │ NULL
+```
+
+Tracks one prior value only. Useful for "did this customer recently move."
+
+### How to query Type 2 (the as-of join)
+
+This is the join pattern you will draw on the whiteboard:
+
+```sql
+SELECT
+ i.invoice_id,
+ i.amount,
+ c.country AS country_at_invoice_time
+FROM invoices i
+LEFT JOIN customers_history c
+ ON c.customer_id = i.customer_id
+ AND i.invoice_date >= c.valid_from
+ AND i.invoice_date < c.valid_to;
+```
+
+Two important details:
+
+1. The interval is `[valid_from, valid_to)`. Half open. This avoids the row appearing in both the old and the new period on the exact change date.
+2. You join on **date inside range**, not on `is_current`. Using `is_current` would re-introduce the original bug.
+
+### Which type to pick
+
+| Field | Likely type | Why |
+| ---------------------------- | ----------- | --------------------------------------------------------- |
+| Customer address / country | Type 2 | Past invoices must keep their original region |
+| Customer's display name | Type 1 | A typo fix should fix old reports too |
+| Subscription plan tier | Type 2 | Revenue reporting depends on which plan they had when |
+| Currency code of an account | Type 2 | Historical balances were stored in that currency |
+| Most recent campaign source | Type 3 | We only care about "the one before this one" |
+| Their original signup date | Type 0 | Set once, never changes |
+
+### Common mistakes interviewers want you to name
+
+1. **Storing only current state.** The "moving customer breaks historical region revenue" story.
+2. **Type 2 done wrong** by joining on `is_current` instead of date ranges. Same bug, fancier table.
+3. **Overlapping validity ranges** because two updates landed in the same second and you forgot to close the previous row before opening a new one.
+4. **Type 2 explosion** when you flag too many columns as "track history." Pick the few that really need it. Otherwise the dimension grows to billions of rows for no reason.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you actually build a Type 2 table in dbt?"*
+
+dbt has a built in `snapshot` materialisation that does exactly this. You declare:
+
+```sql
+{% snapshot customers_history %}
+ {{ config(
+ target_schema='snapshots',
+ unique_key='customer_id',
+ strategy='check',
+ check_cols=['country', 'plan_tier']
+ ) }}
+ SELECT * FROM {{ source('app', 'customers') }}
+{% endsnapshot %}
+```
+
+Every time it runs, dbt diffs the source against the snapshot. New or changed rows get a fresh `dbt_valid_from` and the previous row's `dbt_valid_to` is closed. It is the easiest way to get Type 2 in a modern warehouse without writing the MERGE logic by hand.
diff --git a/Problem 11: Data Contracts in Plain Words/question.md b/Problem 11: Data Contracts in Plain Words/question.md
new file mode 100644
index 0000000..a2e9920
--- /dev/null
+++ b/Problem 11: Data Contracts in Plain Words/question.md
@@ -0,0 +1,29 @@
+## Problem 11: Data Contracts in Plain Words
+
+**Scenario:**
+A producer team renames a column from `user_id` to `userId` in their event stream as part of a refactor. They do not tell anyone. Three downstream pipelines break overnight, including the daily revenue report. After the postmortem, leadership asks: how do we stop this from happening every quarter?
+
+The answer everyone keeps mentioning is "data contracts."
+
+In the interview, the question is:
+
+> What is a data contract, in plain words, and why are companies suddenly talking about them?
+
+---
+
+### Your Task:
+
+1. Explain what a data contract is and what it is not.
+2. Explain why this conversation is happening now.
+3. Sketch what a real data contract looks like (columns, types, rules).
+4. Explain how it gets enforced in practice.
+
+---
+
+### What a Good Answer Covers:
+
+* The shift from "data is a side effect" to "data is a product."
+* The contract as an agreement between a producer and a consumer.
+* Schema, semantics, freshness, ownership.
+* Where it gets enforced: producer side, ingest side, CI checks.
+* Why it usually fails when it's only a document.
diff --git a/Problem 11: Data Contracts in Plain Words/solution.md b/Problem 11: Data Contracts in Plain Words/solution.md
new file mode 100644
index 0000000..c5692f7
--- /dev/null
+++ b/Problem 11: Data Contracts in Plain Words/solution.md
@@ -0,0 +1,123 @@
+## Solution 11: Data Contracts in Plain Words
+
+### Short version you can say out loud
+
+> A data contract is an explicit agreement between the team that produces data and the teams that consume it. It says what fields will be there, what types they will be, what they mean, how fresh they will be, and who owns them. It is the same idea as an API contract between two services, just applied to data. People are talking about it now because data has become a product, and treating it like a side effect of the app keeps breaking downstream systems.
+
+### Why now
+
+For most of the last 20 years, data was a by-product of the application. Engineers built features and the data team scraped whatever ended up in the database. When the app team changed a column, the data team found out by the dashboard breaking the next morning. That worked when there were two analysts and one report. It does not work now, because data feeds machine learning models, billing, regulators, and live customer features. The cost of a breaking change is much higher.
+
+Data contracts are the industry trying to apply software engineering discipline (interfaces, versioning, tests, ownership) to data the same way we did to microservice APIs ten years ago.
+
+### What a contract actually contains
+
+```
+ ┌────────────────────────────────────────┐
+ │ DATA CONTRACT │
+ │ │
+ │ Schema fields, types, nullability │
+ │ Semantics what each field means │
+ │ Quality rules and SLAs │
+ │ Freshness how often, how late │
+ │ Owner team and on-call │
+ │ Version semver, deprecation policy │
+ └─────────────┬──────────────────────────┘
+ │
+ ┌────────────────┴────────────────┐
+ ▼ ▼
+ Producer team Consumer teams
+ (app backend) (analytics, ML, finance)
+```
+
+A typical YAML contract might look like:
+
+```yaml
+name: orders
+version: 1.3.0
+owner: checkout-team
+sla:
+ freshness: 5 minutes from event time
+ availability: 99.9%
+schema:
+ - name: order_id
+ type: string
+ required: true
+ description: A unique id for the order. Stable across retries.
+ - name: customer_id
+ type: int64
+ required: true
+ - name: amount_cents
+ type: int64
+ required: true
+ description: Charged amount in the smallest unit of the currency.
+ - name: currency
+ type: string
+ required: true
+ constraints:
+ enum: [SGD, MYR, IDR, USD]
+ - name: created_at
+ type: timestamp
+ required: true
+quality:
+ - rule: amount_cents > 0
+ - rule: order_id is unique
+ - rule: no more than 0.01% of rows missing currency
+breaking_changes_policy: 6 months deprecation window
+```
+
+It is the same shape as a Protobuf schema, an OpenAPI spec, or an Avro schema, plus extra metadata about ownership and SLA.
+
+### What a contract is NOT
+
+* It is **not** just a document on Confluence. A document does not catch a renamed column at 3 AM.
+* It is **not** the same as a schema. A schema only describes shape. A contract also covers meaning, ownership and freshness.
+* It is **not** a one-way wish list from the consumer. Both sides have to agree, because the producer takes on the cost of stability.
+
+### Where it gets enforced
+
+The whole point is that the contract is **machine readable** and **checked automatically**. Three common enforcement points:
+
+```
+┌─────────┐ 1 ┌──────────┐ 2 ┌──────────┐ 3 ┌──────────┐
+│Producer │──────▶│ Kafka / │──────▶│ Warehouse│─────▶│ Consumer │
+│ code │ │ S3 │ │ │ │ code │
+└─────────┘ └──────────┘ └──────────┘ └──────────┘
+ │ │ │
+ ▼ ▼ ▼
+1. CI check in 2. Schema 3. dbt tests against
+ producer repo registry the contract on
+ (a renamed (Avro / Protobuf, every model run.
+ column fails rejects messages
+ the build) that don't match
+ the registered
+ schema)
+```
+
+* **Producer side.** A CI test fails the build if a code change would break the contract. This is the most valuable spot, because it catches the issue before it leaves the producer team.
+* **Ingest side.** A schema registry (Confluent, Apicurio, Glue Schema Registry) rejects events that don't match the registered schema. This catches drift between code and reality.
+* **Consumer side.** dbt tests or Great Expectations checks validate the data on arrival. Last line of defence.
+
+### How a real change happens with contracts
+
+The producer team wants to rename `user_id` to `userId`. With a contract in place:
+
+1. They open a pull request that changes the contract: `user_id` is now deprecated in version 1.4, `userId` is added.
+2. CI runs the consumer test list against the new contract. It tells them which downstream models reference `user_id` (8 of them).
+3. The contract says the deprecation window is 6 months. They cannot remove `user_id` for 6 months. They keep emitting both fields during that window.
+4. Consumers migrate at their own pace. After 6 months, the field is removed.
+
+The dashboard never breaks at 3 AM, because the system enforced the agreement.
+
+### Common mistakes
+
+1. **"We have a contract" but it lives in a Google Doc.** Not enforced is not a contract.
+2. **The contract is owned by the data team, not the producer team.** Producers will not feel responsible, so it drifts.
+3. **No deprecation window.** Producers will still break consumers because they can change the schema instantly.
+4. **Treating semantic changes as non breaking.** Renaming the *meaning* of `amount` from "net" to "gross" is a breaking change even if the type and name stay the same.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What is the difference between a data contract and a schema registry?"*
+
+A schema registry enforces shape. The data contract is the bigger agreement that *contains* a schema and adds meaning, ownership, freshness and quality rules. In practice you usually have both: the contract lives in source control, and at runtime, the registry enforces the schema piece of it.
diff --git a/Problem 12: Parquet vs CSV vs JSON/question.md b/Problem 12: Parquet vs CSV vs JSON/question.md
new file mode 100644
index 0000000..64c3e50
--- /dev/null
+++ b/Problem 12: Parquet vs CSV vs JSON/question.md
@@ -0,0 +1,27 @@
+## Problem 12: Parquet vs CSV vs JSON
+
+**Scenario:**
+Your team is choosing the storage format for a 5 TB events archive in S3. One engineer wants CSV "because everything reads it." Another wants JSON "because that's how the events arrive." You suggest Parquet. The team has not used it before and asks you to explain.
+
+In the interview, the question is:
+
+> When would you use Parquet, CSV, or JSON for storing data, and how would you explain Parquet to someone who has never heard of it?
+
+---
+
+### Your Task:
+
+1. Explain the three formats in plain words.
+2. Compare them on size, query speed, schema, and tooling.
+3. Explain why Parquet is so popular for analytics.
+4. Say when you would actually pick CSV or JSON over Parquet.
+
+---
+
+### What a Good Answer Covers:
+
+* Row-oriented vs column-oriented storage.
+* Compression and predicate pushdown.
+* The schema-on-write vs schema-on-read difference.
+* Real numbers: same dataset in CSV vs Parquet sizes.
+* Honest cases where Parquet is the wrong call.
diff --git a/Problem 12: Parquet vs CSV vs JSON/solution.md b/Problem 12: Parquet vs CSV vs JSON/solution.md
new file mode 100644
index 0000000..b7bcb01
--- /dev/null
+++ b/Problem 12: Parquet vs CSV vs JSON/solution.md
@@ -0,0 +1,106 @@
+## Solution 12: Parquet vs CSV vs JSON
+
+### Short version you can say out loud
+
+> CSV is a plain text grid. Easy to read for humans, easy to share with anyone, but slow to query and large on disk. JSON is good for nested or schema-changing data, even more expensive on size, and slow to scan. Parquet is a columnar binary format used for analytics: it stores each column together with its own compression, so when your query reads 3 columns out of 50, it reads only those 3 columns from disk. For a 5 TB archive that gets queried a lot, Parquet usually ends up about 5 to 10x smaller and 10 to 100x faster to scan than CSV.
+
+### Picture it
+
+```
+CSV (row by row, text)
+──────────────────────
+id,name,country,amount
+1,Alice,SG,100.00
+2,Bob,MY,250.00
+3,Carol,SG,75.50
+…
+
+A query "SELECT SUM(amount) WHERE country='SG'"
+must read every column of every row.
+
+
+JSON (row by row, with structure)
+──────────────────────────────────
+{"id":1,"name":"Alice","country":"SG","amount":100.00,"tags":["new"]}
+{"id":2,"name":"Bob","country":"MY","amount":250.00}
+{"id":3,"name":"Carol","country":"SG","amount":75.50,"meta":{"src":"web"}}
+
+Same problem as CSV for analytics, plus parsing overhead.
+Good for nested fields and schema that wiggles.
+
+
+Parquet (column by column, binary, compressed)
+───────────────────────────────────────────────
+id : [1, 2, 3, ...] (compressed integers)
+name : [Alice, Bob, Carol, ...] (dictionary encoded)
+country : [SG, MY, SG, ...] (run length encoded)
+amount : [100.00, 250.00, 75.50, ...] (compressed floats)
+
+Same query reads only `country` and `amount` columns.
+Skips reading `id`, `name`, and everything else.
+```
+
+### How each one is laid out
+
+**CSV** stores data row by row, as plain text. Every column is a string, even when it represents a number or a date. No types, no compression by default, no schema. Universal: every tool on earth reads CSV.
+
+**JSON** (here meaning JSON Lines, one object per line) is also row by row, also plain text, but each row carries its own structure. You can have nested objects, arrays, missing fields, and types are implicit per value. Great for "this row might have new fields tomorrow" use cases.
+
+**Parquet** is a binary, columnar format. Inside one Parquet file, data is grouped into "row groups," and inside each row group, each column is stored separately with its own compression and statistics (min, max, null count). This layout gives you three big wins for analytics:
+
+1. **Read fewer columns.** A query that needs 3 of 50 columns reads about 6 percent of the file.
+2. **Better compression.** Values in the same column are similar (same type, often repeating). Compression ratios of 5 to 10x over CSV are normal.
+3. **Predicate pushdown.** Each column has min and max per row group. The query engine can skip whole row groups when the filter range cannot match. This is why BigQuery, Snowflake, Athena, Spark all love Parquet.
+
+### Side by side comparison
+
+| Aspect | CSV | JSON | Parquet |
+| --------------------- | ----------------- | ------------------- | ------------------------ |
+| Format | Text, row-based | Text, row-based | Binary, column-based |
+| Schema | None (all string) | Implicit per row | Strict, embedded |
+| Compression | None by default | None by default | Built in (snappy, zstd) |
+| Size on disk | Largest | Larger than CSV | 5 to 10x smaller |
+| Query 1 column out of 50 | Reads all | Reads all | Reads ~2% |
+| Nested data | Awkward | Native | Native (structs, arrays) |
+| Human readable | Yes | Yes | No |
+| Streaming friendly | Yes (append) | Yes (append) | Not really (block based) |
+| Best for | Exchange, small files | Streaming events with changing shape | Analytics archive |
+
+### Real numbers (typical)
+
+A 100 GB CSV of NYC taxi trips often becomes around 10-15 GB in Parquet with snappy compression and 6-8 GB with zstd. A query that filters by date and selects two columns typically goes from minutes on the CSV to seconds on Parquet on the same Athena or BigQuery setup.
+
+### When to actually pick each one
+
+**Pick CSV when:**
+* You are exchanging data with a non-technical partner who opens files in Excel.
+* The data is small (under a few hundred MB) and human review matters.
+* A tool you cannot change only accepts CSV.
+
+**Pick JSON (or JSON Lines) when:**
+* Events have a flexible or evolving structure with many optional fields.
+* You need to keep the original payload exactly as it arrived (audit, replay).
+* It is the native shape that an API emits, and you want to land raw before transforming.
+
+**Pick Parquet when:**
+* The data lives in a data lake and gets queried by analytics engines (BigQuery, Athena, Spark, DuckDB, Snowflake).
+* The dataset is large (anything above a few GB).
+* You have a stable schema.
+* You care about cost (in BigQuery and Athena you pay per byte scanned, and Parquet directly reduces that).
+
+### Common mistakes interviewers want you to name
+
+1. **Storing analytical data as CSV in S3.** The bill is silently 5x higher than it needs to be.
+2. **Writing tiny Parquet files** (a few KB each). The columnar advantage is lost, and the engine spends most of its time opening files. Aim for files in the 128 MB to 1 GB range.
+3. **Parquet for tiny dataframes** in code. The reader/writer overhead can be slower than CSV at small sizes.
+4. **Choosing JSON for "future flexibility"** and never actually using the flexibility. Pay the cost forever for no benefit.
+5. **Mixing types across rows in JSON.** A field that is sometimes a string and sometimes a number breaks every consumer that tries to declare a schema.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How do Parquet and ORC compare? Or Avro?"*
+
+* **ORC** is also columnar, came from the Hive ecosystem. Very similar to Parquet, slightly better compression sometimes. Parquet won on adoption outside Hadoop.
+* **Avro** is row-based binary with an explicit schema. Excellent for streaming (Kafka loves Avro) and for write-heavy workloads, but bad for "read 2 columns out of 50."
+
+A common pattern in real teams: **Avro on the wire** (Kafka events), **Parquet at rest** (data lake). You convert as data lands.
diff --git a/Problem 13: Data Lake vs Warehouse vs Lakehouse/question.md b/Problem 13: Data Lake vs Warehouse vs Lakehouse/question.md
new file mode 100644
index 0000000..b5094ac
--- /dev/null
+++ b/Problem 13: Data Lake vs Warehouse vs Lakehouse/question.md
@@ -0,0 +1,26 @@
+## Problem 13: Data Lake vs Warehouse vs Lakehouse
+
+**Scenario:**
+A product manager joins your team and asks why your stack has three different storage layers: raw files in S3, tables in BigQuery, and something the team calls "the lakehouse." They want to understand the point of each.
+
+In the interview, the question is:
+
+> Explain the difference between a data lake, a data warehouse, and a lakehouse. Pretend you are explaining it to a product manager who has never worked with data infrastructure.
+
+---
+
+### Your Task:
+
+1. Define each in one sentence a non-technical person can follow.
+2. Show a small diagram of the three.
+3. Explain what each one is good at and bad at.
+4. Explain where the "lakehouse" idea actually came from and why it gets debated.
+
+---
+
+### What a Good Answer Covers:
+
+* Lake as raw files (cheap, flexible, hard to govern).
+* Warehouse as managed tables (governed, fast, more expensive).
+* Lakehouse as "warehouse features on top of lake files," using formats like Delta, Iceberg, Hudi.
+* The honest take: most companies have all three.
diff --git a/Problem 13: Data Lake vs Warehouse vs Lakehouse/solution.md b/Problem 13: Data Lake vs Warehouse vs Lakehouse/solution.md
new file mode 100644
index 0000000..ff750b2
--- /dev/null
+++ b/Problem 13: Data Lake vs Warehouse vs Lakehouse/solution.md
@@ -0,0 +1,105 @@
+## Solution 13: Data Lake vs Warehouse vs Lakehouse
+
+### Short version you can say out loud
+
+> A data lake is a big folder of files. You can dump anything in cheaply, but you have to bring your own tools to make sense of it. A warehouse is a managed database for analytics. It is fast, governed, has SQL out of the box, but costs more and is less flexible about file formats. A lakehouse is the newer idea of putting warehouse features (tables, transactions, schemas) on top of lake files, so you get most of the warehouse experience without paying warehouse prices. In real life most companies use all three at the same time, for different jobs.
+
+### Picture it
+
+```
+DATA LAKE DATA WAREHOUSE
+(S3, GCS, ADLS) (BigQuery, Snowflake, Redshift)
+
+s3://bucket/raw/ ┌─────────────────────────────┐
+ events/ │ orders customers prices │
+ year=2025/ │ ───── ───────── ────── │
+ month=05/ │ ... ... ... │
+ day=14/ │ │
+ part-001.parquet │ managed tables, SQL, │
+ part-002.parquet │ governance, ACID, │
+ logs/ │ time travel, RBAC │
+ invoices/ └─────────────────────────────┘
+ random_csv_someone_uploaded/
+ Fast for analytics.
+Anything goes. Costs more.
+Cheap. Hard to govern.
+
+ LAKEHOUSE
+ (Delta Lake, Iceberg, Hudi on top of S3/GCS)
+
+ s3://bucket/curated/
+ orders/ ── managed by Iceberg/Delta
+ data/*.parquet
+ _metadata/ ── transactions, schema, snapshots
+
+ Warehouse-like behaviour (ACID, schema, time travel)
+ on top of cheap lake files. Query with Spark,
+ Trino, Databricks, Snowflake, BigQuery external.
+```
+
+### One paragraph each
+
+**Data lake.** A data lake is just object storage (S3, GCS, Azure Data Lake) holding files in any format you like: CSV, JSON, Parquet, images, PDFs. It is cheap, infinitely scalable, and accepts anything. The downside is that there is no schema, no transactions, and no built-in query engine. If you want to query it, you bring your own tool (Athena, Spark, Trino). And because anyone can drop anything in, lakes can become "data swamps" without strict folder conventions and governance.
+
+**Data warehouse.** A warehouse is a managed analytical database. You define tables, you load data in, you run SQL. The engine knows your schema, manages compression, builds statistics, handles indexes (or partitioning + clustering), enforces access control, and gives you ACID transactions. BigQuery, Snowflake, Redshift, Synapse are all in this category. Faster and friendlier than a lake for analytics, but you pay for the management.
+
+**Lakehouse.** A lakehouse is a layer on top of a data lake that makes those files behave like warehouse tables. Three open formats lead the space: **Delta Lake**, **Apache Iceberg**, and **Apache Hudi**. They store data as Parquet files but add a metadata layer that gives you transactions, schema evolution, time travel (query the table as of last Tuesday), and updates and deletes. You get most of the warehouse experience while keeping the files open and queryable by many engines. Spark, Trino, Snowflake, BigQuery and Databricks can all read Iceberg now.
+
+### What each one is good at
+
+| Need | Lake | Warehouse | Lakehouse |
+| ---------------------------------------------- | ---- | --------- | --------- |
+| Store anything cheaply, including non-tabular | Yes | No | No |
+| Run SQL with sub-second latency | No | Yes | Sometimes |
+| ACID transactions on tables | No | Yes | Yes |
+| Time travel (query as of a past time) | No | Yes (limited) | Yes |
+| Schema enforcement and evolution | No | Yes | Yes |
+| Multi-engine read (Spark, Trino, BigQuery, …) | Yes | No | Yes |
+| Built-in governance, access control | DIY | Yes | Partial |
+| Cost | Lowest | Highest | Middle |
+
+### Where each fits in a real pipeline
+
+```
+ Raw events / files
+ │
+ ▼
+ ┌───────────┐
+ │ LAKE │ raw, unprocessed, "everything that ever happened"
+ └─────┬─────┘
+ │
+ ▼
+ ┌───────────┐
+ │ LAKEHOUSE │ cleaned, conformed, queryable, transactional
+ └─────┬─────┘
+ │
+ ▼
+ ┌───────────┐
+ │ WAREHOUSE │ business marts, dashboards, BI
+ └───────────┘
+```
+
+This three-layer setup is so common it has names: bronze (lake), silver (lakehouse), gold (warehouse), or just raw / staging / marts.
+
+### The honest take on the "lakehouse" debate
+
+The term is partly marketing. Databricks pushed it heavily. Critics point out that "putting a metadata layer on top of files" was already what Hive did 15 years ago. What is genuinely new is:
+
+* The open table formats (Iceberg, Delta, Hudi) handle real ACID transactions, not just metadata.
+* Cloud engines like BigQuery and Snowflake now read these formats directly, breaking the old "you have to use Spark on Databricks" lock-in.
+* The economics actually work because object storage is so cheap.
+
+So the idea is real, even if the word is overused. The practical question for a team is "do we keep growing the warehouse, or do we move some workloads onto Iceberg in our lake?" The honest answer is "depends on cost, scale, and how mixed your tooling is."
+
+### Common confusions interviewers want you to clear up
+
+1. **"Lake means unstructured."** Wrong. A lake can hold tidy Parquet tables. The point is *where* and *how managed*, not how messy.
+2. **"Lakehouse replaces the warehouse."** Rarely true in practice. Warehouses are still better for sub-second BI on curated marts. Most teams use both.
+3. **"Lake is free."** Storage is cheap, but query cost (Athena, Spark) can rival warehouse cost if you scan badly.
+4. **"Iceberg is a database."** No, it is a table format spec. You still need an engine to read or write it.
+
+### Bonus follow-up the interviewer might throw
+
+> *"If you were starting from scratch today for a small startup, what would you pick?"*
+
+For a small startup, the simplest answer is: land raw in cheap object storage, use a managed warehouse (BigQuery or Snowflake) for everything analytical, and only adopt Iceberg or Delta when warehouse costs become painful or you have multi-engine needs. Going lakehouse-first adds operational complexity that a five-person team usually does not need.
diff --git a/Problem 14: Exactly Once Delivery/question.md b/Problem 14: Exactly Once Delivery/question.md
new file mode 100644
index 0000000..a1ac1c1
--- /dev/null
+++ b/Problem 14: Exactly Once Delivery/question.md
@@ -0,0 +1,27 @@
+## Problem 14: Exactly Once Delivery
+
+**Scenario:**
+A payments engineer says their Kafka topic provides "exactly once" so the downstream job does not need any deduplication logic. The job processes a payment confirmation, and one day, a customer is charged twice. The engineer is surprised. You explain that exactly once is more subtle than it sounds.
+
+In the interview, the question is:
+
+> What does "exactly once" delivery mean and why is it so hard to actually guarantee in real systems?
+
+---
+
+### Your Task:
+
+1. Define at most once, at least once, and exactly once in one sentence each.
+2. Explain why exactly once is so hard at the system boundary.
+3. Show the trick that makes "effective exactly once" possible.
+4. Give a real example of where this goes wrong.
+
+---
+
+### What a Good Answer Covers:
+
+* The producer / broker / consumer distinction.
+* Why network and crashes force you to pick: drop or duplicate.
+* The role of idempotent consumers.
+* Kafka's "exactly once semantics" and what it actually covers.
+* End-to-end exactly once vs in-broker exactly once.
diff --git a/Problem 14: Exactly Once Delivery/solution.md b/Problem 14: Exactly Once Delivery/solution.md
new file mode 100644
index 0000000..8269fd3
--- /dev/null
+++ b/Problem 14: Exactly Once Delivery/solution.md
@@ -0,0 +1,108 @@
+## Solution 14: Exactly Once Delivery
+
+### Short version you can say out loud
+
+> Three delivery guarantees exist: at most once (might lose messages, never duplicate), at least once (never lose, might duplicate), and exactly once (never lose, never duplicate). At least once is what real networks naturally give you. Exactly once is hard because the only way to guarantee no loss is to retry on uncertainty, and retrying creates duplicates. The practical fix is to make the consumer idempotent, so duplicates from the wire stop mattering. When people say "exactly once," they usually mean "at least once delivery plus an idempotent consumer." Kafka's exactly once semantics covers Kafka to Kafka, but does not magically extend to your database or third party API.
+
+### Picture the three modes
+
+```
+ Producer ────▶ Broker ────▶ Consumer
+
+At most once │ Fire and forget. If the message is lost in the wire, it is gone.
+ │ Used for telemetry where losing a tiny percent is fine.
+
+At least once │ The producer retries until it sees an ack. The broker retries
+ │ delivery until the consumer commits. Result: never lost,
+ │ but can be delivered more than once. Default for Kafka.
+
+Exactly once │ Each message effects the consumer exactly one time.
+ │ Hard, because retries are the only safe way to avoid loss,
+ │ and retries cause duplicates. So "exactly once" really means
+ │ "duplicate-aware consumer."
+```
+
+### Why it is hard
+
+Networks fail in a specific way: when you call "did the other side receive my message?", you can get three answers — yes, no, or *I don't know*. The "I don't know" case is the whole problem. If the producer sent a message and never got an ack, two things could have happened:
+
+1. The broker never received it.
+2. The broker received it, but the ack got lost.
+
+The producer cannot tell the difference. So it retries. If case 2 was the real one, the broker now has the message twice.
+
+The same thing happens between broker and consumer: the consumer processes the message, then has to commit "I am done with it." If the commit ack is lost, the consumer (after a restart) re-reads the same message and processes it again.
+
+You cannot have both "never lose" and "never duplicate" with only network calls. You have to add something else.
+
+### The trick: idempotent consumers
+
+If your consumer can recognize a duplicate and ignore it, you do not need exactly-once delivery from the broker. You only need at-least-once delivery, plus a consumer that does not care when it sees a duplicate.
+
+The most common patterns:
+
+**1. Idempotency key**
+Every message carries a unique id. The consumer keeps a small store of "ids I have already processed" (a database table, Redis, a Bloom filter for big scale). Before doing the work, it checks. If already done, skip.
+
+```python
+def handle(msg):
+ if seen.exists(msg.id):
+ return # duplicate, ignore
+ do_the_work(msg)
+ seen.insert(msg.id) # remember it
+```
+
+The catch is the order: if you do the work first and crash before recording the id, you have done the work twice. If you record the id first and crash before doing the work, you have lost the message. The fix is to put both into a single transaction. Which leads to the next pattern.
+
+**2. Transactional outbox / transactional consumer**
+The consumer wraps both the work and the "seen" record in one database transaction. Either both commit, or neither does. If the transaction is in your application database, you can do this with normal SQL transactions.
+
+```sql
+BEGIN;
+INSERT INTO payments (id, amount) VALUES (...);
+INSERT INTO processed_ids (id) VALUES (...);
+COMMIT;
+```
+
+Now there is no window where the work happened but the id was not recorded.
+
+**3. Stateful processing with checkpointed state (Flink-style)**
+A stream processor like Flink or Beam keeps a snapshot of its state. When it commits to Kafka, it commits its state snapshot at the same time, using a two-phase commit. On restart, it resumes from the last snapshot, so retried messages do not double-count. This is the cleanest version of "real" exactly once, but it only works when both ends speak the protocol (Kafka <-> Flink <-> Kafka).
+
+### Kafka exactly once: what it actually covers
+
+Kafka's "Exactly Once Semantics" (EOS) does three things:
+
+1. The producer dedupes its own retries using a sequence number, so the broker stores each message once per partition.
+2. Transactional writes let a producer write to multiple partitions atomically.
+3. Read-process-write loops inside Kafka can commit consumer offsets and produced messages in one transaction.
+
+What it does NOT cover:
+
+* Writing to your **Postgres** or your **payment gateway**. Those are outside the Kafka transaction.
+* Effects in any external system, including sending an email or invoking an API.
+
+So the payments engineer in the scenario was technically right that Kafka delivered exactly once, but the consumer charged the customer via Stripe, which is outside Kafka. The retry happened at the Stripe call level after a timeout, and Stripe saw two charge requests with no idempotency key.
+
+### The whole picture
+
+```
+End-to-end exactly once = at-least-once delivery + idempotent end side
+
+Even with Kafka EOS, the moment you write to anything outside Kafka,
+you are back to needing idempotency in that external write.
+```
+
+### Common mistakes interviewers want you to name
+
+1. **Believing exactly once is a property of the transport.** It is really a property of the whole loop, including the consumer's side effects.
+2. **No idempotency key on external API calls.** This is the Stripe-charged-twice bug.
+3. **Recording "I am done" outside the transaction with the work.** Crash window.
+4. **Using a small in-memory dedup set** that resets on restart. Duplicates that arrive after the restart slip through.
+5. **Assuming Kafka EOS makes the database consistent.** It does not. The database is not in the Kafka transaction.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How do you size the deduplication store? You cannot keep every id forever."*
+
+You usually only need to dedupe within a window: the maximum time a duplicate could plausibly arrive. For Kafka with normal retry policies, that is minutes, not hours. So a TTL of a few hours on the dedup store is enough in practice, and you do not need infinite storage. If your idempotency key is content-based (a hash of the payload), you do not even need to keep ids: the same payload always produces the same key.
diff --git a/Problem 15: Teaching SQL Performance to a Junior/question.md b/Problem 15: Teaching SQL Performance to a Junior/question.md
new file mode 100644
index 0000000..6340af3
--- /dev/null
+++ b/Problem 15: Teaching SQL Performance to a Junior/question.md
@@ -0,0 +1,29 @@
+## Problem 15: Teaching SQL Performance to a Junior
+
+**Scenario:**
+A junior engineer on your team writes a query that joins three tables and uses a window function. The query is correct. The result is right. But it takes 8 minutes to run on a table the rest of the team queries in 4 seconds. They ask you for help, and the manager asks you to mentor them.
+
+In the interview, the question is:
+
+> A junior wrote a correct query that runs slow. How do you teach them to make it faster, step by step?
+
+This is a teaching question. The interviewer is checking whether you can mentor, not just whether you know optimization tricks.
+
+---
+
+### Your Task:
+
+1. Walk through the mental model you would teach.
+2. Show the actual checklist you would have them apply.
+3. Give a real before-and-after example.
+4. Mention the soft skills (how to make them learn, not just copy).
+
+---
+
+### What a Good Answer Covers:
+
+* Read the EXPLAIN plan first.
+* Filter early, project narrow.
+* Avoid functions on indexed columns.
+* Push aggregates before joins where possible.
+* Be honest that "correct first, fast second" is a culture, not a rule.
diff --git a/Problem 15: Teaching SQL Performance to a Junior/solution.md b/Problem 15: Teaching SQL Performance to a Junior/solution.md
new file mode 100644
index 0000000..f54934c
--- /dev/null
+++ b/Problem 15: Teaching SQL Performance to a Junior/solution.md
@@ -0,0 +1,165 @@
+## Solution 15: Teaching SQL Performance to a Junior
+
+### Short version you can say out loud
+
+> I would not start by rewriting their query. I would sit with them, run EXPLAIN, and let them see where the time is going. Most slow queries come from one of four causes: scanning too much data, joining too early, filtering with a function on the indexed column, or asking the database to sort more than it needs to. I walk them through those four, in that order, and they almost always find it themselves. The job is to give them the lens, not the answer.
+
+### The mental model I would teach them
+
+```
+A SQL query has two costs:
+ 1. How much data the engine touches.
+ 2. How much work it does per row.
+
+Optimization is mostly about touching less data.
+The rest is helping the engine pick the right plan.
+```
+
+Then I would teach four habits, in this order.
+
+### Habit 1: Read EXPLAIN before guessing
+
+The first lesson: do not optimize from intuition. Read the plan.
+
+```sql
+EXPLAIN ANALYZE
+SELECT ...
+```
+
+What I would show them on the plan:
+
+* **The biggest box on the left.** That is where the time went. Look there first.
+* **"Seq Scan" or "Full Table Scan."** The engine read the whole table. Why? Often there is no index, or the WHERE used a function that hid the index.
+* **The estimated vs actual row counts.** If they are wildly different, the optimizer is making bad decisions because its stats are wrong. Often fixed by `ANALYZE`.
+* **Nested loop join over millions of rows.** Almost always wrong. Should be hash join or merge join.
+
+### Habit 2: Filter early, project narrow
+
+Most beginners write queries like this:
+
+```sql
+-- Slow
+SELECT *
+FROM orders o
+JOIN customers c ON c.id = o.customer_id
+WHERE o.created_at > '2025-01-01';
+```
+
+And then complain that it is slow.
+
+I rewrite it with them like this:
+
+```sql
+-- Fast
+SELECT o.id, o.amount, c.name
+FROM orders o
+JOIN customers c ON c.id = o.customer_id
+WHERE o.created_at > '2025-01-01';
+```
+
+The two lessons:
+
+1. `SELECT *` reads every column. In a column store like BigQuery, this scans 10x more data than you need.
+2. Always think about which side of the join is filterable. The engine can push the date filter down on `orders`, shrink that side, then join. If you filter after the join, both sides got fully read first.
+
+### Habit 3: Do not wrap indexed columns in functions
+
+This is the single most common slow query in a junior's code:
+
+```sql
+-- Slow
+WHERE DATE(created_at) = '2025-05-14'
+
+-- Fast
+WHERE created_at >= '2025-05-14' AND created_at < '2025-05-15'
+```
+
+Why the first is slow: the index on `created_at` is on the raw timestamp values. `DATE(created_at)` is a function applied to every row, so the engine cannot use the index. It has to read every row, compute `DATE(...)`, and compare. The second version uses range comparisons directly on the indexed column.
+
+The same lesson hits with `LOWER(email) = '...'`, `CAST(id AS TEXT) = '...'`, `SUBSTR(name, 1, 3) = '...'`. Functions on indexed columns kill the index.
+
+### Habit 4: Aggregate before you join, when you can
+
+```sql
+-- Slow: joins billions, then aggregates
+SELECT c.country, SUM(o.amount)
+FROM orders o
+JOIN customers c ON c.id = o.customer_id
+GROUP BY c.country;
+
+-- Sometimes faster: aggregate first, then join the small result
+WITH per_customer AS (
+ SELECT customer_id, SUM(amount) AS total
+ FROM orders
+ GROUP BY customer_id
+)
+SELECT c.country, SUM(p.total)
+FROM per_customer p
+JOIN customers c ON c.id = p.customer_id
+GROUP BY c.country;
+```
+
+This is not always faster, and modern optimizers may do this automatically. But teaching the pattern is useful, because it makes the junior think about *what shape of data* they are passing to each step.
+
+### A real before / after I would walk them through
+
+```sql
+-- Original (8 minutes)
+SELECT
+ o.id,
+ c.name,
+ ROW_NUMBER() OVER (PARTITION BY c.country ORDER BY o.amount DESC) AS rn
+FROM orders o
+JOIN customers c ON c.id = o.customer_id
+JOIN regions r ON r.country = c.country
+WHERE EXTRACT(YEAR FROM o.created_at) = 2025
+ AND r.active = true;
+
+-- After teaching the habits (4 seconds)
+WITH active_orders AS (
+ SELECT id, customer_id, amount
+ FROM orders
+ WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01'
+)
+SELECT
+ ao.id,
+ c.name,
+ ROW_NUMBER() OVER (PARTITION BY c.country ORDER BY ao.amount DESC) AS rn
+FROM active_orders ao
+JOIN customers c ON c.id = ao.customer_id
+JOIN regions r ON r.country = c.country AND r.active = true;
+```
+
+What changed:
+
+* `EXTRACT(YEAR FROM created_at)` became a range. The index on `created_at` is used.
+* `r.active = true` moved into the JOIN, so the filter happens at the same time as the join.
+* The CTE narrows `orders` to only the columns we need before joining.
+
+### The soft side of the lesson
+
+The harder part is the conversation, not the SQL. Things I keep in mind:
+
+* **Don't rewrite for them.** They will not learn. Let them read EXPLAIN. Ask "where do you think the time is going?" before pointing.
+* **Praise correctness first.** They got the right answer. That is harder than making it fast.
+* **Show one fix at a time.** If you change five things in one pass, they cannot tell which one mattered.
+* **Time each change.** Before and after on the same data. Numbers stick.
+* **Tell them what NOT to do.** Premature optimization. Adding indexes "just in case." Rewriting working queries because they look ugly.
+
+### A checklist I would actually leave with them
+
+When a query is slow:
+
+1. Run `EXPLAIN ANALYZE`. Find the biggest box.
+2. Look for full table scans. Add a filter, or check if a function is hiding an index.
+3. Look at the order of joins. The smallest filtered set should be on one side of the first join.
+4. Select only the columns you need.
+5. If using window functions, make sure the input is already as small as possible.
+6. Try aggregating before joining if both sides are huge.
+7. After each change, re-run EXPLAIN and the query. Note the time.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the query is on BigQuery, where there are no indexes?"*
+
+The habits transfer, but the levers are different. On BigQuery: think about **partition pruning** (does the WHERE hit the partition column?), **clustering** (does it hit clustered columns?), **selected columns** (BigQuery charges by bytes scanned), and **broadcast joins** (force the small side to be broadcast if the optimizer doesn't). The mental model "touch less data" stays the same, but "use the index" becomes "use the partition and the cluster."
diff --git a/Problem 16: SELECT DISTINCT Hiding Join Bugs/question.md b/Problem 16: SELECT DISTINCT Hiding Join Bugs/question.md
new file mode 100644
index 0000000..126561a
--- /dev/null
+++ b/Problem 16: SELECT DISTINCT Hiding Join Bugs/question.md
@@ -0,0 +1,27 @@
+## Problem 16: SELECT DISTINCT Hiding Join Bugs
+
+**Scenario:**
+An analyst on the team writes `SELECT DISTINCT` on almost every query. When you ask why, they say "because the joins keep duplicating rows, and DISTINCT cleans it up." The numbers in their dashboards mostly look right, but every few weeks something is off by a small amount and nobody can explain why.
+
+In the interview, the question is:
+
+> An analyst is using SELECT DISTINCT everywhere because joins keep producing duplicates. What is actually going wrong, and how would you fix it without DISTINCT?
+
+---
+
+### Your Task:
+
+1. Explain what causes the duplicates in the first place.
+2. Show the bug with a small example.
+3. Show two ways to fix it without DISTINCT.
+4. Explain why DISTINCT is a dangerous habit, not just a slow one.
+
+---
+
+### What a Good Answer Covers:
+
+* The cardinality (grain) of each table in the join.
+* "Many-to-many" joins exploding row counts.
+* Aggregating before joining.
+* Using EXISTS or a semi-join.
+* Why DISTINCT can also collapse rows that *should* be different.
diff --git a/Problem 16: SELECT DISTINCT Hiding Join Bugs/solution.md b/Problem 16: SELECT DISTINCT Hiding Join Bugs/solution.md
new file mode 100644
index 0000000..27a16e0
--- /dev/null
+++ b/Problem 16: SELECT DISTINCT Hiding Join Bugs/solution.md
@@ -0,0 +1,149 @@
+## Solution 16: SELECT DISTINCT Hiding Join Bugs
+
+### Short version you can say out loud
+
+> The duplicates are not really duplicates. They are real rows produced by joining a table to another table that has multiple matching rows. DISTINCT silently collapses them, which feels like a fix, but it also collapses things that genuinely should be separate. Two rows that look the same in the selected columns might come from different orders, different timestamps, different reasons. The right fix is to think about the **grain** of each table in the join, and either filter, aggregate, or use a semi-join so each row appears at most once on purpose.
+
+### Where the duplicates come from
+
+```
+orders order_items
+───────────── ─────────────────
+id │ customer_id order_id │ product
+1 │ 100 1 │ apple
+2 │ 100 1 │ banana
+3 │ 200 2 │ apple
+ 3 │ apple
+ 3 │ apple
+
+
+SELECT o.id, o.customer_id
+FROM orders o
+JOIN order_items i ON i.order_id = o.id;
+
+Result:
+id │ customer_id
+1 │ 100 ← order 1 has 2 items, so it shows twice
+1 │ 100
+2 │ 100
+3 │ 200 ← order 3 has 2 items of the same product
+3 │ 200
+```
+
+The analyst sees order 1 twice and thinks "I need DISTINCT." But the row repetition is correct given the join. It is just that they joined a one-row-per-order table to a one-row-per-item table, so the result has one row per item.
+
+The bug is the *choice of join*, not a duplicate problem.
+
+### Why DISTINCT is the dangerous fix
+
+```sql
+SELECT DISTINCT o.id, o.customer_id
+FROM orders o
+JOIN order_items i ON i.order_id = o.id;
+```
+
+This returns:
+```
+id │ customer_id
+1 │ 100
+2 │ 100
+3 │ 200
+```
+
+Looks clean. But two real risks:
+
+1. **It hides intent.** Three months later, someone adds `i.product` to the SELECT. Now DISTINCT does not collapse the rows anymore, the numbers explode, nobody understands why.
+2. **It can collapse rows that should differ.** If you wrote `SELECT DISTINCT customer_id, country` and a customer has two countries in your history (they moved), you silently lose the second one.
+3. **It hides bugs.** Imagine the join condition was wrong. Without DISTINCT, you would have seen huge duplication and known something was off. With DISTINCT, the numbers just quietly drift.
+
+### The right way to think about it
+
+Every table has a **grain**: the level at which one row means one thing. Examples:
+
+* `customers` grain: one row per customer.
+* `orders` grain: one row per order.
+* `order_items` grain: one row per item within an order.
+* `meter_reads` grain: one row per meter per 15-minute interval.
+
+When you join, the result has the grain of the **finest** table in the join. If you want one row per customer in the final result, you cannot join straight to `order_items`. You have to **aggregate first** or use a **semi-join**.
+
+### Fix 1: Aggregate before joining
+
+If the analyst wanted "one row per order with the number of items," aggregate `order_items` first:
+
+```sql
+WITH items_per_order AS (
+ SELECT order_id, COUNT(*) AS item_count
+ FROM order_items
+ GROUP BY order_id
+)
+SELECT o.id, o.customer_id, COALESCE(i.item_count, 0) AS item_count
+FROM orders o
+LEFT JOIN items_per_order i ON i.order_id = o.id;
+```
+
+Now the join is one to one. No duplicates. No DISTINCT.
+
+### Fix 2: Use EXISTS (semi-join)
+
+If the analyst wanted "all orders that have at least one item," they do not need to join at all:
+
+```sql
+SELECT o.id, o.customer_id
+FROM orders o
+WHERE EXISTS (
+ SELECT 1
+ FROM order_items i
+ WHERE i.order_id = o.id
+);
+```
+
+`EXISTS` stops at the first match. It does not multiply rows. This is called a semi-join. It is exactly the right tool when the question is "does any matching row exist," and most engines optimize it well.
+
+The opposite, `NOT EXISTS`, finds orders with no items at all.
+
+### Fix 3: Window or row_number filter
+
+Sometimes you genuinely want "one row per group, but I need columns from the child table too." Use `ROW_NUMBER`:
+
+```sql
+SELECT id, customer_id, product
+FROM (
+ SELECT o.id, o.customer_id, i.product,
+ ROW_NUMBER() OVER (PARTITION BY o.id ORDER BY i.created_at) AS rn
+ FROM orders o
+ JOIN order_items i ON i.order_id = o.id
+) ranked
+WHERE rn = 1;
+```
+
+This says "for each order, keep only the first item." Explicit, intentional, and the next reader knows exactly what you meant.
+
+### A useful sanity check
+
+If your query returns N rows but the table you said you wanted "one row per X" has N rows that don't match, something is wrong. Before sharing a result, do a quick row count check:
+
+```sql
+SELECT COUNT(*), COUNT(DISTINCT id) FROM result;
+```
+
+If these differ, you have duplicates you did not expect. Either fix the query or be explicit about why duplicates are intended.
+
+### Common mistakes interviewers want you to name
+
+1. **DISTINCT as a habit, not a decision.** Sprinkled because the result looked off, not because the question asked for unique rows.
+2. **DISTINCT on a query you later edit.** It only worked because the SELECT list was narrow. Add a column, it stops working.
+3. **DISTINCT to "fix" a many-to-many join.** The real fix is to aggregate or filter.
+4. **Counting users with DISTINCT user_id** in a query that already has duplicates from joins. The count is right, but the rest of the query may be carrying wrong totals from the same duplication.
+5. **Not understanding grain.** This is the deeper issue. Once you can say the grain of each table out loud, the DISTINCT problem disappears.
+
+### Bonus follow-up the interviewer might throw
+
+> *"When IS DISTINCT actually the right tool?"*
+
+When you genuinely have a set, not a multiset, and you want to deduplicate it. For example:
+
+* Listing the unique countries that appeared in a column.
+* Building a deduplicated list of email addresses from multiple source files where the same address really might appear twice and you want it once.
+
+The key is intent: you should be able to say out loud "I expect duplicates here and I want one of each." If the answer is "I just don't want the join to multiply rows," DISTINCT is the wrong fix.
diff --git a/Problem 17: Reading an EXPLAIN Plan/question.md b/Problem 17: Reading an EXPLAIN Plan/question.md
new file mode 100644
index 0000000..de3dffc
--- /dev/null
+++ b/Problem 17: Reading an EXPLAIN Plan/question.md
@@ -0,0 +1,27 @@
+## Problem 17: Reading an EXPLAIN Plan
+
+**Scenario:**
+A query is slow. You ask the engineer "what does the EXPLAIN plan say?" and they shrug. Most engineers know `EXPLAIN ANALYZE` exists but freeze when they see the actual output. The interviewer wants to know if you can read it confidently and use it to diagnose.
+
+In the interview, the question is:
+
+> You see an EXPLAIN plan for the first time. Talk me through what you actually look at, and in what order.
+
+---
+
+### Your Task:
+
+1. Explain what EXPLAIN tells you, in plain words.
+2. Walk through the order in which you would scan a plan.
+3. Show a sample plan and point out what would jump out to you.
+4. List the four or five things that consistently cause slow queries.
+
+---
+
+### What a Good Answer Covers:
+
+* The difference between EXPLAIN and EXPLAIN ANALYZE.
+* Reading a plan bottom up (or innermost out).
+* Row estimate vs actual row count: the biggest hint.
+* Join types and what they mean.
+* Common red flags: Seq Scan on big tables, Nested Loop with millions of rows, Sort spilling to disk.
diff --git a/Problem 17: Reading an EXPLAIN Plan/solution.md b/Problem 17: Reading an EXPLAIN Plan/solution.md
new file mode 100644
index 0000000..57837a5
--- /dev/null
+++ b/Problem 17: Reading an EXPLAIN Plan/solution.md
@@ -0,0 +1,120 @@
+## Solution 17: Reading an EXPLAIN Plan
+
+### Short version you can say out loud
+
+> EXPLAIN shows the query engine's plan for executing my query. I scan it from the inside out, because that is the order it runs. The four things I always look at first are: where it scans the biggest tables (Seq Scan vs Index Scan), the join types it picked, the gap between estimated rows and actual rows, and any Sort or Hash that spilled to disk. Those four explain probably 80 percent of slow queries.
+
+### EXPLAIN vs EXPLAIN ANALYZE
+
+`EXPLAIN` shows the planner's guess of what it will do. Cheap, returns instantly.
+
+`EXPLAIN ANALYZE` actually runs the query and gives you real timing and real row counts. This is the one you almost always want when debugging.
+
+Be careful: `EXPLAIN ANALYZE` runs the query, including any side effects. Don't `EXPLAIN ANALYZE` a `DELETE` unless you mean it.
+
+### A sample plan to read together
+
+```
+EXPLAIN ANALYZE
+SELECT c.country, SUM(o.amount)
+FROM orders o
+JOIN customers c ON c.id = o.customer_id
+WHERE o.created_at >= '2025-01-01'
+GROUP BY c.country;
+```
+
+Output (Postgres-style, simplified):
+
+```
+HashAggregate (cost=120000.00..120004.00 rows=4 width=40)
+ (actual time=8421.5..8421.7 rows=4 loops=1)
+ Group Key: c.country
+ -> Hash Join (cost=2500.00..115000.00 rows=2000000 width=18)
+ (actual time=180.2..7900.0 rows=1850000 loops=1)
+ Hash Cond: (o.customer_id = c.id)
+ -> Seq Scan on orders o (cost=0.00..100000.00 rows=2000000 width=12)
+ (actual time=0.05..3500 rows=1850000 loops=1)
+ Filter: (created_at >= '2025-01-01')
+ Rows Removed by Filter: 6150000
+ -> Hash (cost=2000.00..2000.00 rows=200000 width=14)
+ (actual time=160 rows=200000 loops=1)
+ -> Seq Scan on customers c (rows=200000)
+Planning Time: 0.5 ms
+Execution Time: 8422 ms
+```
+
+Don't panic. Read it inside out.
+
+### How to read it, step by step
+
+**1. Find the innermost steps first.**
+The plan is a tree. The innermost nodes (deepest indentation) run first. In our plan, that is the two `Seq Scan` nodes. Reading inside out:
+
+```
+Seq Scan on orders → filter by date (3.5 seconds, returns 1.85M rows)
+Seq Scan on customers → loaded into a Hash (0.16 seconds)
+Hash Join the two (7.9 seconds total)
+HashAggregate the result (8.42 seconds total)
+```
+
+**2. Look at the biggest time.**
+The biggest jump is the join itself: 3.5 seconds for the orders scan, but 7.9 seconds total at the join level means the join phase itself ate ~4 seconds. That is where to focus.
+
+**3. Compare estimated rows to actual rows.**
+The planner estimated 2,000,000 rows from the orders scan and got 1,850,000. Close enough. If those numbers were off by 100x (e.g., estimate 20,000, actual 2,000,000) the optimizer was probably picking a bad plan. Fix with `ANALYZE` on the table or by updating stats.
+
+**4. Check for the red flags.**
+
+| Red flag | What it usually means |
+| --------------------------------- | ------------------------------------------------------ |
+| `Seq Scan` on a huge table | Missing index, function on indexed column, or `SELECT *` |
+| `Nested Loop` over millions | Optimizer thought one side was tiny. It wasn't. |
+| `Hash` with a huge build side | The "smaller" side isn't small. Memory pressure. |
+| `Sort` writing to disk | Working set didn't fit in memory. Slow. |
+| Estimate vs actual off by 10x+ | Bad statistics, query planner is flying blind. |
+| `Rows Removed by Filter` very high | You read 8M rows to keep 100k. Filter earlier. |
+
+### Join types you will see, in plain words
+
+* **Nested Loop**: for each row on the left, look up matching rows on the right. Cheap when one side is tiny and the other side has an index. Disastrous when both sides are big.
+* **Hash Join**: build a hash table on one side (the "build" side), probe it from the other. Best when both sides are medium to large. Cost is roughly read both sides once.
+* **Merge Join**: both sides are sorted on the join key, then merged. Used when the optimizer already has both sides sorted, or for very large joins where hashing would not fit in memory.
+
+The most common bad plan you will see: a Nested Loop that the optimizer chose because it thought the outer side was tiny, but it wasn't. The fix is usually to update stats or rewrite the filter.
+
+### Things to do once you spot a problem
+
+* **Seq Scan on a filtered query**: add an index on the filter column, or rewrite to remove a function on the indexed column.
+* **Bad estimate**: `ANALYZE
` to refresh stats. Sometimes increase `default_statistics_target` for skewed columns.
+* **Nested Loop bug**: try forcing a hash join with hints (`pg_hint_plan` in Postgres) or rewrite to give the optimizer a clearer shape.
+* **Sort to disk**: increase `work_mem` for that session, or add an index that produces presorted output, or reduce the data you sort.
+
+### EXPLAIN on BigQuery, Snowflake, etc.
+
+Each engine shows it differently:
+
+* **BigQuery** shows an execution plan in the job details: stages, slot time, rows in / out, bytes shuffled. The thing you scan first is `bytes processed` (cost) and `slot time` (the actual work). Big shuffle stages are the equivalent of a sort spill.
+* **Snowflake** shows a graph in the UI. Click the slowest box. Look for "remote disk I/O" (bad), "spilling to local storage" (bad), and "broadcast vs hash partitioned" joins.
+* **Spark** has `df.explain(True)` for the logical plan and the Spark UI for the physical execution.
+
+Different vocabulary, same lens: find the biggest box, check estimate vs actual, look for red flags.
+
+### A 30-second checklist to keep in your head
+
+When you see a plan:
+
+1. Look at the **total time** at the top.
+2. Find the **biggest single step**.
+3. Is it a **Seq Scan on a big table**? Why? Can you add an index or filter earlier?
+4. Are the **row estimates** off?
+5. Is anything **spilling to disk** (Sort, Hash)?
+6. What **join types** were chosen? Do they make sense for the sizes?
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the plan looks fine but the query is still slow?"*
+
+Two common causes:
+
+1. **Lock contention.** The query is fast in isolation but waits behind other transactions. Check `pg_stat_activity` or your engine's equivalent.
+2. **Cold cache.** First run hits disk, second run hits memory. Run the query twice and compare. If the second run is much faster, you are I/O bound. Either accept the cold cost, warm the cache deliberately, or rebuild the table to be smaller (drop unused columns, partition).
diff --git a/Problem 18: CTE vs Subquery/question.md b/Problem 18: CTE vs Subquery/question.md
new file mode 100644
index 0000000..2168a81
--- /dev/null
+++ b/Problem 18: CTE vs Subquery/question.md
@@ -0,0 +1,27 @@
+## Problem 18: CTE vs Subquery
+
+**Scenario:**
+A teammate refactors a long query, replacing every subquery with a CTE (`WITH ... AS`) "for readability." The next day, the query is 4x slower in production. They are confused, because in older databases CTEs were always at least as fast as subqueries. You explain that this is a Postgres-specific (and historically common) gotcha that newer versions changed.
+
+In the interview, the question is:
+
+> When would you choose a CTE over a subquery, and when does it actually matter for performance?
+
+---
+
+### Your Task:
+
+1. Explain what a CTE is, and what a subquery is, in one line each.
+2. Explain the readability case for CTEs.
+3. Explain the historical performance trap.
+4. Give a clear rule of thumb for which to use today.
+
+---
+
+### What a Good Answer Covers:
+
+* CTEs as "named temporary results."
+* Postgres < 12 "optimization fence" behavior.
+* Modern engines mostly inline CTEs.
+* When you actually want a *materialized* CTE.
+* Recursive CTEs, which only CTEs can do.
diff --git a/Problem 18: CTE vs Subquery/solution.md b/Problem 18: CTE vs Subquery/solution.md
new file mode 100644
index 0000000..099ee61
--- /dev/null
+++ b/Problem 18: CTE vs Subquery/solution.md
@@ -0,0 +1,128 @@
+## Solution 18: CTE vs Subquery
+
+### Short version you can say out loud
+
+> A CTE is a named result you build with `WITH ... AS (...)` and then reference by name later. A subquery is an inline `SELECT` nested inside another query. They are usually equivalent in meaning. The reason CTEs matter is mostly readability: you can name a step, reuse it, and read the query top to bottom like a paragraph. The reason they were a performance trap is that older Postgres treated CTEs as an "optimization fence," meaning it would materialize them before using them, even when a subquery would have been folded into the main plan. Modern Postgres (12+), BigQuery, Snowflake and most others inline CTEs by default, so the difference is mostly stylistic again.
+
+### Same query, two ways
+
+```sql
+-- Subquery
+SELECT c.country, COUNT(*) AS big_orders
+FROM customers c
+JOIN (
+ SELECT customer_id, amount
+ FROM orders
+ WHERE amount > 1000
+) big ON big.customer_id = c.id
+GROUP BY c.country;
+
+
+-- CTE
+WITH big_orders AS (
+ SELECT customer_id, amount
+ FROM orders
+ WHERE amount > 1000
+)
+SELECT c.country, COUNT(*) AS big_orders
+FROM customers c
+JOIN big_orders b ON b.customer_id = c.id
+GROUP BY c.country;
+```
+
+Same result, same intent. Different shape.
+
+### Why I usually reach for a CTE
+
+* **Naming a step.** "These are the big orders" reads better than a nested SELECT.
+* **Reusing a step.** I can reference `big_orders` twice in the same query. A subquery would have to be repeated.
+* **Top-to-bottom flow.** You build up small intermediate results, then combine them. Same idea as breaking a Python function into smaller helpers.
+
+For long queries (5+ joins, multiple aggregates), this matters a lot. A 100-line query with five CTEs reads like five short paragraphs. The same logic with nested subqueries reads like one paragraph with five embedded clauses.
+
+### The historical trap (Postgres < 12, and a few others)
+
+Older Postgres treated every CTE as a "materialization fence." It computed the CTE once, stored the result in a temporary structure, and then read from that structure in the main query. That sounds fine, but it killed plan optimization. Consider:
+
+```sql
+WITH recent AS (
+ SELECT * FROM orders
+)
+SELECT * FROM recent WHERE created_at > '2025-05-01';
+```
+
+In old Postgres, the CTE materialized the entire `orders` table first, then filtered. A naked subquery would have pushed the WHERE down into the scan. So the CTE version could read 10x more data.
+
+The fix in modern Postgres: CTEs are inlined by default if they are referenced once and have no side effects. You can force the old behavior with `WITH recent AS MATERIALIZED (...)`. You can force inlining with `WITH recent AS NOT MATERIALIZED (...)`.
+
+In BigQuery, Snowflake, Redshift, DuckDB, the optimizer almost always treats CTEs as views (inlined). The trap mostly does not apply there.
+
+### When do you WANT a materialized CTE?
+
+Sometimes you specifically want the CTE computed once and reused. Use cases:
+
+1. **The CTE is expensive and referenced many times.** Inlining means running it every time. Materializing means once.
+2. **The CTE has side effects** (DML inside a CTE, like `INSERT ... RETURNING`). It has to run exactly once.
+3. **You want to force a specific join order** that the optimizer keeps getting wrong. Materializing creates a hard boundary.
+
+If your engine does not auto-detect this, you can be explicit:
+
+```sql
+WITH expensive AS MATERIALIZED (
+ SELECT ... -- referenced 5 times below
+)
+SELECT ...
+```
+
+### What CTEs can do that subqueries cannot
+
+**Recursive queries.** Only CTEs support recursion:
+
+```sql
+WITH RECURSIVE org_tree AS (
+ SELECT id, manager_id, name, 1 AS level
+ FROM employees
+ WHERE manager_id IS NULL
+ UNION ALL
+ SELECT e.id, e.manager_id, e.name, t.level + 1
+ FROM employees e
+ JOIN org_tree t ON e.manager_id = t.id
+)
+SELECT * FROM org_tree;
+```
+
+Walking trees, graph traversal, parent-child unfolding. No subquery form exists for this.
+
+### My rule of thumb today
+
+* Default to CTEs for **anything over 20 lines** or **anything with more than one logical step**. Readability is the bigger long-term cost.
+* Use a subquery when the inner part is **tiny and used exactly once**, like `WHERE id IN (SELECT customer_id FROM banned_users)`.
+* On Postgres < 12, watch for materialization. If a query gets slow when you switch from subquery to CTE, that is the cause.
+* On modern engines, write whichever reads better. The plan will be the same.
+* Use `MATERIALIZED` or a real temp table when the CTE is expensive and reused, and the optimizer is not picking that up.
+
+### A note on temp tables vs CTEs
+
+If a "CTE" is reused across multiple queries (not just inside one query), it should not be a CTE. It should be a temp table:
+
+```sql
+CREATE TEMP TABLE big_orders AS
+SELECT customer_id, amount FROM orders WHERE amount > 1000;
+
+-- Now reuse big_orders in many queries this session.
+```
+
+This is a clearer signal to anyone reading the code, and to the engine, that this is a real intermediate result.
+
+### Common mistakes interviewers want you to name
+
+1. **Switching everything to CTEs "for readability"** on an old Postgres and watching plans get worse.
+2. **Using `WITH RECURSIVE` for things that have a known finite depth** (3-level org chart) when a self-join would have been clearer.
+3. **Computing the same CTE twice when referenced once.** If the planner inlines, you pay twice. Add `MATERIALIZED` if reused.
+4. **Treating CTEs as variables.** They are not. You cannot assign to them. They are query-level views.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What about CTEs in BigQuery specifically? Is there a cost difference?"*
+
+In BigQuery, CTEs are inlined into the SQL, so they do not change the bytes scanned. But if you reference the same CTE multiple times, the underlying query runs each time. For very expensive CTEs referenced many times, write the intermediate to a real table or use a `TEMP TABLE` in a multi-statement query, so the work happens once.
diff --git a/Problem 19: Same Query Different Answers/question.md b/Problem 19: Same Query Different Answers/question.md
new file mode 100644
index 0000000..b006553
--- /dev/null
+++ b/Problem 19: Same Query Different Answers/question.md
@@ -0,0 +1,30 @@
+## Problem 19: Same Query, Different Answers
+
+**Scenario:**
+You have one SQL query. You run it in development against a copy of yesterday's production data. It returns 142,500 rows. You run the exact same SQL in production. It returns 138,920 rows. The team is told "the data is the same." Something is off, and it is not obvious where.
+
+In the interview, the question is:
+
+> The same query gives one answer in dev and a different answer in production, even though the data is supposed to be identical. What kinds of bugs would you check for?
+
+This is a "be systematic" question. The interviewer wants to see your debugging instincts.
+
+---
+
+### Your Task:
+
+1. List the categories of reasons this can happen.
+2. Walk through how you would actually check each one.
+3. Explain why this kind of bug is so common.
+
+---
+
+### What a Good Answer Covers:
+
+* The data is not actually the same.
+* Time zone differences.
+* Late-arriving rows.
+* Different SQL dialects / casing / collation.
+* Session settings (date format, time zone, character set).
+* Sampling, table partitions, materialized views.
+* Permission filtering on production (row-level security).
diff --git a/Problem 19: Same Query Different Answers/solution.md b/Problem 19: Same Query Different Answers/solution.md
new file mode 100644
index 0000000..5a72d54
--- /dev/null
+++ b/Problem 19: Same Query Different Answers/solution.md
@@ -0,0 +1,161 @@
+## Solution 19: Same Query, Different Answers
+
+### Short version you can say out loud
+
+> When a query gives different numbers in two environments and the data is "supposed to be the same," the data is almost never actually the same. I check seven things, in this order: time zones, row-level security or filtered views, late-arriving rows, exact table identity, session settings, collation and case sensitivity, and whether one environment has materialized views or aggregates that the other doesn't. About 90 percent of the time it is one of the first three.
+
+### My mental order of checks
+
+```
+1. Time zone difference? ← starts here most often
+2. Row-level security / views the user can see?
+3. Late-arriving rows: when was each snapshot taken?
+4. Are you really reading the same table?
+5. Session settings (date format, locale, NULLs ordering)?
+6. Collation / case sensitivity / trim differences?
+7. Materialized views or aggregated layers?
+```
+
+Let me walk through each.
+
+### 1. Time zones
+
+The single most common cause. The query has a date filter like:
+
+```sql
+WHERE created_at >= '2025-05-14'
+```
+
+In dev, the database time zone is UTC. In production, it is America/New_York. The exact same data, the exact same filter, returns different rows because midnight is at a different moment.
+
+Or worse, the data was inserted with timestamps in one zone, and the query interprets them in another:
+
+```sql
+-- dev session
+SHOW TIMEZONE; -- UTC
+SELECT NOW(); -- 2025-05-14 23:50:00+00
+
+-- production session
+SHOW TIMEZONE; -- America/New_York
+SELECT NOW(); -- 2025-05-14 19:50:00-04
+```
+
+How I check: run `SHOW TIMEZONE` and `SELECT CURRENT_SETTING('TimeZone')` in both. Compare. Or even simpler: rerun the query with explicit UTC bounds, like `created_at >= '2025-05-14 00:00:00 UTC'`, in both environments.
+
+### 2. Row-level security or views the user can see
+
+In dev I often log in as a power user or admin. In production I might be running as a service account with row-level security policies. The same query then returns a subset.
+
+Examples:
+
+* A view that filters `WHERE tenant_id = current_tenant()` exists in prod but not in dev.
+* A row-level security policy hides rows from non-owner users.
+* The table in production is actually a sharded view, and the user only sees some shards.
+
+How I check: run a quick `SELECT COUNT(*) FROM the_table` as both users on production. If they differ, I'm hitting RLS or a filtered view.
+
+### 3. Late-arriving rows
+
+You took a dev snapshot at 6 AM. Production keeps getting new rows during the day. When you run the query in production at 3 PM, the result includes 9 more hours of data.
+
+How I check: pick a stable lower bound. `WHERE created_at < '2025-05-14 06:00:00 UTC'` in both. If now they match, the difference was just freshness.
+
+### 4. Are you really reading the same table?
+
+It sounds silly but it bites people often.
+
+* In dev, the analyst pointed at `analytics.orders`.
+* In prod, "analytics.orders" is actually a view over `raw.orders` joined to `customers` with an inner join, which drops orders for deleted customers.
+* Or the search path is different and you are reading a totally different schema.
+
+How I check:
+
+```sql
+SELECT pg_get_viewdef('analytics.orders', true); -- Postgres
+-- Or in BigQuery
+SELECT ddl FROM `project.dataset.INFORMATION_SCHEMA.VIEWS`
+WHERE table_name = 'orders';
+```
+
+Compare the DDL. The view definitions sometimes differ between environments.
+
+### 5. Session settings
+
+A few that quietly change query results:
+
+* `lc_collate` and `lc_ctype` affect string sorting and comparison.
+* `default_text_search_config` affects `LIKE` and full-text search behavior.
+* `null_first` vs `null_last` in ORDER BY can swap row order, which matters if you use `LIMIT N`.
+* In BigQuery and Snowflake, default time zone for unparsed timestamps.
+
+How I check: dump the relevant session settings. `SHOW ALL` in Postgres. `SHOW PARAMETERS` in Snowflake. Compare.
+
+### 6. Collation and case sensitivity
+
+This one is sneaky. A WHERE clause like:
+
+```sql
+WHERE country = 'sg'
+```
+
+* On a case-insensitive collation, matches `SG`, `sg`, `Sg`.
+* On a case-sensitive collation, matches only `sg`.
+
+Or trailing whitespace. `'SG'` and `'SG '` look the same on screen but compare unequal.
+
+How I check: run `SELECT DISTINCT country, LENGTH(country) FROM orders WHERE LOWER(country) = 'sg'` and look for surprises.
+
+### 7. Materialized views or aggregated layers
+
+You think you queried the raw table, but in prod the query is silently rewritten to use a materialized view that lags behind. Some warehouses do this automatically (BigQuery materialized views, Oracle's query rewrite). The MV may have been built at 4 AM and is missing 11 hours of data.
+
+How I check: check the query plan in prod. Look for a different table name than expected. In BigQuery, look at "referenced tables" in the job details.
+
+### A short script I would run in both environments
+
+```sql
+-- Where are we?
+SELECT CURRENT_DATABASE(), CURRENT_SCHEMA(), CURRENT_USER, CURRENT_TIMEZONE();
+
+-- What does the table say?
+SELECT COUNT(*) AS total,
+ MIN(created_at) AS first_row,
+ MAX(created_at) AS last_row
+FROM the_table;
+
+-- How does it look in UTC?
+SELECT COUNT(*)
+FROM the_table
+WHERE created_at AT TIME ZONE 'UTC' >= '2025-05-14 00:00:00'
+ AND created_at AT TIME ZONE 'UTC' < '2025-05-15 00:00:00';
+```
+
+Comparing these three between dev and prod will catch the cause in almost every case I have seen.
+
+### Why this kind of bug is so common
+
+Because "the data is the same" is almost always wrong in subtle ways. Dev environments are usually:
+
+* Refreshed at a fixed time, not continuously.
+* Anonymized, which can subtly change row counts or distributions.
+* Run with different user permissions.
+* Configured with different session defaults.
+
+The fix is not paranoia in queries. The fix is to make environment differences observable. A simple "where am I" header at the top of every notebook saves hours of confusion.
+
+### Common mistakes interviewers want you to name
+
+1. **Assuming dev = prod.** It almost never is.
+2. **Trusting LIMIT in dev.** Different ORDER BY ties make LIMIT non-deterministic. Numbers can match in count but the actual rows differ.
+3. **Running `EXPLAIN` only in dev.** A different plan in prod can mean a different result if the query has subtle ordering bugs.
+4. **Ignoring NULL handling.** `COUNT(col)` vs `COUNT(*)` behaves differently with NULLs. If a column has more NULLs in one environment, the counts differ.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you make this kind of bug less likely going forward?"*
+
+Three habits help:
+
+1. **Always include a deterministic time bound.** No `WHERE date >= CURRENT_DATE`. Make the bound explicit and absolute.
+2. **Use the same time zone in all environments.** UTC is the cheap default.
+3. **Snapshot dev data with a recorded `as_of` timestamp** and include that timestamp in the query when comparing. You will catch freshness mismatches in the first 30 seconds.
diff --git a/Problem 20: Window Functions vs GROUP BY/question.md b/Problem 20: Window Functions vs GROUP BY/question.md
new file mode 100644
index 0000000..47752d1
--- /dev/null
+++ b/Problem 20: Window Functions vs GROUP BY/question.md
@@ -0,0 +1,27 @@
+## Problem 20: Window Functions vs GROUP BY
+
+**Scenario:**
+A teammate is writing a query for a marketing dashboard. They want each row to show the order along with "this customer's total spend so far." They keep getting stuck because `GROUP BY` collapses the rows. They ask why their query keeps disappearing the order detail.
+
+In the interview, the question is:
+
+> Explain when you would reach for a window function instead of GROUP BY. Use an example you would actually draw on a whiteboard.
+
+---
+
+### Your Task:
+
+1. Explain GROUP BY in one line.
+2. Explain a window function in one line.
+3. Show one small example where each is the right tool.
+4. Cover the three things window functions can do that GROUP BY cannot.
+
+---
+
+### What a Good Answer Covers:
+
+* GROUP BY collapses rows; window functions keep them.
+* PARTITION BY vs GROUP BY.
+* Running totals, ranking, lag and lead.
+* The performance side: window functions are not free.
+* The "use both" pattern.
diff --git a/Problem 20: Window Functions vs GROUP BY/solution.md b/Problem 20: Window Functions vs GROUP BY/solution.md
new file mode 100644
index 0000000..6399f7b
--- /dev/null
+++ b/Problem 20: Window Functions vs GROUP BY/solution.md
@@ -0,0 +1,150 @@
+## Solution 20: Window Functions vs GROUP BY
+
+### Short version you can say out loud
+
+> GROUP BY collapses many rows into one row per group. Window functions compute a value across a group but keep every row. If the question is "give me one number per customer," I use GROUP BY. If the question is "for each order, also show this customer's running total," I use a window function. The key tell is the word "for each row also" in the requirement.
+
+### The whiteboard picture
+
+```
+Same data, two shapes.
+
+orders
+─────────────────────────────────────────
+order_id │ customer_id │ amount │ created_at
+1 │ A │ 100 │ Jan 1
+2 │ A │ 50 │ Jan 5
+3 │ B │ 200 │ Jan 3
+4 │ A │ 70 │ Jan 10
+
+
+GROUP BY WINDOW
+──────────────── ──────────────────────────────────
+SELECT SELECT
+ customer_id, order_id, customer_id, amount,
+ SUM(amount) AS total SUM(amount) OVER (
+FROM orders PARTITION BY customer_id
+GROUP BY customer_id; ORDER BY created_at
+ ) AS running_total
+ FROM orders;
+
+Result: Result:
+customer │ total order │ customer │ amount │ running
+A │ 220 1 │ A │ 100 │ 100
+B │ 200 2 │ A │ 50 │ 150
+ 3 │ B │ 200 │ 200
+ 4 │ A │ 70 │ 220
+```
+
+Notice: GROUP BY returned 2 rows. WINDOW returned 4 rows. Same SUM, different result shape.
+
+### What each one is, in one line
+
+* **GROUP BY**: collapse rows that share group-key values into one row per group, with aggregates over the group.
+* **Window function**: compute an aggregate (or ranking, or lag) for each row, looking at a window of related rows around it, without collapsing anything.
+
+### When to reach for each
+
+| Need | Use |
+| --------------------------------------------------- | ----------- |
+| Total revenue per customer (one row per customer) | GROUP BY |
+| Count of orders per country | GROUP BY |
+| For each order, show running total per customer | Window |
+| Rank customers by spend within each country | Window |
+| Find each customer's previous order date | Window (LAG)|
+| For each row, compare the value to the group avg | Window |
+| Pick the top 1 row per group | Window (ROW_NUMBER)|
+
+The shortcut: if the answer should have **more rows than the number of groups**, you need a window.
+
+### Three things window functions do that GROUP BY cannot
+
+**1. Running totals and moving averages.**
+
+```sql
+SELECT
+ date,
+ amount,
+ SUM(amount) OVER (ORDER BY date) AS cumulative,
+ AVG(amount) OVER (
+ ORDER BY date
+ ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
+ ) AS rolling_7day_avg
+FROM daily_sales;
+```
+
+GROUP BY cannot do this because it would force one row per group.
+
+**2. Ranking inside a group.**
+
+```sql
+SELECT
+ customer_id, country, total_spend,
+ RANK() OVER (PARTITION BY country ORDER BY total_spend DESC) AS rank_in_country
+FROM customer_totals;
+```
+
+`ROW_NUMBER`, `RANK`, `DENSE_RANK` give you the row's position inside its partition. Indispensable for "top N per group" queries.
+
+**3. Lookups to previous / next rows.**
+
+```sql
+SELECT
+ meter_id, reading_time, value,
+ LAG(value) OVER (PARTITION BY meter_id ORDER BY reading_time) AS prev,
+ value - LAG(value) OVER (PARTITION BY meter_id ORDER BY reading_time) AS delta
+FROM meter_reads;
+```
+
+`LAG` and `LEAD` make "compare this row to the next/previous one" trivial. Without windows, you would need a self-join.
+
+### The "top N per group" pattern
+
+This is so common it is worth memorizing:
+
+```sql
+WITH ranked AS (
+ SELECT
+ *,
+ ROW_NUMBER() OVER (PARTITION BY country ORDER BY total_spend DESC) AS rn
+ FROM customer_totals
+)
+SELECT * FROM ranked WHERE rn <= 3;
+```
+
+This gives you the top 3 customers in each country. With GROUP BY alone, you would have to do tricks with self-joins or LATERAL queries.
+
+### The PARTITION BY vs GROUP BY confusion
+
+A common bug: writing `GROUP BY country` in a query that also has window functions, and being surprised the windows behave oddly.
+
+* `GROUP BY country` collapses to one row per country. Window functions then operate on those collapsed rows.
+* `PARTITION BY country` inside an OVER clause divides the rows into groups *without* collapsing.
+
+These are different. They can coexist, but you have to be deliberate.
+
+### Performance notes
+
+Window functions are not free. They typically require a sort by the `PARTITION BY` columns, then by the `ORDER BY` columns. For very large tables, that sort can be the bottleneck.
+
+Two practical things:
+
+1. **Filter early.** Run the window only on the rows you need.
+2. **Use the right window.** `ROWS BETWEEN ... AND CURRENT ROW` is much cheaper than `RANGE` for time-based windows on some engines.
+
+### Common mistakes interviewers want you to name
+
+1. **Trying to GROUP BY when the answer needs per-row detail.** "Show me each order with the customer's lifetime total." GROUP BY cannot keep the order row.
+2. **Forgetting `ORDER BY` inside the window.** `SUM(...) OVER (PARTITION BY x)` without ORDER BY is the same total on every row. Often someone meant a running total but got the grand total.
+3. **Filtering on a window column in WHERE.** You cannot. WHERE runs before windows. Use a subquery or CTE and filter on the alias in the outer SELECT.
+4. **PARTITION BY too coarse.** "Top 5 products overall" needs no partition. "Top 5 products in each country" needs `PARTITION BY country`. Easy to mix up.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What is the difference between RANK, DENSE_RANK and ROW_NUMBER?"*
+
+* `ROW_NUMBER`: every row gets a unique number, ties broken arbitrarily.
+* `RANK`: ties share a number, then it skips. `1, 2, 2, 4`.
+* `DENSE_RANK`: ties share a number, no skip. `1, 2, 2, 3`.
+
+Use `ROW_NUMBER` when you want exactly one row per "first place" (top-N-per-group). Use `RANK` or `DENSE_RANK` when ties matter, like leaderboards.
diff --git a/Problem 21: Data Platform for an Electricity Retailer/question.md b/Problem 21: Data Platform for an Electricity Retailer/question.md
new file mode 100644
index 0000000..4d80829
--- /dev/null
+++ b/Problem 21: Data Platform for an Electricity Retailer/question.md
@@ -0,0 +1,35 @@
+## Problem 21: Data Platform for an Electricity Retailer
+
+**Scenario:**
+A small electricity retailer has 200,000 residential customers. Every customer has a smart meter that sends one reading every 15 minutes. The business needs:
+
+* Accurate monthly bills.
+* A "your usage" page in the customer app that loads in under 2 seconds.
+* Daily reports for the operations team.
+* Forecasts to bid into the wholesale market.
+
+You are the first data engineer they hire. They have AWS, a small Postgres database, and no data platform yet.
+
+In the interview, the question is:
+
+> Design the data platform for a small electricity retailer with 200,000 smart meters reporting every 15 minutes.
+
+---
+
+### Your Task:
+
+1. Estimate the data volume so you know what you are dealing with.
+2. Sketch the architecture end to end.
+3. Pick the storage and processing layers, and defend each choice.
+4. Walk through how each use case is served.
+5. Call out two or three risks you would design for.
+
+---
+
+### What a Good Answer Covers:
+
+* Back-of-envelope volume math.
+* A clear ingestion path, a storage layer, a serving layer.
+* Choice between batch and stream for each use case.
+* How billing remains correct in the face of late data.
+* How the customer app stays fast without scanning raw readings.
diff --git a/Problem 21: Data Platform for an Electricity Retailer/solution.md b/Problem 21: Data Platform for an Electricity Retailer/solution.md
new file mode 100644
index 0000000..f1df448
--- /dev/null
+++ b/Problem 21: Data Platform for an Electricity Retailer/solution.md
@@ -0,0 +1,186 @@
+## Solution 21: Data Platform for an Electricity Retailer
+
+### Volume math first (always start here)
+
+> 200,000 meters × 96 readings per day = **19.2 million readings/day**
+> About **7 billion readings/year**, or roughly **0.5 TB/year** raw (assuming ~70 bytes per reading row).
+
+This is large but very tractable. It is not "needs Spark" territory. A modest warehouse handles this comfortably. Smart meter data is heavy in row count but small in payload, so we will lean on columnar storage and time partitioning.
+
+### The shape of the platform
+
+```
+ ┌──────────────────────┐
+ Smart meters (200k) │ HEAD-END SYSTEM │
+ send every 15 min via │ (vendor system that │
+ the utility's protocol ────▶│ collects from MDM, │
+ (DLMS, NB-IoT, etc.) │ emits files / API) │
+ └──────────┬───────────┘
+ │
+ JSON / CSV per device per hour
+ │
+ ▼
+ ┌──────────────────────┐
+ │ S3 (RAW LAYER) │
+ │ s3://meter-raw/ │
+ │ yyyy/mm/dd/hh/ │
+ └──────────┬───────────┘
+ │
+ Triggered by S3 event
+ │
+ ▼
+ ┌──────────────────────┐
+ │ AWS Lambda / │
+ │ Glue parser │
+ │ validate, normalize │
+ └──────────┬───────────┘
+ │ Parquet
+ ▼
+ ┌──────────────────────┐
+ │ S3 (CURATED LAYER) │
+ │ s3://meter-curated/ │
+ │ date=YYYY-MM-DD/ │
+ └──────────┬───────────┘
+ │
+ dbt + Airflow (hourly)
+ │
+ ▼
+ ┌──────────────────────────────┐
+ │ WAREHOUSE (Redshift / │
+ │ Snowflake / BigQuery) │
+ │ │
+ │ reads_15min (raw fact) │
+ │ reads_hourly (rollup) │
+ │ reads_daily (rollup) │
+ │ customers (dim) │
+ │ tariffs (SCD2 dim) │
+ │ bills (mart) │
+ └──────────┬───────────────────┘
+ │
+ ┌────────────────────────────┼────────────────────┐
+ ▼ ▼ ▼
+ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
+ │ Customer app │ │ Ops dashboards │ │ Forecasting │
+ │ reads from a │ │ (Looker, Metab) │ │ model (Python, │
+ │ small Postgres │ │ query warehouse │ │ reads warehouse │
+ │ serving table │ │ directly │ │ history) │
+ │ (last 90 days) │ │ │ │ │
+ └──────────────────┘ └──────────────────┘ └──────────────────┘
+```
+
+### Why each piece
+
+**S3 raw layer.** Cheap, append-only, every original file is preserved. If a downstream bug is discovered, we can reprocess everything. This is the most underrated part of the design.
+
+**Lambda (or Glue) parser.** The volume is large but each file is small. Lambda triggered by S3 events scales naturally to 200k devices and is pay-per-invocation. The parser does three things: validate the row schema, normalize units (kWh, kW), and write Parquet to the curated layer partitioned by date.
+
+**S3 curated layer (Parquet, partitioned).** Columnar, compressed, queryable directly by Athena if needed. Partition by `date` so backfills and date filters are cheap.
+
+**Warehouse with three rollup tables.** The trick is to never let dashboards or the app touch the raw 15-minute fact directly. We pre-aggregate.
+
+* `reads_15min` keeps the source of truth at full granularity. Used for billing and forecasting.
+* `reads_hourly` is built nightly from `reads_15min`. Used for ops dashboards.
+* `reads_daily` is built from `reads_hourly`. Used for the customer app and monthly bills.
+
+A query that asks "show me my last 90 days of usage" hits 90 rows in `reads_daily`, not 8,640 rows in `reads_15min`.
+
+**Customer app serving layer.** This is the one place I would not go straight to the warehouse. Warehouse queries are seconds, not milliseconds, and they are not built for thousands of concurrent users. I would push the last 90 days of `reads_daily` per customer into a small Postgres or DynamoDB serving table, refreshed nightly. The app reads from there in under 50 ms.
+
+### Walking through each use case
+
+**1. Monthly bills.**
+Run a monthly job that reads `reads_15min` for the billing period for each customer, multiplies by the tariff in effect (using an SCD Type 2 `tariffs` table to handle mid-month price changes), and writes one bill row per customer per month. Bills are idempotent: rerunning the job for the same period produces the same number.
+
+**2. The customer app "your usage" page.**
+Reads from the serving table mentioned above. Last 90 days at daily grain. Loads in well under 2 seconds.
+
+**3. Daily ops reports.**
+Run on `reads_hourly` and `reads_daily`. Total energy by region, top 100 highest consuming customers, devices with no reading in the last 24 hours. All under 5 seconds in the warehouse.
+
+**4. Forecasting for wholesale market bids.**
+Read several years of `reads_15min` aggregated to hourly, train a model (Prophet, gradient boosting, or a small neural network), forecast next day. Runs nightly. Forecaster reads warehouse directly because it is a single process, not many users.
+
+### Handling late and corrected data
+
+Smart meter data is notoriously messy. Devices go offline, send readings late, sometimes resend corrected values. The platform has to absorb this without producing wrong bills.
+
+Three rules:
+
+1. **Raw layer is append-only.** Late files just land in their original date partition or a "late" partition.
+2. **Curated and rollup layers are rebuilt by partition, not by row.** A late file for May 10 triggers a rebuild of May 10 in `reads_15min`, `reads_hourly`, `reads_daily`. The MERGE / overwrite pattern from Problem 9.
+3. **Bills are sealed.** Once a bill is sent, the bill row is frozen. Subsequent corrections produce an *adjustment* row, not an edit. This is how regulators expect it.
+
+```
+Raw : May 10 partition gains 32 new late readings
+Curated: May 10 rebuilt (idempotent overwrite)
+Rollups: May 10 hourly + daily rebuilt
+Bill : May was already sent → adjustment row
+ May was not yet billed → next bill picks up the change
+```
+
+### Schema sketch
+
+```
+reads_15min (partition by date)
+─────────────────────────────────────────
+meter_id STRING
+read_at TIMESTAMP
+kwh NUMERIC
+quality_flag STRING -- "good", "estimated", "missing"
+ingested_at TIMESTAMP
+
+reads_hourly (partition by date)
+─────────────────────────────────────────
+meter_id STRING
+hour TIMESTAMP
+kwh NUMERIC
+
+reads_daily (partition by date)
+─────────────────────────────────────────
+meter_id STRING
+day DATE
+kwh NUMERIC
+
+customers (dim)
+─────────────────────────────────────────
+customer_id STRING (PK)
+name, email, address, plan_id
+joined_on DATE
+
+tariffs (SCD Type 2)
+─────────────────────────────────────────
+plan_id, kwh_price, fixed_charge, valid_from, valid_to, is_current
+
+bills (mart)
+─────────────────────────────────────────
+bill_id, customer_id, period_start, period_end, kwh_total, total_due, issued_at
+```
+
+### Risks I would call out
+
+**1. Meter clock drift.** Devices report local time, sometimes wrong. The parser must normalize to UTC and treat the device time as a hint, not truth. Bills cross day boundaries if you trust device clocks blindly.
+
+**2. The 1-second blast at the top of every hour.** When meters dump readings on the hour, traffic spikes. S3 + Lambda absorb this, but a database ingestion path would not. Designed around this.
+
+**3. Estimated readings.** When a meter is offline, the head-end system fills the gap with an estimate. The bill must distinguish estimated from measured (the `quality_flag` column). Regulators care about this.
+
+**4. Schema changes from the head-end vendor.** They update their format every year. A data contract and a strict schema check at the Lambda parser stops this from corrupting the curated layer.
+
+### What I would NOT build on day one
+
+* A streaming pipeline. The use cases all tolerate hourly latency.
+* A separate Spark cluster. The volume does not need it. The warehouse handles aggregation.
+* A real-time customer-facing dashboard. Nice to have, expensive to build, not asked for.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What changes when you grow from 200,000 to 5 million meters?"*
+
+The shape stays the same, but a few pieces get pressure:
+
+* Lambda parsing gets expensive. Move to a Kinesis Firehose or a small Spark Streaming job.
+* `reads_15min` is now 175 billion rows/year. Stays fine in BigQuery or Snowflake, but Redshift may struggle. Time partitioning + clustering on `meter_id` is now non-negotiable.
+* The serving Postgres becomes a load problem. Shard by customer prefix, or move to DynamoDB.
+* Forecasting moves from one nightly model to per-region models running in parallel.
+
+The architecture survives the growth because we kept the raw layer untouched and the rollups well defined.
diff --git a/Problem 22: Banking App Monthly Spending Widget/question.md b/Problem 22: Banking App Monthly Spending Widget/question.md
new file mode 100644
index 0000000..b1bae25
--- /dev/null
+++ b/Problem 22: Banking App Monthly Spending Widget/question.md
@@ -0,0 +1,28 @@
+## Problem 22: Banking App Monthly Spending Widget
+
+**Scenario:**
+A retail bank wants a "your spending this month" widget on the home screen of its mobile app. When the user opens the app, they should see their total spend so far this month, broken down into categories like groceries, transport, dining. It needs to feel instant. The bank has 5 million active customers and millions of transactions per day.
+
+In the interview, the question is:
+
+> Sketch what sits behind a "your monthly spending" widget in a banking app, where users expect it to feel instant.
+
+---
+
+### Your Task:
+
+1. Decide what "feels instant" actually means and design to that.
+2. Pick a storage choice for the read path and explain why.
+3. Show how the data flows from a card swipe to the widget.
+4. Cover the merchant-to-category problem.
+5. Mention how you would handle refunds and corrections.
+
+---
+
+### What a Good Answer Covers:
+
+* Read-optimized serving store, not the warehouse.
+* Incremental updates per transaction (CDC or stream).
+* Pre-aggregated monthly totals per (user, category).
+* Idempotency and refund handling.
+* Cold cache vs warm cache, and the first-of-the-month problem.
diff --git a/Problem 22: Banking App Monthly Spending Widget/solution.md b/Problem 22: Banking App Monthly Spending Widget/solution.md
new file mode 100644
index 0000000..5c6bcfd
--- /dev/null
+++ b/Problem 22: Banking App Monthly Spending Widget/solution.md
@@ -0,0 +1,172 @@
+## Solution 22: Banking App Monthly Spending Widget
+
+### What "instant" means
+
+"Instant" on a phone means the widget renders in under 200 ms after the screen opens. The data fetch itself has maybe 80-100 ms of that budget after we account for TLS, app rendering, and the rest. So the backend has to answer in roughly 50 ms at p95.
+
+That number drives every design choice. A warehouse query takes seconds. A relational database can do it. A key-value store does it easily. So we are not going to query the warehouse from the phone, ever. We are going to maintain a small pre-computed answer in a fast store.
+
+### The shape of the system
+
+```
+ Card swipe / digital payment
+ │
+ ▼
+ ┌──────────────────────┐
+ │ Core banking │ (source of truth, OLTP, Postgres or
+ │ transactions DB │ a mainframe-equivalent)
+ └─────────┬────────────┘
+ │ CDC (Debezium / Striim)
+ ▼
+ ┌──────────────────────┐
+ │ Kafka topic │
+ │ "transactions" │
+ └─────────┬────────────┘
+ │
+ ▼
+ ┌─────────────────────────────┐
+ │ Stream processor (Flink / │
+ │ Kafka Streams) │
+ │ │
+ │ - look up category for │
+ │ merchant_id │
+ │ - apply +/- (refund vs │
+ │ purchase) │
+ │ - keyed by (user, month, │
+ │ category) │
+ │ - emit updated totals │
+ └────────────┬────────────────┘
+ │
+ ▼
+ ┌─────────────────────────────┐
+ │ Fast serving store │
+ │ (DynamoDB / Bigtable / │
+ │ Redis / Aerospike) │
+ │ │
+ │ Key: user_id|YYYY-MM │
+ │ Value: { category -> sum, │
+ │ total, updated_at}│
+ └────────────┬────────────────┘
+ │
+ │ <50 ms point read
+ ▼
+ ┌──────────────────────┐
+ │ Mobile app widget │
+ └──────────────────────┘
+
+ (in parallel, for analytics / dispute / reporting)
+ ┌─────────────────────────────────┐
+ │ Same Kafka topic also drains │
+ │ to S3 → Warehouse (BigQuery / │
+ │ Snowflake) for non realtime │
+ │ use cases │
+ └─────────────────────────────────┘
+```
+
+### Data shape in the serving store
+
+```
+Key: user_id | YYYY-MM
+Value: {
+ total: 4321.50,
+ categories: {
+ groceries: 850.20,
+ transport: 130.00,
+ dining: 295.75,
+ bills: 1820.00,
+ other: 1225.55
+ },
+ txn_count: 78,
+ updated_at: "2025-05-14T08:21:12Z",
+ last_seq: 991283
+}
+```
+
+One record per user per month. Reading it is a single point lookup. ~5 KB per user. 5 million users × 12 months would be a few hundred GB at most. Cheap.
+
+### How a transaction flows
+
+1. Customer pays at a merchant. The core banking system writes the transaction row. This is the moment of truth.
+2. CDC (Debezium reading the Postgres WAL, or its mainframe equivalent) emits the row to Kafka within a second or two.
+3. A stream processor reads the Kafka topic. For each transaction:
+ * Look up the merchant category from a small in-memory map (categories rarely change, refreshed every hour).
+ * Compute the signed amount: positive for purchases, negative for refunds.
+ * Update the (user, year-month, category) total.
+4. The new total is written to the fast serving store.
+5. Next time the user opens the app, the widget reads the row in one call.
+
+End to end latency from swipe to widget refresh: typically 2-5 seconds. The user does not see latency because they were not staring at the app at the moment of swipe.
+
+### The merchant-to-category problem
+
+The biggest soft problem in this pipeline. A swipe at "TESCO SG #421" should become "groceries." A swipe at "GRAB *XR23F" should become "transport." This mapping is messy and changes over time.
+
+How I would handle it:
+
+* **A `merchant_categories` table** keyed by merchant id, maintained by a small team. Loaded into the stream processor as a broadcast state, refreshed hourly.
+* **A fallback rule engine** for unknown merchants: pattern match on the descriptor string ("UBER" → transport, "STARBUCKS" → dining).
+* **An "other" bucket** for genuinely unmapped transactions. The widget shows them but they get reviewed weekly.
+
+When a category changes (the mapping is fixed), I would NOT retroactively change past months. The widget shows what was said at the time. Otherwise the user sees their March number change in May, which feels wrong.
+
+### Handling refunds and corrections
+
+Refunds are just negative transactions. The stream processor adds them and the totals go down. Important: if the original purchase happened in April and the refund happens in May, which month does the refund belong to?
+
+Two conventions, pick one:
+
+* **Banking convention (post date):** the refund hits whatever month it cleared in. So the May refund reduces the May total.
+* **Spending convention (transaction date):** the refund reduces the month the original purchase was in. So the May refund reduces the April total retroactively.
+
+Banks usually go with post date, because that's how the statement reads. The widget should match the statement.
+
+### Cold start and the first-of-the-month problem
+
+On the first of the month, every user has a fresh empty record. The very first read after midnight needs to materialize the empty row. Two ways:
+
+1. **Lazy:** the widget service checks the store, finds nothing, returns zeros, and the row gets created on the first transaction. Simple but produces a brief "your spending: 0" until the first txn arrives, which is correct anyway.
+2. **Eager:** a scheduled job at midnight of the user's local time zone creates an empty record. Slightly nicer UX but more code.
+
+Lazy is fine for a v1.
+
+### Idempotency
+
+CDC can deliver the same transaction twice (Kafka retries, processor restarts). The stream processor must dedupe by `transaction_id`. Two patterns:
+
+* Keep the last `last_seq` processed in the same record. Skip anything ≤ that.
+* Use Flink exactly-once with checkpointing.
+
+Either way, the rule is: applying the same transaction twice produces the same total once. This is the same idempotency rule from Problem 9, just inside a streaming processor.
+
+### Why not query the warehouse directly?
+
+* Warehouse query latency is seconds, not milliseconds.
+* Warehouse is not built for thousands of concurrent point reads.
+* Cost scales with reads. Each widget render would cost real money at 5 million daily active users.
+
+The warehouse is still part of the picture, but for analytics, disputes, and historical reports, not for the live widget.
+
+### What the warehouse layer does
+
+The same Kafka topic drains to S3 (Firehose) and then to BigQuery or Snowflake. The warehouse hosts:
+
+* Full transaction history (years).
+* Analytics: "average spend per customer per category."
+* Risk and fraud models.
+* Customer support: "show me this user's full statement."
+
+The widget never touches it.
+
+### Common mistakes interviewers want you to name
+
+1. **Pointing the widget at the warehouse.** Lots of teams try this. Latency, cost, and concurrency all bite.
+2. **Recomputing the monthly total on every read** ("aggregate the last 30 days from a transactions cache"). Even on a fast store, this is expensive at the first-of-month-just-after-payday spike.
+3. **Forgetting to dedupe.** CDC retries are normal. A doubled total is a customer complaint.
+4. **Mapping changes that rewrite history.** Surprising the user is worse than a bad mapping.
+5. **No fallback for when the serving store is down.** If the widget is critical, fall back to a stale cached number, not a spinner forever.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you make this work offline, when the phone has no network?"*
+
+Cache the last successful response on the phone with a short TTL (an hour). Show it with a small "as of 8:21" timestamp. The user sees their last known number instead of an error, and the freshness label is honest. When the phone reconnects, refresh in the background.
diff --git a/Problem 23: Ride Hailing Surge Pricing/question.md b/Problem 23: Ride Hailing Surge Pricing/question.md
new file mode 100644
index 0000000..2c3e643
--- /dev/null
+++ b/Problem 23: Ride Hailing Surge Pricing/question.md
@@ -0,0 +1,28 @@
+## Problem 23: Ride Hailing Surge Pricing
+
+**Scenario:**
+A ride hailing company wants to set a surge multiplier (1.0x, 1.5x, 2x and so on) for each small geographic area, in real time. The multiplier should rise when there are many more ride requests than available drivers in that area, and fall when supply catches up. The product team wants prices to update at most every 30 seconds, and the area should be roughly a neighborhood.
+
+In the interview, the question is:
+
+> Design how a ride hailing company calculates surge pricing in real time.
+
+---
+
+### Your Task:
+
+1. Decide what data is needed, and how often.
+2. Pick a geospatial bucketing strategy.
+3. Sketch the streaming pipeline.
+4. Define the pricing function.
+5. Cover how the price is delivered to riders and drivers.
+
+---
+
+### What a Good Answer Covers:
+
+* H3 or geohash for area buckets.
+* Sliding windows on demand and supply streams.
+* The smoothing problem (avoid jumpy prices).
+* A pricing store keyed by area and time.
+* Fairness, abuse, and the "is this legal here" angle.
diff --git a/Problem 23: Ride Hailing Surge Pricing/solution.md b/Problem 23: Ride Hailing Surge Pricing/solution.md
new file mode 100644
index 0000000..b2cfe14
--- /dev/null
+++ b/Problem 23: Ride Hailing Surge Pricing/solution.md
@@ -0,0 +1,170 @@
+## Solution 23: Ride Hailing Surge Pricing
+
+### What we are really computing
+
+> A real-time ratio between **demand** (ride requests being created) and **supply** (available drivers) inside a small area, smoothed over a few minutes, mapped to a price multiplier between 1.0x and some cap like 3.0x.
+
+The math is simple. The hard parts are: bucketing the map into "areas," keeping the price stable enough that it doesn't whiplash, and getting it to riders and drivers fast enough.
+
+### The architecture
+
+```
+ ┌───────────────────┐ ┌────────────────────┐
+ │ Rider app │ │ Driver app │
+ │ "request ride" │ │ "online / idle / │
+ │ events │ │ on trip" events │
+ └─────────┬─────────┘ └──────────┬─────────┘
+ │ Kafka topic │ Kafka topic
+ │ ride_requests │ driver_status
+ ▼ ▼
+ ┌─────────────────────────────────────────────────┐
+ │ Stream processor (Flink) │
+ │ │
+ │ 1. Map GPS → H3 hex (resolution 8 or 9) │
+ │ 2. Per hex, 3-minute sliding window: │
+ │ requests = count of ride_requests │
+ │ supply = unique drivers idle in window │
+ │ 3. Compute raw_multiplier = f(requests, supply)│
+ │ 4. Smooth with EMA over last 3 windows │
+ │ 5. Snap to allowed price tiers (1.0, 1.25, …) │
+ └─────────────┬───────────────────────────────────┘
+ │
+ ▼
+ ┌─────────────────────────────────────────────────┐
+ │ Pricing store (Redis or Bigtable) │
+ │ Key: h3_hex │
+ │ Value: { multiplier, valid_until, updated_at }│
+ │ TTL: 60 seconds (fail-safe to 1.0x) │
+ └─────────────┬───────────────────────────────────┘
+ │
+ ┌────────┴───────────────┐
+ ▼ ▼
+ ┌─────────────┐ ┌────────────────────┐
+ │ Pricing API │ │ Driver heatmap API │
+ │ (rider quote│ │ (where to drive) │
+ │ + final │ │ │
+ │ price) │ │ │
+ └─────────────┘ └────────────────────┘
+```
+
+### Bucketing the map: H3 hexagons
+
+I would use Uber's open-source H3 library to convert latitude and longitude into a hex cell. H3 is what Uber actually uses for this. Two reasons:
+
+1. **Hexagons have equal distance to all neighbors**, unlike squares. So "the cell next door" means the same thing in every direction. This matters when you're balancing supply across cells.
+2. **Multiple resolutions** are built in. Resolution 8 is about 0.74 km² per cell (a neighborhood). Resolution 9 is about 0.10 km² (a few blocks). I would start at 8.
+
+```python
+import h3
+hex_id = h3.geo_to_h3(latitude, longitude, resolution=8)
+# Returns a string like '88283082a3fffff'
+```
+
+Per minute math: with ~10,000 active hexes in a big city × one update per 30 seconds = roughly 20,000 writes per minute. Trivial for Redis or Bigtable.
+
+### The pricing function
+
+A simple, defensible formula:
+
+```
+raw_ratio = max(1.0, requests / max(supply, 1))
+
+raw_mult = 1.0 + α * (raw_ratio - 1) (α tunes aggressiveness)
+
+smoothed = 0.5 * smoothed_prev + 0.5 * raw_mult (exponential smoothing)
+
+final_mult = nearest_allowed_tier(smoothed) (1.0, 1.25, 1.5, 2.0, 3.0)
+final_mult = min(final_mult, cap) (cap = 3.0x in most markets)
+```
+
+Three things this gives us:
+
+* If demand equals supply, multiplier is 1.0x. No surge.
+* If demand is 2x supply, multiplier rises but not violently.
+* Smoothing prevents the price from jumping every 30 seconds.
+* Snapping to allowed tiers makes the UI honest. We never show "1.317x" to a rider.
+
+### The smoothing problem
+
+Without smoothing, the price flickers between tiers as drivers come and go. Riders and drivers both hate this. Two mechanisms:
+
+1. **Exponential moving average** of the raw multiplier over the last few windows.
+2. **Hysteresis**: require the change to persist for two windows before applying it. If it drops below the threshold in one window but comes back, keep the higher tier.
+
+With both, prices are stable enough to be trustworthy and still responsive to real changes.
+
+### Delivery to riders and drivers
+
+* **Rider price quote**: when the rider opens the app or taps "request," the app calls the pricing API. The API looks up the hex for the pickup location and returns the multiplier. The full quote also locks the price for ~5 minutes to avoid the "price changed while I was looking" problem.
+* **Driver heatmap**: drivers see a color overlay showing high-surge areas. This data is read from the same store, but they get a 1-minute-lagged version. (Showing live surge to drivers leads to them all rushing in and crashing the surge, which is bad for both sides.)
+
+### Late and missing data
+
+Driver status events can arrive late. A driver who went idle 90 seconds ago should count toward supply. Two protections:
+
+* Use **event time**, not processing time, with a small watermark allowance (60 seconds).
+* If a hex has near-zero supply data for more than 2 minutes (a vendor outage?), the system **fails safe to 1.0x**. We do not surge based on missing data.
+
+### The cap and the human side
+
+A cap (often 3x or 4x) is in the system not because of math but because regulators or politics demand it. Some cities ban surge altogether during emergencies. The pricing function should be configurable by city: cap, allowed tiers, even disabled. This config lives in a small table the stream processor refreshes hourly.
+
+There is also a fairness layer: the surge multiplier in a hex is the same for everyone in that hex, regardless of who they are. No personalized surge. This is a policy choice that the system enforces, not a math choice.
+
+### Hex edge effects
+
+A pickup that is 5 meters inside hex A versus 5 meters inside hex B may see two very different prices. To smooth this, the pricing API can blend the rider's hex with its immediate neighbors:
+
+```
+multiplier = weighted_average(
+ this_hex × 0.5,
+ 6 neighbor hexes × 0.5 / 6
+)
+```
+
+This makes the price feel continuous across the map.
+
+### Schema sketch in the pricing store
+
+```
+key: h3_8 hex id
+value: {
+ multiplier: 1.50,
+ raw_demand: 42,
+ supply: 28,
+ updated_at: "2025-05-14T20:14:30Z",
+ valid_until: "2025-05-14T20:15:30Z",
+ reason: "high_demand"
+}
+```
+
+TTL of 60 seconds. If the stream processor stops writing, prices fall back to 1.0x within a minute. That is a safe failure mode.
+
+### Risks I would call out
+
+1. **Surge as gaming target.** Drivers go offline to spike supply numbers. We measure "available drivers in area," and we should also keep "active drivers in area" for cross-checks.
+2. **Storms / events.** Sudden demand spikes from a sports game. The cap protects users, but the system also needs to recognize patterns and not panic.
+3. **Latency.** Pricing API must respond in under 50 ms or the rider sees a spinner. Redis or Bigtable point reads are the right tool. No SQL warehouse here.
+4. **Data freshness.** A 3-minute window plus 1-minute driver-side lag means the price is reacting to ~4-minute-old data. Acceptable. Less than that creates instability.
+
+### What does not need to be real time
+
+* Pay-out calculations.
+* Customer support showing historical surge for a complaint.
+* Pricing analytics ("how much surge happened yesterday").
+
+All of these go through the warehouse, fed from the same Kafka topics.
+
+### Common mistakes interviewers want you to name
+
+1. **Using lat/lon directly as a bucket.** Cells are arbitrary, neighbor relationships break.
+2. **No smoothing.** Prices oscillate every 30 seconds, riders complain.
+3. **Pricing built on the warehouse.** Far too slow.
+4. **No cap or fail-safe.** A pipeline glitch leaves a 7.5x surge stuck for an hour.
+5. **Showing live surge to drivers.** Causes oscillation in supply too.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you handle a city with very different supply/demand dynamics, like a small town?"*
+
+Pricing parameters should be per city, possibly per hex cluster. A small town has fewer drivers and fewer requests, so the ratio is noisier. I would increase the window size (5 to 10 minutes), require larger absolute changes before triggering surge, and lower the cap. The architecture stays the same. Only the config differs.
diff --git a/Problem 24: Spotify Minutes Listened This Week/question.md b/Problem 24: Spotify Minutes Listened This Week/question.md
new file mode 100644
index 0000000..44ae464
--- /dev/null
+++ b/Problem 24: Spotify Minutes Listened This Week/question.md
@@ -0,0 +1,28 @@
+## Problem 24: Spotify Minutes Listened This Week
+
+**Scenario:**
+A music streaming app shows "minutes listened this week" on every user's profile. Hundreds of millions of users open the app each day. Scanning the full play event history every time someone opens their profile would be impossibly expensive. The product team wants the number to be at most 5 minutes stale.
+
+In the interview, the question is:
+
+> Design how Spotify might compute "minutes listened this week" without scanning all play events every time someone opens the app.
+
+---
+
+### Your Task:
+
+1. Estimate the read and write volumes at the scale described.
+2. Pick a storage and update pattern.
+3. Decide what "this week" means and how it rolls over.
+4. Cover correctness in the presence of pauses, skips, and replays.
+5. Mention how it stays fast at 200 million daily users.
+
+---
+
+### What a Good Answer Covers:
+
+* A pre-aggregated counter per (user, week), updated incrementally.
+* Streaming aggregation over play heartbeat events.
+* The "what counts as a minute listened" definition.
+* Week boundary handling and time zones.
+* The serving store choice.
diff --git a/Problem 24: Spotify Minutes Listened This Week/solution.md b/Problem 24: Spotify Minutes Listened This Week/solution.md
new file mode 100644
index 0000000..e8b32d5
--- /dev/null
+++ b/Problem 24: Spotify Minutes Listened This Week/solution.md
@@ -0,0 +1,183 @@
+## Solution 24: Spotify Minutes Listened This Week
+
+### Volume math first
+
+> ~500 million users total, ~200 million daily active. Average maybe 30 minutes a day listened. Play events with heartbeats every ~30 seconds gives roughly:
+>
+> 200M users × 60 heartbeats/day = **12 billion events per day**.
+>
+> Profile loads: maybe 1 billion per day. That's the read side.
+
+Two takeaways:
+
+* We cannot scan play history per request. Even one full scan of the week per user would be tens of GB per user per request.
+* We must pre-aggregate. The "minutes listened this week" must be a single value already sitting somewhere, updated in the background.
+
+### The architecture
+
+```
+ ┌────────────────────────┐
+ │ Client (apps, web, │
+ │ speakers, etc.) │
+ │ emits play heartbeats │
+ │ every ~30 sec while │
+ │ audio is playing │
+ └───────────┬────────────┘
+ │
+ ▼
+ ┌────────────────────────┐
+ │ Kafka topic │
+ │ play_heartbeats │
+ │ (12B events/day) │
+ └───────────┬────────────┘
+ │
+ ▼
+ ┌────────────────────────────────────────────┐
+ │ Stream processor (Flink) │
+ │ │
+ │ Per heartbeat: │
+ │ - Validate (real play, not seek/scrub) │
+ │ - Compute elapsed seconds since previous │
+ │ heartbeat for this user │
+ │ - Key by (user_id, week_id) │
+ │ - Increment counter │
+ │ │
+ │ Emit updates to serving store every │
+ │ N seconds (e.g. 30s window). │
+ └───────────┬────────────────────────────────┘
+ │
+ ▼
+ ┌────────────────────────────────────────────┐
+ │ Serving store (Bigtable / Cassandra / │
+ │ DynamoDB / Aerospike) │
+ │ │
+ │ Row key: user_id|week_id │
+ │ Value: { minutes, updated_at, version } │
+ └───────────┬────────────────────────────────┘
+ │
+ │ <20ms point read
+ ▼
+ ┌────────────────────────┐
+ │ Profile service │
+ │ "minutes this week" │
+ └────────────────────────┘
+
+ (parallel branch)
+ Kafka → S3 (Firehose) → Warehouse (BigQuery)
+ Used for analytics, Year in Review, ML, audit.
+```
+
+### What counts as a "minute listened"
+
+This sounds obvious but it is the whole pipeline's correctness problem. We need a clear rule:
+
+* A heartbeat fires every 30 seconds during active playback.
+* If the next heartbeat for the same user arrives within 60 seconds and is on the same content, count the elapsed real time between them.
+* If more than 60 seconds passed, the user paused, switched device, or lost network. Do not count the gap.
+* Skips and seeks emit different events and never count toward minutes.
+
+This logic lives in the stream processor. The output of the processor is "this heartbeat contributed N seconds." It is N, not 30, exactly because of pause/gap handling.
+
+### The week boundary
+
+"This week" is ambiguous. We pick:
+
+* **ISO week**, starts Monday 00:00 in the user's local time zone.
+* `week_id` is `2025-W20` style.
+
+Why local time zone, not UTC? Because the user looks at the profile in their own day. A user in Singapore at 1 AM Monday should see a fresh week. With UTC weeks they would still see last week.
+
+The stream processor knows each user's time zone (from a small user table broadcast to the job). It computes the right `week_id` per event.
+
+On the boundary moment, the user has both `2025-W20` and `2025-W21` records in the store. The profile reads only the current one.
+
+### Why the key is (user, week) and not just user
+
+* Reads only one row. A single point read. Fast.
+* On Monday the row resets naturally (a new key exists).
+* Old weeks remain queryable: "compared to last week, you listened 12% more."
+* TTL the old rows (90 days) so the store stays small.
+
+The store ends up roughly: 200M active users × ~12 weeks of retention = ~2.4 billion rows. Each row is small (~80 bytes). Around 200 GB. Trivial for Bigtable.
+
+### Update path
+
+```
+Every 30 sec, a play heartbeat arrives.
+ │
+ ▼
+Flink computes "this heartbeat = +28 seconds for user 12345, week 2025-W20"
+ │
+ ▼
+Flink keeps a per-key running total in state.
+ │
+ ▼
+Every 30 seconds (a small commit window), Flink flushes the new value to
+the serving store with a versioned write (compare-and-set on version).
+ │
+ ▼
+Profile reads see a number that is at most ~30 seconds behind reality.
+```
+
+We do not write to the store on every heartbeat (that's 12B writes a day). We write once per commit window per user. Most users have 1 heartbeat per 30 seconds anyway, so the savings come from buffering and from idle users dropping out of writes entirely.
+
+### Idempotency
+
+Heartbeats can be duplicated (network retries). Each heartbeat carries a unique `heartbeat_id`. The processor dedupes within a short window (a few minutes) using Flink keyed state.
+
+If Flink restarts, it resumes from checkpoint. The serving store write is a CAS on `version`, so a replayed write does not double-count.
+
+End result: the same user activity always produces the same minute count.
+
+### The "first profile load on Monday" problem
+
+When a new week starts, the store does not have a row yet. The profile service:
+
+* Reads the (user, current week) key.
+* If missing, return `0` minutes (correct).
+* The row appears the moment the user plays anything.
+
+No special migration needed.
+
+### The "user plays nothing this week" case
+
+There is no row at all. The profile shows 0. Correct.
+
+### Why a key-value store, not a SQL warehouse
+
+* 1B reads/day at sub-50ms requires a single-row point read store.
+* Writes are also high. Bigtable handles 100k+ writes/sec per cluster.
+* No JOINs are needed at read time. The row already has the answer.
+* Cost: a SQL warehouse like BigQuery is built for scans, not for billions of point reads.
+
+A relational database (Postgres) would actually work for this if sharded well, but Bigtable/DynamoDB/Cassandra is the easier scale story.
+
+### What about the warehouse layer?
+
+It still exists, but for different reasons:
+
+* Year in Review and the more sophisticated personalised statistics.
+* ML features: who listened to what, when, where.
+* Auditing the streamed minutes (for royalty calculations).
+* Reprocessing a week if a bug in the rule was discovered.
+
+The warehouse never sees a profile load. The warehouse is for analytics and offline processing.
+
+### Common mistakes interviewers want you to name
+
+1. **Scanning play events at read time.** Even with partitioning, 12B rows/day cannot serve 1B reads.
+2. **Using UTC weeks.** Users hate that.
+3. **Counting all heartbeats as 30 seconds.** Pauses and gaps must be detected, otherwise minutes look inflated.
+4. **No idempotency.** Heartbeat retries double the minutes.
+5. **One global Redis.** A single cluster cannot do this; you need a partitioned store.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the business asks for 'minutes listened today,' down to the hour?"*
+
+Two ways:
+
+1. Add another key, `(user_id, day_id)`. Same approach. Reset at midnight local time.
+2. If they want hourly granularity, store a small dictionary in the row: `{ '14': 12, '15': 28, ... }`. Updated by the same stream processor.
+
+The architecture does not change. The data model just gets a finer grain.
diff --git a/Problem 25: Smart Meter to Monthly Bill PDF/question.md b/Problem 25: Smart Meter to Monthly Bill PDF/question.md
new file mode 100644
index 0000000..a6d0730
--- /dev/null
+++ b/Problem 25: Smart Meter to Monthly Bill PDF/question.md
@@ -0,0 +1,28 @@
+## Problem 25: Smart Meter to Monthly Bill PDF
+
+**Scenario:**
+You work for an energy retailer. Every month, you need to produce a bill PDF for each of 200,000 customers. The bill must reflect the customer's actual consumption (every 15 minutes), the tariff in effect for each period (which can change mid-month), taxes, fees, and any account adjustments. Regulators require the bill to be reproducible and auditable: if a customer disputes their bill, you must be able to show how each number was calculated.
+
+In the interview, the question is:
+
+> Design the pipeline that turns raw smart meter readings into a monthly bill PDF.
+
+---
+
+### Your Task:
+
+1. Sketch the end-to-end flow.
+2. Cover billing correctness in the presence of late, missing or corrected readings.
+3. Show how mid-month tariff changes are handled.
+4. Explain the audit trail.
+5. Cover what happens when generation goes wrong.
+
+---
+
+### What a Good Answer Covers:
+
+* The fact and dimension shape (Problem 10 SCD2 plays into this).
+* Idempotent monthly job, by billing period.
+* The "estimated reading" handling.
+* PDF rendering as a separate, deterministic step.
+* Sealed bills and adjustments.
diff --git a/Problem 25: Smart Meter to Monthly Bill PDF/solution.md b/Problem 25: Smart Meter to Monthly Bill PDF/solution.md
new file mode 100644
index 0000000..fac3f32
--- /dev/null
+++ b/Problem 25: Smart Meter to Monthly Bill PDF/solution.md
@@ -0,0 +1,245 @@
+## Solution 25: Smart Meter to Monthly Bill PDF
+
+### The whole flow at a glance
+
+```
+ ┌──────────────────────────┐
+ │ Meter readings (15-min) │
+ │ s3://meter-raw/... │
+ └────────────┬─────────────┘
+ │ parser, validation
+ ▼
+ ┌──────────────────────────┐
+ │ Curated 15-min fact │ (reads_15min, partitioned by date)
+ └────────────┬─────────────┘
+ │
+ ▼
+ ┌──────────────────────────────────────────────┐
+ │ Billing calc job │
+ │ (runs once per customer per billing period) │
+ │ │
+ │ Inputs: │
+ │ - reads_15min for the period │
+ │ - tariffs (SCD2) │
+ │ - customers (dim) │
+ │ - account_adjustments │
+ │ │
+ │ Output: one bill row + bill_lines │
+ └────────────┬─────────────────────────────────┘
+ │
+ ▼
+ ┌──────────────────────────┐
+ │ bills (sealed) │ immutable once issued
+ │ bill_lines │
+ └────────────┬─────────────┘
+ │
+ ▼
+ ┌──────────────────────────┐
+ │ PDF rendering service │ pure function of bill_id
+ │ (deterministic) │
+ └────────────┬─────────────┘
+ │
+ ▼
+ ┌──────────────────────────┐
+ │ Customer portal + │
+ │ email delivery │
+ └──────────────────────────┘
+```
+
+The principle: each step is a function of the layer beneath it. Re-running the same step on the same inputs always gives the same output. The PDF is the last and least interesting step.
+
+### The data model
+
+```
+reads_15min (fact, partition by date)
+─────────────────────────────────────────
+meter_id, read_at, kwh, quality_flag, ingested_at
+
+customers (dim)
+─────────────────────────────────────────
+customer_id, meter_id, plan_id, billing_address, ...
+
+tariffs (SCD Type 2 dim)
+─────────────────────────────────────────
+plan_id, energy_rate_per_kwh, fixed_daily_charge,
+valid_from, valid_to, is_current
+
+account_adjustments (fact)
+─────────────────────────────────────────
+adjustment_id, customer_id, amount, reason, applied_on
+
+bills (mart, sealed)
+─────────────────────────────────────────
+bill_id (PK), customer_id, period_start, period_end,
+kwh_total, energy_charge, fixed_charge, adjustments,
+tax, total_due, issued_at, status
+
+bill_lines (mart)
+─────────────────────────────────────────
+bill_id, line_no, description, quantity, unit_price, amount
+```
+
+The `bill_lines` table is the audit trail. Every cent on the bill has a corresponding row showing exactly how it was computed.
+
+### The billing calculation, step by step
+
+For one customer for one billing period:
+
+**1. Pull readings.**
+
+```sql
+SELECT read_at, kwh, quality_flag
+FROM reads_15min
+WHERE meter_id = @meter
+ AND read_at >= @period_start
+ AND read_at < @period_end
+ORDER BY read_at;
+```
+
+Expect 4 × 24 × 30 = 2880 rows for a 30-day month.
+
+**2. Fill gaps.**
+
+If any interval has no reading or a `missing` flag, fill with an estimate. The estimation rule is regulator-defined, usually "average of the same interval in the previous N days." Mark each filled row with `quality_flag = 'estimated'`.
+
+**3. Join to the tariff in effect at each interval.**
+
+```sql
+SELECT r.read_at, r.kwh, t.energy_rate_per_kwh
+FROM readings r
+JOIN tariffs t
+ ON t.plan_id = @plan_id
+ AND r.read_at >= t.valid_from
+ AND r.read_at < t.valid_to;
+```
+
+Notice: this is the SCD2 "as-of" join from Problem 10. If the tariff changed mid-month, half the rows match the old rate and half match the new rate, automatically.
+
+**4. Compute the energy charge.**
+
+```
+energy_charge = sum( kwh * rate ) over all 15-min intervals
+```
+
+**5. Add the fixed daily charge.**
+
+```
+fixed_charge = days_in_period * fixed_daily_charge
+```
+
+(If the fixed daily charge changed mid-month, the same SCD2 join handles it per day.)
+
+**6. Apply adjustments.**
+
+```sql
+SELECT SUM(amount) FROM account_adjustments
+WHERE customer_id = @cust
+ AND applied_on BETWEEN @period_start AND @period_end;
+```
+
+Examples: refunds, goodwill credits, late fee.
+
+**7. Apply tax.**
+
+Usually a flat percentage of subtotal. Some markets have different rates for residential vs commercial.
+
+**8. Write bill and bill_lines, sealed.**
+
+A `bills` row plus one `bill_lines` row per category (energy, fixed, each adjustment, tax). After this, the bill is **frozen**. Any later correction is a new adjustment in the next billing period, not an edit.
+
+### Idempotency
+
+The job is keyed by `(customer_id, period_start, period_end)`. Re-running it for the same key:
+
+* DELETEs any draft bill_lines for that key.
+* Recomputes from current `reads_15min` and `tariffs`.
+* If a bill with that key is `status = 'issued'`, the job refuses to overwrite. You can only correct via an adjustment.
+
+Pseudocode:
+
+```python
+def compute_bill(customer_id, period_start, period_end):
+ existing = bills.get(customer_id, period_start)
+ if existing and existing.status == 'issued':
+ raise BillAlreadyIssued(existing.bill_id)
+
+ # idempotent re-compute
+ delete_lines_for(customer_id, period_start)
+ lines = build_lines(customer_id, period_start, period_end)
+ write_lines(lines)
+ write_bill(summary_of(lines), status='draft')
+```
+
+### Late and corrected readings
+
+This is where the system pays off.
+
+* If readings for May 10 arrive late on May 14, the curated layer rebuilds the May 10 partition.
+* If the bill for May has not been issued yet, the next run of the billing job picks up the new numbers naturally. No special code.
+* If the bill was already issued, the change creates an `account_adjustment` row to be applied in the next billing cycle. The regulator sees a clear audit trail: "your May bill was X. On June 5 we received corrected readings for May 10. We credited Y dollars on your June bill."
+
+This is much cleaner than editing a sealed bill, which would break audits.
+
+### Mid-month tariff changes
+
+These happen often (price reviews, plan switches by the customer). Handled by:
+
+* `tariffs` table is SCD Type 2.
+* The as-of join in step 3 matches each 15-minute reading to the rate that was in force at that exact moment.
+* Bill lines show "Energy 1-15 May at 0.18/kWh" and "Energy 16-31 May at 0.22/kWh" separately, so the customer can see what changed.
+
+### Auditability
+
+The audit trail is the `bill_lines` table plus the snapshot of inputs. Three pieces:
+
+1. **The bill itself** (issued, sealed).
+2. **bill_lines** showing every component.
+3. **Reading snapshot** — at bill time, we record a `reads_snapshot_id` pointing to which version of the curated reads were used. If a later correction changes the readings, the snapshot still shows what we billed on.
+
+A dispute conversation looks like: "Your May energy charge of $XX comes from these 2,880 readings, totaling YYY kWh, at the rates shown in lines 1 and 2. The estimated readings are flagged on line 3."
+
+### PDF rendering
+
+Important: the PDF service is a **pure function of the bill_id**. It reads the bill and bill_lines and renders. It does not compute anything. If we change the PDF template, every old bill can be re-rendered to the new look without changing any numbers.
+
+Two practical patterns:
+
+* PDFs are written to S3 with a deterministic name: `bills/{customer_id}/{period}/{bill_id}.pdf`.
+* The PDF carries a small footer like "Bill ID: B-2025-05-12345" so the support team can find the source data instantly.
+
+### What happens when generation fails
+
+Failure modes I would design for:
+
+* **Missing readings for the period.** Job pauses, alerts ops. Bill is not generated.
+* **PDF rendering crashes.** Bill row exists, but no PDF. PDF service retries; if it never works, ops gets paged.
+* **Wrong tariff applied** (config error). All bills for the period must be re-generated. Since they are not yet `issued`, the job can do this idempotently.
+* **Adjustment job double-fires.** Each adjustment has a unique id; the bill query SUMs DISTINCT ids.
+
+### Why split the job per customer
+
+Two reasons:
+
+* **Parallelism.** 200,000 customers × ~1 second per bill = 56 hours sequentially. In parallel with 200 workers, ~17 minutes.
+* **Isolation.** One customer's bug does not poison the rest. The bill job for customer 12345 fails, the other 199,999 still go out.
+
+A simple orchestrator (Airflow with dynamic task mapping, or a queue of customer ids) is enough.
+
+### Common mistakes interviewers want you to name
+
+1. **Editing a sealed bill** when correction arrives. Use adjustments.
+2. **Not handling SCD2 tariffs.** Tariff changes mid-month get billed wrong.
+3. **Computing during PDF rendering.** A template change later silently changes old totals.
+4. **No quality flag for estimates.** Auditors require "what was measured vs estimated."
+5. **Per-customer time zones** ignored when defining "billing period."
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you parallelize this for 5 million customers?"*
+
+Same architecture, more parallelism. Two changes worth mentioning:
+
+* Run the billing job on Spark or BigQuery rather than Python-per-customer. The "per customer" loop becomes a SQL with GROUP BY customer.
+* The PDF render service stays per-customer but moves behind a queue with a thousand workers.
+
+Five million bills at ~0.5 seconds each, 1,000 workers, is roughly 40 minutes wall clock.
diff --git a/Problem 26: Delivery Idle Driver Tracking/question.md b/Problem 26: Delivery Idle Driver Tracking/question.md
new file mode 100644
index 0000000..0f413dd
--- /dev/null
+++ b/Problem 26: Delivery Idle Driver Tracking/question.md
@@ -0,0 +1,28 @@
+## Problem 26: Delivery Idle Driver Tracking
+
+**Scenario:**
+A food delivery company has 30,000 active drivers in a city. The dispatch system needs to know, in near real time, which drivers are currently idle and where they are, so it can offer them the next order. "Idle" means online, not on a trip, and not on a break. The data must be no more than 10 seconds stale, and dispatch must be able to query "show me all idle drivers within 1.5 km of this restaurant" in well under a second.
+
+In the interview, the question is:
+
+> Design how a delivery company knows in near real time which drivers are idle and where they are.
+
+---
+
+### Your Task:
+
+1. Decide what state needs to be tracked, and where.
+2. Pick a streaming pipeline.
+3. Pick a spatial index for "drivers near point X."
+4. Cover the failure modes: app crash, no GPS, off-trip-but-not-really.
+5. Sketch how dispatch queries this.
+
+---
+
+### What a Good Answer Covers:
+
+* A driver state machine and how it changes.
+* Last-known-location store with TTL.
+* Geospatial query (geohash or H3 neighbor lookup).
+* The "stale driver" problem and how to expire safely.
+* Why you do not put this in your warehouse.
diff --git a/Problem 26: Delivery Idle Driver Tracking/solution.md b/Problem 26: Delivery Idle Driver Tracking/solution.md
new file mode 100644
index 0000000..ebe1d7f
--- /dev/null
+++ b/Problem 26: Delivery Idle Driver Tracking/solution.md
@@ -0,0 +1,182 @@
+## Solution 26: Delivery Idle Driver Tracking
+
+### The shape of the system
+
+```
+Driver app Trip system
+sends ping every 5 sec emits trip events
+ │ │
+ ▼ ▼
+ ─────────────────────────────────────────────
+ Kafka
+ ┌─────────────┐ ┌──────────────────┐
+ │ driver_ping │ │ trip_status │
+ │ topic │ │ topic │
+ └──────┬──────┘ └────────┬─────────┘
+ │ │
+ └────────┬───────────┘
+ ▼
+ ┌────────────────────────────────────────┐
+ │ Stream processor (Flink) │
+ │ │
+ │ Keyed by driver_id, holds the state: │
+ │ online | on_trip | break | offline │
+ │ Updated by pings + trip events. │
+ │ │
+ │ For idle drivers, emits to │
+ │ the "idle drivers" store keyed by H3. │
+ └────────────────┬───────────────────────┘
+ │
+ ▼
+ ┌─────────────────────────────────────────┐
+ │ Live driver store │
+ │ Redis / DragonflyDB / Aerospike │
+ │ │
+ │ Per-driver hash: { state, lat, lng, │
+ │ h3, last_ping } │
+ │ TTL: 60 sec (auto-expire stale) │
+ │ │
+ │ Index: H3 -> set of driver_ids │
+ └────────────────┬────────────────────────┘
+ │
+ ▼
+ ┌─────────────────────────────────────────┐
+ │ Dispatch service │
+ │ Query: "idle drivers within 1.5 km │
+ │ of (lat, lng)" │
+ │ → resolves to hex + neighbors, │
+ │ reads sets, returns up to N drivers │
+ └─────────────────────────────────────────┘
+```
+
+### The driver state machine
+
+Drivers move between a few states:
+
+```
+ login start_trip
+offline ────▶ online ────────▶ on_trip
+ ▲ ▲ ▲ │
+ │ │ │ end_trip │
+ │ logout │ └──────────────┘
+ └───────────┘
+ break_start ┌─▶ break ─┐ break_end
+ │ ▼
+ └─── back to online
+```
+
+Only `online` is idle. The stream processor maintains this state per driver from two streams:
+
+* `driver_ping` (every 5 sec, with location, app says "online or break")
+* `trip_status` (`trip_started`, `trip_completed`)
+
+The processor's keyed state per driver holds: `state`, `last_lat`, `last_lng`, `last_ping_at`. Each event updates the state.
+
+### Why streaming, not a database
+
+You could have the app write directly to Postgres. Don't.
+
+* 30,000 drivers × 1 write per 5 sec = 6,000 writes/sec. That's hot.
+* Trips happen continuously. Writes amplify.
+* Locking and ACID overhead destroys throughput.
+
+A stream + an in-memory store handles this comfortably.
+
+### The location index
+
+For "drivers within 1.5 km of point X," we need a spatial index. Two practical choices:
+
+**H3 hex grid (recommended).** Resolution 9 (~0.10 km² hexes) or 8 (~0.74 km²). Each driver's location maps to a hex. A query for "within 1.5 km" expands to "this hex and its neighbors up to N rings."
+
+```python
+import h3
+center = h3.geo_to_h3(lat, lng, resolution=9)
+nearby = h3.k_ring(center, 3) # this hex and rings 1..3
+```
+
+Then we read all drivers whose stored hex is in `nearby` and filter by exact distance.
+
+**Geohash.** Same idea with rectangular cells. Works fine for moderate scale. H3 wins on neighbor math.
+
+### The store layout
+
+In Redis terms:
+
+```
+# per-driver state
+HSET driver:42 state online lat 1.3245 lng 103.8512 h3 89...3fff last_ping 1715680000
+EXPIRE driver:42 60
+
+# per-hex index of idle drivers
+SADD idle_drivers:h3:89...3fff 42 91 132
+EXPIRE idle_drivers:h3:89...3fff 60
+```
+
+The TTL is the key safety. If a driver disappears (app crash, dead battery), their entries expire within 60 seconds. We do not need a separate cleanup job.
+
+When the stream processor updates a driver:
+
+* Move them out of the old hex set if they moved.
+* Add them to the new hex set if they are still idle.
+* Refresh the per-driver hash.
+
+If the driver starts a trip, remove them from any hex set. They are no longer idle.
+
+### Query path for dispatch
+
+When a new order comes in at the restaurant location:
+
+1. Compute the restaurant's H3 hex.
+2. Get the `k_ring` of hexes covering the search radius (1.5 km is roughly k=3 at resolution 9).
+3. Read the `idle_drivers:h3:*` sets for each hex (one Redis MGET).
+4. Hydrate each driver's lat/lng from `driver:` hashes.
+5. Compute exact distance, filter by 1.5 km, sort by distance.
+6. Return the top N.
+
+Total latency: usually 5-20 ms.
+
+### Handling failure modes
+
+**App crashes.** Pings stop. After 60 seconds the TTL fires and the driver disappears from queries. No bad dispatch.
+
+**No GPS / bad GPS.** Pings still arrive with a quality flag. The stream processor flags poor-quality positions and either holds the last good location for up to 30 seconds or drops the driver out of idle.
+
+**Driver says "online" but they're actually on a trip.** The trip system is the source of truth for `on_trip`. The processor uses trip events to override the app's claim. If a trip event says `trip_started`, we set the driver to `on_trip` regardless of what the ping says.
+
+**Network split between app and dispatch.** The driver shows online for the duration of the TTL, then disappears. Dispatch sees a smaller pool and offers fewer matches, which is the right failure mode.
+
+**Stream processor crash.** Flink resumes from checkpoint. State is reconstructed. The store may briefly hold stale records, but the TTL bounds it.
+
+### Why not just query the warehouse?
+
+* Latency. Warehouse queries are seconds.
+* Cost. Per-query cost adds up at thousands of dispatches per second.
+* Concurrency. Warehouses are not built for the read pattern.
+
+The warehouse is still in the picture for analytics ("how many idle drivers did we have at 6 PM yesterday?"), fed from the same Kafka topics.
+
+### Capacity estimate
+
+30,000 drivers × 1 ping/5 sec = 6,000 writes/sec.
+Read side: dispatch queries during peak meals, maybe 200 orders/sec, each doing a small read. Trivial.
+Store size: 30,000 entries × ~150 bytes = ~5 MB. Plus a hex index over maybe 5,000 hexes. Tiny.
+
+All this fits on a single Redis cluster easily. For a 100x larger company, the same shape holds, you just shard by city or driver_id.
+
+### What about long-running idle drivers parking somewhere?
+
+Some drivers park and wait. They keep sending pings but never move. The system correctly counts them as idle. The dispatch service may want to prefer drivers who have been idle longer (fairness) or shorter (faster pickup). Either is a small change in the query: include `last_ping` and `idle_since` in the response, let dispatch's algorithm pick.
+
+### Common mistakes interviewers want you to name
+
+1. **Storing driver state in the OLTP database.** Hot row contention, slow reads.
+2. **No TTL.** A crashed app means the driver is "online forever" in the store.
+3. **Reading from the warehouse.** Latency and cost both blow up.
+4. **Trusting the app's state claim.** Trip events must override.
+5. **Spatial query that scans all drivers.** Cell-based index is essential.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you change this if you wanted to predict idle drivers, not just track them?"*
+
+Same data flows, plus the warehouse historical view. A model can learn that drivers in hex X tend to be idle at 2 PM on weekdays. Dispatch can pre-warm offers there. The live system stays the same; the prediction is just an extra hint, never the source of truth.
diff --git a/Problem 27: Year in Review Recap/question.md b/Problem 27: Year in Review Recap/question.md
new file mode 100644
index 0000000..7c5a4c2
--- /dev/null
+++ b/Problem 27: Year in Review Recap/question.md
@@ -0,0 +1,28 @@
+## Problem 27: Year in Review Recap
+
+**Scenario:**
+Every December, almost every consumer app sends out a "your year in review" recap: top songs, favorite artists, total kilometers driven, money saved, hours focused. They roll out to hundreds of millions of users in a short window and need to feel personal and beautiful, but be cheap to generate at scale.
+
+In the interview, the question is:
+
+> Sketch the architecture behind those "your year in review" recaps almost every consumer app sends in December.
+
+---
+
+### Your Task:
+
+1. Decide whether this is real-time or batch.
+2. Sketch the pipeline from raw event history to a personalized recap.
+3. Decide the storage and serving model.
+4. Cover the social-share angle (deep link, image generation).
+5. Mention the rollout strategy.
+
+---
+
+### What a Good Answer Covers:
+
+* This is a batch problem, with a serving cache.
+* Aggregation over a year of events, per user.
+* The pre-rendered image / video for sharing.
+* Personalisation as a deterministic function of features.
+* The "everyone opens it within 48 hours" load spike.
diff --git a/Problem 27: Year in Review Recap/solution.md b/Problem 27: Year in Review Recap/solution.md
new file mode 100644
index 0000000..92dcfe1
--- /dev/null
+++ b/Problem 27: Year in Review Recap/solution.md
@@ -0,0 +1,190 @@
+## Solution 27: Year in Review Recap
+
+### The first thing to realize
+
+This looks fancy. It is actually one of the simplest patterns in data engineering: **a huge batch job that pre-computes a small JSON per user, dropped into a fast key-value store, served as static content with personalization on top.**
+
+There is no real-time anything. The "magic" is in the design, copywriting, and image rendering, not the data pipeline. The pipeline just runs once.
+
+### The architecture
+
+```
+ ┌───────────────────────────────────┐
+ │ Warehouse (full year of events) │
+ │ plays / orders / activity rows │
+ │ already in BigQuery / Snowflake │
+ └────────────────┬──────────────────┘
+ │
+ ▼
+ ┌────────────────────────────────────────────────┐
+ │ Yearly aggregation job (SQL + dbt) │
+ │ │
+ │ Per user, compute ~20-50 features: │
+ │ - total minutes / km / spend │
+ │ - top 5 things by count │
+ │ - "biggest day", "longest streak" │
+ │ - novelty / rank vs other users │
+ │ │
+ │ Output: one big table user_recap_features │
+ └────────────────┬───────────────────────────────┘
+ │
+ ▼
+ ┌────────────────────────────────────────────────┐
+ │ Story template selection (deterministic) │
+ │ │
+ │ Map features → which "story cards" each user │
+ │ gets, in what order. Pure function of │
+ │ features. Same user, same year, same cards. │
+ └────────────────┬───────────────────────────────┘
+ │
+ ▼
+ ┌────────────────────────────────────────────────┐
+ │ Pre-rendered share images │
+ │ (one PNG per user, generated headless) │
+ │ Stored on S3 / CDN │
+ └────────────────┬───────────────────────────────┘
+ │
+ ▼
+ ┌────────────────────────────────────────────────┐
+ │ Serving store (DynamoDB / Bigtable / Memcache)│
+ │ Key: user_id │
+ │ Value: { cards: [...], share_image_url, hash }│
+ └────────────────┬───────────────────────────────┘
+ │
+ ┌───────────┴────────────┐
+ ▼ ▼
+ ┌──────────────┐ ┌────────────────────┐
+ │ Mobile app │ │ Social share image │
+ │ requests │ │ served from CDN │
+ │ recap │ │ (no compute) │
+ └──────────────┘ └────────────────────┘
+```
+
+### Why batch is fine
+
+The recap covers a year. Events from December 31 do not change "top artist of 2025." A weekly recompute in late December nails the freshness. There is no business value in real-time here.
+
+The only realtime piece is "when the user opens the app, fetch their card list." That is a single point read.
+
+### The aggregation job
+
+The job runs entirely in SQL inside the warehouse. A simplified shape:
+
+```sql
+WITH per_user AS (
+ SELECT
+ user_id,
+ SUM(seconds_played) / 60.0 AS minutes_total,
+ COUNT(DISTINCT track_id) AS unique_tracks,
+ APPROX_TOP_COUNT(artist, 5) AS top_artists,
+ APPROX_TOP_COUNT(genre, 3) AS top_genres,
+ MAX_BY(date, seconds_played) AS biggest_day,
+ COUNT(DISTINCT date) AS days_active
+ FROM plays
+ WHERE play_date BETWEEN '2025-01-01' AND '2025-12-31'
+ GROUP BY user_id
+)
+SELECT
+ user_id,
+ *,
+ -- a few percentile features
+ PERCENT_RANK() OVER (ORDER BY minutes_total) AS minutes_percentile
+FROM per_user;
+```
+
+The output is one row per user with 20-50 columns. For a 500M user platform, this is a few hundred GB. Trivial for a modern warehouse.
+
+### The story template engine
+
+A user does not see raw numbers. They see a narrative: "You listened to 8,432 minutes this year. That's more than 98% of listeners." The mapping from features to cards is a small rule engine:
+
+```python
+def build_cards(features):
+ cards = []
+ cards.append(intro_card(features.minutes_total))
+ if features.minutes_percentile > 0.95:
+ cards.append(top_listener_card(features.minutes_percentile))
+ cards.append(top_artists_card(features.top_artists))
+ if features.unique_tracks > 1000:
+ cards.append(explorer_card(features.unique_tracks))
+ # …and so on
+ return cards
+```
+
+This runs as part of the same batch job (typically in dbt's macros, or a Spark/BigQuery Python step). The output is per-user JSON.
+
+The critical property: **the output is a deterministic function of the features**. Same features in, same JSON out. So if the job re-runs, users do not see a different story tomorrow.
+
+### Pre-rendered share images
+
+The shareable image (Instagram story format, 1080×1920, with the user's stats overlaid) cannot be rendered on demand. Two reasons:
+
+1. At launch, hundreds of millions of users open the app within 48 hours. On-demand rendering would melt any render farm.
+2. Predictable storage is cheap; bursty CPU is expensive.
+
+So we pre-render. The batch job, after producing the JSON, fans out to a small image renderer (often headless Chromium with a templated HTML page, or Skia, or a custom rasterizer). Output: one PNG per user, written to S3 / GCS / a CDN bucket.
+
+URL pattern: `https://cdn.example.com/recap/2025/{user_id}.png`. The URL is the only thing the app needs to know. The CDN handles the load.
+
+Approximate cost: 500M users × ~200 KB image = 100 TB on the CDN for the season. Reasonable.
+
+### Serving
+
+When a user opens the recap, the app calls a small API:
+
+```
+GET /recap/2025
+→ { cards: [...], share_image_url: "..." }
+```
+
+The API is a thin layer over a KV store. Sub-50 ms latency. Same point-read pattern as the banking widget in Problem 22.
+
+### The traffic spike
+
+When the recap drops, every active user opens it within a day or two. That's a 10-30x spike over normal traffic. Strategies:
+
+* **Pre-render everything.** No live rendering.
+* **CDN for images.** Static assets handle a million requests per second on a CDN with no engineering effort.
+* **API caches the JSON aggressively.** TTL of an hour is fine because the data does not change.
+* **Gradual rollout.** Launch to 10% of users, then 30%, then 100% over a week. Smooths the spike.
+* **Pre-warm the cache.** Just before launch, run a background sweep that touches every user's record, so caches are hot.
+
+### Privacy and the "no, please don't" user
+
+Some users do not want to be confronted with their year. The app should:
+
+* Make the recap opt-in to push notifications.
+* Allow opt-out before generation. If a user opts out, their record is filtered from the batch job.
+* Never include features they would find uncomfortable (highly listened-to ex's playlists, etc.). The story engine includes a rules layer to skip cards that are likely to feel invasive.
+
+### Rebuild and corrections
+
+If a bug in the aggregation is discovered the day after launch, the recap can be rebuilt:
+
+* Re-run the aggregation job.
+* Same deterministic story engine produces new JSON.
+* Re-render images.
+* Push new JSON into the serving store.
+
+Because the whole thing is a pure function of warehouse data, rebuilds are safe. The user might briefly see "their recap changed," which is acceptable in the first day.
+
+### Common mistakes interviewers want you to name
+
+1. **Trying to make this real-time.** No business value, huge cost.
+2. **Rendering images on demand.** The first hour of launch dies.
+3. **Non-deterministic story selection** (random emoji, random card order). Refreshing the recap shows a different version. Users hate that.
+4. **No opt-out / privacy controls.** PR risk.
+5. **No gradual rollout.** Launches at midnight, infra cooks.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you decide what makes a 'good' card to include?"*
+
+A few principles I would design around, mostly product not engineering:
+
+* **The card celebrates the user**, never makes them feel bad.
+* **Relative metrics beat absolute** ("more than 87% of listeners" beats "8,432 minutes").
+* **A surprise** in each card, not just the obvious top-N list.
+* **Tested.** A/B test cards on a small slice before the full launch.
+
+Engineering enables this by making story templates pluggable and feature counts easy to add. The hard part is product judgment.
diff --git a/Problem 28: Low Balance Notification Pipeline/question.md b/Problem 28: Low Balance Notification Pipeline/question.md
new file mode 100644
index 0000000..1dea760
--- /dev/null
+++ b/Problem 28: Low Balance Notification Pipeline/question.md
@@ -0,0 +1,30 @@
+## Problem 28: Low Balance Notification Pipeline
+
+**Scenario:**
+A neobank wants to send a push notification "your balance is below $50" to each customer whose balance drops under that threshold. The notification should fire at most once per day per customer, must never wake up the wrong customer, and must respect the customer's local time zone (no push at 3 AM). They have 8 million customers.
+
+In the interview, the question is:
+
+> Design a "low balance" notification pipeline that runs daily and absolutely must not wake the wrong customer.
+
+The phrase "must not wake the wrong customer" is the whole point of the question. The interviewer is testing how you think about correctness under retries, idempotency, and personalization.
+
+---
+
+### Your Task:
+
+1. Sketch the daily pipeline.
+2. Show how you guarantee "at most once" per customer per day.
+3. Cover the local time zone and quiet-hours rules.
+4. Cover what happens when the job retries.
+5. Cover privacy and opt-out.
+
+---
+
+### What a Good Answer Covers:
+
+* Batch eligibility computation in the warehouse.
+* A notifications-sent log used as a dedup gate.
+* Idempotent push with a unique key.
+* Schedule per time zone, not one global schedule.
+* What happens when the pipeline crashes mid-run.
diff --git a/Problem 28: Low Balance Notification Pipeline/solution.md b/Problem 28: Low Balance Notification Pipeline/solution.md
new file mode 100644
index 0000000..a394aa6
--- /dev/null
+++ b/Problem 28: Low Balance Notification Pipeline/solution.md
@@ -0,0 +1,188 @@
+## Solution 28: Low Balance Notification Pipeline
+
+### The pipeline
+
+```
+ ┌──────────────────────────────────────────────┐
+ │ Warehouse (BigQuery / Snowflake) │
+ │ - accounts (current balance, tz, opt-in) │
+ │ - notifications_sent (audit log) │
+ └────────────────┬─────────────────────────────┘
+ │
+ Scheduled per time zone
+ │
+ ▼
+ ┌──────────────────────────────────────────────┐
+ │ Eligibility query (SQL) │
+ │ │
+ │ WHERE balance < threshold │
+ │ AND user is opted in │
+ │ AND user's local time is within │
+ │ send-window (9 AM .. 7 PM) │
+ │ AND no notification sent today already │
+ │ for this user (anti-join) │
+ │ │
+ │ → produces a small "candidates" set │
+ └────────────────┬─────────────────────────────┘
+ │
+ ▼
+ ┌──────────────────────────────────────────────┐
+ │ Notification job │
+ │ For each candidate: │
+ │ 1. Insert into notifications_sent with │
+ │ unique key (user_id, date, type) │
+ │ 2. If insert succeeds (no duplicate), │
+ │ call push API with idempotency_key │
+ │ = the same (user_id, date, type) hash │
+ │ 3. If insert fails (already there), skip │
+ └────────────────┬─────────────────────────────┘
+ │
+ ▼
+ ┌──────────────────────────────────────────────┐
+ │ Push provider (APNs / FCM / Twilio) │
+ │ Receives idempotency_key, dedupes on it │
+ └──────────────────────────────────────────────┘
+```
+
+The trick: **the dedup gate is a row in a real table with a unique constraint**. The order is "claim the right to send" first, then "send." If two workers try the same user at the same instant, only one wins the INSERT, only one calls the push API.
+
+### The eligibility query
+
+```sql
+WITH local_now AS (
+ SELECT
+ a.user_id,
+ a.balance,
+ a.timezone,
+ a.threshold,
+ DATE(CURRENT_TIMESTAMP() AT TIME ZONE a.timezone) AS local_date,
+ EXTRACT(HOUR FROM CURRENT_TIMESTAMP() AT TIME ZONE a.timezone) AS local_hour
+ FROM accounts a
+ WHERE a.notifications_opt_in = TRUE
+ AND a.low_balance_enabled = TRUE
+)
+SELECT n.user_id
+FROM local_now n
+LEFT JOIN notifications_sent s
+ ON s.user_id = n.user_id
+ AND s.notification_type = 'low_balance'
+ AND s.local_date = n.local_date
+WHERE n.balance < n.threshold
+ AND n.local_hour BETWEEN 9 AND 18
+ AND s.user_id IS NULL;
+```
+
+This finds users who:
+
+* Are opted in.
+* Are below their threshold (note: per-user threshold, not a global $50).
+* Are inside their local 9-7 PM window.
+* Have not yet received the notification today.
+
+The `LEFT JOIN ... IS NULL` is the dedup gate at query time. It is fast and avoids sending to anyone we already sent.
+
+### The unique constraint is the real safety
+
+The query is fast but not safe by itself. If the job runs twice (Airflow retry), both runs could see the same eligible user. The unique constraint in `notifications_sent` is what makes this idempotent.
+
+```sql
+CREATE TABLE notifications_sent (
+ user_id STRING,
+ notification_type STRING,
+ local_date DATE,
+ sent_at TIMESTAMP,
+ push_status STRING,
+ PRIMARY KEY (user_id, notification_type, local_date)
+);
+```
+
+The notification worker pattern:
+
+```python
+def send_one(user_id, local_date):
+ key = (user_id, "low_balance", local_date)
+ try:
+ notifications_sent.insert(
+ user_id=user_id,
+ notification_type="low_balance",
+ local_date=local_date,
+ sent_at=now(),
+ push_status="pending",
+ )
+ except UniqueViolation:
+ return # someone else already sent
+
+ push_provider.send(
+ user_id=user_id,
+ body="Your balance is low",
+ idempotency_key=hash(key), # so the push API also dedupes
+ )
+
+ notifications_sent.update(key, push_status="delivered")
+```
+
+The unique constraint guarantees "at most once per user per day, per notification type." Even if 50 workers retry the same user, only one row gets inserted, only one push goes out.
+
+### Quiet hours and time zones
+
+This is the part most teams get wrong. They run one global "8 AM" cron and surprise users in other time zones.
+
+Right approach: the eligibility query knows each user's time zone. Run the job often (every hour) but the WHERE clause filters to users whose **local hour** is currently in the allowed window. So a user in Singapore is checked when it is 9 AM in Singapore. A user in San Francisco gets checked when it is 9 AM in SF.
+
+```sql
+WHERE EXTRACT(HOUR FROM CURRENT_TIMESTAMP() AT TIME ZONE a.timezone) BETWEEN 9 AND 18
+```
+
+This produces a rolling wave through the day rather than one global send.
+
+### What about a user with no time zone set?
+
+Two policies, pick consciously:
+
+* Conservative: fall back to UTC and only send between 9 AM and 6 PM UTC (which spans Europe). Some users get nothing, but no one gets woken.
+* Best guess: infer from the country code on their account.
+
+I would default to conservative for the first launch.
+
+### Privacy and opt-out
+
+The `notifications_opt_in` and feature-specific flags (`low_balance_enabled`) live in the user profile. The query checks both. Two protections:
+
+* A user who opts out today does not receive the message even if they were below the threshold this morning. The opt-out check is in the live query, not a snapshot.
+* A user who deletes their account flows through CDC into the warehouse and is marked inactive. They are filtered out.
+
+### What can go wrong
+
+**Crash mid-run.** Half the inserts happened, half did not. Re-running the job sees the half that did, skips them, sends to the rest. No duplicates. The unique constraint is doing the work.
+
+**Push provider double-sends.** APNs and FCM both support idempotency keys. We pass the same key for the same (user, type, date). The provider dedupes.
+
+**The balance flickers.** A user dropped to $49, the job sent the message, then their salary hit and the balance jumped to $5,000. We do not "take back" the notification. The day's row exists. They will not get the message again tomorrow unless they drop again.
+
+**Time zone update.** A user changed their time zone after we evaluated them. We sent at 9 AM old time zone, which is now 3 AM in their new one. Sad but unavoidable. Mitigation: re-evaluate the row's `local_hour` at send time, not just query time.
+
+**The threshold changes.** A user lowered their threshold from $100 to $20. Tomorrow's query naturally picks them up at the new threshold. No backfill needed.
+
+### Why I would NOT do this in a streaming job
+
+Tempting design: "send the notification the moment the balance drops below the threshold." Two problems:
+
+* Need to suppress duplicates across a day, including across pauses in the stream. Stream state must persist for 24 hours per user.
+* Quiet hours mean the event needs to be delayed until morning. Now you need a "send later" queue.
+* A failed pipeline silently drops notifications.
+
+A daily batch job, run hourly per time zone, with a unique constraint dedup, is simpler and equally good for this use case. Real-time would matter if the SLA were "within 5 minutes of the balance dropping," but it is not.
+
+### Common mistakes interviewers want you to name
+
+1. **No unique constraint.** Retries duplicate notifications.
+2. **One global cron.** Wakes up users at 3 AM.
+3. **Insert row AFTER push.** Crash window: push sent, no record, retry sends again.
+4. **Inferring time zone from the device IP.** Inaccurate and creepy.
+5. **Same template for every user.** Should be localized.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you handle a customer who responds 'stop' to the notification?"*
+
+The response is a customer-care channel signal. Route it back into the account record as an opt-out on this specific notification type. The next day's eligibility query filters them out. The opt-out is per notification type so they can still get fraud alerts even if they muted low-balance notifications.
diff --git a/Problem 29: Daily Report Quietly Wrong for Two Weeks/question.md b/Problem 29: Daily Report Quietly Wrong for Two Weeks/question.md
new file mode 100644
index 0000000..9c23596
--- /dev/null
+++ b/Problem 29: Daily Report Quietly Wrong for Two Weeks/question.md
@@ -0,0 +1,27 @@
+## Problem 29: Daily Report Quietly Wrong for Two Weeks
+
+**Scenario:**
+The daily revenue report has been wrong for the last two weeks. Nobody noticed because the numbers looked plausible. The finance team caught it during a month-end reconciliation when their hand-computed total did not match what the dashboard had been claiming. By the time you find out, decisions have already been made on the wrong data, including an exec presentation last Wednesday.
+
+In the interview, the question is:
+
+> A daily report has been wrong for two weeks but nobody noticed because the numbers looked plausible. How do you investigate, and what do you change so it never happens again?
+
+---
+
+### Your Task:
+
+1. Walk through how you would actually investigate.
+2. Explain what you would communicate, to whom, and when.
+3. Describe what changes after this incident.
+4. Cover the postmortem mindset.
+
+---
+
+### What a Good Answer Covers:
+
+* The investigation itself: stop the bleeding first.
+* Identifying the change that caused the divergence.
+* Telling stakeholders early, even before you know the cause.
+* Data quality checks that would have caught it.
+* The cultural shift from "tests pass" to "data is right."
diff --git a/Problem 29: Daily Report Quietly Wrong for Two Weeks/solution.md b/Problem 29: Daily Report Quietly Wrong for Two Weeks/solution.md
new file mode 100644
index 0000000..bd051be
--- /dev/null
+++ b/Problem 29: Daily Report Quietly Wrong for Two Weeks/solution.md
@@ -0,0 +1,141 @@
+## Solution 29: Daily Report Quietly Wrong for Two Weeks
+
+### Short version you can say out loud
+
+> The first 30 minutes are about telling people and stopping further bad decisions. The next few hours are about finding the cause and the size of the error. The rest of the week is about correction and rebuilding trust. The real work begins after, with the question "why did this go undetected for two weeks." That answer is almost never one bug. It is a missing check.
+
+### Hour 1: communicate, don't investigate alone
+
+Before I open a query editor, I post in the team channel and the analytics channel:
+
+> "Heads up — finance found a mismatch on the daily revenue numbers, possibly going back two weeks. I am investigating. Please pause any decisions based on the daily revenue dashboard while I confirm what's going on. I will post an update by 2 PM."
+
+Three reasons:
+
+1. People who used the number this morning need to know.
+2. It looks bad to investigate quietly for 6 hours and then post a bombshell.
+3. Others on the team might already know something useful.
+
+I also tag the analytics lead and let my manager know one-to-one.
+
+### Hour 2-3: triage
+
+Two things in parallel:
+
+**A. How big is the error?**
+
+Compare three numbers for the last 30 days:
+
+* The dashboard's claimed daily revenue.
+* The source-of-truth system's daily revenue (the payment gateway, the OLTP database, the partner statement).
+* The warehouse's raw fact, recomputed from scratch.
+
+A simple SQL or notebook produces three columns side by side. The day the mismatch starts is usually obvious.
+
+**B. When did the divergence start?**
+
+The first day where the warehouse number stops matching the source is the inflection point. I look at:
+
+* Git log of dbt models touched that week.
+* Airflow / Dagster runs and their durations on those days.
+* Schema change events on the source.
+* Any product release that might have changed event emission.
+
+Three out of four times, the cause is something in that list. The first cause I find may be the only cause, or there may be two combining. Be honest about which.
+
+### Hour 4-5: confirm root cause
+
+A few classic culprits:
+
+* **A source schema change.** A new column was added or a value's meaning changed. The transform silently dropped or miscategorized rows.
+* **A renamed column** killed a join key, producing NULL where there used to be a value.
+* **A new product feature** generates events with a different shape. The transform never accounted for them.
+* **A time zone change** in some source system, off by one day.
+* **A unit change**, like a price moving from cents to dollars (this one I've seen).
+
+I want to be able to say "the divergence is N% per day and is caused by X." If I cannot, I keep digging.
+
+### Hour 6 onwards: tell stakeholders the real story
+
+A second message, this time concrete:
+
+> "Update: the daily revenue dashboard has been undercounting by about 4.2% since May 1. Cause: on April 30 we added a new product line whose events come through a different topic, and the transform that builds the revenue fact was not updated to include them. The amount is real revenue, not lost — it just was not showing in the dashboard. I have a fix in PR review and will backfill the last 14 days by tomorrow morning. Decisions made on this data should be revisited; I am compiling a list of which dashboards were affected."
+
+The tone matters. Specific. Not blaming a person. Names the impact.
+
+### The fix and the backfill
+
+The fix lands in dbt. The backfill runs the corrected model for the affected partitions. Because dbt models are idempotent by partition (as in Problem 9), the backfill is safe to rerun.
+
+After the backfill, I refresh the comparison and confirm the numbers match the source of truth.
+
+### Now the real work: why did nobody notice
+
+This is what the postmortem is actually about. The answer is usually one of:
+
+* **No tolerance check.** Nobody had wired "warehouse daily revenue should be within 1% of payments source daily revenue." If they had, page would have fired on day one.
+* **No anomaly check.** Daily revenue dropped 4.2% with no explanation. A simple "compare to 28-day moving average" would have flagged it.
+* **The dashboard owner moved teams** and the new owner did not know to monitor it.
+* **People don't believe the dashboard anymore**, so they don't look at it carefully. This one is cultural and it's the worst.
+
+### The changes I would propose
+
+Three layers of defense:
+
+**1. Source-of-truth reconciliation.** A dbt test that compares the daily revenue fact to the source system's daily total within a tolerance. Runs after every load. Pages on mismatch larger than 1%.
+
+```yaml
+- name: revenue_matches_source
+ test: assert_within_tolerance
+ expression: |
+ SELECT
+ ABS(SUM(warehouse_revenue) - SUM(source_revenue)) /
+ NULLIF(SUM(source_revenue), 0) AS diff_pct
+ FROM revenue_reconciliation_view
+ WHERE date = CURRENT_DATE - 1
+ threshold: 0.01
+```
+
+**2. Anomaly check.** A check that compares yesterday's value to the trailing 28-day average. Pages on >3 standard deviations. Catches even smaller drifts.
+
+**3. Schema drift alert.** Any column added or removed in a source feeds a notification to the data team. Even if the transform survives, someone looks at the change.
+
+**4. Ownership.** Every important dashboard has a named owner. The owner gets a weekly "your dashboard updated successfully" digest. If the digest stops, they know to ask why.
+
+### What I would NOT do
+
+* Blame an individual. Two weeks of silence is a system problem.
+* Add a check for this specific schema change. The bug will not repeat. The next one will be different.
+* Promise "this will never happen again." It will. Promise to detect it within a day.
+
+### How to talk about this in a real interview
+
+If the interviewer wants you to actually walk it through, the spine of the answer is:
+
+1. Communicate first.
+2. Quantify the error.
+3. Find the change that caused it.
+4. Confirm size, tell stakeholders, fix and backfill.
+5. Postmortem on the *detection failure*, not the bug itself.
+6. Add structural checks so the next divergence is caught fast.
+
+If you can hit those six points, you are answering the question they are really asking, which is: "are you the kind of engineer I want around when something goes wrong?"
+
+### Common mistakes interviewers want you to name
+
+1. **Investigating quietly.** Other people are still using the wrong number while you dig.
+2. **Focusing on the bug, not the detection gap.** Catching the next one matters more than this one.
+3. **Adding 50 alerts after the incident.** Alert fatigue. Three high-signal checks beat 50 noisy ones.
+4. **Not actually backfilling.** Some teams "fix forward" only. Old reports remain wrong.
+5. **Forgetting to tell the people who used the bad data.** Decisions they made may need revisiting.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you build a culture where teams trust the data enough to actually report mismatches early?"*
+
+Two things help:
+
+1. **Make data quality visible.** A "freshness and accuracy" status page that updates automatically. People learn the system has health.
+2. **Reward early reporting.** When finance reports a mismatch, thank them in public. Make it clear that catching things is good, not annoying.
+
+Trust is built slowly and broken fast. The two-week silence above destroyed some of it; rebuilding takes a few quarters of clean operation and visible checks.
diff --git a/Problem 30: Warehouse Cost Doubled in Two Months/question.md b/Problem 30: Warehouse Cost Doubled in Two Months/question.md
new file mode 100644
index 0000000..bed9be2
--- /dev/null
+++ b/Problem 30: Warehouse Cost Doubled in Two Months/question.md
@@ -0,0 +1,29 @@
+## Problem 30: Warehouse Cost Doubled in Two Months
+
+**Scenario:**
+Your finance team forwards a bill. The Snowflake or BigQuery bill is up 100% over two months, from $14,000/month to $28,000/month. They want an explanation by Friday, and they want it not to grow further. The engineering manager wants no breaking changes for the existing analytics team.
+
+In the interview, the question is:
+
+> Your team's warehouse cost doubled in two months. Walk through the conversation you would have with finance and with engineering.
+
+This is a hybrid: technical investigation plus stakeholder management.
+
+---
+
+### Your Task:
+
+1. Walk through your investigation in the warehouse.
+2. Show how you would frame the answer for finance vs engineering.
+3. Sketch the immediate cost reductions you would propose.
+4. Cover the longer-term governance.
+
+---
+
+### What a Good Answer Covers:
+
+* Reading the warehouse usage tables (INFORMATION_SCHEMA, ACCOUNT_USAGE).
+* Splitting cost by query, by user, by warehouse, by table.
+* The 80/20 rule of warehouse cost: a handful of queries.
+* The conversation: finance wants a number, engineering wants a plan.
+* Tagging, alerting, and budgets.
diff --git a/Problem 30: Warehouse Cost Doubled in Two Months/solution.md b/Problem 30: Warehouse Cost Doubled in Two Months/solution.md
new file mode 100644
index 0000000..7515c45
--- /dev/null
+++ b/Problem 30: Warehouse Cost Doubled in Two Months/solution.md
@@ -0,0 +1,148 @@
+## Solution 30: Warehouse Cost Doubled in Two Months
+
+### Short version you can say out loud
+
+> I would get the data before the conversation. Almost always, 5 to 10 queries or jobs account for most of the increase. Then I would talk to finance with a number and a plan, and to engineering with options that don't break anyone's work. The biggest cost win is usually scheduled queries that scan more than they need to, plus one or two new dashboards that someone built without realising they would run hourly.
+
+### Step 1: pull the actual numbers (the same day)
+
+The warehouse usage tables tell you exactly what is happening.
+
+In Snowflake:
+
+```sql
+-- Cost by warehouse by day, last 90 days
+SELECT
+ warehouse_name,
+ DATE_TRUNC('day', start_time) AS day,
+ SUM(credits_used) AS credits
+FROM snowflake.account_usage.warehouse_metering_history
+WHERE start_time > DATEADD(day, -90, CURRENT_DATE)
+GROUP BY 1, 2
+ORDER BY day, credits DESC;
+```
+
+In BigQuery:
+
+```sql
+-- Cost by user, by day, last 90 days
+SELECT
+ user_email,
+ DATE(creation_time) AS day,
+ SUM(total_bytes_billed) / POW(2,40) AS tib_billed,
+ SUM(total_bytes_billed) / POW(2,40) * 6.25 AS approx_usd -- on-demand pricing
+FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
+WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
+GROUP BY 1, 2
+ORDER BY 2, 4 DESC;
+```
+
+I run three slices:
+
+1. **By day.** Where did the increase start?
+2. **By user / service account.** Which team or pipeline is responsible?
+3. **By query hash or destination table.** Which specific jobs are expensive?
+
+Almost always, a chart shows the inflection point and a single user or service account explains most of the increase.
+
+### Step 2: find the top offenders
+
+The 80/20 rule holds in warehouse cost. Five to ten queries usually explain 70 to 90 percent of the growth.
+
+```sql
+-- BigQuery: top jobs by cost, last 30 days
+SELECT
+ query,
+ user_email,
+ COUNT(*) AS runs,
+ SUM(total_bytes_billed) / POW(2,40) AS tib_billed
+FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
+WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
+ AND job_type = 'QUERY'
+GROUP BY 1, 2
+ORDER BY tib_billed DESC
+LIMIT 20;
+```
+
+For each top offender, I want three answers:
+
+* What is it for?
+* Does it need to run as often as it does?
+* Can it scan less?
+
+The investigation usually finds:
+
+1. **A new scheduled query** runs every hour, scans 500 GB of unpartitioned data, returns 10 rows. Easy win.
+2. **A dashboard refresh** every 5 minutes, even though the data is daily. Easy win.
+3. **A `SELECT *` from a 3 TB table** in a notebook somebody forgot about. Easy win.
+4. **A model rebuild** that grew from 100 GB to 800 GB because the data grew. This is real growth, not waste.
+
+### Step 3: framing for finance
+
+Finance wants a number and a number. Not a journey.
+
+> "The increase from $14k to $28k breaks down as: 60% (~$8.4k) is three scheduled queries that scan full table partitions when they only need yesterday. 20% (~$2.8k) is a dashboard that auto-refreshes 12 times an hour. 20% (~$2.8k) is real growth as our user base grew 30% this quarter.
+>
+> The first 80% can be addressed this week without breaking any user experience. I expect to land back around $16-17k next month.
+>
+> The remaining $3k is real growth; if revenue tracks, it should not concern us, but we will keep watching."
+
+Three properties of a good finance message: a specific decomposition, a clear "what is fixable," and a forward number.
+
+### Step 4: framing for engineering
+
+Engineering does not want a number; they want options and impact.
+
+> "Here are five changes. Each one I have estimated the savings, the risk, and the work.
+>
+> 1. Partition-prune the `daily_metrics_rollup` scheduled query (5 lines change). Save ~$3k/mo. Zero downstream impact.
+> 2. Cap the dashboard auto-refresh to once per hour (dashboard config). Save ~$1.5k/mo. Users will notice "last refreshed an hour ago" but data is daily anyway.
+> 3. Replace the abandoned notebook's nightly run by stopping the schedule. Save ~$1k/mo. Confirmed with the owner — they forgot it existed.
+> 4. Materialize the heaviest dashboard query into a daily table (~$0.5k/mo savings, modest dbt work).
+> 5. (Longer term) Adopt a query-tagging convention and per-team budgets.
+>
+> Total savings: about $6k/mo by next Wednesday. I will need ~half a day."
+
+Note that I'm not asking for permission. I am proposing changes with their impact named. Engineering can push back on the dashboard refresh policy if they want to, but the proposal is on the table.
+
+### The 80/20 list of fixes I would always check
+
+1. **Unpartitioned scans on partitioned tables.** `WHERE date >= ...` is missing or wraps the date in a function.
+2. **`SELECT *` from wide tables.** Especially in dashboards and notebooks. Costs you bytes-billed for every column.
+3. **Dashboard auto-refresh** at higher frequency than the data changes.
+4. **No materialised views or summary tables** for queries that run every hour and re-aggregate the same data.
+5. **Massive joins where one side is huge and unfiltered.** A small WHERE moved before the join saves 90 percent.
+6. **Compute warehouses left running** with too high a size, or auto-suspend turned off (Snowflake-specific).
+7. **dbt `--full-refresh` running daily by accident.** Should be incremental.
+
+### Long term: governance, not just optimization
+
+After the immediate fixes, the goal is for this not to recur. I would propose:
+
+* **Query tags.** Every dbt model, every dashboard, every scheduled query carries a tag (`team:`, `purpose:`, `dataset:`). Cost reports group by tag.
+* **A weekly cost report.** Distributed automatically. Each team sees their own number.
+* **A budget per team.** Soft for now: a Slack alert when a team is on pace to exceed the monthly budget. Hard later if needed.
+* **A "before merging an expensive query" review rule.** dbt and the warehouse make this easy with dry runs.
+* **Reserved capacity or commitments** if costs are stable enough. BigQuery slots or Snowflake commitments can save 30-50 percent vs on-demand once usage is predictable.
+
+### The conversation with finance going forward
+
+The relationship matters as much as the numbers. After this incident:
+
+* Send finance a one-paragraph monthly summary, even when costs are flat. They do not have to chase me.
+* Pre-warn them before a known cost-up event ("we are loading historical data next week, this will be a one-time spike").
+* Have a rough rule of thumb for "cost per active user" or "cost per revenue dollar." That makes the conversation about value, not just spend.
+
+### Common mistakes interviewers want you to name
+
+1. **Adding more compute.** Some teams react to high cost by buying reserved capacity that locks in the inflated usage.
+2. **Optimizing in the dark.** Without the usage tables, you guess.
+3. **Killing a query without finding its owner.** Outages incoming.
+4. **One-time fix, no follow-up.** Same problem in 3 months.
+5. **Not telling the team what was fixed.** Bills drop, no one knows why, behavior repeats.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if cost is up but utilization is up too, and the team genuinely needs the data?"*
+
+Then the conversation is about value, not cost. Show finance the metric: "cost per active user," "cost per dollar of revenue analyzed," whatever fits the business. If costs are growing slower than the business, you are doing fine. If they are growing faster, the optimisation work above is still useful, but the bigger question is "can we get to per-slot or reserved pricing now that we have a steady baseline."
diff --git a/Problem 31: The Dashboard is Wrong/question.md b/Problem 31: The Dashboard is Wrong/question.md
new file mode 100644
index 0000000..cf568cb
--- /dev/null
+++ b/Problem 31: The Dashboard is Wrong/question.md
@@ -0,0 +1,29 @@
+## Problem 31: "The Dashboard Is Wrong"
+
+**Scenario:**
+A senior analyst pings you on Slack: "the dashboard is wrong." When you ask which number, they cannot say exactly, just that "it doesn't feel right." This is a real thing that happens a lot. You have to handle it without making them feel dismissed, and without spending two days chasing a ghost.
+
+In the interview, the question is:
+
+> A senior analyst says "the dashboard is wrong" but cannot tell you which number. How do you handle that conversation?
+
+This is testing communication and structured debugging under ambiguity, not SQL.
+
+---
+
+### Your Task:
+
+1. Describe how you would respond in the first message.
+2. Explain how you turn vague concern into a specific question.
+3. Cover what you do if there really is a bug, and what you do if there is not.
+4. Talk about the trust dynamic.
+
+---
+
+### What a Good Answer Covers:
+
+* Take the concern seriously, even when vague.
+* Open-ended questions that elicit the specific number.
+* Showing your investigation as you go.
+* Confirming the bug or confirming the data (with humility).
+* Building a habit so this is easier next time.
diff --git a/Problem 31: The Dashboard is Wrong/solution.md b/Problem 31: The Dashboard is Wrong/solution.md
new file mode 100644
index 0000000..5449058
--- /dev/null
+++ b/Problem 31: The Dashboard is Wrong/solution.md
@@ -0,0 +1,132 @@
+## Solution 31: "The Dashboard Is Wrong"
+
+### Short version you can say out loud
+
+> I take it seriously and I do not argue with the gut feeling. I ask one open question to find the specific number, then I show my work as I check. Usually the analyst has noticed a real thing — they just have not localized it yet. If it turns out the data is right, the conversation ends with "thanks for catching this" and a small change so the next person trusts the dashboard faster.
+
+### The first reply
+
+```
+Don't say: "The data is correct, I checked yesterday."
+Don't say: "The pipeline is healthy, no errors."
+Don't say: "Which exact number?" (too blunt)
+
+Do say:
+"I'll take a look. To help me find it fast — what were you
+expecting to see, and what looks off? Even rough numbers are fine."
+```
+
+This does three things:
+
+1. Takes them seriously instantly.
+2. Asks for their expectation, not the actual number. This is the key move. "It feels off" is hard; "I expected revenue around $X" is concrete.
+3. Lowers the cost for them to engage. They do not have to be precise.
+
+### Turning vague into specific
+
+Senior analysts almost always have a number in mind. A few questions that get there:
+
+* "Roughly when did it look off? Today, this week?"
+* "Is one chart wrong, or does it ripple through several?"
+* "Compared to what — last week, last quarter, your forecast?"
+
+Within two messages, you usually have something concrete: "Revenue for last Thursday showed $182k, I expected closer to $220k."
+
+Now I have something to check.
+
+### My investigation, out loud
+
+I share my screen or paste queries as I go. Three reasons:
+
+1. The analyst can correct me if I am looking at the wrong thing.
+2. They learn the data better. Next time they can self-serve a little more.
+3. It builds trust: I am not deciding alone behind a curtain.
+
+The checks:
+
+```
+1. Does the dashboard match the warehouse table it claims to read from?
+ SELECT SUM(revenue) FROM marts.daily_revenue WHERE date = '2025-05-08';
+
+2. Does the warehouse table match its upstream source?
+ Compare against payments source / OLTP / partner statement.
+
+3. Did the pipeline run fully and on time for that date?
+ Check Airflow / dbt / data freshness.
+
+4. Is there something special about that date?
+ Holiday, deploy, migration, source outage.
+
+5. Has the dashboard's underlying query changed recently?
+ git blame on the SQL.
+```
+
+In 80 percent of cases, one of these surfaces the answer in 15 minutes.
+
+### Three possible outcomes
+
+**Outcome A: there really is a bug.**
+
+Now treat it like Problem 29. Communicate to the wider team, find the size of the error, fix and backfill, and add a check so the next time it shows up automatically.
+
+**Outcome B: the data is right, the analyst's expectation was wrong.**
+
+This is the delicate one. Don't say "the data is fine, you were wrong." Instead:
+
+> "Confirmed against the payments source: Thursday was $182,144. The dip was a Singapore public holiday — the same drop appears in last year's data. I will add a holiday flag to the dashboard so this is obvious next time."
+
+The analyst's instinct was good. They saw an unusual number. The data is correct, but the dashboard could have made it easier to read. The fix on our side is a small UX improvement.
+
+**Outcome C: the number is right, but the metric definition surprised them.**
+
+Common case: "revenue" includes refunds, or excludes them, and the analyst assumed the opposite. The dashboard is technically correct but doesn't show the assumption.
+
+> "Found it. The number is net of refunds. You were expecting gross. Both are valid views, but the dashboard does not say which. I will add a tooltip and rename the column."
+
+### The trust dynamic
+
+The hardest part of this conversation is not the SQL. It is that an analyst who flags a "wrong dashboard" is risking their reputation. If I argue back, they will stop telling me. The next time something is actually broken, I will not hear about it.
+
+So the rule is: never make them feel small. Even when the data is right, their instinct caught something — usually a clarity gap.
+
+Equally, the rule for them: I will not let them fix it for themselves with side queries that diverge from the dashboard. We will fix the dashboard. That keeps a single source of truth.
+
+### Making this easier next time
+
+After the conversation, a few small changes pay off forever:
+
+* **Tooltips on every metric** explaining the definition. "Revenue = gross sales net of refunds, in local currency."
+* **A "data freshness as of" stamp** on the dashboard. Stale data is the second most common cause of "this feels wrong."
+* **A "compare to last week" or "compare to last year"** toggle. The analyst's instinct usually came from a comparison.
+* **A "report a problem" button** that captures the dashboard, the filters, the user, and a short note. Half of these messages would be specific from the start.
+
+### What I would NOT do
+
+* Reply "the pipeline succeeded, the data is correct." It is dismissive even when literally true.
+* Demand they file a ticket before I look. They might never come back.
+* Investigate in silence for 4 hours and then announce a conclusion. Bring them along.
+* Take the blame for nothing. If the data is right, say so plainly, while still thanking them.
+
+### How to talk about this in an interview
+
+The interviewer is checking three things:
+
+1. **Empathy.** Do you take the analyst seriously?
+2. **Structure.** Do you have a checklist when the question is vague?
+3. **Honesty.** Can you say "the data is right" without being defensive?
+
+If you can show all three in three minutes, you are answering the question.
+
+### Common mistakes interviewers want you to name
+
+1. **Defending the pipeline before checking.** "The pipeline is fine" is not the same as "the dashboard is fine."
+2. **Asking for too much detail up front.** "Which date, which filter, which row" feels like an interrogation.
+3. **Fixing only the symptom.** The analyst gets a new dashboard; the next analyst will be confused too.
+4. **Not closing the loop.** They reported it; tell them what you found.
+5. **Marking the issue "won't fix"** because the data was technically right. It cost trust; that is not "won't fix."
+
+### Bonus follow-up the interviewer might throw
+
+> *"How do you handle it when the analyst is wrong twice in a week and the team starts treating them as 'that person who cries wolf'?"*
+
+Take them aside privately and offer a hand. "I noticed both reports last week turned out to be misunderstanding the definition. Want me to walk through the metric layer with you in 30 minutes? It'll save us both time." This protects them, builds their understanding, and reduces noise without ever shaming them in public.
diff --git a/Problem 32: Inheriting a Pipeline No One Owns/question.md b/Problem 32: Inheriting a Pipeline No One Owns/question.md
new file mode 100644
index 0000000..ecefd2b
--- /dev/null
+++ b/Problem 32: Inheriting a Pipeline No One Owns/question.md
@@ -0,0 +1,29 @@
+## Problem 32: Inheriting a Pipeline No One Owns
+
+**Scenario:**
+You inherit a pipeline that has been running for two years. The engineer who built it has left. The team has shrunk. A new requirement comes in that touches a transform deep inside the pipeline, and you realise no one alive understands how it works. The schedule says it runs daily, and most days it succeeds. Sometimes it fails and someone restarts it. Now you need to change something and you do not know what you will break.
+
+In the interview, the question is:
+
+> A pipeline you built two years ago is still running, but the original team is gone. A new requirement breaks it. What is the first thing you do?
+
+This is testing your judgment under ownership transfer. Lots of engineers get this wrong by either rewriting too eagerly or being too scared to change anything.
+
+---
+
+### Your Task:
+
+1. Resist the urge to start coding immediately.
+2. Walk through how you would learn the pipeline before changing it.
+3. Sketch a minimal first change that proves you understand the system.
+4. Cover how you would document as you go.
+
+---
+
+### What a Good Answer Covers:
+
+* Read the code and the schedule before touching anything.
+* Run it locally or in a sandbox to feel its behavior.
+* Find the seams: where could you change without rippling?
+* Test coverage as the first investment.
+* Make the new requirement a small, reversible change.
diff --git a/Problem 32: Inheriting a Pipeline No One Owns/solution.md b/Problem 32: Inheriting a Pipeline No One Owns/solution.md
new file mode 100644
index 0000000..3d352fc
--- /dev/null
+++ b/Problem 32: Inheriting a Pipeline No One Owns/solution.md
@@ -0,0 +1,129 @@
+## Solution 32: Inheriting a Pipeline No One Owns
+
+### Short version you can say out loud
+
+> I do not start coding. I spend the first day reading the pipeline end to end, running it in a sandbox, and writing down what I observe. I look for the seam where the new requirement is supposed to land. I add the smallest possible change with a way to roll back, and I add a test for the change so that even if I do not understand the rest of the pipeline yet, I understand the part I touched. The instinct to "just rewrite it" is the most expensive mistake here.
+
+### Day 1: read, do not code
+
+Things I would read in this order:
+
+1. **The schedule and the recent run history.** When does it run, how often does it fail, how long does each step take?
+2. **The DAG / orchestrator config.** Airflow, Dagster, whatever it is. Draw the dependency graph on paper.
+3. **The code, top down.** Start from the entry point, not from the new requirement's spot.
+4. **The destination tables.** What does the output look like, what are its grain, what columns are populated and which look untouched?
+5. **The last 3 months of incident notes.** Slack, on-call channel, anything. People often leave breadcrumbs.
+
+By end of day 1 I should be able to draw the pipeline on a whiteboard and explain each box. If I cannot, I keep reading.
+
+### Day 2: run it in a sandbox
+
+A real-data sandbox is the second most valuable investment. I would:
+
+* Spin up a copy that points at a non-production target.
+* Trigger one historical day's run.
+* Diff the output of my sandbox against the real production output for the same day.
+* They should match exactly. If they do not, my sandbox setup is wrong, or the pipeline has hidden state.
+
+Running it teaches things reading does not: which steps are actually slow, which call external APIs, where retries happen, what the failure modes look like.
+
+### What I am looking for
+
+While reading and running, I am hunting for:
+
+* **Pure functions vs side effects.** Pure transforms are easy to change. The dangerous code is the part that talks to external systems.
+* **Idempotency.** Can I run it twice for the same day? If yes, I can experiment without fear.
+* **Seams.** Places where the change is supposed to land. Usually it is one transform out of many, not a sprawling rewrite.
+* **Surprises.** "Why is this transform here?" If I cannot explain it after reading, I do not delete it. Chesterton's fence.
+
+### A note on Chesterton's fence
+
+There is always code in an inherited pipeline that looks pointless. Maybe a strange filter, a magic constant, a weird CASE WHEN. The instinct is to delete it.
+
+Don't.
+
+It was put there for a reason, usually a bug or a stakeholder request. Until I know the reason, I leave it alone. If I really need to remove it, I ask first: "Anyone remember why this filter exists?" Half the time someone does. The other half, I add a comment "removed 2025-05, reason unknown, no apparent downstream impact" and watch for two weeks.
+
+### Day 3: the smallest change that does the new requirement
+
+Now I implement the change. The discipline:
+
+* **Single PR.** No "while we are at it" cleanup.
+* **Tests for the new behavior.** Even just a small fixture and an assertion in the same PR. This is now the only test in the pipeline. It is more than there was yesterday.
+* **A way to roll back.** Either a feature flag, or a small migration that is reversible. Idempotent partition replace is gold here: I can rerun yesterday's data with the old code if the new one is wrong.
+* **Production parity check.** Run the modified pipeline against the same day, compare the output against production. The new column or new rule should appear; everything else should be identical.
+
+### Documentation as I learn
+
+Every time I figure something out, I write it down. Not a beautiful doc. A `README.md` next to the code, with sections like:
+
+```
+# Pipeline: revenue-rollup
+
+## Schedule
+Runs at 03:00 UTC daily. Backfills allowed for last 14 days.
+
+## Inputs
+- raw.orders (partitioned by event_date)
+- raw.refunds (partitioned by event_date)
+- ref.tariffs (full table, SCD2)
+
+## Outputs
+- marts.daily_revenue (one row per region per day)
+
+## Known quirks
+- The `is_test_account` filter exists because of a 2023 incident
+ with the QA team's traffic appearing in revenue. Do not remove.
+- The Athena step in stage 3 is much slower than the rest. ~25 min.
+ Not yet investigated.
+
+## Failure modes I have observed
+- Source partition arrives late. Pipeline succeeds on empty input.
+ This is the silent failure mode; we have not added a check yet.
+```
+
+Three months from now, the next person inheriting this will thank me. So will I, when I have forgotten.
+
+### What I would NOT do in the first month
+
+* **Rewrite the pipeline.** It works. The new owner is the worst rewriter — I have least context.
+* **Migrate it to a new orchestrator.** Even if Airflow is "old," the migration is the wrong first move.
+* **Add 20 new dbt tests.** Add 2, on the things I touched. More can come later.
+* **Remove the strange CASE WHEN.** Chesterton.
+* **Restart failed runs without reading the error.** That is how the previous team got here.
+
+### What I would do in the first month, besides the change
+
+After the immediate change is live, I would:
+
+* Add a freshness check on the output table. "Yesterday's row exists, and the row count is within 30 percent of the trailing 7-day average."
+* Tag the pipeline with my name as owner so future questions come to me, not nobody.
+* Schedule a 15-minute "data team handover" where I walk a colleague through what I know. Now we have two people who know it.
+
+### The conversation with the requester
+
+The new requirement came from somewhere. The conversation should be honest:
+
+> "I will deliver the change by Friday. While I am there, I'm going to spend two days getting comfortable with the pipeline because the previous owner left and the code is dense. After Friday, I will know the pipeline well enough to commit to faster turnaround on future requests."
+
+This sets expectations and earns time for the careful read. Most stakeholders are fine with this if you are clear.
+
+### Common mistakes interviewers want you to name
+
+1. **The big rewrite reflex.** Cost is high, risk is enormous, value is unclear.
+2. **Touching prod first.** Always sandbox.
+3. **Removing "obviously dead" code** without finding out why it was there.
+4. **Not adding documentation** while the system is fresh in your head. It will not be fresh in a month.
+5. **Working alone.** Pair the read with a teammate; the second pair of eyes catches surprises.
+
+### Bonus follow-up the interviewer might throw
+
+> *"At what point would you rewrite it?"*
+
+Three signals together:
+
+1. The code is genuinely hard to change safely (no tests, hidden state, fragile dependencies) AND a major requirement is coming that forces a deep change.
+2. The cost of operating it (incidents, debugging time) is high enough to make rewriting cheaper over 6 months.
+3. The team has the bandwidth to maintain both old and new for the transition.
+
+If only one is true, do not rewrite. Patch carefully. The rewrite is justified only when the alternative is more painful, not when the existing code is ugly.
diff --git a/Problem 33: Executive Needs a Number Tomorrow/question.md b/Problem 33: Executive Needs a Number Tomorrow/question.md
new file mode 100644
index 0000000..d7b5306
--- /dev/null
+++ b/Problem 33: Executive Needs a Number Tomorrow/question.md
@@ -0,0 +1,27 @@
+## Problem 33: Executive Needs a Number Tomorrow
+
+**Scenario:**
+The CFO calls at 4 PM. They are preparing a board update for tomorrow morning. They need "total customer acquisition cost by channel for the last quarter." You know the data has quality issues: there were ingestion gaps in marketing data last month, and the cost attribution model has a known disagreement with finance. You have a few hours. They want a number.
+
+In the interview, the question is:
+
+> An executive asks for a number by tomorrow morning, but the data has known quality issues. What do you tell them, and what do you ship?
+
+---
+
+### Your Task:
+
+1. Frame the conversation with the exec.
+2. Decide what to ship.
+3. Cover what notes go with the number.
+4. Plan the follow-up after the meeting.
+
+---
+
+### What a Good Answer Covers:
+
+* You ship a number. "We don't have data" is not an answer.
+* You make the caveats clear and short.
+* You decide what to caveat and what to silently fix.
+* You document the source so it can be reproduced.
+* You ask one question to scope the answer.
diff --git a/Problem 33: Executive Needs a Number Tomorrow/solution.md b/Problem 33: Executive Needs a Number Tomorrow/solution.md
new file mode 100644
index 0000000..8da56e0
--- /dev/null
+++ b/Problem 33: Executive Needs a Number Tomorrow/solution.md
@@ -0,0 +1,116 @@
+## Solution 33: Executive Needs a Number Tomorrow
+
+### Short version you can say out loud
+
+> Executives in a time crunch need a defensible number plus the smallest possible set of caveats. I always ship a number. I never ship a wall of disclaimers. I ask one quick question to make sure I am answering the right one, I run the cleanest version of the analysis I can in the time I have, and I deliver the number with two or three sentences of context. After the meeting, I follow up with a clean rebuild and tell them if it changes.
+
+### The conversation, before any SQL
+
+When the CFO calls, the first move is to scope the question, not to start coding.
+
+> "Quick clarification before I run this. By 'CAC by channel,' do you want it gross (all marketing spend per acquired customer) or net of partner reimbursements? And by 'last quarter,' Jan-Mar or your fiscal Q1?"
+
+Two questions, one minute. Saves an hour later.
+
+I would also ask the level of confidence they need:
+
+> "I can give you a directional number by 7 PM that is good to ±5 percent, or a higher-confidence number tomorrow at noon. Which one do you need?"
+
+Most of the time, "directional tonight" is the answer. The exec needs a number to talk to the board, not to file with the SEC.
+
+### What I ship
+
+I ship one number per channel, plus a comparison metric they will want next.
+
+```
+Channel CAC last quarter vs last year
+─────────────────────────────────────────────
+Paid Search $42 +12%
+Social $61 +18%
+Affiliate $28 -4%
+Referral $9 +2%
+Organic $3 -1%
+ ─────────
+Blended $34 +9%
+```
+
+The "vs last year" is unsolicited but always asked next. Saves the second round.
+
+### The caveats, kept short
+
+I write a separate short note alongside the number. Two or three bullets. No more.
+
+> Method: total spend in channel / new customers attributed to that channel, last-touch model. Source: marketing.spend_daily and analytics.customer_acquisition.
+>
+> Two things to know:
+>
+> 1. April had a 3-day gap in social ad spend ingestion. I have filled it using the daily run rate average. Impact is probably ±$2 on the social CAC. The underlying issue is being addressed.
+> 2. Finance attributes a few large referral bonuses on a different basis. The number above is the analytics-team view; finance's version of "referral CAC" is ~$11. We are aligning the methodology next week.
+
+The principle: enough caveats that the CFO does not get blindsided when finance pushes back, not so many that the number looks unreliable.
+
+### What I would NOT do
+
+* **Refuse to give a number.** "The data is messy" is true and useless to a CFO at 4 PM.
+* **Bury caveats inside the analysis.** They never get read.
+* **Make up a clean methodology I have not used before.** No surprises today. Use the existing model, flag where it disagrees with finance.
+* **Send the SQL and let them figure it out.** Send a number with one clean table.
+* **Promise a more precise number than I can deliver.** A wrong number tomorrow is worse than an honest one tonight.
+
+### How I think about "the data is wrong"
+
+Two categories of data quality issues:
+
+* **Quantitative.** Missing rows, broken ingestion, late data. Often fixable with simple imputation or backfill. I disclose the imputation but I still ship.
+* **Methodological.** Different teams compute the same metric differently. Last-touch vs multi-touch attribution. Net vs gross spend. I cannot fix this in an evening. I disclose the methodology I used, name the alternative, and offer to align it next week.
+
+Both get a one-line caveat. Neither stops me from shipping.
+
+### The follow-up after the meeting
+
+The next morning, after the board update is over, two things happen:
+
+1. **I clean up the analysis** properly. Rerun against the corrected data, document the methodology, write it up.
+2. **I tell the CFO if anything changed.** "Yesterday's number was $42 for paid search. After the proper backfill, it is $41.20. The story is the same." This builds trust for the next time.
+
+Equally important, I file a ticket for the longer-term fix on the data quality issue. The April ingestion gap probably has a root cause; we should find it.
+
+### What if the data really cannot give a defensible answer
+
+Sometimes the gap is too large. The honest move is to say so, but constructively.
+
+> "I cannot give you a CAC by channel with confidence tonight because the April spend ingestion has a 14-day gap. I can give you:
+>
+> - CAC by channel for the previous quarter, which is clean.
+> - A blended CAC for the current quarter, which I am confident in (channels aside, the totals are right).
+>
+> Which would be more useful for the board?"
+
+This is still shipping a number, just a different one. It also surfaces the underlying issue without sounding like an excuse.
+
+### The conversation if finance challenges the number
+
+In the meeting, finance may say "our CAC for referrals is $11, not $28." Two outcomes:
+
+1. **They are using the same data with a different methodology.** Both are defensible. The next slide should reconcile.
+2. **One of us is on broken data.** Probably the team that has not reconciled in a while. We compare.
+
+Either way, the conversation is calmer because I already flagged this in the caveats. The CFO is not surprised.
+
+### Common mistakes interviewers want you to name
+
+1. **Sending raw SQL or notebooks to an exec.** They will not read it.
+2. **Wall of disclaimers.** Makes the number look untrustworthy.
+3. **Refusing to provide a number.** Looks like obstruction to the business.
+4. **Overpromising precision.** "$41.78 paid-search CAC" implies a precision the data does not have.
+5. **No follow-up.** The exec uses the number forever; you never came back to check.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the CFO uses the number in a board document and then a more accurate number turns out to be different?"*
+
+I would tell them as soon as I know, not let them find out from someone else. The script:
+
+> "Quick correction: the CAC number I sent you Tuesday was $42 for paid search. The cleaner rebuild this morning gives $39. The change is within the ±5% I quoted, and the channel rankings are unchanged. If anyone asks, the numbers are now in [link]."
+
+Owning the change builds credibility. Hoping nobody notices destroys it.
diff --git a/Problem 34: Three Days of Data Lost/question.md b/Problem 34: Three Days of Data Lost/question.md
new file mode 100644
index 0000000..af0f28b
--- /dev/null
+++ b/Problem 34: Three Days of Data Lost/question.md
@@ -0,0 +1,29 @@
+## Problem 34: Three Days of Data Lost
+
+**Scenario:**
+Three days of production data has been lost. A Kafka topic was misconfigured during a deploy: the retention was set to 1 hour instead of 7 days, and the consumer group that lands events into the data lake fell behind. By the time anyone noticed, the missing events had aged out of Kafka. They are gone from there. The source systems may still have them, but only some do.
+
+In the interview, the question is:
+
+> Production lost three days of data because a Kafka topic was misconfigured. How do you recover, and what do you change after?
+
+This is a recovery scenario plus a postmortem scenario. The interviewer wants to see calm, structured thinking, plus the harder lesson at the end.
+
+---
+
+### Your Task:
+
+1. Sketch the immediate recovery plan.
+2. Cover what is recoverable and what is permanently lost.
+3. Plan the postmortem.
+4. Propose what changes to prevent this class of failure.
+
+---
+
+### What a Good Answer Covers:
+
+* Stop the bleeding: fix the config first.
+* Recover from upstream sources where possible.
+* Be honest about what is gone.
+* Idempotent replay, no double counting.
+* Guards: schema registry, config review, monitoring, retention floors.
diff --git a/Problem 34: Three Days of Data Lost/solution.md b/Problem 34: Three Days of Data Lost/solution.md
new file mode 100644
index 0000000..1138d1c
--- /dev/null
+++ b/Problem 34: Three Days of Data Lost/solution.md
@@ -0,0 +1,152 @@
+## Solution 34: Three Days of Data Lost
+
+### Short version you can say out loud
+
+> First I fix the config so today's data is safe. Then I find out which producers still have the original events and replay from them, idempotently. Anything that only lived in Kafka and aged out is permanently gone; I write that down honestly. After recovery, the real change is structural: a retention floor that cannot be lowered below the consumer lag SLA, plus monitoring on consumer lag versus retention.
+
+### Hour 1: stop the bleeding
+
+The first thing I do is not investigate. It is to stop today's events from also being lost.
+
+1. **Restore the retention setting** on the Kafka topic to its proper value (7 days, or whatever the standard is).
+2. **Catch up the consumer group** that fell behind. If it is still running but lagged, scale it up or unblock it.
+3. **Page the right people.** Producer team owners, data team lead, on-call.
+
+While that is happening, I post in the incident channel:
+
+> "Incident: kafka topic `app_events` had retention misconfigured (1h instead of 7d). Consumer lag caused 3 days of events to age out before being landed. Retention has been restored. Investigating recovery options now. Will update at the top of every hour."
+
+### Hour 2: scope the loss
+
+For each topic involved, I want three numbers:
+
+1. **First and last lost event time.** Exactly the window we are recovering.
+2. **Approximate count of lost events.** From producer logs or upstream metrics.
+3. **Downstream tables that depend on this.** Which marts, which dashboards, which models.
+
+```sql
+-- in the warehouse, where the consumer would have landed events
+SELECT MIN(event_time) AS first_ok, MAX(event_time) AS last_ok
+FROM raw.app_events
+WHERE event_time > NOW() - INTERVAL 10 DAYS;
+```
+
+The gap between the producer-side timeline and the warehouse-side timeline is the lost window.
+
+### Hour 3: recovery options, source by source
+
+Now the recovery thinking. For each producer that fed the lost topic, ask: **does the source system still have the originals?**
+
+Three possible answers per producer.
+
+**1. Yes, still in the OLTP / source database.**
+
+Replay from there. CDC can usually be rewound to a starting point in time. Or a one-off SQL extract can produce the same events with the same shape. Land them into the lost window, idempotently.
+
+```sql
+INSERT INTO raw.app_events
+SELECT
+ event_id,
+ user_id,
+ event_type,
+ event_time,
+ payload
+FROM source.events
+WHERE event_time BETWEEN @lost_start AND @lost_end;
+```
+
+Because rows have an `event_id`, we can dedupe naturally on insert (MERGE / upsert). Idempotent. If we accidentally replay events that did make it into the warehouse, no harm done.
+
+**2. Producer kept an outgoing log.**
+
+Some teams write the same event to both Kafka and a local file or DynamoDB stream "for audit." Check. If found, replay from the audit.
+
+**3. Producer kept nothing. The event was emit-and-forget.**
+
+This is the part that is gone. There is nothing to recover. Tell people. Do not paper over it.
+
+### Idempotent replay
+
+The replay is essentially a backfill of the lost window. The pattern from Problem 9:
+
+* Treat the window as a partition.
+* Delete what is in that window in the destination (any partial events that did sneak through), then re-insert.
+* Or, if every event has a stable `event_id`, MERGE on the id.
+
+```sql
+MERGE INTO raw.app_events AS dst
+USING staging.replayed AS src
+ON dst.event_id = src.event_id
+WHEN NOT MATCHED THEN INSERT (event_id, ...)
+WHEN MATCHED THEN UPDATE SET ...;
+```
+
+Once raw is restored, every downstream model is rebuilt for that window. Because every model is partition-keyed and idempotent, this is mechanical.
+
+### What is permanently gone
+
+Be honest about this in the incident channel and the postmortem. Examples of unrecoverable categories:
+
+* **Client-side events** sent only to Kafka without a server-side mirror. Browser pings, app analytics events that never went through a server log.
+* **Third-party webhook events** if the third party does not allow replay.
+* **Derived events** computed in the stream itself and not stored anywhere upstream.
+
+Quantify the loss. "We lost approximately 1.2 million events out of 38 million for the window, all client-side analytics. They are not recoverable. Server-side events (checkout, payments) are fully recovered."
+
+### The communication thread
+
+The incident channel gets four updates that day:
+
+1. Initial post: "incident, retention misconfig, investigating."
+2. After scoping: "lost window is X to Y, downstream tables affected: A, B, C."
+3. After recovery plan: "recovering from sources for 95% of events. Client-side telemetry is unrecoverable."
+4. After recovery: "raw is rebuilt, marts being rebuilt now, ETA tomorrow morning. Final loss estimate: ~5% of telemetry events, no impact on financial data."
+
+Stakeholders get a separate, shorter version aimed at their concerns. Finance only cares about "do our money numbers move?" If not, that's the message they need.
+
+### The postmortem, after the dust settles
+
+A blameless postmortem within a week. Honest about what went wrong.
+
+**Timeline.** Exactly when each event happened.
+
+**What went well.** Recovery started within an hour. Most data was recovered. Comms were clear.
+
+**What went badly.**
+
+* The retention change shipped without review specific to data-team consumers.
+* Consumer lag was being silently growing for a day before retention expired.
+* No alert fired when retention was set below the lag SLA.
+* The list of "downstream owners of this topic" did not exist; nobody knew who to page.
+
+**Action items.** Each one assigned to a person with a date.
+
+* Add a Kafka admission control: cannot set retention below `max(consumer_lag * 2)` for any consumer group on the topic.
+* Alert: consumer lag exceeds 50% of retention. Pages the data team.
+* Alert: producer-side event rate vs warehouse-side landing rate divergence. Pages the data team.
+* Topic owners are recorded in a registry; deploy reviews include checking "does this topic have warehouse consumers."
+* Critical client-side events get a server-side mirror.
+
+### The bigger lesson
+
+The technical fix is easy. The harder lesson is that a deploy in one part of the system can silently destroy data in another. This calls for:
+
+* **Data contracts** that include retention requirements.
+* **Cross-team review** for changes to topics that have data-team consumers.
+* **A defense-in-depth mindset** that does not rely on any one team to do the right thing.
+
+If I had to pick one change, it would be the retention floor: simply make it impossible to lower retention below the SLA. That removes the failure class permanently.
+
+### Common mistakes interviewers want you to name
+
+1. **Investigating before fixing the config.** Today is still leaking.
+2. **Trying to recover from Kafka after retention has expired.** The data is gone.
+3. **No quantification of the loss.** Saying "some data was lost" is unactionable.
+4. **Blaming the engineer who shipped the deploy.** The system allowed it.
+5. **No structural change.** "We will be more careful" never works.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the lost data drives a financial number that has already been published?"*
+
+That is a regulatory and legal event, not just an engineering one. Bring in legal and the finance lead immediately. The technical recovery is the same. The harder part is the disclosure: the published number was wrong by X%. Whether it is material enough to restate depends on the jurisdiction and the size. Do not make that decision alone.
diff --git a/Problem 35: Lambda vs Cloud Function vs Cloud Run/question.md b/Problem 35: Lambda vs Cloud Function vs Cloud Run/question.md
new file mode 100644
index 0000000..5396e27
--- /dev/null
+++ b/Problem 35: Lambda vs Cloud Function vs Cloud Run/question.md
@@ -0,0 +1,27 @@
+## Problem 35: Lambda vs Cloud Function vs Cloud Run
+
+**Scenario:**
+You need to deploy a small Python service. It reads a file from S3 (or GCS), validates and transforms it, and writes the result to BigQuery. It runs maybe 200 times a day. The team is on Google Cloud, but the file lands in an AWS bucket owned by a partner. The hiring manager wants to know if you can defend a cloud choice.
+
+In the interview, the question is:
+
+> Lambda vs Cloud Function vs Cloud Run for a small Python job. Talk through what would push you toward each.
+
+---
+
+### Your Task:
+
+1. Explain the three options in plain words.
+2. Walk through the dimensions you would compare them on.
+3. Pick one for the scenario, and defend it.
+4. Mention when you would actually pick the other two.
+
+---
+
+### What a Good Answer Covers:
+
+* Event source matters: where does the trigger come from?
+* Container vs zip / image vs source bundle.
+* Cold starts and concurrency.
+* Cost shape (per invocation, per second).
+* Runtime limits (15 min Lambda, 60 min Cloud Function gen2, basically unbounded Cloud Run).
diff --git a/Problem 35: Lambda vs Cloud Function vs Cloud Run/solution.md b/Problem 35: Lambda vs Cloud Function vs Cloud Run/solution.md
new file mode 100644
index 0000000..f1781ed
--- /dev/null
+++ b/Problem 35: Lambda vs Cloud Function vs Cloud Run/solution.md
@@ -0,0 +1,156 @@
+## Solution 35: Lambda vs Cloud Function vs Cloud Run
+
+### Short version you can say out loud
+
+> All three run a small piece of code on demand and bill only when it runs. The choice usually comes down to three things: where the trigger lives, how long the job runs, and whether I want to ship a container or just code. For the scenario in the question, a file lands in an S3 bucket owned by a partner, so the natural trigger lives in AWS. I would put a Lambda there to consume the S3 event, validate, and then push to GCP. If the work were heavier or longer than 15 minutes, I would use Cloud Run on the GCP side instead.
+
+### The three in one sentence each
+
+* **AWS Lambda.** Run a function in response to an event. Zip or container. Max 15 minutes. Most mature in the AWS event ecosystem.
+* **Google Cloud Function.** Same idea on GCP. Gen 2 is built on Cloud Run under the hood. Max 60 minutes. Tight integration with GCP triggers.
+* **Google Cloud Run.** Run a container. HTTP, Pub/Sub, scheduler, or event triggers. Can run far longer than functions (effectively unlimited with jobs). Closer to "small service" than "function."
+
+### The dimensions that actually matter
+
+| Dimension | Lambda | Cloud Function | Cloud Run |
+| ---------------------------------- | ---------------- | ---------------- | --------------------- |
+| Cloud | AWS | GCP | GCP |
+| Code shape | Zip or container | Source or container | Container |
+| Max runtime | 15 minutes | 9 min (gen1) / 60 min (gen2) | ~60 min HTTP, unlimited as Job |
+| Cold start | ~100ms-1s | ~1-3s | ~1-3s, can be 0 with min instances |
+| Concurrency per instance | 1 | 1 | Configurable, up to ~1000 |
+| Memory cap | 10 GB | 16 GB (gen2) | 32 GB |
+| Cost | Per invocation + GB-sec | Per invocation + GB-sec | Per second of CPU/memory while running |
+| Native triggers | All AWS services | All GCP services | HTTP + Pub/Sub + Scheduler + Eventarc |
+| When you want a container | Yes (since 2020) | Yes (gen2) | Yes (always) |
+
+### How I would pick for the scenario
+
+The file lands in an AWS bucket. The S3 event is the trigger. Three reasonable patterns:
+
+**Pattern A: Lambda in AWS, push to GCP from there.**
+
+```
+S3 (partner) ──ObjectCreated──▶ Lambda (Python)
+ │
+ ▼
+ validate, transform
+ │
+ ▼
+ BigQuery load API (cross-cloud HTTPS)
+```
+
+* Trigger lives where the data lives. No polling, no cross-cloud event copy.
+* Small Python function. Zip or image.
+* The cross-cloud call is one HTTPS write to BigQuery's load API. ~1 second.
+* Total cost: pennies per day at 200 invocations.
+
+This is what I would actually ship for this scenario.
+
+**Pattern B: copy file to GCS first, then Cloud Function.**
+
+```
+S3 (partner) ──ObjectCreated──▶ Lambda copies to GCS
+ │
+ ▼
+ GCS ObjectFinalized
+ │
+ ▼
+ Cloud Function (Python)
+ │
+ ▼
+ validate, transform, load BQ
+```
+
+Two hops, more moving parts. The reason to do this is if the team has strict "all data must enter GCP first for audit" rules, or if the validation is complex and the team's library is GCP-only.
+
+**Pattern C: Cloud Run with a scheduled poll.**
+
+If the partner cannot fire an event when they drop the file, polling is the fallback. Cloud Scheduler hits a Cloud Run endpoint every 5 minutes, the service lists the S3 bucket, picks up new files, processes them.
+
+I would only do this if S3 event delivery is impossible. Polling wastes invocations and adds latency.
+
+### When I would actually pick each one
+
+**Pick Lambda when:**
+
+* The trigger source is in AWS (S3, DynamoDB, EventBridge, SQS, Kinesis).
+* The job is short (< 15 minutes) and stateless.
+* You already have AWS IAM and observability set up.
+
+**Pick Cloud Function when:**
+
+* The trigger source is in GCP (GCS, Pub/Sub, Firestore, Eventarc).
+* The job is short to moderate (< 60 minutes on gen 2).
+* You want the cheapest, simplest deploy.
+
+**Pick Cloud Run when:**
+
+* The job needs to run longer than function limits.
+* You need higher concurrency per instance (a web API that handles 100 concurrent users on one container).
+* You want a full container, with system dependencies, drivers, larger libraries.
+* You need a small HTTP API, not just a one-shot handler.
+* You want to keep some instances warm (`min-instances`) to eliminate cold starts.
+
+### Three things people get wrong
+
+**1. Picking Cloud Run when a function is fine.**
+
+Cloud Run takes more operational effort: build an image, manage the Dockerfile, deal with image registry. For a 30-line Python script, a function is faster to ship and equally cheap.
+
+**2. Picking Lambda when the trigger lives in GCP.**
+
+You can do it (EventBridge can subscribe to Pub/Sub), but you are routing data around for no reason. Use the native event source.
+
+**3. Forgetting the 15-minute Lambda limit.**
+
+A "small job" that scans a big bucket can drift past 15 minutes once a file is unusually large. Either bound the work (process up to N files per invocation), or use Cloud Run or Step Functions for the workflow.
+
+### Cold start, in practice
+
+Cold starts matter when the trigger is user-facing. For a file-drop pipeline, a 1-2 second cold start is fine. For an API behind a mobile app, cold starts hurt.
+
+Mitigations:
+
+* **Provisioned concurrency** on Lambda or **min-instances** on Cloud Run / Cloud Function gen2 keeps instances warm. Costs more.
+* **Smaller runtime** (avoid large dependencies, use slim base images, lazy-import heavy modules).
+* **Smaller package** (avoid bundling unused libraries).
+
+For 200 invocations a day, none of this matters. For 200 per second, it does.
+
+### Cost shape
+
+* **Lambda and Cloud Function** charge per invocation plus per GB-second.
+* **Cloud Run** charges per second of CPU and memory while the request is in flight (per-second billing, idle time not charged with concurrency).
+
+For a small 1-second job at 200 invocations a day:
+
+* Lambda: well under a dollar a month.
+* Cloud Function: same.
+* Cloud Run: same.
+
+The cost choice does not matter until you are running tens of millions of invocations.
+
+### Observability is the deciding factor sometimes
+
+For a Python script you wrote, observability is what saves your week. Each has good logs (CloudWatch, Cloud Logging) and metrics. What I would ask the team:
+
+* Where do we already send logs?
+* Where do we have on-call dashboards?
+* Where do alerts route?
+
+Putting the function in a cloud where you already have observability set up beats picking the "best" tool but having to wire it from scratch.
+
+### Common mistakes interviewers want you to name
+
+1. **Default to whatever I used last time** without thinking about trigger source.
+2. **Picking the wrong runtime cap.** Lambda for a 22-minute job will be a nightmare.
+3. **Ignoring cold starts on user-facing paths.**
+4. **Over-engineering with Cloud Run** when a function would work fine.
+5. **Forgetting cross-cloud egress costs.** Pulling a file out of S3 to process in GCP costs egress.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if you needed to add a second step that takes the BigQuery load result and emails a report?"*
+
+I would not make the function bigger. I would publish a message ("BQ load completed for file X") to Pub/Sub or EventBridge, and a second function subscribes and sends the email. Two small, single-purpose functions are easier to debug and rerun than one that does both. This is the cloud version of "do one thing well."
diff --git a/Problem 36: Scheduled Pipeline Pay Only When Run/question.md b/Problem 36: Scheduled Pipeline Pay Only When Run/question.md
new file mode 100644
index 0000000..7f95f60
--- /dev/null
+++ b/Problem 36: Scheduled Pipeline Pay Only When Run/question.md
@@ -0,0 +1,27 @@
+## Problem 36: Scheduled Pipeline, Pay Only When It Runs
+
+**Scenario:**
+A small team has a daily Python ETL that takes about 15 minutes to run. They want it on a schedule (daily at 4 AM), but they want to pay only when it runs. They are price-sensitive and the script today runs on a small VM that sits idle 23 hours a day.
+
+In the interview, the question is:
+
+> A team wants their pipeline on a schedule but only wants to pay when it runs. What options do you give them, and what are the trade offs?
+
+---
+
+### Your Task:
+
+1. List the practical options on AWS and GCP.
+2. Compare them on cost, simplicity, and limits.
+3. Pick one for the scenario.
+4. Mention when the picture changes (longer jobs, more frequent runs, more steps).
+
+---
+
+### What a Good Answer Covers:
+
+* EventBridge + Lambda; Cloud Scheduler + Cloud Function; Cloud Run Jobs.
+* The 15-minute Lambda limit.
+* AWS Batch / Step Functions; GCP Workflows.
+* When to step up to Airflow / Dagster.
+* Idle-cost math: the VM was costing $20/month; the function will cost cents.
diff --git a/Problem 36: Scheduled Pipeline Pay Only When Run/solution.md b/Problem 36: Scheduled Pipeline Pay Only When Run/solution.md
new file mode 100644
index 0000000..721667f
--- /dev/null
+++ b/Problem 36: Scheduled Pipeline Pay Only When Run/solution.md
@@ -0,0 +1,115 @@
+## Solution 36: Scheduled Pipeline, Pay Only When It Runs
+
+### Short version you can say out loud
+
+> A scheduled function or a scheduled container job. For a 15-minute Python ETL, my default is Cloud Run Job triggered by Cloud Scheduler on GCP, or AWS Batch triggered by EventBridge on AWS. Both bill only when the job runs. Lambda is tempting but is capped at 15 minutes, which is right on the edge for this script. The team's VM probably costs $20-50 a month sitting idle; the serverless version should be cents.
+
+### The options that actually fit
+
+```
+GOAL: every day at 4 AM, run this Python ETL, pay nothing in between
+
+ AWS GCP
+ ──────────────────────── ─────────────────────────
+ EventBridge → Lambda Cloud Scheduler → Cloud Function (gen 2)
+ ↓ time limit 15 min ↓ time limit 60 min
+ (works if job stays short) (works for this scenario)
+
+ EventBridge → AWS Batch Cloud Scheduler → Cloud Run Job
+ ↓ unlimited runtime ↓ unlimited runtime
+ (good for longer jobs) (my default for this scenario)
+
+ EventBridge → Step Functions Cloud Scheduler → Workflows
+ ↓ multi step orchestration ↓ multi step orchestration
+ (good when more than one step) (good when more than one step)
+
+ Managed Airflow / Dagster cloud Managed Airflow / Dagster cloud
+ ↓ "I'll have many DAGs" ↓ "I'll have many DAGs"
+```
+
+### Picking one for the scenario
+
+15-minute Python ETL, one step, daily. My pick is **Cloud Run Job + Cloud Scheduler** on GCP, or **AWS Batch + EventBridge** on AWS.
+
+Why not Lambda or Cloud Function:
+
+* The job takes 15 minutes. Lambda's hard cap is 15 minutes. One slow day and the job dies mid-way.
+* Cloud Function gen 2 has a 60-minute cap, so it would technically fit, but if the script grows in scope, you hit the cap.
+
+Why Cloud Run Job:
+
+* Pay-per-second, only while running. ~15 min × pennies/min = pennies per run.
+* No runtime limit that the team will hit.
+* The Python code runs in a container they control; system dependencies (parquet, gdal, oracle clients) are easy to bake in.
+* Triggered by Cloud Scheduler with a cron expression.
+
+Why AWS Batch on AWS:
+
+* Same idea: a managed compute environment that spins up on demand.
+* Slight upfront setup (compute environment, job queue, job definition).
+* For a single job, AWS Step Functions calling ECS Fargate may be simpler.
+
+### The actual setup, in shape
+
+```yaml
+# Cloud Run Job
+gcloud run jobs create daily-etl \
+ --image gcr.io/proj/etl:v1 \
+ --region asia-southeast1 \
+ --memory 2Gi --cpu 1 --task-timeout 30m
+
+# Cloud Scheduler
+gcloud scheduler jobs create http etl-trigger \
+ --schedule "0 4 * * *" \
+ --uri "https://...projects.../jobs/daily-etl:run" \
+ --oidc-service-account-email ... \
+ --location asia-southeast1
+```
+
+Cost for 15-minute runs daily: under a dollar per month for the compute, plus very small Scheduler fees. The VM the team is moving away from probably costs more than that idle.
+
+### When the choice changes
+
+**If the job is short (< 5 min) and single-step:**
+Lambda or Cloud Function is the cheapest, simplest option. Just a function + EventBridge / Scheduler rule.
+
+**If the job grows to multiple steps with branching or retries:**
+Step Functions on AWS, Workflows on GCP. They handle orchestration, retries and conditional logic.
+
+**If the team has many jobs (10+) and wants dependencies between them:**
+This is where managed Airflow (MWAA, Cloud Composer) or Dagster Cloud earns its cost. A single scheduled cron is no longer enough.
+
+**If the job needs GPUs or large memory:**
+Cloud Run Jobs supports up to 32 GB and high CPU. For bigger workloads, AWS Batch with spot instances, or GKE/EKS.
+
+**If the team needs strict cron semantics with missed-run replay:**
+Airflow / Dagster handle this. Cloud Scheduler is simpler but does not retry missed schedules automatically (Cloud Scheduler does retry, but the semantics are different).
+
+### What about the existing VM?
+
+The VM is a comfortable place for people new to cloud, but it has hidden costs:
+
+* Paying 24/7 for ~3% utilization.
+* Manual OS patching, dependency upgrades, log rotation.
+* Single point of failure if the VM dies.
+
+Cost math, rough:
+
+* Small VM (t3.small or e2-small), $15-25/month, year-round.
+* Cloud Run Job same workload: ~$0.50-$1/month.
+
+The savings are real, but the bigger win is one less server to babysit.
+
+### Common mistakes interviewers want you to name
+
+1. **Lambda for a 15-minute job.** Will time out occasionally.
+2. **Spinning up an EC2 / VM** and scheduling with cron because "that's how Linux does it." Loses the cost benefit.
+3. **Managed Airflow for one job.** Overkill, expensive idle cost.
+4. **Cloud Run service instead of Cloud Run Job.** Service is for HTTP; Job is the one-shot runner.
+5. **Forgetting timezone of the cron.** Cloud Scheduler defaults can surprise you.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the job needs to read from a Postgres in the same VPC?"*
+
+Both Cloud Run Job and AWS Batch can run inside a VPC. You attach a VPC connector (GCP) or a VPC subnet (AWS Batch). The job gets a private IP and can reach internal services. Slightly more setup, but the trade-off is worth keeping private databases off the public internet.
diff --git a/Problem 37: BigQuery vs Snowflake for New Team/question.md b/Problem 37: BigQuery vs Snowflake for New Team/question.md
new file mode 100644
index 0000000..17f5d7e
--- /dev/null
+++ b/Problem 37: BigQuery vs Snowflake for New Team/question.md
@@ -0,0 +1,27 @@
+## Problem 37: BigQuery vs Snowflake for a New Team
+
+**Scenario:**
+A startup is choosing a warehouse for a brand-new analytics team. They have ~50 engineers, no existing data warehouse, and they want to be productive within a month. Budget is reasonable but not unlimited. The team will start with simple dashboards and grow toward ML and reverse ETL.
+
+In the interview, the question is:
+
+> Why might you choose BigQuery over Snowflake, or the other way around, for a brand new analytics team?
+
+---
+
+### Your Task:
+
+1. Compare on the things that actually matter at this stage.
+2. Pick one for the scenario, defend it.
+3. Mention when the other would have been the better choice.
+4. Cover the things that look big in marketing but matter less in practice.
+
+---
+
+### What a Good Answer Covers:
+
+* Pricing model: on-demand bytes vs warehouse compute time.
+* Time to first query.
+* Ecosystem in your existing stack (GCP-native vs cloud-agnostic).
+* Concurrency and isolation.
+* Migration cost later.
diff --git a/Problem 37: BigQuery vs Snowflake for New Team/solution.md b/Problem 37: BigQuery vs Snowflake for New Team/solution.md
new file mode 100644
index 0000000..430e50a
--- /dev/null
+++ b/Problem 37: BigQuery vs Snowflake for New Team/solution.md
@@ -0,0 +1,106 @@
+## Solution 37: BigQuery vs Snowflake for a New Team
+
+### Short version you can say out loud
+
+> Both are excellent. For a brand-new team that just wants to be productive, the choice usually comes down to which cloud you are already on, and which pricing model fits your usage shape. BigQuery is best when you are on GCP and your usage is bursty (pay per byte scanned). Snowflake is best when you want cloud-agnostic, multiple tightly isolated compute environments, and a steady predictable workload. For a 50-person startup with no warehouse today, I would default to BigQuery if they are on GCP, Snowflake if they are on AWS or multi-cloud. Either choice is hard to regret.
+
+### The honest comparison
+
+| Dimension | BigQuery | Snowflake |
+| ------------------------------- | --------------------------------------- | --------------------------------------- |
+| Cloud | GCP only | AWS, GCP, Azure |
+| Pricing model (default) | On-demand bytes scanned | Per-second of warehouse uptime |
+| Reserved option | Slots (committed capacity) | Capacity contracts |
+| Time to first query | Minutes (just create a dataset) | Minutes |
+| Compute isolation | Reservations + workload management | Multiple virtual warehouses (best in class) |
+| Concurrency for many users | Good (with slots) | Excellent (size each warehouse independently) |
+| Scaling on huge scans | Excellent (auto) | Excellent (resize warehouse) |
+| Storage | Cheap, separate billing | Cheap, separate billing |
+| ML in-database | BigQuery ML, very integrated | Snowpark, Snowflake Cortex |
+| Streaming ingest | Streaming inserts, Storage Write API | Snowpipe, dynamic tables |
+| Marketplace and sharing | Analytics Hub | Data Marketplace, native sharing |
+| Lock-in | Tied to GCP | Portable across clouds |
+| dbt support | Excellent | Excellent |
+
+The features list is almost a draw. The real differences are the **pricing shape** and the **operational model**.
+
+### Pricing shape, in plain words
+
+**BigQuery on-demand.** You pay for the bytes the query scans. A 100 GB scan costs about $0.625 (US on-demand). Idle costs $0 for compute. Storage is separate, ~$0.02/GB/mo for active, less for long-term.
+
+**Snowflake.** You pay for the seconds a virtual warehouse is running. A "small" warehouse costs about $2/credit/hour-ish; it auto-suspends after a configurable idle (1 minute default). Storage is separate, ~$0.023/GB/mo.
+
+Two different cost shapes:
+
+* **BigQuery rewards spiky usage.** If your team queries actively for 2 hours a day and nothing else, you pay only for those queries.
+* **Snowflake rewards steady usage.** A warehouse sized appropriately and used continuously gives predictable cost.
+
+For a small new team that does not yet know its query pattern, **BigQuery on-demand has lower commitment**. You truly pay per query. Snowflake is fine but you need to size warehouses and tune auto-suspend.
+
+### Time to first useful dashboard
+
+Both can get you there in a week. Slight edge to BigQuery if you are already on GCP, because IAM, projects, and identity are wired in. Snowflake is fast too but needs role configuration and a virtual warehouse plan from day one.
+
+### What pushes me to Snowflake
+
+* You are already on AWS or Azure. BigQuery is awkward outside GCP (egress costs, identity bridging).
+* You expect many teams with very different workloads (heavy analysts, ML training, ad-hoc) and you want them on separate compute. Snowflake's per-warehouse model is purpose-built for this.
+* You care about data sharing across organizations. Snowflake's native sharing is best in class.
+* You want a single warehouse story across multiple clouds (some companies refuse vendor lock-in to one cloud).
+* The team prefers a more conventional, predictable warehouse experience.
+
+### What pushes me to BigQuery
+
+* You are on GCP. Identity, billing, networking are all native.
+* Storage and compute are decoupled cleanly. Idle is truly free.
+* You like SQL-first ML (BQ ML is well integrated).
+* The team will start small and might stay small. The on-demand model is forgiving.
+* You need to query large public datasets (BigQuery has many; cost is just the scan).
+* You like the no-cluster operational model: there is nothing to size, suspend, or resume.
+
+### The pick for the scenario
+
+A 50-person startup, no warehouse today. If they are on GCP, I would default to BigQuery. The on-demand pricing forgives the team's early experimentation, and the operational model is the simplest (no clusters, no warehouse sizes, no auto-suspend tuning).
+
+If they are on AWS or undecided, Snowflake. It does not pull them deeper into one cloud, and the per-warehouse model scales nicely as the team grows past the early phase.
+
+### Things that look big but matter less than people think
+
+* **"Snowflake supports semi-structured better."** True five years ago. Both have first-class JSON and arrays now.
+* **"BigQuery has BQ ML."** True, but most ML in production is not run inside the warehouse.
+* **"Snowflake has Snowpark."** Useful for Python-in-warehouse, but if you really want Python, dbt + BigQuery or external compute is fine.
+* **"BigQuery is cheaper because storage is cheap."** Both have cheap storage. The compute is where the bill is.
+
+### When you might regret BigQuery later
+
+* The on-demand bill grows unpredictable as the team grows. The fix is to switch to slot commitments. This is fine, but it is a project.
+* The team needs strong workload isolation (one heavy job that ruins everyone else). Slots help, but Snowflake's per-warehouse isolation is more straightforward.
+
+### When you might regret Snowflake later
+
+* You ended up on GCP for everything else. Now your data plane is in a different cloud.
+* Warehouse sizing turns into a small ops job. Not big, but non-zero.
+* Idle cost from many small warehouses adds up if not watched.
+
+### What does NOT change the answer
+
+* "We need ML." Both have ML stories.
+* "We need streaming." Both have streaming ingest.
+* "We need data sharing." Both have it.
+* "We need dbt." Both have it.
+
+The deciding factors are pricing model fit and existing cloud.
+
+### Common mistakes interviewers want you to name
+
+1. **Picking based on marketing.** Both vendors have great marketing.
+2. **Ignoring the existing cloud.** Cross-cloud egress is real money.
+3. **Comparing pricing without knowing the workload.** Bytes scanned vs compute-second-up is apples and oranges without a usage pattern.
+4. **Picking the warehouse before the team can use it.** dbt, BI tool, and orchestrator matter just as much.
+5. **Refusing to commit.** "We'll start with both" is a way to never finish either setup.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What about Databricks?"*
+
+Different shape. Databricks is great when you have heavy Python / Spark / ML / streaming workloads alongside SQL. For a 50-person team starting with simple dashboards, it can feel heavy. BigQuery or Snowflake are simpler to onboard. Databricks shines when ML and data engineering dominate, not when reporting does. Many companies end up running Databricks plus a warehouse rather than choosing.
diff --git a/Problem 38: Store Partner Files in S3 or Warehouse/question.md b/Problem 38: Store Partner Files in S3 or Warehouse/question.md
new file mode 100644
index 0000000..ecf0a9a
--- /dev/null
+++ b/Problem 38: Store Partner Files in S3 or Warehouse/question.md
@@ -0,0 +1,29 @@
+## Problem 38: Store Partner Files in S3 or the Warehouse?
+
+**Scenario:**
+A partner sends 50 large CSV files every day, totaling about 30 GB. They arrive on SFTP. A team member is pushing to load all of them straight into the warehouse. Another team member says "keep them in S3, load only what we need." You are asked to mediate.
+
+In the interview, the question is:
+
+> A team is debating whether to store partner files in S3 or load them straight into the warehouse. What questions do you ask before answering?
+
+This is a "how do you think" question. The right answer is rarely one-or-the-other; it's usually "both, in a sensible order."
+
+---
+
+### Your Task:
+
+1. List the questions you would ask first.
+2. Explain the standard pattern (land in object storage, then load).
+3. Pick the right answer for the scenario.
+4. Cover the cases where you would deviate from the default.
+
+---
+
+### What a Good Answer Covers:
+
+* Raw vs curated layering.
+* Schema evolution and reprocessing.
+* Cost of warehouse storage vs object storage.
+* Audit and immutability.
+* When you really do load straight to the warehouse.
diff --git a/Problem 38: Store Partner Files in S3 or Warehouse/solution.md b/Problem 38: Store Partner Files in S3 or Warehouse/solution.md
new file mode 100644
index 0000000..6aace99
--- /dev/null
+++ b/Problem 38: Store Partner Files in S3 or Warehouse/solution.md
@@ -0,0 +1,140 @@
+## Solution 38: Store Partner Files in S3 or the Warehouse?
+
+### Short version you can say out loud
+
+> Both. The standard pattern is "land raw files in S3, load curated tables into the warehouse." S3 is the audit trail and the safety net for reprocessing. The warehouse is where analytics actually happens. Loading directly into the warehouse without keeping the raw file is the failure mode you want to avoid, because it makes reprocessing painful and audit impossible. The cases where you skip S3 are rare and small-volume.
+
+### Questions I would ask first
+
+Before answering, I want to know:
+
+1. **Do we keep the raw file for audit or compliance?** If yes, S3 is non-negotiable.
+2. **How often will we need to reprocess?** If schemas evolve or transforms change, having the raw file saves hours.
+3. **What does "the warehouse" cost vs S3?** Big difference at scale.
+4. **Who else will read the raw file?** Other teams, ML pipelines, debugging.
+5. **How structured is the data?** Clean tabular CSV is easy to load directly; nested or messy JSON is not.
+
+For the 30 GB / 50 files / daily scenario, the answer is clearly **land in S3 first, then load curated parts into the warehouse**.
+
+### The standard pattern
+
+```
+Partner SFTP
+ │
+ │ scheduled pull (or partner pushes)
+ ▼
+┌──────────────────────────┐
+│ S3 (raw zone) │ immutable, append-only,
+│ s3://partner-raw/ │ partitioned by date
+│ date=2025-05-14/ │
+│ file_001.csv │
+│ file_002.csv │
+│ ... │
+└──────────┬───────────────┘
+ │
+ │ scheduled load job
+ ▼
+┌──────────────────────────┐
+│ Warehouse (raw layer) │ one-to-one mirror,
+│ raw.partner_xxx │ schema enforced, partitioned
+└──────────┬───────────────┘
+ │
+ │ dbt transforms
+ ▼
+┌──────────────────────────┐
+│ Warehouse (marts) │ business-ready, joined,
+│ marts.customers, etc. │ aggregated
+└──────────────────────────┘
+```
+
+Three layers. Each one has a clear purpose, and you can rebuild every layer below from the layer above.
+
+### Why S3 first
+
+* **Cheap.** ~$0.023/GB/mo on S3 vs much more in many warehouses for raw storage at the same bytes. For 30 GB/day × 365 days = ~11 TB/year. On S3 that is ~$250/year. In a warehouse the same volume costs many times more, especially if columnar compression cannot help the raw shape.
+* **Reprocessable.** If a transform bug ships, we can reload from S3. If the file landed only in the warehouse and the load mangled it, the original is gone.
+* **Auditable.** "Show me exactly what the partner sent us on May 14" is a one-line answer from S3.
+* **Multi-consumer.** Other teams (ML, ad-hoc Python) can read S3 directly with Athena, BigQuery external tables, or Spark.
+
+### Why also load into the warehouse
+
+* SQL is the lingua franca of analysts.
+* The warehouse is built for fast scans, joins, and aggregates.
+* dbt models, dashboards, and metric definitions live there.
+* External tables (BigQuery external, Athena, Snowflake external stages) work but are slower and less optimized than native warehouse tables.
+
+The raw warehouse table is a near-copy of the file, with proper types and a partition column. It is what dbt builds models on top of.
+
+### Schema evolution: the S3-first payoff
+
+A few months in, the partner adds a new column. With raw files in S3:
+
+1. Update the warehouse raw table schema to include the new column.
+2. Backfill from S3 (the historical files have the new column once it started appearing).
+3. Existing transforms continue to work because the column is nullable for older dates.
+
+With warehouse-only:
+
+1. New column appears. Load might fail silently or coerce. You realize a week later.
+2. The old data does not have the column at all. No way to backfill.
+
+The S3 raw layer is the only reason schema evolution feels manageable in real pipelines.
+
+### When I would consider loading directly to the warehouse
+
+There are a few cases where the S3 step is overkill:
+
+* **Tiny files** (a few MB), low frequency (weekly), and you have a clear lineage in the warehouse itself.
+* **Highly structured database extracts** (a daily CDC dump from Postgres) where the warehouse load tool already keeps a snapshot.
+* **Sensitive data** that legally cannot land in object storage until cleaned. Then a transient stream loads, cleans, and writes both raw-cleaned and curated.
+
+For 30 GB/day of partner CSV, none of these apply.
+
+### File layout I would actually use
+
+```
+s3://partner-raw/source=partner_a/date=2025-05-14/
+ ingested_at=2025-05-14T05-12-03Z/
+ customers.csv
+ orders.csv
+ products.csv
+ _manifest.json ← list of files in this batch, with sizes/hashes
+ _SUCCESS ← written last, only when all files landed
+```
+
+Two practical details:
+
+* `_SUCCESS` marker so consumers only process complete batches.
+* `_manifest.json` with file hashes catches partial or corrupt uploads.
+
+### Cost rough math
+
+For the scenario: 30 GB × 365 = ~11 TB/year on S3 ≈ $250/year. Trivial.
+
+Loaded to the warehouse: depends on the warehouse and pricing model, but typically 5-20x that for raw storage, and you also pay for compression overhead and metadata.
+
+Plus, warehouse storage is much harder to delete safely. S3 has clear lifecycle rules: after N months, transition to cheaper tiers; after Y months, delete.
+
+### What about CSV-specific concerns
+
+CSVs are the worst format for the warehouse:
+
+* No types until parsed.
+* Quoting rules vary.
+* Different files may have different headers.
+
+The standard pattern: land CSV in S3. The load step converts to Parquet on the way to the warehouse, or uses a `LOAD CSV` with explicit schema. Either way, the warehouse never deals with raw CSV strings at query time.
+
+### Common mistakes interviewers want you to name
+
+1. **Skipping S3 to "save time."** Saves hours, costs days when reprocessing.
+2. **Putting raw in the warehouse and considering it the source of truth.** It is a derivative.
+3. **Deleting raw S3 files after loading.** Loses the audit trail.
+4. **No partitioning on the S3 path.** Future reprocess scans all data.
+5. **No `_SUCCESS` marker.** Consumer reads partial uploads.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the partner only sends parquet files, beautifully partitioned?"*
+
+Even better. Skip the conversion step. Land in S3, then use external tables in the warehouse if the warehouse supports them well. You may not even need to "load" into the warehouse for some use cases; external tables on parquet can be fine. The raw-in-S3 principle still applies.
diff --git a/Problem 39: Managed Airflow vs Self Hosted/question.md b/Problem 39: Managed Airflow vs Self Hosted/question.md
new file mode 100644
index 0000000..545b99b
--- /dev/null
+++ b/Problem 39: Managed Airflow vs Self Hosted/question.md
@@ -0,0 +1,26 @@
+## Problem 39: Managed Airflow vs Self Hosted
+
+**Scenario:**
+Your team is starting to run more than a handful of scheduled jobs and the cron-on-a-VM pattern is starting to crack. Someone proposes Airflow. The team can either self-host it on Kubernetes, or use a managed service like MWAA (AWS), Cloud Composer (GCP), or Astronomer. Some engineers say "managed is overkill, we know Kubernetes." Others say "let's not run database migrations in production." You are asked to call it.
+
+In the interview, the question is:
+
+> When does managed Airflow (MWAA or Cloud Composer) start to make sense over running your own, and when is it overkill?
+
+---
+
+### Your Task:
+
+1. Describe the dimensions that matter.
+2. Sketch the breakeven point.
+3. Mention the alternatives that are not Airflow.
+4. Give the rule of thumb for a small/medium team.
+
+---
+
+### What a Good Answer Covers:
+
+* Operational overhead of Airflow: database, scheduler, workers, web server, secrets, upgrades.
+* The cost of managed vs the cost of an on-call rotation.
+* When Dagster or Prefect or even just Cloud Workflows might fit.
+* Team size and DAG count thresholds.
diff --git a/Problem 39: Managed Airflow vs Self Hosted/solution.md b/Problem 39: Managed Airflow vs Self Hosted/solution.md
new file mode 100644
index 0000000..7ce5032
--- /dev/null
+++ b/Problem 39: Managed Airflow vs Self Hosted/solution.md
@@ -0,0 +1,114 @@
+## Solution 39: Managed Airflow vs Self Hosted
+
+### Short version you can say out loud
+
+> Managed Airflow makes sense the moment Airflow stops being a side project for someone. Self-hosted is cheap until you count the engineer hours spent on the database, the scheduler, the upgrades, and the inevitable 3 AM scheduler hang. For a team with fewer than ~50 DAGs and no dedicated infra person, managed wins almost every time. For a large platform team that already runs Kubernetes well and has Airflow as part of their core product, self-hosted is reasonable. And for a small team starting today, I would consider not using Airflow at all.
+
+### What "running Airflow" actually involves
+
+Airflow is more than a Python library. It is a multi-process system:
+
+```
+Web server Scheduler Workers (Celery / Kubernetes)
+ \ | /
+ \ v /
+ ▶ Metadata database (Postgres) ◀
+ |
+ v
+ Logs (filesystem / S3)
+ Secrets backend
+ Plugins, providers, Python deps
+ Upgrades, every few months
+```
+
+Each of these is a thing that breaks at some point. The scheduler in particular has a long history of needing tuning at scale (parsing performance, executor restarts, DB connection limits).
+
+When the team is small, this is "someone's side job." That works for a while, then breaks at the worst time.
+
+### When self-hosted is fine
+
+* You already run Kubernetes and have an SRE team.
+* You have 100+ DAGs and a custom plugin ecosystem.
+* You need control over Python versions, providers, networking.
+* You can afford one engineer's time on Airflow ops continuously.
+* You want full IAM and network customization that managed services don't allow.
+
+This is real for big companies. Airbnb runs it themselves. So do many large data platform teams.
+
+### When managed is the right call
+
+* The team is small (less than 10 data engineers).
+* You have 5-100 DAGs.
+* You do not have an SRE team that wants to own this.
+* You want to focus on data, not infra.
+* You can absorb the cost (typically $300-$1,500/month for small managed Airflow on MWAA / Composer).
+
+For most companies under 200 people, this is the right answer.
+
+### A rough cost picture
+
+```
+Self-hosted on Kubernetes (small)
+ Cluster cost : $300-700/mo
+ Postgres for metadata : $50-200/mo
+ Engineer hours : 5-20 hours/month at incident time
+ Engineer hourly cost : $50-150 (loaded)
+ Real total : ~$1,000-3,000/mo
+
+MWAA / Composer (small)
+ Service cost : $300-700/mo
+ Engineer hours : 1-3 hours/month
+ Real total : ~$500-1,000/mo
+```
+
+The hidden cost of self-hosted is the engineer hours, especially during upgrades and incidents. Most teams underestimate this by 3-5x when they pitch self-hosted.
+
+### When I would not use Airflow at all
+
+Airflow is the default, but it is not always the right tool. For a small team starting today:
+
+* **Cloud Workflows + Cloud Scheduler (GCP)** or **Step Functions + EventBridge (AWS)** for under 20 simple pipelines. No Airflow at all.
+* **Dagster** if asset-centric thinking suits the team and they like declarative data lineage.
+* **Prefect** if Python-first scripting suits the team.
+* **dbt Cloud** for SQL-only workloads; it has its own scheduler.
+
+For 5-20 DAGs with simple dependencies, Cloud Workflows or Step Functions cost almost nothing and have no operational burden.
+
+### The deciding question
+
+I would ask: "Do we want to run Airflow, or do we want our DAGs to run?"
+
+If the answer is "we want our DAGs to run," managed Airflow or one of the alternatives is the right call. Self-hosting Airflow is a choice to invest in the tool, not just use it.
+
+### A breakeven sketch
+
+| DAGs | Team size | Recommendation |
+| ---- | --------- | -------------- |
+| 1-10 | 2-5 | Cloud Workflows / Step Functions, or dbt Cloud. Skip Airflow. |
+| 10-50 | 5-15 | Managed Airflow (MWAA / Composer / Astronomer). |
+| 50-200 | 10-30 | Managed Airflow, possibly evaluate Dagster. |
+| 200-1000 | 20-50 | Managed if cost ok; consider self-hosted with dedicated owner. |
+| 1000+ | 50+ | Self-hosted on Kubernetes, dedicated platform team. |
+
+### What I would do for the scenario in the question
+
+A team that has outgrown cron-on-a-VM. I would:
+
+* Estimate the DAG count realistically. "How many in 6 months?"
+* If under 50, recommend managed Airflow.
+* If over 50, recommend managed Airflow plus a clear plan to keep DAG counts sane (use task groups, dynamic DAGs sparingly).
+* Either way, push back against "let's self-host because we know Kubernetes." Knowing Kubernetes is not the same as wanting to own Airflow's quirks.
+
+### Common mistakes interviewers want you to name
+
+1. **Pitching self-hosted because the team likes infrastructure.** Hobby vs job.
+2. **Picking Airflow when a few Step Functions would do.** Operational overhead for nothing.
+3. **Not counting upgrade cost.** Airflow 1 to 2 to 2.x has been painful.
+4. **Forgetting the metadata DB needs backups and tuning.**
+5. **"Managed is too expensive."** When the engineer time is counted, often the other way around.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What is the case for Dagster over Airflow today?"*
+
+Dagster is asset-centric: you define the data you want to exist, and Dagster figures out how to keep it fresh. Airflow is task-centric: you define tasks and their dependencies. The Dagster model often matches how teams actually think about data ("the orders mart should be fresh by 8 AM") more cleanly than Airflow's "run this task." For a new project, Dagster is genuinely worth evaluating. For a team migrating away from existing Airflow pipelines, the cost of switching paradigms is usually higher than the benefit.
diff --git a/Problem 40: BigQuery Access Control for 50 Person Company/question.md b/Problem 40: BigQuery Access Control for 50 Person Company/question.md
new file mode 100644
index 0000000..a353a5c
--- /dev/null
+++ b/Problem 40: BigQuery Access Control for 50 Person Company/question.md
@@ -0,0 +1,28 @@
+## Problem 40: BigQuery Access Control for a 50 Person Company
+
+**Scenario:**
+A 50-person company has been giving everyone "BigQuery Data Viewer" on the entire project. Some data is sensitive (HR, finance, customer PII). The CTO asks you to fix the permissions before the next audit, without breaking anyone's daily work.
+
+In the interview, the question is:
+
+> How would you design access control across BigQuery datasets in a 50 person company with both sensitive and non sensitive data?
+
+---
+
+### Your Task:
+
+1. Lay out the access model.
+2. Show how it maps to BigQuery's actual roles and IAM concepts.
+3. Cover the rollout: how do you not break anyone in week one.
+4. Mention auditing.
+
+---
+
+### What a Good Answer Covers:
+
+* Datasets as the unit of access, not the whole project.
+* Groups, not individual users.
+* Sensitive datasets isolated.
+* Row-level and column-level security for fine cases.
+* Service accounts for pipelines.
+* Audit logs and access reviews.
diff --git a/Problem 40: BigQuery Access Control for 50 Person Company/solution.md b/Problem 40: BigQuery Access Control for 50 Person Company/solution.md
new file mode 100644
index 0000000..93603c9
--- /dev/null
+++ b/Problem 40: BigQuery Access Control for 50 Person Company/solution.md
@@ -0,0 +1,182 @@
+## Solution 40: BigQuery Access Control for a 50 Person Company
+
+### Short version you can say out loud
+
+> Three principles. One: grant access to datasets, not to projects. Two: grant to Google groups, never to individual users. Three: put sensitive data (PII, finance, HR) in dedicated datasets with their own groups. On top of that, use service accounts for every pipeline, never personal credentials, and turn on audit logs so the auditor can see who looked at what. For 50 people, this is a one-week project, and it scales to 500.
+
+### The dataset-as-unit model
+
+BigQuery's natural permission boundary is the **dataset**. You can grant `roles/bigquery.dataViewer` on `analytics_marts` without exposing `hr_raw`. This is the building block.
+
+```
+Project: company-data
+│
+├── Dataset: raw_events ── group: data-engineering@
+├── Dataset: analytics_marts ── group: analysts@
+├── Dataset: customer_pii ── group: pii-cleared@
+├── Dataset: finance ── group: finance-team@
+├── Dataset: hr ── group: hr-team@
+└── Dataset: sandbox ── group: all-employees@ (read/write)
+```
+
+Each dataset gets a set of groups attached to it via IAM. Adding or removing a user is then a single change in the group.
+
+### Why groups, never individuals
+
+Imagine someone leaves the company. If you granted access to 12 individual users, you have to find every place that user appears. If you granted to groups, removing them from the group is one change.
+
+Equally, hiring is one change: add the new analyst to `analysts@`.
+
+The rule: **no `user:` bindings in IAM**. Only `group:` and `serviceAccount:`.
+
+### The role tiers
+
+For each dataset, the standard roles in BigQuery cover most cases:
+
+| Role | What it does |
+| ---------------------------------- | ------------------------------------------- |
+| `bigquery.dataViewer` | Read tables |
+| `bigquery.dataEditor` | Read + write tables |
+| `bigquery.dataOwner` | All of the above + manage permissions |
+| `bigquery.jobUser` (project-level) | Run queries (separate from data access) |
+| `bigquery.user` (project-level) | Run queries + create datasets |
+
+A typical analyst gets:
+
+* Project-level: `bigquery.jobUser` (so they can run queries).
+* Dataset-level: `bigquery.dataViewer` on `analytics_marts`, `bigquery.dataEditor` on `sandbox`.
+
+That's it. No project-wide read.
+
+### The groups, mapped to teams
+
+```
+data-engineering@ - all data engineers, dataEditor on raw_*, marts_*
+analysts@ - all analysts, dataViewer on analytics_marts
+pii-cleared@ - small subset who have signed PII training, dataViewer on customer_pii
+finance-team@ - finance team, dataViewer on finance
+hr-team@ - HR team, dataOwner on hr
+all-employees@ - everyone, dataEditor on sandbox only
+exec-team@ - executives, dataViewer on marts and exec-only datasets
+admins@ - very small, project owners
+service-accounts-prod@ - all production service accounts (kept separate)
+```
+
+Groups overlap. An analyst on the finance team is in both `analysts@` and `finance-team@`.
+
+### Sensitive data: separate datasets, separate groups
+
+Three categories of sensitive data:
+
+1. **Customer PII.** Names, emails, phone numbers, addresses. Lives in `customer_pii` dataset. Access only via `pii-cleared@`, which requires a quick training and a manager request.
+2. **Finance.** Revenue at line-item detail, payroll, contracts. In `finance` dataset. Access only via `finance-team@` and `exec-team@`.
+3. **HR.** Compensation, performance reviews, personal data. In `hr` dataset. Access only via `hr-team@` and a few execs.
+
+Each of these datasets is in the same project for convenience, but in a separate project is fine too if you want even harder isolation.
+
+### Column and row-level security for the cases between
+
+Sometimes you need "the analyst can see the customers table but not the email column." BigQuery has column-level security:
+
+```sql
+ALTER TABLE customer_pii.customers
+ALTER COLUMN email
+SET OPTIONS (
+ policy_tags = [
+ "projects/.../taxonomies/.../policyTags/pii_email"
+ ]
+);
+```
+
+Users without the `Fine-Grained Reader` role on that policy tag see `NULL` in that column.
+
+Row-level security:
+
+```sql
+CREATE ROW ACCESS POLICY country_filter
+ON sales.orders
+GRANT TO ('group:apac-analysts@example.com')
+FILTER USING (country IN ('SG','MY','ID','TH','VN'));
+```
+
+The APAC analyst only ever sees rows for those countries.
+
+Use these sparingly. Coarse-grained dataset access is easier to reason about and easier to audit. Fine-grained access tags add complexity that pays off only for genuinely shared tables with sensitive columns.
+
+### Pipelines and service accounts
+
+Every production pipeline runs as a service account, never as a person. A few patterns:
+
+* `sa-etl-prod@` — production ETL service account.
+* `sa-ml-training@` — for model training.
+* `sa-bi-tool@` — for the BI tool's queries.
+
+Service accounts are members of relevant groups, just like users. The principle of least privilege still applies. The ETL service account does not need access to HR.
+
+When a human queries the warehouse, they use their own identity. When a pipeline queries, it uses its own. You never share a service account key as a person.
+
+### The rollout: how not to break anyone in week one
+
+The danger is removing access too fast. People notice when their dashboards stop working. Plan:
+
+**Week 1: discover.**
+
+Pull the BigQuery audit logs for the last 30 days. Find every query and the dataset it touched.
+
+```sql
+SELECT
+ protopayload_auditlog.authenticationInfo.principalEmail AS user,
+ protopayload_auditlog.servicedata_v1_bigquery
+ .jobCompletedEvent.job.jobStatistics.referencedTables[OFFSET(0)].datasetId AS dataset,
+ COUNT(*) AS queries
+FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
+WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
+GROUP BY user, dataset
+ORDER BY user, queries DESC;
+```
+
+Now I know who actually needs access to what. Often you find that 80% of users only ever touch 2-3 datasets.
+
+**Week 2: shadow.**
+
+Create the new groups and dataset-level grants. Do NOT remove project-wide access yet. Now everyone has both project-wide and dataset-specific access. Nothing breaks.
+
+**Week 3: cut over.**
+
+Remove project-wide access in one stroke. Monitor audit logs. If someone screams, you know exactly who needs an additional group.
+
+The "shadow then cut" is the trick. It is much safer than the reverse.
+
+### Audit and access reviews
+
+* Turn on **BigQuery Data Access** audit logs. They record every read of sensitive tables.
+* Set up a saved query that reports "who read which sensitive table this week." Send it to a security channel.
+* Run a **quarterly access review**: each group owner confirms the current member list.
+
+For a 50-person company this is small overhead. For larger ones it becomes essential.
+
+### Common mistakes interviewers want you to name
+
+1. **Granting at project level.** Cannot revoke fine-grained access without breaking everything.
+2. **Granting to individuals.** Permission sprawl.
+3. **Same service account for many pipelines.** Hard to audit, blast radius too big.
+4. **No audit log review.** Sensitive data access invisible.
+5. **PII in shared datasets.** Once it's there, narrowing access later is much harder than starting clean.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you handle a contractor who needs access for two weeks?"*
+
+Add them to a "contractor" group with the specific dataset access they need, plus an automatic expiration. BigQuery has **conditional IAM bindings** that can include a time condition:
+
+```yaml
+binding:
+ role: roles/bigquery.dataViewer
+ members:
+ - user:contractor@example.com
+ condition:
+ title: "Two-week access"
+ expression: "request.time < timestamp('2025-06-01T00:00:00Z')"
+```
+
+After June 1, the binding becomes inert automatically. No one has to remember to revoke.
diff --git a/Problem 41: Tables for an Airbnb Like App/question.md b/Problem 41: Tables for an Airbnb Like App/question.md
new file mode 100644
index 0000000..e3f9640
--- /dev/null
+++ b/Problem 41: Tables for an Airbnb Like App/question.md
@@ -0,0 +1,27 @@
+## Problem 41: Tables for an Airbnb Like App
+
+**Scenario:**
+You are designing the data model for an app like Airbnb. Hosts list properties. Guests search, book, pay and review. There are calendars, prices that change by date, cancellations, refunds, and multiple guests per booking. The interviewer wants to see you reason about which tables exist, what is a fact and what is a dimension, and where the trade-offs hide.
+
+In the interview, the question is:
+
+> Walk me through how you would design tables for an app like Airbnb. Start from the obvious entities and tell me where the trade offs hide.
+
+---
+
+### Your Task:
+
+1. List the entities and their grain.
+2. Draw the relationships.
+3. Cover the trade-offs around bookings, prices and reviews.
+4. Mention the warehouse layer on top.
+
+---
+
+### What a Good Answer Covers:
+
+* Users, listings, calendars, bookings, payments, reviews.
+* OLTP shape vs warehouse star schema.
+* Pricing as a separate, time-varying table.
+* Booking as the central fact.
+* Slowly changing dimensions (listing details, host details).
diff --git a/Problem 41: Tables for an Airbnb Like App/solution.md b/Problem 41: Tables for an Airbnb Like App/solution.md
new file mode 100644
index 0000000..344b2c4
--- /dev/null
+++ b/Problem 41: Tables for an Airbnb Like App/solution.md
@@ -0,0 +1,205 @@
+## Solution 41: Tables for an Airbnb Like App
+
+### Short version you can say out loud
+
+> Two layers. The OLTP layer is normalized: one row per real-world thing, foreign keys between them. The warehouse layer is a star schema: bookings are the main fact, and users, listings, dates and locations are dimensions. The trade-offs all show up around things that change over time: prices, listing descriptions, host status. Those need a date dimension or an SCD2 history table so historical reports stay correct.
+
+### Entities in the OLTP layer
+
+```
+users
+─────────────────────────────────
+user_id (PK)
+email, name, joined_at, country
+is_host (bool)
+
+listings
+─────────────────────────────────
+listing_id (PK)
+host_id (FK → users.user_id)
+title, description, city, country, lat, lng
+property_type, max_guests, num_bedrooms
+created_at
+
+listing_amenities (one row per amenity per listing)
+─────────────────────────────────
+listing_id, amenity (composite PK)
+
+calendar
+─────────────────────────────────
+listing_id, date (composite PK)
+is_available (bool)
+nightly_price (cents)
+minimum_stay
+updated_at
+
+bookings
+─────────────────────────────────
+booking_id (PK)
+listing_id (FK)
+guest_id (FK)
+checkin_date, checkout_date
+num_guests
+total_price_cents
+status (requested, confirmed, cancelled, completed)
+created_at, cancelled_at
+
+payments
+─────────────────────────────────
+payment_id (PK)
+booking_id (FK)
+amount_cents, currency
+type (charge, refund, payout)
+status (pending, succeeded, failed)
+created_at
+
+reviews
+─────────────────────────────────
+review_id (PK)
+booking_id (FK)
+reviewer_id, reviewee_id
+rating, body
+created_at
+```
+
+Notice three things:
+
+1. **The calendar is one row per (listing, date).** Each night has its own price and availability. This is the "varies over time" pattern.
+2. **A booking can have multiple payments.** One charge, one refund, one host payout. So `payments` is a child of `bookings`, not 1:1.
+3. **Reviews go both ways.** Guest reviews host, host reviews guest. One booking can produce two reviews.
+
+### Why the calendar is separate
+
+The first instinct is to put `price` and `available` on `listings`. That breaks the moment a host wants different prices on different nights. So the calendar is its own table, grain = (listing, date). This is the most common modeling decision around Airbnb-like apps. Get this wrong and pricing reports lie forever.
+
+The price the guest pays is **frozen at booking time** into `bookings.total_price_cents`. If the host changes the price later, old bookings keep their original total. The calendar is "what is for sale today," not "history of prices charged."
+
+If you also need a price history (e.g., to analyze price changes), keep a `calendar_history` table or use SCD2 on `calendar`.
+
+### The warehouse layer
+
+```
+ ┌──────────────────┐
+ │ dim_date │
+ │ date, dow, ... │
+ └────────┬─────────┘
+ │
+ │
+ ┌─────────────┐ │ ┌──────────────────┐
+ │ dim_user │ │ │ dim_listing │
+ │ (guest + │ │ │ (SCD2: title, │
+ │ host) │ │ │ type, location) │
+ └─────┬───────┘ │ └────────┬─────────┘
+ │ ▼ │
+ │ ┌──────────────────────┐ │
+ └────▶│ fact_booking │◀────┘
+ │ ────────────────── │
+ │ booking_id (DD) │
+ │ guest_key (FK) │
+ │ host_key (FK) │
+ │ listing_key (FK) │
+ │ booked_date_key │
+ │ checkin_date_key │
+ │ checkout_date_key │
+ │ nights, guests │
+ │ total_price_cents │
+ │ status │
+ └──────────────────────┘
+ │
+ ┌────────────┴────────────┐
+ ▼ ▼
+ ┌────────────────┐ ┌──────────────────┐
+ │ fact_payment │ │ fact_review │
+ │ booking_key, │ │ booking_key, │
+ │ amount, type, │ │ rating, body, │
+ │ date_key │ │ direction │
+ └────────────────┘ └──────────────────┘
+```
+
+* `fact_booking` is the grain "one row per booking." This is the most queried table.
+* `dim_listing` is SCD2: when a listing's title or property type changes, the old version is kept so old reports show the listing as it was.
+* `dim_user` is one table for both hosts and guests. A single user can be both. The `is_host` flag is a label, not a separate entity.
+* `dim_date` is the conventional date dimension with year, quarter, month, day-of-week, is-holiday, etc.
+* `fact_payment` is at the grain "one row per payment event." Cancellations and refunds are negative entries.
+* `fact_review` is at the grain "one row per review."
+
+### Trade-offs
+
+**1. Is a cancelled booking still a row in fact_booking?**
+
+Yes. It is a booking that existed. The `status` field tells you it was cancelled. If you remove it, queries like "how many cancellations this month" stop working. Aggregations on `status = 'completed'` filter out the cancellations naturally.
+
+**2. Should the warehouse store the calendar or just the booking?**
+
+Both. The booking is what happened. The calendar is what was offered. Analysts want both: "what was the average listed price last summer" needs the calendar; "what was the average paid price last summer" needs the booking.
+
+I would model `fact_calendar_day` at grain (listing, date) with `is_available`, `nightly_price`. It is large but extremely useful for pricing analytics.
+
+**3. Multi-currency.**
+
+`bookings.total_price_cents` is in the guest's local currency. To compare across countries, you need conversion. Two options:
+
+* Store `total_usd_cents` too, computed at booking time using that day's FX rate.
+* Store an `fx_rates` daily table and join when reporting.
+
+The first is simpler for reporting; the second is more accurate over long histories. I would do both: freeze the USD value at booking time for fast queries, and keep the FX rate for audit.
+
+**4. Reviews from both directions.**
+
+Two reviews per booking (guest of host, host of guest) are different rows in `fact_review`. The `direction` column distinguishes them. Average ratings differ by direction; treat them separately.
+
+### Where the SCD2 hides
+
+Three places change over time and need SCD2:
+
+* **Listings.** Hosts edit the title, photos, even the property type. Old bookings should still show the listing as the guest saw it at booking time.
+* **Listing prices.** The calendar already gives a per-date price, so SCD2 is less critical here. But the price the host *had set* before the guest booked is sometimes useful.
+* **Users (hosts).** Verified status, business account status changes. Reports about "verified hosts" should know whether the host was verified at the time of the booking, not now.
+
+### What an analyst query looks like
+
+```sql
+-- Average revenue per night by city, last quarter
+SELECT
+ l.city,
+ SUM(b.total_price_cents) / SUM(b.nights) / 100.0 AS revenue_per_night
+FROM fact_booking b
+JOIN dim_listing l
+ ON l.listing_key = b.listing_key
+JOIN dim_date d
+ ON d.date_key = b.checkin_date_key
+WHERE d.date >= '2025-01-01'
+ AND d.date < '2025-04-01'
+ AND b.status IN ('completed')
+GROUP BY l.city
+ORDER BY revenue_per_night DESC;
+```
+
+The dimensions (`dim_listing`, `dim_date`) provide the filter and grouping columns. The fact (`fact_booking`) provides the measurement.
+
+### Patterns I would not regret
+
+* **One booking, one row.** Never split a single booking across many rows in the booking fact.
+* **Surrogate keys.** Every dim has an integer surrogate. SCD2 versions use different surrogates for the same natural id.
+* **Always keep a `created_at` and an `updated_at`.** Backfills depend on these.
+* **Soft delete.** Bookings are never hard-deleted. They go to `cancelled` status.
+
+### Common mistakes interviewers want you to name
+
+1. **Price as a column on `listings`.** Breaks the moment per-date pricing arrives.
+2. **Storing `total_price` as a float.** Use integer cents. Always.
+3. **Mixing facts and dimensions** (Problem 43).
+4. **Forgetting reviews can be both ways.**
+5. **Not using SCD2 on `listings` when title or category changes.**
+6. **No FX rate snapshot.** Multi-currency analytics drift over years.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you redesign this if Airbnb added 'experiences' (tours and activities) alongside stays?"*
+
+Two paths:
+
+1. **Same fact table, polymorphic.** Add `booking_type` (stay vs experience) and an `experience_key` next to `listing_key`. Half the columns will be null for the other type. Simple but messy.
+2. **Separate fact tables.** `fact_stay_booking` and `fact_experience_booking`, each with its own dimensions. Cleaner per use case but harder for "total revenue across all booking types" queries.
+
+I would go with the separate-fact path and create a `fact_booking_unified` view that UNIONs them with shared columns for the cross-type queries. Best of both.
diff --git a/Problem 42: Tracking Subscription Plan History/question.md b/Problem 42: Tracking Subscription Plan History/question.md
new file mode 100644
index 0000000..3ca1a21
--- /dev/null
+++ b/Problem 42: Tracking Subscription Plan History/question.md
@@ -0,0 +1,27 @@
+## Problem 42: Tracking Subscription Plan History
+
+**Scenario:**
+A SaaS company has customers who change plans constantly: upgrade, downgrade, pause, switch billing cycle. The finance team disputes a customer's bill almost every month, and the support team needs to answer "what plan was this customer on three months ago" routinely. The current `customers.plan_id` column only stores the latest plan, which is no help.
+
+In the interview, the question is:
+
+> You need to track every change to a customer's subscription plan because billing disputes are common. How do you model this?
+
+---
+
+### Your Task:
+
+1. Show the table design.
+2. Walk through the events that update it.
+3. Cover the query patterns: "what plan now," "what plan at this date," "list all changes for this customer."
+4. Mention the trade-offs vs an event-only table.
+
+---
+
+### What a Good Answer Covers:
+
+* A subscription-period table with valid_from / valid_to.
+* The current row vs historical rows.
+* The as-of join.
+* The difference between this and an audit log.
+* Pause and resume handling.
diff --git a/Problem 42: Tracking Subscription Plan History/solution.md b/Problem 42: Tracking Subscription Plan History/solution.md
new file mode 100644
index 0000000..63d76eb
--- /dev/null
+++ b/Problem 42: Tracking Subscription Plan History/solution.md
@@ -0,0 +1,185 @@
+## Solution 42: Tracking Subscription Plan History
+
+### Short version you can say out loud
+
+> A `subscription_periods` table. Each row is one continuous stretch of time during which the customer was on one specific plan, with `valid_from` and `valid_to` columns. When the plan changes, we close the current row and open a new one. The "current" plan is just the row whose `valid_to` is in the far future. Disputes get resolved by joining the bill date against this table. Audit events also exist separately, but they record "who did what" rather than "what was true when."
+
+### The table
+
+```sql
+CREATE TABLE subscription_periods (
+ period_id UUID PRIMARY KEY,
+ customer_id UUID NOT NULL,
+ plan_id UUID NOT NULL,
+ billing_cycle TEXT NOT NULL, -- monthly, yearly
+ status TEXT NOT NULL, -- active, paused
+ valid_from TIMESTAMP NOT NULL,
+ valid_to TIMESTAMP NOT NULL, -- '9999-12-31' for current
+ created_by TEXT, -- user / system
+ created_at TIMESTAMP NOT NULL DEFAULT NOW(),
+
+ CONSTRAINT period_is_well_formed CHECK (valid_from < valid_to)
+);
+
+CREATE INDEX ix_sub_customer_valid
+ ON subscription_periods (customer_id, valid_from, valid_to);
+```
+
+Example rows:
+
+```
+customer_id │ plan │ cycle │ status │ valid_from │ valid_to
+1001 │ basic │ monthly │ active │ 2024-08-15 10:00:00 │ 2025-01-12 14:30:00
+1001 │ pro │ monthly │ active │ 2025-01-12 14:30:00 │ 2025-04-01 09:00:00
+1001 │ pro │ yearly │ active │ 2025-04-01 09:00:00 │ 2025-05-10 11:00:00
+1001 │ pro │ yearly │ paused │ 2025-05-10 11:00:00 │ 2025-05-22 09:30:00
+1001 │ pro │ yearly │ active │ 2025-05-22 09:30:00 │ 9999-12-31 00:00:00
+```
+
+You can read this row by row and understand the customer's life: started on basic, upgraded to pro, switched to yearly billing, paused for 12 days, resumed.
+
+### How events update the table
+
+Pseudocode for a plan change:
+
+```python
+def change_plan(customer_id, new_plan_id, billing_cycle, when, who):
+ with transaction:
+ # Close the current period
+ current = sql("""
+ UPDATE subscription_periods
+ SET valid_to = :when
+ WHERE customer_id = :customer_id
+ AND valid_to = '9999-12-31'
+ RETURNING *
+ """, when=when, customer_id=customer_id)
+
+ # Open a new one starting at the same instant
+ insert(
+ subscription_periods,
+ customer_id=customer_id,
+ plan_id=new_plan_id,
+ billing_cycle=billing_cycle,
+ status='active',
+ valid_from=when,
+ valid_to='9999-12-31',
+ created_by=who,
+ )
+```
+
+The two writes happen in one transaction. The intervals are half-open: `valid_from` inclusive, `valid_to` exclusive. The new period starts at the exact instant the old one ends. No gaps, no overlaps.
+
+Pause and resume are the same shape, just with a `status` change instead of a `plan_id` change.
+
+### The three classic queries
+
+**"What plan is this customer on right now?"**
+
+```sql
+SELECT *
+FROM subscription_periods
+WHERE customer_id = :id
+ AND valid_to = '9999-12-31';
+```
+
+Or equivalently, `WHERE NOW() >= valid_from AND NOW() < valid_to`.
+
+**"What plan was this customer on on a specific date?"**
+
+```sql
+SELECT *
+FROM subscription_periods
+WHERE customer_id = :id
+ AND :date >= valid_from
+ AND :date < valid_to;
+```
+
+This is the as-of join from Problem 10. It answers the billing dispute in seconds.
+
+**"List every change this customer made."**
+
+```sql
+SELECT *
+FROM subscription_periods
+WHERE customer_id = :id
+ORDER BY valid_from;
+```
+
+Customer support reads this top-to-bottom. The whole history is right there.
+
+### Joining to bills
+
+A monthly bill for May covers a window of dates. The customer may have been on different plans in different parts of the month. The bill builds itself from this table:
+
+```sql
+SELECT
+ p.plan_id,
+ p.billing_cycle,
+ GREATEST(p.valid_from, :period_start) AS effective_start,
+ LEAST(p.valid_to, :period_end) AS effective_end
+FROM subscription_periods p
+WHERE p.customer_id = :id
+ AND p.valid_to > :period_start
+ AND p.valid_from < :period_end;
+```
+
+This returns one row per plan-segment within the billing window. Each segment gets its own line on the bill ("Pro plan May 1-9: $X. Paused May 10-21: $0. Pro plan May 22-31: $Y"). This is exactly how the smart meter bill in Problem 25 handled tariffs.
+
+### What goes in here vs in an audit table
+
+This table is **what was true when**. It is the source of truth for billing and reporting.
+
+An audit log table is separate. It records **who did what**:
+
+```
+audit_events
+─────────────────────────────────────
+event_id, customer_id, action_type, actor (user or system),
+before_state, after_state, occurred_at
+```
+
+The two tables answer different questions. The periods table answers "what was the plan." The audit table answers "who changed it." For disputes you usually want both.
+
+### Pause and resume
+
+Two ways to model:
+
+**Option A: status column.** A row with `status = 'paused'` indicates the customer was paused during that period. Pricing logic treats paused periods as $0. The schema does not need new tables.
+
+**Option B: separate "active vs paused" intervals.** More normalized: one table tracks plan, another tracks paused/active. More complex to join.
+
+I prefer Option A. It is simple and it composes well with billing logic.
+
+### Trade-offs vs an event-only table
+
+Some teams only store events:
+
+```
+events_only
+─────────────────────────────────────
+event_id, customer_id, action ('upgraded', 'downgraded', 'paused'),
+new_plan_id, occurred_at
+```
+
+To answer "what plan was the customer on on May 10," you replay events from the start. This is correct but slow. For frequent dispute lookups, recomputing the state is wasteful.
+
+The period table is the **materialized** form of the same information. It is precomputed, indexed, and fast. Best practice: keep both. Events drive updates to periods. Periods are the read model.
+
+### Common mistakes interviewers want you to name
+
+1. **Overlapping intervals.** Two rows both claim to cover the same instant. Bug somewhere in the update logic. Add a constraint or a check.
+2. **Using `NULL` for `valid_to` on the current row.** Then "as-of date" queries need `COALESCE`. `9999-12-31` is simpler.
+3. **Closing the old row but not opening a new one.** Customer has a gap; their plan looks "deleted" for an instant. The transaction must do both.
+4. **Updating the period table in place without history.** Then a dispute six months later cannot be answered.
+5. **Storing the period in the user row** (`plan_id`, `plan_started_at`). The history is gone.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How do you avoid race conditions when two plan changes hit at the same time?"*
+
+Two protections:
+
+1. The whole "close old + open new" runs in a single transaction with a row lock on the customer's current period.
+2. A unique constraint that ensures at most one row per customer has `valid_to = '9999-12-31'`. The second concurrent insert fails, the application retries.
+
+Together these guarantee a consistent timeline even with simultaneous writes.
diff --git a/Problem 43: Mixing Facts and Dimensions/question.md b/Problem 43: Mixing Facts and Dimensions/question.md
new file mode 100644
index 0000000..25c1c26
--- /dev/null
+++ b/Problem 43: Mixing Facts and Dimensions/question.md
@@ -0,0 +1,28 @@
+## Problem 43: Mixing Facts and Dimensions
+
+**Scenario:**
+A team has a single "orders" table in their warehouse that includes the order details, the customer name and address, the product name and category, and the warehouse-of-origin name. Every analyst query reads from that one table, and "joins" are never needed. The team's reasoning: "it's easier to query."
+
+A new requirement: when a customer changes their address, all the old orders in the table now show the new address, so historical reports change. The team is asked to "fix" this and they consider rebuilding the table daily as a snapshot.
+
+In the interview, the question is:
+
+> A team is mixing facts and dimensions in the same table because "it is easier to query." Explain why that quietly hurts them later.
+
+---
+
+### Your Task:
+
+1. Explain the symptom and the underlying cause.
+2. Show the fix.
+3. Address the team's "but it's easier" argument honestly.
+4. Cover when a wide denormalized table really IS the right call.
+
+---
+
+### What a Good Answer Covers:
+
+* The grain trap.
+* History rewriting silently.
+* Storage cost vs query simplicity trade.
+* The compromise: a wide reporting view on top of a clean star schema.
diff --git a/Problem 43: Mixing Facts and Dimensions/solution.md b/Problem 43: Mixing Facts and Dimensions/solution.md
new file mode 100644
index 0000000..5ba769e
--- /dev/null
+++ b/Problem 43: Mixing Facts and Dimensions/solution.md
@@ -0,0 +1,140 @@
+## Solution 43: Mixing Facts and Dimensions
+
+### Short version you can say out loud
+
+> Two problems. First, when a customer's address changes, every old order silently shows the new address. History is rewritten. Second, when a customer or product gets updated in any way, every row referencing them has to change, which is slow and locks the table. The fix is the boring textbook one: put the order facts in a fact table and the customer and product attributes in dimension tables, joined by surrogate keys. The "it is easier to query" argument is real but solvable with a single view on top of the star schema.
+
+### The symptom, drawn out
+
+```
+The single wide orders table
+
+order_id │ order_date │ customer_name │ customer_address │ product_name │ amount
+1001 │ Jan 2 │ Alice Tan │ 12 Bukit Rd, SG │ Widget A │ 50
+1002 │ Jan 5 │ Bob Khan │ 5 Orchard Rd, SG │ Widget B │ 30
+1003 │ Mar 1 │ Alice Tan │ 12 Bukit Rd, SG │ Widget A │ 50
+
+Now Alice moves to Malaysia. They update the row in the app DB.
+The ETL re-runs the orders table from scratch.
+
+order_id │ order_date │ customer_name │ customer_address │ product_name │ amount
+1001 │ Jan 2 │ Alice Tan │ 8 Jalan Ipoh, MY │ Widget A │ 50 ← changed
+1002 │ Jan 5 │ Bob Khan │ 5 Orchard Rd, SG │ Widget B │ 30
+1003 │ Mar 1 │ Alice Tan │ 8 Jalan Ipoh, MY │ Widget A │ 50 ← changed
+
+The January revenue by region report now shifts $50 from SG to MY.
+But that order was placed and shipped in SG. The history is now wrong.
+```
+
+This is the classic SCD problem (Problem 10) in disguise. By denormalizing customer attributes onto the fact table, every change to a customer mutates the past.
+
+### The fix
+
+```
+fact_orders dim_customer (SCD2)
+───────────────────────────────── ───────────────────────────────────
+order_id (PK) customer_key (PK, surrogate)
+order_date customer_id (natural)
+customer_key (FK to dim_customer) name, address, country
+product_key (FK to dim_product) valid_from, valid_to, is_current
+amount
+
+dim_product
+─────────────────────────────────
+product_key (PK, surrogate)
+product_id (natural)
+name, category, brand
+valid_from, valid_to, is_current
+```
+
+When Alice moves, dim_customer gets a new row (Type 2). The old order rows still point to her old customer_key. The January revenue by region report is correct.
+
+### Joining is the easy part
+
+```sql
+SELECT
+ c.country,
+ SUM(o.amount) AS revenue
+FROM fact_orders o
+JOIN dim_customer c
+ ON c.customer_key = o.customer_key
+WHERE o.order_date BETWEEN '2025-01-01' AND '2025-01-31'
+GROUP BY c.country;
+```
+
+Two table join. Indexed. Fast. The result is the customer state **at the time of the order**, because that's what the surrogate key locked.
+
+### The team's "but it's easier to query" argument
+
+This is the part to address honestly, not dismiss.
+
+The argument: analysts do not want to write joins. They want `SELECT * FROM orders` and be done.
+
+The fix: **build a wide reporting view on top of the star schema.**
+
+```sql
+CREATE VIEW v_orders AS
+SELECT
+ o.order_id,
+ o.order_date,
+ c.name AS customer_name,
+ c.address AS customer_address,
+ c.country AS customer_country,
+ p.name AS product_name,
+ p.category AS product_category,
+ o.amount
+FROM fact_orders o
+JOIN dim_customer c ON c.customer_key = o.customer_key
+JOIN dim_product p ON p.product_key = o.product_key;
+```
+
+Analysts query `v_orders`. It looks exactly like the old wide table. But the data underneath is correct: each row shows the customer and product **as they were at order time**.
+
+You get the simplicity AND the correctness.
+
+### Storage and write performance
+
+The wide table is also worse for storage and writes, even before correctness comes up:
+
+* **Storage.** Every order row repeats the customer's full name, address, and country. Millions of orders, gigabytes of redundancy.
+* **Writes.** When a customer's email changes, you update one dimension row, not every fact row that references them. On large tables this is the difference between seconds and minutes.
+
+In a column-store, the redundant fields compress well, so storage matters less. Writes still matter.
+
+### When wide IS the right call
+
+A few cases where I would intentionally denormalize:
+
+* **Reporting marts** at the very last layer, where the join overhead matters and the dimensions are stable enough. The trick is to snapshot the dimension state at the time the fact is loaded, so history does not rewrite.
+* **Single-purpose extract for a downstream tool** that does not understand joins (a legacy BI tool, an external partner).
+* **Search indexes** (Elasticsearch, OpenSearch). Wide documents are the norm there.
+* **Aggregation tables** that are themselves derivative ("daily revenue by country and product"). These do not have user-level identity, so SCD is not an issue.
+
+In all of these, the rule is: the wide table is built **from** the star schema, not instead of it.
+
+### The migration path
+
+The team has a wide table. How do you fix it without breaking everyone:
+
+1. **Build the star schema next to it.** dim_customer with SCD2, dim_product, fact_orders pointing at surrogate keys.
+2. **Build `v_orders`** as a view on top, with the same column names as the old wide table.
+3. **Sanity-check** that `v_orders` matches the old wide table for current customers and products. It should, except for some historical rows where the dimensions have changed since.
+4. **Repoint analyst queries** at `v_orders`. Most do not have to change SQL.
+5. **Deprecate the old wide table.** Drop it once you are sure no one reads it directly.
+
+The whole thing takes a week or two for a real warehouse and pays off forever.
+
+### Common mistakes interviewers want you to name
+
+1. **Snapshot-of-the-day denormalization** (rebuild the wide table every night from current dimensions). Looks fine until someone asks "what about three months ago," and the answer is "today's customer attributes."
+2. **One mega dimension.** Trying to fit every attribute on one user-dimension. Split if dimensions vary at different rates (a frequently-changing "preferences" dimension separate from a stable "demographics" dimension).
+3. **Surrogate keys for fact tables.** Facts use their natural id (`order_id`). Surrogates are for dimensions.
+4. **Joining on natural keys instead of surrogates.** Then SCD2 history breaks.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the dimension is small (say, 100 rows) and rarely changes?"*
+
+Then denormalizing onto the fact is fine, with one rule: do it at fact-load time, snapshotting the dimension value, and never recompute. That way history is preserved even though you skipped the join.
+
+In practice, very few dimensions are actually stable. Customer and product almost always change. So I would still keep them as dimensions and use a view if joins feel painful.
diff --git a/Problem 44: Explaining Fact Table Grain/question.md b/Problem 44: Explaining Fact Table Grain/question.md
new file mode 100644
index 0000000..b54bdc7
--- /dev/null
+++ b/Problem 44: Explaining Fact Table Grain/question.md
@@ -0,0 +1,27 @@
+## Problem 44: Explaining Fact Table Grain
+
+**Scenario:**
+You are mentoring an analyst who keeps producing dashboards that don't quite match. The cause is the same every time: they don't have a clear sense of what each row in the fact table represents. They ask you to explain "grain" in a way they can use the next time they build a model.
+
+In the interview, the question is:
+
+> What is the grain of a fact table, and how would you explain it to a non technical stakeholder?
+
+---
+
+### Your Task:
+
+1. Define grain in one sentence.
+2. Give three different grains for the same business.
+3. Show why getting it wrong produces bad numbers.
+4. Walk through how to choose grain for a new fact.
+
+---
+
+### What a Good Answer Covers:
+
+* Grain as "one row means one X."
+* Picking grain by what the dashboard asks.
+* Mixing grains is the bug.
+* Coarser is faster; finer is more flexible.
+* Bridge tables for many-to-many.
diff --git a/Problem 44: Explaining Fact Table Grain/solution.md b/Problem 44: Explaining Fact Table Grain/solution.md
new file mode 100644
index 0000000..453e283
--- /dev/null
+++ b/Problem 44: Explaining Fact Table Grain/solution.md
@@ -0,0 +1,156 @@
+## Solution 44: Explaining Fact Table Grain
+
+### Short version you can say out loud
+
+> Grain is the answer to the question "what does one row in this table mean?" If you cannot finish the sentence "one row in this table is one ____," you do not have grain pinned down yet. Every aggregation, every join, every dashboard depends on it. Get grain wrong and your numbers are off by exactly the right amount to look plausible but be misleading.
+
+### Explaining grain to a non technical stakeholder
+
+The way I would say it in a meeting:
+
+> "Imagine each table is a stack of paper. Grain is what is printed on one sheet. If the sheet is 'one customer,' then there is one sheet per customer and we can count sheets to count customers. If the sheet is 'one order,' there is one per order. If the sheet is 'one item on one order,' there are many more sheets, because each order has multiple items. We have to know what one sheet is before we know what summing them up means."
+
+A stakeholder who understands "one row = one X" is suddenly ready to ask the right questions.
+
+### Three grains for the same business
+
+Imagine a coffee shop chain.
+
+**Grain: one sale.**
+
+```
+sale_id │ store │ date │ total_amount
+1 │ A │ 2025-05-14 │ 12.50
+2 │ A │ 2025-05-14 │ 7.00
+```
+
+Useful for "total revenue today per store." Not useful for "how many large lattes were sold."
+
+**Grain: one item on one sale.**
+
+```
+sale_id │ store │ date │ product │ qty │ amount
+1 │ A │ 2025-05-14 │ Latte L │ 1 │ 6.50
+1 │ A │ 2025-05-14 │ Croissant│ 2 │ 6.00
+2 │ A │ 2025-05-14 │ Coffee │ 2 │ 7.00
+```
+
+Useful for everything the first grain can do, plus "how many lattes did we sell." More rows, more flexibility.
+
+**Grain: one minute of sales per store.**
+
+```
+store │ minute │ revenue │ items_sold
+A │ 2025-05-14 09:01:00 │ 23.50 │ 4
+A │ 2025-05-14 09:02:00 │ 0.00 │ 0
+```
+
+Useful for "what minute are we busiest." Not useful for "what was sale 1234."
+
+Same business, three legitimate grains. Each one answers a different question.
+
+### How wrong grain produces wrong numbers
+
+Imagine the team mixes grains accidentally:
+
+```
+SELECT
+ s.store,
+ SUM(s.total_amount) AS revenue,
+ COUNT(*) AS line_items_sold
+FROM fact_sale_lines s
+GROUP BY s.store;
+```
+
+`fact_sale_lines` is at the **line item** grain. So `SUM(total_amount)` may be double-counting the sale's total if every line carries the order's total. The COUNT is right (it counts line items), but the SUM is wrong.
+
+This is the most common modeling bug. The reason it goes unnoticed: the numbers are usually "in the right ballpark," so nothing screams. You only catch it by reconciling against the source system.
+
+The fix: each fact table should carry **only** measurements at its own grain. Order total goes on `fact_orders`, not `fact_order_lines`. Line price goes on `fact_order_lines`, not on `fact_orders`.
+
+### Choosing grain for a new fact
+
+Three questions:
+
+1. **What does the dashboard or report want?** This is the only honest answer. Build to the finest grain you actually need.
+2. **Is there a natural "thing that happens once" in the business?** That is usually your grain. A sale, an order, a payment, a shipment, a meter reading.
+3. **What is the right time grain?** Per-event is most flexible. Per-minute or per-day is a pre-aggregation, useful when the finer grain is too big.
+
+The default for transactional facts is "one row per event." Pre-aggregations sit on top for performance.
+
+### Mixed grain in one table is the bug
+
+Sometimes a fact table tries to be both order-grain and line-grain. The smell: most columns are filled in, but a few columns (like `line_id` or `quantity`) are null on some rows.
+
+The fix is to split into two tables, each with one clear grain:
+
+```
+fact_orders : one row per order (order_total, customer_key, ...)
+fact_order_lines : one row per line (order_id, product_key, qty, price)
+```
+
+If you need both in one query, join them. Do not collapse them.
+
+### Coarser is faster, finer is more flexible
+
+The trade-off:
+
+* **Coarser grain** (one row per day per store) means fewer rows, faster queries. But you can never answer questions below that grain ("what was sold at 9:42 AM").
+* **Finer grain** (one row per second, or per individual event) keeps every question open. But the table is huge.
+
+The common pattern: keep the finest grain in `fact_*`, build pre-aggregates `agg_*` for the questions you ask most.
+
+### Bridge tables for many-to-many
+
+What if one fact has many "many-to-many" relationships? An order can have many promotions applied, a promotion can apply to many orders.
+
+You cannot put both on the same row without violating grain. The fix:
+
+```
+fact_orders one row per order
+fact_order_promotions one row per (order, promotion) pair
+```
+
+The bridge table has its own grain. You join through it when you want "promotions per order" or "orders per promotion."
+
+### Common mistakes interviewers want you to name
+
+1. **No grain documented.** The new team member has to figure it out from the data.
+2. **Mixed grain.** Some rows are orders, some are line items, in the same table.
+3. **SUM across the wrong grain.** Counts items but sums order totals; gets a multiplied number.
+4. **Building one giant table** with every possible measurement at every grain. Always wrong somewhere.
+5. **Confusing time grain with entity grain.** "Per day" is a time grain. "Per order" is an entity grain. Both can apply at once.
+
+### How I would teach this
+
+When mentoring, I would write the grain in the first line of the table's docs:
+
+> `fact_orders`: one row per order. An order is everything a single customer placed in one checkout, including all its line items.
+
+And on `fact_order_lines`:
+
+> `fact_order_lines`: one row per item on an order. A single product appears once per order, with a quantity column.
+
+Two sentences. The whole team now reads tables the same way.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the business asks 'show me both order count and line item count in one query'? Won't joining force a grain choice?"*
+
+You join, but you have to aggregate to one grain on each side before joining. For example:
+
+```sql
+WITH orders_per_day AS (
+ SELECT order_date, COUNT(*) AS orders
+ FROM fact_orders GROUP BY order_date
+),
+lines_per_day AS (
+ SELECT order_date, SUM(qty) AS items
+ FROM fact_order_lines GROUP BY order_date
+)
+SELECT o.order_date, o.orders, l.items
+FROM orders_per_day o
+JOIN lines_per_day l USING (order_date);
+```
+
+Each CTE has its own grain (per-day), then you join at the common grain. No row-multiplication.
diff --git a/Problem 45: Current State and Full History/question.md b/Problem 45: Current State and Full History/question.md
new file mode 100644
index 0000000..6881b1a
--- /dev/null
+++ b/Problem 45: Current State and Full History/question.md
@@ -0,0 +1,26 @@
+## Problem 45: Current State and Full History
+
+**Scenario:**
+The reporting team has a hot debate. Some queries want "the current state of every order" (latest status, shipped or cancelled, current refund amount). Other queries want "the full history" (every state change, when, who, why). Right now the team is duplicating the data: one table with current state, a parallel table with events. Storage is doubled. It is the wrong shape.
+
+In the interview, the question is:
+
+> A reporting team wants both "current state" and "full history" of every order. How do you build that without doubling storage?
+
+---
+
+### Your Task:
+
+1. Show the right shape.
+2. Sketch how to derive current state from history cheaply.
+3. Cover the query patterns.
+4. Address the team's worry about query speed.
+
+---
+
+### What a Good Answer Covers:
+
+* One source-of-truth events table.
+* A derived "current state" view or materialized view.
+* Indexes / clustering on the event table for fast latest-state lookups.
+* When you really do need a separate current-state table.
diff --git a/Problem 45: Current State and Full History/solution.md b/Problem 45: Current State and Full History/solution.md
new file mode 100644
index 0000000..c415c91
--- /dev/null
+++ b/Problem 45: Current State and Full History/solution.md
@@ -0,0 +1,185 @@
+## Solution 45: Current State and Full History
+
+### Short version you can say out loud
+
+> Keep one source of truth: the events table. Derive the current state as a view or a small materialized view on top. You never duplicate the data, you just have one base and one derived. Storage is not doubled. The events table is the history; the view is the latest snapshot. If the view is too slow to compute on the fly, materialize it — but the materialization is much smaller than the events table, so storage cost is barely affected.
+
+### The shape
+
+```
+fact_order_events
+─────────────────────────────────────────
+event_id (PK)
+order_id
+event_type ('created', 'paid', 'shipped', 'cancelled', 'refunded', ...)
+event_payload (JSON: amount, address, etc.)
+actor (user_id or system)
+occurred_at
+ingested_at
+
+Indexes / clustering:
+ (order_id, occurred_at DESC) ← fast "last event per order"
+ (event_type, occurred_at) ← time-based queries
+```
+
+This is the only table that holds raw data. Every state change of an order produces one row.
+
+Then a view:
+
+```sql
+CREATE OR REPLACE VIEW v_order_current_state AS
+SELECT *
+FROM (
+ SELECT
+ order_id,
+ event_type,
+ event_payload,
+ occurred_at,
+ ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY occurred_at DESC) AS rn
+ FROM fact_order_events
+)
+WHERE rn = 1;
+```
+
+Or, depending on the engine, a `QUALIFY` clause:
+
+```sql
+SELECT *
+FROM fact_order_events
+QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY occurred_at DESC) = 1;
+```
+
+`v_order_current_state` has one row per order: the latest known state.
+
+Both "current state" and "full history" come from one source. No duplication.
+
+### How big is each side
+
+* `fact_order_events`: many events per order. For an order with 6 state changes, 6 rows. Storage is real.
+* `v_order_current_state`: as a view, no storage. As a materialized table, one row per order. Much smaller than the events table.
+
+So even when you materialize the "current state," storage is not doubled. The materialized snapshot is a fraction of the event log size.
+
+### The query patterns
+
+**"What is the current state of order 5001?"**
+
+```sql
+SELECT * FROM v_order_current_state WHERE order_id = 5001;
+```
+
+One row. Fast.
+
+**"Show me the full history of order 5001."**
+
+```sql
+SELECT *
+FROM fact_order_events
+WHERE order_id = 5001
+ORDER BY occurred_at;
+```
+
+All rows. Fast with the right clustering or index.
+
+**"How many orders are in 'shipped' state today?"**
+
+```sql
+SELECT COUNT(*)
+FROM v_order_current_state
+WHERE event_type = 'shipped';
+```
+
+Fast. Once.
+
+**"How long does it take orders to go from 'paid' to 'shipped' on average?"**
+
+```sql
+WITH paid AS (
+ SELECT order_id, occurred_at AS paid_at FROM fact_order_events WHERE event_type = 'paid'
+),
+shipped AS (
+ SELECT order_id, occurred_at AS shipped_at FROM fact_order_events WHERE event_type = 'shipped'
+)
+SELECT AVG(EXTRACT(EPOCH FROM (s.shipped_at - p.paid_at)) / 3600.0) AS hours_to_ship
+FROM paid p
+JOIN shipped s USING (order_id);
+```
+
+This needs the history, not the current state. The events table is the only place that has it.
+
+### When the view is too slow
+
+For a small table (millions of rows), the view is fast. For a huge table (billions), the `ROW_NUMBER` over the whole thing can be slow on cold cache.
+
+Two ways to speed it up:
+
+**1. Materialize.**
+
+```sql
+CREATE MATERIALIZED VIEW mv_order_current_state AS
+SELECT ...
+```
+
+Refreshed every N minutes, or on demand. Rows are pre-computed.
+
+**2. Maintain an incremental snapshot table.**
+
+A separate table `dim_order_current_state` updated by a small pipeline. When a new event arrives, MERGE into this table:
+
+```sql
+MERGE INTO dim_order_current_state AS dst
+USING (
+ SELECT * FROM fact_order_events
+ WHERE event_id = :new_event_id
+) AS src
+ON dst.order_id = src.order_id
+WHEN MATCHED THEN UPDATE SET
+ event_type = src.event_type,
+ event_payload = src.event_payload,
+ occurred_at = src.occurred_at
+WHEN NOT MATCHED THEN INSERT VALUES (...);
+```
+
+`dim_order_current_state` is much smaller than `fact_order_events`. Storage is barely affected.
+
+### The team's worry: "querying the events table is too slow"
+
+This is the team's real concern. The answer:
+
+* For point lookups by `order_id`, with clustering on `(order_id, occurred_at)`, the events table is fast even at billion rows.
+* For aggregates ("how many orders are shipped"), the materialized current-state table is the right answer.
+* For history-aware queries ("time from paid to shipped"), nothing beats the events table.
+
+So the right setup is: events as base + a thin current-state materialized table. Both serve their queries fast.
+
+### When you might really need two physical tables
+
+Three real cases:
+
+1. **Different consumers, very different access patterns.** A customer-facing API reads `current_state` thousands of times per second; the warehouse reads events in batch. The API gets its own table, optimized for point reads, possibly in a different store entirely (Postgres, DynamoDB).
+2. **Compliance retention differences.** Events kept for 7 years (audit); current state kept indefinitely. Same data, different lifecycles.
+3. **The current state has computed fields** that are expensive to recompute (a status that depends on rules, not just the last event). Worth materializing.
+
+In all three, the rule is: events are still the source of truth, and the second table is derived.
+
+### What I would NOT do
+
+* **Build current state as a primary table** and bolt on events as an afterthought. The team usually starts here and regrets it.
+* **Update current state in place without writing an event.** Audit gap.
+* **Keep "old current state" history by overwriting the same row.** Lose the history. Cannot recover.
+
+### Common mistakes interviewers want you to name
+
+1. **Duplicating data instead of deriving.** "Two tables that should always agree" never do.
+2. **Computing current state from events with no clustering.** Slow at scale.
+3. **Forgetting that events are append-only.** Updating event rows breaks audit.
+4. **No event types beyond "updated."** Then you do not know what changed; only that something did. Use specific types.
+5. **Storing the whole new state in every event.** Storage explodes. Store either the delta or just enough to reconstruct.
+
+### Bonus follow-up the interviewer might throw
+
+> *"This looks a lot like event sourcing. Are you proposing event sourcing for the warehouse?"*
+
+It is event-sourced thinking, applied to data modeling, yes. The principle is the same: the events are the truth, the current state is a projection. The difference is that we are not running our application off the event log; we are running our analytics on top of it. That makes the trade-offs gentler. No need for event-by-event replay; we just SELECT.
+
+For domains where the audit story matters (finance, billing, healthcare, energy markets), this pattern is almost always worth the small extra effort.
diff --git a/Problem 46: Region Suddenly Shows Zero Revenue/question.md b/Problem 46: Region Suddenly Shows Zero Revenue/question.md
new file mode 100644
index 0000000..e2406c0
--- /dev/null
+++ b/Problem 46: Region Suddenly Shows Zero Revenue/question.md
@@ -0,0 +1,27 @@
+## Problem 46: Region Suddenly Shows Zero Revenue
+
+**Scenario:**
+The daily revenue dashboard shows total revenue across regions. Today, one region — say, Indonesia — shows zero. The other regions look normal. Yesterday Indonesia showed $X. The business hasn't shut down operations there.
+
+In the interview, the question is:
+
+> Your daily revenue dashboard suddenly shows zero for one region. Walk me through your investigation step by step.
+
+---
+
+### Your Task:
+
+1. Resist the urge to guess. Walk through the actual diagnostic steps.
+2. Show what you check first, second, third.
+3. Cover the most common causes.
+4. Mention the communication step.
+
+---
+
+### What a Good Answer Covers:
+
+* Bottom-up: did the data even arrive?
+* Middle: did the transforms include it?
+* Top: did the dashboard's query filter it out?
+* Source vs warehouse vs dashboard.
+* The common culprits: filter typo, dimension change, source outage, time zone.
diff --git a/Problem 46: Region Suddenly Shows Zero Revenue/solution.md b/Problem 46: Region Suddenly Shows Zero Revenue/solution.md
new file mode 100644
index 0000000..d173494
--- /dev/null
+++ b/Problem 46: Region Suddenly Shows Zero Revenue/solution.md
@@ -0,0 +1,126 @@
+## Solution 46: Region Suddenly Shows Zero Revenue
+
+### Short version you can say out loud
+
+> I work bottom up. Did the data even land for that region? If yes, did the transforms include it? If yes, did the dashboard's filter accidentally drop it? Nine times out of ten, one of three things: the source feed for that region was late or empty, the join to the dimension table dropped them because a code changed, or someone added a filter to the dashboard that excludes the new value. I check each layer in order and the bug shows up by elimination.
+
+### The four layers I check, in order
+
+```
+1. Raw layer : did the data arrive?
+2. Curated : did the data survive the transform?
+3. Marts : did the data survive joins to dimensions?
+4. Dashboard : did the data survive the dashboard's query/filter?
+```
+
+If a layer's number is zero and the layer below has data, the bug is in this layer. If the layer below has no data, drop one more level.
+
+### Step 1: did the data land?
+
+```sql
+SELECT COUNT(*), SUM(amount) FROM raw.orders
+WHERE region = 'ID'
+ AND created_at >= CURRENT_DATE - INTERVAL '2 days';
+```
+
+Three outcomes:
+
+* **Zero rows.** The data did not arrive. Check the source feed: is the partner sending, is the SFTP empty, is the Kafka topic dry? This is now a "data is missing" investigation, not a "dashboard is wrong" one. Most likely cause.
+* **Normal row count.** The data is there. Move up.
+* **Reduced row count.** Partial outage. Dig into when it stopped.
+
+In about half of incidents, the answer is here. The data did not arrive.
+
+### Step 2: did the curated layer include it?
+
+```sql
+SELECT COUNT(*), SUM(amount) FROM curated.orders
+WHERE region = 'ID'
+ AND order_date = CURRENT_DATE - 1;
+```
+
+If raw has data but curated does not, the transform dropped it. Common causes:
+
+* The transform has a hard-coded list of regions and Indonesia is not in it.
+* A new region code (`ID` becomes `IDN` or `ID-JK` for Jakarta) and the transform's CASE WHEN doesn't recognize the new value.
+* The transform filters out test data and Indonesia's rows accidentally got flagged as test.
+
+Look at the transform code (dbt model, SQL, whatever it is) and find any place that filters or maps the `region` column.
+
+### Step 3: did the marts layer include it?
+
+```sql
+SELECT r.name AS region, SUM(o.amount) AS revenue
+FROM marts.fact_orders o
+JOIN marts.dim_region r ON r.region_key = o.region_key
+WHERE o.order_date = CURRENT_DATE - 1
+GROUP BY r.name;
+```
+
+If curated has Indonesia but marts does not, the join to `dim_region` dropped them. This is the SCD trap. Common causes:
+
+* `dim_region` has been re-keyed and the old surrogate keys don't match.
+* Indonesia's row was accidentally deleted or marked `is_active = false`.
+* A new region was added but the SCD2 valid_to window expired for the old one.
+
+Check `dim_region` directly:
+
+```sql
+SELECT * FROM marts.dim_region WHERE code = 'ID';
+```
+
+### Step 4: is the dashboard's filter wrong?
+
+If marts is correct but the dashboard shows zero, the bug is in the dashboard's SQL or filter.
+
+* Did someone add `WHERE region != 'ID'` while debugging?
+* Is the dashboard filtering on a stale dropdown value?
+* Is the dashboard reading from a different table than you think?
+
+Inspect the dashboard's underlying query. Often there are extra filters injected by the BI tool that you don't see in the model.
+
+### The most common causes I have seen
+
+Ranked by how often they bite:
+
+1. **Source feed late or empty.** Partner outage, scheduled job slipped, SFTP credentials rotated.
+2. **A code value changed at the source.** Two letter to three letter ISO code. Filter drops them all.
+3. **A new product launched that doesn't emit the region field correctly.** Region is null for those rows, dropped by the join.
+4. **A time zone effect.** Indonesia's day rolled over differently than expected; "yesterday in UTC" misses them.
+5. **A dimension table got rebuilt** and the surrogate keys changed underneath the facts.
+6. **Someone changed the dashboard yesterday.** Check git history.
+7. **Test data filter that accidentally matches real data.** "Filter where email like '%@test.com'" matches a real customer.
+
+### The communication step
+
+While investigating, I post:
+
+> "Looking into the Indonesia revenue showing zero on the dashboard. Tracking it down now, will update in 30 min."
+
+Even if I find it in 5 minutes, the early message lowers the panic. If 30 minutes pass and I still don't know, I post again with what I have ruled out so far.
+
+### How I would set up to catch this faster next time
+
+* **A per-region freshness check** on the marts table. Yesterday should have at least N% of the last week's average per region. Pages on big drops.
+* **A schema drift detector** that flags new values in low-cardinality columns. Helps with the "two letter to three letter" failure.
+* **Dashboard "as of" stamp** so I know if the dashboard is stale vs zero.
+
+### What I would NOT do
+
+* **Guess and fix in production.** Without confirming the layer, you might "fix" the dashboard while the real bug is in the source.
+* **Add `OR region IS NULL`** as a quick patch. Hides the real issue.
+* **Tell the business "it's a known issue, we'll look at it tomorrow."** A zero region looks bad. Take it seriously.
+
+### Common mistakes interviewers want you to name
+
+1. **Starting from the dashboard.** Bottom-up is faster.
+2. **Trusting "the pipeline succeeded."** Tasks pass with empty input.
+3. **Not checking source freshness.** Often the cause and the cheapest check.
+4. **Editing dashboards without recording what changed.**
+5. **No alert that would catch this automatically.**
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if Indonesia is consistently 'zero' for the first few hours of every day, and then catches up?"*
+
+That is a "late data" pattern, not a bug. The source for Indonesia is shipping data after midnight Jakarta time, so by your dashboard's "yesterday" cutoff in UTC, their data has not arrived yet. The fix is either to wait until Indonesia's data lands before computing the dashboard, or to mark the dashboard as "still loading for Indonesia" until the freshness check passes. Either is correct; the wrong move is to say "the data is wrong" because the timing is wrong.
diff --git a/Problem 47: Airflow Green but Output Empty/question.md b/Problem 47: Airflow Green but Output Empty/question.md
new file mode 100644
index 0000000..9e8c5a9
--- /dev/null
+++ b/Problem 47: Airflow Green but Output Empty/question.md
@@ -0,0 +1,27 @@
+## Problem 47: Airflow Green but Output Empty
+
+**Scenario:**
+The pipeline ran. Airflow shows all tasks green. The downstream dashboard says yesterday's data is missing. The on-call engineer is confused: "everything succeeded, but the output table didn't update."
+
+In the interview, the question is:
+
+> An Airflow task is green but the output table did not update. What are the first three things you check?
+
+---
+
+### Your Task:
+
+1. Explain why "task succeeded" is not the same as "data is correct."
+2. Walk through the three quickest checks.
+3. List the most common silent-success patterns.
+4. Mention how to prevent this class of failure.
+
+---
+
+### What a Good Answer Covers:
+
+* Success means the code did not throw, not that data is present.
+* Empty input still produces empty output successfully.
+* Row count checks and freshness checks as guardrails.
+* Idempotent overwrite vs append: which makes silent failures invisible.
+* dbt tests, Great Expectations, or simple SQL guards.
diff --git a/Problem 47: Airflow Green but Output Empty/solution.md b/Problem 47: Airflow Green but Output Empty/solution.md
new file mode 100644
index 0000000..f84b41b
--- /dev/null
+++ b/Problem 47: Airflow Green but Output Empty/solution.md
@@ -0,0 +1,125 @@
+## Solution 47: Airflow Green but Output Empty
+
+### Short version you can say out loud
+
+> "Task succeeded" just means the code did not throw an exception. It does not mean data is present. The three things I check, in order: did the source have data, did the transform actually write rows, and was the output overwritten with empty. About 80 percent of "green but empty" incidents come from one of those three. The bigger lesson is that the pipeline should fail the task when the data is wrong, not just when the code crashes.
+
+### The three checks, in order
+
+**1. Did the source have data?**
+
+```sql
+SELECT COUNT(*), MIN(event_time), MAX(event_time)
+FROM raw.source_table
+WHERE event_date = CURRENT_DATE - 1;
+```
+
+If this is zero, the source did not send. The task ran, read an empty source, wrote nothing, succeeded. This is the most common cause.
+
+**2. Did the transform actually write rows?**
+
+```sql
+SELECT
+ table_name,
+ row_count,
+ last_modified
+FROM information_schema.tables
+WHERE table_name = 'curated_orders';
+```
+
+If `last_modified` is recent but `row_count` is the same as yesterday, the transform ran but the SQL produced no new rows. Maybe a filter condition is wrong, maybe the join key changed.
+
+**3. Was the output overwritten with empty?**
+
+Run the transform's main query manually for yesterday's date. If it returns 0 rows, the transform is "succeeding" by producing nothing.
+
+```sql
+SELECT * FROM curated_orders WHERE event_date = CURRENT_DATE - 1 LIMIT 10;
+```
+
+If empty, the pipeline cheerfully overwrote yesterday's partition with zero rows. This is the worst silent failure: it wipes existing data.
+
+### Why "green" lies
+
+Airflow's task success is binary on the exit code. A Python script that does `print("done")` and exits successfully is green. A SQL job that runs `INSERT INTO ... SELECT WHERE 1=0` is green. A `bq load` of an empty file is green.
+
+The orchestrator does not know what "right" looks like. It is your job to tell it.
+
+### Common patterns that produce silent success
+
+1. **Empty source, idempotent overwrite.** The classic one. Source was late, transform ran on empty, output was overwritten with empty. Yesterday's data is now gone in the destination too.
+2. **Filter on a column that has changed.** A new region code (Problem 46), a renamed event type, all rows filtered out.
+3. **Join to a dimension that lost a row.** Inner join drops everything. Outer join would have shown nulls; the team uses inner because "it has always worked."
+4. **Time zone or date math drift.** "Yesterday" is computed in a different time zone than the source uses.
+5. **A `LIMIT 0` left in code** from someone's debugging session.
+6. **A failed upstream that the task "handles gracefully."** Should have failed loudly.
+7. **dbt model with `materialized: incremental` and a broken `is_incremental()` check.** Nothing inserted.
+
+### What I would change to prevent silent success
+
+The fix is to make the pipeline fail when the data is wrong, not when code crashes. A few simple guards:
+
+**Row count assertions** as part of every load step:
+
+```sql
+-- in dbt, or as a separate guard task
+SELECT
+ CASE
+ WHEN COUNT(*) = 0 THEN ERROR('No rows for yesterday')
+ WHEN COUNT(*) < (SELECT AVG(cnt) * 0.5 FROM daily_counts_last_7) THEN ERROR('Half-empty')
+ ELSE 1
+ END
+FROM curated_orders
+WHERE event_date = CURRENT_DATE - 1;
+```
+
+**Freshness checks**: max(event_time) should be within an expected window of now.
+
+**Source-of-truth reconciliation**: yesterday's row count in the destination should be within X% of yesterday's row count in the source.
+
+These are cheap, run in seconds, and convert a "green task with empty output" into a "red task that gets investigated."
+
+### Tools
+
+* **dbt has built-in tests**: `not_null`, `unique`, custom assertions. The `dbt_expectations` package has many more. Failed tests fail the model.
+* **Great Expectations** for more elaborate validation.
+* **Custom Airflow `PythonOperator` or `BashOperator` tasks** that run a SQL check and fail the task if the count is wrong.
+
+In Dagster, the same idea is "asset checks." They are first-class.
+
+### When the silent overwrite is the real culprit
+
+The most painful version is: the transform did write, but it wrote empty rows, overwriting yesterday's good data. Recovery requires:
+
+* Reading the source (or backup) again.
+* Re-running the transform with correct logic.
+* If the source is gone, you may have permanent data loss in the destination.
+
+The protection against this is to never overwrite blindly. Either:
+
+* Check row count before overwriting (refuse to write 0 if yesterday had millions).
+* Snapshot the destination partition to an archive before overwriting.
+* Use Delta/Iceberg time travel so you can restore.
+
+### The lesson
+
+The team's mental model has to shift from "the task succeeded" to "the data is right." Every important transform should publish two facts at the end:
+
+* The task succeeded.
+* The output passes its data checks.
+
+A green task without data checks is a half-deployed pipeline.
+
+### Common mistakes interviewers want you to name
+
+1. **Trusting Airflow green.** Code-level success is not data-level success.
+2. **Idempotent overwrite without count protection.** Silent erasure.
+3. **No anomaly alert on row count.** The smallest signal that catches most of this.
+4. **Inner join in places that should be outer.** Drops "missing" silently.
+5. **Letting "no data is fine" be the default.** It is not. Make it a failure.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you design the pipeline so a missing source delays the run rather than producing empty output?"*
+
+Use a sensor (Airflow Sensor, Dagster `auto_observe`). The pipeline waits for the source to arrive before it starts the transform. If the source does not arrive within a timeout, the sensor fails the DAG with a clear "source missing" message. Now the failure is loud and actionable, not silent.
diff --git a/Problem 48: Query Suddenly 80x Slower/question.md b/Problem 48: Query Suddenly 80x Slower/question.md
new file mode 100644
index 0000000..9851c62
--- /dev/null
+++ b/Problem 48: Query Suddenly 80x Slower/question.md
@@ -0,0 +1,28 @@
+## Problem 48: Query Suddenly 80x Slower
+
+**Scenario:**
+A SQL query that the team has run daily for a year used to finish in 30 seconds. Today, on identical-looking data and the same warehouse, it takes 40 minutes. Nothing about the query has changed. Nothing about the destination has changed. Someone wants to "throw more compute at it."
+
+In the interview, the question is:
+
+> A query that used to run in 30 seconds last week takes 40 minutes today. How do you find out what changed?
+
+---
+
+### Your Task:
+
+1. Resist "add more compute."
+2. Walk through the diagnostic order.
+3. Cover the most common causes when the query itself hasn't changed.
+4. Mention the prevention layer.
+
+---
+
+### What a Good Answer Covers:
+
+* EXPLAIN / query plan comparison.
+* Data volume changes.
+* Statistics drift.
+* A join's broadcast vs hash flip due to size growth.
+* Storage layout (partitioning, clustering, fragmentation).
+* Concurrent workloads on the same warehouse.
diff --git a/Problem 48: Query Suddenly 80x Slower/solution.md b/Problem 48: Query Suddenly 80x Slower/solution.md
new file mode 100644
index 0000000..4583cb5
--- /dev/null
+++ b/Problem 48: Query Suddenly 80x Slower/solution.md
@@ -0,0 +1,138 @@
+## Solution 48: Query Suddenly 80x Slower
+
+### Short version you can say out loud
+
+> First thing I do is compare the query plan today against the plan from when it was fast. Most of these incidents have the same shape: a join changed strategy because one side crossed a size threshold. The query did not change, but the data did, and the optimizer made a different choice. The cure is almost never "more compute." It is finding which join flipped and helping the optimizer pick the right plan, often by updating statistics or adding a small hint.
+
+### The diagnostic order
+
+**1. Get the plan from today.**
+
+```sql
+EXPLAIN ANALYZE ;
+```
+
+**2. Get the plan from when it was fast.**
+
+* If the team uses an APM tool that records plans, retrieve last week's.
+* In BigQuery, you can find the past job and inspect its execution.
+* In Snowflake, query_history retains plans for some time.
+* Otherwise, try to reconstruct: run the same query on a snapshot of last week's data if you have one.
+
+**3. Diff the two plans.**
+
+Look for:
+
+* Join type changed (Hash → Nested Loop, or broadcast → hash).
+* Scan changed from index/cluster prune to full scan.
+* Sort that now spills to disk (writes to temporary storage).
+* Estimated rows vs actual rows now wildly off.
+
+The diff almost always tells you the bug in two minutes.
+
+### The classic causes when the query didn't change
+
+**1. A join flipped from broadcast to hash because the small side grew.**
+
+A query that used to broadcast a "small dimension" of 10 MB now broadcasts 5 GB because the dimension grew. The optimizer either picks a slow plan or runs out of memory. Fix: either cap the broadcast size, force a hash join, or shrink the dimension.
+
+**2. Statistics are stale.**
+
+The table grew 10x but `ANALYZE` was never re-run. The optimizer thinks the table has 100k rows, picks a nested loop because "the other side is huge anyway," and the nested loop runs over 10M rows.
+
+Fix:
+
+```sql
+-- Postgres
+ANALYZE my_table;
+
+-- BigQuery / Snowflake usually keep stats current automatically, but
+-- if a CTAS just created the table, stats may not be present yet.
+```
+
+**3. A new column or change made a filter unselective.**
+
+A `WHERE status = 'active'` used to match 5% of rows. After a backfill, 95% of rows are 'active'. The plan optimizer's choices invert: what was a good index is now bad.
+
+Fix: either change the query, add a better partition / cluster, or use a different filter.
+
+**4. Storage layout changed.**
+
+In BigQuery: someone deleted and reinserted the partition with a different cluster ordering. In Postgres: heavy churn caused table bloat. In Snowflake: clustering depth got bad and the table needs reclustering.
+
+Fix: rebuild the table or trigger reclustering.
+
+**5. Resource contention from a noisy neighbor.**
+
+Another team launched a daily 6 PM job that uses the same warehouse. Your 6 PM query now waits for slots.
+
+Fix: separate warehouses or schedule shift.
+
+**6. A new index or constraint slowed writes** (and the query happens to involve a temp table with an unexpected lock).
+
+This is the OLTP version. Less common in warehouses.
+
+### A real diagnostic walkthrough
+
+Yesterday's plan:
+
+```
+HashJoin
+ ├ HashAggregate over fact (returns ~10k rows)
+ └ SeqScan over dim_region (5 rows)
+```
+
+Today's plan:
+
+```
+NestedLoop
+ ├ SeqScan over fact (returns ~10M rows)
+ └ IndexScan over dim_region
+```
+
+The reading: `dim_region` did not change, but `fact` did. The optimizer thinks `fact` is small now (statistics are stale) and picked a nested loop. Each `fact` row triggers a lookup in `dim_region`. With 10M rows in fact, that is 10M index lookups.
+
+Fix: `ANALYZE fact`. The optimizer now sees the real row count and re-picks the hash join. Query goes back to 30 seconds.
+
+### What I would NOT do first
+
+* **Throw more compute at it.** Sometimes works, but you do not learn anything, and the problem grows back.
+* **Rewrite the query.** Not yet. The query was fine last week.
+* **Restart the warehouse.** Cargo-cult fix; rarely solves anything but feels productive.
+
+### Other things to check quickly
+
+* **Did the underlying tables grow a lot?** 10x growth often crosses an optimizer threshold.
+* **Was there a schema change overnight?** Adding a column can change row width and storage.
+* **Are there new materialized views or indexes** that the optimizer might be choosing badly?
+* **Is the warehouse the same size?** A teammate accidentally downsized the Snowflake warehouse.
+* **Are you actually reading the same table?** A view definition changed, or a search path changed.
+
+### Prevention
+
+Three habits help:
+
+1. **Tag and time critical queries.** A daily job that times itself and alerts on >2x slowdown catches this on day one.
+2. **Periodic `ANALYZE`** on growing tables. Most warehouses do this automatically, but some configurations require it.
+3. **Pin the plan when it really matters.** Some engines support plan hints (BigQuery has limited hints; Postgres has `pg_hint_plan`). For a critical job, locking the plan removes the surprise.
+
+### Common mistakes interviewers want you to name
+
+1. **Reaching for more compute first.** Wastes money and time.
+2. **Not capturing the plan when it was fast.** Then you cannot compare.
+3. **Skipping ANALYZE / statistics update.** Often the cure.
+4. **Blaming the database.** Usually the data changed.
+5. **Not setting up alerting on query duration.** This incident should have been caught at 1 minute, not 40.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the data didn't change either? Same query, same data, suddenly 80x slower."*
+
+Then the cause is environmental. Likely:
+
+* Warehouse downsized.
+* Heavy concurrent workload from another team or job.
+* A new query has poisoned the result cache and is forcing recomputation.
+* Storage backend issue (rare but real).
+
+Check the warehouse load metrics for that timestamp and see if there's a spike of concurrent jobs. If yes, the problem is contention, not the query.
diff --git a/Problem 49: User Says Data Is Wrong/question.md b/Problem 49: User Says Data Is Wrong/question.md
new file mode 100644
index 0000000..411af9f
--- /dev/null
+++ b/Problem 49: User Says Data Is Wrong/question.md
@@ -0,0 +1,28 @@
+## Problem 49: User Says "The Data Is Wrong"
+
+**Scenario:**
+A user reports "the data is wrong" through a generic support form. There is no screenshot, no specific number, no filter context. Just the words. You have to turn this complaint into an actual bug report you can act on.
+
+In the interview, the question is:
+
+> A user says "the data is wrong." How do you turn that vague complaint into a real bug report?
+
+This is the cousin of Problem 31 (analyst case), but from a user-support angle, often with less context to work with.
+
+---
+
+### Your Task:
+
+1. Describe the questions you would ask.
+2. Walk through how to triage the complaint.
+3. Show how to handle it if the user disappears mid-conversation.
+4. Cover the broader pattern: what tells you whether to investigate.
+
+---
+
+### What a Good Answer Covers:
+
+* The minimum information needed to start.
+* Common categories of "data is wrong" complaints.
+* When to escalate vs when to ask a follow-up.
+* Templates and self-service tools that reduce this load.
diff --git a/Problem 49: User Says Data Is Wrong/solution.md b/Problem 49: User Says Data Is Wrong/solution.md
new file mode 100644
index 0000000..6f14afd
--- /dev/null
+++ b/Problem 49: User Says Data Is Wrong/solution.md
@@ -0,0 +1,121 @@
+## Solution 49: User Says "The Data Is Wrong"
+
+### Short version you can say out loud
+
+> I treat "the data is wrong" as the start of a conversation, not the end. The first response acknowledges them and asks for the minimum information to act: which screen, which number, what they expected. From there I triage. If they can give me concrete details, I investigate. If they cannot, I either move on with a "we will check what we can" message, or I try to reproduce from common patterns. The trick is to be respectful of their time without spending a whole afternoon on something that may not be a real bug.
+
+### The first response
+
+A short, structured reply:
+
+> "Thanks for letting us know. To track this down, can you share:
+>
+> 1. Which page or report?
+> 2. Which specific number looks wrong?
+> 3. What did you expect it to be?
+> 4. Any filters you had applied?
+>
+> A screenshot helps a lot if you can."
+
+Four questions, no jargon. Most users answer 2-3 of them, which is enough.
+
+### Why these four questions
+
+* **Which page** narrows the problem to a specific table or query.
+* **Which number** localizes it inside the page.
+* **What they expected** is the most valuable piece. Often the data is correct and the user's expectation is wrong, or vice versa. Either way, the gap tells you what to investigate.
+* **Filters** matter because the same number can be different under different filter combinations.
+
+### Triage when the user has given useful detail
+
+I run through the same hierarchy as Problem 46:
+
+1. Did the data arrive in the source?
+2. Did the transform include it correctly?
+3. Did the mart serve it correctly?
+4. Is the dashboard's query / filter doing what the user thinks?
+
+For a user-reported issue, step 4 is the most common cause. The user is hitting an edge: a filter combination that excludes legitimate rows, a metric that includes refunds when they expected gross, a time zone different than they assumed.
+
+### Categories of "the data is wrong"
+
+After enough complaints, you start to see patterns:
+
+| Complaint | Most common cause |
+| ---------------------------------------- | -------------------------------------------------- |
+| "The number is too low" | Filter excludes legitimate rows; late data |
+| "The number is too high" | Refund logic; double-counted join; duplicates |
+| "It doesn't match what I see elsewhere" | Different metric definition; different time zone |
+| "It is showing zero" | Source outage; bad join; permission filter |
+| "It changed since yesterday" | History was rewritten (Problem 43); recompute |
+| "Some users are missing" | RLS or filter; deletion; classification logic |
+
+A rough mental classifier saves you time: read the complaint, pick the likely category, jump to the most likely cause.
+
+### What if the user can't or won't give details?
+
+Sometimes you get a one-line complaint and then silence. Two options:
+
+**Option A: respectful close-out.**
+
+> "Thanks for flagging this. Without more detail, we cannot reproduce the issue. If it happens again, the screenshot and the page name will help us look into it quickly. Closing this for now."
+
+This is not dismissive when phrased right. You are being honest.
+
+**Option B: try common patterns first.**
+
+If the user is important (a senior stakeholder), run through the top 3 likely complaints on the page they last viewed. Pre-compute the answers. Reply with something like:
+
+> "I checked the three most-viewed numbers on this page for yesterday and they reconcile to source. If you can point me at a specific one, I can dig further."
+
+This shows effort and often surfaces an issue without their input.
+
+### When to escalate
+
+Three signals push me to escalate the investigation, not the user:
+
+1. **Multiple users report the same vague thing.** That is a real bug, even if no one has the detail.
+2. **The user is in finance or another sensitive function.** Numbers there really do need to be right.
+3. **My own anomaly checks are red.** They are reporting something I already know is off.
+
+In all three cases, I do not wait for clarification. I dig and tell them what I find.
+
+### The communication while investigating
+
+If I am investigating without clear detail, I post an interim:
+
+> "I'm checking the page you mentioned. If I find the issue, I'll share what I found. If I don't, I'll let you know what I ruled out, in case it helps you reproduce."
+
+This sets expectations and protects me from "you said you would look into this, what happened?" three days later.
+
+### How to reduce these reports
+
+The systemic fix is to make "what is this number" easier to self-answer.
+
+* **Tooltips on every metric** explaining the definition (gross vs net, time zone, refresh schedule).
+* **"Last refreshed" stamp.** Half of "data is wrong" complaints are "data is stale."
+* **A "what does this number mean" link** to a doc that explains the metric, the source, the refresh cadence.
+* **A "report a problem" button** that captures the page, filter, user, and time automatically.
+
+The button alone often turns a vague complaint into a complete bug report.
+
+### Common mistakes interviewers want you to name
+
+1. **Asking the user to "explain the bug clearly."** Comes across as dismissive.
+2. **Treating one vague report as a major incident.** Wastes hours; reduces credibility for the next real one.
+3. **Treating one vague report as nothing.** Especially when the user is senior, this is risky.
+4. **No template for the follow-up.** Each engineer asks different questions, the user gets confused.
+5. **Closing without a clear status.** "Cannot reproduce, please re-open with details" is fine; silence is not.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How would you build a self-service tool that helps users investigate themselves?"*
+
+A few small things go a long way:
+
+* A page that says "the data was last refreshed at HH:MM" and "next refresh expected at HH:MM."
+* A "compare to last week" toggle on every dashboard. Users often spot drift before we do.
+* A small "show source query" link that displays the SQL behind the chart. Tech-savvy users can investigate themselves.
+* A "data quality status" page that aggregates the most recent test failures and ingestion lags.
+
+Most users do not want to bother us. The reason they file vague complaints is that they have no other tool. Give them one.
diff --git a/Problem 50: Partition Always Ten Percent Smaller/question.md b/Problem 50: Partition Always Ten Percent Smaller/question.md
new file mode 100644
index 0000000..62fae90
--- /dev/null
+++ b/Problem 50: Partition Always Ten Percent Smaller/question.md
@@ -0,0 +1,29 @@
+## Problem 50: One Partition Always Ten Percent Smaller
+
+**Scenario:**
+You notice that one of the 200 daily partitions of an event table consistently has about 10% fewer rows than the others. The pattern repeats every week. Some teammates say "that's just normal variation, ignore it." You are not sure.
+
+In the interview, the question is:
+
+> One out of 200 daily partitions is always 10 percent smaller than the rest. How do you decide if it is a bug?
+
+This is a "do not chase ghosts, but do not ignore patterns" question. The interviewer is testing your sense of when to investigate vs when to let it be.
+
+---
+
+### Your Task:
+
+1. List the questions you would ask.
+2. Walk through how to investigate cheaply.
+3. Cover the most common real causes.
+4. Decide when "normal variation" really is the answer.
+
+---
+
+### What a Good Answer Covers:
+
+* Confirm the pattern is real (day of week effect, holidays, time zone).
+* Look at the missing rows: are they a category?
+* Check ingestion lag and source freshness.
+* Statistical baseline.
+* When to investigate and when to leave it.
diff --git a/Problem 50: Partition Always Ten Percent Smaller/solution.md b/Problem 50: Partition Always Ten Percent Smaller/solution.md
new file mode 100644
index 0000000..822b523
--- /dev/null
+++ b/Problem 50: Partition Always Ten Percent Smaller/solution.md
@@ -0,0 +1,131 @@
+## Solution 50: One Partition Always Ten Percent Smaller
+
+### Short version you can say out loud
+
+> The first question is "is this a pattern that maps to reality?" If the small partition is always a Sunday, that is probably just user behavior. If it's a random-looking Tuesday, that is suspicious. The second question is "what is missing — a category of events, a region, a producer?" If you can identify a missing slice, you have a bug. If the loss is even across all dimensions, you are probably looking at noise. Ten percent is large enough to be worth one hour of investigation.
+
+### Step 1: confirm the pattern is real
+
+Pull the per-partition row counts for the last 60 days:
+
+```sql
+SELECT
+ event_date,
+ COUNT(*) AS rows,
+ EXTRACT(DOW FROM event_date) AS day_of_week
+FROM events
+WHERE event_date > CURRENT_DATE - 60
+GROUP BY 1, 3
+ORDER BY 1;
+```
+
+Plot it. Three possible shapes:
+
+* **A clean weekly dip on a specific day.** Day-of-week effect. Likely real user behavior.
+* **A dip that wanders across days.** Suspicious. Patterns that move tell a different story.
+* **A consistent percentage off, every day of the week.** Noise; not a single-partition issue at all.
+
+The "always 10% smaller" framing suggests a fixed cadence. Confirm: is it always the same day?
+
+### Step 2: account for the obvious
+
+Things to rule out:
+
+* **Day of week.** Many businesses have lower activity on weekends. A Sunday partition being 30% smaller is normal.
+* **Public holidays.** Specific dates have lower volume. Map the dip dates to holiday calendars.
+* **Time zone effect.** "Day" in the warehouse may not align with "day" at the source. If the source closes at 8 PM local, "their day" ends earlier than UTC midnight.
+* **Recurring maintenance.** Some sources do scheduled maintenance windows on a specific weekday.
+
+If the dip aligns with any of these, it is not a bug. Document it ("Sunday partitions average 30% lower due to weekend traffic patterns") so the next person does not chase it.
+
+### Step 3: find what is missing
+
+If the dip survives the explanations above, look for which slice of the data is short:
+
+```sql
+SELECT
+ event_type,
+ COUNT(*) AS rows
+FROM events
+WHERE event_date = '2025-05-11' -- the small day
+GROUP BY event_type
+ORDER BY rows DESC;
+```
+
+Compare to a normal day. If `event_type = 'page_view'` is missing 30% and other types look normal, you have a clue: one producer is misbehaving on that day.
+
+Same exercise across other dimensions: by region, by app version, by source, by hour. The missing rows are usually concentrated, not evenly thinned.
+
+### Common real causes
+
+When the dip turns out to be a bug, it is usually one of:
+
+1. **A weekly job at the source pauses ingestion** for a maintenance window. Their "data missing for 4 hours" shows up as your "10% smaller partition."
+2. **A specific service has a weekly deploy** that takes minutes, during which events drop.
+3. **A scheduled batch in the source is competing for the same Kafka topic**, causing brief backpressure.
+4. **A weekly partner upload runs late**, beyond your partition boundary, and lands the following day's partition instead.
+5. **A daylight-saving-time edge** that shifts an hour out of one partition into the next, twice a year.
+
+Each of these is fixable, but the fix is on a different team than yours.
+
+### Step 4: judge if it matters
+
+Even if it is a bug, you have to decide whether to spend time on it. The honest test:
+
+* Does the missing 10% impact any downstream business decision?
+* Does the smaller partition cause downstream errors (failed joins, broken counts)?
+* Does the pattern hide a worse failure that could grow?
+
+If the answer is "no, no, no," document it and move on. If the answer is "the missing rows are a category that finance cares about," investigate properly.
+
+### A useful statistical anchor
+
+10% is suspicious because it is too large for random noise on a high-volume table, but small enough that "normal variation" is plausible. A rough rule:
+
+* If daily counts have a standard deviation of `s`, a partition that is more than `3*s` below the mean is unusual.
+* If the dips are systematically at the same level (clear 10% every Tuesday), that is not noise, it is signal.
+
+Quick:
+
+```sql
+WITH daily AS (
+ SELECT event_date, COUNT(*) AS cnt FROM events
+ WHERE event_date > CURRENT_DATE - 90
+ GROUP BY event_date
+)
+SELECT
+ AVG(cnt) AS avg_rows,
+ STDDEV(cnt) AS stddev_rows,
+ AVG(cnt) - 3 * STDDEV(cnt) AS lower_3sigma
+FROM daily;
+```
+
+If the dip partitions are below the lower-3-sigma line, they are unusual.
+
+### What to do once I know
+
+If the dip is **real and bothersome**: file a ticket with the team owning the source. "Every Tuesday between 2-3 AM UTC, the `events` topic has reduced throughput by ~10%. Can you confirm a maintenance window?"
+
+If the dip is **real but harmless**: write it down in the table's documentation. "Tuesdays are typically 10% lower due to source maintenance. Not a bug."
+
+If the dip is **noise**: do nothing, but add an anomaly check so a *real* drop would page.
+
+### Common mistakes interviewers want you to name
+
+1. **Ignoring patterns because "data is noisy."** A repeated pattern is not noise.
+2. **Spending a week investigating a 10% dip on Sundays.** Day of week, ignore.
+3. **Assuming a fix is on your side.** Often the bug is at the source.
+4. **No documentation.** The next engineer chases the same ghost.
+5. **No alert for a worse version.** Today it's 10%; tomorrow it could be 90%.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the partition is exactly 10% smaller, never more, never less? Almost too consistent."*
+
+That is a much stronger signal. Random data losses are rarely exactly the same percentage. A consistent 10% smells like:
+
+* A specific producer is offline on that day (a known set of devices, services, or geographies).
+* A scheduled job at the source skips that day (a planned partial outage).
+* A filter at the ingest is dropping a specific category every Tuesday.
+
+Investigate by slicing the data. The missing 10% is almost certainly a clean category you can name. Once named, the fix is obvious.
diff --git a/Problem 51: BigQuery Bill Eight Times Higher/question.md b/Problem 51: BigQuery Bill Eight Times Higher/question.md
new file mode 100644
index 0000000..4c9657b
--- /dev/null
+++ b/Problem 51: BigQuery Bill Eight Times Higher/question.md
@@ -0,0 +1,29 @@
+## Problem 51: BigQuery Bill Eight Times Higher
+
+**Scenario:**
+The monthly BigQuery bill jumped from $500 to $4,000 between April and May. Nobody on the team can immediately say why. Finance wants an answer this week. You have full access to billing data and INFORMATION_SCHEMA.
+
+This is the more focused cousin of Problem 30. Here the question is the investigation method itself, top to bottom.
+
+In the interview, the question is:
+
+> Your BigQuery bill jumped from 500 to 4000 dollars in one month. Walk me through how you would find what caused it.
+
+---
+
+### Your Task:
+
+1. List the queries you would run first.
+2. Walk through how to split cost by user, by query, by table.
+3. Cover what the answer usually is.
+4. Propose the immediate fixes and the longer term governance.
+
+---
+
+### What a Good Answer Covers:
+
+* INFORMATION_SCHEMA.JOBS_BY_PROJECT.
+* Bytes billed as the cost driver.
+* Partitioning that is not used, clustering that is not used, SELECT * patterns.
+* Scheduled queries running too often.
+* Reservation vs on-demand decision.
diff --git a/Problem 51: BigQuery Bill Eight Times Higher/solution.md b/Problem 51: BigQuery Bill Eight Times Higher/solution.md
new file mode 100644
index 0000000..a9ed1e5
--- /dev/null
+++ b/Problem 51: BigQuery Bill Eight Times Higher/solution.md
@@ -0,0 +1,212 @@
+## Solution 51: BigQuery Bill Eight Times Higher
+
+### Short version you can say out loud
+
+> I would not guess. I would run six queries against INFORMATION_SCHEMA in the first hour. Almost every BigQuery cost explosion has the same shape: a handful of scheduled queries that scan unpartitioned data, or a SELECT * on a huge table, or a dashboard refreshing too often. Once I find the top 10 queries by bytes billed, the answer is usually obvious in 30 minutes.
+
+### The six queries I would run
+
+**1. Daily cost trend, last 90 days.**
+
+```sql
+SELECT
+ DATE(creation_time) AS day,
+ SUM(total_bytes_billed) / POW(2, 40) AS tib,
+ SUM(total_bytes_billed) / POW(2, 40) * 6.25 AS approx_usd_on_demand
+FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
+WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
+GROUP BY day
+ORDER BY day;
+```
+
+This plots the inflection point. The day the bill started rising is the clue.
+
+**2. By user / service account.**
+
+```sql
+SELECT
+ user_email,
+ SUM(total_bytes_billed) / POW(2,40) AS tib,
+ COUNT(*) AS jobs
+FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
+WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
+GROUP BY user_email
+ORDER BY tib DESC
+LIMIT 20;
+```
+
+Usually one or two accounts dominate. If a service account is at the top, the answer is a pipeline. If a human is at the top, the answer is an analyst's notebook or dashboard.
+
+**3. Top queries by total bytes billed.**
+
+```sql
+SELECT
+ SUBSTR(query, 1, 100) AS query_snippet,
+ user_email,
+ COUNT(*) AS runs,
+ SUM(total_bytes_billed) / POW(2,40) AS tib_total,
+ AVG(total_bytes_billed) / POW(2,30) AS gib_avg
+FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
+WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
+ AND job_type = 'QUERY'
+ AND statement_type = 'SELECT'
+GROUP BY query_snippet, user_email
+ORDER BY tib_total DESC
+LIMIT 30;
+```
+
+Sort by `tib_total`. The top 5-10 queries usually explain 70-90 percent of the bill.
+
+**4. Scheduled queries by frequency.**
+
+```sql
+SELECT
+ SUBSTR(query, 1, 100) AS query,
+ COUNT(*) AS runs_in_30d
+FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
+WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
+ AND statement_type = 'SELECT'
+GROUP BY query
+HAVING runs_in_30d > 100
+ORDER BY runs_in_30d DESC;
+```
+
+Anything running thousands of times a month is likely a dashboard refresh or a scheduled query. Often that is the culprit.
+
+**5. Top tables scanned.**
+
+```sql
+SELECT
+ ref.dataset_id,
+ ref.table_id,
+ SUM(j.total_bytes_billed) / POW(2,40) AS tib
+FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT j,
+ UNNEST(j.referenced_tables) ref
+WHERE j.creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
+GROUP BY ref.dataset_id, ref.table_id
+ORDER BY tib DESC
+LIMIT 20;
+```
+
+Identifies the tables that drive the bill. Knowing the table makes the fix obvious (partition it, cluster it, archive old data).
+
+**6. Comparison: April vs May, same queries.**
+
+```sql
+WITH monthly AS (
+ SELECT
+ EXTRACT(MONTH FROM creation_time) AS m,
+ SUBSTR(query, 1, 100) AS q,
+ SUM(total_bytes_billed) / POW(2,40) AS tib
+ FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
+ WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 60 DAY)
+ GROUP BY 1, 2
+)
+SELECT q, MAX(IF(m=4, tib, 0)) AS april, MAX(IF(m=5, tib, 0)) AS may
+FROM monthly
+GROUP BY q
+ORDER BY (may - april) DESC
+LIMIT 20;
+```
+
+The biggest month-over-month deltas are the queries that grew. That is where new cost came from.
+
+### What the answer usually is
+
+In real incidents, the top of that list is one of:
+
+1. **A scheduled query that scans an unpartitioned table** every hour. 500 GB scan × 720 hours = 360 TB billed, easily $1,000+.
+2. **A new dashboard** that refreshes every 5 minutes against a multi-TB raw events table.
+3. **A `SELECT *`** on a huge table inside a dbt model or a notebook.
+4. **A backfill that nobody knew was running.** Sometimes a `dbt run --full-refresh` on a 50-model project rebuilds everything every night.
+5. **A noisy ML pipeline** scanning the same training data 20 times for feature engineering.
+6. **Real growth.** The business doubled its data; the bill doubled. This is rare for 8x, but possible.
+
+### Walking the fix
+
+Once I find the top 3 queries, the fixes are:
+
+**Fix the scheduled scan.**
+
+```sql
+-- before: scans all partitions
+SELECT ... FROM analytics.events;
+
+-- after: partition pruned
+SELECT ... FROM analytics.events
+WHERE event_date = CURRENT_DATE - 1;
+```
+
+Often a 99% cost reduction on that query.
+
+**Reduce dashboard refresh.**
+
+Most dashboards on daily data only need a daily refresh. Drop the auto-refresh from 12x/hour to 1x/hour. 12x cost reduction, no user impact for daily data.
+
+**Replace SELECT * with named columns.**
+
+```sql
+-- before: scans 50 columns
+SELECT * FROM analytics.fact_orders WHERE date = ...;
+
+-- after: scans 4 columns
+SELECT order_id, customer_id, amount, date
+FROM analytics.fact_orders WHERE date = ...;
+```
+
+The bill is bytes-billed. Selecting fewer columns reduces bytes proportionally.
+
+**Materialize hot summaries.**
+
+If 10 queries scan the same daily fact for the same aggregations, build a materialized view or a summary table. Each query then reads a few rows instead of millions.
+
+### Long-term: governance
+
+After the immediate fixes:
+
+* **Query tags.** Every job carries a label (`team:`, `dataset:`, `purpose:`). Cost reports by tag.
+* **Per-team budgets** with Slack alerts when on pace to exceed.
+* **A weekly "top 10 queries" report** so the team sees what is expensive.
+* **Reservation slots** if usage is predictable. BigQuery slots can save 30-50% over on-demand for steady workloads.
+
+### Reservation vs on-demand: a quick decision
+
+| Usage shape | Better choice |
+| ------------------------------------------ | ---------------------- |
+| Spiky, unpredictable, occasional heavy day | On-demand |
+| Steady-state, high daily usage | Slot commitment |
+| Multiple teams sharing a workload | Slots with reservations|
+| Just starting out, no idea yet | On-demand |
+
+For this scenario, $4k/month is still small. I would keep on-demand and just fix the queries. At $20k/month, the slot reservation conversation becomes worth having.
+
+### The conversation with finance
+
+> "The increase from $500 to $4,000 is driven by three queries:
+>
+> 1. A daily aggregation that was changed to scan the full events table instead of just yesterday's partition. ~$2,200 of the increase.
+> 2. A new ML feature pipeline scanning 800 GB twelve times a day. ~$1,000.
+> 3. A dashboard auto-refresh that was set to 5 minutes during launch and never lowered. ~$500.
+>
+> I have fixes ready for all three. Expected June bill is ~$700-900. The remaining $200-400 over April is real growth from the new pipeline, which is delivering business value."
+
+Specific numbers, named queries, a plan, a future estimate. That is the right shape.
+
+### Common mistakes interviewers want you to name
+
+1. **Investigating from intuition** instead of from INFORMATION_SCHEMA.
+2. **Switching to slots** before fixing wasteful queries. You lock in the inflated usage.
+3. **Killing queries without finding their owner.**
+4. **No follow-up alert.** Bills will rise again.
+5. **Not communicating savings.** Team forgets the work happened; behaviour repeats.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the team genuinely needs the data and there is no obvious waste?"*
+
+Then the conversation shifts. Two moves:
+
+1. Talk to finance about reserved slots for the steady portion of the workload. Long-term cost reduction without changing behaviour.
+2. Look at storage as well as compute. Sometimes the cost is in long-term storage of data that nobody queries. Lifecycle rules and archive tables can save a real fraction.
+
+If both have been exhausted and the workload is justified, the cost is the cost. Show the business "cost per user" or "cost per revenue dollar" and let them decide.
diff --git a/Problem 52: Four Hour Spark Job Under One Hour/question.md b/Problem 52: Four Hour Spark Job Under One Hour/question.md
new file mode 100644
index 0000000..e82aa08
--- /dev/null
+++ b/Problem 52: Four Hour Spark Job Under One Hour/question.md
@@ -0,0 +1,30 @@
+## Problem 52: Four Hour Spark Job Under One Hour
+
+**Scenario:**
+A nightly Spark job takes 4 hours. The team wants it under 1 hour without rewriting the business logic. The job reads several large tables, joins them, aggregates, and writes results. You are asked what you would check first.
+
+In the interview, the question is:
+
+> A nightly Spark job runs for 4 hours and the team needs it under 1 hour. Without rewriting logic, what do you check first?
+
+This is testing your sense of where time hides in distributed jobs and what you can change without touching the SQL or DataFrame logic.
+
+---
+
+### Your Task:
+
+1. List the things you would inspect, in order.
+2. Explain how to use the Spark UI.
+3. Cover the most common 4x wins.
+4. Mention what does NOT count as "rewriting logic."
+
+---
+
+### What a Good Answer Covers:
+
+* Spark UI stages and tasks.
+* Skew detection.
+* Partition count and shuffle size.
+* Reading too much data (column pruning, partition pruning).
+* Broadcast joins vs sort-merge joins.
+* Output file size and small-file problem.
diff --git a/Problem 52: Four Hour Spark Job Under One Hour/solution.md b/Problem 52: Four Hour Spark Job Under One Hour/solution.md
new file mode 100644
index 0000000..d4bf2f5
--- /dev/null
+++ b/Problem 52: Four Hour Spark Job Under One Hour/solution.md
@@ -0,0 +1,167 @@
+## Solution 52: Four Hour Spark Job Under One Hour
+
+### Short version you can say out loud
+
+> I would open the Spark UI and look for three things: one stage that takes most of the time, skew where a few tasks are much longer than the rest, and shuffle volume that is way bigger than the input. Most 4x wins come from one of: skewed joins, reading more data than needed, too few or too many partitions, and a sort-merge join that should have been broadcast. None of those require touching the business logic.
+
+### The inspection order
+
+```
+1. Spark UI → Jobs → which stage is the slowest?
+2. That stage → Tasks → are some tasks 10x longer than median? (skew)
+3. Shuffle Read / Write columns → how much data is being moved?
+4. Input → are we reading whole files when we only need columns?
+5. Final write → are we producing 50,000 tiny files?
+```
+
+### Step 1: find the slow stage
+
+In the Spark UI, the Jobs page lists stages with duration. One or two stages usually dominate. Click into the slowest.
+
+The Tasks page for that stage shows the distribution of task durations. Three patterns:
+
+* **Even.** All tasks take similar time. You are CPU- or I/O-bound across the cluster. Scale-up helps.
+* **Skewed.** A few tasks are 10x longer than the rest. Skew. This is the most common reason a Spark job is slow.
+* **Mostly fast with a few slow stragglers.** Skew, but milder. Still worth fixing.
+
+### Step 2: fix skew
+
+Skew means data is unevenly distributed across partitions. Usually because of a join key with one or two huge values.
+
+Detecting:
+
+* Sort tasks by duration. The slowest tasks are processing the huge keys.
+* Sort by shuffle read size. Same answer.
+
+Fixing without changing logic:
+
+* **Salting.** Add a random suffix to the join key on both sides (Spark 3+ does this automatically with AQE, Adaptive Query Execution). AQE can split skewed partitions on the fly.
+* **Enable AQE.**
+
+```python
+spark.conf.set("spark.sql.adaptive.enabled", "true")
+spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
+```
+
+If AQE is off, turn it on. This single change often shaves 50 percent off skewed jobs.
+
+### Step 3: reduce shuffle
+
+The Spark UI shows shuffle read and write per stage. If shuffle is bigger than the input data, the job is moving more than it needs to.
+
+Common causes:
+
+* **Two large tables joined on a high-cardinality key.** Required, but expensive.
+* **A `groupBy` immediately followed by a wide aggregation.** Sometimes can be reorganized; not always.
+* **A `repartition()` followed by a small filter.** Repartitioning before filtering is wasteful.
+
+Without changing the logic, you can:
+
+* **Filter earlier.** If the SQL filters after a join, move it before. AQE in Spark 3 does some of this, but not all.
+* **Tune `spark.sql.shuffle.partitions`.** Default 200 is often wrong. Too few = giant shuffles. Too many = overhead. A rule of thumb: target ~128 MB per partition. If shuffle is 100 GB, aim for ~800 partitions.
+
+### Step 4: read less
+
+Spark reading from columnar storage (Parquet, ORC, Delta) should only read the columns it needs. If you have:
+
+```python
+df = spark.read.parquet("s3://...")
+df.select("a", "b", "c").filter(...).join(...)
+```
+
+Spark should push down both the column selection and the filter. But sometimes a downstream operation breaks this (a `.collect()`, a `.cache()`, a UDF before the filter).
+
+Check the physical plan:
+
+```python
+df.explain(True)
+```
+
+Look for:
+
+* **PartitionFilters.** Make sure the filter on the partition column appears here. If it does not, the job is reading every partition.
+* **PushedFilters.** The other filters should show up. If not, predicate pushdown is broken.
+* **PrunedColumns.** Should show only the columns you actually use.
+
+Reading 100 GB vs 5 GB makes a 20x difference in I/O time alone.
+
+### Step 5: check the partition count of the input
+
+If a 1 TB table is stored as 5 huge files, Spark only has 5 partitions to work on initially, no matter how many executors you have. The job stalls.
+
+Fix without changing logic:
+
+* `spark.read.parquet(...).repartition(N)` to split it up early. Sometimes a `.repartition()` before the heavy work is a 2x win even though it triggers a shuffle.
+* If the source format is non-splittable (gzipped CSV), there is no fix at read time. The underlying problem is the file format.
+
+### Step 6: check the output
+
+Sometimes the 4-hour job is fast on the compute part but spends an hour writing tiny files. If your DataFrame has 5,000 partitions and you `write` to a partitioned destination with 100 partitions, you create 500,000 files.
+
+Fix:
+
+```python
+df.repartition(100).write.partitionBy("date").parquet(out)
+```
+
+Or use `coalesce(n)` if you do not need a full shuffle.
+
+### Step 7: cluster shape
+
+Without rewriting logic, you can still:
+
+* **Increase executors.** More parallelism, faster per-stage.
+* **Right-size executor memory.** Too small causes spills to disk (very slow). Too big wastes cores. A sweet spot is usually 4-8 cores per executor and 4 GB per core.
+* **Spot instances** for cost, not speed. Saves money, can hurt reliability if not careful.
+
+If the job is I/O bound, more compute does not help. Look at network read instead.
+
+### What does NOT count as "rewriting logic"
+
+The team asked to keep the business logic the same. That still allows:
+
+* Enabling AQE.
+* Tuning shuffle partition count.
+* Adding `repartition` or `coalesce` calls.
+* Selecting columns before the heavy work.
+* Filter pushdown.
+* Broadcast hints (`broadcast(df)`).
+* Cluster sizing.
+* Output write strategy.
+
+It does not allow rewriting the SQL or DataFrame transformations themselves. That is the line.
+
+### A real fix walkthrough
+
+A team's 4-hour job is mostly one stage with a sort-merge join. Spark UI shows:
+
+* Input: 500 GB of orders + 50 MB of dim_country.
+* Shuffle: 500 GB on each side of the join.
+* Total time: 3.5 hours.
+
+The dim_country is small but Spark is treating it like another large table. Fix:
+
+```python
+from pyspark.sql.functions import broadcast
+joined = orders.join(broadcast(dim_country), "country_code")
+```
+
+Now Spark broadcasts the small side. No shuffle. The join completes in minutes instead of hours. Total job goes from 4 hours to ~45 minutes.
+
+One line. No logic change. 4x.
+
+### Common mistakes interviewers want you to name
+
+1. **Adding more nodes** before checking the plan. Often the bottleneck is one stage, not capacity.
+2. **Disabling AQE** because of an old bug. AQE in Spark 3.2+ is solid.
+3. **Default `spark.sql.shuffle.partitions=200`** for a 1 TB job.
+4. **Tiny output files** that take longer to write than the compute itself.
+5. **Caching everything.** Sometimes the cache is the bottleneck.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if you actually do need to rewrite logic to hit one hour?"*
+
+Then the conversation changes. Most often, the win is reducing what you compute, not how. For example, if 80 percent of the join output is then aggregated to 1 percent of the rows, do the aggregation first. Or, pre-compute features in a daily smaller table so the big nightly job has less to do.
+
+But that should be a second conversation, after the no-logic-change wins are taken. They usually get you within 2x of the target by themselves.
diff --git a/Problem 53: Hourly Scan on Daily Data/question.md b/Problem 53: Hourly Scan on Daily Data/question.md
new file mode 100644
index 0000000..c42883a
--- /dev/null
+++ b/Problem 53: Hourly Scan on Daily Data/question.md
@@ -0,0 +1,26 @@
+## Problem 53: Hourly Scan on Daily Data
+
+**Scenario:**
+A dashboard refreshes every hour. The underlying data only changes once a day, at 6 AM. The query scans 5 TB each hour, so 120 TB per day, ~$750 a month at on-demand rates. The team built it that way "because the dashboard tool's default is hourly." They do not want to break the dashboard.
+
+In the interview, the question is:
+
+> A team is scanning a 5 TB table every hour for a dashboard that only changes once a day. How do you fix this without disturbing their workflow?
+
+---
+
+### Your Task:
+
+1. List the three or four fix options.
+2. Pick one, defend.
+3. Cover how to keep the user experience identical.
+4. Mention longer term checks.
+
+---
+
+### What a Good Answer Covers:
+
+* Materialized view or scheduled summary table.
+* Caching at the BI layer.
+* Daily refresh schedule.
+* Storage savings vs compute savings.
diff --git a/Problem 53: Hourly Scan on Daily Data/solution.md b/Problem 53: Hourly Scan on Daily Data/solution.md
new file mode 100644
index 0000000..9d3af96
--- /dev/null
+++ b/Problem 53: Hourly Scan on Daily Data/solution.md
@@ -0,0 +1,134 @@
+## Solution 53: Hourly Scan on Daily Data
+
+### Short version you can say out loud
+
+> The dashboard reads the same answer 24 times a day. The cheapest fix is to compute that answer once and cache it. The cleanest way is a daily summary table or a materialized view, refreshed once after the source data lands. The dashboard's query stays the same, just pointed at the small table instead of the 5 TB raw. User experience is unchanged. Cost drops from $750/month to under $10/month.
+
+### Four options, ranked
+
+**1. Daily summary table (my pick).**
+
+Compute the dashboard's exact query into a small table once a day, right after the data lands. The dashboard reads the small table.
+
+```sql
+CREATE OR REPLACE TABLE marts.daily_dashboard_metrics AS
+SELECT
+ region,
+ product_category,
+ DATE_TRUNC(order_date, DAY) AS day,
+ COUNT(*) AS orders,
+ SUM(amount) AS revenue
+FROM raw.orders
+WHERE order_date BETWEEN '2025-01-01' AND CURRENT_DATE
+GROUP BY 1, 2, 3;
+```
+
+The dashboard's query becomes:
+
+```sql
+SELECT * FROM marts.daily_dashboard_metrics WHERE ...;
+```
+
+A few thousand rows. Sub-second scan. Pennies per month.
+
+The dashboard's URL, layout, filters, and look are unchanged. Users do not even know.
+
+**2. Materialized view.**
+
+If the warehouse supports them well (BigQuery, Snowflake), define a materialized view that captures the dashboard's query. The view is refreshed on a schedule or on insert. Queries against the source automatically rewrite to use the MV.
+
+```sql
+CREATE MATERIALIZED VIEW marts.dashboard_mv AS
+SELECT region, product_category, ...
+FROM raw.orders
+GROUP BY ...;
+```
+
+Less control than option 1, but you do not have to maintain the refresh job yourself.
+
+**3. BI tool extract / cache.**
+
+Tools like Looker, Tableau, Power BI can build their own extract — basically option 1 but inside the BI tool. Refreshed daily, served from cache.
+
+Less portable, less SQL-visible, but no warehouse-side change.
+
+**4. Schedule change at the BI tool.**
+
+Just change the dashboard refresh from hourly to daily. Same data, 24x cheaper.
+
+This is the simplest one and the team did not consider it. Worth asking: do users really need hourly? If the data only changes daily, the hourly refresh shows the same answer 23 times in a row.
+
+### My pick for this scenario
+
+**Option 1 (daily summary table) plus Option 4 (daily refresh)** together.
+
+* The summary table makes the query cheap.
+* The daily refresh stops re-running the same answer 24 times.
+
+Cost goes from $750/mo to under $10/mo. The dashboard renders faster, too, because the underlying query is smaller.
+
+### Keeping the user experience identical
+
+The user-facing pieces are:
+
+* The dashboard URL.
+* The chart names.
+* The filters available.
+* The numbers shown.
+
+None of those need to change. The change is invisible to the user: only the underlying table swap. If anything, performance improves because the query is now reading a few thousand rows instead of 5 TB.
+
+If there is a worry about new bugs, run both old and new for a few days. Compare row counts and sums. If they match, swap.
+
+### What about hourly users?
+
+If even one user needs hourly data, the picture changes. Two paths:
+
+1. **Cap the hourly query to today's partition only.** Yesterday is fixed, today is live. Hourly is now scanning a few GB, not 5 TB.
+
+```sql
+WHERE date = CURRENT_DATE -- only today's partition, ~200 GB instead of 5 TB
+```
+
+Cost drops from $750/mo to ~$30/mo.
+
+2. **Two layers**: a daily summary for historical, a small "today" table refreshed hourly. The dashboard UNIONs them. Adds a little complexity for the team that wants up-to-the-hour data.
+
+For this scenario, neither is needed because the source only updates daily.
+
+### What about the storage cost of the summary table?
+
+A daily summary table for a 5 TB source is typically megabytes, not gigabytes. The grain is much smaller than the raw events. Storage cost is negligible compared to the scan savings.
+
+### How to talk to the team
+
+The conversation matters:
+
+> "I noticed the dashboard scans 5 TB every hour to show numbers that only change at 6 AM. I can change two things: rebuild the dashboard query into a small daily summary table, and switch the dashboard's refresh from hourly to daily. The dashboard itself does not change. Users see the same numbers, faster. Saves about $740 a month. Half a day of work, easy to roll back if something looks off."
+
+Concrete, specific, and not preachy. The team will say yes.
+
+### Longer term: how to catch this
+
+Set up a weekly automated report:
+
+* Top 10 queries by bytes-billed in the past week.
+* Each one annotated with: how often it runs, what table it scans, who owns it.
+
+Anyone glancing at the report will see "daily query running 24x/day on 5 TB" the next time it pops up. Catches the next instance early.
+
+### Common mistakes interviewers want you to name
+
+1. **Killing the dashboard refresh** without telling the team.
+2. **Building a materialized view** but forgetting to refresh it (some engines require explicit refresh schedules).
+3. **Picking the BI tool extract** when warehouse-side is simpler. Couples the fix to one tool.
+4. **Not validating** that the summary matches the original. Different aggregation orders can produce subtle differences.
+5. **One mega summary table** that tries to cover every dashboard. Becomes its own slow query.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the team also has 30 other dashboards on the same table, all scanning hourly?"*
+
+Then the fix scales naturally. Build a slightly wider summary table that covers the common columns. The 30 dashboards all read from it. Now you have replaced 30 × $750 = $22,500/month with one $50 summary table. The win is bigger when the source is shared.
+
+Or, more structurally, treat this as a sign that the team needs a metrics layer (like dbt's metrics, or a semantic model). One place to define the metric, all dashboards read consistent numbers without scanning raw.
diff --git a/Problem 54: Just Throw More Memory At It/question.md b/Problem 54: Just Throw More Memory At It/question.md
new file mode 100644
index 0000000..5b0f9b8
--- /dev/null
+++ b/Problem 54: Just Throw More Memory At It/question.md
@@ -0,0 +1,29 @@
+## Problem 54: "Just Throw More Memory at It"
+
+**Scenario:**
+A senior engineer hits a slow query in Snowflake (or a Spark job that's running out of memory) and proposes "let's just throw more memory at it. Upsize the warehouse to a 2X-Large, or bump the executor memory to 32 GB." The fix is real and works. But it triples the cost.
+
+In the interview, the question is:
+
+> A senior engineer says "just throw more memory at it." What questions would you ask before agreeing?
+
+This is testing your reflex to dig before scaling.
+
+---
+
+### Your Task:
+
+1. Acknowledge the move; it's not always wrong.
+2. List the questions you would ask.
+3. Walk through the cheaper alternatives.
+4. Cover when you would actually agree.
+
+---
+
+### What a Good Answer Covers:
+
+* Why upsizing works (less spill, more parallel).
+* The cost trade.
+* The cheaper alternatives almost always exist.
+* When upsize is the right call (one-time, large, no time to optimize).
+* The cultural side: don't dunk on the senior engineer.
diff --git a/Problem 54: Just Throw More Memory At It/solution.md b/Problem 54: Just Throw More Memory At It/solution.md
new file mode 100644
index 0000000..a27462b
--- /dev/null
+++ b/Problem 54: Just Throw More Memory At It/solution.md
@@ -0,0 +1,128 @@
+## Solution 54: "Just Throw More Memory at It"
+
+### Short version you can say out loud
+
+> Upsizing is real, it works. But it locks in cost forever, and the same query usually has a cheaper fix that has not been tried. I would agree to upsize as a temporary measure to unblock today, while we look for the underlying cause. The questions I would ask are: what is the plan showing, what is the table layout, and have we tried the four or five common optimizations before tripling the bill. Eight out of ten times, one of those wins gets us the same result without the cost.
+
+### Acknowledging it can be the right move
+
+It would be wrong to dismiss the suggestion. "Throw more memory" is sometimes correct:
+
+* The job is one-time (a backfill, a migration). Optimization time costs more than the extra compute.
+* The data is genuinely big and the query is already efficient.
+* The deadline is in 2 hours.
+
+So the answer is not "no." The answer is "let's verify."
+
+### The questions I would ask
+
+1. **What does the query plan look like?** Are we sorting to disk? Reading 10x more than we need? Picking a nested loop? Each of those has a cheaper fix than memory.
+2. **Is the table partitioned and clustered well?** A 5 TB scan on a poorly clustered table is often a 50 GB scan after re-clustering.
+3. **Have we tried the obvious filter and join improvements?** SELECT *, function on indexed column, broadcast join missed.
+4. **Is this query going to run once or every day?** Once: just upsize. Every day: optimize.
+5. **What was the size before this got slow?** If the table doubled, the optimizer may have flipped a join. Plan diff (Problem 48).
+
+If none of those have been investigated, I would ask for an hour to look before agreeing to the upsize.
+
+### The cheaper alternatives I would check
+
+Before agreeing to upsize, in order of effort:
+
+**1. Reduce data read.**
+
+Column pruning, partition pruning, filter pushdown. Often a 10x reduction in bytes scanned with no logic change.
+
+**2. Force a better join.**
+
+Broadcast hints, MERGE JOIN with sorted inputs. A wrong join strategy is the most common cause of memory pressure.
+
+**3. Use a temp table.**
+
+Sometimes the optimizer cannot figure out a path. Materialize the intermediate result to a small temp table, then join. This breaks the planner's bad assumptions.
+
+**4. Re-cluster the table.**
+
+In Snowflake, run `ALTER TABLE ... RECLUSTER`. In BigQuery, recreate with proper clustering. Slow queries due to clustering drift get a big win.
+
+**5. Update statistics.**
+
+Often the answer (Problem 48). Free, fast, sometimes magic.
+
+**6. Cache or summary.**
+
+If the query runs often, build a summary table or materialized view (Problem 53).
+
+**7. Then upsize.**
+
+If none of those work and the job is critical, upsize. But document why, so the next person knows.
+
+### When upsize IS the right answer
+
+Three cases:
+
+1. **One-time big work.** A monthly reconciliation, a quarterly data migration. Two hours of upsized compute is cheaper than two days of optimization.
+2. **The query is already optimal.** You have tried all of the above, plans look clean, the data is just big.
+3. **You are time-pressured.** The dashboard has to render in 5 minutes for the board meeting. Upsize now, optimize Monday.
+
+The honest move in case 3 is to upsize *and* file a ticket for Monday. Otherwise the temporary fix becomes permanent.
+
+### The cost math
+
+```
+Snowflake warehouse sizes (rough)
+─────────────────────────────────
+X-Small → 1 credit/hour
+Small → 2
+Medium → 4
+Large → 8
+X-Large → 16
+2X-Large → 32
+3X-Large → 64
+
+Going from Large to 2X-Large is 4x cost.
+Going from X-Small to 2X-Large is 32x cost.
+```
+
+A query that runs once a day for 30 minutes:
+
+* Large: ~$15 per day = ~$450/month.
+* 2X-Large: ~$60 per day = ~$1,800/month.
+
+If the optimization would have taken 4 hours and saved you the upsize, that's $1,350/month saved for the price of an afternoon's work.
+
+### The cultural side
+
+When a senior engineer suggests something, the worst move is "that's wrong." Instead:
+
+> "That would work. Before we commit to the cost, can we spend an hour checking the plan? If there's a free fix, we save the upsize. If not, we upsize and move on."
+
+You are agreeing to upsize as the fallback. You are also buying time to investigate without making it look like you are blocking them.
+
+If the investigation finds an obvious fix, the senior engineer learns something. If it doesn't, you upsize together and have shared confidence in the decision.
+
+### What I would NOT do
+
+* **Refuse outright.** Sometimes upsize is right.
+* **Spend two days optimizing.** Not worth it for a small win.
+* **Upsize permanently** to make a sporadic problem go away. The cost compounds.
+* **Hide the upsize.** Document it so it gets reviewed.
+
+### Common mistakes interviewers want you to name
+
+1. **Upsize first, ask later.** Triples the bill, often without reason.
+2. **Optimize forever** for a one-time job.
+3. **No follow-up.** Upsize becomes permanent.
+4. **Treating "the senior engineer said so" as proof.** Senior engineers are right often but not always.
+5. **No comparison.** Without a before/after measurement, you cannot tell if the upsize even helped.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the job is genuinely too big for the smaller warehouse, but we cannot rewrite the logic?"*
+
+Then the upsize is correct, but consider:
+
+* **Use a larger warehouse only for that job.** Snowflake lets you assign warehouses per task. Other jobs stay on the smaller warehouse.
+* **Schedule it off-peak.** A larger warehouse for 30 minutes at 2 AM costs less impact than fighting for slots at 9 AM.
+* **Use a reservation or commitment** if the upsized warehouse becomes a regular pattern. Reserved capacity is 30-50 percent cheaper than on-demand.
+
+Treat the upsize as a tool, not a default.
diff --git a/Problem 55: Partitioning Clustering Materialized Views/question.md b/Problem 55: Partitioning Clustering Materialized Views/question.md
new file mode 100644
index 0000000..bd3cde2
--- /dev/null
+++ b/Problem 55: Partitioning Clustering Materialized Views/question.md
@@ -0,0 +1,29 @@
+## Problem 55: Partitioning, Clustering, Materialized Views
+
+**Scenario:**
+A junior engineer asks you to explain the three main BigQuery cost levers and when to use each. They have heard the words, used them inconsistently, and want a clear picture.
+
+In the interview, the question is:
+
+> Explain how partitioning, clustering, and materialized views each save money in BigQuery, and when each one is the right tool.
+
+This is a "do you actually understand these" question, with the test that your answer can be used by a junior the next day.
+
+---
+
+### Your Task:
+
+1. Explain each one in one sentence.
+2. Show a small example of when each saves money.
+3. Walk through the combinations.
+4. Mention the order to think about them.
+
+---
+
+### What a Good Answer Covers:
+
+* Partitioning prunes whole partitions before scan.
+* Clustering prunes blocks inside a partition.
+* Materialized views compute the aggregate once and reuse it.
+* When two or three work together.
+* Common mistakes.
diff --git a/Problem 55: Partitioning Clustering Materialized Views/solution.md b/Problem 55: Partitioning Clustering Materialized Views/solution.md
new file mode 100644
index 0000000..d00f5cc
--- /dev/null
+++ b/Problem 55: Partitioning Clustering Materialized Views/solution.md
@@ -0,0 +1,149 @@
+## Solution 55: Partitioning, Clustering, Materialized Views
+
+### Short version you can say out loud
+
+> Partitioning lets BigQuery skip whole chunks of the table when your WHERE filters on the partition column. Clustering lets it skip storage blocks inside a partition when your WHERE filters on the cluster columns. Materialized views let it skip the whole computation when you ask for the same aggregate again. They stack: partition first, then cluster, then materialize the heavy aggregates that read this layout. Most cost savings come from getting these three things right.
+
+### One sentence each
+
+* **Partitioning** physically splits the table into pieces (usually by date), so queries that filter on the partition column read only the relevant pieces.
+* **Clustering** sorts rows inside each partition by chosen columns, so queries that filter on those columns skip blocks that cannot match.
+* **Materialized view** stores the result of an expensive aggregation once and updates it as new data arrives, so repeated queries read pennies instead of terabytes.
+
+### Where they save money
+
+```
+Table: events (8 billion rows, 4 TB)
+Query: SELECT COUNT(*) FROM events WHERE event_date = '2025-05-14';
+
+Without anything Cost: 4 TB scan = expensive
+Partitioned by event_date Cost: 1/365 of the table = $X / 365
+Clustered by event_type after partition Cost: even smaller for filtered subsets
+Materialized view of daily counts Cost: <1 GB, pennies
+```
+
+Each layer adds a multiplier of savings, applied to different shapes of query.
+
+### When to use partitioning
+
+Partition by the column that almost every query filters on. For event/log tables that is usually `event_date` or `created_at`. Rule: at most ~4,000 partitions in BigQuery, so do NOT partition by a high-cardinality column like `customer_id`.
+
+```sql
+CREATE TABLE analytics.events (
+ event_date DATE,
+ user_id INT64,
+ event_type STRING,
+ payload JSON
+)
+PARTITION BY event_date;
+```
+
+After partitioning, queries that include `WHERE event_date = ...` prune to one partition. Queries that omit the partition filter still scan the whole table.
+
+### When to use clustering
+
+Cluster on columns that you filter or join on, that are NOT the partition column. Up to four columns. Order matters: most-filtered column first.
+
+```sql
+CREATE TABLE analytics.events (
+ event_date DATE,
+ user_id INT64,
+ event_type STRING,
+ payload JSON
+)
+PARTITION BY event_date
+CLUSTER BY user_id, event_type;
+```
+
+Now `WHERE event_date = '2025-05-14' AND user_id = 1234` is cheap: BigQuery prunes to one partition, then uses block metadata to skip blocks that cannot contain `user_id = 1234`.
+
+Clustering is not an index. It is a sort order plus per-block min/max. So it helps for **equality and range filters** on the clustered columns, and for joins on those columns. It does not help for arbitrary substring searches.
+
+### When to use materialized views
+
+When the same expensive aggregate is asked for repeatedly. Examples:
+
+* Daily revenue by region.
+* Active users per day.
+* Top products by month.
+
+```sql
+CREATE MATERIALIZED VIEW analytics.daily_revenue_by_region AS
+SELECT
+ event_date,
+ region,
+ SUM(amount) AS revenue,
+ COUNT(*) AS orders
+FROM analytics.events
+WHERE event_type = 'purchase'
+GROUP BY event_date, region;
+```
+
+BigQuery maintains the MV automatically as new rows are inserted. Queries against the original table that match the MV's pattern are automatically rewritten to use the MV. Even queries that need to combine with the MV partially are sometimes optimized.
+
+Storage of the MV is usually small (rolled-up data). Cost is the small storage plus the maintenance, which is far less than the repeated full scans.
+
+### When to use which
+
+| Situation | Tool |
+| --------- | ---- |
+| Query always filters on a date column | Partition by that date |
+| Query filters on a high-cardinality column | Cluster on that column |
+| Query joins on a column | Cluster on the join column (both sides) |
+| Same aggregate asked many times | Materialized view |
+| Mix of filtered + aggregated queries on the same table | Partition + cluster + MV |
+
+### The stacking principle
+
+These compose:
+
+* Partition first. Pick the column you filter on in 80 percent of queries.
+* Then cluster. Pick up to four columns you filter or join on next.
+* Then materialize. Only the aggregates that are run often enough to justify the maintenance.
+
+A typical large analytics table ends up with all three:
+
+```sql
+CREATE TABLE analytics.events ...
+PARTITION BY event_date
+CLUSTER BY user_id, event_type;
+
+CREATE MATERIALIZED VIEW analytics.daily_user_counts AS
+SELECT event_date, COUNT(DISTINCT user_id) AS dau
+FROM analytics.events
+GROUP BY event_date;
+```
+
+A query for "active users by day last quarter" hits the MV. Cost: cents.
+A query for "all events for user 1234 last week" hits the partition + cluster. Cost: a few MB scan.
+An ad-hoc query on a column we did not cluster on falls back to the partition. Cost: more, but still bounded to one day.
+
+### Things that look like savings but are not
+
+* **Adding indexes.** BigQuery does not have traditional indexes. Clustering is the closest thing.
+* **Caching.** BigQuery caches query results for 24 hours by default, but only for identical queries. Useful, not a substitute.
+* **Wide pre-joined tables** without thinking. Sometimes save cost, sometimes hurt because rows are bigger.
+
+### Common mistakes interviewers want you to name
+
+1. **Partitioning by a high-cardinality column** like `user_id`. Hits the 4000-partition limit fast.
+2. **Querying without the partition filter.** The partition does nothing.
+3. **Clustering on too many columns.** After the first 2-3, returns diminish, and the order may be wrong for new queries.
+4. **Materialized view without checking it actually gets used.** BigQuery only auto-rewrites if the query matches the MV pattern. Otherwise you pay the MV maintenance for nothing.
+5. **Trying to retrofit on a huge table.** Repartitioning a 50 TB table takes hours. Plan from day one.
+
+### Quick rule of thumb
+
+If your team is asking "should we partition or cluster?" the answer is almost always **both**. They serve different purposes. Materialized views come third, when a specific aggregate justifies the maintenance.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the team doesn't know yet what columns to cluster on?"*
+
+Three steps:
+
+1. Look at INFORMATION_SCHEMA.JOBS_BY_PROJECT for the past 30 days. Find the top 10 queries against that table.
+2. Look at their WHERE clauses and JOIN conditions. The columns that appear most often are your cluster candidates.
+3. Recreate the table with `PARTITION BY date CLUSTER BY top_columns`. Measure scan size before vs after.
+
+Empirical, not guessed. The data tells you the right clustering.
diff --git a/Problem 56: Watermarks in Plain Words/question.md b/Problem 56: Watermarks in Plain Words/question.md
new file mode 100644
index 0000000..efe170a
--- /dev/null
+++ b/Problem 56: Watermarks in Plain Words/question.md
@@ -0,0 +1,27 @@
+## Problem 56: Watermarks in Plain Words
+
+**Scenario:**
+The team is building a streaming pipeline that computes 5-minute windowed aggregates. Late events are common. The pipeline is using Flink, and the engineer setting it up keeps asking how to choose the watermark. They have read the docs but it has not clicked yet.
+
+In the interview, the question is:
+
+> What is a watermark in streaming, in plain words, and what goes wrong when you set it too tight or too loose?
+
+---
+
+### Your Task:
+
+1. Explain a watermark in plain English.
+2. Show what setting it too tight does.
+3. Show what setting it too loose does.
+4. Cover how to choose in practice.
+
+---
+
+### What a Good Answer Covers:
+
+* Event time vs processing time.
+* The watermark as "the system's belief about how complete a window is."
+* Tight watermark: low latency but drops late events.
+* Loose watermark: catches late events but delays output.
+* Allowed lateness and side outputs.
diff --git a/Problem 56: Watermarks in Plain Words/solution.md b/Problem 56: Watermarks in Plain Words/solution.md
new file mode 100644
index 0000000..766b5b1
--- /dev/null
+++ b/Problem 56: Watermarks in Plain Words/solution.md
@@ -0,0 +1,123 @@
+## Solution 56: Watermarks in Plain Words
+
+### Short version you can say out loud
+
+> A watermark is the system saying "I am willing to bet that no events older than this timestamp will arrive from now on." It tells windowed aggregations when they are allowed to close and emit a result. Set it too tight and you close windows before late events arrive — those events get dropped. Set it too loose and the output is correct but late, because you keep waiting for stragglers that may never come. The right value is a guess based on how late events realistically arrive.
+
+### Picture event time vs processing time
+
+```
+Event time : when the event actually happened (in the device, app, etc.)
+Processing time: when the event is processed by your pipeline
+
+In a perfect world they are equal. In real life:
+ - Network delays
+ - Mobile clients with bad connections
+ - Cross-region delivery
+ - Backpressure in Kafka
+
+So events from 12:05:00 might arrive at the processor at 12:05:01 (a 1-second lag),
+or at 12:09:30 (a 4.5-minute lag), or at 13:00:00 (an hour late).
+```
+
+The pipeline wants to compute "events that happened between 12:00 and 12:05." It has to decide when to stop waiting and emit the answer.
+
+### What a watermark actually is
+
+A watermark is a piece of timing metadata. When the processor sees a watermark of `12:08:00`, it is asserting: **"I think I have all events with event time ≤ 12:08:00."**
+
+This belief drives window closure:
+
+* The window `[12:00, 12:05)` closes when the watermark crosses `12:05`.
+* After it closes, any late event for that window is by default dropped.
+
+### Tight watermark: low latency, dropped events
+
+If you set the watermark to `current_event_time - 1 second`, you are saying "all events arrive within 1 second of their event time."
+
+* **Pro**: windows close fast. Result emitted within seconds.
+* **Con**: any event that is slow (mobile retry, network blip) is dropped because it arrives after its window has closed.
+
+In real-life pipelines, even "fast" streams have a tail of events arriving 30-60 seconds late, often more. A tight watermark loses them.
+
+### Loose watermark: correct but slow
+
+If you set the watermark to `current_event_time - 10 minutes`, you are saying "events can arrive up to 10 minutes late."
+
+* **Pro**: late events are still captured.
+* **Con**: results take 10 extra minutes to emit. A 5-minute window emits 15 minutes after it should.
+
+For some use cases (a dashboard, a billing roll-up), this lag is fine. For others (fraud detection, real-time pricing), it is not.
+
+### How to choose
+
+There is no universal value. The right watermark is **a guess at how late events realistically arrive in your system**. Steps:
+
+1. **Measure your tail.** Look at the difference between event time and processing time for a representative sample. Plot the distribution. Pick a percentile that matches your appetite.
+2. **A common starting point**: watermark = current event time - the 99th percentile of lag.
+3. **For most systems**: somewhere between 30 seconds and 5 minutes is typical.
+
+You are explicitly trading off latency for completeness. Most teams converge on "a few minutes" because the cost of a slightly late result is much less than the cost of losing data.
+
+### Allowed lateness
+
+A separate knob from the watermark. The watermark closes the window. Allowed lateness says "even after the window closes, accept events older than the watermark for this much longer, and re-emit the result."
+
+```
+Window [12:00, 12:05)
+Watermark says "complete" at 12:05 + 1 min → window closes, first result emitted
+Allowed lateness = 10 min → updates re-emitted for late events until 12:16
+After 12:16, late events go to "side output" (drop or audit)
+```
+
+This is the production sweet spot: get a fast first answer, refine it as stragglers arrive, eventually finalize.
+
+### Per-event watermarks
+
+Some pipelines have event sources with very different lag profiles. A web event has different lag than a batch upload from a vendor. You can:
+
+* **Per-source watermarks**: each Kafka topic or producer has its own watermark, the processor takes the minimum across all sources.
+* **Idle source handling**: if one source goes idle, don't let it freeze the watermark. Flink has `withIdleness` for this.
+
+### Side outputs for late events
+
+Don't just drop late events. Route them to a "late" sink:
+
+* Useful for diagnosing whether your watermark is too tight.
+* Useful for audits.
+* Useful for compensating updates downstream (re-bill, re-flag).
+
+The cost is tiny; the visibility is great.
+
+### A concrete recipe
+
+For a typical app event stream with the goal "5-minute windows, results within 2-3 minutes, no event loss":
+
+```
+watermark = current event time - 2 minutes
+allowed lateness = 5 minutes
+side output = anything later than that
+
+Result: first emission at watermark + window end (~2 min after wall clock).
+Updates for the next 5 minutes as stragglers arrive.
+After that, events go to the late sink for analysis.
+```
+
+### Common mistakes interviewers want you to name
+
+1. **Watermark = current processing time.** No allowance for lateness. Drops the tail.
+2. **Watermark = current event time - 1 hour.** Output is too slow to be useful.
+3. **No side output for late events.** Silent loss.
+4. **Same watermark across very different sources.** The slowest source dominates.
+5. **Confusing watermark with allowed lateness.** Both knobs exist for a reason.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if you need 'exactly correct' aggregates at the end of the day, but also fast early results?"*
+
+Two-tier approach:
+
+1. **Stream layer** with a moderately tight watermark gives near-real-time results (good enough for dashboards).
+2. **Batch layer** (typically a nightly warehouse job) recomputes the same aggregates over the day's full event log, with effectively infinite "watermark." This is the audited number.
+
+If the two diverge (they will, slightly), the batch one wins for reporting. This is the original "lambda architecture" pattern. It is more operational work but gives you the guarantees both audiences want.
diff --git a/Problem 57: Kafka Ordering Guarantee/question.md b/Problem 57: Kafka Ordering Guarantee/question.md
new file mode 100644
index 0000000..d1e1694
--- /dev/null
+++ b/Problem 57: Kafka Ordering Guarantee/question.md
@@ -0,0 +1,27 @@
+## Problem 57: Kafka Ordering Guarantee
+
+**Scenario:**
+A teammate says Kafka "guarantees message ordering," and your pipeline depends on that. You have noticed messages occasionally appearing out of order downstream. The teammate is sure Kafka could not be at fault.
+
+In the interview, the question is:
+
+> Kafka is said to "guarantee ordering." What does that actually mean, and what can quietly break it in practice?
+
+---
+
+### Your Task:
+
+1. State the exact guarantee.
+2. Explain why it is conditional.
+3. List the realistic ways the guarantee breaks.
+4. Cover what to do when ordering matters.
+
+---
+
+### What a Good Answer Covers:
+
+* Ordering per partition, not per topic.
+* Partitioning key choice.
+* Producer retries and idempotence.
+* Multiple producers, consumer groups, reprocessing.
+* When you actually need global ordering.
diff --git a/Problem 57: Kafka Ordering Guarantee/solution.md b/Problem 57: Kafka Ordering Guarantee/solution.md
new file mode 100644
index 0000000..9fc69da
--- /dev/null
+++ b/Problem 57: Kafka Ordering Guarantee/solution.md
@@ -0,0 +1,138 @@
+## Solution 57: Kafka Ordering Guarantee
+
+### Short version you can say out loud
+
+> Kafka guarantees ordering only within a single partition. Across the whole topic, messages can interleave any way. So the question becomes "do all messages that need to be ordered land in the same partition?" That depends on the partition key. If two related events use different keys, they can land in different partitions and arrive out of order downstream. Even within a partition, producer retries without idempotence and topic resharding can reorder things. Most "Kafka is reordering my data" cases are really "my partition key is wrong."
+
+### The exact guarantee
+
+> "Within a single partition of a single topic, messages are stored in the order they were written, and consumers see them in that order."
+
+That's the entire promise. Notice what is NOT promised:
+
+* Order across partitions.
+* Order across producers.
+* Order after compaction.
+* Order after a topic reshuffle.
+
+### Why ordering happens to be partitioned
+
+Kafka scales by splitting a topic into partitions, each handled by a different broker. If Kafka tried to guarantee global ordering, it could only have one partition, and one broker, and one consumer at a time. The whole reason it scales is that partitions are independent.
+
+So ordering is the price of scalability. If you need ordering for a set of messages, those messages must land in one partition.
+
+### How to make messages land in the same partition
+
+The producer specifies a **partition key**. Kafka hashes the key and assigns to a partition deterministically. Messages with the same key always go to the same partition.
+
+```python
+producer.send(
+ topic='orders',
+ key=str(order_id).encode(), # ← partition key
+ value=event_payload,
+)
+```
+
+If you key by `order_id`, all events for the same order land in the same partition and stay ordered. Different orders may interleave across partitions, which is fine because they are independent.
+
+The wrong move:
+
+* No key: Kafka picks a partition round-robin. Same-order events can land anywhere.
+* Wrong key: keying by something that varies (`timestamp`, `random_id`) puts related events in different partitions.
+
+The most common bug: keying by `user_id` when the relevant ordering is by `order_id`. Two events for the same order go to different partitions because they happened to have different user-relevant fields.
+
+### What can quietly break ordering within a partition
+
+Even when you key correctly, ordering can still go wrong:
+
+**1. Producer retries without idempotence.**
+
+If the producer sends message A, times out, retries, then sends message B, you can end up with B-A-A in the partition. Enable idempotent producer (Kafka >= 0.11):
+
+```
+enable.idempotence = true
+acks = all
+```
+
+With idempotence on, Kafka deduplicates retries and preserves order. This is on by default in recent versions.
+
+**2. `max.in.flight.requests.per.connection > 1` without idempotence.**
+
+The producer sends multiple batches in parallel. If batch 1 fails and is retried, batch 2 lands first. Idempotence fixes this; without it, lower this to 1.
+
+**3. Multiple producers writing to the same partition.**
+
+Each producer's messages are ordered, but Kafka does not order across producers in the same partition. If two services produce events for the same order from different machines, the order in the partition is "first to arrive at the broker." Usually fine, sometimes not.
+
+**4. Topic reshuffling (partition count changes).**
+
+If you increase the number of partitions, the hash mapping changes. New messages with the same key may now land in a different partition. Old messages stay where they were. Consumers see "old key X messages in partition 3, new key X messages in partition 7." Ordering across that boundary is gone.
+
+This is the biggest gotcha. Once you've keyed for ordering, do not change partition counts.
+
+**5. Compacted topics.**
+
+A compacted topic keeps only the latest message per key. Older messages disappear. Ordering of the surviving messages is preserved, but you no longer see the full history.
+
+**6. Consumer reprocessing.**
+
+If a consumer resets and re-reads from an earlier offset, it sees the same messages again in order — but downstream side effects (database writes) may already exist from the first pass. Order is preserved on the wire; the downstream effect order depends on idempotency.
+
+### When you need global ordering
+
+If you truly need global ordering — every event in the topic seen in one true order:
+
+* Use a single-partition topic. Throughput is bounded by one broker.
+* Or, route through a sequencer (a single service that assigns monotonic ids before producing).
+
+This is rare. Most "I need ordering" actually means "I need ordering per-entity," which the right key fixes.
+
+### Production checklist when ordering matters
+
+1. **Key by the entity that needs ordering** (`order_id`, `user_id`, `meter_id`, whatever the unit is).
+2. **Enable idempotent producer.** Default in recent Kafka.
+3. **`acks=all`** so retries don't lose messages.
+4. **Document the partition key** in your data contract.
+5. **Never change partition count** once keyed for ordering. If you must, create a new topic and migrate.
+6. **Build idempotent consumers anyway**, because of dedup needs (Problem 14).
+
+### A diagnostic for the scenario
+
+The teammate is sure Kafka is innocent. To diagnose:
+
+```sql
+-- in your warehouse, after landing
+SELECT event_id, event_time, ingest_time, partition
+FROM raw.events
+WHERE entity_id = '12345'
+ORDER BY ingest_time
+LIMIT 50;
+```
+
+Look at:
+
+* Are messages from the same entity in different partitions? If yes, partition key is wrong.
+* Are messages within one partition out of event-time order? Late delivery, not partition reorder. Check producer settings.
+* Did the topic's partition count change recently? That is the boundary effect.
+
+The answer is almost always in the first or third bullet.
+
+### Common mistakes interviewers want you to name
+
+1. **"Kafka guarantees ordering."** Without "per partition" it is wrong.
+2. **No partition key.** Round-robin scatters related events.
+3. **Wrong partition key.** Order is preserved per the wrong entity.
+4. **Increasing partition count** breaks key-based ordering.
+5. **Trusting wire ordering** without idempotent consumers. Retries reorder downstream effects.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if you want both high throughput AND strict ordering across the whole topic?"*
+
+You can't, fundamentally. But you can simulate:
+
+* Use a single-partition topic for the messages that need strict order, and a separate high-throughput topic for everything else.
+* Use a "router" service that sequences events through a single point and emits to many partitions with a global sequence number, then a consumer re-sorts using the sequence number.
+
+Both are operational complexity. Usually the right answer is to design the workflow so global ordering is not required, and per-entity ordering (via key) is enough.
diff --git a/Problem 58: Streaming Consumer Lag Diagnosis/question.md b/Problem 58: Streaming Consumer Lag Diagnosis/question.md
new file mode 100644
index 0000000..23697b6
--- /dev/null
+++ b/Problem 58: Streaming Consumer Lag Diagnosis/question.md
@@ -0,0 +1,30 @@
+## Problem 58: Streaming Consumer Lag Diagnosis
+
+**Scenario:**
+A Flink job consumes from Kafka and writes to a sink. The consumer lag has been growing steadily over the last 4 hours. By the time you check, it is 90 minutes behind. The on-call engineer wants to scale the job up immediately.
+
+In the interview, the question is:
+
+> A streaming job's consumer lag keeps growing. Walk through the questions you would ask before touching any code.
+
+This is testing whether you can resist the "scale it up" reflex and find the real cause.
+
+---
+
+### Your Task:
+
+1. List the questions you would ask first.
+2. Explain how to read the lag pattern.
+3. Cover the most common causes.
+4. Decide when scaling is right and when it is the wrong move.
+
+---
+
+### What a Good Answer Covers:
+
+* Is throughput slower than the source rate?
+* Is the source rate suddenly higher than usual?
+* Is the sink slow (back-pressure)?
+* Hot key / skew.
+* Resource constraints inside the job.
+* When scaling helps; when it just hides the problem.
diff --git a/Problem 58: Streaming Consumer Lag Diagnosis/solution.md b/Problem 58: Streaming Consumer Lag Diagnosis/solution.md
new file mode 100644
index 0000000..2e12fb4
--- /dev/null
+++ b/Problem 58: Streaming Consumer Lag Diagnosis/solution.md
@@ -0,0 +1,145 @@
+## Solution 58: Streaming Consumer Lag Diagnosis
+
+### Short version you can say out loud
+
+> Before scaling, I would answer four questions: is the source producing more than usual, is the sink slower than usual, is the processing stage itself stuck on a hot key, and is there a resource limit inside the job. Each of those has a different fix. Scaling the job up helps for one of them and is wasted on the rest.
+
+### The four questions
+
+```
+1. Is the source rate higher than normal? (input side)
+2. Is the sink slower than normal? (output side, back-pressure)
+3. Is the job's processing logic stuck somewhere? (skew, hot key)
+4. Is the job hitting a resource limit? (CPU, memory, network)
+```
+
+I check them in this order because each has a cheaper fix than the next.
+
+### Question 1: source rate
+
+Look at the producer-side metrics. Is the topic's input rate higher than usual?
+
+```
+Producer rate, last 6 hours
+─────────────────────────────
+typical : 5,000 msgs/sec
+now : 28,000 msgs/sec
+```
+
+If yes: the source is producing faster than usual. Probably a marketing campaign, a scheduled batch dump, a new feature went live. The job is doing what it should — it just cannot keep up with the new rate.
+
+**The fix is scaling.** This is the case where the senior engineer's "scale up" is right. But understand the why, because tomorrow you might need to scale back down.
+
+### Question 2: sink rate (back-pressure)
+
+Look at the sink. If the job writes to a database, is the database saturated? If it writes to another Kafka topic, is the downstream consumer keeping up?
+
+Flink's UI shows back-pressure per operator. A red "high back-pressure" on the sink operator means the sink is the bottleneck. The job is processing fast but cannot push fast enough.
+
+If yes: scaling the job up does nothing. The sink can't take more. Fixes:
+
+* Scale the sink (more database connections, bigger instance, partition the write).
+* Batch sink writes (fewer larger inserts instead of many small ones).
+* Asynchronous sink with proper back-pressure (Flink has an async I/O operator).
+
+### Question 3: a stuck stage (hot key or skew)
+
+Flink's UI shows per-task metrics. If one task is 100% busy and the others are idle, you have skew.
+
+```
+Operator: keyed aggregation
+ Task 0: 100% busy (processing user_id="superuser")
+ Task 1: 10% busy
+ Task 2: 9% busy
+ Task 3: 11% busy
+```
+
+One key is dominating. The job cannot scale by adding workers because the work is bottlenecked on one slot.
+
+Fixes:
+
+* Identify the hot key and decide how to handle it (rate-limit, route to a dedicated stream).
+* Use Flink's `salt` pattern: append a random suffix to the key, aggregate in two phases.
+* Some processors handle this automatically (Spark Structured Streaming with AQE).
+
+### Question 4: resource limits
+
+Look at the cluster metrics for the job:
+
+* CPU at 100% across all workers? Genuinely compute-bound. Scale up.
+* Memory pressure with GC pauses? Increase memory or reduce state size.
+* Network saturated? Probably reading large messages; consider compression.
+
+These call for scaling, but the scaling target depends on the constraint. Adding workers does not help if memory per worker is the issue.
+
+### Reading the lag pattern
+
+The shape of the lag over time tells you which case it is:
+
+* **Lag has been steady, then started growing 4 hours ago.** Something changed at that point. Look at deploys, source-side changes, sink changes. Question 1, 2, or 4.
+* **Lag is growing at a constant rate.** The job's processing rate is just below the source rate. Scale up.
+* **Lag growth is bursty.** Source has spikes the job can't smooth. Either scale to peak, or accept some lag.
+* **Lag is huge on one partition only.** That partition is hot or its keys are skewed. Question 3.
+
+### A diagnostic walkthrough
+
+The scenario: lag is 90 minutes after 4 hours of growth. Steps:
+
+1. Source rate: check producer dashboard. Up 3x for the last 4 hours. Found: a vendor restarted and is replaying their buffered data.
+2. Sink rate: nominal.
+3. Skew: none.
+4. Cluster: CPU at 65%, headroom available.
+
+Conclusion: the source is dumping a replay. The job has room to scale but is not maxed. The lag will recover once the replay finishes. Two options:
+
+* Scale the job up temporarily to catch up faster.
+* Do nothing; the lag will recover naturally in ~6 hours.
+
+Either is correct. Picking depends on whether downstream consumers can tolerate 90-min lag for that long.
+
+### When scaling is the right call
+
+* The source rate has genuinely increased and will stay there.
+* The job is CPU- or network-bound and has headroom on the cluster.
+* You need to drain a backlog quickly, even if it is temporary.
+
+### When scaling is the wrong call
+
+* Sink is the bottleneck. Scaling the job adds back-pressure without throughput.
+* Hot-key skew. Adding workers does nothing for the slot that is overloaded.
+* Memory-bound. More workers with the same per-worker memory hits the same wall.
+
+The danger of scaling first: it sometimes masks the real problem long enough for it to compound later.
+
+### Tools
+
+* **Kafka lag metric**: `consumer_group_lag` for the most direct signal.
+* **Flink UI**: per-operator back-pressure indicator, per-task busy-time, watermark progress.
+* **Cluster metrics**: CPU, memory, network on each worker.
+* **Source-side metrics**: producer rate, message size, errors.
+
+### What I would also do besides scaling
+
+* **Alerts on lag.** Page when lag exceeds N minutes for M minutes. This incident should have woken someone at the 15-minute mark.
+* **Capacity planning.** Know the maximum throughput of the job in messages/sec. Compare against expected peak.
+* **Replay protocol.** When a vendor replays, the consumer side often experiences exactly this. Have a "scale to 2x during expected replays" runbook.
+
+### Common mistakes interviewers want you to name
+
+1. **Scale before diagnosing.** Wastes money and hides the real cause.
+2. **Trust "the job looks fine."** The Flink UI's back-pressure indicator is your friend.
+3. **No alerting on lag.** Always discovered too late.
+4. **Confusing source spike with sink slowdown.** Different fixes.
+5. **Forgetting hot keys.** A single skewed key undoes any scaling.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if scaling does fix it for a few hours, then the lag grows again?"*
+
+That is a sign of a hidden steady-state problem: the source rate is creeping up, or the job's per-message work is growing (state size growing, garbage collection getting worse). Don't keep scaling. Profile the job:
+
+* State size growing unbounded? Add TTL or smaller state.
+* CPU time per message growing? A library update introduced overhead.
+* Network or sink saturated at higher scale?
+
+Scale gives you breathing room. The fix to a creeping problem is upstream of scaling.
diff --git a/Problem 59: Onboarding a New Analyst/question.md b/Problem 59: Onboarding a New Analyst/question.md
new file mode 100644
index 0000000..1eeda90
--- /dev/null
+++ b/Problem 59: Onboarding a New Analyst/question.md
@@ -0,0 +1,27 @@
+## Problem 59: Onboarding a New Analyst
+
+**Scenario:**
+A new analyst joins next Monday. They are technically strong but new to your company. The data team's reputation depends on how fast they can ship trustworthy dashboards. You have one month to set them up so they can be productive without breaking anything.
+
+In the interview, the question is:
+
+> A new analyst joins next week. How do you onboard them so they can ship a dashboard in their first month without breaking anything?
+
+---
+
+### Your Task:
+
+1. Lay out the first 30 days.
+2. Cover what they read, what they shadow, what they build.
+3. Mention the systemic side: what you have already built that makes this faster.
+4. Cover failure modes.
+
+---
+
+### What a Good Answer Covers:
+
+* The first day, week, month progression.
+* Pairing on real work, not just docs.
+* A small protected sandbox before production.
+* Code review on dashboards / models as a learning loop.
+* Metrics layer / source of truth.
diff --git a/Problem 59: Onboarding a New Analyst/solution.md b/Problem 59: Onboarding a New Analyst/solution.md
new file mode 100644
index 0000000..6e81b1a
--- /dev/null
+++ b/Problem 59: Onboarding a New Analyst/solution.md
@@ -0,0 +1,113 @@
+## Solution 59: Onboarding a New Analyst
+
+### Short version you can say out loud
+
+> I would mix three things: read, shadow, and build, in that order. Week one is reading and access setup. Week two is shadowing me on real tickets. Week three is them shipping a small ticket end-to-end with review. Week four is independence. The thing that makes this fast is not the onboarding plan; it is what was built before they arrived: documented sources, a metrics layer, a sandbox they cannot break, and a friendly code review culture. Without those, no onboarding plan saves you.
+
+### The 30-day plan
+
+**Day 1: setup and welcome.**
+
+* Accounts (warehouse, BI tool, GitHub, Slack).
+* Pair on one real, low-stakes dashboard for 90 minutes. They watch, ask anything.
+* End of day: they have run a SELECT, opened a dashboard, and edited a dbt model in their fork.
+
+**Week 1: reading and exploring.**
+
+* Read the "metrics" doc: definitions of revenue, active user, churn, anything the business uses.
+* Read the high-level data flow diagram.
+* Explore the warehouse with a guided tour: which dataset is what, which tables are stable, which to avoid.
+* Daily 30-min check-in with me. They ask questions; I answer or point them at the doc.
+
+**Week 2: shadowing.**
+
+* They sit with me on real tickets. They write the SQL, I review and explain.
+* One small ticket they own end-to-end with heavy pairing.
+* By end of week 2, they have shipped one dashboard change with my help.
+
+**Week 3: independence with safety net.**
+
+* They pick up a small ticket on their own.
+* Every PR they open gets reviewed by me before merge.
+* They start joining standups, asking questions in the team channel themselves.
+
+**Week 4: full ramp.**
+
+* They own a small recurring report or dashboard.
+* PRs reviewed by the team, not just me.
+* By the end of week 4, they can answer the question "where do I find X" without asking.
+
+### What I would NOT do
+
+* **Dump a list of 30 docs to read.** They will skim and remember nothing.
+* **Ship them straight to production with no review.** Trust them, but check the work.
+* **Make them watch me code all month.** Boring and they do not learn by watching alone.
+* **Assume the docs are correct.** They are not. Use the onboarding to fix what you find broken.
+
+### What makes this work: the systemic side
+
+The onboarding plan is the easy part. The hard part is what is already in place:
+
+* **Metrics layer.** A single source for "active user" and "revenue" definitions. The analyst can find it in 30 seconds.
+* **Documented tables.** Every important table has a one-paragraph description. Grain is named explicitly.
+* **A sandbox dataset** they can write to without affecting production.
+* **Friendly code review.** Reviewers explain, not just block.
+* **Slack channel for data questions** where junior questions are welcome.
+
+If any of these are missing, onboarding takes 3x as long because the analyst has to discover everything by asking.
+
+### Pair, don't dump
+
+The cheapest move is a 30-minute pairing session per day for the first two weeks. Twelve hours of pairing prevents a week of confusion. It is also where you find broken docs, weird table names, and questions you forgot you had.
+
+### A "first ticket" that always exists
+
+Have a small, real, low-stakes ticket ready for every new analyst:
+
+* "Add a new dimension to the daily revenue dashboard."
+* "Build a simple breakdown of orders by payment type for last 30 days."
+
+Something where the right answer is known, the data exists, the impact is small, and there is no urgency. Their first end-to-end build is a contained success.
+
+### Failure modes
+
+* **They are afraid to ship.** Make the first PR small and pair-reviewed. Get them across the line.
+* **They ask questions in DM.** Redirect to the channel. Helps them, helps everyone.
+* **They get pulled into "urgent" requests too early.** Protect them. Their first month is not for fire-fighting.
+* **They build their own off-warehouse pipeline.** Why? Because they could not find what they needed in the warehouse. That is your bug, not theirs.
+
+### Letting them break things
+
+Some breakage is fine. A bad PR caught in review is a great learning moment. A bad PR that hit production teaches them and the team that we need a better check. Both are healthy if the broken thing was small and recoverable.
+
+What I would not let them do: drop a production table, send a wrong email to customers, push to the `main` branch directly. Those are guarded by access control, not by hoping they remember.
+
+### Tracking success
+
+After 30 days, the bar:
+
+* Can they answer a stakeholder question without my help?
+* Have they shipped at least one production dashboard?
+* Are they comfortable opening PRs?
+* Do they know who to ask for what?
+
+If yes to all four, the onboarding worked. If not, identify which broke and fix the system, not the person.
+
+### Common mistakes interviewers want you to name
+
+1. **Document-heavy onboarding.** Most documents are stale; people learn by doing.
+2. **No buddy.** Lonely week-one analysts disengage fast.
+3. **Throwing them into prod day one.** Builds fear, not skill.
+4. **Not telling them how to ask.** "Ask in the channel, here's the template."
+5. **Skipping the "first ticket."** Without a small win, they feel lost.
+
+### Bonus follow-up the interviewer might throw
+
+> *"How do you scale this when you are hiring an analyst every two weeks?"*
+
+You can't pair-onboard at that rate. Two changes:
+
+1. **An onboarding squad rotation.** A different team member runs the onboarding each cycle. Everyone gets practice; no one is overloaded.
+2. **A self-serve "first week" curriculum.** Pre-built sandbox, pre-built first ticket, pre-recorded walkthroughs for the basics.
+
+The pairing parts are still needed but become 30 minutes a day instead of half a day. The systemic investment pays off as you grow.
diff --git a/Problem 5: Merging Messy CSVs from Multiple Partners/question.md b/Problem 5: Merging Messy CSVs from Multiple Partners/question.md
new file mode 100644
index 0000000..1a93f45
--- /dev/null
+++ b/Problem 5: Merging Messy CSVs from Multiple Partners/question.md
@@ -0,0 +1,91 @@
+## Problem 5: Merging Messy CSVs from Multiple Partners
+
+**Scenario:**
+Every Monday morning, your team receives a folder of CSV files from different partners. Each file contains the same kind of data (customer signups), but every partner names their columns differently. Some files have extra columns you don’t care about, some have missing values, and the date format is never consistent.
+
+Here is what three example files might look like:
+
+```
+# partner_a.csv
+customer_id,full_name,email,signup_date
+201,Alice Lee,alice@a.com,2025-10-01
+202,Bob Khan,bob@a.com,2025-10-02
+```
+
+```
+# partner_b.csv
+CustomerID,Name,Email,SignupDate,Country
+301,Carol Tan,carol@b.com,2025-10-01,SG
+302,,daniel@b.com,2025-10-04,MY
+```
+
+```
+# partner_c.csv
+cust_id,name,email_addr,joined_on
+401,Eve Patel,eve@c.com,01/10/2025
+402,Frank Wu,frank@c.com,02/10/2025
+```
+
+Typical issues you will see:
+
+* Same field has different names (`customer_id`, `CustomerID`, `cust_id`)
+* Date formats differ (`2025-10-01` vs `01/10/2025`)
+* Some files have extra columns (like `Country`) that you don’t need
+* Some rows have missing values
+* The folder may contain hundreds of files
+
+The warehouse team wants a single clean CSV they can load straight into BigQuery.
+
+---
+
+### Your Task:
+
+Write a Python program that:
+
+1. Reads every CSV file inside a folder called `partner_csvs/`.
+2. Maps the different column names into one standard schema:
+
+| Standard column | Possible source names |
+| --------------- | ---------------------------------- |
+| customer_id | customer_id, CustomerID, cust_id |
+| name | name, Name, full_name |
+| email | email, Email, email_addr |
+| signup_date | signup_date, SignupDate, joined_on |
+
+3. Converts `signup_date` to `YYYY-MM-DD`.
+4. Skips rows that are missing `email` or `customer_id`.
+5. Replaces a missing `name` with `"Unknown"`.
+6. Adds a `source_file` column so you can trace which file each row came from.
+7. Writes everything into a single output file called `all_customers.csv`.
+
+**Example Output (all_customers.csv):**
+
+```
+customer_id,name,email,signup_date,source_file
+201,Alice Lee,alice@a.com,2025-10-01,partner_a.csv
+202,Bob Khan,bob@a.com,2025-10-02,partner_a.csv
+301,Carol Tan,carol@b.com,2025-10-01,partner_b.csv
+302,Unknown,daniel@b.com,2025-10-04,partner_b.csv
+401,Eve Patel,eve@c.com,2025-10-01,partner_c.csv
+402,Frank Wu,frank@c.com,2025-10-02,partner_c.csv
+```
+
+---
+
+### Bonus Challenges:
+
+* Print a small summary at the end: how many files were read, total rows in, rows written, rows skipped.
+* Move the column mapping into a small config dict (or YAML file) so a new partner can be added without touching the code.
+* Handle GZIP compressed files (`.csv.gz`) too.
+* Stream the writing so that even with 500 files you never hold everything in memory.
+
+---
+
+💡 **Hints:**
+
+* Use `pathlib.Path.glob` to walk the folder.
+* The `csv.DictReader` and `csv.DictWriter` make column renaming much easier than positional indexes.
+* Build a reverse lookup table from partner column name to standard column name once, then reuse it.
+* Keep the date parsing in its own small function so adding a new format later is easy.
+
+---
diff --git a/Problem 5: Merging Messy CSVs from Multiple Partners/solution.py b/Problem 5: Merging Messy CSVs from Multiple Partners/solution.py
new file mode 100644
index 0000000..737a02a
--- /dev/null
+++ b/Problem 5: Merging Messy CSVs from Multiple Partners/solution.py
@@ -0,0 +1,121 @@
+#!/usr/bin/env python3
+"""
+Partner CSV Merger: combine customer CSVs from many partners
+into a single clean file with a standard schema.
+Author: Amirul Islam
+"""
+
+import csv
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, Iterable, Optional
+
+INPUT_FOLDER = "../data/partner_csvs"
+OUTPUT_FILE = "all_customers.csv"
+
+# Standard column -> possible source names from different partners
+COLUMN_MAP = {
+ "customer_id": ["customer_id", "CustomerID", "cust_id"],
+ "name": ["name", "Name", "full_name"],
+ "email": ["email", "Email", "email_addr"],
+ "signup_date": ["signup_date", "SignupDate", "joined_on"],
+}
+
+DATE_FORMATS = ["%Y-%m-%d", "%d/%m/%Y", "%m/%d/%Y"]
+
+STANDARD_COLUMNS = list(COLUMN_MAP.keys()) + ["source_file"]
+
+
+def build_reverse_map() -> Dict[str, str]:
+ """Map any known partner column name back to our standard column name."""
+ reverse = {}
+ for standard, options in COLUMN_MAP.items():
+ for option in options:
+ reverse[option.lower()] = standard
+ return reverse
+
+
+def parse_date(value: str) -> Optional[str]:
+ """Try a few common date formats and return ISO date if one matches."""
+ for fmt in DATE_FORMATS:
+ try:
+ return datetime.strptime(value.strip(), fmt).strftime("%Y-%m-%d")
+ except ValueError:
+ continue
+ return None
+
+
+def normalize_row(
+ row: Dict[str, str],
+ reverse_map: Dict[str, str],
+ source: str,
+) -> Optional[Dict[str, str]]:
+ """Map one raw row to the standard schema. Return None if the row is unusable."""
+ clean = {col: "" for col in STANDARD_COLUMNS}
+
+ for raw_col, value in row.items():
+ if raw_col is None:
+ continue
+ standard = reverse_map.get(raw_col.lower())
+ if standard:
+ clean[standard] = (value or "").strip()
+
+ if not clean["customer_id"] or not clean["email"]:
+ return None
+
+ if not clean["name"]:
+ clean["name"] = "Unknown"
+
+ if clean["signup_date"]:
+ parsed = parse_date(clean["signup_date"])
+ if parsed is None:
+ return None
+ clean["signup_date"] = parsed
+
+ clean["source_file"] = source
+ return clean
+
+
+def iter_csv_files(folder: str) -> Iterable[Path]:
+ return sorted(Path(folder).glob("*.csv"))
+
+
+def merge_partner_csvs(folder: str, output_file: str) -> Dict[str, int]:
+ """Read every CSV in the folder, normalize, and write one combined CSV."""
+ reverse_map = build_reverse_map()
+ stats = {"files": 0, "rows_in": 0, "rows_out": 0, "rows_skipped": 0}
+
+ with open(output_file, "w", newline="") as out:
+ writer = csv.DictWriter(out, fieldnames=STANDARD_COLUMNS)
+ writer.writeheader()
+
+ for path in iter_csv_files(folder):
+ stats["files"] += 1
+ try:
+ with open(path, "r", newline="") as f:
+ reader = csv.DictReader(f)
+ for raw in reader:
+ stats["rows_in"] += 1
+ clean = normalize_row(raw, reverse_map, path.name)
+ if clean is None:
+ stats["rows_skipped"] += 1
+ continue
+ writer.writerow(clean)
+ stats["rows_out"] += 1
+ except FileNotFoundError:
+ continue
+
+ return stats
+
+
+def main():
+ stats = merge_partner_csvs(INPUT_FOLDER, OUTPUT_FILE)
+ print("Merge complete")
+ print(f" files read: {stats['files']}")
+ print(f" rows in: {stats['rows_in']}")
+ print(f" rows written: {stats['rows_out']}")
+ print(f" rows skipped: {stats['rows_skipped']}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/Problem 60: Metric by Tomorrow vs Doing It Right/question.md b/Problem 60: Metric by Tomorrow vs Doing It Right/question.md
new file mode 100644
index 0000000..fac7a26
--- /dev/null
+++ b/Problem 60: Metric by Tomorrow vs Doing It Right/question.md
@@ -0,0 +1,27 @@
+## Problem 60: "I Need This Metric by Tomorrow"
+
+**Scenario:**
+A business stakeholder asks for a new metric "by tomorrow." The metric is non-trivial: it requires a new transformation, agreement on a definition, and a small dashboard. Doing it well takes a week. Doing it "now" means a one-off query in a spreadsheet with no documentation. The stakeholder is impatient.
+
+In the interview, the question is:
+
+> A business stakeholder wants a new metric "by tomorrow." How do you balance moving fast with doing it properly?
+
+---
+
+### Your Task:
+
+1. Show how you would handle the conversation.
+2. Decide what you ship tomorrow and what you ship in the proper version.
+3. Cover how to not let "tomorrow" become forever.
+4. Mention the cultural side.
+
+---
+
+### What a Good Answer Covers:
+
+* Ship a number tomorrow, ship a model next week.
+* Caveats kept short.
+* Re-confirm the definition before shipping.
+* Schedule the proper version.
+* Avoid promising it will never need maintenance.
diff --git a/Problem 60: Metric by Tomorrow vs Doing It Right/solution.md b/Problem 60: Metric by Tomorrow vs Doing It Right/solution.md
new file mode 100644
index 0000000..8a9935e
--- /dev/null
+++ b/Problem 60: Metric by Tomorrow vs Doing It Right/solution.md
@@ -0,0 +1,125 @@
+## Solution 60: "I Need This Metric by Tomorrow"
+
+### Short version you can say out loud
+
+> I ship a number tomorrow and a proper version next week. The tomorrow version is a one-off query, clearly marked as draft, with the caveats I would share in an exec briefing. The proper version goes through the regular metric flow: definition reviewed, model in dbt, tests, dashboard. The stakeholder gets what they need, the team does not accumulate undocumented one-offs. The only thing I refuse to do is ship a one-off and call it done.
+
+### The conversation, tomorrow morning
+
+> "I can have a number for you by end of day. I want to confirm the definition with you in 10 minutes so we are computing the same thing. Then I will share the number with the caveats. Next week, I will productize it: in the dashboard, with proper tests, so it does not need me to rerun it. Sound good?"
+
+Three things in that message:
+
+1. **Commits to today.** Stakeholder gets unblocked.
+2. **Locks the definition.** Cheap conversation now, expensive surprise later if skipped.
+3. **Promises the proper version.** Not "if I get time," not "we'll see." A date.
+
+### Locking the definition before computing
+
+The 10-minute definition conversation is the single most important step. The question I always ask:
+
+> "If we measure this metric tomorrow and the number is X, what would make you say 'that's wrong'?"
+
+Their answer tells me what they actually mean. Often it is not what the requested wording said.
+
+Concrete examples that catch real ambiguity:
+
+* "Active user" — does that include trial users? Bot accounts? Mobile-only?
+* "Revenue" — gross or net of refunds? In what currency? Include subscription pro-rations?
+* "Conversion rate" — over what funnel step? Over what time window?
+
+Settling these before computing saves multiple rounds of "actually, I meant..."
+
+### What I ship tomorrow
+
+A single Slack message or document with:
+
+* The number.
+* One paragraph on how it was computed (named tables, named filters).
+* Two or three caveats that affect interpretation.
+* The note "This is a draft for tomorrow's meeting. Full productized version coming by [date]."
+
+That's it. Not a spreadsheet flying around. A Slack message in their team channel, archived, readable.
+
+Example:
+
+> Adoption of new feature in Q1: **12.4%**.
+>
+> Computed as users who triggered `feature_unlocked` at least once between Jan 1 and Mar 31, divided by all active users in that period. Sources: `events.feature_events`, `marts.daily_active_users`.
+>
+> Caveats:
+> 1. "Active" here means at least one login in the period. Stricter definitions of "active" change the number by 1-2 pp.
+> 2. Users on legacy mobile clients (~3% of base) don't emit the event. The true rate is slightly higher.
+>
+> A productized dashboard with these numbers will be ready by Wednesday.
+
+### What I ship next week
+
+The proper version goes through the regular flow:
+
+1. Add the metric to the metrics layer / dbt project.
+2. Build the dashboard.
+3. Add data tests.
+4. Document the definition in the metrics doc.
+5. Share with the stakeholder; have them confirm the dashboard matches what they expected.
+
+This takes a couple of days, not a heroic week. The hard part was the definition, which we already did.
+
+### Why I would not just ship the one-off and stop
+
+Two real costs of "one-off as final":
+
+1. **Stakeholders ask for the same number again next quarter.** I re-run the one-off and possibly get a slightly different number because the data changed or my SQL was different. Now I look unreliable.
+2. **The one-off has no tests, no documentation, no lineage.** When the source schema changes, my one-off breaks silently and the stakeholder uses a wrong number for a quarter.
+
+The productized version is the protection.
+
+### Avoiding "tomorrow" becoming forever
+
+The "by next week" promise has to be real. Two protections:
+
+* **Put it in the team's planning.** A ticket, a date, an owner. Not a Slack promise.
+* **The stakeholder is in the loop on the proper version.** They will ask if it slips. Their pressure replaces yours.
+
+If you skip both, three months later the stakeholder is still pasting your one-off number into board decks.
+
+### How to handle "tomorrow's tomorrow"
+
+If they need the number again next week and the proper version is not ready, you might be tempted to re-run the one-off. Don't. Instead:
+
+> "Same number for last week, computed the same way: 14.1%. Productized dashboard with this and weekly refresh is on track for Wednesday."
+
+The interim re-run is documented, attached to the same Slack thread, and the proper version's ETA stays visible.
+
+### The cultural side
+
+The stakeholder is asking "tomorrow" because past experience says data takes too long. If you keep that promise and follow through on the productized version, the next ask is easier:
+
+> "Hey, can we add a new metric? I know it will take a week to do properly."
+
+That is the goal. They have learned that "tomorrow" gets a number and "next week" gets a real thing, and that both are reliable.
+
+### What I would NOT do
+
+* **Refuse the tomorrow ask.** Looks obstructive.
+* **Quietly ship something half-baked** that they then quote as final.
+* **Promise it will be ready tomorrow when it won't.** Tells them I do not estimate well.
+* **Wing the definition.** Ten minutes saves ten rounds of revision.
+
+### Common mistakes interviewers want you to name
+
+1. **No definition lock.** Number ships, stakeholder reads it differently, conflict in two weeks.
+2. **Ship one-off and stop.** Stakeholder bases decisions on un-tested SQL.
+3. **Make them wait a week.** Looks like the data team is slow.
+4. **Skip the caveats.** Avoid friction now, cause bigger friction later.
+5. **No tracking** of pending productizations.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if you also have three other tomorrow asks competing for the same day?"*
+
+That is a prioritization conversation, not a data conversation. Push it to whoever owns the team's priorities. Be transparent:
+
+> "I have three other asks for tomorrow. I can do all four with quick one-off numbers, or two of them properly. Which two are most important?"
+
+Most stakeholders, when forced to triage, drop the lower-priority ones. The ones that survive get done well.
diff --git a/Problem 61: Two Teams Disagree on Active User/question.md b/Problem 61: Two Teams Disagree on Active User/question.md
new file mode 100644
index 0000000..37671ca
--- /dev/null
+++ b/Problem 61: Two Teams Disagree on Active User/question.md
@@ -0,0 +1,29 @@
+## Problem 61: Two Teams Disagree on "Active User"
+
+**Scenario:**
+The product team's dashboard shows 1.2M active users last month. The finance team's deck says 980K. Both are looking at the same company. Each team thinks the other is wrong. The CEO asks the data team to "fix this."
+
+In the interview, the question is:
+
+> Two teams disagree on the definition of "active user." How do you settle it without it becoming a political fight?
+
+This is a "metric ownership" question. The interviewer wants to see how you navigate organizational disagreement.
+
+---
+
+### Your Task:
+
+1. Acknowledge that both teams are usually right by their own definitions.
+2. Walk through how you would investigate.
+3. Propose the resolution.
+4. Cover what happens after.
+
+---
+
+### What a Good Answer Covers:
+
+* The definitions are different on purpose, not by accident.
+* The job is to surface both definitions clearly, not to pick one.
+* Get both teams in one room with the data person.
+* Document the agreement.
+* Build a metrics layer so this stops happening.
diff --git a/Problem 61: Two Teams Disagree on Active User/solution.md b/Problem 61: Two Teams Disagree on Active User/solution.md
new file mode 100644
index 0000000..546c591
--- /dev/null
+++ b/Problem 61: Two Teams Disagree on Active User/solution.md
@@ -0,0 +1,117 @@
+## Solution 61: Two Teams Disagree on "Active User"
+
+### Short version you can say out loud
+
+> The two numbers are almost certainly both right. Each team has a definition that fits their work, and neither knows the other's definition. My job is not to pick a winner. It is to surface both definitions side by side, get them in one room, and help them agree on what each one means and when each one should be used. The longer-term fix is a metrics layer where every metric has a name, an owner, and a definition that everyone can see.
+
+### Step 1: get both numbers and both definitions
+
+Before any meeting, I gather the raw material:
+
+* Product team's query: "users who logged in at least once in the last 30 days."
+* Finance team's query: "users with a paid event in the last 30 days, excluding trial-only accounts and accounts marked as test."
+
+The numbers diverge because the definitions diverge. The product number is bigger because it counts more.
+
+I write both definitions in plain English and put them side by side:
+
+```
+Metric Product's count Finance's count
+─────────────────────────────────────────────────────────────────
+Users counted logged in ≥1 time had a paid event
+ in last 30 days in last 30 days
+ AND not trial
+ AND not test account
+Number 1,200,000 980,000
+Source events.logins marts.fact_payments
+ + dim_users filters
+```
+
+Already visible: the two are not measuring the same thing.
+
+### Step 2: get both teams in one room
+
+Don't email. Don't Slack. Schedule a 30-minute meeting with one representative from each team plus you. Bring the table above and a shared document.
+
+The meeting is not "who is right." It is "what is each one for."
+
+Likely outcome of the conversation:
+
+* Product's definition is right for product engagement.
+* Finance's definition is right for revenue-related reporting.
+* They are not the same metric. They should not be called the same name.
+
+### Step 3: rename and document
+
+In the meeting, propose new names:
+
+* "Monthly Active Users (MAU)" for product's number.
+* "Paying Users" or "Active Paying Customers" for finance's number.
+
+Now there is no ambiguity. The CEO sees 1.2M MAU and 980K paying customers, both true, both useful, different meanings.
+
+Get verbal agreement from both teams in the meeting. Write it up in the doc. Each definition has:
+
+* A name (unique).
+* An owner (the team that defines it).
+* A precise definition.
+* The source query or model.
+* A version date.
+
+### Step 4: implement the agreement
+
+Within a week:
+
+* Both dashboards relabeled to use the new names.
+* Definitions added to the metrics doc / metrics layer.
+* If the warehouse has a metrics layer (dbt metrics, Cube, etc.), encode the metrics there so everyone reads from one source.
+
+The CEO now sees both numbers, both correctly labeled, no confusion.
+
+### What if the teams disagree on which number gets the name "active user"?
+
+This happens. The product team feels "active user" is theirs because they own engagement. Finance feels "active user" is the meaningful business number, so it should mean paying.
+
+The resolution:
+
+* Neither metric gets the bare name "active user." Both get specific names ("MAU" vs "paying users").
+* In any executive dashboard, both numbers are shown with clear labels.
+
+Refusing to give either team the generic name is the cleanest move. It forces precision in conversations.
+
+### Step 5: the systemic fix
+
+This problem happens because metrics are defined in dashboards, not in a shared layer. To stop it from recurring:
+
+* **Metrics layer.** Every important metric is defined once, in code, with tests. Dashboards consume the metric, they don't redefine it.
+* **Metric registry.** A page that lists every metric, its definition, and its owner. New metrics go through a quick review before being added.
+* **PR review for metrics.** When someone adds a new metric, both data team and the business owner sign off.
+
+With this in place, the next "active user" question never happens because there is no "active user" — only "MAU" and "paying users," each with a defined home.
+
+### What I would NOT do
+
+* **Pick a winner without the teams in the room.** Whoever loses feels overruled. Now you have an enemy.
+* **Try to combine the two definitions into one.** Frankenstein metrics please no one.
+* **Just tell the CEO "they're both right."** That is true but unhelpful. They need a resolution.
+* **Spend a quarter building a metrics layer** without addressing this week's question. Ship the rename now; build the layer over the next month.
+
+### The cultural side
+
+The data team gains credibility by being the neutral arbiter. We do not pick teams. We make the data visible and help people see what they are arguing about. After this conversation, both teams trust us a bit more.
+
+A failure mode: getting drawn into being "the team that decides who is right." That role poisons every later disagreement. Stay neutral. Surface, document, codify.
+
+### Common mistakes interviewers want you to name
+
+1. **Telling the CEO one number is right.** Then you have to defend a definition you do not own.
+2. **Letting each team keep their own private definition.** Same metric, two numbers, forever.
+3. **Naming both metrics "active user."** The confusion is in the name.
+4. **No follow-up in code.** Verbal agreement, no metrics layer entry, repeats in six months.
+5. **Treating it as a math problem.** It is a definition problem.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if neither team will budge and the CEO sides with one of them?"*
+
+Then you implement what the CEO decides, but you still create the second metric with a clear name and add it to the dashboard. The losing team's number does not disappear; it is just no longer called "active user." Six months later, the people who needed it will still ask for it under its new name, and the company will use both. Quiet wins.
diff --git a/Problem 62: Postmortem After a Bad Day/question.md b/Problem 62: Postmortem After a Bad Day/question.md
new file mode 100644
index 0000000..317c996
--- /dev/null
+++ b/Problem 62: Postmortem After a Bad Day/question.md
@@ -0,0 +1,27 @@
+## Problem 62: Postmortem After a Bad Day
+
+**Scenario:**
+A data incident happened. Numbers were wrong for a full day. The wrong numbers were quoted by sales, by finance, and in one PR communication. The business is upset. Leadership has asked the data team to run a postmortem and "make sure this never happens again."
+
+In the interview, the question is:
+
+> There has been a data incident, numbers were wrong for a day, and the business is upset. How do you run the postmortem?
+
+---
+
+### Your Task:
+
+1. Describe the structure of the postmortem.
+2. Cover the cultural rules (blameless).
+3. List the sections of the document.
+4. Cover the action items and follow-up.
+
+---
+
+### What a Good Answer Covers:
+
+* Timeline.
+* What went well, what went badly.
+* Root cause, not blame.
+* Action items with owners and dates.
+* "Never again" is unrealistic; "detect faster" is real.
diff --git a/Problem 62: Postmortem After a Bad Day/solution.md b/Problem 62: Postmortem After a Bad Day/solution.md
new file mode 100644
index 0000000..444337f
--- /dev/null
+++ b/Problem 62: Postmortem After a Bad Day/solution.md
@@ -0,0 +1,153 @@
+## Solution 62: Postmortem After a Bad Day
+
+### Short version you can say out loud
+
+> A postmortem has two goals: learn from what happened, and rebuild trust. The structure is timeline, root cause, what went well, what went badly, and action items with owners. The cultural rule is blameless: we never name a person as the cause. The deliverable is a short document, three to five pages, that anyone in the company can read in 15 minutes. The promise we make at the end is not "this will never happen again" — it is "we will detect faster and respond better."
+
+### The structure I would use
+
+```
+Postmortem:
+Date of incident: 2025-05-14
+Duration of bad data: ~22 hours
+Author:
+
+1. Summary (one paragraph)
+2. Impact (numbers wrong, who used them)
+3. Timeline (what happened and when)
+4. Root cause (technical, not blame)
+5. What went well (don't skip this)
+6. What went badly (honestly)
+7. Action items (owners and dates)
+```
+
+### Section by section
+
+**1. Summary.**
+
+One paragraph. What broke, what the impact was, how it was found, when it was fixed.
+
+> On May 14, the daily revenue dashboard reported numbers approximately 18% below actual revenue. The cause was a transform update on May 13 that filtered out a new product line. Detected at 09:40 by a finance review. Fixed and backfilled by 14:20 the same day. Approximately $X of revenue was misreported in three downstream communications.
+
+**2. Impact.**
+
+Concrete. Not "some data was wrong." How much, where, who used it.
+
+* Revenue dashboard wrong by ~18% for one day.
+* Sales used the number in a customer call.
+* Finance shared it in a Tuesday business review.
+* A PR draft cited the lower number; PR team caught it before send.
+
+**3. Timeline.**
+
+Hour by hour, what happened. From the original commit through the fix. Specific timestamps and actions.
+
+```
+May 13, 16:42 PR #2891 merged, adds filter excluding test products.
+ The filter accidentally matches a new real product line.
+May 14, 02:15 Daily ETL runs with new logic; revenue computed wrong.
+May 14, 09:40 Finance analyst flags discrepancy in Slack.
+May 14, 10:05 Data engineer investigates, finds the filter.
+May 14, 11:30 Fix PR opened.
+May 14, 12:15 Fix merged.
+May 14, 13:40 Backfill of May 14 partition complete.
+May 14, 14:20 Dashboard refreshed; downstream consumers notified.
+```
+
+The timeline is the spine. Everything else hangs off it.
+
+**4. Root cause.**
+
+Technical. Specific. No names.
+
+> The filter in the `revenue_clean.sql` model uses `WHERE product_id NOT IN (SELECT id FROM test_products)`. The new product line had been added to `test_products` for QA on May 12 and not yet removed when launched on May 13. The transform inherited the QA flag without revalidating.
+>
+> Contributing factor: no data test verified that the day's row count was within tolerance of the previous week. A check would have flagged the issue at 02:30, seven hours before manual detection.
+
+The contributing factor is as important as the cause. The cause is a single bug; the contributing factor is the missing safety net.
+
+**5. What went well.**
+
+Three to five bullets. Skipping this section is a mistake. It makes the document feel like punishment.
+
+* Finance caught the issue manually and reported it immediately.
+* Root cause was identified within 25 minutes of report.
+* Fix was developed, reviewed, and shipped within 2.5 hours.
+* PR team caught the bad number before the external communication went out.
+* Communication during the incident was clear; stakeholders were updated hourly.
+
+**6. What went badly.**
+
+Honest. Specific. Still no names.
+
+* No automated check flagged the row count anomaly.
+* The dashboard had no "freshness and quality" indicator; consumers could not see it was suspect.
+* Three communications were drafted on the wrong number before detection.
+* The QA-flag-then-launch sequence is not documented as a known risk in the model's README.
+* It took 100 minutes from "fix merged" to "consumers notified." Communication should have started earlier.
+
+**7. Action items.**
+
+The output everyone reads. Each item has an owner, a date, and a measurable target.
+
+| # | Action | Owner | Due |
+| - | ------ | ----- | --- |
+| 1 | Add row-count anomaly test on `revenue_clean` (3σ threshold) | | May 21 |
+| 2 | Add freshness/quality badge to dashboard | | May 28 |
+| 3 | Move `test_products` to a separate "QA isolation" pattern; document in README | | May 30 |
+| 4 | Create incident playbook for "wrong data" with comms checklist | | June 7 |
+| 5 | Pre-launch checklist for new product lines: includes data team review | PM team | June 7 |
+
+Five items. Not 50. Each one ships within four weeks. They are the actual prevention of the next incident.
+
+### The cultural rules
+
+**Blameless.** Never write "Alice committed PR #2891 without enough review." Write "PR #2891 was merged. The review process did not catch the filter interaction." The mistake belongs to the system, not the person.
+
+This is not about being soft. It is about accuracy. If Alice were not here, the same bug would have happened, because the system did not catch it. The system is the thing to fix.
+
+**Honest, including the slow parts.** Write the actual times. "100 minutes from fix to consumer notification" is uncomfortable but it is what happened. Owning it earns the right to fix it.
+
+**Specific.** "We should be more careful" is not an action item. "Add automated tests" is closer, but still vague. "Add row-count anomaly test on revenue_clean, threshold 3σ, by May 21" is real.
+
+### The communication around it
+
+The postmortem is shared with:
+
+* The team.
+* The directly affected stakeholders (finance, sales, PR).
+* Leadership.
+
+Optional, but a good idea: a public version (with sensitive details removed) in the data team's wiki. Other teams reading "what we learned" makes the data team look more professional, not less.
+
+### What I would NOT do
+
+* **Promise it will never happen again.** It will. Promise to detect faster.
+* **Add 30 new tests.** Alert fatigue. Add the three that would have caught this.
+* **Name the person who wrote the PR.** Demotivating and inaccurate.
+* **Skip the "what went well."** Looks like punishment, leaves the team demoralized.
+* **Lock up the postmortem.** Sharing builds trust.
+
+### Following up
+
+Two weeks later: a quick check.
+
+* Are the action items on track?
+* Did the new test fire on any partition? (If yes, was it a real catch?)
+* Is the team still feeling the incident, or have we moved on?
+
+Three months later, look at the action items list. Anything still open? Why? Either close it or push it to closure.
+
+### Common mistakes interviewers want you to name
+
+1. **Blame.** Postmortem becomes a thing nobody wants to attend.
+2. **Promises with no actions.** "We'll be careful" disappears.
+3. **Too many action items.** Five focused ones beat 30 vague ones.
+4. **Hiding the document.** Trust comes from transparency.
+5. **No follow-up.** Action items rot. Schedule a check.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the same team has a second similar incident a month later?"*
+
+That is a real conversation. The pattern across two incidents tells you something the first one alone did not. Maybe the test you added did not cover the right thing, or maybe the deploy review process has a structural gap. The second postmortem includes "What did the first postmortem miss?" as a section. This is healthier than pretending it is unrelated. Two of the same kind of incident means a class of problem; that needs a deeper structural fix, not another patch.
diff --git a/Problem 63: Inherited Pipeline No Docs No Tests/question.md b/Problem 63: Inherited Pipeline No Docs No Tests/question.md
new file mode 100644
index 0000000..270cb45
--- /dev/null
+++ b/Problem 63: Inherited Pipeline No Docs No Tests/question.md
@@ -0,0 +1,29 @@
+## Problem 63: Inherited Pipeline With No Docs, No Tests
+
+**Scenario:**
+You join a team and inherit a critical pipeline. The previous owner left. There is no documentation, no tests, and the code is dense. You are expected to keep it running and eventually own it. The pipeline runs nightly and the business depends on it.
+
+This is the more extreme cousin of Problem 32, where now we are framing the question around "your first month plan."
+
+In the interview, the question is:
+
+> You inherit a pipeline with no documentation, no tests, and the only person who knew it just left. What is your first month?
+
+---
+
+### Your Task:
+
+1. Resist the urge to rewrite.
+2. Walk through how you would learn it.
+3. Cover what you ship in week 1, week 2, week 3, week 4.
+4. Mention what you would say to your manager about expectations.
+
+---
+
+### What a Good Answer Covers:
+
+* Read, run, document.
+* Tests before changes.
+* Smallest useful change first.
+* Setting expectations on velocity.
+* When to think about a rewrite later.
diff --git a/Problem 63: Inherited Pipeline No Docs No Tests/solution.md b/Problem 63: Inherited Pipeline No Docs No Tests/solution.md
new file mode 100644
index 0000000..e4d03d9
--- /dev/null
+++ b/Problem 63: Inherited Pipeline No Docs No Tests/solution.md
@@ -0,0 +1,160 @@
+## Solution 63: Inherited Pipeline With No Docs, No Tests
+
+### Short version you can say out loud
+
+> My first month is read, run, document, test, then change. In week one I read the code and the schedule. In week two I run it in a sandbox and confirm parity with production. In week three I add the smallest meaningful tests, and I write a README as I go. In week four I make the smallest useful change someone has been asking for. I tell my manager up front that the first month is mostly about not breaking things and understanding what I have inherited. If they expect heroic delivery in the first month, that is a different conversation.
+
+### Week 1: read and run history
+
+Tasks:
+
+* Read every Python file or SQL model in the pipeline.
+* Look at the orchestrator graph (Airflow / Dagster) and draw it on paper.
+* Read the last 90 days of run history. Note which tasks fail or slow down.
+* Read the last 90 days of Slack messages about this pipeline.
+* Find the output tables and look at what they produce.
+
+By the end of week 1, I should be able to:
+
+* Draw the data flow on a whiteboard.
+* Name every input source.
+* Name every output table.
+* Roughly explain what the pipeline does to anyone who asks.
+
+I do not change anything yet.
+
+### Week 2: run it in a sandbox
+
+Set up a clone of the pipeline that points to a non-production target. Pick one historical day and run the full pipeline against it. Compare every output table row by row, sum by sum, against the production output for the same day. They should match exactly.
+
+If they don't match, my sandbox setup is wrong (more common) or the pipeline has hidden state (less common). I keep going until parity is real.
+
+At the end of week 2:
+
+* I can reproduce a historical day's output.
+* I have a working sandbox to experiment in.
+* I have a feel for which steps are slow, which are fragile, which are clean.
+
+### Week 3: tests and a README
+
+Now I start protecting the pipeline.
+
+Add a small README at the root of the pipeline's code:
+
+```
+# Revenue Rollup Pipeline
+
+## What it does
+One sentence. What goes in, what comes out, on what schedule.
+
+## Inputs
+- raw.events (partitioned by date, written by event-ingest team)
+- ref.products (full table, refreshed daily at 03:00)
+- ref.tariffs (SCD2 dim)
+
+## Outputs
+- marts.daily_revenue (one row per region per day)
+- marts.product_summary (one row per product per week)
+
+## Schedule
+Daily at 04:00 UTC. Backfills allowed for last 14 days.
+
+## Known quirks
+- The is_test_account filter exists because of a 2023 incident.
+ Do not remove.
+- Stage 3 (Athena step) is ~25 min, slowest part. Not yet investigated.
+
+## How to backfill
+dagster job backfill --partition YYYY-MM-DD ...
+```
+
+Then I add the minimum useful tests:
+
+1. **Row count tolerance**: today's row count is within 30% of the trailing 7-day average. Catches "silent zero" days.
+2. **Source freshness**: the input partition for today exists and has >0 rows.
+3. **Output uniqueness**: no duplicate primary keys.
+4. **Source-of-truth reconciliation**: today's sum matches an independent source within 1%.
+
+Four tests. Each prevents a class of incident. None of them touch the business logic.
+
+By end of week 3:
+
+* The README exists.
+* The first round of tests is wired in.
+* My understanding is solid enough to talk about the pipeline confidently.
+
+### Week 4: the first real change
+
+Pick the smallest, most useful change someone has been asking for. Make it. Use the sandbox for parity check. Open a PR with tests. Merge. Watch the next run.
+
+This first change is the proof that I can change the pipeline without breaking it. Once it's shipped, my confidence and the team's confidence in me both go up.
+
+### Setting expectations with my manager
+
+Day one, I would say:
+
+> "I will keep the pipeline running and start learning it deeply in the first month. I expect to ship one meaningful change by end of week four, with tests. Bigger changes or a rewrite, if needed, would come later, after I am sure I understand what I am touching. Is that pace acceptable, or do you need something faster from me in the first month?"
+
+Three properties of this:
+
+* Honest about velocity.
+* Promises a deliverable (the week-four change).
+* Opens space for negotiation.
+
+If they say "we need feature X in week one," it is a conversation, not a fight. Either feature X is simple enough that I can do it, or they need to know the risk.
+
+### When I would consider a rewrite
+
+Not in month one. Not in month three. Maybe in month six, and only if:
+
+* I have shipped multiple small changes safely.
+* The current code blocks a major requirement.
+* The team has bandwidth.
+* I can explain to my manager why it is cheaper to rewrite than to keep patching.
+
+Most pipelines don't need rewriting. They need documentation, tests, and small clear changes.
+
+### The Chesterton's fence rule
+
+Through the whole month, when I find code that looks unnecessary, I do not delete it. I ask "why is this here?" If nobody knows, I leave it and write a comment:
+
+> // This filter was added in 2023 commit abc1234. Reason unclear. Investigated 2025-05; nobody on the team remembers. Leaving in until impact of removal can be measured.
+
+The comment is a placeholder for the next person. It is honest. It does not break anything.
+
+### What I would NOT do
+
+* **Rewrite anything in month one.** Even if it is ugly.
+* **Add 30 tests at once.** Four good ones beat 30 noisy ones.
+* **Promise the moon to my manager.** It will hurt me later.
+* **Restart failed runs without reading the error.** That is how the previous owner got here.
+* **Delete the "weird" code.** Chesterton.
+
+### After the first month
+
+By month two and three:
+
+* I am the owner. Pages come to me.
+* I have shipped 3-5 small improvements.
+* I have added a few more tests, one at a time, each motivated by a near-miss.
+* I have started a "future improvements" list with a rough order.
+
+By month six:
+
+* I know the pipeline as well as anyone ever did.
+* I can talk about rewriting, with evidence.
+* The pipeline is in better shape than when I inherited it.
+
+### Common mistakes interviewers want you to name
+
+1. **Big-bang rewrite.** Highest risk move for the least context.
+2. **No README.** Next person inheriting hits the same wall.
+3. **No sandbox.** All changes go straight to production.
+4. **Removing "obviously dead" code.** Bites you.
+5. **Setting unrealistic expectations.** Disappoints everyone, you most of all.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the pipeline is actively broken when you arrive?"*
+
+Different problem. First, stop the bleeding: identify the immediate failure and patch it minimally to get it running again. Then revert to the month-one plan. The patch you applied is in the README under "known issues to investigate." Do not let the urgency tempt you into a rewrite while you still do not understand the pipeline.
diff --git a/Problem 64: Breaking Change in dbt Model 200 Consumers/question.md b/Problem 64: Breaking Change in dbt Model 200 Consumers/question.md
new file mode 100644
index 0000000..50deafe
--- /dev/null
+++ b/Problem 64: Breaking Change in dbt Model 200 Consumers/question.md
@@ -0,0 +1,27 @@
+## Problem 64: Breaking Change in a dbt Model With 200 Consumers
+
+**Scenario:**
+A column in a core dbt model needs to be renamed. The model has 200 downstream consumers: other dbt models, dashboards, scheduled queries, Reverse ETL jobs. Changing the column directly will break all of them at once. The team has done this before and it was a disaster. You are asked to plan the rollout.
+
+In the interview, the question is:
+
+> How would you safely roll out a change to a dbt model that has 200 downstream consumers?
+
+---
+
+### Your Task:
+
+1. Show the high-level rollout strategy.
+2. Walk through the deprecation window.
+3. Cover communication with consumer teams.
+4. Mention the tests that catch breakage.
+
+---
+
+### What a Good Answer Covers:
+
+* Additive changes are safe; replacements are not.
+* A deprecation window where both old and new exist.
+* Communicating to consumer owners.
+* dbt's `exposures` / dependency graph.
+* A clear "you have to update by" deadline.
diff --git a/Problem 64: Breaking Change in dbt Model 200 Consumers/solution.md b/Problem 64: Breaking Change in dbt Model 200 Consumers/solution.md
new file mode 100644
index 0000000..2e2badf
--- /dev/null
+++ b/Problem 64: Breaking Change in dbt Model 200 Consumers/solution.md
@@ -0,0 +1,184 @@
+## Solution 64: Breaking Change in a dbt Model With 200 Consumers
+
+### Short version you can say out loud
+
+> I treat it like an API change in software. Add the new column first, leave the old one in place, communicate the deprecation with a clear deadline, give consumers time to migrate, then remove the old one. Never remove and rename in the same release. The whole rollout takes 4 to 8 weeks for 200 consumers. The discipline is what saves you, not the cleverness of the change.
+
+### The four phases
+
+```
+Phase 1 Add the new column alongside the old one Week 0
+Phase 2 Communicate deprecation with a deadline Week 0
+Phase 3 Wait while consumers migrate Weeks 1 - 6
+Phase 4 Remove the old column Week 6 - 8
+```
+
+### Phase 1: add, don't replace
+
+The model gets both columns. The new one is populated correctly; the old one is computed for compatibility, possibly as an alias.
+
+```sql
+-- in the dbt model
+SELECT
+ ...
+ amount_cents AS amount_cents, -- new column, the right name
+ amount_cents AS amount, -- old column, deprecated alias
+ ...
+FROM ...
+```
+
+A single dbt run, no consumer breaks. Consumers can use the new name when ready.
+
+If the change is more than a rename — say, a units change from dollars to cents — the alias is more complex but still possible:
+
+```sql
+SELECT
+ ...
+ amount_cents AS amount_cents, -- new (integer cents)
+ amount_cents / 100.0 AS amount_dollars, -- old (float dollars)
+ ...
+```
+
+Both are correct and identifiable. Consumers can move at their pace.
+
+### Phase 2: communicate
+
+Three things happen the day the alias ships:
+
+**1. Mark the column as deprecated in dbt.**
+
+```yaml
+models:
+ - name: fact_orders
+ columns:
+ - name: amount
+ description: |
+ DEPRECATED. Use `amount_cents` instead.
+ Will be removed on 2025-07-01.
+ meta:
+ deprecated: true
+ deprecated_at: '2025-05-15'
+ remove_at: '2025-07-01'
+```
+
+dbt's docs page now shows the deprecation. Anyone hovering over the column sees the message.
+
+**2. Send a clear announcement.**
+
+```
+Subject: `fact_orders.amount` will be removed on July 1, 2025.
+
+Hi all,
+
+We're cleaning up `fact_orders`. Today's release adds `amount_cents`
+(integer cents, the right one), and renames the existing
+`amount` (float dollars) as deprecated.
+
+Action you need to take:
+- Switch any code reading `amount` to `amount_cents` before July 1.
+- The old column will be removed on that date.
+- I have generated a list of consumers that reference `amount` and
+ will follow up with the owners individually.
+
+Timeline:
+- May 15: both columns available.
+- June 1: friendly reminder, list of remaining users.
+- June 22: final 10-day warning.
+- July 1: `amount` removed.
+
+Questions: drop them in #data-platform.
+```
+
+Specific, dated, action-oriented.
+
+**3. Identify the consumers and their owners.**
+
+dbt's `exposures` and the column-level lineage tools (dbt's `--select` with column flags, or third-party tools like dbt-cloud, Castor, SelectStar) tell you exactly which models, dashboards, and pipelines read `amount`. Make a list. Find the owner of each one. Either notify the owner directly or open a small PR / ticket for them.
+
+For 200 consumers, this list will probably reveal:
+
+* 30 dbt models — easy to migrate, possibly one PR.
+* 80 dashboards — usually a quick rename per dashboard.
+* 30 reverse ETL jobs — straightforward config update.
+* 60 ad-hoc reports / scheduled queries / one-off bookmarks — the long tail.
+
+The first three groups migrate in week one. The long tail is what the deprecation window protects.
+
+### Phase 3: wait, nudge, monitor
+
+For four to six weeks, both columns exist. Things to do:
+
+* **Monitor usage.** Run a periodic check of `INFORMATION_SCHEMA.JOBS_BY_PROJECT` (BigQuery) or query history (Snowflake) for references to `amount`. Send a personalized reminder to anyone still using it.
+
+```sql
+SELECT user_email, COUNT(*) AS uses
+FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
+WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
+ AND REGEXP_CONTAINS(query, r'\bamount\b')
+ AND NOT REGEXP_CONTAINS(query, r'amount_cents')
+GROUP BY 1
+ORDER BY 2 DESC;
+```
+
+* **Send the reminders on schedule.** Week 2, week 4, week 5, last week. Each reminder includes the remaining list.
+* **Help the slow migrators.** Open the PR for them if they need help.
+
+### Phase 4: remove
+
+When the deadline arrives:
+
+* Remove the old column from the dbt model.
+* Run a final usage check. If anyone is still using it, decide: extend by a week, or break them with notice. The right call depends on who is still using it.
+* Merge, deploy, watch for breakage.
+
+If you have done the previous phases well, removal day is uneventful. That is the goal.
+
+### Tests that catch breakage
+
+In the dbt project, you can include a test that checks the model still has the new column and the old one is gone:
+
+```yaml
+- name: fact_orders
+ columns:
+ - name: amount_cents
+ tests:
+ - not_null
+```
+
+For consumers, the protection is theirs to add. Encourage them to add explicit column-existence tests on their critical dashboards.
+
+### What if a senior stakeholder says "we cannot break the dashboard"?
+
+Then you do not break the dashboard. The deprecation window exists for a reason. If after 6 weeks they still cannot migrate, you negotiate: maybe a quick PR you write for them, maybe a 2-week extension, maybe they accept a brief outage if not.
+
+The leverage is the deadline. Without one, consumers never migrate.
+
+### Why "rename and remove in the same release" fails
+
+Doing it in one shot has two failure modes:
+
+1. **You break consumers all at once.** Dashboards, jobs, reports go red on Monday morning.
+2. **You undo the change in panic.** Now the dbt history has both versions and nobody trusts the change.
+
+The deprecation pattern is annoying because it requires patience, but it is the only safe way at this scale.
+
+### Tools that help
+
+* **dbt's exposures**: declare downstream consumers (dashboards, reports) in the dbt project. Lineage now includes them.
+* **dbt's docs page**: shows deprecation info if you put it in the schema YAML.
+* **Column-level lineage tools** (Castor, SelectStar, Atlan): find every place a column is used.
+* **Query history mining**: a script reading INFORMATION_SCHEMA.JOBS_BY_PROJECT to find references.
+
+### Common mistakes interviewers want you to name
+
+1. **Rename in one PR.** Breaks everyone simultaneously.
+2. **No deadline.** Migrations drag on forever.
+3. **No owner per consumer.** Reminders go nowhere.
+4. **No usage monitoring.** Deadline arrives, surprises everyone.
+5. **Removing without final notice.** Even with notice, the last warning matters.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if you cannot keep the old column as an alias because the new logic is fundamentally different?"*
+
+Then you create a new model, not a new column. `fact_orders_v2`. Both `fact_orders` and `fact_orders_v2` exist for the deprecation window. The communication is the same shape: announce, deadline, migrate, deprecate, remove. The cost is two models in parallel for a few weeks; the safety is that nothing breaks unexpectedly.
diff --git a/Problem 65: 4000 DAG Airflow at 90 Percent CPU/question.md b/Problem 65: 4000 DAG Airflow at 90 Percent CPU/question.md
new file mode 100644
index 0000000..9f24e8a
--- /dev/null
+++ b/Problem 65: 4000 DAG Airflow at 90 Percent CPU/question.md
@@ -0,0 +1,29 @@
+## Problem 65: 4000 DAG Airflow at 90 Percent CPU
+
+**Scenario:**
+Your team's Airflow has grown to 4,000 DAGs over time. The scheduler is running at 90% CPU and missing schedules occasionally. Adding a bigger machine is the obvious fix, but the team has done that twice already and it always grows back.
+
+In the interview, the question is:
+
+> Your team's Airflow has 4000 DAGs and the scheduler is at 90 percent CPU. What is the first thing you would try, before scaling out?
+
+This is the cost-vs-real-fix kind of question.
+
+---
+
+### Your Task:
+
+1. Explain what the scheduler is doing.
+2. Walk through the cheaper fixes.
+3. Mention when scaling out is the right call.
+4. Cover the structural issue.
+
+---
+
+### What a Good Answer Covers:
+
+* DAG parsing as the main scheduler cost.
+* Reducing DAG count or DAG complexity.
+* Dynamic DAGs and parsing performance.
+* min_file_process_interval and other tuning.
+* When you have outgrown a single Airflow.
diff --git a/Problem 65: 4000 DAG Airflow at 90 Percent CPU/solution.md b/Problem 65: 4000 DAG Airflow at 90 Percent CPU/solution.md
new file mode 100644
index 0000000..3d96710
--- /dev/null
+++ b/Problem 65: 4000 DAG Airflow at 90 Percent CPU/solution.md
@@ -0,0 +1,148 @@
+## Solution 65: 4000 DAG Airflow at 90 Percent CPU
+
+### Short version you can say out loud
+
+> Before scaling out, I check what the scheduler is actually busy with. In most overloaded Airflows, the cost is DAG parsing, not task scheduling. So the first wins are tuning the parse interval, simplifying or removing expensive DAGs, and using dynamic DAGs more carefully. If after that the scheduler is still hot, then yes, scale the scheduler or split into multiple environments. But scaling first hides a structural problem that will grow back.
+
+### What the scheduler is actually doing
+
+The Airflow scheduler does two main loops:
+
+1. **Parse every DAG file on disk** to understand the dependencies and schedules.
+2. **Look at every DAG's run state** and decide which tasks should start now.
+
+For most overloaded Airflows, parsing dominates. The scheduler re-parses each DAG file every few seconds (default 30 seconds) to pick up changes. With 4,000 DAG files, that is a lot of Python imports and constructor calls per minute.
+
+Two practical signals:
+
+* `dag_processor_manager.log` shows how long each file takes to parse.
+* `scheduler.heartbeat` lag tells you if the scheduler is keeping up.
+
+If parse time per file is high (multi-second) or total parse cycle is longer than a few minutes, parsing is the problem.
+
+### The first wins (no scaling)
+
+**1. Increase `min_file_process_interval`.**
+
+The default is 30 seconds. If you don't need DAG changes picked up that fast, raise it to 120 seconds or more.
+
+```ini
+[scheduler]
+min_file_process_interval = 120
+```
+
+This single setting can drop scheduler CPU by 50% on large Airflows.
+
+**2. Reduce expensive imports at the top of DAG files.**
+
+Many teams put heavy code at module level:
+
+```python
+# bad: runs every parse
+import pandas as pd
+from some_heavy_library import expensive_thing
+
+big_config = expensive_thing()
+```
+
+This executes on every parse cycle. Move the heavy work inside operators or PythonCallables. The parser only needs to know the DAG's shape, not its data.
+
+```python
+# better: parser sees a fast import
+from airflow import DAG
+from airflow.operators.python import PythonOperator
+
+def my_task():
+ import pandas as pd
+ from some_heavy_library import expensive_thing
+ return expensive_thing()
+```
+
+A surprising fraction of overloaded Airflows are fixed by this one habit.
+
+**3. Audit dynamic DAGs.**
+
+A team builds a "dynamic DAG factory" that generates 500 DAGs from a config. Each parse iterates the config, generates Python, parses 500 sub-objects. Expensive.
+
+Fixes:
+
+* Cache the config so it loads once per process, not once per DAG.
+* Reduce the number of dynamically generated DAGs (use task groups within fewer DAGs).
+* Use Airflow's dynamic task mapping for "do the same thing 500 times" patterns, instead of 500 separate DAGs.
+
+**4. Remove dead DAGs.**
+
+Often 10-30% of DAGs are abandoned: nobody uses them, they fail silently, nobody notices. Audit by:
+
+* Which DAGs have not run successfully in 60 days?
+* Which DAGs nobody owns?
+* Which DAGs produce tables nobody queries?
+
+For each abandoned DAG, find an owner or delete it. Deleting 1,000 DAGs cuts parse load proportionally.
+
+**5. Pause infrequent DAGs that don't need to be parsed all the time.**
+
+A DAG that runs once a month doesn't need to be parsed every 30 seconds. There is no built-in way to vary parse frequency per DAG, but you can:
+
+* Move it to a separate Airflow if you have one.
+* Use Airflow's `is_paused_upon_creation` and a scheduled "unpause then run then pause" pattern. More complex.
+
+**6. Use the database scheduler tuning.**
+
+```ini
+[scheduler]
+parsing_processes = 4 # parse files in parallel
+max_threads = 4 # tasks per scheduler process
+```
+
+Tune to the machine. More parsing processes if you have CPU; fewer if you don't.
+
+### When scaling out is the right call
+
+After the above:
+
+* If parse cycle is still slow, run multiple scheduler processes (Airflow 2.0+ supports HA scheduler).
+* If task throughput is the bottleneck (you have many tasks queued but workers idle), you might be under-resourced on workers. Scale workers.
+* If the DAG count keeps growing past a point (5,000+), consider splitting into multiple Airflow environments, by team or by domain.
+
+Multi-scheduler is the cheapest of these. Two schedulers on the same metadata DB roughly halve the load.
+
+### The structural issue
+
+Airflow with 4,000 DAGs and growing has outgrown its sweet spot. Some honest questions:
+
+* Are 1,000 of those DAGs really one-line tasks that should be tasks within other DAGs?
+* Should some workflows move to a lighter scheduler (Cloud Workflows, simple cron) and free Airflow for the complex DAGs?
+* Does this team need a self-service "I want to schedule this" tool that doesn't dump every job into Airflow?
+
+This is a longer conversation but worth raising. Otherwise the scale-up cycle continues every six months.
+
+### A migration option: Dagster
+
+If the team is doing a rebuild anyway, Dagster's asset-centric model often handles "many small things" better. You describe the data you want to exist, and Dagster figures out the dependencies. It scales differently than Airflow's DAG-list approach. Not a quick switch, but worth considering when planning the next phase.
+
+### What I would actually do this week
+
+1. Audit parse times per DAG. Identify the top 20 slowest.
+2. Fix module-level imports in those 20. Quick wins.
+3. Bump `min_file_process_interval` to 120 seconds.
+4. Identify and delete abandoned DAGs.
+5. Add a "DAG count" tracking metric so the team sees growth pressure.
+
+If those don't drop scheduler CPU below 60%, then I would propose scaling (multi-scheduler).
+
+### Common mistakes interviewers want you to name
+
+1. **Scale-up first.** Hides the parsing problem.
+2. **No ownership audit.** Abandoned DAGs accumulate.
+3. **Module-level heavy imports.** Single biggest perf trap.
+4. **Aggressive dynamic DAG factories.** Often the entire problem.
+5. **Default `min_file_process_interval`.** Fine for 100 DAGs, painful for 4000.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if your team owns 2,000 of those 4,000 DAGs and the other 2,000 are from another team you can't change?"*
+
+Split Airflows. Two environments. Yours is now 2,000 DAGs and you control the practices. Theirs is their problem. Shared metadata DB is a no, separate clusters with their own DBs is a yes. This also gives you isolation in the failure case: if their team breaks their scheduler, yours keeps running.
+
+This is a non-trivial migration, weeks of work. But it solves the structural issue, not just the symptom.
diff --git a/Problem 6: Partitioning vs Clustering in BigQuery/question.md b/Problem 6: Partitioning vs Clustering in BigQuery/question.md
new file mode 100644
index 0000000..87ac003
--- /dev/null
+++ b/Problem 6: Partitioning vs Clustering in BigQuery/question.md
@@ -0,0 +1,27 @@
+## Problem 6: Partitioning vs Clustering in BigQuery
+
+**Scenario:**
+You join a small analytics team and notice every important query runs against one giant `events` table with about 8 billion rows. The team has heard about partitioning and clustering, but they are not sure which one to use, or whether to use both.
+
+In the interview, the question is short:
+
+> Explain partitioning and clustering in BigQuery in plain words. When would you pick one over the other? When would you use both?
+
+---
+
+### Your Task:
+
+1. Explain in plain English what partitioning is and what clustering is.
+2. Give a real example for each.
+3. Explain when you would pick one, when you would pick the other, and when you would use both.
+4. Mention at least one common mistake people make.
+
+---
+
+### What a Good Answer Covers:
+
+* The physical difference between partitions (separate storage units) and clustering (sort order inside a partition).
+* How partitioning helps **pruning** at scan time.
+* How clustering helps **filtering** and **joins** on high cardinality columns.
+* The "partition first, cluster second" rule of thumb.
+* Cost implications, since BigQuery charges by bytes scanned.
diff --git a/Problem 6: Partitioning vs Clustering in BigQuery/solution.md b/Problem 6: Partitioning vs Clustering in BigQuery/solution.md
new file mode 100644
index 0000000..d8d97af
--- /dev/null
+++ b/Problem 6: Partitioning vs Clustering in BigQuery/solution.md
@@ -0,0 +1,115 @@
+## Solution 6: Partitioning vs Clustering in BigQuery
+
+### Short version you can say out loud
+
+> Partitioning splits a table into separate physical buckets, usually by date. Clustering sorts the rows inside each bucket by one or more columns. Partitioning helps BigQuery skip whole chunks of the table, clustering helps it skip blocks inside a chunk. In practice you partition by a low cardinality column you almost always filter on (like date), and you cluster by the high cardinality columns you filter or join on (like `customer_id` or `event_type`).
+
+### Picture it
+
+```
+Unpartitioned, unclustered table (8B rows)
+┌──────────────────────────────────────────────┐
+│ row row row row row row row row row row … │
+└──────────────────────────────────────────────┘
+A query with WHERE event_date = '2025-05-01' scans ALL of it.
+
+
+Partitioned by event_date
+┌──── 2025-04-30 ────┐ ┌──── 2025-05-01 ────┐ ┌──── 2025-05-02 ────┐
+│ rows for that day │ │ rows for that day │ │ rows for that day │
+└────────────────────┘ └────────────────────┘ └────────────────────┘
+Same query now scans ONE partition.
+
+
+Partitioned by event_date AND clustered by customer_id
+┌──── 2025-05-01 ────────────────────────────────────────────┐
+│ customer_id=1001..1500 block A │
+│ customer_id=1501..2000 block B │
+│ customer_id=2001..2500 block C │
+└─────────────────────────────────────────────────────────────┘
+WHERE event_date = '2025-05-01' AND customer_id = 1750
+→ BigQuery jumps to block B only.
+```
+
+### What each one actually is
+
+**Partitioning** is a physical split. BigQuery stores each partition as a separate unit. When your `WHERE` clause filters on the partition column, BigQuery prunes the other partitions before it even starts scanning. This is the single biggest cost lever in BigQuery, because BigQuery charges by bytes scanned.
+
+You can partition by:
+* Ingestion time (the default, `_PARTITIONTIME`)
+* A date or timestamp column (`event_date`, `created_at`)
+* An integer range (rare, but useful for tenant ids in some setups)
+
+**Clustering** is a sort order. Inside each partition, rows are kept sorted by the cluster columns. BigQuery also keeps lightweight metadata about the value ranges in each storage block. When your `WHERE` filters on a clustered column, BigQuery uses that metadata to skip blocks that cannot match.
+
+You can cluster on up to four columns. Order matters: the first clustering column gives the biggest skip, then the next, and so on.
+
+### Concrete example
+
+```sql
+-- A typical pattern for an events table
+CREATE TABLE analytics.events (
+ event_date DATE,
+ customer_id INT64,
+ event_type STRING,
+ payload JSON
+)
+PARTITION BY event_date
+CLUSTER BY customer_id, event_type;
+```
+
+Query A:
+```sql
+SELECT COUNT(*)
+FROM analytics.events
+WHERE event_date = '2025-05-01';
+```
+BigQuery scans one partition, ignores the other 364.
+
+Query B:
+```sql
+SELECT *
+FROM analytics.events
+WHERE event_date = '2025-05-01'
+ AND customer_id = 1750
+ AND event_type = 'purchase';
+```
+BigQuery scans one partition, then uses the clustering metadata to read only the storage blocks that contain `customer_id = 1750` and inside those, the blocks that contain `event_type = 'purchase'`. The cost can drop by orders of magnitude.
+
+Query C (the trap):
+```sql
+SELECT *
+FROM analytics.events
+WHERE customer_id = 1750;
+```
+No partition filter, so BigQuery scans every partition. Clustering still helps, but the partition prune is gone. This is the most common mistake.
+
+### When to use which
+
+| Situation | What to use |
+| ---------------------------------------------------------- | ----------------------------------------------------- |
+| Almost every query filters on a date | **Partition** by that date column |
+| Queries filter on a high cardinality column (`customer_id`, `device_id`) | **Cluster** on that column |
+| Both of the above | **Partition + cluster** (the most common pattern) |
+| The table is small (under ~1 GB) | Skip both. The overhead is not worth it. |
+| You filter by a low cardinality column with only 2 to 10 values (`country`) | **Cluster**, not partition. Partitioning by something with 5 values is wasteful. |
+
+### Rule of thumb to remember
+
+* **Partition first.** Pick the column you filter on in 80 percent of queries. Usually a date.
+* **Cluster second.** Pick up to four columns you filter or join on, in order of how often they appear in `WHERE` and `JOIN ON`.
+* **Always include the partition filter in your queries.** If you forget, you pay for the full table.
+
+### Common mistakes interviewers want you to name
+
+1. Partitioning by a column that does not appear in most `WHERE` clauses. The table is just split for no benefit.
+2. Forgetting the partition filter in the query. Cost stays the same.
+3. Partitioning by a high cardinality column like `customer_id`. You can hit BigQuery's 4,000 partition limit very fast.
+4. Clustering by too many columns. After four, you get nothing extra, and the order you picked may no longer be the most selective.
+5. Thinking clustering is "an index." It is not. There is no random lookup, only block pruning.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if my queries change over time and the clustering becomes useless?"*
+
+You can run `ALTER TABLE ... ALTER COLUMN` to change clustering, but BigQuery only re-clusters as it writes new data and during background optimizations. For a heavily queried table, you may want to use `CREATE TABLE AS SELECT` into a freshly clustered copy and swap names, which guarantees the new order from day one.
diff --git a/Problem 7: ETL vs ELT and Why ELT Won/question.md b/Problem 7: ETL vs ELT and Why ELT Won/question.md
new file mode 100644
index 0000000..347c319
--- /dev/null
+++ b/Problem 7: ETL vs ELT and Why ELT Won/question.md
@@ -0,0 +1,29 @@
+## Problem 7: ETL vs ELT and Why ELT Won
+
+**Scenario:**
+You walk into a team that still runs every nightly transform on a separate ETL server before loading the cleaned data into the warehouse. The server is creaking, scaling it up is expensive, and adding a new transform requires a release. Someone asks the obvious question:
+
+> Why do most modern teams do this the other way around now?
+
+In the interview, this question is short and conversational:
+
+> What is the difference between ETL and ELT, and why has the industry mostly moved to ELT?
+
+---
+
+### Your Task:
+
+1. Explain ETL and ELT in plain English, with one sentence each.
+2. Draw a small diagram of the two flows.
+3. Explain what changed in the world that made ELT possible.
+4. Give two situations where ETL is still the right choice.
+
+---
+
+### What a Good Answer Covers:
+
+* The basic order: where the transform happens.
+* Why cheap, scalable warehouses (BigQuery, Snowflake, Redshift) changed the game.
+* The role of tools like dbt.
+* The trade-offs: data freshness, cost predictability, governance, PII handling.
+* When you would still pick ETL on purpose.
diff --git a/Problem 7: ETL vs ELT and Why ELT Won/solution.md b/Problem 7: ETL vs ELT and Why ELT Won/solution.md
new file mode 100644
index 0000000..5325de0
--- /dev/null
+++ b/Problem 7: ETL vs ELT and Why ELT Won/solution.md
@@ -0,0 +1,75 @@
+## Solution 7: ETL vs ELT and Why ELT Won
+
+### Short version you can say out loud
+
+> ETL means you extract data from the source, transform it on a separate machine, then load the clean result into the warehouse. ELT means you extract the raw data, load it straight into the warehouse first, and let the warehouse do the transform. The switch happened because warehouses got cheap, fast and elastic, so it stopped making sense to maintain a separate transform server.
+
+### The two flows side by side
+
+```
+ ETL (the old way)
+┌────────┐ ┌──────────────────┐ ┌──────────┐
+│ Source │───▶│ Transform server │───▶│Warehouse │
+└────────┘ │ (Python, Spark, │ │(clean) │
+ │ Informatica…) │ └──────────┘
+ └──────────────────┘
+
+ ELT (the modern way)
+┌────────┐ ┌──────────┐ ┌─────────────────────┐
+│ Source │───▶│Warehouse │───▶│ Same warehouse runs │
+└────────┘ │ (raw) │ │ transforms (dbt, │
+ └──────────┘ │ SQL, scheduled) │
+ └─────────────────────┘
+```
+
+### One line definitions
+
+* **ETL**: Extract from source, Transform on a separate engine, Load into warehouse.
+* **ELT**: Extract from source, Load raw into warehouse, Transform inside the warehouse.
+
+### What actually changed
+
+Ten years ago, the warehouse was a precious, expensive box. You did not want to land raw, messy, oversized data inside it because storage cost real money and compute was fixed. So teams ran a transform layer outside the warehouse to make the data small and clean before it ever touched the box.
+
+Then three things changed:
+
+1. **Storage got cheap.** Object storage and columnar warehouses made it cost almost nothing to keep raw data.
+2. **Compute got elastic.** BigQuery, Snowflake, Redshift Serverless, Databricks. You only pay for what you scan or use.
+3. **SQL on warehouses got really powerful.** Window functions, JSON parsing, arrays, geography, machine learning. You can do almost everything the old Python ETL did, in SQL, in the warehouse.
+
+That made it cheaper and simpler to dump raw data in and transform with SQL. dbt arrived and gave teams a way to organize that SQL like real software (tests, dependencies, documentation), and ELT became the default.
+
+### Why teams prefer ELT today
+
+| Reason | What it means in practice |
+| ------------------------------- | ------------------------------------------------------------------ |
+| Raw data is preserved | You can always rebuild a transform if the business logic changes |
+| Transforms are just SQL | Analysts can read and change them, not just engineers |
+| One platform, one bill | No separate Spark cluster to babysit |
+| Easy to version with dbt | Tests, lineage and docs out of the box |
+| Faster iteration | A new column is a pull request, not a deploy |
+
+### When ETL is still the right call
+
+ELT is the default, but it is not always correct.
+
+1. **Sensitive data that must never enter the warehouse.** Card numbers, raw health records, regulated PII. You strip or tokenize these *before* loading. The transform has to happen outside the warehouse.
+2. **Huge volumes of useless raw data.** If 90 percent of the source is junk you will never query, you can save real money by filtering at ingest. For example, dropping debug log lines before loading.
+3. **Strict compliance boundaries.** Some regulators want only the cleaned, approved version of the data in the analytical store. Raw stays in the source system.
+4. **Real time enrichment.** Stream processing (Flink, Kafka Streams) transforms on the fly because waiting for a batch warehouse run is too slow.
+
+### The hybrid that most teams actually run
+
+In real life, it is rarely pure ELT. Most teams use:
+
+* **Light ETL at ingest** for PII masking and obvious filtering.
+* **ELT inside the warehouse** for business logic, joins, aggregates and modeling.
+* **Stream processing on the side** for the few use cases that need sub-minute latency.
+
+If the interviewer pushes you, this is the honest answer: "we say ELT, we mean ELT for the analytical layer, but we still do some transforms before landing for privacy and cost reasons."
+
+### Bonus follow-up the interviewer might throw
+
+> *"If raw data is in the warehouse, isn't that a security risk?"*
+
+Yes. That is why mature ELT setups land raw data into a **restricted dataset** with row and column level access controls. Only a small group can query the raw layer. The downstream "clean" and "marts" layers are what most analysts touch. Tokenization or hashing happens at the ingest step for the truly sensitive fields, so they never appear in plaintext anywhere in the warehouse.
diff --git a/Problem 8: OLTP vs OLAP/question.md b/Problem 8: OLTP vs OLAP/question.md
new file mode 100644
index 0000000..ae68816
--- /dev/null
+++ b/Problem 8: OLTP vs OLAP/question.md
@@ -0,0 +1,26 @@
+## Problem 8: OLTP vs OLAP
+
+**Scenario:**
+An analyst on your team writes a SQL query against the production Postgres database. The query is simple: count orders by day for the last year. It takes 90 seconds and the on-call engineer pings you because it briefly slowed down the checkout API. You explain that this kind of query does not belong on the production database. They ask you why.
+
+In the interview, the question is:
+
+> What is OLTP, what is OLAP, and why does the same query feel fast in one and slow in the other?
+
+---
+
+### Your Task:
+
+1. Define OLTP and OLAP in one sentence each.
+2. Explain how the underlying storage and query engine differ.
+3. Give an example of a typical OLTP query and a typical OLAP query.
+4. Explain why running the wrong query on the wrong system is bad for both speed and the business.
+
+---
+
+### What a Good Answer Covers:
+
+* Row-oriented vs columnar storage.
+* Indexes and point lookups vs scans and aggregates.
+* Transactions and ACID guarantees vs append-only analytical workloads.
+* Why we have a separate warehouse at all.
diff --git a/Problem 8: OLTP vs OLAP/solution.md b/Problem 8: OLTP vs OLAP/solution.md
new file mode 100644
index 0000000..a2234b7
--- /dev/null
+++ b/Problem 8: OLTP vs OLAP/solution.md
@@ -0,0 +1,89 @@
+## Solution 8: OLTP vs OLAP
+
+### Short version you can say out loud
+
+> OLTP is the database behind the app. It is built to handle many small reads and writes very fast, like "fetch this user" or "insert this order." OLAP is the database behind reporting. It is built to scan huge amounts of data and compute aggregates, like "revenue by region by month." They store data in completely different ways, which is why the same query can take 50ms in one and 50 seconds in the other.
+
+### Picture it
+
+```
+ OLTP OLAP
+ (Postgres, MySQL, (BigQuery, Snowflake,
+ SQL Server, Oracle) Redshift, ClickHouse)
+
+ Row-oriented storage Column-oriented storage
+
+ ┌──────────────────┐ ┌─────┬──────┬─────┬──────┐
+ │ id│name│email│… │ │ id │ name │email│ … │
+ │ 1 │A │a@..│.. │ ├─────┼──────┼─────┼──────┤
+ │ 2 │B │b@..│.. │ │ 1 │ A │ a@. │ .. │
+ │ 3 │C │c@..│.. │ │ 2 │ B │ b@. │ .. │
+ └──────────────────┘ │ 3 │ C │ c@. │ .. │
+ └─────┴──────┴─────┴──────┘
+ Whole row stored together. Each column stored together.
+
+ Best at: Best at:
+ - "Give me row id=42" - "SUM(amount) by month"
+ - "Insert this new order" - "COUNT DISTINCT customer"
+ - Transactional updates - Reading 5 columns out of 100
+```
+
+### What each one is, in one paragraph
+
+**OLTP (Online Transaction Processing)** powers the running application. It handles thousands of tiny operations per second: insert this order, update this balance, fetch this user. The data is stored row by row, so it can quickly grab or update a complete record. It uses indexes (usually B-tree) for fast lookups by id, and it enforces ACID transactions so the business never sees a half-applied change.
+
+**OLAP (Online Analytical Processing)** powers reporting, dashboards and analytics. It handles a small number of huge queries: total revenue by country last year, daily active users for the last 12 months. The data is stored column by column, so when you select 3 columns out of 100, it only reads those 3. It is optimised for big scans and aggregates, not for updating one row at a time.
+
+### The same data, two storage formats
+
+Imagine an `orders` table with columns: `id, customer_id, status, amount, country, created_at`.
+
+A query like `SELECT SUM(amount) FROM orders WHERE country = 'SG'` does this:
+
+* On an **OLTP row store**: read every row on disk, all six columns of each row, even though we only care about two of them. Slow.
+* On an **OLAP column store**: read only the `country` and `amount` columns. The other four never touch the disk. Fast.
+
+Now flip it. A query like `SELECT * FROM orders WHERE id = 12345`:
+
+* On an **OLTP row store**: the B-tree index jumps straight to the row, one disk seek. Done in milliseconds.
+* On an **OLAP column store**: there is no row index. The engine may scan a partition or block range. Easily 1000x slower than the OLTP version.
+
+So neither one is "better." They are tuned for opposite workloads.
+
+### Side by side
+
+| Aspect | OLTP | OLAP |
+| --------------------- | ----------------------------------- | ------------------------------------- |
+| Typical user | The application | Analysts, BI tools, ML pipelines |
+| Typical query | Read or update one row | Aggregate millions of rows |
+| Storage layout | Row-oriented | Column-oriented |
+| Indexes | Many (primary key, foreign keys) | Few or none (uses partition + cluster)|
+| Updates and deletes | Frequent | Rare, often batch |
+| Concurrency | Thousands of small transactions | Few large queries |
+| ACID | Strict | Eventual or relaxed in many engines |
+| Data freshness | Real time | Minutes to hours behind |
+| Examples | Postgres, MySQL, SQL Server, DynamoDB | BigQuery, Snowflake, Redshift, ClickHouse |
+
+### Why we keep them separate
+
+The story at the top of the problem is the answer: if you run a long analytical query on the production database, you slow down the app for real users. Even if you don't, you push the database's cache out, hurting every other small query. The whole reason warehouses exist is so analytical work has its own home.
+
+A typical company therefore runs:
+
+```
+ ┌─────────┐ sync ┌──────────┐
+ │ OLTP DB │ ──────────▶ │ OLAP DW │
+ │(Postgres│ (CDC, daily │(BigQuery │
+ │ app DB)│ batch, │ Snowflake│
+ └─────────┘ Fivetran) └──────────┘
+ ▲ ▲
+ │ │
+ The app reads Analysts and
+ and writes here. dashboards read here.
+```
+
+### Bonus follow-up the interviewer might throw
+
+> *"What about HTAP, like CockroachDB or TiDB? Aren't they both?"*
+
+HTAP (Hybrid Transactional and Analytical Processing) systems try to do both in one box, usually by keeping a row store for writes and a column store replica for analytics inside the same cluster. They work well for moderate scale and reduce operational complexity, but for very large analytical workloads most companies still keep a dedicated warehouse, because a purpose built engine like BigQuery or Snowflake is still cheaper and faster at that job.
diff --git a/Problem 9: Idempotency in Data Pipelines/question.md b/Problem 9: Idempotency in Data Pipelines/question.md
new file mode 100644
index 0000000..1a9c48e
--- /dev/null
+++ b/Problem 9: Idempotency in Data Pipelines/question.md
@@ -0,0 +1,27 @@
+## Problem 9: Idempotency in Data Pipelines
+
+**Scenario:**
+A scheduled job that loads daily orders into your warehouse failed at 2 AM, retried automatically, and finished successfully on the second attempt. The next morning, the revenue dashboard shows yesterday's number is exactly double. The team is confused. You explain that the job was not idempotent.
+
+In the interview, the question is:
+
+> What does idempotency mean for a data pipeline, and why do interviewers ask about it so often?
+
+---
+
+### Your Task:
+
+1. Define idempotency in plain English.
+2. Show a small example of a non idempotent pipeline and the same pipeline made idempotent.
+3. Explain three real patterns to make a pipeline idempotent.
+4. Explain why this matters more in batch pipelines than people think.
+
+---
+
+### What a Good Answer Covers:
+
+* The link between idempotency and retries.
+* The MERGE / UPSERT pattern.
+* The "delete and reinsert by partition" pattern.
+* The idempotency key pattern from APIs.
+* Why "the job succeeded" is not enough.
diff --git a/Problem 9: Idempotency in Data Pipelines/solution.md b/Problem 9: Idempotency in Data Pipelines/solution.md
new file mode 100644
index 0000000..61739a7
--- /dev/null
+++ b/Problem 9: Idempotency in Data Pipelines/solution.md
@@ -0,0 +1,127 @@
+## Solution 9: Idempotency in Data Pipelines
+
+### Short version you can say out loud
+
+> Idempotent means you can run the same job twice and the result is the same as running it once. In a data pipeline, this matters because retries are normal. Network blips, crashed workers, slow APIs, all of those cause a step to run again. If the pipeline is not idempotent, retrying it corrupts your data. The whole reason interviewers ask about this is that production systems retry constantly, and any pipeline that cannot survive a retry will eventually produce a doubled or missing number.
+
+### The story above, drawn out
+
+```
+Non idempotent (the bug)
+─────────────────────────
+Run 1: INSERT INTO orders SELECT * FROM source WHERE date = '2025-05-14';
+ ✗ Crashes halfway through.
+Retry: INSERT INTO orders SELECT * FROM source WHERE date = '2025-05-14';
+ ✓ Success. But the first INSERT had already written rows.
+Result: Some rows appear twice. Yesterday's revenue is doubled.
+
+
+Idempotent (the fix)
+────────────────────
+Run 1: DELETE FROM orders WHERE date = '2025-05-14';
+ INSERT INTO orders SELECT * FROM source WHERE date = '2025-05-14';
+ ✗ Crashes halfway through.
+Retry: DELETE FROM orders WHERE date = '2025-05-14';
+ INSERT INTO orders SELECT * FROM source WHERE date = '2025-05-14';
+ ✓ Success. The DELETE wipes anything from the bad run.
+Result: Exactly one set of rows for that date, every time.
+```
+
+### What "idempotent" means in this context
+
+If `f(x)` is your pipeline step and you run it once, you get a certain state in the warehouse. If you then run it again with the same input, the warehouse state should be unchanged. `f(f(x)) == f(x)`. That is the whole definition.
+
+Notice it is about the **end state**, not the side effects along the way. The second run can do work (queries, writes), but the data that ends up in the destination is the same as if it had only run once.
+
+### Three patterns that make a step idempotent
+
+**1. MERGE / UPSERT by a stable key**
+
+Every row has a stable primary key. Instead of `INSERT`, you `MERGE` (or `INSERT ... ON CONFLICT UPDATE`):
+
+```sql
+MERGE INTO orders AS target
+USING staging AS source
+ON target.order_id = source.order_id
+WHEN MATCHED THEN UPDATE SET ...
+WHEN NOT MATCHED THEN INSERT (...);
+```
+
+Running this twice does not duplicate rows. The second run finds the same keys and updates in place. Works well in BigQuery, Snowflake, Postgres, almost everywhere.
+
+**2. Delete and reinsert by partition**
+
+For partitioned tables, treat the partition as the unit of work:
+
+```sql
+DELETE FROM orders WHERE event_date = @target_date;
+INSERT INTO orders
+SELECT * FROM source WHERE event_date = @target_date;
+```
+
+The job "owns" that day. Whatever was there from a previous attempt is wiped first. Safe to retry as many times as you want. This is the most common pattern in daily batch pipelines.
+
+In BigQuery you usually do it even cleaner with a *partition replace*:
+
+```sql
+CREATE OR REPLACE TABLE orders$20250514 AS
+SELECT * FROM source WHERE event_date = '2025-05-14';
+```
+
+**3. Idempotency key**
+
+Common in streaming and API style pipelines. Every event carries a unique id (`event_id`, `message_id`, or a hash of its content). On insert, you skip rows where the id already exists.
+
+```sql
+INSERT INTO events (event_id, ...)
+SELECT event_id, ...
+FROM staging s
+WHERE NOT EXISTS (
+ SELECT 1 FROM events e WHERE e.event_id = s.event_id
+);
+```
+
+Or with a unique constraint, you let the database reject the duplicate.
+
+### A non obvious pattern: idempotent file writes
+
+Writing to S3 or GCS feels safe, but it is easy to get wrong. If your job appends a file with a timestamp name and crashes between writing the file and committing the metadata, a retry creates a second file. Two patterns fix this:
+
+* **Deterministic file names.** `orders/date=2025-05-14/part-001.parquet`. A retry overwrites the same file.
+* **Manifest files.** Write all data files first, then write a tiny `_SUCCESS` or `manifest.json` last. Downstream readers ignore everything that is not in the manifest. A crashed run leaves orphan files that a cleanup step removes later.
+
+### Side effects that are quietly NOT idempotent
+
+These are the ones that bite you in real life:
+
+* **Sending notifications or emails.** A retried job that sends "your bill is ready" can send it three times.
+* **Calling third party APIs that have their own state.** A retried Stripe charge can charge the customer twice.
+* **Auto incrementing surrogate keys.** A retry may produce a row with a new key, so the "same row" looks different.
+* **Appending to a CSV.** If your job does `>>` append instead of write-then-rename, retries grow the file.
+
+For these, you usually need an idempotency key passed to the external system, or a "did I already do this" check before doing it.
+
+### Why this matters more than people expect
+
+Three reasons:
+
+1. **Retries are not exceptional, they are normal.** Airflow, Dagster, every orchestrator retries on failure by default. A 1 percent flake rate means every long pipeline retries something every day.
+2. **The data lies quietly when it goes wrong.** A doubled INSERT does not throw an error. Numbers just look "a bit high." You may not notice for weeks.
+3. **Backfills depend on it.** When the business says "rerun the last 30 days," you can only do it safely if every step is idempotent. Otherwise you have to manually clean up first, every time.
+
+### Rule of thumb
+
+> "I should be able to rerun this task for the same partition at any time, and the table looks the same after."
+
+If you cannot say that, the task is not done.
+
+### Bonus follow-up the interviewer might throw
+
+> *"What if the upstream data changes between runs? Then the same partition gives a different answer."*
+
+Good question. There are two stances:
+
+1. **Source of truth wins.** The destination always reflects the latest upstream. This is fine for reporting where you want the freshest picture.
+2. **Snapshot wins.** The destination is frozen as of the first run. You record what was true at that moment. This is what you need for audit, billing, or regulatory data.
+
+Pick one consciously, and document it. Most bugs in this area come from the team being unsure which one they wanted.