Central operations hub for all REALM data work — from MongoDB extraction through public data enrichment to Scoop domain intelligence development.
- Public Data ETL Framework (
data/etl/) — Download, transform, validate, and load 12 external data sources into Scoop on automated cadences - MongoDB Export Utility (root, Java) — Two-phase MongoDB to CSV export with intelligent field discovery and relationship expansion
- Scoop Semantics Extraction (
scoop/) — Repeatable script to pull dataset definitions and domain intelligence configs from Scoop production
All sources feed into Scoop workspace W17008 and power the realm_agent_report investigation (v3.1 — 23 probes, 13 datasets, 7 sections).
| Source | Cadence | Rows | What It Adds |
|---|---|---|---|
| Redfin | Weekly | 5.6M | Housing market metrics by region |
| IRS SOI | Annual | 223K | ZIP-level affluence and income brackets |
| FHFA HPI | Quarterly | 123K | House price appreciation, state + MSA |
| FRED | Monthly | 134 | National macro: mortgage rates, CPI, unemployment |
| Census ACS | Annual | 34K | ZCTA demographics: income, education, homeownership |
| BEA | Quarterly | 28K | County-level income components |
| HMDA | Annual | 5.2M | Luxury mortgage lending ($500K+) |
| SEC EDGAR | Quarterly | 52K | Form 4 insider transactions ($1M+) |
| FBI Crime | Annual | 651 | State-level crime rates |
| Geocoding | Quarterly | 21K | Listing coordinates (Census batch geocoder) |
| FEMA Flood | Quarterly | 9.8K | Per-listing flood zone risk |
| NCES Schools | Quarterly | 9.8K | Per-listing school district |
# List configured sources
./realm-etl list
# Run all sources (API keys needed for FRED, Census, BEA)
FRED_API_KEY=... CENSUS_API_KEY=... BEA_API_KEY=... ./realm-etl run --all --force --load
# Run a single source
./realm-etl run --source fhfa-hpi --load
# Scheduled mode (daily cron — checks cadences, skips if not due)
./realm-etl run --scheduled --load# Discover fields (Phase 1)
./gradlew discover -Pcollection=listings
# Export using config (Phase 2)
./gradlew configExport -Pcollection=listingscd scoop && ./extract-realm-semantics.sh # VPN required| Document | What It Covers |
|---|---|
CLAUDE.md |
Master project reference — architecture, all source configs, investigation contexts, current status |
REALM_INVESTIGATION_STRATEGY.md |
Strategic document (v2.0) — luxury intelligence platform vision, corridor-first execution, monetization |
data/DATASET_CATALOG.md |
All 15 datasets: schemas, join keys, query examples, cross-dataset matrix |
data/etl/README.md |
ETL operator guide: commands, flags, cadence behavior, state semantics |
Julie_Public_Data_Update_20260225.md |
Stakeholder update on all 12 public data sources |
Realm/
├── realm-etl # ETL CLI entry point
├── data/
│ ├── etl/ # Framework: orchestrator, loader, config, state
│ │ ├── sources.json # Master config (all 12 sources)
│ │ └── README.md # Operator guide
│ ├── redfin/source.py # Source modules (one per data source)
│ ├── irs-soi/source.py
│ ├── fhfa-hpi/source.py
│ ├── fred/source.py
│ ├── census-acs/source.py
│ ├── bea/source.py
│ ├── hmda/source.py
│ ├── sec-edgar/source.py
│ ├── fbi-crime/source.py
│ ├── geocoding/source.py
│ ├── fema-flood/source.py
│ ├── nces-schools/source.py
│ └── DATASET_CATALOG.md # Schema reference for all datasets
├── src/main/java/ # MongoDB export utility (Java)
├── scoop/ # Semantics extraction scripts + snapshots
├── config/ # MongoDB field configs
├── output/ # Exported CSVs
└── archive/ # Historical session docs
- Python stdlib only — no pip dependencies in the ETL framework
- Scoop CLI bulkload is the default load method (not S3+Lambda)
- VPN required for all Scoop operations (mobileAPI, bulkload, investigations)
- Java 21 + Gradle for MongoDB export utility
- API keys via env vars:
FRED_API_KEY,CENSUS_API_KEY,BEA_API_KEY
- Remote: https://github.com/scoopeng/Realm
- Branch: master
- GitHub account:
scoopeng(HTTPS only, no SSH)