Skip to content

scoopeng/Realm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Realm Data Tooling & Intelligence Hub

Central operations hub for all REALM data work — from MongoDB extraction through public data enrichment to Scoop domain intelligence development.

What This Repo Does

  1. Public Data ETL Framework (data/etl/) — Download, transform, validate, and load 12 external data sources into Scoop on automated cadences
  2. MongoDB Export Utility (root, Java) — Two-phase MongoDB to CSV export with intelligent field discovery and relationship expansion
  3. Scoop Semantics Extraction (scoop/) — Repeatable script to pull dataset definitions and domain intelligence configs from Scoop production

Public Data Sources (12 live)

All sources feed into Scoop workspace W17008 and power the realm_agent_report investigation (v3.1 — 23 probes, 13 datasets, 7 sections).

Source Cadence Rows What It Adds
Redfin Weekly 5.6M Housing market metrics by region
IRS SOI Annual 223K ZIP-level affluence and income brackets
FHFA HPI Quarterly 123K House price appreciation, state + MSA
FRED Monthly 134 National macro: mortgage rates, CPI, unemployment
Census ACS Annual 34K ZCTA demographics: income, education, homeownership
BEA Quarterly 28K County-level income components
HMDA Annual 5.2M Luxury mortgage lending ($500K+)
SEC EDGAR Quarterly 52K Form 4 insider transactions ($1M+)
FBI Crime Annual 651 State-level crime rates
Geocoding Quarterly 21K Listing coordinates (Census batch geocoder)
FEMA Flood Quarterly 9.8K Per-listing flood zone risk
NCES Schools Quarterly 9.8K Per-listing school district

Quick Start

ETL (primary workflow)

# List configured sources
./realm-etl list

# Run all sources (API keys needed for FRED, Census, BEA)
FRED_API_KEY=... CENSUS_API_KEY=... BEA_API_KEY=... ./realm-etl run --all --force --load

# Run a single source
./realm-etl run --source fhfa-hpi --load

# Scheduled mode (daily cron — checks cadences, skips if not due)
./realm-etl run --scheduled --load

MongoDB Export

# Discover fields (Phase 1)
./gradlew discover -Pcollection=listings

# Export using config (Phase 2)
./gradlew configExport -Pcollection=listings

Scoop Semantics Extraction

cd scoop && ./extract-realm-semantics.sh    # VPN required

Key Documentation

Document What It Covers
CLAUDE.md Master project reference — architecture, all source configs, investigation contexts, current status
REALM_INVESTIGATION_STRATEGY.md Strategic document (v2.0) — luxury intelligence platform vision, corridor-first execution, monetization
data/DATASET_CATALOG.md All 15 datasets: schemas, join keys, query examples, cross-dataset matrix
data/etl/README.md ETL operator guide: commands, flags, cadence behavior, state semantics
Julie_Public_Data_Update_20260225.md Stakeholder update on all 12 public data sources

Project Structure

Realm/
├── realm-etl                    # ETL CLI entry point
├── data/
│   ├── etl/                     # Framework: orchestrator, loader, config, state
│   │   ├── sources.json         # Master config (all 12 sources)
│   │   └── README.md            # Operator guide
│   ├── redfin/source.py         # Source modules (one per data source)
│   ├── irs-soi/source.py
│   ├── fhfa-hpi/source.py
│   ├── fred/source.py
│   ├── census-acs/source.py
│   ├── bea/source.py
│   ├── hmda/source.py
│   ├── sec-edgar/source.py
│   ├── fbi-crime/source.py
│   ├── geocoding/source.py
│   ├── fema-flood/source.py
│   ├── nces-schools/source.py
│   └── DATASET_CATALOG.md       # Schema reference for all datasets
├── src/main/java/               # MongoDB export utility (Java)
├── scoop/                       # Semantics extraction scripts + snapshots
├── config/                      # MongoDB field configs
├── output/                      # Exported CSVs
└── archive/                     # Historical session docs

Technical Notes

  • Python stdlib only — no pip dependencies in the ETL framework
  • Scoop CLI bulkload is the default load method (not S3+Lambda)
  • VPN required for all Scoop operations (mobileAPI, bulkload, investigations)
  • Java 21 + Gradle for MongoDB export utility
  • API keys via env vars: FRED_API_KEY, CENSUS_API_KEY, BEA_API_KEY

Repository

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors