Realm Data Tooling & Intelligence Hub

Central operations hub for all REALM data work — from MongoDB extraction through public data enrichment to Scoop domain intelligence development.

What This Repo Does

Public Data ETL Framework (data/etl/) — Download, transform, validate, and load 12 external data sources into Scoop on automated cadences
MongoDB Export Utility (root, Java) — Two-phase MongoDB to CSV export with intelligent field discovery and relationship expansion
Scoop Semantics Extraction (scoop/) — Repeatable script to pull dataset definitions and domain intelligence configs from Scoop production

Public Data Sources (12 live)

All sources feed into Scoop workspace W17008 and power the realm_agent_report investigation (v3.1 — 23 probes, 13 datasets, 7 sections).

Source	Cadence	Rows	What It Adds
Redfin	Weekly	5.6M	Housing market metrics by region
IRS SOI	Annual	223K	ZIP-level affluence and income brackets
FHFA HPI	Quarterly	123K	House price appreciation, state + MSA
FRED	Monthly	134	National macro: mortgage rates, CPI, unemployment
Census ACS	Annual	34K	ZCTA demographics: income, education, homeownership
BEA	Quarterly	28K	County-level income components
HMDA	Annual	5.2M	Luxury mortgage lending ($500K+)
SEC EDGAR	Quarterly	52K	Form 4 insider transactions ($1M+)
FBI Crime	Annual	651	State-level crime rates
Geocoding	Quarterly	21K	Listing coordinates (Census batch geocoder)
FEMA Flood	Quarterly	9.8K	Per-listing flood zone risk
NCES Schools	Quarterly	9.8K	Per-listing school district

Quick Start

ETL (primary workflow)

# List configured sources
./realm-etl list

# Run all sources (API keys needed for FRED, Census, BEA)
FRED_API_KEY=... CENSUS_API_KEY=... BEA_API_KEY=... ./realm-etl run --all --force --load

# Run a single source
./realm-etl run --source fhfa-hpi --load

# Scheduled mode (daily cron — checks cadences, skips if not due)
./realm-etl run --scheduled --load

MongoDB Export

# Discover fields (Phase 1)
./gradlew discover -Pcollection=listings

# Export using config (Phase 2)
./gradlew configExport -Pcollection=listings

Scoop Semantics Extraction

cd scoop && ./extract-realm-semantics.sh    # VPN required

Key Documentation

Document	What It Covers
`CLAUDE.md`	Master project reference — architecture, all source configs, investigation contexts, current status
`REALM_INVESTIGATION_STRATEGY.md`	Strategic document (v2.0) — luxury intelligence platform vision, corridor-first execution, monetization
`data/DATASET_CATALOG.md`	All 15 datasets: schemas, join keys, query examples, cross-dataset matrix
`data/etl/README.md`	ETL operator guide: commands, flags, cadence behavior, state semantics
`Julie_Public_Data_Update_20260225.md`	Stakeholder update on all 12 public data sources

Project Structure

Realm/
├── realm-etl                    # ETL CLI entry point
├── data/
│   ├── etl/                     # Framework: orchestrator, loader, config, state
│   │   ├── sources.json         # Master config (all 12 sources)
│   │   └── README.md            # Operator guide
│   ├── redfin/source.py         # Source modules (one per data source)
│   ├── irs-soi/source.py
│   ├── fhfa-hpi/source.py
│   ├── fred/source.py
│   ├── census-acs/source.py
│   ├── bea/source.py
│   ├── hmda/source.py
│   ├── sec-edgar/source.py
│   ├── fbi-crime/source.py
│   ├── geocoding/source.py
│   ├── fema-flood/source.py
│   ├── nces-schools/source.py
│   └── DATASET_CATALOG.md       # Schema reference for all datasets
├── src/main/java/               # MongoDB export utility (Java)
├── scoop/                       # Semantics extraction scripts + snapshots
├── config/                      # MongoDB field configs
├── output/                      # Exported CSVs
└── archive/                     # Historical session docs

Technical Notes

Python stdlib only — no pip dependencies in the ETL framework
Scoop CLI bulkload is the default load method (not S3+Lambda)
VPN required for all Scoop operations (mobileAPI, bulkload, investigations)
Java 21 + Gradle for MongoDB export utility
API keys via env vars: FRED_API_KEY, CENSUS_API_KEY, BEA_API_KEY

Repository

Remote: https://github.com/scoopeng/Realm
Branch: master
GitHub account: scoopeng (HTTPS only, no SSH)

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.idea		.idea
archive		archive
config		config
data		data
gradle/wrapper		gradle/wrapper
logs		logs
scoop		scoop
src/main		src/main
.gitignore		.gitignore
AGENT_AUDIT_FINDINGS.md		AGENT_AUDIT_FINDINGS.md
AGENT_DATA_AUDIT_INVESTIGATION.md		AGENT_DATA_AUDIT_INVESTIGATION.md
AGENT_LISTING_HISTORY_FINDINGS.md		AGENT_LISTING_HISTORY_FINDINGS.md
AGENT_LISTING_HISTORY_INVESTIGATION.md		AGENT_LISTING_HISTORY_INVESTIGATION.md
CLAUDE.md		CLAUDE.md
README.md		README.md
REALM_DATA_COMPLETENESS_ANALYSIS.md		REALM_DATA_COMPLETENESS_ANALYSIS.md
REALM_INVESTIGATION_STRATEGY.md		REALM_INVESTIGATION_STRATEGY.md
REALM_Public_Data_Update.md		REALM_Public_Data_Update.md
agent_audit_20260228.txt		agent_audit_20260228.txt
build.gradle		build.gradle
export_realm_data.sh		export_realm_data.sh
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
realm-etl		realm-etl
settings.gradle		settings.gradle
upload_to_scoop.sh		upload_to_scoop.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Realm Data Tooling & Intelligence Hub

What This Repo Does

Public Data Sources (12 live)

Quick Start

ETL (primary workflow)

MongoDB Export

Scoop Semantics Extraction

Key Documentation

Project Structure

Technical Notes

Repository

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Realm Data Tooling & Intelligence Hub

What This Repo Does

Public Data Sources (12 live)

Quick Start

ETL (primary workflow)

MongoDB Export

Scoop Semantics Extraction

Key Documentation

Project Structure

Technical Notes

Repository

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages