Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions claude-md-updates/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# CLAUDE.md Updates

This directory contains proposed CLAUDE.md files for 5 repositories that had active PRs merged into main (May–June 2026) by team members: SayaliPat, shrivastavakapil2000, JoeVsVolcano, mike-brant, nathan-resonate.

## Files to Apply

Each subdirectory contains a `CLAUDE.md` to be committed to the root of the corresponding repository:

| Directory | Target Repository | Action |
|-----------|------------------|--------|
| `step-function-workflow-orchestrator/` | `resonate/step-function-workflow-orchestrator` | **Create** new CLAUDE.md |
| `batch-expression-modeling/` | `resonate/batch-expression-modeling` | **Replace** existing CLAUDE.md |
| `identity-graph/` | `resonate/identity-graph` | **Create** new CLAUDE.md |
| `batch-audience-delivery-syndication/` | `resonate/batch-audience-delivery-syndication` | **Create** new CLAUDE.md |
| `dos-data-pipeline/` | `resonate/dos-data-pipeline` | **Create** new CLAUDE.md |

## How to Apply

For each repository, create a branch and open a PR:

```bash
# 1. Check out the target repo
cd /path/to/step-function-workflow-orchestrator
git checkout -b chore/add-claude-md

# 2. Copy the file from this repo (resonate/.github)
# Assumes resonate/.github is cloned alongside the target repo
cp ../resonate-.github/claude-md-updates/step-function-workflow-orchestrator/CLAUDE.md ./CLAUDE.md
# Or download directly from GitHub:
# curl -o CLAUDE.md https://raw.githubusercontent.com/resonate/.github/main/claude-md-updates/step-function-workflow-orchestrator/CLAUDE.md

git add CLAUDE.md
git commit -m "chore: add CLAUDE.md with project guidance for Claude Code"
git push -u origin chore/add-claude-md
# Then open PR via GitHub UI or: gh pr create --title "chore: add CLAUDE.md"
```

## What's Covered in Each File

### step-function-workflow-orchestrator
- Pipeline inventory (12 active pipelines)
- EMR 5→7 migration notes (Spark 2→3, yarn vcore fix)
- Integration test patterns
- Dynamic dates lambda (directory mode vs flat-file sentinel mode)
- Recent changes: EMR migrations, QA envs, fusion-behavior-preprocess removal, geo district namespace fixes

### batch-expression-modeling (UPDATE)
- All existing content preserved
- Added: formatter metrics lambda, stitch throttle protection (MaxConcurrency=2)
- Added: `delta_with_full_fallback` refresh type handling
- Added: Formatter output path layout (post CDP-118857 partition order change)

### identity-graph
- 11 Spark pipeline jobs and their purposes
- Shared utilities (HashUtils, StagingWriter, AddressNormalizer, IpFilter, ScoringConfig)
- PRISM design overview and 6 tracks
- All jobs use scopt CLI args
- Recent changes: port from resonate-research, ExperianDataProcessor, PRISM docs

### batch-audience-delivery-syndication
- Supported vendors (OpenX, Experian, Viant, BlockGraph)
- openx-publish-data-files: hardcoded .csv.gz extension (DO NOT revert to dynamic parsing)
- blockgraph-create-taxonomy-file: taxonomy generation, SPI=N constant
- Source path partition order (vendor=*/method=av/)

### dos-data-pipeline
- district_source provenance (L2_CONFIRMED, L2_UNCONFIRMED, IP_INFERRED)
- IP-inferred district fallback via 4 ZIP→district CSVs
- ToBitmap gating on L2_CONFIRMED
- ZIP→district namespace requirements (L2 canonical vs floterial)
- GeoLocationFullBackfill: always re-derives all 4 districts
113 changes: 113 additions & 0 deletions claude-md-updates/batch-audience-delivery-syndication/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

This repository contains **Batch Audience Delivery — Syndication** workflows: the Lambda functions and Step Function state machines that publish formatted audience data files to syndicated delivery vendors (OpenX, Experian, Viant/OpenX). It works as a consumer of output from `batch-expression-modeling`'s formatter step.

### Supported Vendors

| Vendor | Lambda | State Machine |
|--------|--------|--------------|
| OpenX | `openx-publish-data-files` | `batch-audience-delivery-syndication` |
| Experian | `experian-publish-data-files` | `experian-syndication-workflow` |
| Viant | `viant-publish-files` | (see viant workflow) |
| BlockGraph | `blockgraph-create-taxonomy-file` | BlockGraph delivery workflow |

## Repository Structure

```
├── workflows/
│ ├── lambdas/
│ │ ├── openx-publish-data-files/ # Renames + copies CSV.gz files to OpenX S3
│ │ ├── blockgraph-create-taxonomy-file/ # Generates BlockGraph metadata CSV(s)
│ │ └── ...
│ └── step-functions/ # ASL state machine definitions
├── terraform/
│ └── workflows/lambdas/ # Terraform per-lambda per-env
└── .github/workflows/ # CI/CD
```

## Common Development Tasks

### Python Lambda Development

Each Lambda has its own directory under `workflows/lambdas/<name>/`:

```bash
# Run tests
cd workflows/lambdas/<lambda-name>
pytest test/ -v

# Example
cd workflows/lambdas/openx-publish-data-files
pytest test/ -v # 6 tests
```

### Deploying Lambdas

Lambda deployment is via GitHub Actions (see `.github/workflows/`). Deploy to a specific environment:

```bash
gh workflow run <workflow>.yml \
-f environment=dev \
--ref <your-branch>
```

### Terraform / Terragrunt

```bash
cd terraform/workflows/lambdas/<lambda-name>/<env>
aws sso login
terragrunt plan
terragrunt apply
```

## Key Lambda: openx-publish-data-files

Copies Spark-formatted `.csv.gz` files from S3 source to OpenX destination, renaming to the pattern `resonate_syndication_{date}_Data_{N}.csv.gz`.

**Important:** The extension is hardcoded as `.csv.gz` — do NOT revert to dynamic `split('.', 1)` extension parsing. This was fixed in CDP-118955 because the upstream `batch-expression-modeling` formatter (CDP-118857) changed Spark's codec-suffix separator from `-c000.csv.gz` to `.c000.csv.gz`, which broke dynamic parsing and silently stopped all OpenX audience refreshes.

**Part number extraction:** Uses regex `part-(\d+)-.*\.csv` to find the part number, then increments by 1 for 1-indexed output filenames.

## Key Lambda: blockgraph-create-taxonomy-file

Generates BlockGraph metadata CSVs for a delivery run (CDP-118915):

- **Initial format** (13 fields) — net-new segments not yet in BlockGraph
- **Refresh format** (8 fields) — segments already known to BlockGraph

**Resolution:**
- Syndicated mode: queries BlockGraph ADS group to resolve audience set
- Custom mode: uses `audience_key_list` from event, filters to BlockGraph group hierarchy

**SPI field:** Constant `N` — we deliver BGIDs (not SPI source data), so the "Created using Sensitive Personal Information" flag is `N` taxonomy-wide (Q5 resolved, CDP-118915). Env var `spi_value` can override if needed.

**Output path:** `<prefix>/batch-delivery-payload/metadata/resonate_metadata_{initial,refresh}_{ts}.csv`

## Key Concepts

### Source Path Layout (Post CDP-118857)

The `batch-expression-modeling` formatter outputs data partitioned as:
```
<prefix>/date=<YYYYMMDD>/vendor=<vendor>/method=<method>/akey=<key>/part-*.csv.gz
```

The source_prefix for syndication lambdas must concatenate `vendor=*/method=av/` (NOT the legacy `method=av/vendor=*/`). This was fixed in CDP-118937 for Experian, Yahoo, and custom delivery.

### File Extension Warning

Spark's `.partitionBy()` changes the codec suffix separator from `-c000.csv.gz` to `.c000.csv.gz`. Any lambda that dynamically parses file extensions from Spark output filenames will break after a `partitionBy` change. Always hardcode the expected extension (`.csv.gz`) after confirming `list_csv_files` already filters to that extension.

### Delivery State Marker

`blockgraph-create-taxonomy-file` reads the Delivery State marker from `state/known-segments/` to identify which PSIDs have already been delivered (refresh vs. initial routing).

## Recent Changes (May–June 2026)

- **openx-publish-data-files** (PR #46, CDP-118955): Hardcoded `.csv.gz` extension — fixes broken OpenX syndication after CDP-118857 Spark `partitionBy` change caused malformed filenames (last good delivery was 2026-04-26)
- **blockgraph-create-taxonomy-file** (PR #48, CDP-118915): New Lambda implementing BlockGraph taxonomy file generation — 18 unit tests, supports both syndicated and custom audience modes, SPI=N hardcoded
- **source_prefix path order** (PR #42, CDP-118937): Swapped `method=av/vendor=*` → `vendor=*/method=av/` to match new formatter output layout
Loading