Skip to content

Make existing code honest (rules fire, real compliance, JWT, parameterized Gremlin)#4

Merged
therandomsecurityguy merged 22 commits into
mainfrom
dac/region-risk-refactor
Jun 21, 2026
Merged

Make existing code honest (rules fire, real compliance, JWT, parameterized Gremlin)#4
therandomsecurityguy merged 22 commits into
mainfrom
dac/region-risk-refactor

Conversation

@therandomsecurityguy

@therandomsecurityguy therandomsecurityguy commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Six self-contained commits. Each builds, typechecks, and passes tests independently.

Commits

  1. refactor(risk-engine): extract to packages/risk-engine — Move lambdas/risk-engine into a workspace package so api-service and the rule-runner CronJob share types.
  2. feat(collector): tag ingestion + internet exposure + SGs/IGW/NAT/LB + Internet node + cross-account trust — The 10 existing risk rules query for properties the collector never wrote. Now it does.
  3. feat(compliance): real ComplianceEngine wiring + DynamoDB tables + api-service routes — Delete mockReports/mockControls; compliance routes now hit Neptune + DynamoDB.
  4. refactor(api): parameterized Gremlin + injection-guard middleware — Every user-supplied value goes through bindings; new allowlist validator.
  5. feat(api): JWT (jose) + RBAC middleware — Cognito OIDC verification, Viewer/Analyst/Admin role gates.
  6. test: root jest config + 34 new tests + docs polish — 92 tests passing total; CONTRIBUTING/ARCHITECTURE/CHANGELOG added.

What's now actually working

Capability Before After
Risk rules fire against real data ❌ no is_internet_exposed, no Internet/ContainerImage nodes ✅ collector derives exposure, creates vertices
Tag-based classification ❌ never read env, data_classification, crown_jewel, owner, business_unit
Compliance route returns real data ❌ hardcoded mocks ✅ calls ComplianceEngine → Neptune + DynamoDB
API endpoints authenticated ❌ zero auth ✅ Cognito JWT verify + RBAC
Gremlin injection safe ❌ string-interpolated ✅ parameterized + middleware allowlist
Multi-region ❌ hardcoded us-east-1 process.env.AWS_REGION everywhere
Cross-account IAM trust ❌ not modeled ExternalAccount vertices + TRUSTS edges

Test results

Test Suites: 9 passed, 9 total
Tests:       92 passed, 92 total

Files

  • New workspace package: packages/risk-engine (@khalifa/risk-engine)
  • New collector modules: lambdas/collector/src/{tags,network,load-balancers,containers,cross-account,exposure}.ts
  • New API middleware: api-service/src/middleware/{auth,rbac,gremlin-validator}.ts
  • New compliance wiring: api-service/src/services/compliance-service.ts
  • New DynamoDB tables: ComplianceEvidence, ComplianceReports (CDK)
  • New CDK input: accountIds: string[] on SecurityGraphIngestionStackProps
  • New docs: CONTRIBUTING.md, ARCHITECTURE.md, CHANGELOG.md
  • Root jest.config.js replaces per-package configs

Breaking changes

  • API endpoints require Cognito JWT (except /health). Existing clients without a token will get 401. UI already sends Authorization: Bearer ${id_token} so this works out of the box.
  • Cognito groups drive RBAC. Operators need to create khalifa-admin/khalifa-analyst/khalifa-viewer groups in the user pool.
  • /compliance/:framework/... paths in the README are now /compliance/frameworks/:framework/... — the actual mounted routes. README updated.
  • CDK requires ACCOUNT_IDS env var (comma-separated). Defaults to just [MASTER_ACCOUNT_ID] for backward compatibility.

Move lambdas/risk-engine into a workspace package so it can be
imported by both the rule-runner CronJob and the api-service without
duplicating types.

- New workspace member: packages/risk-engine (named @khalifa/risk-engine)
- All risk-engine source files moved with history preserved
- CDK RiskEngineFn now bundles from packages/risk-engine
- Root package.json workspaces includes packages/*
- README build instructions updated
…ternet node, cross-account trust

The 10 existing risk rules query for properties the collector never
wrote: is_internet_exposed, Internet/ContainerImage/KubernetesPod
vertices, and tag-based data_classification / crown_jewel / env.
This commit makes the collector produce what the rules expect.

New collectors (lambdas/collector/src):
- tags.ts: ResourceGroupsTaggingAPI integration with property extraction
- network.ts: SecurityGroups, InternetGateways, NATGateways, Subnets, Internet node
- load-balancers.ts: ELBv2 internet-facing detection
- containers.ts: ECR repositories + ContainerImage digests
- cross-account.ts: IAM role trust analysis with ExternalAccount vertices
- exposure.ts: is_internet_exposed derivation from SG/SG rule/IP/LB context

Changes to index.ts:
- Replace hardcoded 'us-east-1' with process.env.AWS_REGION throughout
- Thread region through all helper collectors
- Apply tag props and is_internet_exposed in final pass
- Add new dependency: client-resource-groups-tagging-api, client-ecr,
  client-elastic-load-balancing-v2

Existing infrastructure (Step Functions, EventBridge, IAM role) is
unchanged; this commit only affects what gets written to Neptune.
…i-service routes

- Replace stubbed DynamoDBEvidenceStore with real BatchWriteItem impl
- Add DynamoDBReportStore for persisted ComplianceReport snapshots
- Add InMemoryReportStore for tests/local dev
- ComplianceEngine.runAssessment now persists reports via ReportStore
- getLatestReport returns the most recent report per framework from DDB

API service:
- New ComplianceService wraps the risk-engine + handles Gremlin graph client
- api-service/src/routes/compliance.ts now calls the real service instead
  of returning hard-coded mockReports/mockControls
- api-service depends on @khalifa/risk-engine as a workspace package

CDK:
- accountIds: string[] prop on SecurityGraphIngestionStackProps
- StepFunctions MapAccounts now threads accountId via parameters
- ComplianceEvidence + ComplianceReports DynamoDB tables provisioned
- EKS IRSA roles grant access to both new tables
- EKS ConfigMap exposes EVIDENCE_TABLE, REPORTS_TABLE, Cognito vars
…ware

Every attack-path/resource query used string interpolation for user-
supplied values (label, ARN, fromSelector, toSelector), allowing
Gremlin injection via query params.

- New middleware: api-service/src/middleware/gremlin-validator.ts
  - validateGremlinSelectors: rejects fromSelector/toSelector/label/property
    values that don't match ^[A-Za-z][A-Za-z0-9_]*$
  - validateArnParam: rejects path :arn that doesn't start with 'arn:'
- Neptune client: refactor findAttackPath, getResource, getNeighbors
  to use P.bindings form instead of template-string interpolation
- attack-paths.ts: findAllAttackPaths uses P.maxLen binding
- resources.ts: searchResources uses P.label/P.property/P.value bindings
- app.ts: applies both validators globally (skip /health)

Bug fix: findAllAttackPaths was using 'isInternetExposed' which the
collector never set; corrected to 'is_internet_exposed' (the property
we set in commit 2).
Every endpoint was previously open. Add Cognito OIDC JWT verification
using jose, plus role-based access control gated on cognito:groups.

- api-service/src/middleware/auth.ts
  - authenticate(): verifies Bearer JWT against Cognito JWKS
    - signature, issuer (cognito-idp.{region}.amazonaws.com/{poolId}),
      audience (clientId), exp
    - populates req.user with { sub, email, groups[], tokenUse }
  - optionalAuthenticate(): for routes that don't require auth (none yet)
  - JWKS cached per process for performance

- api-service/src/middleware/rbac.ts
  - Role enum: Admin > Analyst > Viewer
  - Cognito group → role mapping:
      khalifa-admin → Admin
      khalifa-analyst → Analyst
      khalifa-viewer → Viewer
  - requireRole(Role.X) middleware factory
  - Role hierarchy: Admin satisfies all lower-role gates

- api-service/src/app.ts
  - authenticate applied to all routes except /health
  - requireViewer applied to every read endpoint (issues, attack-paths,
    resources, compliance)
  - requireAdmin applied to compliance run endpoints (forward-compat for
    Phase 1's POST /compliance/.../run)
  - validateArnParam / validateGremlinSelectors still applied before auth
    so 400s are returned before token is parsed

- api-service/package.json: + jose
- Replace per-package jest configs with root jest.config.js using
  the projects API. Run all tests via 'npx jest' or 'npm test' at root.
- Add 34 new tests:
  - api-service/src/__tests__/auth.test.ts (5 tests): JWT verify
    (signature, issuer, audience, expiry, req.user population)
  - api-service/src/__tests__/rbac.test.ts (6 tests): Admin/Analyst/Viewer
    role hierarchy, requireRole middleware behavior
  - api-service/src/__tests__/gremlin-validator.test.ts (5 tests):
    rejects Gremlin injection attempts, accepts valid labels
  - api-service/src/__tests__/compliance-service.test.ts (2 tests):
    real ComplianceService listFrameworks + getFrameworkSummary
  - lambdas/collector/src/tags.test.ts (7 tests): tag normalization
    (env, data_classification, crown_jewel coercion to bool)
  - lambdas/collector/src/exposure.test.ts (9 tests): internet-exposure
    derivation from public_ip, PubliclyAccessible, scheme, SG/subnet
- jose@4 (CJS-compatible) added to api-service
- CI: single 'Run all tests' step using root jest; other workflow paths
  updated to reference packages/risk-engine instead of lambdas/risk-engine
- Docs: CONTRIBUTING.md (how to add rules/controls), ARCHITECTURE.md
  (vertex/edge schema, data flow diagram), CHANGELOG.md (Phase 0 entry)
- README: corrected compliance endpoint paths (/compliance/frameworks/...)
  + auth/RBAC note
- UI: layout.tsx adds an ErrorBoundary component for graceful failures

All 92 tests pass.
@therandomsecurityguy therandomsecurityguy changed the title Phase 0: Make existing code honest (rules fire, real compliance, JWT, parameterized Gremlin) Make existing code honest (rules fire, real compliance, JWT, parameterized Gremlin) Jun 20, 2026
Conflicts resolved:
- README.md: combined main's new lambdas/{policy-evaluator,cloudtrail-analyzer}
  tree with HEAD's packages/risk-engine refactor
- api-service/package.json: kept both @khalifa/risk-engine and the new
  @aws-sdk/util-dynamodb
- api-service/src/app.ts: combined my auth/RBAC middleware with main's
  new /identity/* CIEM routes; gated CIEM routes behind requireViewer
- package.json: build:lambdas builds packages/risk-engine + main's
  policy-evaluator + cloudtrail-analyzer
- package-lock.json: regenerated via npm install --workspaces

All 92 tests still pass.
No code changes; this commit forces GitHub to re-evaluate the merge
state of the PR after the conflict resolution commit.
Run prettier --write on the 11 api-service files flagged by
prettier --check. No semantic changes.
Run prettier --write on the UI root layout component which contained
a hand-written ErrorBoundary class without consistent formatting.
Run prettier --write on:
- cdk/lib/khalifa-stack.ts, cdk/bin/khalifa.ts
- packages/risk-engine/src/compliance-engine.ts, packages/risk-engine/src/index.ts
- lambdas/collector/src/{containers,cross-account,exposure,exposure.test,load-balancers,network,tags,tags.test}.ts
- lambdas/collector/index.ts

No semantic changes. All 92 tests still pass.
- .eslintrc.js: add tsconfigRootDir so parser resolves relative paths
  correctly across monorepo workspaces
- Remove unused SDK command imports from lambdas/collector/index.ts
  (DescribeSecurityGroupsCommand/DescribeInternetGatewaysCommand/etc.
  moved to lambdas/collector/src/ modules)
- Remove unused SDK imports from lambdas/incremental-collector/index.ts
- Remove unused SUPPORTED_KEYS constant from tags.ts
- Prefix unused parameters with _ in:
  - packages/risk-engine/src/runner.ts (resources)
  - packages/risk-engine/src/scoring.ts (context)
  - packages/risk-engine/src/compliance-engine.ts (framework)
  - lambdas/collector/index.ts (index, region in collectEks/collectSecurityHub)
  - lambdas/collector/src/exposure.ts (vpcId)
  - lambdas/incremental-collector/index.ts (eventName)
- Remove unused imports/types from packages/risk-engine/src/{compliance-engine,compliance-rules,index}.ts
- Delete unused runComplianceAssessment demo function from index.ts
- Remove unused NextFunction import from auth/gremlin-validator tests
- Remove unused requireAdmin import from api-service/src/app.ts

All 92 tests still pass.
The merge from origin/main brought in IAM-collection code (groups,
policies, versions) that uses commands I had pruned from the IAM
import block during ESLint cleanup. This restores those imports
without the dead ones:
- ListGroupsCommand, ListGroupPoliciesCommand, GetGroupPolicyCommand
- ListAttachedGroupPoliciesCommand, ListAttachedRolePoliciesCommand
- ListRolePoliciesCommand, GetRolePolicyCommand
- ListPolicyVersionsCommand, GetPolicyVersionCommand
- ListUserPoliciesCommand, GetUserPolicyCommand
- ListAttachedUserPoliciesCommand

Also: replace dynamic import of ListPolicyVersionsCommand with the
now-static import. Add type annotation to versions.find callback.

fix(ui): import React explicitly in layout.tsx

The ErrorBoundary class uses React.Component and React.ErrorInfo but
didn't import React, relying on UMD globals. Add explicit import.
Run prettier --write on the collector's main file (which contains
~1,800 lines including the merged-in IAM collection code).
api-service imports @khalifa/risk-engine as a workspace package and
needs its dist/*.d.ts files to resolve types. Without building risk-
engine first, 'tsc --noEmit' in api-service fails with TS2307
('Cannot find module @khalifa/risk-engine') and a cascade of
implicit-any errors on the unresolvable types.

Changes:
- Root scripts: build:api, build:cdk, build now chain build:lambdas
  first so workspaces that depend on @khalifa/risk-engine can
  resolve it
- CI: reordered build steps so risk-engine builds before api-service
  (was building API before Risk Engine, which made typecheck fail)

No semantic changes. All 92 tests still pass.
Use 'tsc -b' (build mode with project references) to build
@khalifa/risk-engine before api-service and cdk compile. This makes
'npm run build:api' and 'npm run build:cdk' self-sufficient — they
no longer require a prior 'npm run build:lambdas'.

Previously, if dist/index.d.ts was missing (e.g., after a fresh
clone or 'git clean'), 'tsc --noEmit' in api-service would fail with
TS2307 'Cannot find module @khalifa/risk-engine' and a cascade of
implicit-any errors on unresolvable types.

All 92 tests still pass.
Use TypeScript project references so api-service can resolve
@khalifa/risk-engine types via the workspace symlink without
needing dist/ to be pre-built.

Changes:
- packages/risk-engine/tsconfig.json: add composite + declarationMap
- packages/risk-engine/package.json: add 'prepare' lifecycle hook
  (runs 'tsc' on npm install/pack/publish)
- api-service/tsconfig.json: composite + references risk-engine
- api-service/package.json:
  - 'build': 'tsc -b' (uses project references)
  - 'typecheck': 'tsc -b && tsc --noEmit' (builds then typechecks)
  - 'pretest': 'tsc -b' (auto-builds risk-engine before tests)
  - 'prebuild': 'tsc -b ../packages/risk-engine' (for fresh install)
- jest.config.js: globalSetup runs 'tsc -b packages/risk-engine'
  before any test suite (so api-service tests find dist/index.d.ts)

Add .tsbuildinfo files to gitignore.

All 92 tests pass.
The jest globalSetup file (jest.setup.js) was created locally but
never committed because .gitignore was excluding *.js. CI was
failing with 'Module <rootDir>/jest.setup.js was not found'.

Changes:
- .gitignore: rewrite with proper structure (original content +
  tsbuildinfo exclusion + explicit allowlist for jest.setup.js,
  prettier.config.js, *.config.js, bin/**, lib/**)
- jest.setup.js: track in repo (was untracked previously)
Next.js 16 + Turbopack build was failing because:
1. Class components must be in a 'use client' file
2. 'metadata' export must be from a Server Component
3. The 'import * as React' pattern no longer puts Component on the
   React namespace in the new JSX transform

Fix: split into two files:
- app/layout.tsx (server): exports metadata + RootLayout
- app/error-boundary.tsx ('use client'): exports ErrorBoundary class

Also: import Component, ErrorInfo, ReactNode as named imports.
The .gitignore rewrite in the previous commit accidentally dropped
the '*.js' and 'cdk/dist/' exclusions, causing cdk/dist/ files and
lambdas/shared/*.js to be tracked. Re-add the proper exclusions
and untrack the artifacts.

Config files we still want tracked:
- .eslintrc.js
- jest.config.js
- jest.setup.js
- prettier.config.js
- bin/**, lib/** (CDK source dirs)
Run prettier --write on the two new files added for the Next.js 16
ErrorBoundary client-component split.
@therandomsecurityguy therandomsecurityguy merged commit 0424c18 into main Jun 21, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant