Skip to content

fix: KEEP-468 fail closed on unresolved template references#1187

Merged
joelorzet merged 6 commits into
stagingfrom
fix/KEEP-468-template-fail-closed
May 9, 2026
Merged

fix: KEEP-468 fail closed on unresolved template references#1187
joelorzet merged 6 commits into
stagingfrom
fix/KEEP-468-template-fail-closed

Conversation

@joelorzet
Copy link
Copy Markdown

@joelorzet joelorzet commented May 8, 2026

Summary

  • Runtime: resolver helpers now track unresolved references and the executor aborts the action with a structured TemplateResolutionError instead of silently substituting empty string or letting the literal {{...}} token flow through. The hackathon corruption case (an n8n-style ref that wrote {{$trigger.input.ts}} to an ENS text record) now fails before the on-chain write. KEEPERHUB_TEMPLATE_RESOLVE_MODE=legacy opt-out preserves historical behaviour with a structured warn for the transition window.
  • Per-node failure persisted: when processActionConfig aborts before executeActionStep runs, the catch block routes through triggerStep's new _recordStepFailure mode to write a workflow_execution_logs row. Without this, the KEEP-1549/KEEP-431 reconciler saw "no node failed" and silently flipped workflow_executions.status to 'success', hiding the abort entirely.
  • PATCH-time validator: workflow save now parses every {{...}} token against the three supported grammars and returns 400 INVALID_TEMPLATE_SYNTAX with a list of offending tokens for typo classes the runtime cannot catch syntactically (empty bodies, stored format missing the colon, bare $).
  • Edge cases (PR review): discriminated *Checked resolvers + 'key' in obj walk distinguish "leaf is null" from "path doesn't exist", so a SQL parameter that legitimately resolves to NULL no longer false-trips the strict gate. Condition expressions also get a post-substitution leftover-scan: display- and legacy-format tokens that don't substitute now throw a clear "unresolved template reference" error instead of an opaque JS syntax error.
  • Docs: new public reference page at /workflows/templating documents the three grammars, path syntax, what fails at save time vs runtime, and the runtime mode env var.
  • Local-dev: docker-compose forwards TURNKEY_API_* to the executor service so the executor doesn't crash on the Turnkey startup gate when active wallets exist locally.

Notes

  • Default mode is strict. Workflows whose existing references resolve cleanly are unaffected.
  • Verified end-to-end against a Sepolia write-contract workflow on the dev stack: processActionConfig aborts before submission, workflow_executions.status='error' with the full message, per-node workflow_execution_logs row recorded, no on-chain transaction.
  • Legacy-mode follow-up deferred: PR review noted that KEEPERHUB_TEMPLATE_RESOLVE_MODE=legacy still re-opens the corruption window if flipped. Holding off on hardening the flag (per-execution dedupe, write-action override, removal) until we can quantify exposure post-deploy.
  • 191 unit tests pass across the related suites (template, db-template-params, http-request-template-substitution, condition-executor, for-each-executor, template-fail-closed, template-fail-closed-persistence, template-fail-closed-edge-cases, validate-template-syntax).
  • Full repo pnpm type-check clean.
  • Closes KEEP-468.

joelorzet added 6 commits May 8, 2026 22:36
When a `{{...}}` template failed to resolve, the resolver silently
substituted an empty string or the literal `{{...}}` token and let
the corrupted value flow into downstream actions, including on-chain
writes. Hackathon teams hit this in production: ENS text records
written with the literal `{{$trigger.input.ts}}`, webhook payloads
with empty `workflowRunId`, run-code bodies that crashed because a
URL got inlined as an unquoted JS identifier.

Resolver helpers now thread an optional `TemplateResolutionTracker`
that records each empty-string substitution categorized as no-node,
no-data, or no-path. `processActionConfig` aggregates tracker entries
plus a post-scan for leftover `{{...}}` literals (the displayPattern
fallback path) and calls `assertResolved`. In strict mode (default)
the action aborts with a `TemplateResolutionError` carrying the list
of unresolved tokens. The For Each array-resolution path gets the
same gate so a mistyped iterable cannot iterate zero times silently.

`KEEPERHUB_TEMPLATE_RESOLVE_MODE=legacy` opt-out preserves the
historical silent-substitute behaviour but emits a structured warn
so operators can quantify exposure during the transition window.

Tests cover each unresolved category, both modes, and the literal
hackathon corruption scenario.
Adds a PATCH-time syntax validator that walks every node config and parses
each `{{...}}` token against the three supported grammars (stored
`{{@nodeid:Label}}`, legacy `{{$nodeId.path}}`, and display `{{Label.path}}`).
Tokens that don't parse - empty bodies, stored references missing the colon
(the `{{@trigger.input.ts}}` typo class that produced the Tradewise on-chain
corruption), bare `$`-only bodies - return 400 with `INVALID_TEMPLATE_SYNTAX`
and a list of offending tokens so the editor can surface inline errors.

Validator mirrors `findBareAtLiterals` for safety: bounded depth, string
length, and finding cap so a malicious DB-resident workflow row can't pin
a worker.

Display-format tokens with arbitrary labels are accepted at the syntax
layer; unresolved references are caught at runtime by the strict-mode
fail-closed path shipped in the previous commit.
Documents the three supported template grammars (stored, display, legacy
`$`), the path syntax, what does NOT parse, what fails at runtime instead
of save time, where templates can appear, and the runtime mode env var.

The page is the public-facing answer to the hackathon feedback that
builders had to discover the grammar by trial and error. Pairs with the
strict-mode runtime gate and the PATCH-time syntax validator so authors
who hit either failure mode can self-serve a fix from one document.
…tion aborts

The runtime fail-closed introduced earlier in this branch threw
`TemplateResolutionError` from `processActionConfig` before
`executeActionStep` ran, so `withStepLogging` never wrote a
`workflow_execution_logs` row for the failing node. The
KEEP-1549/KEEP-431 reconciler in `logWorkflowCompleteDb` then read
the per-node aggregate, found no failed nodes, and CAS-flipped
`workflow_executions.status` from 'error' to 'success' as a
"spurious SDK retry." End result: the on-chain corruption was
prevented, but the run looked successful with no visible failure.

Add a `_recordStepFailure` mode to `triggerStep` that mirrors the
existing `_workflowComplete` pattern: the workflow body forbids
Node.js modules, so the DB write must route through a `"use step"`
boundary. The catch block in `executeNode` invokes it only when
the error is `TemplateResolutionError` (the pre-step abort case);
the existing KEEP-398/KEEP-431 spurious-max-retries recovery branch
is left untouched. The DB write is wrapped in try/catch so a logging
failure can never block the workflow from aborting.

After this change, a TemplateResolutionError run leaves:
  - workflow_executions.status='error' with the full message
  - one workflow_execution_logs row per failing node, status='error'
  - the run panel in the editor surfaces the action's failure
The executor's `assertTurnkeyEnvForActiveWallets` startup gate
crashes the container in a restart loop when any active Turnkey
wallet exists in the local DB and the API keys aren't on the
process env. The vars were already in `.env`/`.env.local`, but the
executor service in `docker-compose.yml` had no `env_file` and no
explicit Turnkey entries, so they never reached the container.

Add the three vars under the executor's `environment:` block using
`${VAR:-}` interpolation so compose pulls them from the
project-root `.env` automatically, matching the pattern used for
`INTEGRATION_ENCRYPTION_KEY`. Empty defaults preserve current
behaviour for environments where the vars are intentionally unset
(no active Turnkey wallets, gate falls through to the no-op path).
#1 Legitimate NULL upstream values were treated as unresolved.
   `extractTemplateParameters` previously called `resolveTemplateToRawValue`,
   which collapses both "no path" and "leaf is null" to `null`. The strict
   gate then false-tripped on workflows that legitimately passed NULL through
   a Database Query parameter (e.g. a SQL column that returned NULL).

   Add discriminated `*Checked` resolvers and a `resolveStrictPath` helper
   that uses `'key' in obj` to distinguish "the leaf key exists with a null
   value" from "the leaf key does not exist." The SQL parameterizer now
   pushes the real null and leaves the tracker empty in the legitimate case;
   genuine no-node / no-data / no-path still record the right reason.

#3 Condition expressions accepted display- and legacy-format leftovers
   silently. Only stored-format `{{@nodeid:Label}}` is substituted in
   `evaluateConditionExpression`; any other shape that survived the
   substitution loop fed straight into the JS evaluator and surfaced as a
   misleading "Unexpected token '{'" syntax error. Add a post-substitution
   leftover-scan that throws a clear "unresolved template reference(s):
   {{...}}. Use stored format `{{@nodeid:Label.field}}`" before the
   evaluator runs.

Tests cover both edge cases plus the discriminator distinction across
no-node / no-data / no-path, and confirm a real null in a SQL parameter
does not light up the tracker.
@joelorzet joelorzet force-pushed the fix/KEEP-468-template-fail-closed branch from b3d5d9e to f49983c Compare May 9, 2026 01:39
@joelorzet joelorzet merged commit a7979c6 into staging May 9, 2026
40 checks passed
@joelorzet joelorzet deleted the fix/KEEP-468-template-fail-closed branch May 9, 2026 01:51
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

  • Namespace: pr-1187
  • All Helm releases (Keeperhub, Scheduler, Event services)
  • PostgreSQL Database (including data)
  • LocalStack, Redis
  • All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

  • The PR never had the deploy-pr-environment label
  • The environment was already cleaned up
  • The deployment never completed successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant