fix: KEEP-468 fail closed on unresolved template references#1187
Merged
Conversation
When a `{{...}}` template failed to resolve, the resolver silently
substituted an empty string or the literal `{{...}}` token and let
the corrupted value flow into downstream actions, including on-chain
writes. Hackathon teams hit this in production: ENS text records
written with the literal `{{$trigger.input.ts}}`, webhook payloads
with empty `workflowRunId`, run-code bodies that crashed because a
URL got inlined as an unquoted JS identifier.
Resolver helpers now thread an optional `TemplateResolutionTracker`
that records each empty-string substitution categorized as no-node,
no-data, or no-path. `processActionConfig` aggregates tracker entries
plus a post-scan for leftover `{{...}}` literals (the displayPattern
fallback path) and calls `assertResolved`. In strict mode (default)
the action aborts with a `TemplateResolutionError` carrying the list
of unresolved tokens. The For Each array-resolution path gets the
same gate so a mistyped iterable cannot iterate zero times silently.
`KEEPERHUB_TEMPLATE_RESOLVE_MODE=legacy` opt-out preserves the
historical silent-substitute behaviour but emits a structured warn
so operators can quantify exposure during the transition window.
Tests cover each unresolved category, both modes, and the literal
hackathon corruption scenario.
Adds a PATCH-time syntax validator that walks every node config and parses
each `{{...}}` token against the three supported grammars (stored
`{{@nodeid:Label}}`, legacy `{{$nodeId.path}}`, and display `{{Label.path}}`).
Tokens that don't parse - empty bodies, stored references missing the colon
(the `{{@trigger.input.ts}}` typo class that produced the Tradewise on-chain
corruption), bare `$`-only bodies - return 400 with `INVALID_TEMPLATE_SYNTAX`
and a list of offending tokens so the editor can surface inline errors.
Validator mirrors `findBareAtLiterals` for safety: bounded depth, string
length, and finding cap so a malicious DB-resident workflow row can't pin
a worker.
Display-format tokens with arbitrary labels are accepted at the syntax
layer; unresolved references are caught at runtime by the strict-mode
fail-closed path shipped in the previous commit.
Documents the three supported template grammars (stored, display, legacy `$`), the path syntax, what does NOT parse, what fails at runtime instead of save time, where templates can appear, and the runtime mode env var. The page is the public-facing answer to the hackathon feedback that builders had to discover the grammar by trial and error. Pairs with the strict-mode runtime gate and the PATCH-time syntax validator so authors who hit either failure mode can self-serve a fix from one document.
…tion aborts The runtime fail-closed introduced earlier in this branch threw `TemplateResolutionError` from `processActionConfig` before `executeActionStep` ran, so `withStepLogging` never wrote a `workflow_execution_logs` row for the failing node. The KEEP-1549/KEEP-431 reconciler in `logWorkflowCompleteDb` then read the per-node aggregate, found no failed nodes, and CAS-flipped `workflow_executions.status` from 'error' to 'success' as a "spurious SDK retry." End result: the on-chain corruption was prevented, but the run looked successful with no visible failure. Add a `_recordStepFailure` mode to `triggerStep` that mirrors the existing `_workflowComplete` pattern: the workflow body forbids Node.js modules, so the DB write must route through a `"use step"` boundary. The catch block in `executeNode` invokes it only when the error is `TemplateResolutionError` (the pre-step abort case); the existing KEEP-398/KEEP-431 spurious-max-retries recovery branch is left untouched. The DB write is wrapped in try/catch so a logging failure can never block the workflow from aborting. After this change, a TemplateResolutionError run leaves: - workflow_executions.status='error' with the full message - one workflow_execution_logs row per failing node, status='error' - the run panel in the editor surfaces the action's failure
The executor's `assertTurnkeyEnvForActiveWallets` startup gate
crashes the container in a restart loop when any active Turnkey
wallet exists in the local DB and the API keys aren't on the
process env. The vars were already in `.env`/`.env.local`, but the
executor service in `docker-compose.yml` had no `env_file` and no
explicit Turnkey entries, so they never reached the container.
Add the three vars under the executor's `environment:` block using
`${VAR:-}` interpolation so compose pulls them from the
project-root `.env` automatically, matching the pattern used for
`INTEGRATION_ENCRYPTION_KEY`. Empty defaults preserve current
behaviour for environments where the vars are intentionally unset
(no active Turnkey wallets, gate falls through to the no-op path).
#1 Legitimate NULL upstream values were treated as unresolved. `extractTemplateParameters` previously called `resolveTemplateToRawValue`, which collapses both "no path" and "leaf is null" to `null`. The strict gate then false-tripped on workflows that legitimately passed NULL through a Database Query parameter (e.g. a SQL column that returned NULL). Add discriminated `*Checked` resolvers and a `resolveStrictPath` helper that uses `'key' in obj` to distinguish "the leaf key exists with a null value" from "the leaf key does not exist." The SQL parameterizer now pushes the real null and leaves the tracker empty in the legitimate case; genuine no-node / no-data / no-path still record the right reason. #3 Condition expressions accepted display- and legacy-format leftovers silently. Only stored-format `{{@nodeid:Label}}` is substituted in `evaluateConditionExpression`; any other shape that survived the substitution loop fed straight into the JS evaluator and surfaced as a misleading "Unexpected token '{'" syntax error. Add a post-substitution leftover-scan that throws a clear "unresolved template reference(s): {{...}}. Use stored format `{{@nodeid:Label.field}}`" before the evaluator runs. Tests cover both edge cases plus the discriminator distinction across no-node / no-data / no-path, and confirm a real null in a SQL parameter does not light up the tracker.
b3d5d9e to
f49983c
Compare
🧹 PR Environment Cleaned UpThe PR environment has been successfully deleted. Deleted Resources:
All resources have been cleaned up and will no longer incur costs. |
ℹ️ No PR Environment to Clean UpNo PR environment was found for this PR. This is expected if:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TemplateResolutionErrorinstead of silently substituting empty string or letting the literal{{...}}token flow through. The hackathon corruption case (an n8n-style ref that wrote{{$trigger.input.ts}}to an ENS text record) now fails before the on-chain write.KEEPERHUB_TEMPLATE_RESOLVE_MODE=legacyopt-out preserves historical behaviour with a structured warn for the transition window.processActionConfigaborts beforeexecuteActionStepruns, the catch block routes throughtriggerStep's new_recordStepFailuremode to write aworkflow_execution_logsrow. Without this, the KEEP-1549/KEEP-431 reconciler saw "no node failed" and silently flippedworkflow_executions.statusto'success', hiding the abort entirely.{{...}}token against the three supported grammars and returns 400INVALID_TEMPLATE_SYNTAXwith a list of offending tokens for typo classes the runtime cannot catch syntactically (empty bodies, stored format missing the colon, bare$).*Checkedresolvers +'key' in objwalk distinguish "leaf is null" from "path doesn't exist", so a SQL parameter that legitimately resolves to NULL no longer false-trips the strict gate. Condition expressions also get a post-substitution leftover-scan: display- and legacy-format tokens that don't substitute now throw a clear "unresolved template reference" error instead of an opaque JS syntax error./workflows/templatingdocuments the three grammars, path syntax, what fails at save time vs runtime, and the runtime mode env var.TURNKEY_API_*to the executor service so the executor doesn't crash on the Turnkey startup gate when active wallets exist locally.Notes
processActionConfigaborts before submission,workflow_executions.status='error'with the full message, per-nodeworkflow_execution_logsrow recorded, no on-chain transaction.KEEPERHUB_TEMPLATE_RESOLVE_MODE=legacystill re-opens the corruption window if flipped. Holding off on hardening the flag (per-execution dedupe, write-action override, removal) until we can quantify exposure post-deploy.template,db-template-params,http-request-template-substitution,condition-executor,for-each-executor,template-fail-closed,template-fail-closed-persistence,template-fail-closed-edge-cases,validate-template-syntax).pnpm type-checkclean.