Skip to content

feat: failure classification taxonomy with circuit breaker and intelligent retry #579

@nextlevelshit

Description

@nextlevelshit

Context

Fabro classifies failures into 6 categories and uses normalized failure fingerprints to detect repeated failures — automatically terminating when the same failure signature repeats N times (circuit breaker). They also have 3-layer retry (LLM call, turn, node) with different policies per failure class.

Wave currently has basic on_failure actions (fail/skip/continue/rework/retry) without failure classification or intelligent circuit breaking.

Design Goals — Best of Both Worlds

Combine Fabro's failure intelligence with Wave's contract validation to create a more resilient execution model.

Failure Classification

Classify every step failure into one of these categories:

Class Meaning Retryable? Example
transient Rate limits, timeouts, network Yes (auto-retry) API 429, connection timeout
deterministic Auth errors, bad config No Invalid API key, missing binary
budget_exhausted Context/token limits No (trigger fallback) Context window exceeded
contract_failure Output doesn't match schema Yes (rework) JSON schema mismatch
test_failure Tests/validation failed Yes (fix loop) go test exit code 1
canceled User/system cancellation No SIGINT, timeout

The executor should attempt to classify failures based on:

  1. Exit codes from adapter/commands
  2. Error message patterns (regex matching)
  3. Contract validation results
  4. Adapter-specific error parsing

Circuit Breaker

After each failure, create a normalized fingerprint: step_name + failure_class + normalized_error. When the same fingerprint repeats circuit_breaker_limit times (default: 3), terminate the step/pipeline.

Only deterministic and contract_failure failures are tracked — transient failures are excluded from fingerprinting.

runtime:
  circuit_breaker:
    limit: 3                    # same failure 3x = terminate
    tracked_classes: [deterministic, contract_failure, test_failure]

Retry Policies

Per-step retry policy configuration:

steps:
  - name: implement
    retry:
      policy: standard          # none | standard | aggressive | patient
      max_attempts: 5
      backoff: exponential      # exponential | linear | constant
      initial_delay_ms: 200
      max_delay_ms: 30000

Built-in policies:

  • none: 1 attempt, immediate failure
  • standard: 3 attempts, 1s initial, 2x exponential (default)
  • aggressive: 5 attempts, 200ms initial, 2x exponential
  • patient: 3 attempts, 5s initial, 3x exponential

Stall Watchdog

If a step produces no progress events for stall_timeout (default: 30min), terminate it:

runtime:
  stall_timeout: 1800s

What Wave Keeps

  • Contract validation (enriches failure classification — contract failures are a distinct class)
  • SQLite state (failure signatures stored for circuit breaker)
  • Existing on_failure actions (extended with classification awareness)

What Wave Gains

  • Intelligent retry — don't retry deterministic failures, do retry transient ones
  • Circuit breaker — prevents infinite loops on repeated identical failures
  • Failure analytics — classification feeds into retrospectives
  • Stall detection — catches hung adapters

Implementation Scope

  1. internal/pipeline/failure.go — failure classification and fingerprinting
  2. Extend executor retry logic with policies and circuit breaker
  3. Add stall watchdog to step execution
  4. Store failure signatures in state DB
  5. Update event emitter with failure class in events
  6. Add retry policy to manifest schema

Research Sources

  • Fabro failure handling: https://docs.fabro.sh/execution/failures
  • Fabro 6 failure classes: transient_infra, deterministic, budget_exhausted, compilation_loop, canceled, structural
  • Fabro circuit breaker: normalized fingerprint + 3-repeat threshold

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions