Skip to content

Make worker startup timing configurable and harden marker correlation#27

Draft
jacoblyles wants to merge 1 commit intomainfrom
fix/startup-timeout-marker-fallback
Draft

Make worker startup timing configurable and harden marker correlation#27
jacoblyles wants to merge 1 commit intomainfrom
fix/startup-timeout-marker-fallback

Conversation

@jacoblyles
Copy link

Summary

This PR addresses reliability issues observed in OpenClaw -> Claude -> Maniple -> worker flows.

1) Configurable startup timing

Adds a new startup config section:

  • startup.agent_ready_timeout_seconds (default 30)
  • startup.marker_poll_timeout_seconds (default 30)

These values are now used by spawn_workers for:

  • agent startup readiness timeout
  • marker correlation polling timeout

Also adds config CLI support and env overrides:

  • MANIPLE_AGENT_READY_TIMEOUT_SECONDS
  • MANIPLE_MARKER_POLL_TIMEOUT_SECONDS
    (and legacy CLAUDE_TEAM_* fallbacks)

2) Marker correlation fallback (Claude/tmux)

When direct await_marker_in_jsonl(...) correlation fails for Claude workers, spawn_workers now attempts recovery via tmux pane marker scanning (find_jsonl_by_tmux_id).

This avoids hard dependence on the worker replying Identified! in cases where a worker model treats the marker message as suspicious/prompt injection.

3) Better startup failure diagnostics

On tmux startup timeout, include a tail excerpt of recent pane output in the raised error message. This makes failures actionable (e.g. auth/model errors, update screens) without requiring manual pane capture.

4) Marker prompt wording hardening

Updates marker instruction text to explicitly identify the marker block as an orchestrator system handshake.

Why

In real runs, worker startup failures were hard to distinguish and marker correlation was brittle for some Claude workers. These changes preserve existing defaults but provide:

  • explicit tuning knobs
  • robust fallback behavior
  • clearer failure output

Validation

Ran targeted tests:

  • uv run pytest -q tests/test_config.py tests/test_config_cli.py tests/test_spawn_workers_defaults.py tests/test_session_state.py
  • Result: 124 passed

Also checked lint for touched files:

  • uv run ruff check ...
  • Result: clean

Notes

Defaults remain unchanged (30s) to preserve current behavior unless users opt in to different values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant