Skip to content

fix(broker): clean up error output, add failure-mode tests, harden socket validation (#539, #486, #485)#732

Open
oboehmer wants to merge 14 commits intomainfrom
feat/539-486-585-broker-errors-hardening
Open

fix(broker): clean up error output, add failure-mode tests, harden socket validation (#539, #486, #485)#732
oboehmer wants to merge 14 commits intomainfrom
feat/539-486-585-broker-errors-hardening

Conversation

@oboehmer
Copy link
Copy Markdown
Collaborator

@oboehmer oboehmer commented Mar 30, 2026

Description

Addresses three related issues in the ConnectionBroker / BrokerClient / SSHTestBase stack that together made D2D test failures difficult to diagnose and the broker path unreliable.

Noisy connection error output (#539): When a device was unreachable, users saw ~100 lines of Unicon internal state machine tracebacks, a generic "Unknown broker error" message, and the same failure logged 4 times across different call layers. This was introduced by #467, which reordered the broker/testbed priority so the broker path started being exercised — activating exc_info=True in _create_connection for the first time. Fixed by: dropping exc_info=True, propagating the real exception message through the broker response (was returning {"status": "error", "result": false} with no "error" key, causing the fallback to "Unknown broker error"), and reducing intermediate log levels to debug so the final user-facing ERROR appears exactly once.

Broker failure-mode test coverage (#486): The existing broker tests only covered the happy path. This PR adds integration tests that exercise the broker against mocked pyATS devices at the device.connect() boundary, covering: broker unreachable, device unreachable, disconnect-on-error, and full shutdown. Using a real Unix socket and real async protocol means these tests catch bugs that pure unit tests cannot reach.

Deadlock in BrokerClient.connect() on failure (discovered via #486): Writing the failure-mode tests uncovered a deadlock: when the broker returned a connection error, BrokerClient.connect() did not release internal state correctly, leaving the client permanently stuck. The test_ensure_connection_returns_false_when_broker_not_running test would have hung indefinitely without the fix.

Broker socket validation and graceful fallback (#485): Previously, the broker path was taken whenever NAC_TEST_BROKER_SOCKET was set (presence-only). This PR validates that the env var points to a real Unix socket via Path.is_socket(). If it’s set but invalid (missing path, regular file, directory), a warning is logged and we fall back to direct testbed connection.

Tests added/expanded

Unit tests (tests/unit/pyats_core/test_connection_broker.py)

Covers broker-side request processing and health checks, including:

  • _is_connection_healthy() edge cases (attribute access errors / missing attrs)
  • _process_request() missing-field error messages are specific (e.g., mentions hostname / cmd)

Unit tests (tests/unit/test_ssh_base_test_validation.py)

Covers broker socket env handling:

  • Use broker when NAC_TEST_BROKER_SOCKET points to a valid Unix socket
  • Warn + fall back when it points to missing path / regular file / directory
  • Raise when broker socket is invalid and no testbed device exists

Integration tests (tests/integration/test_connection_broker_failure_modes.py)

Covers socket protocol + client/broker interaction, including:

  • Broker unavailable: missing socket, stale socket, socket path is dir/file
  • Device connect failure: error message preserved (not “Unknown broker error”)
  • Socket lifecycle: unlink socket mid-run (existing connection works; new client fails)
  • Connection health: reconnect when cached connection unhealthy
  • Concurrency: concurrent connect requests dedupe to one device.connect (single-process + multi-process)
  • Cleanup: disconnect cleanup even when device.disconnect raises; shutdown disconnects devices + removes socket
  • Protocol/request hardening:
    • Unknown/missing command errors
    • Missing required parameters (parametrized)
    • Invalid protocol closes connection (parametrized): truncated frame, non-JSON, bad UTF-8, zero-length, absurd length prefix
    • Valid JSON but wrong type returns framed JSON error response
    • Sequential requests over one connection: handled error does not poison the connection; invalid protocol closes it

Timeout enhancements

Follow-up work tracked in #735 (BrokerClient timeouts + detailed errors).

#486 scenarios not covered in this PR (and why)

Some requested scenarios require product-level behavior decisions and/or timeouts to be non-flaky:

To ensure these remain tracked after closing #486, I created: #737 “test: cover remaining ConnectionBroker failure modes (crash + timeouts)”.

Closes

Related Issue(s)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactoring / Technical debt (internal improvements with no user-facing changes)
  • Documentation update
  • Chore (build process, CI, tooling, dependencies)
  • Other (please describe):

Test Framework Affected

  • PyATS
  • Robot Framework
  • Both
  • N/A (not test-framework specific)

Network as Code (NaC) Architecture Affected

  • ACI (APIC)
  • NDO (Nexus Dashboard Orchestrator)
  • NDFC / VXLAN-EVPN (Nexus Dashboard Fabric Controller)
  • Catalyst SD-WAN (SDWAN Manager / vManage)
  • Catalyst Center (DNA Center)
  • ISE (Identity Services Engine)
  • FMC (Firepower Management Center)
  • Meraki (Cloud-managed)
  • NX-OS (Nexus Direct-to-Device)
  • IOS-XE (Direct-to-Device)
  • IOS-XR (Direct-to-Device)
  • Hyperfabric
  • All architectures
  • N/A (architecture-agnostic)

Platform Tested

  • macOS (version tested: 15.x)
  • Linux (distro/version tested: )

Key Changes

  • connection_broker.py: drop exc_info=True from _create_connection, raise on failure instead of returning None, propagate the real error message in the broker response
  • broker_client.py: fix deadlock in connect() on connection failure; downgrade intermediate error logs to debug to eliminate duplicate console output
  • ssh_base_test.py: validate NAC_TEST_BROKER_SOCKET contents with Path.is_socket()
  • Tests: see above

Testing Done

  • Unit tests added/updated
  • Integration tests performed
  • Manual testing performed:
    • PyATS tests executed successfully
    • Robot Framework tests executed successfully
    • D2D/SSH tests executed successfully (if applicable)
    • HTML reports generated correctly
  • All existing tests pass (pytest / pre-commit run -a)

Test Commands Used

pytest tests/unit/ -q
pre-commit run -a

Checklist

  • Code follows project style guidelines (pre-commit run -a passes)
  • Self-review of code completed
  • Code is commented where necessary (especially complex logic)
  • Documentation updated (if applicable)
  • No new warnings introduced
  • Changes work on both macOS and Linux
  • CHANGELOG.md updated (if applicable)

Screenshots (if applicable)

Additional Notes

The BrokerClient.connect() deadlock was not directly reported — it was found while writing the failure-mode test coverage requested in #486. That test (test_ensure_connection_returns_false_when_broker_not_running) would have hung indefinitely without the fix.

oboehmer and others added 14 commits March 30, 2026 09:23
Replace verbose multi-layer error logging with a single actionable
message at the source. Removes exc_info=True from _create_connection
(full trace is in the Unicon log file), propagates the actual error
message in the broker's connect response instead of "Unknown broker
error", and demotes intermediate broker_client logs to debug level.
In ssh_base_test, ConnectionErrors are re-raised silently since the
broker already logged them.
…lure

When connect() failed while holding _connection_lock, the exception
handler called disconnect() which also tried to acquire the same lock,
causing a deadlock. Inline the cleanup instead.
Cover the error paths that e2e happy-path tests cannot reach: broker
unreachable, device connect failure, disconnect raising, and shutdown
with active connections. Shared fixtures (socket_dir, good_device,
bad_device, make_broker) live in a new conftest.py; health-check states
are parametrized.
Switch NAC_TEST_BROKER_SOCKET detection from Path.exists() to
Path.is_socket(), so regular files and directories are treated the same
as a missing path — all three trigger the fallback warning and fall
through to the direct testbed connection instead of attempting a
broker connect against a non-socket path.

Test changes:
- Replace sock.touch() with a real AF_UNIX socket bind so the
  "happy path" test exercises is_socket() correctly.
- Collapse the single missing-path fallback test into a parametrized
  test_falls_back_for_non_socket_paths covering all three rejection
  cases: missing path, regular file, and directory.
- Add socket_dir fixture to tests/unit/conftest.py (short-path temp
  dir that stays within macOS's 104-char Unix socket path limit),
  available to all unit tests.
…n duplication

The connect() error path and disconnect() previously contained identical
inline teardown logic (close writer, null reader/writer, reset _connected),
with the subtle difference that disconnect() also called wait_closed().
Any future additions to teardown logic would have needed to be mirrored
in both places.

Extract _teardown_connection() as a private, lock-free helper called by
both sites. disconnect() saves the writer reference before calling it,
then calls wait_closed() on the saved reference — preserving the drain
behaviour on clean disconnects without adding it to the connect() failure
path where there is nothing buffered to flush.
Previously, the broker path was taken whenever NAC_TEST_BROKER_SOCKET was
present in the environment. This could route execution into the broker
even when the value pointed at a non-socket path.

Parse the env var as a Path and require Path.is_socket(); otherwise log a
warning and fall back to direct testbed connection.
Group broker failure-mode integration tests into focused classes for readability.
Add integration coverage for “stale socket file exists but no broker listening” using a real AF_UNIX socket file and a bounded asyncio.wait_for to prevent hangs.
Add failure-mode coverage for when the broker socket is unlinked while the broker is running:\n- existing client connection continues to work\n- new connection attempts fail cleanly\n\nCo-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add an integration test that seeds ConnectionBroker.connected_devices with an
unhealthy connection and asserts _get_connection() evicts it and reconnects
to a fresh device.
Add an integration test that runs two BrokerClient instances concurrently
requesting ensure_connection() for the same hostname and asserts the broker
only calls device.connect() once.
test(broker): add socket-path validation + multi-process connect coverage

Add integration coverage for invalid BrokerClient socket paths (dir/file) and
verify multiple subprocesses can concurrently ensure a connection without
hanging; assert the broker only calls device.connect() once.
test(broker): add malformed/invalid protocol integration coverage

Add integration tests that exercise broker behavior for malformed requests
(unknown/missing command) and invalid wire protocol cases (truncated frame,
non-JSON, zero-length frame, bad UTF-8, absurd length prefix). Also validate
that well-framed but wrong-typed JSON (non-object) returns an error response.
Add integration tests for broker command validation (execute/connect/disconnect
missing required fields), disconnect of non-connected devices, framed JSON error
responses over the raw socket protocol, and multi-request behavior on a single
connection (handled error keeps connection usable; invalid protocol closes it).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@oboehmer oboehmer marked this pull request as ready for review March 31, 2026 22:04
@oboehmer oboehmer requested a review from aitestino March 31, 2026 22:05
@oboehmer oboehmer self-assigned this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

1 participant