fix(broker): clean up error output, add failure-mode tests, harden socket validation (#539, #486, #485) by oboehmer · Pull Request #732 · netascode/nac-test

oboehmer · 2026-03-30T09:53:33Z

Description

Addresses three related issues in the ConnectionBroker / BrokerClient / SSHTestBase stack that together made D2D test failures difficult to diagnose and the broker path unreliable.

Noisy connection error output (#539): When a device was unreachable, users saw ~100 lines of Unicon internal state machine tracebacks, a generic "Unknown broker error" message, and the same failure logged 4 times across different call layers. This was introduced by #467, which reordered the broker/testbed priority so the broker path started being exercised — activating exc_info=True in _create_connection for the first time. Fixed by: dropping exc_info=True, propagating the real exception message through the broker response (was returning {"status": "error", "result": false} with no "error" key, causing the fallback to "Unknown broker error"), and reducing intermediate log levels to debug so the final user-facing ERROR appears exactly once.

Broker failure-mode test coverage (#486): The existing broker tests only covered the happy path. This PR adds integration tests that exercise the broker against mocked pyATS devices at the device.connect() boundary, covering: broker unreachable, device unreachable, disconnect-on-error, and full shutdown. Using a real Unix socket and real async protocol means these tests catch bugs that pure unit tests cannot reach.

Deadlock in BrokerClient.connect() on failure (discovered via #486): Writing the failure-mode tests uncovered a deadlock: when the broker returned a connection error, BrokerClient.connect() did not release internal state correctly, leaving the client permanently stuck. The test_ensure_connection_returns_false_when_broker_not_running test would have hung indefinitely without the fix.

Broker socket validation and graceful fallback (#485): Previously, the broker path was taken whenever NAC_TEST_BROKER_SOCKET was set (presence-only). This PR validates that the env var points to a real Unix socket via Path.is_socket(). If it’s set but invalid (missing path, regular file, directory), a warning is logged and we fall back to direct testbed connection.

Tests added/expanded

Unit tests (`tests/unit/pyats_core/test_connection_broker.py`)

Covers broker-side request processing and health checks, including:

_is_connection_healthy() edge cases (attribute access errors / missing attrs)
_process_request() missing-field error messages are specific (e.g., mentions hostname / cmd)

Unit tests (`tests/unit/test_ssh_base_test_validation.py`)

Covers broker socket env handling:

Use broker when NAC_TEST_BROKER_SOCKET points to a valid Unix socket
Warn + fall back when it points to missing path / regular file / directory
Raise when broker socket is invalid and no testbed device exists

Integration tests (`tests/integration/test_connection_broker_failure_modes.py`)

Covers socket protocol + client/broker interaction, including:

Broker unavailable: missing socket, stale socket, socket path is dir/file
Device connect failure: error message preserved (not “Unknown broker error”)
Socket lifecycle: unlink socket mid-run (existing connection works; new client fails)
Connection health: reconnect when cached connection unhealthy
Concurrency: concurrent connect requests dedupe to one device.connect (single-process + multi-process)
Cleanup: disconnect cleanup even when device.disconnect raises; shutdown disconnects devices + removes socket
Protocol/request hardening:
- Unknown/missing command errors
- Missing required parameters (parametrized)
- Invalid protocol closes connection (parametrized): truncated frame, non-JSON, bad UTF-8, zero-length, absurd length prefix
- Valid JSON but wrong type returns framed JSON error response
- Sequential requests over one connection: handled error does not poison the connection; invalid protocol closes it

Timeout enhancements

Follow-up work tracked in #735 (BrokerClient timeouts + detailed errors).

#486 scenarios not covered in this PR (and why)

Some requested scenarios require product-level behavior decisions and/or timeouts to be non-flaky:

Broker crash mid-run (undefined desired behavior: abort vs fallback vs restart) → Define behavior when ConnectionBroker crashes mid-run #736
Command execution / socket IO timeout behavior (needs product timeout semantics, not just wait_for in tests) → BrokerClient: add detailed ensure_connection errors and timeouts #735

To ensure these remain tracked after closing #486, I created: #737 “test: cover remaining ConnectionBroker failure modes (crash + timeouts)”.

Closes

Related Issue(s)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactoring / Technical debt (internal improvements with no user-facing changes)
Documentation update
Chore (build process, CI, tooling, dependencies)
Other (please describe):

Test Framework Affected

PyATS
Robot Framework
Both
N/A (not test-framework specific)

Network as Code (NaC) Architecture Affected

Platform Tested

macOS (version tested: 15.x)
Linux (distro/version tested: )

Key Changes

connection_broker.py: drop exc_info=True from _create_connection, raise on failure instead of returning None, propagate the real error message in the broker response
broker_client.py: fix deadlock in connect() on connection failure; downgrade intermediate error logs to debug to eliminate duplicate console output
ssh_base_test.py: validate NAC_TEST_BROKER_SOCKET contents with Path.is_socket()
Tests: see above

Testing Done

Unit tests added/updated
Integration tests performed
Manual testing performed:
- PyATS tests executed successfully
- Robot Framework tests executed successfully
- D2D/SSH tests executed successfully (if applicable)
- HTML reports generated correctly
All existing tests pass (pytest / pre-commit run -a)

Test Commands Used

pytest tests/unit/ -q
pre-commit run -a

Checklist

Code follows project style guidelines (pre-commit run -a passes)
Self-review of code completed
Code is commented where necessary (especially complex logic)
Documentation updated (if applicable)
No new warnings introduced
Changes work on both macOS and Linux
CHANGELOG.md updated (if applicable)

Screenshots (if applicable)

Additional Notes

The BrokerClient.connect() deadlock was not directly reported — it was found while writing the failure-mode test coverage requested in #486. That test (test_ensure_connection_returns_false_when_broker_not_running) would have hung indefinitely without the fix.

Replace verbose multi-layer error logging with a single actionable message at the source. Removes exc_info=True from _create_connection (full trace is in the Unicon log file), propagates the actual error message in the broker's connect response instead of "Unknown broker error", and demotes intermediate broker_client logs to debug level. In ssh_base_test, ConnectionErrors are re-raised silently since the broker already logged them.

…lure When connect() failed while holding _connection_lock, the exception handler called disconnect() which also tried to acquire the same lock, causing a deadlock. Inline the cleanup instead.

Cover the error paths that e2e happy-path tests cannot reach: broker unreachable, device connect failure, disconnect raising, and shutdown with active connections. Shared fixtures (socket_dir, good_device, bad_device, make_broker) live in a new conftest.py; health-check states are parametrized.

Switch NAC_TEST_BROKER_SOCKET detection from Path.exists() to Path.is_socket(), so regular files and directories are treated the same as a missing path — all three trigger the fallback warning and fall through to the direct testbed connection instead of attempting a broker connect against a non-socket path. Test changes: - Replace sock.touch() with a real AF_UNIX socket bind so the "happy path" test exercises is_socket() correctly. - Collapse the single missing-path fallback test into a parametrized test_falls_back_for_non_socket_paths covering all three rejection cases: missing path, regular file, and directory. - Add socket_dir fixture to tests/unit/conftest.py (short-path temp dir that stays within macOS's 104-char Unix socket path limit), available to all unit tests.

…n duplication The connect() error path and disconnect() previously contained identical inline teardown logic (close writer, null reader/writer, reset _connected), with the subtle difference that disconnect() also called wait_closed(). Any future additions to teardown logic would have needed to be mirrored in both places. Extract _teardown_connection() as a private, lock-free helper called by both sites. disconnect() saves the writer reference before calling it, then calls wait_closed() on the saved reference — preserving the drain behaviour on clean disconnects without adding it to the connect() failure path where there is nothing buffered to flush.

Previously, the broker path was taken whenever NAC_TEST_BROKER_SOCKET was present in the environment. This could route execution into the broker even when the value pointed at a non-socket path. Parse the env var as a Path and require Path.is_socket(); otherwise log a warning and fall back to direct testbed connection.

Group broker failure-mode integration tests into focused classes for readability. Add integration coverage for “stale socket file exists but no broker listening” using a real AF_UNIX socket file and a bounded asyncio.wait_for to prevent hangs.

Add failure-mode coverage for when the broker socket is unlinked while the broker is running:\n- existing client connection continues to work\n- new connection attempts fail cleanly\n\nCo-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add an integration test that seeds ConnectionBroker.connected_devices with an unhealthy connection and asserts _get_connection() evicts it and reconnects to a fresh device.

Add an integration test that runs two BrokerClient instances concurrently requesting ensure_connection() for the same hostname and asserts the broker only calls device.connect() once.

test(broker): add socket-path validation + multi-process connect coverage Add integration coverage for invalid BrokerClient socket paths (dir/file) and verify multiple subprocesses can concurrently ensure a connection without hanging; assert the broker only calls device.connect() once.

test(broker): add malformed/invalid protocol integration coverage Add integration tests that exercise broker behavior for malformed requests (unknown/missing command) and invalid wire protocol cases (truncated frame, non-JSON, zero-length frame, bad UTF-8, absurd length prefix). Also validate that well-framed but wrong-typed JSON (non-object) returns an error response.

Add integration tests for broker command validation (execute/connect/disconnect missing required fields), disconnect of non-connected devices, framed JSON error responses over the raw socket protocol, and multi-request behavior on a single connection (handled error keeps connection usable; invalid protocol closes it). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

oboehmer and others added 14 commits March 30, 2026 09:23

fix(broker): fix deadlock in BrokerClient.connect() on connection fai…

21a239d

…lure When connect() failed while holding _connection_lock, the exception handler called disconnect() which also tried to acquire the same lock, causing a deadlock. Inline the cleanup instead.

move broker failure-mode tests to integration and tighten assertions

9406e3d

test(broker): cover socket deletion mid-run

fb08d8b

Add failure-mode coverage for when the broker socket is unlinked while the broker is running:\n- existing client connection continues to work\n- new connection attempts fail cleanly\n\nCo-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test(broker): cover reconnect when cached connection is unhealthy

f7b407b

Add an integration test that seeds ConnectionBroker.connected_devices with an unhealthy connection and asserts _get_connection() evicts it and reconnects to a fresh device.

test(broker): add concurrent connect race coverage

b955af1

Add an integration test that runs two BrokerClient instances concurrently requesting ensure_connection() for the same hostname and asserts the broker only calls device.connect() once.

oboehmer mentioned this pull request Mar 31, 2026

test: cover remaining ConnectionBroker failure modes (crash + timeouts) #737

Open

oboehmer marked this pull request as ready for review March 31, 2026 22:04

oboehmer requested a review from aitestino March 31, 2026 22:05

oboehmer added the prio: high label Mar 31, 2026

oboehmer self-assigned this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(broker): clean up error output, add failure-mode tests, harden socket validation (#539, #486, #485)#732

fix(broker): clean up error output, add failure-mode tests, harden socket validation (#539, #486, #485)#732
oboehmer wants to merge 14 commits intomainfrom
feat/539-486-585-broker-errors-hardening

oboehmer commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oboehmer commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests added/expanded

Unit tests (tests/unit/pyats_core/test_connection_broker.py)

Unit tests (tests/unit/test_ssh_base_test_validation.py)

Integration tests (tests/integration/test_connection_broker_failure_modes.py)

Timeout enhancements

#486 scenarios not covered in this PR (and why)

Closes

Related Issue(s)

Type of Change

Test Framework Affected

Network as Code (NaC) Architecture Affected

Platform Tested

Key Changes

Testing Done

Test Commands Used

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

oboehmer commented Mar 30, 2026 •

edited

Loading

Unit tests (`tests/unit/pyats_core/test_connection_broker.py`)

Unit tests (`tests/unit/test_ssh_base_test_validation.py`)

Integration tests (`tests/integration/test_connection_broker_failure_modes.py`)