fix(broker): clean up error output, add failure-mode tests, harden socket validation (#539, #486, #485)#732
Open
fix(broker): clean up error output, add failure-mode tests, harden socket validation (#539, #486, #485)#732
Conversation
Replace verbose multi-layer error logging with a single actionable message at the source. Removes exc_info=True from _create_connection (full trace is in the Unicon log file), propagates the actual error message in the broker's connect response instead of "Unknown broker error", and demotes intermediate broker_client logs to debug level. In ssh_base_test, ConnectionErrors are re-raised silently since the broker already logged them.
…lure When connect() failed while holding _connection_lock, the exception handler called disconnect() which also tried to acquire the same lock, causing a deadlock. Inline the cleanup instead.
Cover the error paths that e2e happy-path tests cannot reach: broker unreachable, device connect failure, disconnect raising, and shutdown with active connections. Shared fixtures (socket_dir, good_device, bad_device, make_broker) live in a new conftest.py; health-check states are parametrized.
Switch NAC_TEST_BROKER_SOCKET detection from Path.exists() to Path.is_socket(), so regular files and directories are treated the same as a missing path — all three trigger the fallback warning and fall through to the direct testbed connection instead of attempting a broker connect against a non-socket path. Test changes: - Replace sock.touch() with a real AF_UNIX socket bind so the "happy path" test exercises is_socket() correctly. - Collapse the single missing-path fallback test into a parametrized test_falls_back_for_non_socket_paths covering all three rejection cases: missing path, regular file, and directory. - Add socket_dir fixture to tests/unit/conftest.py (short-path temp dir that stays within macOS's 104-char Unix socket path limit), available to all unit tests.
…n duplication The connect() error path and disconnect() previously contained identical inline teardown logic (close writer, null reader/writer, reset _connected), with the subtle difference that disconnect() also called wait_closed(). Any future additions to teardown logic would have needed to be mirrored in both places. Extract _teardown_connection() as a private, lock-free helper called by both sites. disconnect() saves the writer reference before calling it, then calls wait_closed() on the saved reference — preserving the drain behaviour on clean disconnects without adding it to the connect() failure path where there is nothing buffered to flush.
Previously, the broker path was taken whenever NAC_TEST_BROKER_SOCKET was present in the environment. This could route execution into the broker even when the value pointed at a non-socket path. Parse the env var as a Path and require Path.is_socket(); otherwise log a warning and fall back to direct testbed connection.
Group broker failure-mode integration tests into focused classes for readability. Add integration coverage for “stale socket file exists but no broker listening” using a real AF_UNIX socket file and a bounded asyncio.wait_for to prevent hangs.
Add failure-mode coverage for when the broker socket is unlinked while the broker is running:\n- existing client connection continues to work\n- new connection attempts fail cleanly\n\nCo-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add an integration test that seeds ConnectionBroker.connected_devices with an unhealthy connection and asserts _get_connection() evicts it and reconnects to a fresh device.
Add an integration test that runs two BrokerClient instances concurrently requesting ensure_connection() for the same hostname and asserts the broker only calls device.connect() once.
test(broker): add socket-path validation + multi-process connect coverage Add integration coverage for invalid BrokerClient socket paths (dir/file) and verify multiple subprocesses can concurrently ensure a connection without hanging; assert the broker only calls device.connect() once.
test(broker): add malformed/invalid protocol integration coverage Add integration tests that exercise broker behavior for malformed requests (unknown/missing command) and invalid wire protocol cases (truncated frame, non-JSON, zero-length frame, bad UTF-8, absurd length prefix). Also validate that well-framed but wrong-typed JSON (non-object) returns an error response.
Add integration tests for broker command validation (execute/connect/disconnect missing required fields), disconnect of non-connected devices, framed JSON error responses over the raw socket protocol, and multi-request behavior on a single connection (handled error keeps connection usable; invalid protocol closes it). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Addresses three related issues in the
ConnectionBroker/BrokerClient/SSHTestBasestack that together made D2D test failures difficult to diagnose and the broker path unreliable.Noisy connection error output (#539): When a device was unreachable, users saw ~100 lines of Unicon internal state machine tracebacks, a generic "Unknown broker error" message, and the same failure logged 4 times across different call layers. This was introduced by #467, which reordered the broker/testbed priority so the broker path started being exercised — activating
exc_info=Truein_create_connectionfor the first time. Fixed by: droppingexc_info=True, propagating the real exception message through the broker response (was returning{"status": "error", "result": false}with no"error"key, causing the fallback to "Unknown broker error"), and reducing intermediate log levels todebugso the final user-facing ERROR appears exactly once.Broker failure-mode test coverage (#486): The existing broker tests only covered the happy path. This PR adds integration tests that exercise the broker against mocked pyATS devices at the
device.connect()boundary, covering: broker unreachable, device unreachable, disconnect-on-error, and full shutdown. Using a real Unix socket and real async protocol means these tests catch bugs that pure unit tests cannot reach.Deadlock in
BrokerClient.connect()on failure (discovered via #486): Writing the failure-mode tests uncovered a deadlock: when the broker returned a connection error,BrokerClient.connect()did not release internal state correctly, leaving the client permanently stuck. Thetest_ensure_connection_returns_false_when_broker_not_runningtest would have hung indefinitely without the fix.Broker socket validation and graceful fallback (#485): Previously, the broker path was taken whenever
NAC_TEST_BROKER_SOCKETwas set (presence-only). This PR validates that the env var points to a real Unix socket viaPath.is_socket(). If it’s set but invalid (missing path, regular file, directory), a warning is logged and we fall back to direct testbed connection.Tests added/expanded
Unit tests (
tests/unit/pyats_core/test_connection_broker.py)Covers broker-side request processing and health checks, including:
_is_connection_healthy()edge cases (attribute access errors / missing attrs)_process_request()missing-field error messages are specific (e.g., mentionshostname/cmd)Unit tests (
tests/unit/test_ssh_base_test_validation.py)Covers broker socket env handling:
NAC_TEST_BROKER_SOCKETpoints to a valid Unix socketIntegration tests (
tests/integration/test_connection_broker_failure_modes.py)Covers socket protocol + client/broker interaction, including:
Timeout enhancements
Follow-up work tracked in #735 (BrokerClient timeouts + detailed errors).
#486 scenarios not covered in this PR (and why)
Some requested scenarios require product-level behavior decisions and/or timeouts to be non-flaky:
wait_forin tests) → BrokerClient: add detailed ensure_connection errors and timeouts #735To ensure these remain tracked after closing #486, I created: #737 “test: cover remaining ConnectionBroker failure modes (crash + timeouts)”.
Closes
Related Issue(s)
Type of Change
Test Framework Affected
Network as Code (NaC) Architecture Affected
Platform Tested
Key Changes
connection_broker.py: dropexc_info=Truefrom_create_connection, raise on failure instead of returningNone, propagate the real error message in the broker responsebroker_client.py: fix deadlock inconnect()on connection failure; downgrade intermediate error logs todebugto eliminate duplicate console outputssh_base_test.py: validateNAC_TEST_BROKER_SOCKETcontents withPath.is_socket()Testing Done
pytest/pre-commit run -a)Test Commands Used
Checklist
pre-commit run -apasses)Screenshots (if applicable)
Additional Notes
The
BrokerClient.connect()deadlock was not directly reported — it was found while writing the failure-mode test coverage requested in #486. That test (test_ensure_connection_returns_false_when_broker_not_running) would have hung indefinitely without the fix.