Skip to content

Fix/streaming timestamps#105

Open
mcurrier2 wants to merge 6 commits intomainfrom
fix/streaming-timestamps
Open

Fix/streaming timestamps#105
mcurrier2 wants to merge 6 commits intomainfrom
fix/streaming-timestamps

Conversation

@mcurrier2
Copy link
Copy Markdown
Collaborator

@mcurrier2 mcurrier2 commented Mar 11, 2026

Description

Brief description of changes

Type of Change

  • [ x] Bug fix
  • New feature
  • Breaking change
  • Documentation update

Testing

  • [ x] Tests pass locally
  • [ x] Added tests for new functionality
  • [ x] Updated documentation

Checklist

  • [ x] Code follows style guidelines
  • [ x] Self-review completed
  • Comments added for complex code
  • [ x] Documentation updated
  • No breaking changes (or marked as breaking)

PR: Fix Streaming Output Timestamps Across All Code Paths

Summary

Streaming output timestamps (timestamp_ns in JSON/CSV per-message
records) were inaccurate across multiple code paths. Instead of
recording when each message was actually sent, timestamps were captured
at record-creation time — either post-test during batch iteration or
at file-read time. This caused all timestamps within a test run to
cluster into the same second, making time-series analysis of streaming
data meaningless.

Because this tool's primary purpose is precision measurement, incorrect
timestamps in the streaming output undermine the validity of per-message
latency records for any downstream analysis or visualization.

Branch: fix/streaming-timestamps
Base: main
Files changed: 11 (496 insertions, 60 deletions)


Root Cause

Commit e0ff7fa (Oct 8, 2025, PR #82) refactored the benchmark runner
to collect latencies inside spawned async futures, then batch-create
MessageLatencyRecord objects after the futures completed. This moved
the SystemTime::now() call from inside the hot loop (where it
reflected each message's send time) to the post-test iteration (where
it reflected the moment the record was created — all within the same
second).

The initial fix on this branch (68afe07) changed the
MessageLatencyRecord API to accept send_timestamp_ns as a
parameter, but only the blocking round-trip path was actually capturing
timestamps at send time. Three other code path families remained broken.


Problems and Fixes

1. Async round-trip timestamps captured post-test

Problem: The client future returned Vec<Duration>. After the
future completed, the post-test loop called current_timestamp_ns()
at record-creation time. All records received the same timestamp.

Fix: Changed the future to capture current_timestamp_ns() before
each send() and return Vec<(Duration, u64)>. The post-test loop
uses the captured timestamp.

// Inside the client future, before each send:
let wall_ts = MessageLatencyRecord::current_timestamp_ns();
let send_time = Instant::now();
client_transport.send(&message).await?;
// ...
latencies.push((send_time.elapsed(), wall_ts));

// Post-test loop:
for (i, (latency, wall_ts)) in latencies.iter().enumerate() {
    let record = MessageLatencyRecord::new(
        i as u64, mechanism, msg_size,
        LatencyType::RoundTrip, *latency, *wall_ts,
    );
}

Applies to: Both duration-based and count-based async round-trip
loops in src/benchmark.rs.

2. Async combined timestamps captured post-test

Problem: Same pattern as round-trip — the combined test (one-way +
round-trip in a single run) collected latency durations inside the
spawned future and created records post-test with stale timestamps.

Fix: Changed one_way_latencies to Vec<(Duration, u64)> to
carry the wall-clock timestamp alongside each measurement. The
timestamp is captured before each send() inside the future.

Applies to: Both duration-based and count-based async combined
loops in src/benchmark.rs.

3. One-way timestamps from file-read time instead of send time

Problem: One-way tests use a server process that measures receive
latency and writes results to a temporary file. The original format
was one latency_ns value per line. The client read this file
post-test and called current_timestamp_ns() at file-read time —
all records received the same timestamp.

Fix (server side): Both async and blocking server loops now write
wall_send_ns,latency_ns per line. The wall-clock send time is
computed as SystemTime::now() - latency_ns, approximating when the
message entered the IPC channel.

// Server, after measuring latency:
let wall_now_ns = SystemTime::now()
    .duration_since(UNIX_EPOCH)
    .unwrap_or_default()
    .as_nanos() as u64;
let wall_send_ns = wall_now_ns.saturating_sub(latency_ns);
writeln!(file, "{},{}", wall_send_ns, latency_ns).ok();

Fix (client side): A new parse_latency_file_line() function
parses the two-field format. Both async and blocking file readers
use the parsed wall_send_ns as send_timestamp_ns.

Applies to: src/main.rs (both server loops), src/benchmark.rs
(async reader), src/benchmark_blocking.rs (blocking reader).


Known Limitations

Wall-clock / monotonic clock mixing in one-way path

For one-way tests, the server computes the send timestamp by
subtracting the measured monotonic latency from its current wall-clock
time. This mixes two clock domains:

  • latency_ns = monotonic_receive - monotonic_send
    (from message timestamp)
  • wall_send_ns = SystemTime::now() - latency_ns

If NTP adjusts the system clock between message send and receive, the
computed wall_send_ns will be slightly off. This is the best
approximation available without clock synchronization between the
client and server processes. The error is bounded by the magnitude
of any NTP adjustment during the test (typically microseconds).

Timestamp capture ordering

In the round-trip and combined futures, current_timestamp_ns() is
captured one instruction before Instant::now(). The wall-clock
timestamp therefore slightly predates the monotonic measurement start.
The gap is single-digit nanoseconds — orders of magnitude below the
IPC latencies being measured.


Tests Added

Unit tests for parse_latency_file_line (7 tests)

Test Coverage
test_parse_latency_file_line_valid Happy path: "170...,42000"
test_parse_latency_file_line_zeros Edge case: "0,0"
test_parse_latency_file_line_missing_comma Error: single value
test_parse_latency_file_line_empty Error: empty string
test_parse_latency_file_line_non_numeric_first Error: "abc,789"
test_parse_latency_file_line_non_numeric_second Error: "123,xyz"
test_parse_latency_file_line_extra_commas Error: "1,2,3"

Enhanced end-to-end streaming tests (2 tests)

Both test_one_way_streaming_captures_send_timestamp and
test_round_trip_streaming_captures_send_timestamp now validate:

  1. All timestamp_ns values fall within the test execution window
    (before_ns <= ts <= after_ns)
  2. Timestamps are not all identical (which was the original bug)

Pre-existing timestamp API tests (3 tests in results.rs)

  • test_new_uses_provided_send_timestamp
  • test_new_combined_uses_provided_send_timestamp
  • test_current_timestamp_ns_returns_recent_value

Documentation Added

README.md: Streaming Output Columns

Added a new "Streaming Output Columns" section documenting all six
per-message streaming columns with types, descriptions, and nullable
semantics:

Column Type Description
timestamp_ns u64 Wall-clock time message was sent
message_id u64 Zero-based message identifier
mechanism string IPC mechanism name
message_size u64 Payload size in bytes
one_way_latency_ns u64/null One-way latency
round_trip_latency_ns u64/null Round-trip latency

Includes a note on timestamp_ns accuracy for one-way tests
(wall-clock / monotonic clock mixing).


Validation

Unit and Integration Tests

$ cargo test
test result: ok. 265 passed; 0 failed; 1 ignored
(+ 42 doc tests, 27 integration tests — 334 total)

$ cargo clippy --all-targets -- -D warnings
# zero warnings

$ cargo fmt --check
# clean

$ scripts/msrv-check.sh
[MSRV] Rust 1.70 build/tests passed.

Functional Verification

3-second duration benchmarks across all mechanisms with streaming
output enabled. Timestamps validated to span the full test window:

Mechanism Mode First Timestamp Last Timestamp Delta (s) Messages
UDS Round-Trip 13:13:48.101628 13:13:51.101572 3.000 213,424
TCP Round-Trip 13:13:55.154037 13:13:58.153894 3.000 132,423
SHM One-Way 13:17:32.079542 13:17:35.079503 3.000 1,317,468
PMQ Round-Trip 13:14:15.407160 13:14:18.362355 2.955 42,687

All mechanisms show timestamps distributed across the full test
duration. PMQ delta is 45ms short due to backpressure from shallow
system queue depth (typically 10 messages).

Before/After: Timestamp Distribution

Before fix (main branch): All timestamp_ns values in streaming
output were within the same second, regardless of test duration. A
3-second test with 200,000+ messages would show all timestamps
clustering around a single epoch second.

After fix: Timestamps span the full test duration. Each message's
timestamp_ns reflects the approximate wall-clock time it was sent,
enabling meaningful time-series analysis of per-message latency data.


Files Changed

File Change
src/benchmark.rs Capture wall_ts inside async round-trip and combined futures before each send(); update one-way file reader to parse new wall_send_ns,latency_ns format; add parse_latency_file_line() with 7 unit tests; enhance 2 end-to-end streaming tests with timestamp validation
src/benchmark_blocking.rs Update blocking one-way file reader to parse new format using parse_latency_file_line()
src/main.rs Both async and blocking server loops write wall_send_ns,latency_ns per line (was latency_ns only)
README.md Add "Streaming Output Columns" section with column definitions and accuracy note
src/results.rs MessageLatencyRecord::new() and new_combined() accept send_timestamp_ns parameter; add current_timestamp_ns() helper; 3 API tests
src/ipc/tcp_socket.rs Refactor is_some()/unwrap() to if let (clippy fix)
src/ipc/unix_domain_socket.rs Same clippy fix
Cargo.toml Pin uuid to <1.21 for MSRV compatibility
.cargo/audit.toml Ignore known MSRV-pinned dependency advisories
CONFIG.md Document sequential test execution
utils/dashboard/README.md Note sequential test execution

Risk Assessment

  • Low risk. The core latency measurement logic (monotonic clock,
    get_monotonic_time_ns()) is untouched. Only the wall-clock
    metadata timestamp in streaming records is changed.
  • Backward compatible. No CLI changes. Streaming output JSON/CSV
    schema is unchanged (same columns, same types). The internal
    server-to-client latency file format changed from latency_ns to
    wall_send_ns,latency_ns, but this file is ephemeral (created and
    deleted within a single benchmark run) and never exposed to users.
  • All 334 tests pass with zero clippy warnings.
  • MSRV 1.70 verified via containerized pre-commit check.

Ignore known vulnerabilities in dependencies pinned for MSRV 1.70
compatibility:
- RUSTSEC-2026-0007 (bytes): integer overflow, fix requires >= 1.11.1
- RUSTSEC-2026-0009 (time): stack exhaustion, fix requires >= 0.3.47

Both are DoS-only risks, acceptable for a benchmark tool.

AI-assisted-by: Claude claude-4.6-opus-high-thinking (Anthropic)
Made-with: Cursor
…ages

Pass send timestamp to MessageLatencyRecord instead of capturing it when
the record is created. This makes the timestamp represent when the message
was sent, so gaps between timestamps now match the actual latency.

Changes:
- MessageLatencyRecord::new() and new_combined() now require send_timestamp_ns
  parameter instead of capturing SystemTime::now() internally
- Added MessageLatencyRecord::current_timestamp_ns() helper for call sites
  that need to capture the current wall-clock time
- Updated all streaming record creation sites in benchmark.rs and
  benchmark_blocking.rs to capture and pass send timestamps

Cherry-picked from container-to-container-ipc branch (6cb8f9b).
Host-container-specific changes excluded (host_container.rs not present).

AI-assisted-by: Claude claude-4.6-opus-high-thinking (Anthropic)
Made-with: Cursor
…n documentation

- Add 3 new unit tests for MessageLatencyRecord timestamp handling:
  test_new_uses_provided_send_timestamp, test_new_combined_uses_provided_send_timestamp,
  test_current_timestamp_ns_returns_recent_value
- Add 2 end-to-end streaming tests in benchmark.rs:
  test_one_way_streaming_captures_send_timestamp,
  test_round_trip_streaming_captures_send_timestamp
- Document sequential one-way/round-trip test execution in README.md
  (new "Test Execution Order" section), CONFIG.md, dashboard README,
  and run() doc comments in benchmark.rs and benchmark_blocking.rs
- All 332 tests pass (331 passed, 1 ignored), zero clippy warnings

AI-assisted-by: Claude claude-4.6-opus-high-thinking
Made-with: Cursor
- Refactor is_some()/unwrap() anti-pattern in tcp_socket.rs and
  unix_domain_socket.rs to use idiomatic `if let` destructuring,
  fixing clippy::unnecessary_unwrap errors on Rust 1.94.0
- Pin uuid to >=1.17, <1.22 to prevent MSRV CI from resolving
  uuid 1.22+ which depends on getrandom 0.4 (Rust 2024 edition,
  MSRV 1.85), incompatible with our Rust 1.70 MSRV
- All tests passing, clippy clean, formatting verified

AI-assisted-by: Claude claude-4.6-opus-high-thinking
Made-with: Cursor
- uuid 1.21+ (not 1.22+) switched to getrandom ^0.4 which requires
  Rust 2024 edition (MSRV 1.85), incompatible with our Rust 1.70 MSRV
- Pin uuid to >=1.17, <1.21 to ensure getrandom ^0.3 is used

AI-assisted-by: Claude claude-4.6-opus-high-thinking
Made-with: Cursor
…aths

The previous commit (68afe07) added a send_timestamp_ns parameter to
MessageLatencyRecord but only the blocking round-trip path was actually
capturing timestamps at send time. The async round-trip, async combined,
and all one-way paths were still using current_timestamp_ns() at
record-creation time (post-test), causing all timestamps within a run
to cluster into the same second.

Changes:
- Async round-trip: capture wall-clock timestamp inside the spawned
  client future before each send(), return Vec<(Duration, u64)>
- Async combined: same pattern for one-way latency vector
- One-way (async + blocking): server now writes "wall_send_ns,latency_ns"
  per line (wall_send_ns = wall_clock_now - latency); client readers
  parse and use the server-computed send timestamp
- Add parse_latency_file_line() with 7 unit tests covering valid input,
  missing commas, empty lines, non-numeric values, and extra commas
- Enhance existing end-to-end streaming tests to validate timestamps
  fall within the test execution window and are not all identical
- Document streaming output column definitions in README.md including
  timestamp_ns semantics and accuracy note for one-way clock mixing

All 265+ unit tests pass, clippy clean, no scope creep.

AI-assisted-by: Claude claude-4.6-opus-high-thinking (Anthropic)
Made-with: Cursor
Copy link
Copy Markdown
Collaborator

@dustinblack dustinblack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I closed #92 because fresh testing showed the problem was no longer reproducible. Issue #106 was then created the following day as the basis for this PR, but it contains no evidence of an actual problem — no reproduction steps, no sample output showing incorrect timestamps, and no before/after data.

I'd suggest closing this PR and #106 unless concrete evidence can be provided in #106 demonstrating that the specific code paths being fixed here actually produce incorrect timestamps on the current main branch — similar to what I provided in #92.

@mcurrier2
Copy link
Copy Markdown
Collaborator Author

These are valid issues/problems. I added to issue #106 showing the problems and before/after scenarios.

Copy link
Copy Markdown
Collaborator

@dustinblack dustinblack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the evidence to #106 — the before/after data clearly demonstrates the problem across round-trip, combined, and one-way code paths. The bugs are real and the fixes are technically sound. Withdrawing my earlier suggestion to close.

Technical assessment of the changes:

The fixes are correct. Moving SystemTime::now() capture from record-creation time to send time (via the send_timestamp_ns parameter on new()/new_combined()) is the right approach. The one-way server-side approximation (wall_now - latency) mixes clock domains but is clearly documented and is the best available without clock sync. The parse_latency_file_line() parser is clean, and the test coverage is thorough.

Before this can be approved, it needs:

  1. Rebase on main. This branch is based on e826b20, missing PRs #103 and #104. The clippy refactors and .cargo/audit.toml are already on main and will drop out. CI failures should resolve.

  2. Coordinate file format with PR #109. This PR changes the server latency file from latency_ns to wall_send_ns,latency_ns. PR #109's write_latency_buffer() writes the old single-field format. Whichever merges second needs to match. Since #109 is closer to approval, consider whether this PR should rebase on top of #109 or vice versa.

  3. Remove documentation that overlaps with PR #108. The "Streaming Output Columns" and "Test Execution Order" README sections appear in both PRs. Whichever PR owns those sections, the other should drop them to avoid merge conflicts. Since those sections are most relevant here, I'd suggest keeping them in this PR and removing them from #108.

mcurrier2 added a commit that referenced this pull request Mar 25, 2026
Remove sections that belong in the streaming-timestamps PR (#105),
not this buffer-sizing PR:

- "Streaming Output Columns" table and timestamp_ns accuracy note
- "Test Execution Order" section
- Streaming JSON/CSV description rewording
- Round-trip CLI example comment expansion

README diff now only contains buffer-sizing documentation changes:
auto-sizing table, error prevention updates, and example buffer
size correction.

AI-assisted-by: Claude claude-4.6-opus-high-thinking
Made-with: Cursor
ewchong pushed a commit that referenced this pull request Mar 31, 2026
…Hangs (#108)

* fix: Use fixed 64KB buffer for standalone SHM to enable streaming

Previously, standalone SHM mode calculated buffer size to fit ALL
messages (e.g., 6.4MB for 50k messages). This caused the writer to
dump everything instantly while the reader slowly drained using
pthread_cond_timedwait with 500us polling timeouts, leading to huge
accumulated latencies (~489ms for 64B).

The fix uses a fixed 64KB buffer (or 2x message size if larger),
matching container behavior. This enables proper streaming where the
writer blocks when the buffer is full.

Before: 489ms mean latency for 64B/50k messages
After:  1.95ms mean latency for 64B/10k messages (blocking)
        15.85ms mean latency for 64B/10k messages (async)

Also updates test_transport_config_buffer_size_logic to expect the
new fixed-buffer behavior for SHM while keeping the message-count
sizing for TCP/UDS.

Cherry-picked from container-to-container-ipc branch (3b49877).

AI-assisted-by: Claude claude-4.6-opus-high-thinking (Anthropic)
Made-with: Cursor

* test/docs: Add buffer sizing and SHM backpressure tests, update documentation

- Add 9 new tests covering SHM buffer sizing, backpressure, and
  condvar timed-wait behavior:
  - test_shm_large_message_buffer_sizing: verifies 2x msg size path
    when messages exceed 32KB (async)
  - test_shm_duration_mode_uses_fixed_buffer: verifies SHM gets 64KB
    in duration mode, not 1GB TCP/UDS default (async)
  - test_blocking_transport_config_buffer_size_logic: full buffer
    sizing test for blocking mode (SHM, TCP, PMQ, duration)
  - test_blocking_shm_duration_mode_uses_fixed_buffer: SHM 64KB in
    blocking duration mode
  - test_blocking_shm_large_message_buffer_sizing: 2x path (blocking)
  - test_backpressure_with_small_buffer: exercises timed condvar wait
    with 1KB buffer and 20 messages
  - test_payload_integrity_under_backpressure: byte-level payload
    verification through backpressure-induced blocking writes
  - test_ring_buffer_wrap_around_under_backpressure: write_pos wraps
    the circular buffer multiple times under backpressure
  - test_shutdown_detected_during_blocked_write: server closes while
    client is blocked waiting for buffer space
- Update README.md Buffer Size Configuration with per-mechanism
  auto-sizing table and updated error prevention guidance
- Update CONFIG.md SHM defaults from 8192 to 64KB (auto) and add
  automatic buffer sizing explanation
- All tests passing, clippy clean

AI-assisted-by: Claude claude-4.6-opus-high-thinking (Anthropic)
Made-with: Cursor

* test/docs: Add coverage tests and fix stale buffer sizing documentation

- Add test_user_buffer_size_overrides_shm_default (async + blocking):
  verifies user-provided --buffer-size overrides SHM's 64KB default
- Add test_shm_buffer_sizing_at_32kb_boundary (async + blocking):
  tests exact transition where 2*(msg_size+64) crosses 64KB threshold
- Add test_high_volume_condvar_stress: 100 messages through 512-byte
  buffer to stress pthread_cond_timedwait retry loop
- Update create_transport_config_internal doc comment in benchmark.rs
  to describe per-mechanism buffer sizing (SHM, PMQ, TCP/UDS)
- Update Adaptive Buffer Sizing doc in benchmark_blocking.rs to
  describe per-mechanism behavior instead of vague description
- Fix README.md example output: SharedMemory buffer size 10240000 ->
  65536 to reflect new fixed 64KB auto-sizing
- All 340 tests passing, zero clippy warnings

AI-assisted-by: Claude claude-4.6-opus-high-thinking
Made-with: Cursor

* fix(msrv): Pin transitive dependencies to maintain Rust 1.70 compatibility

The CI MSRV job removes Cargo.lock and resolves fresh dependencies.
Several transitive dependencies recently bumped their MSRV above 1.70:

- uuid 1.21+ requires Rust 1.85 → pinned to <1.21
- tempfile 3.25+ pulls getrandom >=0.3,<0.5 which resolves to 0.4.x
  (edition 2024, unparseable by Rust 1.70's cargo) → pinned to <3.25
- zmij 1.0.20+ requires Rust 1.71 → pinned to =1.0.19
- quote 1.0.45+ requires Rust 1.71 → pinned to =1.0.44
- syn 2.0.115+ requires Rust 1.71 → pinned to =2.0.114
- unicode-ident 1.0.23+ requires Rust 1.71 → pinned to =1.0.22

Verified: MSRV builds and tests pass both with and without Cargo.lock
in a Rust 1.70 container. Local clippy, fmt, and tests all clean.

AI-assisted-by: Claude claude-4.6-opus-high-thinking
Made-with: Cursor

* ci: trigger CI rebuild with updated MSRV dependency pins

No code changes. Forces new CI run to pick up dependency pins
from commit cd28295 (uuid <1.21, tempfile <3.25, zmij =1.0.19,
quote =1.0.44, syn =2.0.114, unicode-ident =1.0.22).

AI-assisted-by: Claude claude-4.6-opus-high-thinking
Made-with: Cursor

* fix: Remove out-of-scope condvar/polling code per review

Remove container-IPC code that was out of scope for issue #107
(buffer sizing fix). This scopes the PR to items 4-5 only.

Removed:
- write_data_polling() and read_data_polling() fallback functions
- pthread_cond_timedwait (reverted to pthread_cond_wait)
- Broken-condvar detection (100-iteration/10ms heuristic)
- Mutex-lock-failure fallbacks to polling paths
- 30-second timeout counters (wait_count > 60000)
- test_high_volume_condvar_stress test

Restored from main (PR #104):
- write_data_blocking() signature with timestamp_offset parameter
  so latency measurement excludes backpressure wait time
- read_data_blocking() with clean pthread_cond_wait
- Proper pthread_cond_signal in both write and read paths

The two functional regressions cited in review are resolved:
1. Timestamp regression: write_data_polling() lacked timestamp_offset,
   but that function no longer exists. The only write path now refreshes
   the timestamp after the condvar wait.
2. Missing condvar signal: write_data_polling() never called
   pthread_cond_signal(&data_ready), but that function no longer exists.
   The only write path signals after every write.

All tests passing (42/42). No clippy warnings.

AI-assisted-by: Claude claude-4.6-opus-high-thinking
Made-with: Cursor

* docs: Remove out-of-scope README sections per review

Remove sections that belong in the streaming-timestamps PR (#105),
not this buffer-sizing PR:

- "Streaming Output Columns" table and timestamp_ns accuracy note
- "Test Execution Order" section
- Streaming JSON/CSV description rewording
- Round-trip CLI example comment expansion

README diff now only contains buffer-sizing documentation changes:
auto-sizing table, error prevention updates, and example buffer
size correction.

AI-assisted-by: Claude claude-4.6-opus-high-thinking
Made-with: Cursor

* fix: Unify message overhead constant across buffer sizing paths

Replace hardcoded 64 and 32 values with the MESSAGE_OVERHEAD
constant in both benchmark.rs and benchmark_blocking.rs:

- TCP/UDS msg-count sizing: was hardcoded 64, now MESSAGE_OVERHEAD
- SHM logging/validation: was hardcoded 32, now MESSAGE_OVERHEAD
- Add comment explaining what MESSAGE_OVERHEAD covers: 8 (id) +
  8 (timestamp) + 8 (bincode vec length) + 1 (message type) +
  4 (ring buffer length prefix) = 29 bytes, rounded to 64

Addresses review feedback about inconsistent overhead values.

AI-assisted-by: Claude claude-4.6-opus-high-thinking
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix streaming output timestamps captured at record-creation time instead of message-send time

2 participants