Fix/buffer size overloading - Fix SHM Buffer Over-Sizing and Condvar Hangs by mcurrier2 · Pull Request #108 · redhat-performance/rusty-comms

mcurrier2 · 2026-03-16T16:02:08Z

Description

Brief description of changes

Type of Change

[ x] Bug fix
New feature
Breaking change
[x ] Documentation update

Testing

[ x] Tests pass locally
Added tests for new functionality
[ x] Updated documentation

Checklist

Code follows style guidelines
Self-review completed
Comments added for complex code
Documentation updated
No breaking changes (or marked as breaking)

PR: Fix SHM Buffer Over-Sizing and Condvar Hangs

Summary

Shared memory (SHM) benchmarks produced misleading latency measurements
because the auto-sized ring buffer was large enough to hold all messages
at once. The writer dumped everything instantly while the reader slowly
drained the buffer, causing artificially huge accumulated latencies for
later messages. Additionally, the blocking SHM ring buffer used
pthread_cond_wait (infinite wait) for synchronization, which could
hang indefinitely when condition variable signals were missed — a known
failure mode in containerized environments with glibc ABI mismatches.

This PR replaces the all-messages-at-once buffer sizing with a fixed
64KB streaming buffer for SHM, and replaces infinite condvar waits
with timed waits plus a polling fallback for robustness.

Branch: fix/buffer-size-overloading
Base: main
Files changed: 6 (802 insertions, 43 deletions)

Problems and Fixes

1. SHM buffer sized to fit all messages (batching instead of streaming)

Problem: When --buffer-size was omitted, both async and blocking
benchmark runners auto-sized the SHM ring buffer to fit every message:
msg_count * (msg_size + 64). For a 10,000-message × 1KB benchmark,
this created a ~10MB buffer. The writer filled it instantly without
ever blocking, while the reader drained messages one at a time. Later
messages sat in the buffer for seconds, accumulating artificial latency
that had nothing to do with IPC performance.

Before:

// Buffer size logic (same for all mechanisms):
let buffer_size = self.config.buffer_size.unwrap_or_else(|| {
    if is_pmq {
        PMQ_SAFE_DEFAULT_BUFFER_SIZE
    } else if self.config.duration.is_some() {
        DURATION_MODE_BUFFER_SIZE
    } else {
        // For message-count mode, size the buffer to fit all messages.
        self.get_msg_count() * (self.config.message_size + 64)
    }
});

After:

// Buffer size logic (per-mechanism):
let is_shm = self.mechanism == IpcMechanism::SharedMemory;
const SHM_DEFAULT_BUFFER_SIZE: usize = 65536; // 64KB
const MESSAGE_OVERHEAD: usize = 64;

let buffer_size = self.config.buffer_size.unwrap_or_else(|| {
    if is_pmq {
        PMQ_SAFE_DEFAULT_BUFFER_SIZE
    } else if is_shm {
        // Fixed buffer for SHM to enable streaming, not batching
        let min_for_message = (self.config.message_size + MESSAGE_OVERHEAD) * 2;
        std::cmp::max(SHM_DEFAULT_BUFFER_SIZE, min_for_message)
    } else if self.config.duration.is_some() {
        DURATION_MODE_BUFFER_SIZE
    } else {
        // TCP/UDS handle backpressure via kernel buffers
        self.get_msg_count() * (self.config.message_size + 64)
    }
});

Effect: SHM now uses a fixed 64KB ring buffer (or 2x message size
for messages larger than 32KB). The writer blocks when the buffer is
full, producing per-message latencies that reflect actual IPC
throughput rather than queue drain time.

Mechanism	Old Auto-Size	New Auto-Size
SHM	`msg_count * (msg_size + 64)`	Fixed 64 KB (or 2x msg size)
PMQ	8,192 bytes	Unchanged
TCP/UDS (msg-count)	`msg_count * (msg_size + 64)`	Unchanged
TCP/UDS (duration)	1 GB	Unchanged

Applies to: src/benchmark.rs (async) and
src/benchmark_blocking.rs (blocking) — identical logic in both.

2. SHM condvar infinite wait causing hangs

Problem: write_data_blocking() and read_data_blocking() used
pthread_cond_wait() with no timeout. If the condition variable signal
was missed (common in cross-process scenarios, especially in containers
with glibc ABI mismatches), the thread would hang indefinitely.

Before:

unsafe fn write_data_blocking(&self, data: &[u8]) -> Result<()> {
    // ...
    libc::pthread_mutex_lock(&self.mutex as *const _ as *mut _);

    while self.available_write_space() < required_space {
        if self.shutdown.load(Ordering::Acquire) {
            libc::pthread_mutex_unlock(&self.mutex as *const _ as *mut _);
            return Err(anyhow!("Connection closed"));
        }

        // Infinite wait — hangs if signal is missed
        libc::pthread_cond_wait(
            &self.space_ready as *const _ as *mut _,
            &self.mutex as *const _ as *mut _,
        );
    }
    // ...
}

After:

unsafe fn write_data_blocking(&self, data: &[u8]) -> Result<()> {
    // ...
    let lock_result = libc::pthread_mutex_lock(&self.mutex as *const _ as *mut _);
    if lock_result != 0 {
        return self.write_data_polling(data); // Fallback
    }

    let mut wait_count = 0;
    let loop_start = std::time::Instant::now();
    while self.available_write_space() < required_space {
        if self.shutdown.load(Ordering::Acquire) { /* ... */ }

        // Detect broken condvar (100+ iterations in <10ms)
        if wait_count >= 100 && loop_start.elapsed() < Duration::from_millis(10) {
            libc::pthread_mutex_unlock(&self.mutex as *const _ as *mut _);
            return self.write_data_polling(data);
        }

        // Timed wait (500µs) instead of infinite wait
        let mut timeout = libc::timespec { tv_sec: 0, tv_nsec: 0 };
        libc::clock_gettime(libc::CLOCK_REALTIME, &mut timeout);
        timeout.tv_nsec += 500_000; // 500µs
        if timeout.tv_nsec >= 1_000_000_000 {
            timeout.tv_sec += 1;
            timeout.tv_nsec -= 1_000_000_000;
        }

        libc::pthread_cond_timedwait(
            &self.space_ready as *const _ as *mut _,
            &self.mutex as *const _ as *mut _,
            &timeout,
        );

        wait_count += 1;
        if wait_count > 60000 { // 30s timeout (60000 * 500µs)
            libc::pthread_mutex_unlock(&self.mutex as *const _ as *mut _);
            return Err(anyhow!("Timeout waiting for buffer space"));
        }
    }
    // ...
}

The same pattern was applied to read_data_blocking().

Changes:

pthread_cond_wait → pthread_cond_timedwait with 500µs timeout
Broken-condvar detection (100 iterations in <10ms → polling fallback)
30-second overall timeout (60,000 × 500µs)
Mutex lock failure → polling fallback

3. No fallback when pthread primitives fail

Problem: If pthread_mutex_lock failed or pthread_cond_timedwait
malfunctioned (both possible in containerized environments with
mismatched glibc versions), there was no alternative code path. The
operation would either crash or hang.

Fix: Added write_data_polling() and read_data_polling() methods
that use thread::sleep(Duration::from_micros(100)) polling with a
30-second timeout. These are activated when:

pthread_mutex_lock returns non-zero
The broken-condvar detector fires (100+ iterations in <10ms)

unsafe fn write_data_polling(&self, data: &[u8]) -> Result<()> {
    let start = std::time::Instant::now();
    let timeout = Duration::from_secs(30);

    while self.available_write_space() < required_space {
        if self.shutdown.load(Ordering::Acquire) {
            return Err(anyhow!("Connection closed"));
        }
        if start.elapsed() > timeout {
            return Err(anyhow!(
                "Timeout waiting for buffer space (polling fallback)"
            ));
        }
        thread::sleep(Duration::from_micros(100));
    }
    // ... write data using atomics only (no mutex) ...
}

Applies to: src/ipc/shared_memory_blocking.rs — both
write_data_polling() and read_data_polling().

4. SHM buffer warning log was misleading

Problem: When the SHM buffer was smaller than total data volume,
the benchmark logged a warn! saying "This may cause backpressure,
which is a valid test scenario." With the new fixed 64KB buffer,
this warning fired on virtually every SHM run and was no longer
indicating an unusual condition — it was the intended design.

Before:

warn!(
    "Buffer size ({} bytes) is smaller than the total data size \
     ({} bytes). This may cause backpressure, which is a valid \
     test scenario.",
    buffer_size, total_message_data
);

After:

debug!(
    "SHM using fixed {}KB buffer for {} bytes of data \
     - streaming mode enabled",
    buffer_size / 1024, total_message_data
);

Applies to: Both src/benchmark.rs and
src/benchmark_blocking.rs.

Tests Added

Buffer sizing tests (5 tests)

Test	File	Coverage
`test_transport_config_buffer_size_logic` (updated)	`benchmark.rs`	Added Scenario 2a: SHM uses fixed 64KB, not total; split TCP/UDS into Scenario 2b
`test_shm_large_message_buffer_sizing`	`benchmark.rs`	SHM auto-size uses 2x msg size when message > 32KB (40KB message → 80,128 byte buffer)
`test_shm_duration_mode_uses_fixed_buffer`	`benchmark.rs`	SHM in duration mode gets 64KB, not the 1GB TCP/UDS default
`test_blocking_transport_config_buffer_size_logic`	`benchmark_blocking.rs`	5-scenario blocking test: user-provided, SHM fixed, TCP msg-count, TCP duration, PMQ safe default
`test_blocking_shm_duration_mode_uses_fixed_buffer`	`benchmark_blocking.rs`	Blocking SHM in duration mode gets 64KB
`test_blocking_shm_large_message_buffer_sizing`	`benchmark_blocking.rs`	Blocking SHM uses 2x msg size for large messages

SHM backpressure and robustness tests (4 tests)

Test	File	Coverage
`test_backpressure_with_small_buffer`	`shared_memory_blocking.rs`	20 messages × 200 bytes through 1KB buffer — writer must block repeatedly, exercises `pthread_cond_timedwait` path
`test_payload_integrity_under_backpressure`	`shared_memory_blocking.rs`	10 messages × 256 bytes through 512-byte buffer — verifies byte-level payload correctness after backpressure-induced blocking
`test_ring_buffer_wrap_around_under_backpressure`	`shared_memory_blocking.rs`	50 messages × 128 bytes through 1KB buffer — write position wraps the circular buffer multiple times
`test_shutdown_detected_during_blocked_write`	`shared_memory_blocking.rs`	Server closes after reading 1 of 2 messages while writer is blocked on full buffer — verifies shutdown/timeout error

Documentation Updated

CONFIG.md

--buffer-size default changed from 8192 to *auto*
SHM buffer_size default changed from 8192 to 64 KB (auto)
Added auto-sizing explanation paragraph
SHM optimal settings: concurrency → 1, buffer guidance →
"leave at automatic"

README.md

Replaced Buffer Size Configuration prose with per-mechanism
auto-sizing table
Updated Error Prevention troubleshooting for new SHM timeout
messages

Validation

Unit and Integration Tests

$ cargo test
test result: ok. 262 passed; 0 failed; 1 ignored  (unit)
+ 27 integration tests, 42 doc tests — 335 total

$ cargo clippy --all-targets -- -D warnings
# zero warnings

$ cargo fmt --check
# clean

Before/After: SHM Buffer Sizing (Actual Test Results)

Test command (identical on both branches):

./target/release/ipc-benchmark -m shm --blocking -i 1000 -s 1024 \
    --one-way -w 0 --streaming-output-csv output.csv

Before fix (main branch):

Buffer Size:  1,088,000 bytes (auto-sized to fit all 1000 messages)

Benchmark Results:
  One-Way Latency:
           Mean: 2.56 ms, P95: 2.78 ms, P99: 2.78 ms
           Min:  169.11 us, Max: 2.78 ms

Per-message latency from streaming CSV:
  Msg 1:        169 µs
  Msg 250:    2,768 µs  (16x worse than msg 1)
  Msg 500:    2,594 µs
  Msg 750:    2,635 µs
  Msg 1000:   2,592 µs

The writer dumps all 1000 messages into the 1MB buffer instantly.
The reader drains them one at a time. By message 250, latency has
climbed to 2.7ms and stays there — this is queue-drain time, not
IPC latency.

After fix (fix/buffer-size-overloading branch):

Buffer Size:  65,536 bytes (fixed 64KB streaming buffer)

Benchmark Results:
  One-Way Latency:
           Mean: 304.54 us, P95: 667.65 us, P99: 683.52 us
           Min:  10.03 us, Max: 687.00 us

Per-message latency from streaming CSV:
  Msg 1:        187 µs
  Msg 250:      354 µs  (1.9x msg 1)
  Msg 500:      282 µs
  Msg 750:       10 µs
  Msg 1000:      14 µs

The writer blocks when the 64KB buffer fills (~60 messages), waits
for the reader to drain space, then continues. Every message's
latency reflects actual IPC throughput.

Improvement:

Metric	Before (`main`)	After (fix)	Improvement
Mean latency	2,560 µs	304 µs	8.4x
Max latency	2,780 µs	687 µs	4.0x
Last-message latency	2,592 µs	14 µs	185x
Buffer size	1,088,000 bytes	65,536 bytes	16.6x smaller
Throughput	18.31 MB/s	17.11 MB/s	~same

Throughput is nearly identical because the total data transfer rate
is the same — the fix changes when messages are written, not how
fast the IPC channel moves data.

Before/After: Condvar Wait Behavior

Before fix (main branch): write_data_blocking() and
read_data_blocking() used pthread_cond_wait() with no timeout.
If a condition variable signal was missed (common in cross-process
scenarios, especially in containers with glibc ABI mismatches), the
thread would hang indefinitely — no timeout, no error, no recovery.

After fix: pthread_cond_timedwait with 500µs timeout means a
missed signal costs at most 500µs of latency, not an infinite hang.
If the condvar is completely broken (100 no-ops in <10ms), the code
falls back to polling. If nothing works, a 30-second timeout produces
a clear error message:

Error: Timeout waiting for buffer space
Error: Timeout waiting for data
Error: Timeout waiting for buffer space (polling fallback)
Error: Timeout waiting for data (polling fallback)

Files Changed

File	Change
`src/ipc/shared_memory_blocking.rs`	Replace `pthread_cond_wait` with `pthread_cond_timedwait` (500µs) in both `write_data_blocking` and `read_data_blocking`; add broken-condvar detection; add `write_data_polling()` and `read_data_polling()` fallbacks; add mutex lock failure handling; add 30-second overall timeouts; 4 new backpressure/robustness tests
`src/benchmark.rs`	Add SHM-specific fixed 64KB buffer sizing path; downgrade SHM buffer log from `warn!` to `debug!`; update existing buffer test; add 2 new SHM buffer sizing tests
`src/benchmark_blocking.rs`	Mirror all async buffer sizing changes for blocking mode; add `Duration` import; add 3 new blocking buffer sizing tests
`README.md`	Rewrite Buffer Size Configuration with per-mechanism auto-sizing table; update Error Prevention troubleshooting
`CONFIG.md`	Update `--buffer-size` default to auto; update SHM buffer_size default and optimal settings
`.cargo/audit.toml`	Ignore MSRV-pinned dependency advisories (RUSTSEC-2026-0007, RUSTSEC-2026-0009)

Risk Assessment

Low risk. Buffer sizing changes only affect the default when
--buffer-size is omitted. User-provided values are always
respected. TCP, UDS, and PMQ behavior is completely unchanged.
SHM condvar changes are additive. The primary path still uses
pthread condvars — just with a timeout instead of infinite wait.
The polling fallback only activates on detected failures.
Backward compatible. No CLI changes, no output format changes.
Users who were explicitly passing --buffer-size see no difference.
All 335 tests pass with zero clippy warnings.
9 new tests specifically cover the buffer sizing logic and SHM
backpressure behavior introduced by this branch.

dustinblack

PR #108 Review

Buffer sizing fix — looks good

The core change (fixed 64KB SHM buffer instead of sizing to fit all messages) is well-motivated and the before/after data is compelling. The buffer sizing logic in benchmark.rs and benchmark_blocking.rs is clean, the warn! → debug! downgrade makes sense, and the test coverage is thorough (boundary tests, large messages, duration mode, user override).

Condvar/polling changes must be removed

src/ipc/shared_memory_blocking.rs:286-432 (write path) and src/ipc/shared_memory_blocking.rs:437-598 (read path)

The pthread_cond_timedwait, broken-condvar detection, write_data_polling(), and read_data_polling() are the same container-IPC code that was removed from PR #104 at review (ba2e60c). The container-to-container IPC work is out of scope and will be re-architected in the future, so this code should not be introduced here.

Beyond the scoping concern, this code has two functional problems:

Regresses PR #104's timestamp fix. write_data_polling() has no timestamp_offset parameter, so when the polling fallback triggers, the timestamp is not updated before the write. This undoes the latency accuracy fix from #104.
Missing condvar signal after polling write. write_data_polling() writes data and updates write_pos but never calls pthread_cond_signal(&data_ready). A reader blocked on the condvar path won't wake until its 500µs timed wait expires. PR #103 fixed this same class of issue.

Please scope this PR to issue #107 items 4-5 (buffer sizing and tests). The condvar/polling work (items 1-3) should be excluded — those issue items should be moved to a future PR when the container-IPC architecture is ready.

Minor: overlapping files

.cargo/audit.toml, Cargo.lock, and Cargo.toml changes overlap with PRs #103, #104, and #105. Whichever merges last will need to resolve conflicts. Not a blocker, just a heads-up.

mcurrier2 · 2026-03-24T18:03:14Z

Issue #107 updated to reflect this PR only addresses:

Fix SHM automatic buffer sizing to use a fixed 64 KB buffer (or 2x message size for large messages) instead of sizing the ring buffer to fit all messages at once. A small buffer forces natural writer/reader pacing through backpressure, producing latencies representative of steady-state IPC performance.

Update test_transport_config_buffer_size_logic to validate the new SHM-specific fixed-buffer behavior while preserving message-count-based sizing for TCP/UDS.

Here's a summary of what was removed/restored in src/ipc/shared_memory_blocking.rs (net -270 lines):

Removed:

write_data_polling() — the polling fallback for writes (no mutex, no condvar signal)
read_data_polling() — the polling fallback for reads
pthread_cond_timedwait — replaced with pthread_cond_wait in both write and read paths
Broken-condvar detection logic (100-iteration / 10ms heuristic)
Mutex-lock-failure fallbacks (falling back to polling on pthread_mutex_lock != 0)
30-second timeout counters (wait_count > 60000)
test_high_volume_condvar_stress test

Restored from main:

write_data_blocking() signature with data: &mut [u8] and timestamp_offset: Option<Range> — timestamp is refreshed inside the function after the condvar wait, so latency excludes backpressure time
read_data_blocking() with clean pthread_cond_wait
Call site now passes &mut serialized and Some(Message::timestamp_offset())

All validation passes: cargo fmt, cargo check, cargo test (42 passed), cargo clippy (0 warnings).

Both functional problems are fully resolved by the removal. Here's why:

Timestamp regression: write_data_polling() was the code path that lacked the timestamp_offset parameter. We deleted that function entirely. The only remaining write path is write_data_blocking(), which now takes timestamp_offset: Option<Range> and refreshes the timestamp after the condvar wait, right before the memory write. There is no fallback path that could bypass it.
Missing condvar signal: write_data_polling() was the code path that wrote data without calling pthread_cond_signal(&data_ready). Again, that function no longer exists. The only remaining write path (write_data_blocking) calls pthread_cond_signal(&data_ready) after every write. And on the read side, read_data_polling() is also gone — read_data_blocking() always uses pthread_cond_wait, and always calls pthread_cond_signal(&space_ready) after consuming data.

Both problems existed exclusively in the polling fallback functions and the broken-condvar detection logic that routed into them. Since all of that code is removed, neither issue can occur. The write and read paths now match exactly what's on main (from PR #104), with clean pthread_cond_wait and proper signaling in both directions.

dustinblack

Thanks for removing the condvar/polling code. The buffer sizing fix looks good and is ready to go, but this branch needs a rebase on main before I can approve — it's based on e826b20, missing PRs #103 and #104.

After rebasing, please also clean up the duplication:

shared_memory_blocking.rs timestamp fix — already merged via #104. After rebase this should drop out automatically, but verify no duplicate changes remain.
.cargo/audit.toml — already on main via #103. Same.
README "Streaming Output Columns" and "Test Execution Order" sections — these are out of scope for a buffer-sizing PR. Please remove them; if they're needed, they belong in a separate PR with their own justification.

The MSRV pins in Cargo.toml (uuid <1.21, zmij, quote, syn, unicode-ident, tempfile <3.25) look like legitimate CI fixes and are fine to keep.

Previously, standalone SHM mode calculated buffer size to fit ALL messages (e.g., 6.4MB for 50k messages). This caused the writer to dump everything instantly while the reader slowly drained using pthread_cond_timedwait with 500us polling timeouts, leading to huge accumulated latencies (~489ms for 64B). The fix uses a fixed 64KB buffer (or 2x message size if larger), matching container behavior. This enables proper streaming where the writer blocks when the buffer is full. Before: 489ms mean latency for 64B/50k messages After: 1.95ms mean latency for 64B/10k messages (blocking) 15.85ms mean latency for 64B/10k messages (async) Also updates test_transport_config_buffer_size_logic to expect the new fixed-buffer behavior for SHM while keeping the message-count sizing for TCP/UDS. Cherry-picked from container-to-container-ipc branch (3b49877). AI-assisted-by: Claude claude-4.6-opus-high-thinking (Anthropic) Made-with: Cursor

…entation - Add 9 new tests covering SHM buffer sizing, backpressure, and condvar timed-wait behavior: - test_shm_large_message_buffer_sizing: verifies 2x msg size path when messages exceed 32KB (async) - test_shm_duration_mode_uses_fixed_buffer: verifies SHM gets 64KB in duration mode, not 1GB TCP/UDS default (async) - test_blocking_transport_config_buffer_size_logic: full buffer sizing test for blocking mode (SHM, TCP, PMQ, duration) - test_blocking_shm_duration_mode_uses_fixed_buffer: SHM 64KB in blocking duration mode - test_blocking_shm_large_message_buffer_sizing: 2x path (blocking) - test_backpressure_with_small_buffer: exercises timed condvar wait with 1KB buffer and 20 messages - test_payload_integrity_under_backpressure: byte-level payload verification through backpressure-induced blocking writes - test_ring_buffer_wrap_around_under_backpressure: write_pos wraps the circular buffer multiple times under backpressure - test_shutdown_detected_during_blocked_write: server closes while client is blocked waiting for buffer space - Update README.md Buffer Size Configuration with per-mechanism auto-sizing table and updated error prevention guidance - Update CONFIG.md SHM defaults from 8192 to 64KB (auto) and add automatic buffer sizing explanation - All tests passing, clippy clean AI-assisted-by: Claude claude-4.6-opus-high-thinking (Anthropic) Made-with: Cursor

- Add test_user_buffer_size_overrides_shm_default (async + blocking): verifies user-provided --buffer-size overrides SHM's 64KB default - Add test_shm_buffer_sizing_at_32kb_boundary (async + blocking): tests exact transition where 2*(msg_size+64) crosses 64KB threshold - Add test_high_volume_condvar_stress: 100 messages through 512-byte buffer to stress pthread_cond_timedwait retry loop - Update create_transport_config_internal doc comment in benchmark.rs to describe per-mechanism buffer sizing (SHM, PMQ, TCP/UDS) - Update Adaptive Buffer Sizing doc in benchmark_blocking.rs to describe per-mechanism behavior instead of vague description - Fix README.md example output: SharedMemory buffer size 10240000 -> 65536 to reflect new fixed 64KB auto-sizing - All 340 tests passing, zero clippy warnings AI-assisted-by: Claude claude-4.6-opus-high-thinking Made-with: Cursor

…ility The CI MSRV job removes Cargo.lock and resolves fresh dependencies. Several transitive dependencies recently bumped their MSRV above 1.70: - uuid 1.21+ requires Rust 1.85 → pinned to <1.21 - tempfile 3.25+ pulls getrandom >=0.3,<0.5 which resolves to 0.4.x (edition 2024, unparseable by Rust 1.70's cargo) → pinned to <3.25 - zmij 1.0.20+ requires Rust 1.71 → pinned to =1.0.19 - quote 1.0.45+ requires Rust 1.71 → pinned to =1.0.44 - syn 2.0.115+ requires Rust 1.71 → pinned to =2.0.114 - unicode-ident 1.0.23+ requires Rust 1.71 → pinned to =1.0.22 Verified: MSRV builds and tests pass both with and without Cargo.lock in a Rust 1.70 container. Local clippy, fmt, and tests all clean. AI-assisted-by: Claude claude-4.6-opus-high-thinking Made-with: Cursor

No code changes. Forces new CI run to pick up dependency pins from commit cd28295 (uuid <1.21, tempfile <3.25, zmij =1.0.19, quote =1.0.44, syn =2.0.114, unicode-ident =1.0.22). AI-assisted-by: Claude claude-4.6-opus-high-thinking Made-with: Cursor

Remove container-IPC code that was out of scope for issue #107 (buffer sizing fix). This scopes the PR to items 4-5 only. Removed: - write_data_polling() and read_data_polling() fallback functions - pthread_cond_timedwait (reverted to pthread_cond_wait) - Broken-condvar detection (100-iteration/10ms heuristic) - Mutex-lock-failure fallbacks to polling paths - 30-second timeout counters (wait_count > 60000) - test_high_volume_condvar_stress test Restored from main (PR #104): - write_data_blocking() signature with timestamp_offset parameter so latency measurement excludes backpressure wait time - read_data_blocking() with clean pthread_cond_wait - Proper pthread_cond_signal in both write and read paths The two functional regressions cited in review are resolved: 1. Timestamp regression: write_data_polling() lacked timestamp_offset, but that function no longer exists. The only write path now refreshes the timestamp after the condvar wait. 2. Missing condvar signal: write_data_polling() never called pthread_cond_signal(&data_ready), but that function no longer exists. The only write path signals after every write. All tests passing (42/42). No clippy warnings. AI-assisted-by: Claude claude-4.6-opus-high-thinking Made-with: Cursor

Remove sections that belong in the streaming-timestamps PR (#105), not this buffer-sizing PR: - "Streaming Output Columns" table and timestamp_ns accuracy note - "Test Execution Order" section - Streaming JSON/CSV description rewording - Round-trip CLI example comment expansion README diff now only contains buffer-sizing documentation changes: auto-sizing table, error prevention updates, and example buffer size correction. AI-assisted-by: Claude claude-4.6-opus-high-thinking Made-with: Cursor

mcurrier2 · 2026-03-25T14:54:09Z

PR #108 is updated with 7 clean commits rebased on main:

a7690b9 — Fix: Use fixed 64KB buffer for standalone SHM
c1e9537 — Tests/docs: Buffer sizing and SHM backpressure tests
cd879f4 — Tests/docs: Coverage tests and stale doc fixes
970e408 — Fix(msrv): Pin transitive dependencies for Rust 1.70
2ecf3e9 — CI: Trigger rebuild with MSRV pins
8925b1d — Fix: Remove out-of-scope condvar/polling code per review
c64e906 — Docs: Remove out-of-scope README sections per review

All items addressed:

No duplicate .cargo/audit.toml (dropped by rebase)
No duplicate shared_memory_blocking.rs timestamp/condvar code (dropped by rebase)
Out-of-scope README sections removed

dustinblack

Rebase looks clean — duplication from #103/#104 dropped out, out-of-scope README sections removed. The remaining diff is focused on buffer sizing with good test coverage.

The beta CI failure (tcp_one_way_blocking_smoke) is a flaky integration test unrelated to these changes — all other platforms and stable pass.

Approved.

github-actions · 2026-03-25T16:48:02Z

📈 Changed lines coverage: 83.33% (10/12)

🚨 Uncovered lines in this PR

src/benchmark.rs: 1681
src/benchmark_blocking.rs: 595

📊 Code Coverage Summary

File	Line Coverage	Uncovered Lines
`src/benchmark.rs`	81.12% (477/588)	`75, 78, 89, 93, 102, 105, 107, 124, 392, 397-402, 409-414, 481-484, 589, 673, 679-681, 685-687, 707, 776-778, 783, 804, 809, 827, 933, 937-940, 942, 951-954, 956, 1032, 1063, 1066, 1068-1069, 1078-1079, 1132-1136, 1138, 1221, 1234, 1241, 1250, 1270-1274, 1276, 1372, 1381, 1383-1384, 1387-1388, 1394-1395, 1400-1401, 1403, 1408-1409, 1413-1415, 1420-1421, 1424-1425, 1457, 1515-1518, 1520-1526, 1528, 1531, 1681, 1696`
`src/benchmark_blocking.rs`	74.24% (317/427)	`97, 111, 127, 263, 369, 375-377, 380-382, 402, 434, 488, 583, 595, 609, 639-642, 723-726, 745, 749, 764, 806-808, 811, 814-816, 818, 821, 823-827, 829-830, 838-842, 844-848, 851-852, 856-857, 892, 941, 1020, 1031, 1061, 1064, 1131-1135, 1137, 1192-1195, 1200, 1213, 1216-1219, 1223-1225, 1227, 1229-1230, 1232-1233, 1236, 1238-1242, 1244, 1248-1249, 1251, 1253, 1274, 1286-1290, 1292, 1312-1315`
`src/cli.rs`	92.39% (85/92)	`630, 729, 769, 771, 792-794`
`src/execution_mode.rs`	100.00% (14/14)	``
`src/ipc/mod.rs`	65.28% (47/72)	`115, 425, 427-430, 740-741, 756-757, 775-776, 807, 810, 813, 818, 845-846, 860, 862, 882, 884, 1007-1009`
`src/ipc/posix_message_queue.rs`	46.09% (59/128)	`139-140, 213-215, 217, 224, 229, 332-335, 337, 345, 437, 441-442, 446, 449-452, 454-458, 539, 679, 782, 789-790, 807-808, 819-820, 831-832, 849-850, 906, 910-911, 914-919, 921-923, 927, 929-931, 933, 935-937, 941-943, 945-947, 994-995, 1017`
`src/ipc/posix_message_queue_blocking.rs`	81.94% (127/155)	`172, 182, 221, 251-255, 274, 325, 368, 387-390, 416-418, 422-423, 425-426, 436, 455, 457-458, 460-461`
`src/ipc/shared_memory.rs`	69.36% (163/235)	`61, 141, 145, 246-247, 257-258, 262, 390-391, 417-419, 421, 439-441, 443-444, 446-450, 467, 474, 480, 483-484, 488, 492, 496-497, 502-503, 666-667, 670-671, 674, 676, 681-682, 709-710, 713-714, 721-723, 725, 727-732, 734-735, 738-739, 741-745, 752, 782, 784-785, 787, 791`
`src/ipc/shared_memory_blocking.rs`	77.74% (206/265)	`196-198, 200-201, 204-206, 209-210, 212, 217, 219, 223-225, 230, 238-240, 243-245, 248-249, 251, 254, 257-258, 261-262, 266-267, 269, 273-274, 276, 310-312, 318-319, 378-379, 403-407, 498, 506, 556, 573, 660, 726, 789, 798, 808, 830`
`src/ipc/shared_memory_direct.rs`	83.80% (150/179)	`372-375, 444-451, 455, 482, 506-509, 513-514, 556-557, 569, 598, 605-606, 629-630, 636`
`src/ipc/tcp_socket.rs`	59.43% (63/106)	`31-32, 61, 96, 113-114, 118, 124-125, 129, 136-137, 141, 147-148, 152, 171-172, 175-177, 184-185, 188, 362-363, 366-367, 370-371, 376-377, 422, 429, 447-449, 478, 480-482, 484, 487`
`src/ipc/tcp_socket_blocking.rs`	97.62% (82/84)	`134, 159`
`src/ipc/unix_domain_socket.rs`	59.43% (63/106)	`29-30, 58, 93, 103, 122-123, 127, 133-134, 138, 145-146, 150, 156-157, 161, 180-181, 184-186, 193-194, 197, 346-347, 350-351, 354-355, 360-361, 412-414, 443, 445-447, 449, 452, 468`
`src/ipc/unix_domain_socket_blocking.rs`	94.34% (100/106)	`276-277, 283-285, 287`
`src/logging.rs`	100.00% (13/13)	``
`src/main.rs`	42.39% (142/335)	83-85, 87, 124-125, 135-139, 143-145, 147-148, 150-151, 171-174, 198-202, 210, 216, 219, 224-227, 232-233, 239, 245, 247-249, 251, 257-258, 264, 269, 272-273, 277, 279-280, 284-285, 287, 293, 297-298, 300-305, 307-308, 311, 320, 323-324, 327, 374-377, 384, 386-390, 393-396, 398-399, 401-402, 404, 406-412, 416, 418-421, 424, 428-430, 434, 436, 439, 443, 448-451, 457-458, 464-465, 471, 473-474, 478, 480, 485-487, 491, 494-495, 497-498, 503, 505-507, 511-512, 514, 521, 526-527, 529-534, 536-537, 541, 550, 553-554, 557, 559, 578, 585, 589-591, 593, 623-624, 632, 665, 716, 720, 723-726, 775-778, 815-816, 823-824, 827, 854-855, 858, 912-913, 917-920, 960, 969, 974, 979-980
`src/metrics.rs`	79.79% (150/188)	`455-460, 493-494, 552, 558, 579-582, 732-734, 736, 768, 788, 833, 838, 881, 904, 923-924, 926-927, 930-932, 952, 980, 984, 1005, 1007-1008, 1013`
`src/results.rs`	56.92% (255/448)	717, 726-728, 730-731, 734-735, 738, 760, 763-764, 767, 769, 772, 776-781, 791-792, 795-800, 817, 829-830, 832, 834, 837-838, 840, 844, 871, 895-897, 900-901, 905-907, 910, 936, 941, 946, 952, 971, 973-974, 976, 978-982, 984, 986-987, 1021, 1062-1063, 1066, 1072-1073, 1077, 1081-1083, 1085-1086, 1110-1114, 1117-1120, 1123-1130, 1140-1141, 1160-1161, 1163-1167, 1169, 1186-1187, 1189-1194, 1196, 1214, 1216-1221, 1239, 1242, 1258-1259, 1274-1276, 1278-1280, 1282-1283, 1285-1286, 1288-1289, 1291, 1293-1294, 1296-1299, 1301-1303, 1305-1307, 1310, 1314-1315, 1323-1328, 1330-1331, 1335-1336, 1340-1342, 1344, 1348-1349, 1358-1361, 1365-1367, 1371, 1373, 1376, 1381-1382, 1387, 1394-1398, 1400, 1598-1599, 1819-1820, 1822-1823, 1828
`src/results_blocking.rs`	95.48% (296/310)	`489-490, 492-493, 544, 769, 774, 779, 815, 818-819, 827-828, 886`
`src/utils.rs`	70.73% (29/41)	`71, 143, 147-149, 153, 159, 198-202`
Total	72.92% (2838/3892)

sberg-rh · 2026-03-26T18:30:35Z

Question: Intentional difference in message overhead values?

I noticed that MESSAGE_OVERHEAD = 64 is defined and used for SHM buffer sizing in both benchmark.rs and benchmark_blocking.rs, but there are two other places that use different hardcoded values in the same function:

SHM sizing uses the constant:

const MESSAGE_OVERHEAD: usize = 64;
let min_for_message = (self.config.message_size + MESSAGE_OVERHEAD) * 2;

TCP/UDS sizing uses a hardcoded 64:

self.get_msg_count() * (self.config.message_size + 64)

Validation/logging check uses a hardcoded 32:

self.get_msg_count() * (self.config.message_size + 32)

Is the 32 in the validation path intentionally different from the 64 used elsewhere? I wasn't sure if that reflects a different overhead calculation for logging purposes or if it should be using MESSAGE_OVERHEAD as well. Might be worth a quick comment if it's intentional.

Applies to:

src/benchmark.rs (lines ~1658, ~1677)
src/benchmark_blocking.rs (lines ~574, ~591)

Replace hardcoded 64 and 32 values with the MESSAGE_OVERHEAD constant in both benchmark.rs and benchmark_blocking.rs: - TCP/UDS msg-count sizing: was hardcoded 64, now MESSAGE_OVERHEAD - SHM logging/validation: was hardcoded 32, now MESSAGE_OVERHEAD - Add comment explaining what MESSAGE_OVERHEAD covers: 8 (id) + 8 (timestamp) + 8 (bincode vec length) + 1 (message type) + 4 (ring buffer length prefix) = 29 bytes, rounded to 64 Addresses review feedback about inconsistent overhead values. AI-assisted-by: Claude claude-4.6-opus-high-thinking Made-with: Cursor

mcurrier2 · 2026-03-30T12:57:26Z

It should always be 64 bytes.
This has been changed to specify the 64 byte constant - MESSAGE_OVERHEAD, where it was being hardcoded as either 32 or 64. A comment is added describing what makes up the "overhead" in each program.

github-actions · 2026-03-30T13:07:44Z

📈 Changed lines coverage: 88.89% (16/18)

🚨 Uncovered lines in this PR

src/benchmark.rs: 1686
src/benchmark_blocking.rs: 600

📊 Code Coverage Summary

File	Line Coverage	Uncovered Lines
`src/benchmark.rs`	81.15% (478/589)	`75, 78, 89, 93, 102, 105, 107, 124, 392, 397-402, 409-414, 481-484, 589, 673, 679-681, 685-687, 707, 776-778, 783, 804, 809, 827, 933, 937-940, 942, 951-954, 956, 1032, 1063, 1066, 1068-1069, 1078-1079, 1132-1136, 1138, 1221, 1234, 1241, 1250, 1270-1274, 1276, 1372, 1381, 1383-1384, 1387-1388, 1394-1395, 1400-1401, 1403, 1408-1409, 1413-1415, 1420-1421, 1424-1425, 1457, 1515-1518, 1520-1526, 1528, 1531, 1686, 1701`
`src/benchmark_blocking.rs`	74.30% (318/428)	`97, 111, 127, 263, 369, 375-377, 380-382, 402, 434, 488, 587, 600, 614, 644-647, 728-731, 750, 754, 769, 811-813, 816, 819-821, 823, 826, 828-832, 834-835, 843-847, 849-853, 856-857, 861-862, 897, 946, 1025, 1036, 1066, 1069, 1136-1140, 1142, 1197-1200, 1205, 1218, 1221-1224, 1228-1230, 1232, 1234-1235, 1237-1238, 1241, 1243-1247, 1249, 1253-1254, 1256, 1258, 1279, 1291-1295, 1297, 1317-1320`
`src/cli.rs`	92.39% (85/92)	`630, 729, 769, 771, 792-794`
`src/execution_mode.rs`	100.00% (14/14)	``
`src/ipc/mod.rs`	65.28% (47/72)	`115, 425, 427-430, 740-741, 756-757, 775-776, 807, 810, 813, 818, 845-846, 860, 862, 882, 884, 1007-1009`
`src/ipc/posix_message_queue.rs`	46.09% (59/128)	`139-140, 213-215, 217, 224, 229, 332-335, 337, 345, 437, 441-442, 446, 449-452, 454-458, 539, 679, 782, 789-790, 807-808, 819-820, 831-832, 849-850, 906, 910-911, 914-919, 921-923, 927, 929-931, 933, 935-937, 941-943, 945-947, 994-995, 1017`
`src/ipc/posix_message_queue_blocking.rs`	81.94% (127/155)	`172, 182, 221, 251-255, 274, 325, 368, 387-390, 416-418, 422-423, 425-426, 436, 455, 457-458, 460-461`
`src/ipc/shared_memory.rs`	69.36% (163/235)	`61, 141, 145, 246-247, 257-258, 262, 390-391, 417-419, 421, 439-441, 443-444, 446-450, 467, 474, 480, 483-484, 488, 492, 496-497, 502-503, 666-667, 670-671, 674, 676, 681-682, 709-710, 713-714, 721-723, 725, 727-732, 734-735, 738-739, 741-745, 752, 782, 784-785, 787, 791`
`src/ipc/shared_memory_blocking.rs`	78.87% (209/265)	`196-198, 200-201, 204-206, 209-210, 212, 217, 219, 223-225, 230, 238-240, 243-245, 248-249, 251, 254, 257-258, 261-262, 266-267, 269, 273-274, 276, 311-312, 378-379, 403-407, 498, 506, 556, 573, 660, 726, 789, 798, 808, 830`
`src/ipc/shared_memory_direct.rs`	83.80% (150/179)	`372-375, 444-451, 455, 482, 506-509, 513-514, 556-557, 569, 598, 605-606, 629-630, 636`
`src/ipc/tcp_socket.rs`	59.43% (63/106)	`31-32, 61, 96, 113-114, 118, 124-125, 129, 136-137, 141, 147-148, 152, 171-172, 175-177, 184-185, 188, 362-363, 366-367, 370-371, 376-377, 422, 429, 447-449, 478, 480-482, 484, 487`
`src/ipc/tcp_socket_blocking.rs`	97.62% (82/84)	`134, 159`
`src/ipc/unix_domain_socket.rs`	59.43% (63/106)	`29-30, 58, 93, 103, 122-123, 127, 133-134, 138, 145-146, 150, 156-157, 161, 180-181, 184-186, 193-194, 197, 346-347, 350-351, 354-355, 360-361, 412-414, 443, 445-447, 449, 452, 468`
`src/ipc/unix_domain_socket_blocking.rs`	94.34% (100/106)	`276-277, 283-285, 287`
`src/logging.rs`	100.00% (13/13)	``
`src/main.rs`	42.39% (142/335)	83-85, 87, 124-125, 135-139, 143-145, 147-148, 150-151, 171-174, 198-202, 210, 216, 219, 224-227, 232-233, 239, 245, 247-249, 251, 257-258, 264, 269, 272-273, 277, 279-280, 284-285, 287, 293, 297-298, 300-305, 307-308, 311, 320, 323-324, 327, 374-377, 384, 386-390, 393-396, 398-399, 401-402, 404, 406-412, 416, 418-421, 424, 428-430, 434, 436, 439, 443, 448-451, 457-458, 464-465, 471, 473-474, 478, 480, 485-487, 491, 494-495, 497-498, 503, 505-507, 511-512, 514, 521, 526-527, 529-534, 536-537, 541, 550, 553-554, 557, 559, 578, 585, 589-591, 593, 623-624, 632, 665, 716, 720, 723-726, 775-778, 815-816, 823-824, 827, 854-855, 858, 912-913, 917-920, 960, 969, 974, 979-980
`src/metrics.rs`	79.79% (150/188)	`455-460, 493-494, 552, 558, 579-582, 732-734, 736, 768, 788, 833, 838, 881, 904, 923-924, 926-927, 930-932, 952, 980, 984, 1005, 1007-1008, 1013`
`src/results.rs`	56.92% (255/448)	717, 726-728, 730-731, 734-735, 738, 760, 763-764, 767, 769, 772, 776-781, 791-792, 795-800, 817, 829-830, 832, 834, 837-838, 840, 844, 871, 895-897, 900-901, 905-907, 910, 936, 941, 946, 952, 971, 973-974, 976, 978-982, 984, 986-987, 1021, 1062-1063, 1066, 1072-1073, 1077, 1081-1083, 1085-1086, 1110-1114, 1117-1120, 1123-1130, 1140-1141, 1160-1161, 1163-1167, 1169, 1186-1187, 1189-1194, 1196, 1214, 1216-1221, 1239, 1242, 1258-1259, 1274-1276, 1278-1280, 1282-1283, 1285-1286, 1288-1289, 1291, 1293-1294, 1296-1299, 1301-1303, 1305-1307, 1310, 1314-1315, 1323-1328, 1330-1331, 1335-1336, 1340-1342, 1344, 1348-1349, 1358-1361, 1365-1367, 1371, 1373, 1376, 1381-1382, 1387, 1394-1398, 1400, 1598-1599, 1819-1820, 1822-1823, 1828
`src/results_blocking.rs`	95.48% (296/310)	`489-490, 492-493, 544, 769, 774, 779, 815, 818-819, 827-828, 886`
`src/utils.rs`	70.73% (29/41)	`71, 143, 147-149, 153, 159, 198-202`
Total	73.01% (2843/3894)

sberg-rh

Looks good with latest updates! Approved

mcurrier2 requested review from dustinblack and sberg-rh March 16, 2026 16:02

mcurrier2 linked an issue Mar 16, 2026 that may be closed by this pull request

Fix SHM buffer sizing and condvar latency spikes causing unreliable shared memory benchmarks #107

Closed

dustinblack requested changes Mar 17, 2026

View reviewed changes

mcurrier2 mentioned this pull request Mar 24, 2026

perf/server-latency-buffering : Buffer latencies in memory instead of per-message file I/O #109

Merged

7 tasks

dustinblack requested changes Mar 25, 2026

View reviewed changes

dustinblack mentioned this pull request Mar 25, 2026

Fix/streaming timestamps #105

Open

5 tasks

mcurrier2 added 7 commits March 25, 2026 10:29

mcurrier2 force-pushed the fix/buffer-size-overloading branch from 44aaa3f to c64e906 Compare March 25, 2026 14:47

mcurrier2 requested a review from dustinblack March 25, 2026 15:04

dustinblack previously approved these changes Mar 25, 2026

View reviewed changes

mcurrier2 dismissed dustinblack’s stale review via 3d50d53 March 30, 2026 12:47

sberg-rh approved these changes Mar 30, 2026

View reviewed changes

mcurrier2 requested a review from ewchong March 30, 2026 13:39

ewchong approved these changes Mar 31, 2026

View reviewed changes

ewchong merged commit f7f8eb9 into main Mar 31, 2026
12 checks passed

ewchong deleted the fix/buffer-size-overloading branch March 31, 2026 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/buffer size overloading - Fix SHM Buffer Over-Sizing and Condvar Hangs#108

Fix/buffer size overloading - Fix SHM Buffer Over-Sizing and Condvar Hangs#108
ewchong merged 8 commits intomainfrom
fix/buffer-size-overloading

mcurrier2 commented Mar 16, 2026

Uh oh!

dustinblack left a comment

Uh oh!

mcurrier2 commented Mar 24, 2026

Uh oh!

dustinblack left a comment

Uh oh!

mcurrier2 commented Mar 25, 2026

Uh oh!

dustinblack left a comment

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

sberg-rh commented Mar 26, 2026

Uh oh!

mcurrier2 commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

sberg-rh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mcurrier2 commented Mar 16, 2026

Description

Type of Change

Testing

Checklist

PR: Fix SHM Buffer Over-Sizing and Condvar Hangs

Summary

Problems and Fixes

1. SHM buffer sized to fit all messages (batching instead of streaming)

2. SHM condvar infinite wait causing hangs

3. No fallback when pthread primitives fail

4. SHM buffer warning log was misleading

Tests Added

Buffer sizing tests (5 tests)

SHM backpressure and robustness tests (4 tests)

Documentation Updated

CONFIG.md

README.md

Validation

Unit and Integration Tests

Before/After: SHM Buffer Sizing (Actual Test Results)

Before/After: Condvar Wait Behavior

Files Changed

Risk Assessment

Uh oh!

dustinblack left a comment

Choose a reason for hiding this comment

PR #108 Review

Buffer sizing fix — looks good

Condvar/polling changes must be removed

Minor: overlapping files

Uh oh!

mcurrier2 commented Mar 24, 2026

Uh oh!

dustinblack left a comment

Choose a reason for hiding this comment

Uh oh!

mcurrier2 commented Mar 25, 2026

Uh oh!

dustinblack left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 25, 2026

📈 Changed lines coverage: 83.33% (10/12)

🚨 Uncovered lines in this PR

📊 Code Coverage Summary

Uh oh!

sberg-rh commented Mar 26, 2026

Uh oh!

mcurrier2 commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

📈 Changed lines coverage: 88.89% (16/18)

🚨 Uncovered lines in this PR

📊 Code Coverage Summary

Uh oh!

sberg-rh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants