encode: enforce the protobuf 2 GiB message-size limit by rpb-ant · Pull Request #271 · anthropics/buffa

rpb-ant · 2026-07-02T18:20:14Z

Encoding was infallible while decode has always rejected length-delimited payloads over 2 GiB−1 (DecodeError::MessageTooLarge): a writer whose message crossed 2^31−1 bytes silently serialized a blob that no conforming decoder (including buffa's own) will read back, and past u32::MAX the size arithmetic wrapped outright, producing corrupt output. In production the failure surfaces far from the cause — the writer keeps succeeding while every reader starts failing. buffa now refuses to produce bytes it wouldn't accept.

Design

"Wide inside, checked at the door" (protobuf C++'s shape: size_t ByteSizeLong() + an INT_MAX check at the serialize entry):

Generated compute_size accumulates in a u64 local (which can't overflow for any in-memory message) and saturates to u32 once at each node's return (buffa::saturate_size). The trait signature and SizeCache stay u32; the 2–4 GiB window stays byte-exact so the guard sees true values.
compute_size is bottom-up and monotone, so one check at the top-level entry points provably catches aggregation overflow, sub-message overflow, and the single-huge-field case. Existing entry points (encode, encode_to_vec, encode_to_bytes, encode_length_delimited, encoded_len, encode_with_cache) panic before writing anything; try_* twins return Err(EncodeError::MessageTooLarge), the first variant of the previously uninhabited enum.
The same rule applies everywhere size arithmetic lives: write_to's transient packed/map lengths, the map_codec runtime helpers, lazy-fragment sizing, DynamicMessage (usize-exact, own guard), SizeCachePool. A codegen test locks the rule by asserting emitted compute_size/write_to bodies contain no u32 arithmetic.
Belts: a debug-build assertion that write_to produced exactly the byte count compute_size declared (the write pass is wrap-free ground truth, so this converts future two-pass codegen bugs into CI failures), and SizeCache::consume_next rejects over-limit slots for callers driving the two passes directly.
Internal Result-returning pipelines that re-encode parsed messages (JSON/text Any and extension converters, pack_any, to_message, from_options) surface the condition through their existing error channels rather than panicking.

Structure

Four commits: (1) runtime guard + try_* surface on Message/ViewEncode/SizeCachePool; (2) u64 size arithmetic in codegen + map_codec + lazy views, with regenerated WKT/descriptor code and the codegen lock test; (3) DynamicMessage + fallible-pipeline migrations; (4) logging-example regen (which surfaced pre-existing buffa::Map drift in its main.rs) and a SECURITY.md markdownlint fix.

Notes for review

Behavior change, called out in the changelog fragment: over-limit messages now panic at the existing entry points instead of producing decoder-rejected/corrupt bytes. The panic message names the try_* escape. prost has no encode cap at all; protobuf C++ and Go both error here.
put_len_delimited_header now takes len: u64 (was u32) — bare literals still infer; u32 callers widen with u64::from.
encoded_len() panics rather than returning a saturated lie; try_encoded_len() is the non-panicking probe.
The reflection codec's usize sizing is exact on 64-bit; the 32-bit theoretical wrap (shared Bytes clones summing past usize::MAX) is documented in-code and unreachable in practice.
examples/bsr-quickstart remains stale (regen requires BSR auth), same as on main.

Verified: full workspace tests, clippy -D warnings, no_std (host + thumbv7em) and MSRV 1.75 checks, conformance 14/14, and a consumer-crate drive: a version-envelope message at exactly 2 GiB−1 encodes and decodes back; one byte over returns Err(MessageTooLarge) with nothing written (not even the length prefix); encode_to_vec panics with the documented message; a 4.4 GB payload (the old u32-wrap class) errors instead of wrapping.

…oint Encoding was infallible while decoding rejects length-delimited payloads over 2 GiB-1: a message that organically crossed 2^31-1 bytes serialized silently into a blob no conforming decoder - including buffa's own - will read back, and sizes past u32::MAX wrapped into corrupt output. Hoist the limit into pub const MAX_MESSAGE_BYTES (deduping the decode-side literals), inhabit the reserved EncodeError enum with MessageTooLarge, and check the computed size at every provided Message / ViewEncode / SizeCachePool encode entry point: the existing methods now panic on over-limit messages before writing anything, and new try_encode, try_encode_with_cache, try_encoded_len, try_encode_length_delimited, try_encode_to_vec, and try_encode_to_bytes twins return the error instead (pool: try_encode / try_encode_view / try_encoded_len). saturate_size, checked_encode_size, and assert_encode_size are exported so generated code and manual implementations share one funnel. Belts: encode_to_vec/encode_to_bytes gain a debug-build assertion that write_to produced exactly the byte count compute_size declared, and SizeCache::consume_next rejects over-limit slots - a backstop for callers driving compute_size/write_to directly.

Generated compute_size previously accumulated in u32: three 1.5 GiB sub-trees wrap to an under-limit value the new entry-point check cannot distinguish from a legitimate size, and a single huge bytes field wraps in its usize-to-u32 cast. Accumulate in a u64 local (which cannot overflow for any in-memory message) and saturate to u32 once at each node's return via buffa::saturate_size - the trait signature and SizeCache stay u32, and the byte-exact size survives the 2-4 GiB window so the entry-point guard sees the true value. The same discipline applies everywhere size arithmetic lives: write_to's transient packed-payload and map-entry lengths, the map_codec runtime helpers (MapCodec::encoded_len, field_len, message_field_len), the lazy-fragment emitter, and tag lengths. put_len_delimited_header takes a plain u64 length. Lazy views' inherent encode entry points route through the shared buffa funnel, gain try_encode / try_encoded_len / try_encode_to_vec / try_encode_to_bytes twins, and the debug two-pass byte-count assertion. A new codegen test locks the discipline by asserting the emitted compute_size/write_to bodies contain no u32 size arithmetic. Regenerates the checked-in WKT and bootstrap descriptor types.

DynamicMessage gains the same 2 GiB guard as generated types: encode and encode_to_vec panic on over-limit sizes (nested recursion uses an unchecked path so the check is one size walk per top-level encode), with try_encode / try_encode_to_vec twins returning EncodeError::MessageTooLarge. The library's own Result-returning pipelines that re-encode a parsed message stop panicking and surface the condition through their existing error channels instead: the JSON/text Any and extension converters (any_from_json, any_merge_text, message_merge_text, group_merge_text, message_from_json, the WKT from_json closures, reflective deserialize_any), pack_any (new AnyError::MessageTooLarge variant), to_message (DecodeError::MessageTooLarge), and from_options. set_extension documents that message-typed values are encoded on set and inherit the panic.

The checked-in logging example still carried u32 compute_size arithmetic from several codegen generations ago; regenerating also surfaced that main.rs predated the buffa::Map field type (std HashMap literals no longer infer), so the map literals now collect() into the generated type. bsr-quickstart's generated code needs BSR authentication to regenerate and is left as-is. Also strips trailing whitespace in SECURITY.md that failed markdownlint.

github-actions · 2026-07-02T18:20:27Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

…allible extension/Any paths Follow-up to the 2 GiB encode guard, addressing review findings: - Every panicking encode entry point (Message, ViewEncode, SizeCachePool, DynamicMessage, generated lazy views) now delegates to its try_* twin and routes the Err through one #[cold] encode_size_overflow shim, so the panicking and fallible paths cannot diverge. The thin wrappers are #[inline]; assert_encode_size (added earlier on this branch, now zero-caller) is removed. The shared panic message names the try_* escape generically since the shim now serves more than the encode family. - SizeCache::consume_next's over-limit slot check becomes a debug_assert: it is unreachable through every provided entry point (each child compute_size saturates, so an over-limit slot forces the checked total over the limit first), and it sat on the hottest write-pass loop. Matches the debug-only two-pass ledger philosophy. - OwnedView::from_owned maps an over-limit encode to Err(DecodeError::MessageTooLarge) instead of panicking inside a Result-returning API; DecodeError::MessageTooLarge's display text now covers both the protobuf maximum and configured decode limits. - The eager re-encode paths gain fallible twins: Any::try_pack and ExtensionSet::try_set_extension (staged, so Err leaves the extendee untouched - which also makes set_extension's panic path atomic). ExtensionCodec/SingularCodec swap their required method: try_encode / try_encode_one are required, the panicking encode / encode_one are provided wrappers, so a fallible codec cannot accidentally leave the try_ path panicking. - DynamicMessage's private guard adapts onto the shared checked_encode_size/saturate_size helpers (one comparison, one panic wording); pack_any documents its MessageTooLarge error; encoded_len documents that it is exact and unchecked; DynamicMessage::encode documents the size-walk cost. - The three per-file size-reporting test doubles consolidate into #[cfg(test)] test_doubles::SizedMsg; buffa-types keeps its own copy (cfg(test) items do not cross crates) with a cross-reference. - Docs: guide gains the extension/Any panic-vs-try section; both migration guides stop calling encode infallible. A Breaking changes fragment records the u32->u64 signature changes, the regenerate-generated-code requirement, and the codec-trait swap. Regenerated WKT/descriptor generated code (from_owned doc update only - no checked-in code emits lazy views).

Restore direct bodies for encode_to_vec / encode_to_bytes (Message, ViewEncode, generated lazy views) instead of delegating to the try_ twins: LLVM does not fold the Result<Vec<u8>>/Result<Bytes> niche away even under full LTO inlining, so the delegating form re-checked the capacity sentinel and round-tripped the payload through a Result temp in every caller. Measured on a quieted c7i.metal spot box with layout-normalized builds (lto, codegen-units=1, -align-all-nofallthru-blocks=6): - combined bench, delegating form: google_message1 encode +7.5%, api_response encode +2.1% vs the pre-delegation baseline; disassembly showed +6 instructions, +1 branch, and a second 16-byte stack round-trip of the Vec on the serialized chain in the bench loop, while write_to itself shrank (the removed consume_next branch). - with this change: api_response encode +0.26%; google_message1 encode -0.19% in the isolated per-shape harness (the residual combined-bench delta is the documented code-placement lottery of all-shapes binaries; the untouched decode control was flat). The unit- and scalar-returning entry points keep the try_ delegation - their Results stay in registers and fold cleanly. A comment on Message::encode_to_vec records the rule.

rpb-ant added 4 commits July 2, 2026 18:06

iainmcgin added 2 commits July 2, 2026 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

encode: enforce the protobuf 2 GiB message-size limit#271

encode: enforce the protobuf 2 GiB message-size limit#271
rpb-ant wants to merge 6 commits into
mainfrom
encode-size-guard

rpb-ant commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rpb-ant commented Jul 2, 2026

Design

Structure

Notes for review

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants