Various hardening/compat tweaks#26
Open
cgwalters wants to merge 7 commits into
Open
Conversation
GNU tar uses base-256 encoding for numeric fields that don't fit in octal. Only two leading bytes are spec-defined: 0x80 (positive, with the value in the non-leading bytes) and 0xff (negative two's-complement, for pre-epoch timestamps). All other leading bytes with the high bit set are reserved for future use. Previously, parse_numeric() had two related bugs: 1. It used `bytes[0] & 0x80 != 0` to enter the base-256 branch, accepting reserved leading bytes 0x81–0xbf as valid positives. These are decoded by masking byte[0] with 0x7f and including it in the accumulator — producing garbage values for any byte other than 0x80. 2. It used `bytes[0] & 0x40 != 0` to detect negative values, treating 0xc0–0xfe as "negative" when the spec designates them as reserved, and only 0xff as the defined negative indicator. The root cause of the sign-bit confusion was that an earlier fix attempted to detect negatives via overflow: checked_shl(8) doesn't actually detect u64 overflow (it only validates the shift count), so for 12-byte fields the overflow happened to catch negatives, but for 8-byte fields (uid, gid, devmajor, devminor) a 0xff-prefixed negative value has only 63 payload bits, fits in u64, and was silently accepted as a huge positive number. Fix: replace the bit-level checks with explicit leading-byte comparisons. Only 0x80 is accepted as a positive base-256 marker; 0xff and all other leading bytes with the high bit set are rejected with InvalidOctal. Assisted-by: OpenCode (Claude Sonnet 4.6) Signed-off-by: Colin Walters <walters@verbum.org>
A PAX 'size' field containing a value near u64::MAX (e.g. u64::MAX itself) would pass parse_pax_u64() validation, then be stored as entry_size. The subsequent padded_size() call uses next_multiple_of(512) which wraps to 0 in release builds, causing the caller to skip 0 bytes of content — a stream desynchronization. The header's own size field already has this guard via checked_next_multiple_of at the block-parsing stage. Apply the same check to the PAX size override at the point it is accepted in emit_entry, returning InvalidSize if the rounded-up value would overflow u64. Assisted-by: OpenCode (Claude Sonnet 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
A single all-zero 512-byte block appearing in a valid archive is not meaningful: end-of-archive requires two consecutive zero blocks per the POSIX spec. Previously, stray zero blocks were silently skipped, meaning entries after a zero block would be visible to tar-core but hidden from Go's archive/tar (which errors) and GNU tar/Python (which stop). This created a confusion window for archive-based security bypasses. Return ParseError::StrayZeroBlock when a zero block is encountered that is not immediately followed by a second zero block. The valid two-block EOA marker continues to produce ParseEvent::EndOfArchive. Assisted-by: OpenCode (Claude Sonnet 4.6) Signed-off-by: Colin Walters <walters@verbum.org>
Hardlinks, symlinks, character/block devices, directories, and FIFOs carry no data content — their size field must be zero. A malicious archive can set a non-zero size on such an entry to create a divergence: tar-core would report padded_size() bytes of content to skip, while Go's archive/tar enforces size=0 for header-only types and reads 0 bytes, interpreting the 'content' as the next entry's header. This lets an attacker hide entries from a tar-core-based scanner that are visible to a Go-based extractor. Add EntryType::is_header_only() and check it in emit_entry after all PAX overrides are applied, returning NonZeroSizeForHeaderOnlyEntry if the final size is non-zero for one of these types. Note: some very old tar implementations wrote non-zero size fields on directory entries; such archives will now be rejected as malformed. Assisted-by: OpenCode (Claude Sonnet 4.6) Signed-off-by: Colin Walters <walters@verbum.org>
The base-256 numeric field fix (5c937f6) switched parse_numeric from checked_shl(8) to checked_mul(256) for overflow detection. checked_shl(8) never fires for a constant shift of 8 (always in-range for u64), so the old code silently wrapped reserved base-256 leading bytes like 0x8e to garbage u64 values and continued parsing. tar-rs still has this behaviour. After the fix, tar-core correctly returns InvalidOctal for any numeric field whose leading byte is not valid octal ASCII, 0x80 (positive base-256), or 0xff (negative base-256). The differential harness treated this as a regression because tar-rs accepted the entry while tar-core rejected it. Add ParseError::Header(HeaderError::InvalidOctal(_)) to the allowlist, matching the existing precedent for PAX parse errors. tar-core is correct to reject these; tar-rs's silent acceptance is the known pre-existing bug. Assisted-by: OpenCode (Claude Sonnet 4.6) Signed-off-by: Colin Walters <walters@verbum.org>
tar-core rejects archives where a header-only entry type (FIFO, symlink, directory, device node) has a non-zero size field, since consuming those bytes as content would desynchronise the stream for all subsequent entries. tar-rs silently accepts them, so this is a known intentional divergence. Add ParseError::NonZeroSizeForHeaderOnlyEntry(_) to the fuzz harness allowlist alongside the existing PAX and base-256 allowlist entries. Update the module-level doc comment to describe this new category. Assisted-by: OpenCode (Claude Sonnet 4.6) Signed-off-by: Colin Walters <walters@verbum.org>
The GNU tar base-256 format stores a pure marker byte (0x80 positive, 0xff negative) in field[0] and the value big-endian in field[1..N]. For N=8 fields (uid/gid/devmajor/devminor) this gives a 7-byte payload (56 data bits, max 2^56-1). The previous implementation wrote value.to_be_bytes() across all 8 bytes then OR'd field[0] with 0x80. For values < 2^56 both approaches produce identical output (field[0] of the value is 0x00, so OR gives exactly 0x80). For values in [2^56, 2^62), the old code produced field[0] in 0x81..0xbf, which a strict GNU-compatible decoder checking for byte[0] == 0x80 would reject. The old limit of 2^62-1 was also a fiction: real uid_t is uint32_t. Change the overflow threshold from 2^62 to 2^56, set field[0] = 0x80 explicitly, and write the value into field[1..8] only. The decoder is intentionally left lenient (0x80..0xbf accepted) to keep compatibility with archives produced by tar-rs, which uses the same OR approach. Update proptest bounds, doc comments, and add unit tests for the new encoder behaviour. Assisted-by: OpenCode (Claude Sonnet 4.6) Signed-off-by: Colin Walters <walters@verbum.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Nothing critical here, just came out of running an LLM-based audit against various other tar implementations.