archive: Replace entry parsing with tar-core Parser#447
Draft
cgwalters wants to merge 2 commits into
Draft
Conversation
Replace the hand-rolled entry parsing loop and sparse header handling with tar-core's Parser state machine. The Parser processes GNU long name/link extensions, PAX headers (including global extensions), and GNU/PAX sparse formats internally, emitting fully resolved entries. This brings PAX sparse v1.0 support (fixing issues #286 and #295) with no public API changes. Entries::raw(true) is preserved using the original single-header reading code path for callers that need to see every tar record (including extension headers) individually. Assisted-by: OpenCode (Claude Opus 4) Signed-off-by: Colin Walters <walters@verbum.org>
Prove that the tar-core parser integration fixes issues #286 (incorrect file size for PAX sparse entries) and #295 (files extracted under GNUSparseFile.0/ instead of their real name). The test archive is from ncihnegn's PR #298, which this work obsoletes. Co-authored-by: Haowen Ning <ncihnegn@gmail.com> Assisted-by: OpenCode (Claude Opus 4) Signed-off-by: Colin Walters <walters@verbum.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Obviously, this is a giant change; putting this up as draft for discussion.
In https://github.com/composefs/composefs-rs we need access to the raw bytestream behind the tar archive and we process arbitrary potentially malicious tar archives (as used by OCI). I wanted something that was heavily fuzzed and robust. tar-core also has
unsafe-code = "forbid".Sync and async unification
To an especially compelling thing about the "sans-io" approach of tar-core is we can finally unify the backend implementation between this project and the current async fork. See also "tarmageddon". Particularly in light of the just-published CVEs for this crate which relate to the above (and I help fix) - tar-core of course enforces more uniform interpretation by unifying that parser. I also spent some time cross-checking interoperability with other popular implementations in Python and Go.
Semver
Note this change as is designed to be semver compatible! People keep proposing semver-breaking changes here and we only recently added a check for that, which should pass here.
However - I would like to propose after this at some point we do an "easy to port" breaking bump to 0.5 where we remove some deprecated APIs (like the raw mode - anyone who wants that can just use tar-core!). And also, if we can commit to reusing the types exposed by tar-core (which does have the issue it needs to bump semver when zerocopy does - that's how we avoid unsafe code) - which I think we can do, it will dramatically simplify this crate.
cap-std
Another recent vulnerability here was due to symlink traversal. Something I'd like to add to this crate is an optional dependency on cap-std which is a very well engineered project that uniformly avoids those types of issues.
Feedback appreciated
Obviously so far tar-core has mostly been reviewed by me, and any other people looking at it would be very appreciated. One thing I'm happy with is the extensive test coverage - e.g. we're fuzzing in a variety of ways - including verifying that tar-core and this crate have the same expected interpretation for tarballs.
Replace the hand-rolled entry parsing loop and sparse header
handling with tar-core's Parser state machine. The Parser
processes GNU long name/link extensions, PAX headers (including
global extensions), and GNU/PAX sparse formats internally,
emitting fully resolved entries.
This brings PAX sparse v1.0 support (fixing issues #286 and #295)
with no public API changes. Entries::raw(true) is preserved using
the original single-header reading code path for callers that need
to see every tar record (including extension headers) individually.
Assisted-by: OpenCode (Claude Opus 4)
Signed-off-by: Colin Walters walters@verbum.org