Skip to content

archive: Replace entry parsing with tar-core Parser#447

Draft
cgwalters wants to merge 2 commits into
composefs:mainfrom
cgwalters:rebase-on-tar-core
Draft

archive: Replace entry parsing with tar-core Parser#447
cgwalters wants to merge 2 commits into
composefs:mainfrom
cgwalters:rebase-on-tar-core

Conversation

@cgwalters
Copy link
Copy Markdown
Collaborator

@cgwalters cgwalters commented Mar 20, 2026

Obviously, this is a giant change; putting this up as draft for discussion.

In https://github.com/composefs/composefs-rs we need access to the raw bytestream behind the tar archive and we process arbitrary potentially malicious tar archives (as used by OCI). I wanted something that was heavily fuzzed and robust. tar-core also has unsafe-code = "forbid".

Sync and async unification

To an especially compelling thing about the "sans-io" approach of tar-core is we can finally unify the backend implementation between this project and the current async fork. See also "tarmageddon". Particularly in light of the just-published CVEs for this crate which relate to the above (and I help fix) - tar-core of course enforces more uniform interpretation by unifying that parser. I also spent some time cross-checking interoperability with other popular implementations in Python and Go.

Semver

Note this change as is designed to be semver compatible! People keep proposing semver-breaking changes here and we only recently added a check for that, which should pass here.

However - I would like to propose after this at some point we do an "easy to port" breaking bump to 0.5 where we remove some deprecated APIs (like the raw mode - anyone who wants that can just use tar-core!). And also, if we can commit to reusing the types exposed by tar-core (which does have the issue it needs to bump semver when zerocopy does - that's how we avoid unsafe code) - which I think we can do, it will dramatically simplify this crate.

cap-std

Another recent vulnerability here was due to symlink traversal. Something I'd like to add to this crate is an optional dependency on cap-std which is a very well engineered project that uniformly avoids those types of issues.

Feedback appreciated

Obviously so far tar-core has mostly been reviewed by me, and any other people looking at it would be very appreciated. One thing I'm happy with is the extensive test coverage - e.g. we're fuzzing in a variety of ways - including verifying that tar-core and this crate have the same expected interpretation for tarballs.


Replace the hand-rolled entry parsing loop and sparse header
handling with tar-core's Parser state machine. The Parser
processes GNU long name/link extensions, PAX headers (including
global extensions), and GNU/PAX sparse formats internally,
emitting fully resolved entries.

This brings PAX sparse v1.0 support (fixing issues #286 and #295)
with no public API changes. Entries::raw(true) is preserved using
the original single-header reading code path for callers that need
to see every tar record (including extension headers) individually.

Assisted-by: OpenCode (Claude Opus 4)
Signed-off-by: Colin Walters walters@verbum.org

cgwalters and others added 2 commits March 20, 2026 12:55
Replace the hand-rolled entry parsing loop and sparse header
handling with tar-core's Parser state machine. The Parser
processes GNU long name/link extensions, PAX headers (including
global extensions), and GNU/PAX sparse formats internally,
emitting fully resolved entries.

This brings PAX sparse v1.0 support (fixing issues #286 and #295)
with no public API changes. Entries::raw(true) is preserved using
the original single-header reading code path for callers that need
to see every tar record (including extension headers) individually.

Assisted-by: OpenCode (Claude Opus 4)
Signed-off-by: Colin Walters <walters@verbum.org>
Prove that the tar-core parser integration fixes issues #286
(incorrect file size for PAX sparse entries) and #295 (files
extracted under GNUSparseFile.0/ instead of their real name).

The test archive is from ncihnegn's PR #298, which this work
obsoletes.

Co-authored-by: Haowen Ning <ncihnegn@gmail.com>
Assisted-by: OpenCode (Claude Opus 4)
Signed-off-by: Colin Walters <walters@verbum.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant