Skip to content

Add support for PAX Format, Version 1.0#298

Open
ncihnegn wants to merge 9 commits into
composefs:mainfrom
ncihnegn:pax
Open

Add support for PAX Format, Version 1.0#298
ncihnegn wants to merge 9 commits into
composefs:mainfrom
ncihnegn:pax

Conversation

@ncihnegn
Copy link
Copy Markdown
Contributor

@ncihnegn ncihnegn commented Aug 12, 2022

To fix #286 #295.

@ncihnegn
Copy link
Copy Markdown
Contributor Author

ncihnegn commented Aug 13, 2022

Windows CI is failing as #299.

@NobodyXu
Copy link
Copy Markdown

@ncihnegn Thank you very much for this PR!

@NobodyXu
Copy link
Copy Markdown

NobodyXu commented Sep 6, 2022

@alexcrichton Can you review this PR please?
It is really important for cargo-binstall as users of it encounters PAX format in practice and lack of PAX format means cargo-binstall gives errors even if a pre-built binary is available.

Comment thread src/archive.rs Outdated
let off = block.offset()?;
let len = block.length()?;
if len != 0 && (size - remaining) % 512 != 0 {
let mut add_block = |block: &SparseEntry| -> io::Result<_> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to avoid a new SparseEntry type here and just take two parameters for offset/size

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread src/archive.rs
Some(gnu) => gnu,
None => return Err(other("sparse entry type listed but not GNU header")),
};
let mut sparse_map = Vec::<SparseEntry>::new();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the main goals I tried to keep for the tar crate is to minimize internal allocations. Instead of having a temporary list here could this be refactored to avoid the intermediate allocation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Here we need to convert strings to numbers and the number of pairs is not fixed.

Comment thread src/archive.rs Outdated
}
}

#[allow(unused_assignments)] // https://github.com/rust-lang/rust/issues/22630
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to remove this or move the comment to where the warning is printed instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread src/archive.rs Outdated
Comment on lines +408 to +410
if is_recognized_header && fields.is_pax_sparse() {
gnu_longname = fields.pax_sparse_name();
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels different than the current organization. Instead of pretending that the gnu_longname field was present if a pax-specified field is present could the accessor which looks at long_pathname be updated to consult the pax extensions if they're present?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread src/archive.rs Outdated
Comment on lines +399 to +401
// Not an entry
// Keep pax_extensions for the next ustar header
processed -= 1;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this is doing? An entry was consumed here so I don't know why this value would be decremented?

Copy link
Copy Markdown
Contributor Author

@ncihnegn ncihnegn Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PAX format, each entry has two headers: pax and ustar.

Comment thread src/archive.rs Outdated
Comment on lines +427 to +441
let mut reader = io::BufReader::with_capacity(BLOCK_SIZE, &self.archive.inner);
let mut read_decimal_line = || -> io::Result<u64> {
let mut str = String::new();
num_bytes_read += reader.read_line(&mut str)?;
str.strip_suffix("\n")
.and_then(|s| s.parse::<u64>().ok())
.ok_or_else(|| other("failed to read a decimal line"))
};

let num_entries = read_decimal_line()?;
for _ in 0..num_entries {
let offset = read_decimal_line()?;
let size = read_decimal_line()?;
sparse_map.push(SparseEntry { offset, size });
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand how this could work and pass tests. This is creating a temporary buffer to read from the inner underlying data stream but then the buffer is discarded outside of this scope. That means that more data than necessary could be consumed from the inner data stream and accidentally discarded.

I don't think that this should create a temporary buffer here but instead, if necessary, use a stack-local buffer and then do byte-searching within since presumably the entry here is typically small enough for that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will consume exactly a block of 512 bytes. After reading the necessary data, the remaining filler will be discarded.

Comment thread src/archive.rs
None => return Err(other("sparse entry type listed but not GNU header")),
};
let mut sparse_map = Vec::<SparseEntry>::new();
let mut real_size = 0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about this doesn't feel quite right because real_size isn't set in all branches of the if statement below whereas prior it was always set to a particular value.

Copy link
Copy Markdown
Contributor Author

@ncihnegn ncihnegn Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. It is set in both branches, line 428 and 452.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do though:
let real_size = if entry.is_pax_sparse() { ... } else { ... }

@akesson
Copy link
Copy Markdown

akesson commented Dec 16, 2022

Did this go stale?

@davidanthoff
Copy link
Copy Markdown

Is there any chance that this PR could be revived? Or maybe it is ready to be merged, given that it looks like @ncihnegn responded to all the change requests?

Presumably the first step would be to resolve the merge conflicts, @ncihnegn is that something you could do?

@ncihnegn
Copy link
Copy Markdown
Contributor Author

ncihnegn commented Jun 9, 2024

Updated.

@ararslan
Copy link
Copy Markdown

@alexcrichton, could you please re-review this? The Julia language's version multiplexer/installer uses this crate but we're blocked on porting the installer to all of Julia's supported platforms until this is merged.

@alexcrichton
Copy link
Copy Markdown
Collaborator

This is quite an old PR and I've unfortunately lost context on this. I'd also understand if @ncihnegn here wouldn't want to sheperd thing along after it's been 2 years since it's been opened. Despite that though this is somewhat tricky code and I'm hesitant to merge as-is. Merging this would require me to get a better understanding of PAX and how this crate works again (it's been awhile) so that's a fair bit of context to boot back up on. The test coverage also looks relatively light here and additionally there's not a ton of documentation internally about what's going on either.

There's also pieces I don't fully understand like creating intermediate BufReader values which feel like they can easily go wrong but I haven't fully investigated them yet.

Overall I unfortunately do not have the time to myself personally help push this across the finish line, but I can try to outline what would make this easier to land if others are interested in helping to contribute.

@cgwalters
Copy link
Copy Markdown
Collaborator

As I understand things, #375 added support for GNU sparse, but this would be the PAX version?

@ncihnegn
Copy link
Copy Markdown
Contributor Author

ncihnegn commented Oct 25, 2024

As I understand things, #375 added support for GNU sparse, but this would be the PAX version?

Yes. See
https://www.gnu.org/software/tar/manual/html_section/Sparse-Formats.html

@s-m-e
Copy link
Copy Markdown

s-m-e commented Feb 20, 2025

@ncihnegn @alexcrichton Hi there - any chance this gets finished / merged?

I am working my way through adding PAX archive read/write support to a Rust app. If you guys need help, feel free to ping me. I'd love to move/remove this logic out of my app (and have it in a crate instead, ideally this one).

@alexcrichton
Copy link
Copy Markdown
Collaborator

I'm strapped pretty thin right now, but @cgwalters or @xzfc would one of y'all be up for helping to push this over the finish line?

Copy link
Copy Markdown
Collaborator

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a mostly superficial pass so far, haven't looked at correctness of sparse handling (I did glance at the gnu tar docs a while ago and just recall it's a mess)

Comment thread tests/all.rs

#[test]
fn pax_sparse() {
let rdr = Cursor::new(tar!("pax_sparse.tar"));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post the xz fiasco let's be a bit more sensitive about committing binary data to git. Can you add the script that generates this at least? Or probably better honestly for tests, just assume we have a working external tar binary that can generate the relevant data.

Comment thread tests/all.rs Outdated
Comment thread src/entry.rs
Comment thread src/entry.rs Outdated
Comment thread src/archive.rs
None => return Err(other("sparse entry type listed but not GNU header")),
};
let mut sparse_map = Vec::<SparseEntry>::new();
let mut real_size = 0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do though:
let real_size = if entry.is_pax_sparse() { ... } else { ... }

Comment thread src/archive.rs
fields.long_linkname = gnu_longlink;
fields.pax_extensions = pax_extensions;
// False positive: unused assignment
// https://github.com/rust-lang/rust/issues/22630
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this has been fixed so we should be able to drop the assignment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have extra assignments here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we address this by doing:

fields.pax_extensions = pax_extensions.take();

?

Comment thread src/archive.rs
}
}

#[allow(unused_assignments)]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we can drop this now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, rustc 1.87.0-nightly still complains.

Comment thread src/entry.rs
pub fn pax_sparse_name(&mut self) -> Option<Vec<u8>> {
if let Some(ref pax) = self.pax_extensions {
return PaxExtensions::new(pax)
.filter_map(|f| f.ok())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of "swallowing" errors like this, my preference would be to make this function return Result<Option<Vec<u8>>> and propagate this error instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. To propagate the error using Result we will be dropping the Vec.

Comment thread src/header.rs

/// Description of a spare entry.
pub struct SparseEntry {
pub offset: u64,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also document the fields please

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offset and size names are self explanatory.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely true, but I think there's a general principle here that everything pub should have a docstring, even if trivial.

In some other crates I maintain we use deny(missing_docs).

Comment thread src/archive.rs
@ncihnegn ncihnegn requested a review from cgwalters February 28, 2025 09:05
@s-m-e
Copy link
Copy Markdown

s-m-e commented Apr 17, 2025

Ping.

@s-m-e
Copy link
Copy Markdown

s-m-e commented May 7, 2025

Yet another ping - what's still left to do to get this one over the finish line? I've wrapped good old GNU tar in my app, and its about as horrible as one might expect. I'd rather help to get this PR done.

@yjqg6666
Copy link
Copy Markdown

Any update for this PR?

@ararslan
Copy link
Copy Markdown

@cgwalters, are you able to take another look at some of the replies to and changes since your last review? @ncihnegn, I believe some of the comments from the initial review have not yet been addressed, e.g. the ones about documentation.

@cgwalters
Copy link
Copy Markdown
Collaborator

OK I rebased this, dropped the prebuilt binary from the tests and instead significantly expanded the tests (assisted by Claude): https://github.com/cgwalters/tar-rs/tree/pax - any objections to replacing this PR with the contents of that branch?

@cgwalters
Copy link
Copy Markdown
Collaborator

I happened to see golang/go#75677 go by. A key bit looks like https://github.com/golang/go/blob/e1ca1de1234aa0f6be85c97db5492a94b099a305/src/archive/tar/format.go#L149

@djc
Copy link
Copy Markdown

djc commented Mar 17, 2026

@cgwalters since it seems like the PRs have been flowing recently, would it be feasible to get something like this merged? If the OP is unresponsive, it's probably fine to submit a new PR with your changes (ideally maintaining authorship of the OP here).

@ncihnegn
Copy link
Copy Markdown
Contributor Author

I am following this thread and will do anything to get it merge. I don't mind if you choose to merge the PR from @cgwalters .

@djc
Copy link
Copy Markdown

djc commented Mar 18, 2026

OK I rebased this, dropped the prebuilt binary from the tests and instead significantly expanded the tests (assisted by Claude): https://github.com/cgwalters/tar-rs/tree/pax - any objections to replacing this PR with the contents of that branch?

Maybe look at incorporating these changes?

@cgwalters
Copy link
Copy Markdown
Collaborator

Hi sorry I meant to followup here, and I'll post more about this soon but basically I recently created https://github.com/composefs/tar-core and https://github.com/cgwalters/tar-rs/tree/main is PoC patches for rebasing this crate on it.

And I did verify that rebasing on tar-core fixes this issue for free. tar-core also has (made a bit easier by its strict sans-io architecture) metadata limits that fix #298 (comment)

cgwalters referenced this pull request in cgwalters/tar-rs-original-fork Mar 20, 2026
Prove that the tar-core parser integration fixes issues #286
(incorrect file size for PAX sparse entries) and #295 (files
extracted under GNUSparseFile.0/ instead of their real name).

The test archive is from ncihnegn's PR #298, which this work
obsoletes.

Co-authored-by: Haowen Ning <ncihnegn@gmail.com>
Assisted-by: OpenCode (Claude Opus 4)
Signed-off-by: Colin Walters <walters@verbum.org>
cgwalters referenced this pull request in cgwalters/tar-rs-original-fork Mar 20, 2026
Prove that the tar-core parser integration fixes issues #286
(incorrect file size for PAX sparse entries) and #295 (files
extracted under GNUSparseFile.0/ instead of their real name).

The test archive is from ncihnegn's PR #298, which this work
obsoletes.

Co-authored-by: Haowen Ning <ncihnegn@gmail.com>
Assisted-by: OpenCode (Claude Opus 4)
Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters
Copy link
Copy Markdown
Collaborator

Now up in #447

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unpacking file size is incorrect (with reproduced repo)