Add support for PAX Format, Version 1.0#298
Conversation
|
Windows CI is failing as #299. |
|
@ncihnegn Thank you very much for this PR! |
|
@alexcrichton Can you review this PR please? |
| let off = block.offset()?; | ||
| let len = block.length()?; | ||
| if len != 0 && (size - remaining) % 512 != 0 { | ||
| let mut add_block = |block: &SparseEntry| -> io::Result<_> { |
There was a problem hiding this comment.
I think it's ok to avoid a new SparseEntry type here and just take two parameters for offset/size
| Some(gnu) => gnu, | ||
| None => return Err(other("sparse entry type listed but not GNU header")), | ||
| }; | ||
| let mut sparse_map = Vec::<SparseEntry>::new(); |
There was a problem hiding this comment.
One of the main goals I tried to keep for the tar crate is to minimize internal allocations. Instead of having a temporary list here could this be refactored to avoid the intermediate allocation?
There was a problem hiding this comment.
I don't think so. Here we need to convert strings to numbers and the number of pairs is not fixed.
| } | ||
| } | ||
|
|
||
| #[allow(unused_assignments)] // https://github.com/rust-lang/rust/issues/22630 |
There was a problem hiding this comment.
I think it would be best to remove this or move the comment to where the warning is printed instead.
| if is_recognized_header && fields.is_pax_sparse() { | ||
| gnu_longname = fields.pax_sparse_name(); | ||
| } |
There was a problem hiding this comment.
This feels different than the current organization. Instead of pretending that the gnu_longname field was present if a pax-specified field is present could the accessor which looks at long_pathname be updated to consult the pax extensions if they're present?
| // Not an entry | ||
| // Keep pax_extensions for the next ustar header | ||
| processed -= 1; |
There was a problem hiding this comment.
I'm not sure what this is doing? An entry was consumed here so I don't know why this value would be decremented?
There was a problem hiding this comment.
In PAX format, each entry has two headers: pax and ustar.
| let mut reader = io::BufReader::with_capacity(BLOCK_SIZE, &self.archive.inner); | ||
| let mut read_decimal_line = || -> io::Result<u64> { | ||
| let mut str = String::new(); | ||
| num_bytes_read += reader.read_line(&mut str)?; | ||
| str.strip_suffix("\n") | ||
| .and_then(|s| s.parse::<u64>().ok()) | ||
| .ok_or_else(|| other("failed to read a decimal line")) | ||
| }; | ||
|
|
||
| let num_entries = read_decimal_line()?; | ||
| for _ in 0..num_entries { | ||
| let offset = read_decimal_line()?; | ||
| let size = read_decimal_line()?; | ||
| sparse_map.push(SparseEntry { offset, size }); | ||
| } |
There was a problem hiding this comment.
I don't think I understand how this could work and pass tests. This is creating a temporary buffer to read from the inner underlying data stream but then the buffer is discarded outside of this scope. That means that more data than necessary could be consumed from the inner data stream and accidentally discarded.
I don't think that this should create a temporary buffer here but instead, if necessary, use a stack-local buffer and then do byte-searching within since presumably the entry here is typically small enough for that.
There was a problem hiding this comment.
It will consume exactly a block of 512 bytes. After reading the necessary data, the remaining filler will be discarded.
| None => return Err(other("sparse entry type listed but not GNU header")), | ||
| }; | ||
| let mut sparse_map = Vec::<SparseEntry>::new(); | ||
| let mut real_size = 0; |
There was a problem hiding this comment.
Something about this doesn't feel quite right because real_size isn't set in all branches of the if statement below whereas prior it was always set to a particular value.
There was a problem hiding this comment.
Not sure what you mean. It is set in both branches, line 428 and 452.
There was a problem hiding this comment.
We could do though:
let real_size = if entry.is_pax_sparse() { ... } else { ... }
|
Did this go stale? |
|
Updated. |
|
@alexcrichton, could you please re-review this? The Julia language's version multiplexer/installer uses this crate but we're blocked on porting the installer to all of Julia's supported platforms until this is merged. |
|
This is quite an old PR and I've unfortunately lost context on this. I'd also understand if @ncihnegn here wouldn't want to sheperd thing along after it's been 2 years since it's been opened. Despite that though this is somewhat tricky code and I'm hesitant to merge as-is. Merging this would require me to get a better understanding of PAX and how this crate works again (it's been awhile) so that's a fair bit of context to boot back up on. The test coverage also looks relatively light here and additionally there's not a ton of documentation internally about what's going on either. There's also pieces I don't fully understand like creating intermediate Overall I unfortunately do not have the time to myself personally help push this across the finish line, but I can try to outline what would make this easier to land if others are interested in helping to contribute. |
|
As I understand things, #375 added support for GNU sparse, but this would be the PAX version? |
Yes. See |
|
@ncihnegn @alexcrichton Hi there - any chance this gets finished / merged? I am working my way through adding PAX archive read/write support to a Rust app. If you guys need help, feel free to ping me. I'd love to move/remove this logic out of my app (and have it in a crate instead, ideally this one). |
|
I'm strapped pretty thin right now, but @cgwalters or @xzfc would one of y'all be up for helping to push this over the finish line? |
cgwalters
left a comment
There was a problem hiding this comment.
I just did a mostly superficial pass so far, haven't looked at correctness of sparse handling (I did glance at the gnu tar docs a while ago and just recall it's a mess)
|
|
||
| #[test] | ||
| fn pax_sparse() { | ||
| let rdr = Cursor::new(tar!("pax_sparse.tar")); |
There was a problem hiding this comment.
Post the xz fiasco let's be a bit more sensitive about committing binary data to git. Can you add the script that generates this at least? Or probably better honestly for tests, just assume we have a working external tar binary that can generate the relevant data.
| None => return Err(other("sparse entry type listed but not GNU header")), | ||
| }; | ||
| let mut sparse_map = Vec::<SparseEntry>::new(); | ||
| let mut real_size = 0; |
There was a problem hiding this comment.
We could do though:
let real_size = if entry.is_pax_sparse() { ... } else { ... }
| fields.long_linkname = gnu_longlink; | ||
| fields.pax_extensions = pax_extensions; | ||
| // False positive: unused assignment | ||
| // https://github.com/rust-lang/rust/issues/22630 |
There was a problem hiding this comment.
It looks like this has been fixed so we should be able to drop the assignment.
There was a problem hiding this comment.
We don't have extra assignments here.
There was a problem hiding this comment.
Can't we address this by doing:
fields.pax_extensions = pax_extensions.take();
?
| } | ||
| } | ||
|
|
||
| #[allow(unused_assignments)] |
There was a problem hiding this comment.
Hopefully we can drop this now
There was a problem hiding this comment.
No, rustc 1.87.0-nightly still complains.
| pub fn pax_sparse_name(&mut self) -> Option<Vec<u8>> { | ||
| if let Some(ref pax) = self.pax_extensions { | ||
| return PaxExtensions::new(pax) | ||
| .filter_map(|f| f.ok()) |
There was a problem hiding this comment.
I'm not a big fan of "swallowing" errors like this, my preference would be to make this function return Result<Option<Vec<u8>>> and propagate this error instead.
There was a problem hiding this comment.
I don't understand. To propagate the error using Result we will be dropping the Vec.
|
|
||
| /// Description of a spare entry. | ||
| pub struct SparseEntry { | ||
| pub offset: u64, |
There was a problem hiding this comment.
Let's also document the fields please
There was a problem hiding this comment.
Offset and size names are self explanatory.
There was a problem hiding this comment.
That's definitely true, but I think there's a general principle here that everything pub should have a docstring, even if trivial.
In some other crates I maintain we use deny(missing_docs).
|
Ping. |
|
Yet another ping - what's still left to do to get this one over the finish line? I've wrapped good old GNU tar in my app, and its about as horrible as one might expect. I'd rather help to get this PR done. |
|
Any update for this PR? |
|
@cgwalters, are you able to take another look at some of the replies to and changes since your last review? @ncihnegn, I believe some of the comments from the initial review have not yet been addressed, e.g. the ones about documentation. |
|
OK I rebased this, dropped the prebuilt binary from the tests and instead significantly expanded the tests (assisted by Claude): https://github.com/cgwalters/tar-rs/tree/pax - any objections to replacing this PR with the contents of that branch? |
|
I happened to see golang/go#75677 go by. A key bit looks like https://github.com/golang/go/blob/e1ca1de1234aa0f6be85c97db5492a94b099a305/src/archive/tar/format.go#L149 |
|
@cgwalters since it seems like the PRs have been flowing recently, would it be feasible to get something like this merged? If the OP is unresponsive, it's probably fine to submit a new PR with your changes (ideally maintaining authorship of the OP here). |
|
I am following this thread and will do anything to get it merge. I don't mind if you choose to merge the PR from @cgwalters . |
Maybe look at incorporating these changes? |
|
Hi sorry I meant to followup here, and I'll post more about this soon but basically I recently created https://github.com/composefs/tar-core and https://github.com/cgwalters/tar-rs/tree/main is PoC patches for rebasing this crate on it. And I did verify that rebasing on tar-core fixes this issue for free. tar-core also has (made a bit easier by its strict sans-io architecture) metadata limits that fix #298 (comment) |
Prove that the tar-core parser integration fixes issues #286 (incorrect file size for PAX sparse entries) and #295 (files extracted under GNUSparseFile.0/ instead of their real name). The test archive is from ncihnegn's PR #298, which this work obsoletes. Co-authored-by: Haowen Ning <ncihnegn@gmail.com> Assisted-by: OpenCode (Claude Opus 4) Signed-off-by: Colin Walters <walters@verbum.org>
Prove that the tar-core parser integration fixes issues #286 (incorrect file size for PAX sparse entries) and #295 (files extracted under GNUSparseFile.0/ instead of their real name). The test archive is from ncihnegn's PR #298, which this work obsoletes. Co-authored-by: Haowen Ning <ncihnegn@gmail.com> Assisted-by: OpenCode (Claude Opus 4) Signed-off-by: Colin Walters <walters@verbum.org>
|
Now up in #447 |
To fix #286 #295.