Various features to support bootc image deltas#66
Conversation
In bootc images, the typical layout for a layer tar is: ``` sysroot/ostree/repo/objects/9f/a74817a833dd0b4cefd91da9072006dde770bff03166a75f8e0f2e6b795c9e.file usr/bin/bash link to sysroot/ostree/repo/objects/9f/a74817a833dd0b4cefd91da9072006dde770bff03166a75f8e0f2e6b795c9e.file ``` In the tar file this makes the sha256 name a "real" file object, and the actual file a hardlink referencing it. When diffing such a layer we're only looking at the path/basename of the "real" file, which means we will never find the right source to delta against. To fix this we record *all* the names for each file, and compare against them. Comparing an OCI layer with this gives a large boost: -rw-r--r--. 1 alex alex 17M 25 mar 10.58 image1-layer.tar -rw-r--r--. 1 alex alex 17M 25 mar 10.58 image2-layer.tar -rw-r--r--. 1 alex alex 17M 25 mar 10.59 old-result.tardiff -rw-r--r--. 1 alex alex 3,0M 25 mar 11.19 new-result.tardiff Signed-off-by: Alexander Larsson <alexl@redhat.com>
We need to use HasSuffix, not HasPrefix. Signed-off-by: Alexander Larsson <alexl@redhat.com>
Sometimes you have multiple tar files as source for delta information. In particular, this is common when you are diffing OCI container image layers. For example, when generating a delta for one layer in a new image you don't necessarily know what layer the has the original files, because layers index are not stable, especially with bootc style OCI images that get rechunked. This is mostly trivial code that makes oldTars an array, but there is some complexity in how you have to handle filenames that conflict in the old tars. We assume they have been extracted in the order given, so any files in an earlier tar-file that has been overwritten by a file from a later tar-file will be marked overwritten and not used as delta source. Signed-off-by: Alexander Larsson <alexl@redhat.com>
If ths is specified, only files with that prefix are used as sources for deltas. This can be useful if you only have a partially extracted version of the tar files on the system when applying the patch. This is particularly useful for bootc images, because only the files in /sysroot/ostree/repo/objects/ are easily available. Signed-off-by: Alexander Larsson <alexl@redhat.com>
We're getting lint errors like: cmd/tar-diff/main.go:71:3: exitAfterDefer: log.Fatalf will exit, and `defer file.Close()` will not run (gocritic) So, lets use a realMain() wrapper that use a return value, and then do the os.Exit() inside the real main(). This lets us safely use defer. Signed-off-by: Alexander Larsson <alexl@redhat.com>
These are not critical (the files are just read and will be closed at process termination anyway), so this is not fatal, but we log something so people are aware that something is weird. Signed-off-by: Alexander Larsson <alexl@redhat.com>
This makes no functional difference, but fixes this lint warning: pkg/tar-diff/analysis.go:353:1: cyclomatic complexity 31 of func `analyzeForDelta` is high (> 30) (gocyclo) Signed-off-by: Alexander Larsson <alexl@redhat.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the tar-diff tool and library by introducing support for multiple 'old' tar files as delta sources, which is particularly beneficial for container image layers. The command-line tool's usage has been updated to reflect this, and a new --source-prefix flag allows filtering source files by path. Internal data structures and analysis logic have been refactored to accommodate these changes, including handling multiple paths for hardlinked files and correctly identifying overwritten files across layers. The README.md has been updated with examples for the new features, and new test cases have been added. There is no specific feedback to provide on the review comments, as they were filtered out.
| for _, basename := range file.basenames { | ||
| if strings.HasSuffix(basename, ".xz") || | ||
| strings.HasSuffix(basename, ".bz2") { |
There was a problem hiding this comment.
The change from strings.HasPrefix to strings.HasSuffix for checking file extensions like .xz or .bz2 is a crucial correction. HasPrefix would not correctly identify these extensions. Additionally, iterating over file.basenames ensures that all names associated with a hardlinked file are checked, which is a significant improvement for delta candidate identification.
Here are some changes that help when creating deltas for bootc images:
With these, I was able to create pretty small deltas for bootc oci images with my hacked up wip oci delta tool (https://github.com/alexlarsson/oci-delta-tool)
Note: All of these changes are generic and can be useful for other types of tar use as well.
(This is a new, rebased version of #64)