Various features to support bootc image deltas#64
Conversation
In bootc images, the typical layout for a layer tar is: ``` sysroot/ostree/repo/objects/9f/a74817a833dd0b4cefd91da9072006dde770bff03166a75f8e0f2e6b795c9e.file usr/bin/bash link to sysroot/ostree/repo/objects/9f/a74817a833dd0b4cefd91da9072006dde770bff03166a75f8e0f2e6b795c9e.file ``` In the tar file this makes the sha256 name a "real" file object, and the actual file a hardlink referencing it. When diffing such a layer we're only looking at the path/basename of the "real" file, which means we will never find the right source to delta against. To fix this we record *all* the names for each file, and compare against them. Comparing an OCI layer with this gives a large boost: -rw-r--r--. 1 alex alex 17M 25 mar 10.58 image1-layer.tar -rw-r--r--. 1 alex alex 17M 25 mar 10.58 image2-layer.tar -rw-r--r--. 1 alex alex 17M 25 mar 10.59 old-result.tardiff -rw-r--r--. 1 alex alex 3,0M 25 mar 11.19 new-result.tardiff Signed-off-by: Alexander Larsson <alexl@redhat.com>
We need to use HasSuffix, not HasPrefix. Signed-off-by: Alexander Larsson <alexl@redhat.com>
Sometimes you have multiple tar files as source for delta information. In particular, this is common when you are diffing OCI container image layers. For example, when generating a delta for one layer in a new image you don't necessarily know what layer the has the original files, because layers index are not stable, especially with bootc style OCI images that get rechunked. This is mostly trivial code that makes oldTars an array, but there is some complexity in how you have to handle filenames that conflict in the old tars. We assume they have been extracted in the order given, so any files in an earlier tar-file that has been overwritten by a file from a later tar-file will be marked overwritten and not used as delta source. Signed-off-by: Alexander Larsson <alexl@redhat.com>
If ths is specified, only files with that prefix are used as sources for deltas. This can be useful if you only have a partially extracted version of the tar files on the system when applying the patch. This is particularly useful for bootc images, because only the files in /sysroot/ostree/repo/objects/ are easily available. Signed-off-by: Alexander Larsson <alexl@redhat.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly enhances tar-diff by introducing support for multiple 'old' tar files as sources for delta generation, which is particularly useful for layered container images. It also adds a new --source-prefix command-line option to filter which files from the old tar archives can be considered as delta sources. This involved refactoring internal data structures like tarFileInfo and sourceInfo to accommodate multiple paths for hardlinks and track the origin tar file for each entry. The analyzeForDelta and extractDeltaData functions were updated to process these multiple inputs and apply the new prefix filtering logic. The README.md and command-line usage examples have been updated to reflect these new capabilities. A review comment suggests further improving delta efficiency by also excluding .gz files from delta candidates, similar to how .xz and .bz2 files are handled.
| if strings.HasSuffix(basename, ".xz") || | ||
| strings.HasSuffix(basename, ".bz2") { | ||
| return false |
There was a problem hiding this comment.
For consistency and improved delta efficiency, consider also excluding .gz files from delta candidates. Binary diffing compressed data (even if rsyncable) is generally ineffective, as small changes in the uncompressed content can lead to large, unpredictable changes in the compressed output, making the delta larger than necessary or even larger than the original file. Excluding all common compressed formats (.xz, .bz2, .gz) would align with the goal of finding "non-delta-able files (currently just compression)".
| if strings.HasSuffix(basename, ".xz") || | |
| strings.HasSuffix(basename, ".bz2") { | |
| return false | |
| if strings.HasSuffix(basename, ".xz") || | |
| strings.HasSuffix(basename, ".bz2") || | |
| strings.HasSuffix(basename, ".gz") { |
There was a problem hiding this comment.
There is a comment about this in the code "NB: We explicitly don't have .gz here in case someone might be using --rsyncable for that". That was taken directly from the original code in ostree (ostree-repo-static-delta-compilation-analysis.c). I'm not sure if we agree with this still, but I didn't want to change it in this MR.
|
For some reason it is not allowed to force push to a branch, so I couldn't push a new rebased version of this that fixes the lint issue. So, there is a new version in #65 which has the fixes. |
Here are some changes that help when creating deltas for bootc images:
With these, I was able to create pretty small deltas for bootc oci images with my hacked up wip oci delta tool (https://github.com/alexlarsson/oci-delta-tool)
Note: All of these changes are generic and can be useful for other types of tar use as well.