Skip to content

Various features to support bootc image deltas#64

Closed
alexlarsson wants to merge 4 commits intomainfrom
handle-bootc-layers
Closed

Various features to support bootc image deltas#64
alexlarsson wants to merge 4 commits intomainfrom
handle-bootc-layers

Conversation

@alexlarsson
Copy link
Collaborator

Here are some changes that help when creating deltas for bootc images:

  • Properly handle the hardlinks between ostree repo objects and normal files when finding delta sources
  • Support multiple "old" tar files for finding deltas (we don't know which layers has the right files)
  • Add ability to filter what files are used for delta sources (we only have the objects in the ostree repo available on the target system.

With these, I was able to create pretty small deltas for bootc oci images with my hacked up wip oci delta tool (https://github.com/alexlarsson/oci-delta-tool)

Note: All of these changes are generic and can be useful for other types of tar use as well.

In bootc images, the typical layout for a layer tar is:

```
sysroot/ostree/repo/objects/9f/a74817a833dd0b4cefd91da9072006dde770bff03166a75f8e0f2e6b795c9e.file
usr/bin/bash link to sysroot/ostree/repo/objects/9f/a74817a833dd0b4cefd91da9072006dde770bff03166a75f8e0f2e6b795c9e.file
```

In the tar file this makes the sha256 name a "real" file object, and the actual file a hardlink referencing it.

When diffing such a layer we're only looking at the path/basename of
the "real" file, which means we will never find the right source to
delta against. To fix this we record *all* the names for each file,
and compare against them.

Comparing an OCI layer with this gives a large boost:

-rw-r--r--. 1 alex alex  17M 25 mar 10.58  image1-layer.tar
-rw-r--r--. 1 alex alex  17M 25 mar 10.58  image2-layer.tar
-rw-r--r--. 1 alex alex  17M 25 mar 10.59  old-result.tardiff
-rw-r--r--. 1 alex alex 3,0M 25 mar 11.19  new-result.tardiff

Signed-off-by: Alexander Larsson <alexl@redhat.com>
We need to use HasSuffix, not HasPrefix.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Sometimes you have multiple tar files as source for delta
information. In particular, this is common when you are diffing OCI
container image layers. For example, when generating a delta for one
layer in a new image you don't necessarily know what layer the has the
original files, because layers index are not stable, especially with
bootc style OCI images that get rechunked.

This is mostly trivial code that makes oldTars an array, but there is
some complexity in how you have to handle filenames that conflict in
the old tars. We assume they have been extracted in the order given, so
any files in an earlier tar-file that has been overwritten by a file from
a later tar-file will be marked overwritten and not used as delta source.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
If ths is specified, only files with that prefix are used
as sources for deltas. This can be useful if you only have
a partially extracted version of the tar files on the system
when applying the patch.

This is particularly useful for bootc images, because only the
files in /sysroot/ostree/repo/objects/ are easily available.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the tar-diff utility by introducing advanced capabilities for generating binary deltas between tar archives. The primary motivation is to optimize delta creation for bootc OCI images, which often involve complex hardlinking patterns and layered file systems. By allowing multiple source tar files, improving hardlink recognition, and adding source path filtering, the changes aim to produce smaller and more efficient delta files, streamlining the update process for such images. These improvements are designed to be generic and beneficial for other tar-based delta use cases as well.

Highlights

  • Hardlink Handling: Improved the handling of hardlinks within tar archives, allowing the diffing logic to correctly identify and utilize files that have multiple paths (hardlinks) as potential delta sources. This is particularly useful for scenarios like ostree repositories where objects might be hardlinked to their 'real' file paths.
  • Multiple Old Tar Files Support: Enabled the tar-diff tool and library to accept multiple 'old' tar files as sources for generating a delta. This is crucial for layered systems like container images, where the new image might derive content from several preceding layers, and the tool can now intelligently search across these layers for optimal delta sources.
  • Source File Prefix Filtering: Added a new option (--source-prefix) to filter which files from the old tar archives can be considered as delta sources. This allows users to restrict the source pool to specific directories or patterns, ensuring that the delta only references files that will be available on the target system during patching, such as specific object directories in a bootc image.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances tar-diff by introducing support for multiple 'old' tar files as sources for delta generation, which is particularly useful for layered container images. It also adds a new --source-prefix command-line option to filter which files from the old tar archives can be considered as delta sources. This involved refactoring internal data structures like tarFileInfo and sourceInfo to accommodate multiple paths for hardlinks and track the origin tar file for each entry. The analyzeForDelta and extractDeltaData functions were updated to process these multiple inputs and apply the new prefix filtering logic. The README.md and command-line usage examples have been updated to reflect these new capabilities. A review comment suggests further improving delta efficiency by also excluding .gz files from delta candidates, similar to how .xz and .bz2 files are handled.

Comment on lines +213 to +215
if strings.HasSuffix(basename, ".xz") ||
strings.HasSuffix(basename, ".bz2") {
return false

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and improved delta efficiency, consider also excluding .gz files from delta candidates. Binary diffing compressed data (even if rsyncable) is generally ineffective, as small changes in the uncompressed content can lead to large, unpredictable changes in the compressed output, making the delta larger than necessary or even larger than the original file. Excluding all common compressed formats (.xz, .bz2, .gz) would align with the goal of finding "non-delta-able files (currently just compression)".

Suggested change
if strings.HasSuffix(basename, ".xz") ||
strings.HasSuffix(basename, ".bz2") {
return false
if strings.HasSuffix(basename, ".xz") ||
strings.HasSuffix(basename, ".bz2") ||
strings.HasSuffix(basename, ".gz") {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a comment about this in the code "NB: We explicitly don't have .gz here in case someone might be using --rsyncable for that". That was taken directly from the original code in ostree (ostree-repo-static-delta-compilation-analysis.c). I'm not sure if we agree with this still, but I didn't want to change it in this MR.

@alexlarsson
Copy link
Collaborator Author

For some reason it is not allowed to force push to a branch, so I couldn't push a new rebased version of this that fixes the lint issue. So, there is a new version in #65 which has the fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant