Skip to content

Various features to support bootc image deltas#66

Open
alexlarsson wants to merge 7 commits intomainfrom
handle-bootc-layers3
Open

Various features to support bootc image deltas#66
alexlarsson wants to merge 7 commits intomainfrom
handle-bootc-layers3

Conversation

@alexlarsson
Copy link
Collaborator

Here are some changes that help when creating deltas for bootc images:

  • Properly handle the hardlinks between ostree repo objects and normal files when finding delta sources
  • Support multiple "old" tar files for finding deltas (we don't know which layers has the right files)
  • Add ability to filter what files are used for delta sources (we only have the objects in the ostree repo available on the target system.

With these, I was able to create pretty small deltas for bootc oci images with my hacked up wip oci delta tool (https://github.com/alexlarsson/oci-delta-tool)

Note: All of these changes are generic and can be useful for other types of tar use as well.

(This is a new, rebased version of #64)

In bootc images, the typical layout for a layer tar is:

```
sysroot/ostree/repo/objects/9f/a74817a833dd0b4cefd91da9072006dde770bff03166a75f8e0f2e6b795c9e.file
usr/bin/bash link to sysroot/ostree/repo/objects/9f/a74817a833dd0b4cefd91da9072006dde770bff03166a75f8e0f2e6b795c9e.file
```

In the tar file this makes the sha256 name a "real" file object, and the actual file a hardlink referencing it.

When diffing such a layer we're only looking at the path/basename of
the "real" file, which means we will never find the right source to
delta against. To fix this we record *all* the names for each file,
and compare against them.

Comparing an OCI layer with this gives a large boost:

-rw-r--r--. 1 alex alex  17M 25 mar 10.58  image1-layer.tar
-rw-r--r--. 1 alex alex  17M 25 mar 10.58  image2-layer.tar
-rw-r--r--. 1 alex alex  17M 25 mar 10.59  old-result.tardiff
-rw-r--r--. 1 alex alex 3,0M 25 mar 11.19  new-result.tardiff

Signed-off-by: Alexander Larsson <alexl@redhat.com>
We need to use HasSuffix, not HasPrefix.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Sometimes you have multiple tar files as source for delta
information. In particular, this is common when you are diffing OCI
container image layers. For example, when generating a delta for one
layer in a new image you don't necessarily know what layer the has the
original files, because layers index are not stable, especially with
bootc style OCI images that get rechunked.

This is mostly trivial code that makes oldTars an array, but there is
some complexity in how you have to handle filenames that conflict in
the old tars. We assume they have been extracted in the order given, so
any files in an earlier tar-file that has been overwritten by a file from
a later tar-file will be marked overwritten and not used as delta source.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
If ths is specified, only files with that prefix are used
as sources for deltas. This can be useful if you only have
a partially extracted version of the tar files on the system
when applying the patch.

This is particularly useful for bootc images, because only the
files in /sysroot/ostree/repo/objects/ are easily available.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
We're getting lint errors like:

cmd/tar-diff/main.go:71:3: exitAfterDefer: log.Fatalf will exit, and `defer file.Close()` will not run (gocritic)

So, lets use a realMain() wrapper that use a return value, and then
do the os.Exit() inside the real main(). This lets us safely use defer.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
These are not critical (the files are just read and will be closed at
process termination anyway), so this is not fatal, but we log
something so people are aware that something is weird.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
This makes no functional difference, but fixes this lint warning:

pkg/tar-diff/analysis.go:353:1: cyclomatic complexity 31 of func `analyzeForDelta` is high (> 30) (gocyclo)

Signed-off-by: Alexander Larsson <alexl@redhat.com>
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the tar-diff utility by introducing features crucial for optimizing delta generation, particularly for bootc images and other layered tar-based systems. The changes enable more intelligent identification of source files by improving hardlink awareness, support for multiple historical tar archives as delta sources, and the ability to filter source files based on path prefixes. These improvements aim to produce smaller and more relevant deltas, leading to more efficient image updates and reduced storage requirements.

Highlights

  • Enhanced Hardlink Handling: The system now properly recognizes and utilizes hardlinks between ostree repository objects and regular files when determining delta sources, improving efficiency for bootc images.
  • Multi-Source Delta Support: The tar-diff tool can now accept multiple "old" tar files as sources for delta generation, allowing for more flexible and accurate delta creation, especially for layered image formats.
  • Source File Filtering: A new --source-prefix option has been introduced, enabling users to restrict which files from the old tar archives can be considered as delta sources, which is particularly useful for scenarios like bootc images where only specific parts of the old system are available.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the tar-diff tool and library by introducing support for multiple 'old' tar files as delta sources, which is particularly beneficial for container image layers. The command-line tool's usage has been updated to reflect this, and a new --source-prefix flag allows filtering source files by path. Internal data structures and analysis logic have been refactored to accommodate these changes, including handling multiple paths for hardlinked files and correctly identifying overwritten files across layers. The README.md has been updated with examples for the new features, and new test cases have been added. There is no specific feedback to provide on the review comments, as they were filtered out.

Comment on lines +212 to +214
for _, basename := range file.basenames {
if strings.HasSuffix(basename, ".xz") ||
strings.HasSuffix(basename, ".bz2") {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The change from strings.HasPrefix to strings.HasSuffix for checking file extensions like .xz or .bz2 is a crucial correction. HasPrefix would not correctly identify these extensions. Additionally, iterating over file.basenames ensures that all names associated with a hardlinked file are checked, which is a significant improvement for delta candidate identification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant