Skip to content

Conversation

@snazy
Copy link
Member

@snazy snazy commented Dec 12, 2025

API and implementations to perform long-running operations against object stores, mostly to purge files.

API and implementations to perform long-running operations against object stores, mostly to purge files.
Copy link
Contributor

@adam-christian-software adam-christian-software left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a fantastic PR! Thanks for pushing forward with this. I had some smaller questions, but this looks great.

/**
* The number of purged files.
*
* <p>The returned value may be wrong and include non-existing files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for failedPurges. Is there a way that users can understand this number better? If it is wrong, then why is it helpful to return?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No way, because that information is not returned from the Iceberg APIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if the number can be wrong, I'd just remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to leave it ;)
The reason is that this object-storage API doesn't require us to use Iceberg's FileIO. We could access the object-store SDKs directly, which would be more efficient and yield correct numbers including deletion-error information per file/object. But that would be some more work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH OK. Sounds good. I'm good to resolve.

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This seems to overlap with the existing TableCleanupTaskHandler. My understanding is that the direction is to invest in a more general task execution framework (e.g., a delegation service outside of Polaris), rather than re-implementing file-level operations in individual features. It would be great to align this work with that direction.

@snazy
Copy link
Member Author

snazy commented Dec 20, 2025

Thanks Adam for the thorough review! Pushed a commit to address the findings.

@snazy snazy changed the title Object storage operations (proposal) Object storage operations Dec 20, 2025
Comment on lines +37 to +50
import org.projectnessie.catalog.formats.iceberg.IcebergSpec;
import org.projectnessie.catalog.formats.iceberg.manifest.IcebergDataContent;
import org.projectnessie.catalog.formats.iceberg.manifest.IcebergDataFile;
import org.projectnessie.catalog.formats.iceberg.manifest.IcebergFileFormat;
import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestContent;
import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestEntry;
import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestEntryStatus;
import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestFile;
import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestFileWriter;
import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestFileWriterSpec;
import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestListWriter;
import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestListWriterSpec;
import org.projectnessie.catalog.formats.iceberg.meta.IcebergPartitionSpec;
import org.projectnessie.catalog.formats.iceberg.meta.IcebergSchema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think Polaris should depend on the Nessie project for Iceberg metadata. What if Iceberg itself changed, while Nessie doesn't change them or haven't change yet? cc @dennishuo @singhpk234 @adam-christian-software

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this before; it must have changed since I last looked it over. Personally, I'm not 100% sure what these classes do, however I agree that we should not be relying on Project Nessie for Iceberg-native classes. If these classes are needed, I would rather move them into Polaris rather than keep them in Project Nessie.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These classes are only needed for simpler tests, which are not possible or only possible with a lot of hacks with Iceberg Java code. These classes would only break, if the Iceberg spec changes in a breaking way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it must have changed since I last looked it over.

Nope. Was in there since the very first commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully agree with this.

These classes are not just test helpers. They introduce a parallel Iceberg metadata and behavior model, even if used only in tests today. That means they still encode assumptions about Iceberg semantics, which increases long term maintenance risk.

The idea that they would only break with a breaking Iceberg spec change is also optimistic. Iceberg often evolves through additive or clarifying changes that do not break the spec but can still invalidate assumptions in helper classes like these.

Finally, this creates an indirect dependency where Polaris has to wait for Nessie to catch up whenever Iceberg changes. That coupling feels unnecessary and makes it harder for Polaris to stay closely aligned with upstream Iceberg behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants