-
Notifications
You must be signed in to change notification settings - Fork 347
Object storage operations #3256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
API and implementations to perform long-running operations against object stores, mostly to purge files.
adam-christian-software
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a fantastic PR! Thanks for pushing forward with this. I had some smaller questions, but this looks great.
...iles/impl/src/main/java/org/apache/polaris/storage/files/impl/FileOperationsFactoryImpl.java
Outdated
Show resolved
Hide resolved
storage/files/impl/src/main/java/org/apache/polaris/storage/files/impl/FileOperationsImpl.java
Outdated
Show resolved
Hide resolved
storage/files/api/src/main/java/org/apache/polaris/storage/files/api/FileOperations.java
Show resolved
Hide resolved
storage/files/api/src/main/java/org/apache/polaris/storage/files/api/FileSpec.java
Show resolved
Hide resolved
storage/files/api/src/main/java/org/apache/polaris/storage/files/api/PurgeSpec.java
Show resolved
Hide resolved
storage/files/api/src/main/java/org/apache/polaris/storage/files/api/PurgeSpec.java
Show resolved
Hide resolved
| /** | ||
| * The number of purged files. | ||
| * | ||
| * <p>The returned value may be wrong and include non-existing files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for failedPurges. Is there a way that users can understand this number better? If it is wrong, then why is it helpful to return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No way, because that information is not returned from the Iceberg APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if the number can be wrong, I'd just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to leave it ;)
The reason is that this object-storage API doesn't require us to use Iceberg's FileIO. We could access the object-store SDKs directly, which would be more efficient and yield correct numbers including deletion-error information per file/object. But that would be some more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH OK. Sounds good. I'm good to resolve.
flyrain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This seems to overlap with the existing TableCleanupTaskHandler. My understanding is that the direction is to invest in a more general task execution framework (e.g., a delegation service outside of Polaris), rather than re-implementing file-level operations in individual features. It would be great to align this work with that direction.
|
Thanks Adam for the thorough review! Pushed a commit to address the findings. |
| import org.projectnessie.catalog.formats.iceberg.IcebergSpec; | ||
| import org.projectnessie.catalog.formats.iceberg.manifest.IcebergDataContent; | ||
| import org.projectnessie.catalog.formats.iceberg.manifest.IcebergDataFile; | ||
| import org.projectnessie.catalog.formats.iceberg.manifest.IcebergFileFormat; | ||
| import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestContent; | ||
| import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestEntry; | ||
| import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestEntryStatus; | ||
| import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestFile; | ||
| import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestFileWriter; | ||
| import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestFileWriterSpec; | ||
| import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestListWriter; | ||
| import org.projectnessie.catalog.formats.iceberg.manifest.IcebergManifestListWriterSpec; | ||
| import org.projectnessie.catalog.formats.iceberg.meta.IcebergPartitionSpec; | ||
| import org.projectnessie.catalog.formats.iceberg.meta.IcebergSchema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think Polaris should depend on the Nessie project for Iceberg metadata. What if Iceberg itself changed, while Nessie doesn't change them or haven't change yet? cc @dennishuo @singhpk234 @adam-christian-software
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see this before; it must have changed since I last looked it over. Personally, I'm not 100% sure what these classes do, however I agree that we should not be relying on Project Nessie for Iceberg-native classes. If these classes are needed, I would rather move them into Polaris rather than keep them in Project Nessie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These classes are only needed for simpler tests, which are not possible or only possible with a lot of hacks with Iceberg Java code. These classes would only break, if the Iceberg spec changes in a breaking way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it must have changed since I last looked it over.
Nope. Was in there since the very first commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not fully agree with this.
These classes are not just test helpers. They introduce a parallel Iceberg metadata and behavior model, even if used only in tests today. That means they still encode assumptions about Iceberg semantics, which increases long term maintenance risk.
The idea that they would only break with a breaking Iceberg spec change is also optimistic. Iceberg often evolves through additive or clarifying changes that do not break the spec but can still invalidate assumptions in helper classes like these.
Finally, this creates an indirect dependency where Polaris has to wait for Nessie to catch up whenever Iceberg changes. That coupling feels unnecessary and makes it harder for Polaris to stay closely aligned with upstream Iceberg behavior.
API and implementations to perform long-running operations against object stores, mostly to purge files.