feat!(storage): Allow StorageFactory::build to take table metadata#2247
feat!(storage): Allow StorageFactory::build to take table metadata#2247CTTY wants to merge 4 commits intoapache:mainfrom
Conversation
| "Unable to load file io, neither warehouse nor metadata location is set!", | ||
| )); | ||
| } | ||
|
|
There was a problem hiding this comment.
I don't think we need these checks anymore since now we rely on a configured StorageFactory
crates/iceberg/src/io/storage/mod.rs
Outdated
| /// # Arguments | ||
| /// | ||
| /// * `metadata` - The table metadata to incorporate | ||
| fn with_metadata(&self, metadata: &TableMetadata) -> Result<Arc<dyn StorageFactory>>; |
There was a problem hiding this comment.
I think this should be part of build method?
There was a problem hiding this comment.
My concern about having build_with_metadata is that it means we will need to thread the metadata through FileIO / FileIOBuilder — something like FileIOBuilder::with_metadata(metadata) that stores it, and then FileIO::get_storage calls build_with_metadata when metadata is present, or build when it's not. I think metadata can be heavy to be passed around
I think just having with_metadata is much lighter and custom implementations can only extract what they need upfront rather than iceberg holding metadata info until build
There was a problem hiding this comment.
I'm not convinced, as a pub trait, providing two not compatible entrance method makes both the caller and callee difficult. As a caller, when should I call which method? What's the contract should we follow? As a callee, why I need to implement two incompatible methods?
As with storing table metadata, in fact it's caused by the optimization of building storage lazily. We need to choose to sacrifice at one of them, e.g. pay for storing table metadata or building storage eargly.
There was a problem hiding this comment.
I think removing lazy storage makes sense to me, right now I don't see it being useful. When users are using FileIO, they will want to initialize Storage anyway
There was a problem hiding this comment.
I've updated the API to be
fn build(
&self,
config: &StorageConfig,
metadata: Option<&TableMetadata>,
) -> Result<Arc<dyn Storage>>;at some place you will have to have a FileIO constructed before you have a metadata object, and I think we should at least allow users to pass None in that case. An alternative is to pass in "default" TableMetadata but it seems hacky
Which issue does this PR close?
What changes are included in this PR?
Are these changes tested?
Relying on the existing tests