Add BigQuery Metastore federation support#4050
Add BigQuery Metastore federation support#4050joyhaldar wants to merge 8 commits intoapache:mainfrom
Conversation
dimas-b
left a comment
There was a problem hiding this comment.
Thanks for your contribution, @joyhaldar ! The PR LGTM 👍 Still, I believe we need to open a dev discussion for it before merging as it includes REST API changes.
| ICEBERG_REST: "#/components/schemas/IcebergRestConnectionConfigInfo" | ||
| HADOOP: "#/components/schemas/HadoopConnectionConfigInfo" | ||
| HIVE: "#/components/schemas/HiveConnectionConfigInfo" | ||
| BIGQUERY: "#/components/schemas/BigQueryMetastoreConnectionConfigInfo" |
There was a problem hiding this comment.
Please open a dev ML discussion for this (it's customary for all REST API changes).
There was a problem hiding this comment.
Thank you for your review Dmitri.
I have an existing dev ML thread for this. Replied with the PR link: https://lists.apache.org/thread/n3hh5s1zxn6yn7cbowfgf8p029z6m7g1
Would you like me to start a new discussion?
There was a problem hiding this comment.
I opened https://lists.apache.org/thread/m05xm7szd7znrm9yos1rnld5ljx04y0h specifically for community awareness of REST API changes.
| connectionConfigInfoDpo.asIcebergCatalogProperties(polarisCredentialManager)); | ||
|
|
||
| BigQueryMetastoreCatalog bigQueryMetastoreCatalog = new BigQueryMetastoreCatalog(); | ||
| bigQueryMetastoreCatalog.initialize(warehouse, mergedProperties); |
There was a problem hiding this comment.
How / where does it get BigQuery connection credentials?
There was a problem hiding this comment.
Thank you for your review Dmitri.
BigQuery Metastore uses Application Default Credentials via Iceberg's BigQueryProperties.
Credentials are resolved automatically from the environment, GOOGLE_APPLICATION_CREDENTIALS, gcloud auth, or attached Service Account.
There was a problem hiding this comment.
So, if storage is also in GCP, the BigQuery connection and storage connections will have to use the same credentials, effectively (in current Polaris code).
It's not a big deal ATM, I think... just trying to understand the full picture :)
There was a problem hiding this comment.
Yes, that's my understanding also. Both BigQuery Metastore and GCS storage will use the same ADC credentials, or impersonated SA if configured. Thank you for clarifying the full picture. Sorry for the late reply.
There was a problem hiding this comment.
Thanks for the explanation, @joyhaldar. Can we comment out this "hidden" behavior somewhere?
There was a problem hiding this comment.
Done. Added a comment explaining the ADC credential behavior in this commit.
8f471ad to
a3db29b
Compare
Thank you for your review @dimas-b. I have an existing dev ML thread for this. Replied with the PR link: https://lists.apache.org/thread/n3hh5s1zxn6yn7cbowfgf8p029z6m7g1 Would you like me to start a new discussion? |
|
CI failure is unrelated to this PR, MongoDB testcontainer timeout if I understand correctly. |
a3db29b to
c6980e7
Compare
Co-authored-by: Joy Haldar <Joy.Haldar@target.com>
c6980e7 to
07b00dc
Compare
Co-authored-by: Joy Haldar <joy.haldar@target.com>
jbonofre
left a comment
There was a problem hiding this comment.
If this PR is solid, I have a concern about the potential NPE. Can you please clarify ?
| .add("gcpProjectId", gcpProjectId) | ||
| .add("gcpLocation", gcpLocation) | ||
| .add("listAllTables", listAllTables) | ||
| .add("impersonateServiceAccount", impersonateServiceAccount) |
There was a problem hiding this comment.
Is it ok to include impersonateServiceAccount for toString() ?
It could include sensitive data I guess, so it could potentially land in log messages.
Maybe we should remove the account details or offuscate ?
There was a problem hiding this comment.
Thank you for your review JB.
Done. I have removed all impersonation fields from toString().
| // IMPLICIT authentication. | ||
| AuthenticationParametersDpo authenticationParametersDpo = | ||
| connectionConfigInfoDpo.getAuthenticationParameters(); | ||
| if (authenticationParametersDpo.getAuthenticationTypeCode() |
There was a problem hiding this comment.
Should we include a null check here ?
If connectionConfigInfoDpo.getAuthenticationParameters() returns null, then authenticationParametersDpo.getAuthenticationTypeCode() will throw NPE.
Since the constructor marks authenticationParameters as required = false I guess it can happen.
There was a problem hiding this comment.
Thank you for your review JB.
I copied this pattern from HiveFederatedCatalogFactory and HadoopFederatedCatalogFactory which also don't have null checks. Should I add the null check here and open a separate PR to fix Hive and Hadoop as well? Or keep it consistent with existing code for now?
There was a problem hiding this comment.
I'd say let's make new code robust immediately and fix old code later.
There was a problem hiding this comment.
Note: IIRC, Hive federation is not even bundled by default:
polaris/runtime/server/build.gradle.kts
Lines 42 to 43 in b67e1a5
There was a problem hiding this comment.
IIUC, This path seems to rely on authenticationParameters being present, while the public BIGQUERY contract still appears to allow it to be omitted. I wonder if that mismatch should be resolved at validation / schema time rather than here in the runtime path.
There was a problem hiding this comment.
Done. Added null check here. If authenticationParameters is null, BigQuery uses ADC by default.
| AuthenticationParametersDpo.fromAuthenticationParametersModelWithSecrets( | ||
| bigqueryConfigModel.getAuthenticationParameters(), secretReferences); | ||
| String bigqueryUri = | ||
| bigqueryConfigModel.getUri() != null |
There was a problem hiding this comment.
nit: If URI is null, we default to https://bigquery.googleapis.com. Maybe this default should live in DPO for visibility (or maybe OpenAPI as you started a discussion to update it 😄 ).
There was a problem hiding this comment.
Done. I have moved the default URI to a public constant in BigQueryMetastoreConnectionConfigInfoDpo.
Co-authored-by: Joy Haldar <joy.haldar@target.com>
Co-authored-by: Joy Haldar <joy.haldar@target.com>
| } | ||
|
|
||
| @Override | ||
| public ConnectionConfigInfo asConnectionConfigInfoModel( |
There was a problem hiding this comment.
Do we have a test to validate no data is lost on round-trips through these different config objects?
There was a problem hiding this comment.
Sorry about that. I have added testBigQueryMetastoreConnectionConfig in ConnectionConfigInfoDpoTest.
flyingImer
left a comment
There was a problem hiding this comment.
I’m aligned with the direction here overall. I think there are two high-priority issues worth tightening before this moves forward:
-
The current BIGQUERY auth contract does not look aligned yet. IIUC, the API/model path still allows
authenticationParametersto be omitted, while the runtime only supportsIMPLICITauth and later dereferences that field as if it were always present. That feels like more than a local null-safety issue, the boundary contract still looks looser than the implementation. -
Since this adds a new federation connector across OpenAPI, model conversion, and runtime wiring, I think we should have at least one round-trip/ wiring test for
BIGQUERYbefore the contract settles.
2 cents: this PR also seems like a useful point to clarify whether federation here is aiming for a common Polaris contract over native Iceberg catalogs, or for backend-specific adapters behind one REST surface. I think that answer affects how strict/explicit the connector contract should be.
| items: | ||
| type: string | ||
| description: Delegation chain for impersonation | ||
| required: |
There was a problem hiding this comment.
IIUC, BIGQUERY can still omit authenticationParameters at the API boundary, but the current implementation only supports IMPLICIT auth and later code paths assume the field is present. Would it make sense to tighten the BIGQUERY contract here instead, so the boundary reflects what the runtime actually supports today?
There was a problem hiding this comment.
Thank you for the feedback. I have added null checks in both BigQueryMetastoreFederatedCatalogFactory and asConnectionConfigInfoModel. If authenticationParameters is null, BigQuery uses ADC by default. Please let me know if you'd like me to change the approach. Sorry for the late reply.
| .setImpersonateLifetimeSeconds(impersonateLifetimeSeconds) | ||
| .setImpersonateScopes(impersonateScopes) | ||
| .setImpersonateDelegates(impersonateDelegates) | ||
| .setAuthenticationParameters( |
There was a problem hiding this comment.
Related contract gap: the readback/model-conversion path also seems to assume authenticationParameters is non-null here. So even if the create/update boundary accepts a BIGQUERY payload without an auth block, we can still fail later when converting the stored config back to the API model.
There was a problem hiding this comment.
I have added null check in asConnectionConfigInfoModel.
| // IMPLICIT authentication. | ||
| AuthenticationParametersDpo authenticationParametersDpo = | ||
| connectionConfigInfoDpo.getAuthenticationParameters(); | ||
| if (authenticationParametersDpo.getAuthenticationTypeCode() |
There was a problem hiding this comment.
IIUC, This path seems to rely on authenticationParameters being present, while the public BIGQUERY contract still appears to allow it to be omitted. I wonder if that mismatch should be resolved at validation / schema time rather than here in the runtime path.
Co-authored-by: Joy Haldar <joy.haldar@target.com>
Thank you for the detailed review. I have tried to address both points:
Please let me know if you'd like any changes. |
| * org.apache.polaris.core.admin.model.BigQueryMetastoreConnectionConfigInfo} defined in the API | ||
| * model. | ||
| */ | ||
| public class BigQueryMetastoreConnectionConfigInfoDpo extends ConnectionConfigInfoDpo { |
There was a problem hiding this comment.
Just wondering: Could this class be @PolarisImmutable? I guess it might save a good chunk of boiler-plate code 🤔
There was a problem hiding this comment.
The existing ConnectionConfigInfoDpo classes (Hadoop, Hive) don't use @PolarisImmutable. I can try to refactor and use it if you think it's worthwhile. Would appreciate your inputs before I proceed.
There was a problem hiding this comment.
No worries. Let's keep existing pattern.
| // IMPLICIT authentication. | ||
| AuthenticationParametersDpo authenticationParametersDpo = | ||
| connectionConfigInfoDpo.getAuthenticationParameters(); | ||
| if (authenticationParametersDpo != null |
There was a problem hiding this comment.
What's the use case for permitting federated catalogs without an AuthenticationParametersDpo object?
There was a problem hiding this comment.
Same concern here. Should we fail fast when authenticationParametersDpo is null?
There was a problem hiding this comment.
BigQuery always uses Application Default Credentials or the GOOGLE_APPLICATION_CREDENTIALS environment variable to automatically pick up the user's credentials or a JSON file pointing to their Service Account. Making authenticationParameters required would just force users to type IMPLICIT for no reason IMO, but please let me know what you think, or please let me know if you think I am incorrect.
There was a problem hiding this comment.
I think we will need an AuthenticationType regardless of how BigQuery picks up credentials. In this case, it should be IMPLICIT(I agreed with line 58 here). However, from a syntax perspective, we may not want users to specify it explicitly. We could default it when the AuthenticationType is missing. That said, I would not recommend handling this defaulting logic here. The defaulting to IMPLICIT should happen elsewhere. WDYT?
There was a problem hiding this comment.
I believe declaring the IMPLICIT authentication explicitly 😉 is actually good. It increases clarity about the intended catalog behaviour. It is also more robust in case of future API and backend code changes.
There was a problem hiding this comment.
Agreed, the point is even we want a default value for better UX. This is not the place to handle the defaulting logic.
spec/polaris-management-service.yml
Outdated
| gcpLocation: | ||
| type: string | ||
| description: The GCP location (default = us) | ||
| listAllTables: | ||
| type: boolean | ||
| description: Whether to list all tables (default = true) | ||
| impersonateServiceAccount: | ||
| type: string | ||
| description: Service account email to impersonate | ||
| impersonateLifetimeSeconds: | ||
| type: integer | ||
| description: Token lifetime in seconds for impersonation (default = 3600) | ||
| impersonateScopes: | ||
| type: array | ||
| items: | ||
| type: string | ||
| description: OAuth scopes for impersonation (default = cloud-platform) | ||
| impersonateDelegates: | ||
| type: array | ||
| items: | ||
| type: string | ||
| description: Delegation chain for impersonation |
There was a problem hiding this comment.
How often does the BQMS config surface change? For example, when new fields are introduced or existing fields are updated. Locking every field into the spec creates a maintenance burden, since each upstream change requires a Polaris spec PR, client regeneration, and a new release.
Instead, we could consider using a bag of open properties to keep the model more flexible.
properties:
type: object
additionalProperties:
type: string
There was a problem hiding this comment.
You are right, I will work on addressing this.
There was a problem hiding this comment.
Done. Refactored to use a properties bag. Only warehouse and gcpProjectId remain as required fields, everything else goes into the map. Thank you for your help.
flyrain
left a comment
There was a problem hiding this comment.
Thanks @joyhaldar for adding this. I think it would be really useful. The PR is in the right direction. Left some comments.
| private static final String IMPERSONATE_LIFETIME_SECONDS = | ||
| "gcp.bigquery.impersonate.lifetime-seconds"; | ||
| private static final String IMPERSONATE_SCOPES = "gcp.bigquery.impersonate.scopes"; | ||
| private static final String IMPERSONATE_DELEGATES = "gcp.bigquery.impersonate.delegates"; |
There was a problem hiding this comment.
For these property keys, should we just use the iceberg sdk's properties?
There was a problem hiding this comment.
The property keys in Iceberg's BigQueryProperties are package-private, so I had to duplicate them here. I can add a comment referencing the Iceberg source. Let me know what you think.
| private final String impersonateServiceAccount; | ||
| private final Integer impersonateLifetimeSeconds; | ||
| private final List<String> impersonateScopes; | ||
| private final List<String> impersonateDelegates; |
There was a problem hiding this comment.
These are authentication related properties, wandering if we need to create a new authentication for GCP?
private final String impersonateServiceAccount;
private final Integer impersonateLifetimeSeconds;
private final List<String> impersonateScopes;
private final List<String> impersonateDelegates;
There was a problem hiding this comment.
Thank you for the review. These fields are now part of the properties bag.
XJDKC
left a comment
There was a problem hiding this comment.
Side Question: if we can federate to BigLake, we should be able to federate to BigQuery through BigLake right?
jbonofre
left a comment
There was a problem hiding this comment.
I have two major questions/"concerns":
- As we introduce new dependencies (for BigQuery and transitive), the LICENSE/NOTICE should be updated according to the second point.
- The question is: do we ship bigquery metastore federation by default in the server ? If yes, the LICENSE/NOTICE of the server should be updated. If no, no need to update it.
| .add("gcpProjectId", gcpProjectId) | ||
| .add("gcpLocation", gcpLocation) | ||
| .add("listAllTables", listAllTables) | ||
| .add("authenticationParameters", getAuthenticationParameters()) |
There was a problem hiding this comment.
I know I had a comment in this toString() method, and you addressed it. Thanks a lot.
Sorry to be a pain, but I have a follow up question 😄
Here we directly call getAuthenticationParameters(). authenticationParameters is @Nullable.
It's similar to HiveConnectionConfigInfoDpo which has the same nullable pattern.
If other DPOs call toString() on the auth params, this could throw NPE.
The existing Hadoop DPO has authenticationParameters as @Nonnull and calls toString() explicitly.
Maybe we should do something similar here to prevent NPE ?
We can do something simple:
| .add("authenticationParameters", getAuthenticationParameters()) | |
| .add("authenticationParameters", | |
| getAuthenticationParameters() != null ? getAuthenticationParameters().toString() : "null") |
There was a problem hiding this comment.
Done. Added the null check as suggested. Thank you again JB!
Co-authored-by: Joy Haldar <joy.haldar@target.com>
flyrain
left a comment
There was a problem hiding this comment.
Thanks for continuous working on it, @joyhaldar! I think we are getting closer. Left some comments. Could you fix the CI failures as well?
A few followup items(not a blocker):
- Document this new feature
- CLI changes to support BQMS.
Co-authored-by: Joy Haldar <joy.haldar@target.com>
Made it optional for now to unblock CI. Will follow up on LICENSE/NOTICE updates separately if we decide to ship it by default. |
…logFactory Co-authored-by: Joy Haldar <joy.haldar@target.com>"
| connectionConfigInfoDpo.asIcebergCatalogProperties(polarisCredentialManager)); | ||
|
|
||
| // Credentials are resolved via Google Application Default Credentials (ADC). | ||
| // GCS storage operations use the same ADC credentials. |
There was a problem hiding this comment.
The second statement is not 100% accurate. In principle a user can override GCS storage credentials:
I'd propose adding ... by default.
| * org.apache.polaris.core.admin.model.BigQueryMetastoreConnectionConfigInfo} defined in the API | ||
| * model. | ||
| */ | ||
| public class BigQueryMetastoreConnectionConfigInfoDpo extends ConnectionConfigInfoDpo { |
There was a problem hiding this comment.
No worries. Let's keep existing pattern.
flyingImer
left a comment
There was a problem hiding this comment.
This is directionally better now, but I still think there are two blocker-level contract issues here.
-
BIGQUERY can still accept auth modes that the connector itself does not actually support. The connector is still IMPLICIT-only, but the create/update path is validating auth too generically, so an invalid BIGQUERY catalog can still be accepted and persisted, then fail later at catalog instantiation time.
-
The
authenticationParameterscontract is still inconsistent. The schema/DPO/factory path still treats it as effectively optional, but the create/admin validation path dereferences it as if it were required. So right now a missing auth block can still turn into a null-deref path instead of a deterministic validation/defaulting decision.
My preference would be to make this narrower and explicit before merge:
- validate BIGQUERY auth per connection type at create/update time, not only later in the connector, and
- pick one contract for missing
authenticationParameters(required + IMPLICITormissing means IMPLICIT) and make the spec/validation/DPO path all match.
| properties: | ||
| type: object | ||
| additionalProperties: | ||
| type: string | ||
| description: Additional catalog properties |
There was a problem hiding this comment.
Thanks for adding this. Thinking more broadly, this properties would be useful for other federated catalog type like HMS. I'd proposed to move it up to its parent ConnectionConfigInfo, so that all subtypes can benefit from it. WDYT? cc @XJDKC @HonahX @singhpk234 @dimas-b
There was a problem hiding this comment.
Yes, the more I think about it the more I like the idea of supporting general properties inside ConnectionConfigInfo. That REST API change should be able to support both BigQuery Catalog federation and #3729 (CC: @PhillHenry).
There was a problem hiding this comment.
Here is another use case that justifies it. A client for the remote catalog may support more properties than those defined in the spec. Without a freeform properties field, we would need to keep updating the spec to accommodate properties already supported by existing clients (for example HMS or BQMS), which is both unnecessary and heavy. With properties, users can simply update their federated catalog in the metastore instead.
Adds federation support for BigQuery Metastore catalogs, enabling Polaris to serve as a unified REST catalog interface for Iceberg tables managed in BigQuery Metastore.
Changes
BigQueryMetastoreFederatedCatalogFactory- factory for creating BigQuery Metastore catalog handlesBigQueryMetastoreConnectionConfigInfoDpo- connection configuration with all BigQueryPropertiesBIGQUERYconnection type inConnectionTypeBigQueryMetastoreConnectionConfigInfoChecklist
CHANGELOG.md(if needed)site/content/in-dev/unreleased(if needed)