[#11486] refactor(catalog): Add shared Hadoop Kerberos auth module#11765
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.
Adds a new shared :catalogs:hadoop-auth module to centralize Kerberos/Hadoop auth utilities and refactors existing Kerberos clients (Iceberg, Hive, Hadoop FS, Paimon) to reuse it.
Changes:
- Introduced
KerberosAuthUtils(principal validation, keytab fetching, krb5.conf wiring, login modes, ticket refresh scheduler) in new:catalogs:hadoop-authmodule. - Refactored multiple
KerberosClientimplementations to delegate toKerberosAuthUtilsand removed duplicated logic. - Wired the new module into Gradle settings and dependent subprojects, and added initial unit tests for the shared utilities.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| settings.gradle.kts | Includes new :catalogs:hadoop-auth module in the build. |
| build.gradle.kts | Adds :catalogs:hadoop-auth to shared subproject config and excludes it from a task filter list. |
| catalogs/hadoop-auth/build.gradle.kts | Defines the new hadoop-auth module dependencies and test setup. |
| catalogs/hadoop-auth/src/main/java/.../KerberosAuthUtils.java | Adds shared Kerberos/Hadoop authentication utilities for reuse by other modules. |
| catalogs/hadoop-auth/src/test/java/.../TestKerberosAuthUtils.java | Adds initial unit tests for principal parsing and keytab URI handling. |
| iceberg/iceberg-common/build.gradle.kts | Adds dependency on :catalogs:hadoop-auth. |
| iceberg/iceberg-common/src/main/java/.../KerberosClient.java | Refactors Iceberg Kerberos client to use shared helpers for login/refresh/keytab fetch. |
| catalogs/hive-metastore-common/build.gradle.kts | Adds dependency on :catalogs:hadoop-auth. |
| catalogs/hive-metastore-common/src/main/java/.../KerberosClient.java | Refactors Hive metastore Kerberos client to use shared helpers and removes local thread factory/fetcher logic. |
| catalogs/hadoop-common/build.gradle.kts | Adds dependency on :catalogs:hadoop-auth. |
| catalogs/hadoop-common/src/main/java/.../KerberosClient.java | Refactors Hadoop FS Kerberos client to use shared helpers (including krb5.conf setup). |
| catalogs/catalog-lakehouse-paimon/build.gradle.kts | Adds dependency on :catalogs:hadoop-auth. |
| catalogs/catalog-lakehouse-paimon/src/main/java/.../KerberosClient.java | Refactors Paimon Kerberos client to use shared helpers for login/refresh/keytab fetch. |
…nt into shared hadoop-auth client Collapse the near-identical iceberg/paimon/hadoop-common KerberosClient classes into one shared org.apache.gravitino.catalog.hadoop.auth.KerberosClient (builder carries the per-module differences: login mode, refresh, thread name, krb5.conf wiring). Callers (ClosableHiveCatalog, paimon CatalogUtils, HDFSFileSystemProxy) read their own KerberosConfig and build the shared client; the duplicated keytab-save logic moves to KerberosAuthUtils.saveKeytabFromUri. The Hive KerberosClient stays separate (it carries Hive-specific proxy-user and HMS delegation-token logic) but already reuses KerberosAuthUtils primitives. Also bundle :catalogs:hadoop-auth into catalog-fileset's runtime libs: it depends on hadoop-common with exclude(group="*"), so the hadoop-auth classes were not on the catalog runtime classpath, causing NoClassDefFoundError on the Kerberos path (surfaced by the Kerberos/impersonation ITs).
Code Coverage Report
Files
|
|
Can we do this in hadoop-common module? |
I think keeping this as a separate hadoop-auth module is a cleaner boundary. hadoop-common is currently the filesystem/provider common module and is used by catalogs, connectors, and bundle paths. It also declares Hadoop dependencies as implementation, not compileOnly. If we put the shared Kerberos helpers there, non-filesystem modules such as hive-metastore-common and Paimon would need to depend on hadoop-common just to reuse Kerberos login/keytab utilities, and they would also inherit hadoop-common's broader runtime dependency surface. The code here is Hadoop/Kerberos auth related, but not HDFS filesystem specific. It is shared by Hive, Iceberg, Paimon, and hadoop-common, so a small hadoop-auth module keeps the dependency boundary narrower. One concern is that hadoop-auth currently uses compileOnly Hadoop dependencies, so each consumer must provide Hadoop runtime dependencies explicitly. I will double-check those consumer dependencies and make that contract clear, but I would prefer not to merge this into hadoop-common just to solve that. |
| List<String> principalComponents = Splitter.on('@').splitToList(catalogPrincipal); | ||
| Preconditions.checkArgument( | ||
| principalComponents.size() == 2, "The principal has the wrong format"); | ||
| KerberosAuthUtils.checkPrincipalAndGetRealm(catalogPrincipal); |
There was a problem hiding this comment.
KerberosAuthUtils.login call the function checkPrincipalAndGetRealm too
There was a problem hiding this comment.
Fixed in 699e38c. The Hive HMS login() path now relies on KerberosAuthUtils.login() for principal validation, and loginProxyUser() reuses checkPrincipalAndGetRealm() for realm parsing.
| File keytabFile = kerberosClient.saveKeyTabFileFromUri(Long.parseLong(catalogUUID)); | ||
| File keytabFile = | ||
| new File( | ||
| String.format(KerberosConfig.GRAVITINO_KEYTAB_FORMAT, Long.parseLong(catalogUUID))); |
There was a problem hiding this comment.
Long.parseLong(catalogUUID) is redundant.
There was a problem hiding this comment.
Fixed in 699e38c. The Iceberg keytab path now formats catalogUUID directly without Long.parseLong(catalogUUID).
| checkPrincipalAndGetRealm(principal); | ||
| UserGroupInformation.setConfiguration(hadoopConf); | ||
|
|
||
| if (loginMode == LoginMode.RETURN_UGI) { |
There was a problem hiding this comment.
Use a switch statement to handle the enum and avoid invalid fallbacks.
There was a problem hiding this comment.
Fixed in 699e38c. KerberosAuthUtils.login() now uses an explicit switch over LoginMode, including a null check before the switch.
There was a problem hiding this comment.
Why we keep this KerberosClient
There was a problem hiding this comment.
Kept this class as the HMS-specific adapter. The shared catalogs:hadoop-auth client now handles common keytab login and TGT refresh, while this Hive Metastore client still owns proxy-user creation, HMS delegation-token retrieval, HiveClient wiring, and local keytab cleanup. I added class Javadoc in 699e38c to make that boundary explicit.
There was a problem hiding this comment.
Rename to HmsKerberosClient is better
|
|
||
| ScheduledThreadPoolExecutor executor = | ||
| new ScheduledThreadPoolExecutor(1, daemonThreadFactory(threadNamePrefix)); | ||
| executor.scheduleAtFixedRate( |
There was a problem hiding this comment.
We should use a global thread pool to handle tick refreshes.
There was a problem hiding this comment.
Fixed in 699e38c. Ticket refresh now uses a shared daemon scheduler in KerberosAuthUtils; callers keep only their own ScheduledFuture and cancel that task on close or repeated login.
| return UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytabFilePath); | ||
| if (loginMode == null) { | ||
| throw new IllegalArgumentException("loginMode can't be null"); | ||
| } |
There was a problem hiding this comment.
Using Preconditions.checkArgument is better
| private static void refreshTicket( | ||
| UserGroupInformation loginUgi, String refreshTaskName, Logger log) { | ||
| Thread currentThread = Thread.currentThread(); | ||
| String originalThreadName = currentThread.getName(); |
There was a problem hiding this comment.
refreshTicket runs quickly, so switching the thread name is not practical.
There was a problem hiding this comment.
Do you mean that there is no need to switch thread name?
There was a problem hiding this comment.
Rename to HmsKerberosClient is better
…drop thread rename, rename HmsKerberosClient)
jerryshao
left a comment
There was a problem hiding this comment.
This PR cleanly centralizes four near-identical Kerberos auth implementations into a shared hadoop-auth module. The KerberosAuthUtils / KerberosClient split is a good abstraction and the builder pattern is easy to extend. Five issues below, ranked by severity.
|
|
||
| private static final String TICKET_REFRESH_THREAD_NAME_PREFIX = "kerberos-ticket-refresh-"; | ||
| private static final ScheduledThreadPoolExecutor TICKET_REFRESH_EXECUTOR = | ||
| newTicketRefreshExecutor(); |
There was a problem hiding this comment.
Regression: static single-threaded executor shared by all catalog instances
TICKET_REFRESH_EXECUTOR has corePoolSize=1 and is shared by every Kerberos-enabled catalog in the process. Before this PR each catalog had an isolated executor, so a slow or unreachable KDC only blocked that one catalog's refresh. Now, if any catalog's checkTGTAndReloginFromKeytab() blocks (e.g. a 30–90 s TCP timeout to the KDC), every other catalog's TGT renewal queues behind it on the same thread — potentially causing cascading auth failures across all catalogs simultaneously.
Consider either sizing the pool relative to the expected number of concurrent Kerberos catalogs, or keeping per-instance executors and using the shared class only for the refresh lambda.
There was a problem hiding this comment.
Modified. startTicketRefresh now creates a per-caller single-thread daemon scheduler and returns it.
| return keytabUri.trim().startsWith("hdfs"); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
TICKET_REFRESH_EXECUTOR is never shut down
setRemoveOnCancelPolicy(true) prevents queue memory growth when tasks are cancelled, but the backing daemon thread itself persists after all catalogs are closed. There is no shutdown hook or lifecycle method. In test JVMs, Gradle workers, or OSGi containers, the thread keeps static references alive and emits log errors for catalogs that no longer exist.
Consider Runtime.getRuntime().addShutdownHook(new Thread(TICKET_REFRESH_EXECUTOR::shutdownNow)), or a package-private shutdown() for test cleanup.
There was a problem hiding this comment.
Modified. Each client owns its executor and calls shutdown() in close()
| } catch (IOException e) { | ||
| LOG.warn("Failed to delete keytab file: {}", keytabFilePath, e); | ||
| } | ||
| } |
There was a problem hiding this comment.
cancel(true) interrupts a running TGT renewal mid-flight
The old close() called checkTgtExecutor.shutdown(), which lets an in-progress task run to completion. The new cancelTicketRefreshTask() calls checkTgtRefreshTask.cancel(true), which immediately sets the thread interrupt flag. If checkTGTAndReloginFromKeytab() is interrupted after the old TGT is marked for replacement but before the new one is committed to the UGI Subject, the process-wide loginUser could briefly be left without valid credentials. HMS uses LoginMode.LOGIN_USER (process-wide UGI), so the impact is not limited to a single catalog.
Consider cancel(false) to let an in-flight relogin complete gracefully before teardown.
There was a problem hiding this comment.
Modified. replaced cancel(true) with executor.shutdown()
| public String getRealm() { | ||
| return realm; | ||
| } | ||
|
|
There was a problem hiding this comment.
getRealm() and getLoginUser() can return null before login() is called; missing @Nullable
Both realm and loginUser are null until login() completes. ClosableHiveCatalog.doKerberosOperations() calls kerberosClient.getRealm() and feeds the result into a String.format to construct a proxy principal — producing "user@null" if there is any ordering issue (e.g. login failed silently or was skipped).
CLAUDE.md requires @Nullable on return types that can be null. Add @Nullable to both getRealm() and getLoginUser(), or add Preconditions.checkState(realm != null, "login() has not been called") guards.
There was a problem hiding this comment.
Done, added Preconditions.checkState(... != null, "login() has not been called")
| int fetchKeytabFileTimeout = kerberosConfig.getFetchTimeoutSec(); | ||
| FileFetcher.get() | ||
| .fetchFileFromUri(keyTabUri, keytabFile, fetchKeytabFileTimeout * 1000, hadoopConf); | ||
| KerberosAuthUtils.fetchKeytabFromUri( |
There was a problem hiding this comment.
Redundant keytabsDir.mkdir() — fetchKeytabFromUri already calls parentFile.mkdirs()
KerberosAuthUtils.fetchKeytabFromUri calls parentFile.mkdirs() before invoking FileFetcher, so the three lines creating keytabsDir here are dead code. The manual mkdir() also uses the single-level form, which would silently fail if any parent of keytabs/ was missing — the exact case that fetchKeytabFromUri's mkdirs() already handles correctly.
These three lines can be removed.
There was a problem hiding this comment.
Done. removed the 3 dead lines; fetchKeytabFromUri already does parentFile.mkdirs().
…d add null-state guards Address review comments: - Revert to per-instance refresh executor so a slow/unreachable KDC only blocks that catalog's refresh; shut it down gracefully on close. - Replace cancel(true) with executor.shutdown() to avoid interrupting an in-flight relogin (process-wide UGI for HMS LOGIN_USER mode). - Add checkState guards to KerberosClient.getRealm()/getLoginUser(). - Remove redundant keytabs dir creation already handled by fetchKeytabFromUri.
| ScheduledExecutorService executor = | ||
| Executors.newSingleThreadScheduledExecutor( | ||
| daemonThreadFactory(TICKET_REFRESH_THREAD_NAME_PREFIX)); | ||
| executor.scheduleAtFixedRate( |
There was a problem hiding this comment.
If you prefer to keep one Kerberos client per executor, I’d suggest moving the executor creation and shutdown logic into HmsKerberosClient. In general, it’s better to manage resource creation and cleanup in a single place.
There was a problem hiding this comment.
I'd like to keep the current code, because:
- Each catalog already has its own HmsKerberosClient, and the executor is a per-instance field. It's created in login() and shut down in close(), both inside the client. Nothing is shared across catalogs, so there's no leak.
- Only the construction (newSingleThreadScheduledExecutor + scheduleAtFixedRate) is in KerberosAuthUtils.startTicketRefresh, because it's shared by both KerberosClient and HmsKerberosClient. Moving it into HmsKerberosClient would duplicate this code in both clients.
So I prefer not to inline it. Let me know if you see a real lifecycle problem.
What changes were proposed in this pull request?
This PR adds a new
catalogs:hadoop-authmodule to centralize shared Hadoop Kerberos authentication logic, including keytab fetching, krb5.conf setup, principal validation, login, and ticket refresh.It updates the Kerberos clients in hadoop-common, Hive metastore common, Paimon, and Iceberg common to delegate common logic to the new module while preserving their existing catalog-specific behavior.
Why are the changes needed?
Multiple catalogs had duplicated Kerberos authentication logic. Centralizing the shared Hadoop-dependent logic reduces duplication and makes future fixes easier to apply consistently across catalogs.
Fix: #11486
Does this PR introduce any user-facing change?
No.
How was this patch tested?
./gradlew :catalogs:hadoop-auth:test :catalogs:hadoop-common:compileJava :iceberg:iceberg-common:compileJava :catalogs:catalog-lakehouse-paimon:compileJava :catalogs:hive-metastore-common:compileJava -PskipITs./gradlew compileDistribution -x testHDFSKerberosITgit diff --check