Skip to content

[#11486] refactor(catalog): Add shared Hadoop Kerberos auth module#11765

Merged
jerryshao merged 8 commits into
apache:mainfrom
yuqi1129:fix/hadoop-auth-kerberos
Jul 1, 2026
Merged

[#11486] refactor(catalog): Add shared Hadoop Kerberos auth module#11765
jerryshao merged 8 commits into
apache:mainfrom
yuqi1129:fix/hadoop-auth-kerberos

Conversation

@yuqi1129

@yuqi1129 yuqi1129 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR adds a new catalogs:hadoop-auth module to centralize shared Hadoop Kerberos authentication logic, including keytab fetching, krb5.conf setup, principal validation, login, and ticket refresh.

It updates the Kerberos clients in hadoop-common, Hive metastore common, Paimon, and Iceberg common to delegate common logic to the new module while preserving their existing catalog-specific behavior.

Why are the changes needed?

Multiple catalogs had duplicated Kerberos authentication logic. Centralizing the shared Hadoop-dependent logic reduces duplication and makes future fixes easier to apply consistently across catalogs.

Fix: #11486

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • ./gradlew :catalogs:hadoop-auth:test :catalogs:hadoop-common:compileJava :iceberg:iceberg-common:compileJava :catalogs:catalog-lakehouse-paimon:compileJava :catalogs:hive-metastore-common:compileJava -PskipITs
  • ./gradlew compileDistribution -x test
  • Kerberos integration tests:
    • Hive HMS Kerberos tests
    • Fileset HDFSKerberosIT
    • Hudi Kerberos Hive embedded/deploy
    • Paimon Kerberos filesystem embedded/deploy
    • Iceberg catalog Kerberos Hive embedded/deploy
    • Iceberg REST Kerberos Hive embedded tests, including user impersonation
  • git diff --check

Copilot AI review requested due to automatic review settings June 22, 2026 11:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Adds a new shared :catalogs:hadoop-auth module to centralize Kerberos/Hadoop auth utilities and refactors existing Kerberos clients (Iceberg, Hive, Hadoop FS, Paimon) to reuse it.

Changes:

  • Introduced KerberosAuthUtils (principal validation, keytab fetching, krb5.conf wiring, login modes, ticket refresh scheduler) in new :catalogs:hadoop-auth module.
  • Refactored multiple KerberosClient implementations to delegate to KerberosAuthUtils and removed duplicated logic.
  • Wired the new module into Gradle settings and dependent subprojects, and added initial unit tests for the shared utilities.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
settings.gradle.kts Includes new :catalogs:hadoop-auth module in the build.
build.gradle.kts Adds :catalogs:hadoop-auth to shared subproject config and excludes it from a task filter list.
catalogs/hadoop-auth/build.gradle.kts Defines the new hadoop-auth module dependencies and test setup.
catalogs/hadoop-auth/src/main/java/.../KerberosAuthUtils.java Adds shared Kerberos/Hadoop authentication utilities for reuse by other modules.
catalogs/hadoop-auth/src/test/java/.../TestKerberosAuthUtils.java Adds initial unit tests for principal parsing and keytab URI handling.
iceberg/iceberg-common/build.gradle.kts Adds dependency on :catalogs:hadoop-auth.
iceberg/iceberg-common/src/main/java/.../KerberosClient.java Refactors Iceberg Kerberos client to use shared helpers for login/refresh/keytab fetch.
catalogs/hive-metastore-common/build.gradle.kts Adds dependency on :catalogs:hadoop-auth.
catalogs/hive-metastore-common/src/main/java/.../KerberosClient.java Refactors Hive metastore Kerberos client to use shared helpers and removes local thread factory/fetcher logic.
catalogs/hadoop-common/build.gradle.kts Adds dependency on :catalogs:hadoop-auth.
catalogs/hadoop-common/src/main/java/.../KerberosClient.java Refactors Hadoop FS Kerberos client to use shared helpers (including krb5.conf setup).
catalogs/catalog-lakehouse-paimon/build.gradle.kts Adds dependency on :catalogs:hadoop-auth.
catalogs/catalog-lakehouse-paimon/src/main/java/.../KerberosClient.java Refactors Paimon Kerberos client to use shared helpers for login/refresh/keytab fetch.

…nt into shared hadoop-auth client

Collapse the near-identical iceberg/paimon/hadoop-common KerberosClient classes
into one shared org.apache.gravitino.catalog.hadoop.auth.KerberosClient (builder
carries the per-module differences: login mode, refresh, thread name, krb5.conf
wiring). Callers (ClosableHiveCatalog, paimon CatalogUtils, HDFSFileSystemProxy)
read their own KerberosConfig and build the shared client; the duplicated
keytab-save logic moves to KerberosAuthUtils.saveKeytabFromUri.

The Hive KerberosClient stays separate (it carries Hive-specific proxy-user and
HMS delegation-token logic) but already reuses KerberosAuthUtils primitives.

Also bundle :catalogs:hadoop-auth into catalog-fileset's runtime libs: it depends
on hadoop-common with exclude(group="*"), so the hadoop-auth classes were not on
the catalog runtime classpath, causing NoClassDefFoundError on the Kerberos path
(surfaced by the Kerberos/impersonation ITs).
@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

Code Coverage Report

Overall Project 67.43% -0.05% 🟢
Files changed 62.42% 🟢

Module Coverage
aliyun 1.72% 🔴
api 46.82% 🟢
authorization-common 85.96% 🟢
aws 26.5% 🔴
azure 2.47% 🔴
catalog-common 10.4% 🔴
catalog-fileset 80.23% 🟢
catalog-glue 66.91% 🟢
catalog-hive 79.42% 🟢
catalog-jdbc-clickhouse 80.2% 🟢
catalog-jdbc-common 44.22% 🟢
catalog-jdbc-doris 80.28% 🟢
catalog-jdbc-hologres 54.03% 🟢
catalog-jdbc-mysql 79.23% 🟢
catalog-jdbc-oceanbase 80.91% 🟢
catalog-jdbc-postgresql 82.29% 🟢
catalog-jdbc-starrocks 78.51% 🟢
catalog-kafka 77.01% 🟢
catalog-lakehouse-generic 58.53% 🟢
catalog-lakehouse-hudi 79.1% 🟢
catalog-lakehouse-iceberg 85.86% 🟢
catalog-lakehouse-paimon 84.25% -2.98% 🟢
catalog-model 77.72% 🟢
cli 44.51% 🟢
client-java 78.01% 🟢
common 50.17% 🟢
core 82.59% 🟢
filesystem-hadoop3 77.3% 🟢
flink 0.0% 🔴
flink-common 47.12% 🟢
flink-runtime 0.0% 🔴
gcp 14.12% 🔴
hadoop-auth 66.67% 🟢
hadoop-common 12.7% -5.38% 🔴
hive-metastore-common 53.29% +3.35% 🟢
iceberg-common 60.41% -5.4% 🟢
iceberg-rest-server 73.98% +1.04% 🟢
idp-basic 85.71% 🟢
integration-test-common 0.0% 🔴
jobs 66.17% 🟢
lance-common 20.81% 🔴
lance-rest-server 64.84% 🟢
lineage 53.02% 🟢
optimizer 83.24% 🟢
optimizer-api 21.95% 🔴
server 85.96% 🟢
server-common 74.62% 🟢
spark 28.57% 🔴
spark-common 45.58% 🟢
trino-connector 40.29% 🟢
Files
Module File Coverage
catalog-lakehouse-paimon CatalogUtils.java 43.24% 🔴
hadoop-auth KerberosClient.java 67.24% 🟢
KerberosAuthUtils.java 66.18% 🟢
hadoop-common HDFSFileSystemProxy.java 0.0% 🔴
hive-metastore-common HiveClientFactory.java 85.09% 🟢
HmsKerberosClient.java 80.7% 🟢
iceberg-common MemoryCatalogWithMetadataLocationSupport.java 83.33% 🟢
JdbcCatalogWithMetadataLocationSupport.java 65.22% 🟢
HiveCatalogWithMetadataLocationSupport.java 42.86% 🔴
ClosableHiveCatalog.java 22.76% 🔴
iceberg-rest-server IcebergNamespaceHookDispatcher.java 91.94% 🟢
FederatedCatalogWrapper.java 86.96% 🟢

@yuqi1129 yuqi1129 requested a review from Copilot June 23, 2026 08:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.

@jerryshao

Copy link
Copy Markdown
Contributor

Can we do this in hadoop-common module?

@yuqi1129

Copy link
Copy Markdown
Contributor Author

Can we do this in hadoop-common module?

I think keeping this as a separate hadoop-auth module is a cleaner boundary.

hadoop-common is currently the filesystem/provider common module and is used by catalogs, connectors, and bundle paths. It also declares Hadoop dependencies as implementation, not compileOnly. If we put the shared Kerberos helpers there, non-filesystem modules such as hive-metastore-common and Paimon would need to depend on hadoop-common just to reuse Kerberos login/keytab utilities, and they would also inherit hadoop-common's broader runtime dependency surface.

The code here is Hadoop/Kerberos auth related, but not HDFS filesystem specific. It is shared by Hive, Iceberg, Paimon, and hadoop-common, so a small hadoop-auth module keeps the dependency boundary narrower.

One concern is that hadoop-auth currently uses compileOnly Hadoop dependencies, so each consumer must provide Hadoop runtime dependencies explicitly. I will double-check those consumer dependencies and make that contract clear, but I would prefer not to merge this into hadoop-common just to solve that.

List<String> principalComponents = Splitter.on('@').splitToList(catalogPrincipal);
Preconditions.checkArgument(
principalComponents.size() == 2, "The principal has the wrong format");
KerberosAuthUtils.checkPrincipalAndGetRealm(catalogPrincipal);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KerberosAuthUtils.login call the function checkPrincipalAndGetRealm too

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 699e38c. The Hive HMS login() path now relies on KerberosAuthUtils.login() for principal validation, and loginProxyUser() reuses checkPrincipalAndGetRealm() for realm parsing.

File keytabFile = kerberosClient.saveKeyTabFileFromUri(Long.parseLong(catalogUUID));
File keytabFile =
new File(
String.format(KerberosConfig.GRAVITINO_KEYTAB_FORMAT, Long.parseLong(catalogUUID)));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long.parseLong(catalogUUID) is redundant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 699e38c. The Iceberg keytab path now formats catalogUUID directly without Long.parseLong(catalogUUID).

checkPrincipalAndGetRealm(principal);
UserGroupInformation.setConfiguration(hadoopConf);

if (loginMode == LoginMode.RETURN_UGI) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a switch statement to handle the enum and avoid invalid fallbacks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 699e38c. KerberosAuthUtils.login() now uses an explicit switch over LoginMode, including a null check before the switch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we keep this KerberosClient

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept this class as the HMS-specific adapter. The shared catalogs:hadoop-auth client now handles common keytab login and TGT refresh, while this Hive Metastore client still owns proxy-user creation, HMS delegation-token retrieval, HiveClient wiring, and local keytab cleanup. I added class Javadoc in 699e38c to make that boundary explicit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to HmsKerberosClient is better


ScheduledThreadPoolExecutor executor =
new ScheduledThreadPoolExecutor(1, daemonThreadFactory(threadNamePrefix));
executor.scheduleAtFixedRate(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use a global thread pool to handle tick refreshes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 699e38c. Ticket refresh now uses a shared daemon scheduler in KerberosAuthUtils; callers keep only their own ScheduledFuture and cancel that task on close or repeated login.

@yuqi1129 yuqi1129 requested a review from diqiu50 June 26, 2026 06:04
return UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytabFilePath);
if (loginMode == null) {
throw new IllegalArgumentException("loginMode can't be null");
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Preconditions.checkArgument is better

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

private static void refreshTicket(
UserGroupInformation loginUgi, String refreshTaskName, Logger log) {
Thread currentThread = Thread.currentThread();
String originalThreadName = currentThread.getName();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refreshTicket runs quickly, so switching the thread name is not practical.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that there is no need to switch thread name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to HmsKerberosClient is better

…drop thread rename, rename HmsKerberosClient)
@yuqi1129 yuqi1129 requested a review from diqiu50 June 26, 2026 09:04
diqiu50
diqiu50 previously approved these changes Jun 26, 2026

@diqiu50 diqiu50 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jerryshao jerryshao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR cleanly centralizes four near-identical Kerberos auth implementations into a shared hadoop-auth module. The KerberosAuthUtils / KerberosClient split is a good abstraction and the builder pattern is easy to extend. Five issues below, ranked by severity.


private static final String TICKET_REFRESH_THREAD_NAME_PREFIX = "kerberos-ticket-refresh-";
private static final ScheduledThreadPoolExecutor TICKET_REFRESH_EXECUTOR =
newTicketRefreshExecutor();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regression: static single-threaded executor shared by all catalog instances

TICKET_REFRESH_EXECUTOR has corePoolSize=1 and is shared by every Kerberos-enabled catalog in the process. Before this PR each catalog had an isolated executor, so a slow or unreachable KDC only blocked that one catalog's refresh. Now, if any catalog's checkTGTAndReloginFromKeytab() blocks (e.g. a 30–90 s TCP timeout to the KDC), every other catalog's TGT renewal queues behind it on the same thread — potentially causing cascading auth failures across all catalogs simultaneously.

Consider either sizing the pool relative to the expected number of concurrent Kerberos catalogs, or keeping per-instance executors and using the shared class only for the refresh lambda.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified. startTicketRefresh now creates a per-caller single-thread daemon scheduler and returns it.

return keytabUri.trim().startsWith("hdfs");
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TICKET_REFRESH_EXECUTOR is never shut down

setRemoveOnCancelPolicy(true) prevents queue memory growth when tasks are cancelled, but the backing daemon thread itself persists after all catalogs are closed. There is no shutdown hook or lifecycle method. In test JVMs, Gradle workers, or OSGi containers, the thread keeps static references alive and emits log errors for catalogs that no longer exist.

Consider Runtime.getRuntime().addShutdownHook(new Thread(TICKET_REFRESH_EXECUTOR::shutdownNow)), or a package-private shutdown() for test cleanup.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified. Each client owns its executor and calls shutdown() in close()

} catch (IOException e) {
LOG.warn("Failed to delete keytab file: {}", keytabFilePath, e);
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancel(true) interrupts a running TGT renewal mid-flight

The old close() called checkTgtExecutor.shutdown(), which lets an in-progress task run to completion. The new cancelTicketRefreshTask() calls checkTgtRefreshTask.cancel(true), which immediately sets the thread interrupt flag. If checkTGTAndReloginFromKeytab() is interrupted after the old TGT is marked for replacement but before the new one is committed to the UGI Subject, the process-wide loginUser could briefly be left without valid credentials. HMS uses LoginMode.LOGIN_USER (process-wide UGI), so the impact is not limited to a single catalog.

Consider cancel(false) to let an in-flight relogin complete gracefully before teardown.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified. replaced cancel(true) with executor.shutdown()

public String getRealm() {
return realm;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRealm() and getLoginUser() can return null before login() is called; missing @Nullable

Both realm and loginUser are null until login() completes. ClosableHiveCatalog.doKerberosOperations() calls kerberosClient.getRealm() and feeds the result into a String.format to construct a proxy principal — producing "user@null" if there is any ordering issue (e.g. login failed silently or was skipped).

CLAUDE.md requires @Nullable on return types that can be null. Add @Nullable to both getRealm() and getLoginUser(), or add Preconditions.checkState(realm != null, "login() has not been called") guards.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added Preconditions.checkState(... != null, "login() has not been called")

int fetchKeytabFileTimeout = kerberosConfig.getFetchTimeoutSec();
FileFetcher.get()
.fetchFileFromUri(keyTabUri, keytabFile, fetchKeytabFileTimeout * 1000, hadoopConf);
KerberosAuthUtils.fetchKeytabFromUri(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant keytabsDir.mkdir()fetchKeytabFromUri already calls parentFile.mkdirs()

KerberosAuthUtils.fetchKeytabFromUri calls parentFile.mkdirs() before invoking FileFetcher, so the three lines creating keytabsDir here are dead code. The manual mkdir() also uses the single-level form, which would silently fail if any parent of keytabs/ was missing — the exact case that fetchKeytabFromUri's mkdirs() already handles correctly.

These three lines can be removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. removed the 3 dead lines; fetchKeytabFromUri already does parentFile.mkdirs().

…d add null-state guards

Address review comments:
- Revert to per-instance refresh executor so a slow/unreachable KDC only
  blocks that catalog's refresh; shut it down gracefully on close.
- Replace cancel(true) with executor.shutdown() to avoid interrupting an
  in-flight relogin (process-wide UGI for HMS LOGIN_USER mode).
- Add checkState guards to KerberosClient.getRealm()/getLoginUser().
- Remove redundant keytabs dir creation already handled by fetchKeytabFromUri.
ScheduledExecutorService executor =
Executors.newSingleThreadScheduledExecutor(
daemonThreadFactory(TICKET_REFRESH_THREAD_NAME_PREFIX));
executor.scheduleAtFixedRate(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefer to keep one Kerberos client per executor, I’d suggest moving the executor creation and shutdown logic into HmsKerberosClient. In general, it’s better to manage resource creation and cleanup in a single place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep the current code, because:

  1. Each catalog already has its own HmsKerberosClient, and the executor is a per-instance field. It's created in login() and shut down in close(), both inside the client. Nothing is shared across catalogs, so there's no leak.
  2. Only the construction (newSingleThreadScheduledExecutor + scheduleAtFixedRate) is in KerberosAuthUtils.startTicketRefresh, because it's shared by both KerberosClient and HmsKerberosClient. Moving it into HmsKerberosClient would duplicate this code in both clients.

So I prefer not to inline it. Let me know if you see a real lifecycle problem.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's OK

@diqiu50 diqiu50 requested a review from jerryshao June 30, 2026 11:56
@jerryshao jerryshao merged commit 1729e33 into apache:main Jul 1, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[EPIC] Unify Kerberos keytab file fetching and configuration

4 participants