Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
<module>xtable-core</module>
<module>xtable-utilities</module>
<module>xtable-aws</module>
<module>xtable-databricks</module>
<module>xtable-hive-metastore</module>
<module>xtable-service</module>
</modules>
Expand Down
130 changes: 117 additions & 13 deletions website/docs/unity-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,59 +4,68 @@ title: "Unity Catalog"
---

# Syncing to Unity Catalog
This document walks through the steps to register an Apache XTable™ (Incubating) synced Delta table in Unity Catalog on Databricks and open-source Unity Catalog.

This page covers **Databricks Unity Catalog** and **open-source Unity Catalog**. They are different systems:
Databricks Unity Catalog is a managed service in Databricks, while open-source Unity Catalog is a standalone server.
Both support **Delta external tables only** as Unity Catalog targets; Iceberg/Hudi are not supported as UC targets.

## Pre-requisites (for Databricks Unity Catalog)

1. Source table(s) (Hudi/Iceberg) already written to external storage locations like S3/GCS/ADLS.
If you don't have a source table written in S3/GCS/ADLS,
you can follow the steps in [this](/docs/hms) tutorial to set it up.
2. Setup connection to external storage locations from Databricks.
* Follow the steps outlined [here](https://docs.databricks.com/en/storage/amazon-s3.html) for Amazon S3
* Follow the steps outlined [here](https://docs.databricks.com/en/storage/gcs.html) for Google Cloud Storage
* Follow the steps outlined [here](https://docs.databricks.com/en/storage/azure-storage.html) for Azure Data Lake Storage Gen2 and Blob Storage.
- Follow the steps outlined [here](https://docs.databricks.com/en/storage/amazon-s3.html) for Amazon S3
- Follow the steps outlined [here](https://docs.databricks.com/en/storage/gcs.html) for Google Cloud Storage
- Follow the steps outlined [here](https://docs.databricks.com/en/storage/azure-storage.html) for Azure Data Lake Storage Gen2 and Blob Storage.
3. Create a Unity Catalog metastore in Databricks as outlined [here](https://docs.gcp.databricks.com/data-governance/unity-catalog/create-metastore.html#create-a-unity-catalog-metastore).
4. Create an external location in Databricks as outlined [here](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-location.html).
5. Clone the Apache XTable™ (Incubating) [repository](https://github.com/apache/incubator-xtable) and create the
`xtable-utilities_2.12-0.2.0-SNAPSHOT-bundled.jar` by following the steps on the [Installation page](/docs/setup)

## Pre-requisites (for open-source Unity Catalog)

1. Source table(s) (Hudi/Iceberg) already written to external storage locations like S3/GCS/ADLS or local.
In this guide, we will use the local file system.
In this guide, we will use the local file system.
But for S3/GCS/ADLS, you must add additional properties related to the respective cloud object storage system you're working with as mentioned [here](https://github.com/unitycatalog/unitycatalog/blob/main/docs/server.md)
2. Clone the Unity Catalog repository from [here](https://github.com/unitycatalog/unitycatalog) and build the project by following the steps outlined [here](https://github.com/unitycatalog/unitycatalog?tab=readme-ov-file#prerequisites)

## Steps

### Running sync

Create `my_config.yaml` in the cloned Apache XTable™ (Incubating) directory.

```yaml md title="yaml"
sourceFormat: HUDI|ICEBERG # choose only one
targetFormats:
- DELTA
datasets:
-
tableBasePath: s3://path/to/source/data
- tableBasePath: s3://path/to/source/data
tableName: table_name
partitionSpec: partitionpath:VALUE # you only need to specify partitionSpec for HUDI sourceFormat
```

:::note Note:

1. Replace `s3://path/to/source/data` to `gs://path/to/source/data` if you have your source table in GCS
and `abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>` if you have your source table in ADLS.
2. And replace with appropriate values for `sourceFormat`, and `tableName` fields.
:::
2. And replace with appropriate values for `sourceFormat`, and `tableName` fields.
:::

From your terminal under the cloned Apache XTable™ (Incubating) directory, run the sync process using the below command.

```shell md title="shell"
java -jar xtable-utilities/target/xtable-utilities_2.12-0.2.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml
```

:::tip Note:
At this point, if you check your bucket path, you will be able to see `_delta_log` directory with
:::tip Note:
At this point, if you check your bucket path, you will be able to see `_delta_log` directory with
00000000000000000000.json which contains the logs that helps query engines to interpret the source table as a Delta table.
:::

### Register the target table in Databricks Unity Catalog
### Databricks Unity Catalog: manual registration (SQL)

(After making sure you complete the pre-requisites mentioned for Databricks Unity Catalog above) In your Databricks workspace, under SQL editor, run the following queries.

```sql md title="SQL"
Expand All @@ -68,12 +77,14 @@ CREATE TABLE xtable.synced_delta_schema.<table_name>
USING DELTA
LOCATION 's3://path/to/source/data';
```

:::note Note:
Replace `s3://path/to/source/data` to `gs://path/to/source/data` if you have your source table in GCS
and `abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>` if you have your source table in ADLS.
:::

### Validating the results
### Validating the results (Databricks)

You can now see the created delta table in **Unity Catalog** under **Catalog** as `<table_name>` under
`synced_delta_schema` and also query the table in the SQL editor:

Expand All @@ -82,6 +93,7 @@ SELECT * FROM xtable.synced_delta_schema.<table_name>;
```

### Register the target table in open-source Unity Catalog using the CLI

(After making sure you complete the pre-requisites mentioned for open-source Unity Catalog above) In your terminal start the UC server by following the steps outlined [here](https://github.com/unitycatalog/unitycatalog/tree/main?tab=readme-ov-file#quickstart---hello-uc)

In a different terminal, run the following commands to register the target table in Unity Catalog.
Expand All @@ -91,14 +103,106 @@ bin/uc table create --full_name unity.default.people --columns "id INT, name STR
```

### Validating the results

You can now read the table registered in Unity Catalog using the below command.

```shell md title="shell"
bin/uc table read --full_name unity.default.people
```

### Databricks Unity Catalog: built-in catalog sync (XTable)

XTable can also register the Delta table directly in Databricks Unity Catalog using the catalog
sync configuration. This uses the Databricks Java SDK and issues DDL against a SQL Warehouse.

```yaml md title="yaml"
sourceCatalog:
catalogId: source
catalogType: STORAGE
catalogProperties: {}
targetCatalogs:
- catalogId: uc
catalogType: DATABRICKS_UC
catalogProperties:
externalCatalog.uc.host: https://<workspace>
externalCatalog.uc.warehouseId: <sql-warehouse-id>
# OAuth M2M (recommended)
externalCatalog.uc.authType: oauth-m2m
externalCatalog.uc.clientId: <client-id>
externalCatalog.uc.clientSecret: <client-secret>
datasets:
- sourceCatalogTableIdentifier:
tableIdentifier:
hierarchicalId: <db>.<table>
partitionSpec: partitionpath:VALUE
storageIdentifier:
tableFormat: HUDI
tableBasePath: s3://path/to/source/data
tableDataPath: s3://path/to/source/data
tableName: <table>
partitionSpec: partitionpath:VALUE
namespace: <db>
targetCatalogTableIdentifiers:
- catalogId: uc
tableFormat: DELTA
tableIdentifier:
hierarchicalId: <catalog>.<schema>.<table>
```

### Authentication (Databricks UC)

**Supported now**

- OAuth M2M via:
- `externalCatalog.uc.authType: oauth-m2m`
- `externalCatalog.uc.clientId`
- `externalCatalog.uc.clientSecret`

**Not supported yet**

- PAT/token-based auth is intentionally not wired in the current XTable UC integration.

**Possible later**

- PAT or other auth flows could be added by extending the UC config and SDK wiring,
but they are out of scope for now.

### Implementation details (Databricks UC)

- XTable uses the Databricks SQL Statement Execution API (`StatementExecutionAPI`) to run DDL
against a SQL Warehouse.
- The built-in sync registers **external Delta tables** only:
- `CREATE TABLE IF NOT EXISTS <catalog>.<schema>.<table> USING DELTA LOCATION '<path>'`
- Schema evolution currently runs `MSCK REPAIR TABLE <table> SYNC METADATA` when a schema diff is
detected, to refresh catalog metadata without touching the Delta log.

### Schema evolution limitations (Databricks UC)

Unity Catalog does not provide a catalog-only schema evolution API for external tables.
While `ALTER TABLE ...` can update the catalog, it also assumes control over the Delta
transaction log. For external tables managed outside Databricks, this can be unsafe.

To avoid mutating the Delta log, XTable currently:

1. Detects any schema differences (new columns, dropped columns, type/comment changes).
2. Runs `MSCK REPAIR TABLE <table> SYNC METADATA` to refresh UC metadata.

This approach **does not delete data** and avoids modifying the Delta log. It refreshes
catalog metadata in-place via `MSCK REPAIR TABLE ... SYNC METADATA`.
Schema evolution is usually rare in production pipelines, so this trade-off is considered acceptable.

### Databricks UC limitations and requirements

- Unity Catalog enforces **unique external locations**. A location used by another
table/volume cannot be reused. This means you cannot register multiple external tables
(e.g., Iceberg/Delta via Glue federation and Databricks UC external) pointing to the same location.
- Ensure your Unity Catalog storage credential has **write** access to the Delta
`_delta_log` directory at the table location.

## Conclusion

In this guide we saw how to,

1. sync a source table to create metadata for the desired target table formats using Apache XTable™ (Incubating)
2. catalog the data in Delta format in Unity Catalog on Databricks, and also open-source Unity Catalog
3. query the Delta table using Databricks SQL editor, and open-source Unity Catalog CLI.
2 changes: 1 addition & 1 deletion website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ module.exports = {
items: [
'hms',
'glue-catalog',
'unity-catalog',
'databricks-unity-catalog',
'biglake-metastore',
],
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,5 @@ public class CatalogType {
public static final String STORAGE = "STORAGE";
public static final String GLUE = "GLUE";
public static final String HMS = "HMS";
public static final String DATABRICKS_UC = "DATABRICKS_UC";
}
80 changes: 80 additions & 0 deletions xtable-databricks/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
~ The ASF licenses this file to You under the Apache License, Version 2.0
~ (the "License"); you may not use this file except in compliance with
~ the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.xtable</groupId>
<artifactId>xtable</artifactId>
<version>0.3.0-incubating</version>
</parent>

<artifactId>xtable-databricks</artifactId>
<name>XTable Databricks</name>

<dependencies>
<dependency>
<groupId>org.apache.xtable</groupId>
<artifactId>xtable-core_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>com.databricks</groupId>
<artifactId>databricks-sdk-java</artifactId>
<version>0.85.0</version>
</dependency>

<!-- Hadoop dependencies -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<scope>provided</scope>
</dependency>

<!-- Junit -->
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-api</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-params</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-engine</artifactId>
<scope>test</scope>
</dependency>

<!-- Mockito -->
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-junit-jupiter</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
</project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.xtable.databricks;

import java.util.Map;

import lombok.Value;

import org.apache.xtable.conversion.ExternalCatalogConfig;

@Value
public class DatabricksUnityCatalogConfig {
public static final String HOST = "externalCatalog.uc.host";
public static final String WAREHOUSE_ID = "externalCatalog.uc.warehouseId";
public static final String AUTH_TYPE = "externalCatalog.uc.authType";
public static final String CLIENT_ID = "externalCatalog.uc.clientId";
public static final String CLIENT_SECRET = "externalCatalog.uc.clientSecret";
public static final String TOKEN = "externalCatalog.uc.token";

String host;
String warehouseId;
String authType;
String clientId;
String clientSecret;
String token;

public static DatabricksUnityCatalogConfig from(ExternalCatalogConfig catalogConfig) {
Map<String, String> props = catalogConfig.getCatalogProperties();
return new DatabricksUnityCatalogConfig(
props.get(HOST),
props.get(WAREHOUSE_ID),
props.get(AUTH_TYPE),
props.get(CLIENT_ID),
props.get(CLIENT_SECRET),
props.get(TOKEN));
}
}
Loading