feat: support ddl for branch operation by fangbo · Pull Request #576 · lance-format/lance-spark

fangbo · 2026-06-02T12:23:16Z

Summary

This PR adds DDL support for branch operations in Lance Spark.

Previously, branch-related operations did not support DDL workflows properly. With this change, users can perform DDL on branches more consistently, improving branch management usability and aligning branch behavior with expected table operation semantics.

Changes

Add DDL support for branch operations
Enable branch-related workflows to work with DDL statements
Improve consistency of branch management behavior in Lance Spark

Motivation

Branch operations are an important part of versioned data workflows. Supporting DDL on branches makes these workflows more complete and practical, especially for users who need to manage schema or table-related changes in isolated branches before merging.

LuciferYang · 2026-06-02T14:54:43Z

will review this one later

fangbo · 2026-06-03T02:36:26Z

will review this one later

Thanks very much.

I have an idea about grammar definitions that I'd like to discuss. There is another grammar definition like:

-- create branch from main's specific version
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF MAIN <main_version>

-- create branch from branch's specific version
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF BRANCH <source_branch_name>  <branch_version>

-- create branch from tag
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF TAG <source_tag_name>

What do you think about these two choices ?

cc @hamersaw

LuciferYang · 2026-06-03T05:53:33Z

will review this one later

Thanks very much.

I have an idea about grammar definitions that I'd like to discuss. There is another grammar definition like:

-- create branch from main's specific version
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF MAIN <main_version>

-- create branch from branch's specific version
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF BRANCH <source_branch_name>  <branch_version>

-- create branch from tag
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF TAG <source_tag_name>

What do you think about these two choices ?

cc @hamersaw

Thanks @fangbo, the keyword form is a lot better than REF/MAIN/5 — the / separators aren't really SQL, and this grammar becomes a public contract once it ships, so worth settling now.

Since we already follow Iceberg's extension model closely, the cleanest option is probably to match its branch DDL where we can:

ALTER TABLE t CREATE BRANCH [IF NOT EXISTS] <name> [AS OF VERSION <id>] [RETAIN <n> DAYS]
ALTER TABLE t DROP BRANCH [IF EXISTS] <name>

The catch is that Iceberg has one global snapshot space, so AS OF VERSION <id> is enough. Our versions are per-ref, so we still have to say which ref the version belongs to. So something like:

ALTER TABLE t CREATE BRANCH [IF NOT EXISTS] <name>
   [ AS OF VERSION <v>                 -- main @ v
   | AS OF BRANCH <src> [VERSION <v>]
   | AS OF TAG <tag> ]

That keeps Iceberg's AS OF VERSION <v> verbatim for the common case and only adds AS OF BRANCH / AS OF TAG where our model actually needs it. Two small things: AS OF reads better first (matches Iceberg's CREATE BRANCH … AS OF VERSION), and AS OF TAG <tag> is cleaner than VERSION AS OF TAG since a tag isn't a version.

Couple of cheap wins worth folding in now, both already in Iceberg: IF NOT EXISTS / CREATE OR REPLACE, and — more important — declaring the new keywords as nonReserved so branch/tag/version/as/of don't become reserved inside every extension statement (today a column named version would fail to parse).

Iceberg also skips SHOW BRANCHES and just exposes a <table>.refs metadata table — not a blocker, just an option if we'd rather keep it queryable.

LuciferYang · 2026-06-03T05:56:02Z

+import java.util.stream.IntStream;
+
+/** Base tests for BRANCH DDL commands. */
+public abstract class BaseBranchDDLTest {


I think we should add the negative-path integration tests for the error cases the docs guarantee, a backtick-quoted branch-name test (locks in cleanIdentifier), and a few LanceSqlExtensionsAstBuilderTest cases for the new visit methods.

LuciferYang · 2026-06-03T06:03:53Z

+      case _ => throw new UnsupportedOperationException("CreateBranch only supports LanceDataset")
+    }
+
+    val dataset = Utils.openDatasetBuilder(lanceDataset.readOptions()).build()


The branch execs look like they'll miss credential-vended catalogs. CreateBranchExec (and DropBranchExec / ShowBranchesExec) open with a bare Utils.openDatasetBuilder(readOptions).build(), whereas OptimizeExec / AddIndexExec thread the catalog's vended options into the open:

// OptimizeExec val initialStorageOpts = catalog match { case ns: BaseLanceNamespaceSparkCatalog => Option(lanceDataset.getInitialStorageOptions).map(_.asScala.toMap) case _ => None } val dataset = Utils.openDatasetBuilder(readOptions) .initialStorageOptions(initialStorageOpts.map(_.asJava).orNull) .build()

So on a credential-vending namespace catalog over S3/GCS, the open (and the branch write) won't have the vended creds. It passes today because the dir catalog's options are non-empty and the Glue/S3 docker test is skipped, so CI never hits this path.

Related sharp edge: CreateBranchExec calls the 3-arg createBranch(name, ref, dataset.getLatestStorageOptions). getLatestStorageOptions is the manifest map (no creds), and the 3-arg overload throws on a null/empty map (checkArgument(opts != null && !opts.isEmpty(), ...)). The 2-arg createBranch(name, ref) has no such precondition.

So I'd thread initialStorageOptions into the open like above, then use the 2-arg createBranch(name, ref) — the write picks up creds from the open.

LuciferYang · 2026-06-03T06:06:07Z

+import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
+import org.lance.Ref
+
+case class CreateBranch(


Using LanceCreateBranch could proactively prevent potential naming conflicts.

LuciferYang · 2026-06-03T06:10:10Z

+    else CreateBranch(
+      table,
+      branchName,
+      org.lance.Ref.ofMain(java.lang.Long.valueOf(ctx.refMainVersion.getText)))


There is a risk of out-of-bounds overflow. It is recommended to catch NumberFormatException and wrap it into a ParseException.

LuciferYang · 2026-06-03T06:12:57Z

+        branchName,
+        ref,
+        dataset.getLatestStorageOptions)
+      branchDs.close()


branchDs.close() sits in the try body, not protected by its own finally.

LuciferYang · 2026-06-03T06:14:52Z

+| Column | Type | Description |
+|--------|------|-------------|
+| `name` | String | Branch name |
+| `parent_branch` | String | Source branch name if the branch was created from another branch; otherwise empty |


otherwise empty -> otherwise NULL ?

LuciferYang · 2026-06-03T06:16:27Z

+      StructField("name", DataTypes.StringType, nullable = false),
+      StructField("parent_branch", DataTypes.StringType, nullable = true),
+      StructField("parent_version", DataTypes.LongType, nullable = false),
+      StructField("create_at", DataTypes.LongType, nullable = false),


LuciferYang · 2026-06-03T06:17:28Z

-    ;
-
-
+    ;


should restore the trailing newline

fangbo · 2026-06-03T12:15:36Z

@LuciferYang Greatly thanks for your detailed suggestion. I have made some modification according to your advice.

hamersaw · 2026-06-03T18:56:11Z

Thanks for the PR! This is great. Overall looks great, my only comment is the main reason I never wrapped up the initial proposal for this - basically I didn't have time to drive consensus. There was quite a bit of conversation on #198 about how we wanted to handle branches / tags / versions in this connector, the two competing approaches:

New keywords: Use new AS OF BRANCH / AS OF TAG, etc. For example:

SELECT * FROM foo AS OF BRANCH 'b0';
ALTER TABLE foo CREATE BRANCH 'b0' AS OF VERSION 12345;

Bake the references directly into the table URI

SELECT * FROM foo/__ref/b0;
ALTER TABLE foo/__ref/12345 CREATE BRANCH 'b0';

The former is what is proposed in this PR, but IIR we had loose consensus that the latter was a little cleaner?! Also the latter means we can have a single style that is used across connectors (Spark / Presto / Trino / DuckDB) with the base parsing code in lance core repo.

fangbo · 2026-06-04T02:42:59Z

@hamersaw Thanks for your feedback.

Branch/Tag operations involve DDL and DML.

DDL

This PR focuses on branch DDL. Branch/Tag is part of the Dataset, so from a Branch/Tag management perspective, it's more intuitive for the Dataset to be the target of ALTER operations. Therefore, I lean more towards:

ALTER TABLE foo CREATE BRANCH 'b0' AS OF VERSION 12345;

DML

As for DML, I believe one consensus reached in the previous PR#198 was to treat Branch/Tag as ordinary tables. For example::

SELECT * FROM <table_name>__branch/<branch_name>
UPDATE <table_name>__branch/<branch_name> SET ...
DELETE <table_name>__branch/<branch_name> WHERE

or

SELECT * FROM <table_name>__branch__<branch_name>
UPDATE <table_name>__branch__<branch_name> SET ...
DELETE <table_name>__branch__<branch_name> WHERE

By the way, I lean more towards <table_name>__branch__<branch_name>, because it avoids the extra '/' symbol.

fangbo · 2026-06-05T04:02:20Z

cc @majin1102

fangbo · 2026-06-10T02:46:24Z

@LuciferYang @hamersaw Do you have some more suggestion about this PR ?

LuciferYang · 2026-06-11T07:51:13Z

+      java.lang.Long.valueOf(value)
+    } catch {
+      case _: NumberFormatException =>
+        throw new ParseException("Can't parse value:" + value + " to version", ctx)


On every supported Spark version, the ParseException(String, ParserRuleContext) constructor's first parameter is an error class, not a message — it delegates to SparkThrowableHelper.getMessage(errorClass, Map.empty).

LuciferYang · 2026-06-11T07:53:38Z

+    else LanceCreateBranch(
+      table,
+      branchName,
+      org.lance.Ref.ofMain(java.lang.Long.valueOf(ctx.refMainVersion.getText)),


use org.lance.Ref.ofMain(_parseVersion(ctx, ctx.refMainVersion.getText))?

LuciferYang · 2026-06-11T07:53:45Z

+    else LanceCreateBranch(
+      table,
+      branchName,
+      org.lance.Ref.ofBranch(refBranchName, java.lang.Long.valueOf(ctx.refBranchVersion.getText)),


LuciferYang · 2026-06-11T07:54:43Z

 INDEX: 'INDEX';
 INDEXES: 'INDEXES';
 KEY: 'KEY';
+MAIN: 'MAIN';


MAIN: 'MAIN'; is declared but no parser rule references it.

LuciferYang · 2026-06-11T07:55:16Z

+import org.apache.spark.sql.catalyst.plans.logical.LanceCreateBranchOutputType
+import org.apache.spark.sql.connector.catalog.{Identifier, TableCatalog}
+import org.apache.spark.unsafe.types.UTF8String
+import org.lance.spark.{BaseLanceNamespaceSparkCatalog, LanceDataset}


BaseLanceNamespaceSparkCatalog seems unused import

LuciferYang · 2026-06-11T07:57:49Z

+      if (!ifNotExists || !alreadyExists) {
+        var branchDs: org.lance.Dataset = null
+        try {
+          branchDs = dataset.createBranch(branchName, ref, dataset.getInitialStorageOptions)


val opts = dataset.getInitialStorageOptions branchDs = if (opts == null || opts.isEmpty) dataset.createBranch(branchName, ref) else dataset.createBranch(branchName, ref, opts)

LuciferYang · 2026-06-11T07:59:51Z

+      .build()
+
+    try {
+      val alreadyExists = dataset.branches().list().asScala.exists(_.getName == branchName)


Both LanceCreateBranchExec and LanceDropBranchExec do branches().list() then create/delete. A concurrent CREATE BRANCH between the two calls still surfaces the native "already exists" error despite IF NOT EXISTS. For DDL this is acceptable, but catching the native already-exists/not-found error instead would be atomic and saves one branch-listing call per command (object-store IO against the table root — a read_dir over the refs path). Also: both execs compute the existence check eagerly (LanceCreateBranchExec.scala:46, LanceDropBranchExec.scala:45), so the list() call is issued even when the IF [NOT] EXISTS flag is absent and its result goes unused.

Exception throws when creating a existed branch:

java.lang.RuntimeException: Encountered internal error. Please file a bug report at https://github.com/lance-format/lance/issues. Clone operation should not enter build_manifest., /Users/runner/work/lance/lance/rust/lance/src/dataset/transaction.rs:1866:28 at org.lance.Dataset.nativeCreateBranch(Native Method) at org.lance.Dataset.innerCreateBranch(Dataset.java:1761) at org.lance.Dataset.createBranch(Dataset.java:1753)

This means that existence can not be checked from the error. I think it is reasonable to list branches to check the branch exist or not.

LuciferYang · 2026-06-11T08:01:17Z

+    }
+  }
+
+  def _parseVersion(ctx: ParserRuleContext, value: String): Long = {


Public def _parseVersion with a leading underscore is unidiomatic Scala

This method is changed to private

LuciferYang · 2026-06-11T08:05:05Z

+
+- `CREATE BRANCH` is implemented as a Spark SQL extension command.
+- The referenced table must be a Lance table.
+- Creating a branch from a non-existent branch, tag, or version returns an error.


Lacks corresponding test cases

BaseBranchDDLTest's three corresponding test cases are added.

testCreateBranchFailsWhenSourceBranchDoesNotExist

testCreateBranchFailsWhenSourceTagDoesNotExist

testCreateBranchFailsWhenSourceVersionDoesNotExist

fangbo · 2026-06-16T09:55:14Z

@LuciferYang Thanks for your more detailed reviews. I have made some modification. Could you please look at it ?

LuciferYang

LGTM

github-actions Bot added the enhancement New feature or request label Jun 2, 2026

LuciferYang reviewed Jun 3, 2026

View reviewed changes

LuciferYang reviewed Jun 11, 2026

View reviewed changes

fangbo force-pushed the ddl-branch branch from 8a1bc89 to a7cfed8 Compare June 16, 2026 08:59

fangbo added 2 commits June 16, 2026 17:08

feat: support ddl for branch operation

e64b78a

rebase main

b5f0bfc

fangbo force-pushed the ddl-branch branch from a7cfed8 to b5f0bfc Compare June 16, 2026 09:24

make _parseVersion private

8b0b36b

LuciferYang approved these changes Jun 16, 2026

View reviewed changes

fangbo merged commit fbc11fc into lance-format:main Jun 17, 2026
17 checks passed

fangbo deleted the ddl-branch branch June 23, 2026 11:29

fangbo mentioned this pull request Jun 23, 2026

Support tag DDL #651

Open

Conversation

fangbo commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Motivation

Uh oh!

LuciferYang commented Jun 2, 2026

Uh oh!

fangbo commented Jun 3, 2026

Uh oh!

LuciferYang commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fangbo commented Jun 3, 2026

Uh oh!

hamersaw commented Jun 3, 2026

Uh oh!

fangbo commented Jun 4, 2026

DDL

DML

Uh oh!

fangbo commented Jun 5, 2026

Uh oh!

fangbo commented Jun 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fangbo Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fangbo commented Jun 16, 2026

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

fangbo commented Jun 2, 2026 •

edited

Loading

fangbo Jun 16, 2026 •

edited

Loading