Skip to content

feat: support ddl for branch operation#576

Merged
fangbo merged 3 commits into
lance-format:mainfrom
fangbo:ddl-branch
Jun 17, 2026
Merged

feat: support ddl for branch operation#576
fangbo merged 3 commits into
lance-format:mainfrom
fangbo:ddl-branch

Conversation

@fangbo

@fangbo fangbo commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR adds DDL support for branch operations in Lance Spark.

Previously, branch-related operations did not support DDL workflows properly. With this change, users can perform DDL on branches more consistently, improving branch management usability and aligning branch behavior with expected table operation semantics.

Changes

  • Add DDL support for branch operations
  • Enable branch-related workflows to work with DDL statements
  • Improve consistency of branch management behavior in Lance Spark

Motivation

Branch operations are an important part of versioned data workflows. Supporting DDL on branches makes these workflows more complete and practical, especially for users who need to manage schema or table-related changes in isolated branches before merging.

@github-actions github-actions Bot added the enhancement New feature or request label Jun 2, 2026
@LuciferYang

Copy link
Copy Markdown
Collaborator

will review this one later

@fangbo

fangbo commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

will review this one later

Thanks very much.

I have an idea about grammar definitions that I'd like to discuss. There is another grammar definition like:

-- create branch from main's specific version
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF MAIN <main_version>

-- create branch from branch's specific version
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF BRANCH <source_branch_name>  <branch_version>

-- create branch from tag
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF TAG <source_tag_name> 

What do you think about these two choices ?

cc @hamersaw

@LuciferYang

Copy link
Copy Markdown
Collaborator

will review this one later

Thanks very much.

I have an idea about grammar definitions that I'd like to discuss. There is another grammar definition like:

-- create branch from main's specific version
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF MAIN <main_version>

-- create branch from branch's specific version
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF BRANCH <source_branch_name>  <branch_version>

-- create branch from tag
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF TAG <source_tag_name> 

What do you think about these two choices ?

cc @hamersaw

Thanks @fangbo, the keyword form is a lot better than REF/MAIN/5 — the / separators aren't really SQL, and this grammar becomes a public contract once it ships, so worth settling now.

Since we already follow Iceberg's extension model closely, the cleanest option is probably to match its branch DDL where we can:

ALTER TABLE t CREATE BRANCH [IF NOT EXISTS] <name> [AS OF VERSION <id>] [RETAIN <n> DAYS]
ALTER TABLE t DROP BRANCH [IF EXISTS] <name>

The catch is that Iceberg has one global snapshot space, so AS OF VERSION <id> is enough. Our versions are per-ref, so we still have to say which ref the version belongs to. So something like:

ALTER TABLE t CREATE BRANCH [IF NOT EXISTS] <name>
   [ AS OF VERSION <v>                 -- main @ v
   | AS OF BRANCH <src> [VERSION <v>]
   | AS OF TAG <tag> ]

That keeps Iceberg's AS OF VERSION <v> verbatim for the common case and only adds AS OF BRANCH / AS OF TAG where our model actually needs it. Two small things: AS OF reads better first (matches Iceberg's CREATE BRANCH … AS OF VERSION), and AS OF TAG <tag> is cleaner than VERSION AS OF TAG since a tag isn't a version.

Couple of cheap wins worth folding in now, both already in Iceberg: IF NOT EXISTS / CREATE OR REPLACE, and — more important — declaring the new keywords as nonReserved so branch/tag/version/as/of don't become reserved inside every extension statement (today a column named version would fail to parse).

Iceberg also skips SHOW BRANCHES and just exposes a <table>.refs metadata table — not a blocker, just an option if we'd rather keep it queryable.

import java.util.stream.IntStream;

/** Base tests for BRANCH DDL commands. */
public abstract class BaseBranchDDLTest {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add the negative-path integration tests for the error cases the docs guarantee, a backtick-quoted branch-name test (locks in cleanIdentifier), and a few LanceSqlExtensionsAstBuilderTest cases for the new visit methods.

case _ => throw new UnsupportedOperationException("CreateBranch only supports LanceDataset")
}

val dataset = Utils.openDatasetBuilder(lanceDataset.readOptions()).build()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branch execs look like they'll miss credential-vended catalogs. CreateBranchExec (and DropBranchExec / ShowBranchesExec) open with a bare Utils.openDatasetBuilder(readOptions).build(), whereas OptimizeExec / AddIndexExec thread the catalog's vended options into the open:

// OptimizeExec
val initialStorageOpts = catalog match {
  case ns: BaseLanceNamespaceSparkCatalog =>
    Option(lanceDataset.getInitialStorageOptions).map(_.asScala.toMap)
  case _ => None
}
val dataset = Utils.openDatasetBuilder(readOptions)
  .initialStorageOptions(initialStorageOpts.map(_.asJava).orNull)
  .build()

So on a credential-vending namespace catalog over S3/GCS, the open (and the branch write) won't have the vended creds. It passes today because the dir catalog's options are non-empty and the Glue/S3 docker test is skipped, so CI never hits this path.

Related sharp edge: CreateBranchExec calls the 3-arg createBranch(name, ref, dataset.getLatestStorageOptions). getLatestStorageOptions is the manifest map (no creds), and the 3-arg overload throws on a null/empty map (checkArgument(opts != null && !opts.isEmpty(), ...)). The 2-arg createBranch(name, ref) has no such precondition.

So I'd thread initialStorageOptions into the open like above, then use the 2-arg createBranch(name, ref) — the write picks up creds from the open.

import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
import org.lance.Ref

case class CreateBranch(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using LanceCreateBranch could proactively prevent potential naming conflicts.

else CreateBranch(
table,
branchName,
org.lance.Ref.ofMain(java.lang.Long.valueOf(ctx.refMainVersion.getText)))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a risk of out-of-bounds overflow. It is recommended to catch NumberFormatException and wrap it into a ParseException.

branchName,
ref,
dataset.getLatestStorageOptions)
branchDs.close()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

branchDs.close() sits in the try body, not protected by its own finally.

| Column | Type | Description |
|--------|------|-------------|
| `name` | String | Branch name |
| `parent_branch` | String | Source branch name if the branch was created from another branch; otherwise empty |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise empty -> otherwise NULL ?

StructField("name", DataTypes.StringType, nullable = false),
StructField("parent_branch", DataTypes.StringType, nullable = true),
StructField("parent_version", DataTypes.LongType, nullable = false),
StructField("create_at", DataTypes.LongType, nullable = false),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created_at

;


; No newline at end of file

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should restore the trailing newline

@fangbo

fangbo commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

@LuciferYang Greatly thanks for your detailed suggestion. I have made some modification according to your advice.

@hamersaw

hamersaw commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the PR! This is great. Overall looks great, my only comment is the main reason I never wrapped up the initial proposal for this - basically I didn't have time to drive consensus. There was quite a bit of conversation on #198 about how we wanted to handle branches / tags / versions in this connector, the two competing approaches:

  1. New keywords: Use new AS OF BRANCH / AS OF TAG, etc. For example:
SELECT * FROM foo AS OF BRANCH 'b0';
ALTER TABLE foo CREATE BRANCH 'b0' AS OF VERSION 12345;
  1. Bake the references directly into the table URI
SELECT * FROM foo/__ref/b0;
ALTER TABLE foo/__ref/12345 CREATE BRANCH 'b0';

The former is what is proposed in this PR, but IIR we had loose consensus that the latter was a little cleaner?! Also the latter means we can have a single style that is used across connectors (Spark / Presto / Trino / DuckDB) with the base parsing code in lance core repo.

@fangbo

fangbo commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

@hamersaw Thanks for your feedback.

Branch/Tag operations involve DDL and DML.

DDL

This PR focuses on branch DDL. Branch/Tag is part of the Dataset, so from a Branch/Tag management perspective, it's more intuitive for the Dataset to be the target of ALTER operations. Therefore, I lean more towards:

ALTER TABLE foo CREATE BRANCH 'b0' AS OF VERSION 12345;

DML

As for DML, I believe one consensus reached in the previous PR#198 was to treat Branch/Tag as ordinary tables. For example::

SELECT * FROM <table_name>__branch/<branch_name>
UPDATE <table_name>__branch/<branch_name> SET ...
DELETE <table_name>__branch/<branch_name> WHERE

or

SELECT * FROM <table_name>__branch__<branch_name>
UPDATE <table_name>__branch__<branch_name> SET ...
DELETE <table_name>__branch__<branch_name> WHERE

By the way, I lean more towards <table_name>__branch__<branch_name>, because it avoids the extra '/' symbol.

@fangbo

fangbo commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

cc @majin1102

@fangbo

fangbo commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

@LuciferYang @hamersaw Do you have some more suggestion about this PR ?

java.lang.Long.valueOf(value)
} catch {
case _: NumberFormatException =>
throw new ParseException("Can't parse value:" + value + " to version", ctx)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On every supported Spark version, the ParseException(String, ParserRuleContext) constructor's first parameter is an error class, not a message — it delegates to SparkThrowableHelper.getMessage(errorClass, Map.empty).

else LanceCreateBranch(
table,
branchName,
org.lance.Ref.ofMain(java.lang.Long.valueOf(ctx.refMainVersion.getText)),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use org.lance.Ref.ofMain(_parseVersion(ctx, ctx.refMainVersion.getText))?

else LanceCreateBranch(
table,
branchName,
org.lance.Ref.ofBranch(refBranchName, java.lang.Long.valueOf(ctx.refBranchVersion.getText)),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

INDEX: 'INDEX';
INDEXES: 'INDEXES';
KEY: 'KEY';
MAIN: 'MAIN';

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAIN: 'MAIN'; is declared but no parser rule references it.

import org.apache.spark.sql.catalyst.plans.logical.LanceCreateBranchOutputType
import org.apache.spark.sql.connector.catalog.{Identifier, TableCatalog}
import org.apache.spark.unsafe.types.UTF8String
import org.lance.spark.{BaseLanceNamespaceSparkCatalog, LanceDataset}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseLanceNamespaceSparkCatalog seems unused import

if (!ifNotExists || !alreadyExists) {
var branchDs: org.lance.Dataset = null
try {
branchDs = dataset.createBranch(branchName, ref, dataset.getInitialStorageOptions)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val opts = dataset.getInitialStorageOptions
branchDs =
  if (opts == null || opts.isEmpty) dataset.createBranch(branchName, ref)
  else dataset.createBranch(branchName, ref, opts)

.build()

try {
val alreadyExists = dataset.branches().list().asScala.exists(_.getName == branchName)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both LanceCreateBranchExec and LanceDropBranchExec do branches().list() then create/delete. A concurrent CREATE BRANCH between the two calls still surfaces the native "already exists" error despite IF NOT EXISTS. For DDL this is acceptable, but catching the native already-exists/not-found error instead would be atomic and saves one branch-listing call per command (object-store IO against the table root — a read_dir over the refs path). Also: both execs compute the existence check eagerly (LanceCreateBranchExec.scala:46, LanceDropBranchExec.scala:45), so the list() call is issued even when the IF [NOT] EXISTS flag is absent and its result goes unused.

@fangbo fangbo Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception throws when creating a existed branch:

java.lang.RuntimeException: Encountered internal error. Please file a bug report at https://github.com/lance-format/lance/issues. Clone operation should not enter build_manifest., /Users/runner/work/lance/lance/rust/lance/src/dataset/transaction.rs:1866:28

	at org.lance.Dataset.nativeCreateBranch(Native Method)
	at org.lance.Dataset.innerCreateBranch(Dataset.java:1761)
	at org.lance.Dataset.createBranch(Dataset.java:1753)

This means that existence can not be checked from the error. I think it is reasonable to list branches to check the branch exist or not.

}
}

def _parseVersion(ctx: ParserRuleContext, value: String): Long = {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public def _parseVersion with a leading underscore is unidiomatic Scala

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is changed to private


- `CREATE BRANCH` is implemented as a Spark SQL extension command.
- The referenced table must be a Lance table.
- Creating a branch from a non-existent branch, tag, or version returns an error.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lacks corresponding test cases

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseBranchDDLTest's three corresponding test cases are added.

  • testCreateBranchFailsWhenSourceBranchDoesNotExist
  • testCreateBranchFailsWhenSourceTagDoesNotExist
  • testCreateBranchFailsWhenSourceVersionDoesNotExist

@fangbo

fangbo commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

@LuciferYang Thanks for your more detailed reviews. I have made some modification. Could you please look at it ?

@LuciferYang LuciferYang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fangbo fangbo merged commit fbc11fc into lance-format:main Jun 17, 2026
17 checks passed
@fangbo fangbo deleted the ddl-branch branch June 23, 2026 11:29
@fangbo fangbo mentioned this pull request Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants