feat: support ddl for branch operation#576
Conversation
|
will review this one later |
Thanks very much. I have an idea about grammar definitions that I'd like to discuss. There is another grammar definition like: -- create branch from main's specific version
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF MAIN <main_version>
-- create branch from branch's specific version
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF BRANCH <source_branch_name> <branch_version>
-- create branch from tag
ALTER TABLE <table_name> CREATE BRANCH <branch_name> VERSION AS OF TAG <source_tag_name> What do you think about these two choices ? cc @hamersaw |
Thanks @fangbo, the keyword form is a lot better than Since we already follow Iceberg's extension model closely, the cleanest option is probably to match its branch DDL where we can: ALTER TABLE t CREATE BRANCH [IF NOT EXISTS] <name> [AS OF VERSION <id>] [RETAIN <n> DAYS]
ALTER TABLE t DROP BRANCH [IF EXISTS] <name>The catch is that Iceberg has one global snapshot space, so ALTER TABLE t CREATE BRANCH [IF NOT EXISTS] <name>
[ AS OF VERSION <v> -- main @ v
| AS OF BRANCH <src> [VERSION <v>]
| AS OF TAG <tag> ]That keeps Iceberg's Couple of cheap wins worth folding in now, both already in Iceberg: Iceberg also skips |
| import java.util.stream.IntStream; | ||
|
|
||
| /** Base tests for BRANCH DDL commands. */ | ||
| public abstract class BaseBranchDDLTest { |
There was a problem hiding this comment.
I think we should add the negative-path integration tests for the error cases the docs guarantee, a backtick-quoted branch-name test (locks in cleanIdentifier), and a few LanceSqlExtensionsAstBuilderTest cases for the new visit methods.
| case _ => throw new UnsupportedOperationException("CreateBranch only supports LanceDataset") | ||
| } | ||
|
|
||
| val dataset = Utils.openDatasetBuilder(lanceDataset.readOptions()).build() |
There was a problem hiding this comment.
The branch execs look like they'll miss credential-vended catalogs. CreateBranchExec (and DropBranchExec / ShowBranchesExec) open with a bare Utils.openDatasetBuilder(readOptions).build(), whereas OptimizeExec / AddIndexExec thread the catalog's vended options into the open:
// OptimizeExec
val initialStorageOpts = catalog match {
case ns: BaseLanceNamespaceSparkCatalog =>
Option(lanceDataset.getInitialStorageOptions).map(_.asScala.toMap)
case _ => None
}
val dataset = Utils.openDatasetBuilder(readOptions)
.initialStorageOptions(initialStorageOpts.map(_.asJava).orNull)
.build()So on a credential-vending namespace catalog over S3/GCS, the open (and the branch write) won't have the vended creds. It passes today because the dir catalog's options are non-empty and the Glue/S3 docker test is skipped, so CI never hits this path.
Related sharp edge: CreateBranchExec calls the 3-arg createBranch(name, ref, dataset.getLatestStorageOptions). getLatestStorageOptions is the manifest map (no creds), and the 3-arg overload throws on a null/empty map (checkArgument(opts != null && !opts.isEmpty(), ...)). The 2-arg createBranch(name, ref) has no such precondition.
So I'd thread initialStorageOptions into the open like above, then use the 2-arg createBranch(name, ref) — the write picks up creds from the open.
| import org.apache.spark.sql.types.{DataTypes, StructField, StructType} | ||
| import org.lance.Ref | ||
|
|
||
| case class CreateBranch( |
There was a problem hiding this comment.
Using LanceCreateBranch could proactively prevent potential naming conflicts.
| else CreateBranch( | ||
| table, | ||
| branchName, | ||
| org.lance.Ref.ofMain(java.lang.Long.valueOf(ctx.refMainVersion.getText))) |
There was a problem hiding this comment.
There is a risk of out-of-bounds overflow. It is recommended to catch NumberFormatException and wrap it into a ParseException.
| branchName, | ||
| ref, | ||
| dataset.getLatestStorageOptions) | ||
| branchDs.close() |
There was a problem hiding this comment.
branchDs.close() sits in the try body, not protected by its own finally.
| | Column | Type | Description | | ||
| |--------|------|-------------| | ||
| | `name` | String | Branch name | | ||
| | `parent_branch` | String | Source branch name if the branch was created from another branch; otherwise empty | |
There was a problem hiding this comment.
otherwise empty -> otherwise NULL ?
| StructField("name", DataTypes.StringType, nullable = false), | ||
| StructField("parent_branch", DataTypes.StringType, nullable = true), | ||
| StructField("parent_version", DataTypes.LongType, nullable = false), | ||
| StructField("create_at", DataTypes.LongType, nullable = false), |
| ; | ||
|
|
||
|
|
||
| ; No newline at end of file |
There was a problem hiding this comment.
should restore the trailing newline
|
@LuciferYang Greatly thanks for your detailed suggestion. I have made some modification according to your advice. |
|
Thanks for the PR! This is great. Overall looks great, my only comment is the main reason I never wrapped up the initial proposal for this - basically I didn't have time to drive consensus. There was quite a bit of conversation on #198 about how we wanted to handle branches / tags / versions in this connector, the two competing approaches:
The former is what is proposed in this PR, but IIR we had loose consensus that the latter was a little cleaner?! Also the latter means we can have a single style that is used across connectors (Spark / Presto / Trino / DuckDB) with the base parsing code in lance core repo. |
|
@hamersaw Thanks for your feedback. Branch/Tag operations involve DDL and DML. DDLThis PR focuses on branch DDL. Branch/Tag is part of the Dataset, so from a Branch/Tag management perspective, it's more intuitive for the Dataset to be the target of ALTER operations. Therefore, I lean more towards: DMLAs for DML, I believe one consensus reached in the previous PR#198 was to treat Branch/Tag as ordinary tables. For example:: or
|
|
cc @majin1102 |
|
@LuciferYang @hamersaw Do you have some more suggestion about this PR ? |
| java.lang.Long.valueOf(value) | ||
| } catch { | ||
| case _: NumberFormatException => | ||
| throw new ParseException("Can't parse value:" + value + " to version", ctx) |
There was a problem hiding this comment.
On every supported Spark version, the ParseException(String, ParserRuleContext) constructor's first parameter is an error class, not a message — it delegates to SparkThrowableHelper.getMessage(errorClass, Map.empty).
| else LanceCreateBranch( | ||
| table, | ||
| branchName, | ||
| org.lance.Ref.ofMain(java.lang.Long.valueOf(ctx.refMainVersion.getText)), |
There was a problem hiding this comment.
use org.lance.Ref.ofMain(_parseVersion(ctx, ctx.refMainVersion.getText))?
| else LanceCreateBranch( | ||
| table, | ||
| branchName, | ||
| org.lance.Ref.ofBranch(refBranchName, java.lang.Long.valueOf(ctx.refBranchVersion.getText)), |
| INDEX: 'INDEX'; | ||
| INDEXES: 'INDEXES'; | ||
| KEY: 'KEY'; | ||
| MAIN: 'MAIN'; |
There was a problem hiding this comment.
MAIN: 'MAIN'; is declared but no parser rule references it.
| import org.apache.spark.sql.catalyst.plans.logical.LanceCreateBranchOutputType | ||
| import org.apache.spark.sql.connector.catalog.{Identifier, TableCatalog} | ||
| import org.apache.spark.unsafe.types.UTF8String | ||
| import org.lance.spark.{BaseLanceNamespaceSparkCatalog, LanceDataset} |
There was a problem hiding this comment.
BaseLanceNamespaceSparkCatalog seems unused import
| if (!ifNotExists || !alreadyExists) { | ||
| var branchDs: org.lance.Dataset = null | ||
| try { | ||
| branchDs = dataset.createBranch(branchName, ref, dataset.getInitialStorageOptions) |
There was a problem hiding this comment.
val opts = dataset.getInitialStorageOptions
branchDs =
if (opts == null || opts.isEmpty) dataset.createBranch(branchName, ref)
else dataset.createBranch(branchName, ref, opts)| .build() | ||
|
|
||
| try { | ||
| val alreadyExists = dataset.branches().list().asScala.exists(_.getName == branchName) |
There was a problem hiding this comment.
Both LanceCreateBranchExec and LanceDropBranchExec do branches().list() then create/delete. A concurrent CREATE BRANCH between the two calls still surfaces the native "already exists" error despite IF NOT EXISTS. For DDL this is acceptable, but catching the native already-exists/not-found error instead would be atomic and saves one branch-listing call per command (object-store IO against the table root — a read_dir over the refs path). Also: both execs compute the existence check eagerly (LanceCreateBranchExec.scala:46, LanceDropBranchExec.scala:45), so the list() call is issued even when the IF [NOT] EXISTS flag is absent and its result goes unused.
There was a problem hiding this comment.
Exception throws when creating a existed branch:
java.lang.RuntimeException: Encountered internal error. Please file a bug report at https://github.com/lance-format/lance/issues. Clone operation should not enter build_manifest., /Users/runner/work/lance/lance/rust/lance/src/dataset/transaction.rs:1866:28
at org.lance.Dataset.nativeCreateBranch(Native Method)
at org.lance.Dataset.innerCreateBranch(Dataset.java:1761)
at org.lance.Dataset.createBranch(Dataset.java:1753)
This means that existence can not be checked from the error. I think it is reasonable to list branches to check the branch exist or not.
| } | ||
| } | ||
|
|
||
| def _parseVersion(ctx: ParserRuleContext, value: String): Long = { |
There was a problem hiding this comment.
Public def _parseVersion with a leading underscore is unidiomatic Scala
There was a problem hiding this comment.
This method is changed to private
|
|
||
| - `CREATE BRANCH` is implemented as a Spark SQL extension command. | ||
| - The referenced table must be a Lance table. | ||
| - Creating a branch from a non-existent branch, tag, or version returns an error. |
There was a problem hiding this comment.
Lacks corresponding test cases
There was a problem hiding this comment.
BaseBranchDDLTest's three corresponding test cases are added.
- testCreateBranchFailsWhenSourceBranchDoesNotExist
- testCreateBranchFailsWhenSourceTagDoesNotExist
- testCreateBranchFailsWhenSourceVersionDoesNotExist
|
@LuciferYang Thanks for your more detailed reviews. I have made some modification. Could you please look at it ? |
Summary
This PR adds DDL support for branch operations in Lance Spark.
Previously, branch-related operations did not support DDL workflows properly. With this change, users can perform DDL on branches more consistently, improving branch management usability and aligning branch behavior with expected table operation semantics.
Changes
Motivation
Branch operations are an important part of versioned data workflows. Supporting DDL on branches makes these workflows more complete and practical, especially for users who need to manage schema or table-related changes in isolated branches before merging.