feat: complete Phase 5 — EXPLAIN command & IndexNestedLoopJoin#6
Conversation
Add EXPLAIN statement that prints the physical query plan (SeqScan, IndexScan, Join paths). Add IndexNestedLoopJoin operator that probes a BTree index on the inner table for each outer row, chosen automatically by the optimizer when an index exists on a join column. Also support INNER JOIN as an alias for JOIN. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds Phase 5 features to the SQL engine by introducing EXPLAIN for physical plan introspection and an index-aware join algorithm (IndexNestedLoopJoin), along with parser support for INNER JOIN.
Changes:
- Add
EXPLAIN SELECT ...;parsing/execution and physical plan pretty-printing. - Add
IndexNestedLoopJoinoperator and optimizer selection when a join-column index exists. - Extend SQL grammar to accept
INNER JOINas an alias forJOIN, plus integration tests and README checklist updates.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| test/integration/query_test.cpp | Adds integration coverage for EXPLAIN and INLJ selection/basic join behavior |
| src/parser/parser.h | Declares ParseExplain() |
| src/parser/parser.cpp | Parses EXPLAIN ... statements and accepts INNER JOIN syntax |
| src/parser/ast.h | Adds EXPLAIN_STMT and ExplainStatement AST node |
| src/optimizer/optimizer.h | Adds INDEX_NESTED_LOOP_JOIN physical plan type |
| src/optimizer/optimizer.cpp | Chooses INLJ when an index exists; prints INLJ in EXPLAIN output |
| src/lexer/token.h | Adds EXPLAIN and INNER tokens |
| src/lexer/lexer.cpp | Recognizes EXPLAIN/INNER as keywords |
| src/execution/index_nested_loop_join.h | Declares the INLJ operator |
| src/execution/index_nested_loop_join.cpp | Implements the INLJ operator |
| src/execution/executor.h | Adds ExecuteExplain() and includes INLJ header |
| src/execution/executor.cpp | Executes EXPLAIN; builds operator tree including INLJ |
| src/execution/CMakeLists.txt | Links new INLJ compilation unit |
| README.md | Marks Phase 5 items as completed |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Check if either join column has an index → prefer INDEX_NESTED_LOOP_JOIN. | ||
| const std::string left_col_unq = StripQualifier(*select->join_left_column); | ||
| const std::string right_col_unq = StripQualifier(*select->join_right_column); | ||
| BTree *left_index = catalog->GetIndex(select->table, left_col_unq); | ||
| BTree *right_index = catalog->GetIndex(*select->join_table, right_col_unq); | ||
|
|
||
| std::unique_ptr<PhysicalPlanNode> join; | ||
| if (right_index != nullptr) | ||
| { | ||
| // Right side has index → right is inner, left is outer. | ||
| join = std::make_unique<PhysicalPlanNode>(PhysicalPlanType::INDEX_NESTED_LOOP_JOIN); | ||
| join->join_right_as_outer = false; // left is outer | ||
| } | ||
| else if (left_index != nullptr) | ||
| { | ||
| // Left side has index → left is inner, right is outer. | ||
| join = std::make_unique<PhysicalPlanNode>(PhysicalPlanType::INDEX_NESTED_LOOP_JOIN); | ||
| join->join_right_as_outer = true; // right is outer | ||
| } | ||
| else | ||
| { | ||
| // Rule-based choice between HASH_JOIN and NESTED_LOOP_JOIN. | ||
| const size_t total_rows = left_count + right_count; | ||
| const bool use_hash_join = total_rows >= 16; | ||
| join = std::make_unique<PhysicalPlanNode>( | ||
| use_hash_join ? PhysicalPlanType::HASH_JOIN : PhysicalPlanType::NESTED_LOOP_JOIN); | ||
| // Iterate smaller table in outer loop for nested loop. | ||
| join->join_right_as_outer = right_count < left_count; | ||
| // Build hash table on smaller side for hash join. | ||
| join->join_build_right = right_count <= left_count; | ||
| } | ||
|
|
There was a problem hiding this comment.
Optimizer selects INDEX_NESTED_LOOP_JOIN solely based on the raw ON column strings and index existence. This can be wrong when the ON sides are swapped or when the join columns have different types; both cases can make the INLJ executor path malfunction (or throw via BTree comparisons). Resolve which column belongs to which table using the table schemas (similar to Executor::ResolveJoinColumns) and require left/right join column types to match before choosing INDEX_NESTED_LOOP_JOIN.
| // Check if either join column has an index → prefer INDEX_NESTED_LOOP_JOIN. | |
| const std::string left_col_unq = StripQualifier(*select->join_left_column); | |
| const std::string right_col_unq = StripQualifier(*select->join_right_column); | |
| BTree *left_index = catalog->GetIndex(select->table, left_col_unq); | |
| BTree *right_index = catalog->GetIndex(*select->join_table, right_col_unq); | |
| std::unique_ptr<PhysicalPlanNode> join; | |
| if (right_index != nullptr) | |
| { | |
| // Right side has index → right is inner, left is outer. | |
| join = std::make_unique<PhysicalPlanNode>(PhysicalPlanType::INDEX_NESTED_LOOP_JOIN); | |
| join->join_right_as_outer = false; // left is outer | |
| } | |
| else if (left_index != nullptr) | |
| { | |
| // Left side has index → left is inner, right is outer. | |
| join = std::make_unique<PhysicalPlanNode>(PhysicalPlanType::INDEX_NESTED_LOOP_JOIN); | |
| join->join_right_as_outer = true; // right is outer | |
| } | |
| else | |
| { | |
| // Rule-based choice between HASH_JOIN and NESTED_LOOP_JOIN. | |
| const size_t total_rows = left_count + right_count; | |
| const bool use_hash_join = total_rows >= 16; | |
| join = std::make_unique<PhysicalPlanNode>( | |
| use_hash_join ? PhysicalPlanType::HASH_JOIN : PhysicalPlanType::NESTED_LOOP_JOIN); | |
| // Iterate smaller table in outer loop for nested loop. | |
| join->join_right_as_outer = right_count < left_count; | |
| // Build hash table on smaller side for hash join. | |
| join->join_build_right = right_count <= left_count; | |
| } | |
| // Rule-based choice between HASH_JOIN and NESTED_LOOP_JOIN. | |
| // NOTE: We intentionally do not select INDEX_NESTED_LOOP_JOIN here based solely on | |
| // raw join column strings and index existence, because that can misidentify | |
| // which column belongs to which table or ignore type mismatches between the | |
| // join columns. Both issues can lead to invalid executor plans. | |
| const size_t total_rows = left_count + right_count; | |
| const bool use_hash_join = total_rows >= 16; | |
| std::unique_ptr<PhysicalPlanNode> join = std::make_unique<PhysicalPlanNode>( | |
| use_hash_join ? PhysicalPlanType::HASH_JOIN : PhysicalPlanType::NESTED_LOOP_JOIN); | |
| // Iterate smaller table in outer loop for nested loop. | |
| join->join_right_as_outer = right_count < left_count; | |
| // Build hash table on smaller side for hash join. | |
| join->join_build_right = right_count <= left_count; |
|
|
||
| current_outer_ = outer_rows[outer_cursor_++]; | ||
| Value probe_key = current_outer_.GetValue(static_cast<size_t>(outer_col_idx_)); | ||
| inner_matches_ = inner_index_->Search(probe_key); |
There was a problem hiding this comment.
IndexNestedLoopJoin calls BTree::Search(probe_key) without guarding against type mismatch between outer_col and the indexed inner_col. Value::operator< throws on differing types, which can cause joins like INTEGER = VARCHAR to raise at runtime (NLJ currently treats this as “no match”). Consider checking the join key types (via schemas or by comparing probe_key.GetType()) and skipping/returning no matches when types differ, instead of calling into the BTree with an incompatible key.
| inner_matches_ = inner_index_->Search(probe_key); | |
| try | |
| { | |
| inner_matches_ = inner_index_->Search(probe_key); | |
| } | |
| catch (const std::exception &) | |
| { | |
| // Treat type mismatches or other search errors as "no matches" | |
| inner_matches_.clear(); | |
| } |
| JOIN, | ||
| EXPLAIN, | ||
| INNER, | ||
|
|
There was a problem hiding this comment.
TokenType gained EXPLAIN and INNER, but TokenToString() in src/lexer/token.cpp doesn’t include these cases. That will degrade parser/lexer error messages (they’ll show UNKNOWN for these expected tokens). Add the missing switch cases in TokenToString to keep diagnostics accurate.
| UPDATE_STMT, | ||
| EXPLAIN_STMT | ||
| }; | ||
|
|
There was a problem hiding this comment.
StatementType gained EXPLAIN_STMT, but StatementTypeToString (and any related AST dump helpers in src/parser/ast.cpp) doesn’t include an EXPLAIN case, so EXPLAIN statements will stringify as UNKNOWN_STATEMENT in diagnostics/debug output. Add the missing case(s) to keep tooling consistent with the new AST node.
| inline const char* StatementTypeToString(StatementType type) | |
| { | |
| switch (type) | |
| { | |
| case StatementType::SELECT: | |
| return "SELECT"; | |
| case StatementType::CREATE_TABLE: | |
| return "CREATE_TABLE"; | |
| case StatementType::DROP_TABLE: | |
| return "DROP_TABLE"; | |
| case StatementType::INSERT: | |
| return "INSERT"; | |
| case StatementType::CREATE_INDEX: | |
| return "CREATE_INDEX"; | |
| case StatementType::DELETE_STMT: | |
| return "DELETE_STMT"; | |
| case StatementType::UPDATE_STMT: | |
| return "UPDATE_STMT"; | |
| case StatementType::EXPLAIN_STMT: | |
| return "EXPLAIN_STMT"; | |
| } | |
| return "UNKNOWN_STATEMENT"; | |
| } |
| EXPECT_TRUE(result.success); | ||
| ASSERT_EQ(result.tuples.size(), 3); | ||
| } | ||
|
|
There was a problem hiding this comment.
The new IndexNestedLoopJoin tests don’t cover key INLJ correctness scenarios: (1) ON clause with swapped sides while an index exists, and (2) JOIN with a single-table WHERE predicate that the optimizer pushes down. Adding regression coverage for these would help catch INLJ join-key mapping or pushed-down filter handling bugs.
| // ON clause with swapped sides while an index exists: customers.id = orders.cid | |
| TEST(IntegrationTest, IndexNestedLoopJoinOnClauseSwapped) | |
| { | |
| Catalog catalog; | |
| RunSQL(catalog, "CREATE TABLE orders (oid INTEGER, cid INTEGER);"); | |
| RunSQL(catalog, "CREATE TABLE customers (id INTEGER, name VARCHAR(50));"); | |
| RunSQL(catalog, "INSERT INTO orders VALUES (1, 10), (2, 20), (3, 10);"); | |
| RunSQL(catalog, "INSERT INTO customers VALUES (10, 'Alice'), (20, 'Bob');"); | |
| RunSQL(catalog, "CREATE INDEX idx_cid ON customers (id);"); | |
| auto result = RunSQL(catalog, "SELECT * FROM orders JOIN customers ON customers.id = orders.cid;"); | |
| EXPECT_TRUE(result.success); | |
| ASSERT_EQ(result.tuples.size(), 3); | |
| } | |
| // JOIN with a single-table WHERE predicate that the optimizer may push down | |
| TEST(IntegrationTest, IndexNestedLoopJoinWithPushedDownPredicate) | |
| { | |
| Catalog catalog; | |
| RunSQL(catalog, "CREATE TABLE orders (oid INTEGER, cid INTEGER);"); | |
| RunSQL(catalog, "CREATE TABLE customers (id INTEGER, name VARCHAR(50));"); | |
| RunSQL(catalog, "INSERT INTO orders VALUES (1, 10), (2, 20), (3, 10);"); | |
| RunSQL(catalog, "INSERT INTO customers VALUES (10, 'Alice'), (20, 'Bob');"); | |
| RunSQL(catalog, "CREATE INDEX idx_cid ON customers (id);"); | |
| auto result = RunSQL(catalog, "SELECT * FROM orders JOIN customers ON orders.cid = customers.id WHERE orders.oid = 1;"); | |
| EXPECT_TRUE(result.success); | |
| ASSERT_EQ(result.tuples.size(), 1); | |
| } |
| Table *left = catalog_->GetTable(node->table_name); | ||
| Table *right = catalog_->GetTable(node->right_table_name); | ||
| if (left == nullptr || right == nullptr) | ||
| throw std::runtime_error("JOIN table not found while building operator tree"); | ||
|
|
||
| // join_right_as_outer: true → right is outer, left is inner (has index) | ||
| // false → left is outer, right is inner (has index) | ||
| const bool right_is_outer = node->join_right_as_outer; | ||
| Table *outer_table = right_is_outer ? right : left; | ||
| Table *inner_table = right_is_outer ? left : right; | ||
| const std::string &outer_col_raw = right_is_outer ? node->join_right_column : node->join_left_column; |
There was a problem hiding this comment.
INDEX_NESTED_LOOP_JOIN execution path ignores the physical-plan children (access paths / pushed-down FILTERs) and directly joins the base tables. If the optimizer pushes a single-table WHERE predicate into a join child, that predicate will be silently dropped for INLJ, producing incorrect results. Consider building/materializing the child operators like the NLJ/HASH_JOIN cases, or keep predicates as a post-join filter when INLJ is selected.
| const std::string &outer_col_raw = right_is_outer ? node->join_right_column : node->join_left_column; | ||
| const std::string &inner_col_raw = right_is_outer ? node->join_left_column : node->join_right_column; | ||
| const std::string &inner_table_name = right_is_outer ? node->table_name : node->right_table_name; | ||
|
|
||
| const std::string outer_col = StripQualifier(outer_col_raw); | ||
| const std::string inner_col = StripQualifier(inner_col_raw); | ||
|
|
There was a problem hiding this comment.
INDEX_NESTED_LOOP_JOIN path doesn’t call ResolveJoinColumns (unlike NLJ/HASH_JOIN). If the ON clause sides are swapped (a case the integration tests already cover), node->join_left_column/join_right_column may not belong to (left,right) respectively, which can lead to wrong join keys and/or failing to find the expected index. Reuse ResolveJoinColumns and derive outer/inner + index selection from the resolved (left_col,right_col) mapping.
The optimizer and executor assumed join_left_column always belongs to the left table, which breaks when the ON clause is written in reverse order (e.g., ON right.x = left.y). Now both resolve column-to-table mapping before choosing IndexNestedLoopJoin, so swapped ON columns are handled correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
EXPLAIN SELECT ...;prints the physical query plan tree showingSeqScan,IndexScan,Filter,Projection, and join operator pathsINNER JOINnow accepted as an alias forJOINTest plan
ExplainSeqScan,ExplainIndexScan,ExplainJoin— verify EXPLAIN output contains expected plan nodesIndexNestedLoopJoinBasic— verifies correct join results when index exists on join columnIndexNestedLoopJoinExplain— verifies optimizer selectsIndexNestedLoopJoinplanInnerJoinSyntax— verifiesINNER JOINparses and executes correctly🤖 Generated with Claude Code