Skip to content

feat: complete Phase 5 — EXPLAIN command & IndexNestedLoopJoin#6

Merged
snigenigmatic merged 2 commits into
masterfrom
explain
Mar 26, 2026
Merged

feat: complete Phase 5 — EXPLAIN command & IndexNestedLoopJoin#6
snigenigmatic merged 2 commits into
masterfrom
explain

Conversation

@snigenigmatic

Copy link
Copy Markdown
Owner

Summary

  • EXPLAIN command: EXPLAIN SELECT ...; prints the physical query plan tree showing SeqScan, IndexScan, Filter, Projection, and join operator paths
  • IndexNestedLoopJoin: new join operator that probes a BTree index on the inner table for each outer row — chosen automatically by the optimizer when an index exists on a join column
  • INNER JOIN syntax: INNER JOIN now accepted as an alias for JOIN
  • Phase 5 (JOIN operations) is now fully complete

Test plan

  • All 6 existing test suites pass (value, schema, tuple, lexer, parser, query)
  • New tests: ExplainSeqScan, ExplainIndexScan, ExplainJoin — verify EXPLAIN output contains expected plan nodes
  • New tests: IndexNestedLoopJoinBasic — verifies correct join results when index exists on join column
  • New tests: IndexNestedLoopJoinExplain — verifies optimizer selects IndexNestedLoopJoin plan
  • New tests: InnerJoinSyntax — verifies INNER JOIN parses and executes correctly

🤖 Generated with Claude Code

Add EXPLAIN statement that prints the physical query plan (SeqScan,
IndexScan, Join paths). Add IndexNestedLoopJoin operator that probes
a BTree index on the inner table for each outer row, chosen
automatically by the optimizer when an index exists on a join column.
Also support INNER JOIN as an alias for JOIN.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 26, 2026 06:17

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Phase 5 features to the SQL engine by introducing EXPLAIN for physical plan introspection and an index-aware join algorithm (IndexNestedLoopJoin), along with parser support for INNER JOIN.

Changes:

  • Add EXPLAIN SELECT ...; parsing/execution and physical plan pretty-printing.
  • Add IndexNestedLoopJoin operator and optimizer selection when a join-column index exists.
  • Extend SQL grammar to accept INNER JOIN as an alias for JOIN, plus integration tests and README checklist updates.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
test/integration/query_test.cpp Adds integration coverage for EXPLAIN and INLJ selection/basic join behavior
src/parser/parser.h Declares ParseExplain()
src/parser/parser.cpp Parses EXPLAIN ... statements and accepts INNER JOIN syntax
src/parser/ast.h Adds EXPLAIN_STMT and ExplainStatement AST node
src/optimizer/optimizer.h Adds INDEX_NESTED_LOOP_JOIN physical plan type
src/optimizer/optimizer.cpp Chooses INLJ when an index exists; prints INLJ in EXPLAIN output
src/lexer/token.h Adds EXPLAIN and INNER tokens
src/lexer/lexer.cpp Recognizes EXPLAIN/INNER as keywords
src/execution/index_nested_loop_join.h Declares the INLJ operator
src/execution/index_nested_loop_join.cpp Implements the INLJ operator
src/execution/executor.h Adds ExecuteExplain() and includes INLJ header
src/execution/executor.cpp Executes EXPLAIN; builds operator tree including INLJ
src/execution/CMakeLists.txt Links new INLJ compilation unit
README.md Marks Phase 5 items as completed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/optimizer/optimizer.cpp Outdated
Comment on lines +389 to +420
// Check if either join column has an index → prefer INDEX_NESTED_LOOP_JOIN.
const std::string left_col_unq = StripQualifier(*select->join_left_column);
const std::string right_col_unq = StripQualifier(*select->join_right_column);
BTree *left_index = catalog->GetIndex(select->table, left_col_unq);
BTree *right_index = catalog->GetIndex(*select->join_table, right_col_unq);

std::unique_ptr<PhysicalPlanNode> join;
if (right_index != nullptr)
{
// Right side has index → right is inner, left is outer.
join = std::make_unique<PhysicalPlanNode>(PhysicalPlanType::INDEX_NESTED_LOOP_JOIN);
join->join_right_as_outer = false; // left is outer
}
else if (left_index != nullptr)
{
// Left side has index → left is inner, right is outer.
join = std::make_unique<PhysicalPlanNode>(PhysicalPlanType::INDEX_NESTED_LOOP_JOIN);
join->join_right_as_outer = true; // right is outer
}
else
{
// Rule-based choice between HASH_JOIN and NESTED_LOOP_JOIN.
const size_t total_rows = left_count + right_count;
const bool use_hash_join = total_rows >= 16;
join = std::make_unique<PhysicalPlanNode>(
use_hash_join ? PhysicalPlanType::HASH_JOIN : PhysicalPlanType::NESTED_LOOP_JOIN);
// Iterate smaller table in outer loop for nested loop.
join->join_right_as_outer = right_count < left_count;
// Build hash table on smaller side for hash join.
join->join_build_right = right_count <= left_count;
}

Copilot AI Mar 26, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizer selects INDEX_NESTED_LOOP_JOIN solely based on the raw ON column strings and index existence. This can be wrong when the ON sides are swapped or when the join columns have different types; both cases can make the INLJ executor path malfunction (or throw via BTree comparisons). Resolve which column belongs to which table using the table schemas (similar to Executor::ResolveJoinColumns) and require left/right join column types to match before choosing INDEX_NESTED_LOOP_JOIN.

Suggested change
// Check if either join column has an index → prefer INDEX_NESTED_LOOP_JOIN.
const std::string left_col_unq = StripQualifier(*select->join_left_column);
const std::string right_col_unq = StripQualifier(*select->join_right_column);
BTree *left_index = catalog->GetIndex(select->table, left_col_unq);
BTree *right_index = catalog->GetIndex(*select->join_table, right_col_unq);
std::unique_ptr<PhysicalPlanNode> join;
if (right_index != nullptr)
{
// Right side has index → right is inner, left is outer.
join = std::make_unique<PhysicalPlanNode>(PhysicalPlanType::INDEX_NESTED_LOOP_JOIN);
join->join_right_as_outer = false; // left is outer
}
else if (left_index != nullptr)
{
// Left side has index → left is inner, right is outer.
join = std::make_unique<PhysicalPlanNode>(PhysicalPlanType::INDEX_NESTED_LOOP_JOIN);
join->join_right_as_outer = true; // right is outer
}
else
{
// Rule-based choice between HASH_JOIN and NESTED_LOOP_JOIN.
const size_t total_rows = left_count + right_count;
const bool use_hash_join = total_rows >= 16;
join = std::make_unique<PhysicalPlanNode>(
use_hash_join ? PhysicalPlanType::HASH_JOIN : PhysicalPlanType::NESTED_LOOP_JOIN);
// Iterate smaller table in outer loop for nested loop.
join->join_right_as_outer = right_count < left_count;
// Build hash table on smaller side for hash join.
join->join_build_right = right_count <= left_count;
}
// Rule-based choice between HASH_JOIN and NESTED_LOOP_JOIN.
// NOTE: We intentionally do not select INDEX_NESTED_LOOP_JOIN here based solely on
// raw join column strings and index existence, because that can misidentify
// which column belongs to which table or ignore type mismatches between the
// join columns. Both issues can lead to invalid executor plans.
const size_t total_rows = left_count + right_count;
const bool use_hash_join = total_rows >= 16;
std::unique_ptr<PhysicalPlanNode> join = std::make_unique<PhysicalPlanNode>(
use_hash_join ? PhysicalPlanType::HASH_JOIN : PhysicalPlanType::NESTED_LOOP_JOIN);
// Iterate smaller table in outer loop for nested loop.
join->join_right_as_outer = right_count < left_count;
// Build hash table on smaller side for hash join.
join->join_build_right = right_count <= left_count;

Copilot uses AI. Check for mistakes.

current_outer_ = outer_rows[outer_cursor_++];
Value probe_key = current_outer_.GetValue(static_cast<size_t>(outer_col_idx_));
inner_matches_ = inner_index_->Search(probe_key);

Copilot AI Mar 26, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexNestedLoopJoin calls BTree::Search(probe_key) without guarding against type mismatch between outer_col and the indexed inner_col. Value::operator< throws on differing types, which can cause joins like INTEGER = VARCHAR to raise at runtime (NLJ currently treats this as “no match”). Consider checking the join key types (via schemas or by comparing probe_key.GetType()) and skipping/returning no matches when types differ, instead of calling into the BTree with an incompatible key.

Suggested change
inner_matches_ = inner_index_->Search(probe_key);
try
{
inner_matches_ = inner_index_->Search(probe_key);
}
catch (const std::exception &)
{
// Treat type mismatches or other search errors as "no matches"
inner_matches_.clear();
}

Copilot uses AI. Check for mistakes.
Comment thread src/lexer/token.h
Comment on lines 35 to 38
JOIN,
EXPLAIN,
INNER,

Copilot AI Mar 26, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TokenType gained EXPLAIN and INNER, but TokenToString() in src/lexer/token.cpp doesn’t include these cases. That will degrade parser/lexer error messages (they’ll show UNKNOWN for these expected tokens). Add the missing switch cases in TokenToString to keep diagnostics accurate.

Copilot uses AI. Check for mistakes.
Comment thread src/parser/ast.h
UPDATE_STMT,
EXPLAIN_STMT
};

Copilot AI Mar 26, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StatementType gained EXPLAIN_STMT, but StatementTypeToString (and any related AST dump helpers in src/parser/ast.cpp) doesn’t include an EXPLAIN case, so EXPLAIN statements will stringify as UNKNOWN_STATEMENT in diagnostics/debug output. Add the missing case(s) to keep tooling consistent with the new AST node.

Suggested change
inline const char* StatementTypeToString(StatementType type)
{
switch (type)
{
case StatementType::SELECT:
return "SELECT";
case StatementType::CREATE_TABLE:
return "CREATE_TABLE";
case StatementType::DROP_TABLE:
return "DROP_TABLE";
case StatementType::INSERT:
return "INSERT";
case StatementType::CREATE_INDEX:
return "CREATE_INDEX";
case StatementType::DELETE_STMT:
return "DELETE_STMT";
case StatementType::UPDATE_STMT:
return "UPDATE_STMT";
case StatementType::EXPLAIN_STMT:
return "EXPLAIN_STMT";
}
return "UNKNOWN_STATEMENT";
}

Copilot uses AI. Check for mistakes.
EXPECT_TRUE(result.success);
ASSERT_EQ(result.tuples.size(), 3);
}

Copilot AI Mar 26, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new IndexNestedLoopJoin tests don’t cover key INLJ correctness scenarios: (1) ON clause with swapped sides while an index exists, and (2) JOIN with a single-table WHERE predicate that the optimizer pushes down. Adding regression coverage for these would help catch INLJ join-key mapping or pushed-down filter handling bugs.

Suggested change
// ON clause with swapped sides while an index exists: customers.id = orders.cid
TEST(IntegrationTest, IndexNestedLoopJoinOnClauseSwapped)
{
Catalog catalog;
RunSQL(catalog, "CREATE TABLE orders (oid INTEGER, cid INTEGER);");
RunSQL(catalog, "CREATE TABLE customers (id INTEGER, name VARCHAR(50));");
RunSQL(catalog, "INSERT INTO orders VALUES (1, 10), (2, 20), (3, 10);");
RunSQL(catalog, "INSERT INTO customers VALUES (10, 'Alice'), (20, 'Bob');");
RunSQL(catalog, "CREATE INDEX idx_cid ON customers (id);");
auto result = RunSQL(catalog, "SELECT * FROM orders JOIN customers ON customers.id = orders.cid;");
EXPECT_TRUE(result.success);
ASSERT_EQ(result.tuples.size(), 3);
}
// JOIN with a single-table WHERE predicate that the optimizer may push down
TEST(IntegrationTest, IndexNestedLoopJoinWithPushedDownPredicate)
{
Catalog catalog;
RunSQL(catalog, "CREATE TABLE orders (oid INTEGER, cid INTEGER);");
RunSQL(catalog, "CREATE TABLE customers (id INTEGER, name VARCHAR(50));");
RunSQL(catalog, "INSERT INTO orders VALUES (1, 10), (2, 20), (3, 10);");
RunSQL(catalog, "INSERT INTO customers VALUES (10, 'Alice'), (20, 'Bob');");
RunSQL(catalog, "CREATE INDEX idx_cid ON customers (id);");
auto result = RunSQL(catalog, "SELECT * FROM orders JOIN customers ON orders.cid = customers.id WHERE orders.oid = 1;");
EXPECT_TRUE(result.success);
ASSERT_EQ(result.tuples.size(), 1);
}

Copilot uses AI. Check for mistakes.
Comment thread src/execution/executor.cpp Outdated
Comment on lines +395 to +405
Table *left = catalog_->GetTable(node->table_name);
Table *right = catalog_->GetTable(node->right_table_name);
if (left == nullptr || right == nullptr)
throw std::runtime_error("JOIN table not found while building operator tree");

// join_right_as_outer: true → right is outer, left is inner (has index)
// false → left is outer, right is inner (has index)
const bool right_is_outer = node->join_right_as_outer;
Table *outer_table = right_is_outer ? right : left;
Table *inner_table = right_is_outer ? left : right;
const std::string &outer_col_raw = right_is_outer ? node->join_right_column : node->join_left_column;

Copilot AI Mar 26, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INDEX_NESTED_LOOP_JOIN execution path ignores the physical-plan children (access paths / pushed-down FILTERs) and directly joins the base tables. If the optimizer pushes a single-table WHERE predicate into a join child, that predicate will be silently dropped for INLJ, producing incorrect results. Consider building/materializing the child operators like the NLJ/HASH_JOIN cases, or keep predicates as a post-join filter when INLJ is selected.

Copilot uses AI. Check for mistakes.
Comment thread src/execution/executor.cpp Outdated
Comment on lines +405 to +411
const std::string &outer_col_raw = right_is_outer ? node->join_right_column : node->join_left_column;
const std::string &inner_col_raw = right_is_outer ? node->join_left_column : node->join_right_column;
const std::string &inner_table_name = right_is_outer ? node->table_name : node->right_table_name;

const std::string outer_col = StripQualifier(outer_col_raw);
const std::string inner_col = StripQualifier(inner_col_raw);

Copilot AI Mar 26, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INDEX_NESTED_LOOP_JOIN path doesn’t call ResolveJoinColumns (unlike NLJ/HASH_JOIN). If the ON clause sides are swapped (a case the integration tests already cover), node->join_left_column/join_right_column may not belong to (left,right) respectively, which can lead to wrong join keys and/or failing to find the expected index. Reuse ResolveJoinColumns and derive outer/inner + index selection from the resolved (left_col,right_col) mapping.

Copilot uses AI. Check for mistakes.
The optimizer and executor assumed join_left_column always belongs to
the left table, which breaks when the ON clause is written in reverse
order (e.g., ON right.x = left.y). Now both resolve column-to-table
mapping before choosing IndexNestedLoopJoin, so swapped ON columns
are handled correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@snigenigmatic snigenigmatic merged commit 6bcc1c1 into master Mar 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants