Skip to content

feat(clp-s::search): Scan ERT tables using column-based filtering#2281

Open
ShangDanLuXian wants to merge 2 commits into
y-scope:mainfrom
ShangDanLuXian:column_based_scan
Open

feat(clp-s::search): Scan ERT tables using column-based filtering#2281
ShangDanLuXian wants to merge 2 commits into
y-scope:mainfrom
ShangDanLuXian:column_based_scan

Conversation

@ShangDanLuXian
Copy link
Copy Markdown
Contributor

@ShangDanLuXian ShangDanLuXian commented May 14, 2026

Description

This PR adds an initial column-scan filter execution path for clp_s search.

The implementation is intentionally kept small and focused. QueryRunner now prepares the schema-table filter and attempts to use ColumnScan for supported predicates. If column scan cannot handle the query expression, the existing row-scan path is used as a fallback.

Main changes:

  • Add ColumnScan under components/core/src/clp_s/search/.
  • Support building per-schema match bitmaps for supported filter ASTs.
  • Support column scan for simple scalar/string filter predicates and AND / OR expression trees.
  • Integrate column scan through QueryRunner::prepare_filter.
  • Add CMake wiring for the new column-scan source files.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Tested representative MongoDB search queries locally.
  • For query id: 22419, observed total search time reduced from roughly 30s to roughly 15s on the test dataset, with table scan costs 1-2s.
  • Sanity checked all queries for the MongoDB dataset in the paper.

Summary by CodeRabbit

  • New Features

    • Enhanced search filtering with optimized message evaluation using column-based scanning.
  • Chores

    • Updated build system to include new search components.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Walkthrough

This PR introduces ColumnScan, a bitmap-based filter class for evaluating search AST expressions over indexed columns. It integrates into QueryRunner via a prepare_filter factory method and refactors Output to consume the prepared filter, enabling efficient per-message filtering without rebuilding evaluation state.

Changes

ColumnScan bitmap-based filtering

Layer / File(s) Summary
ColumnScan class contract and interface
components/core/src/clp_s/search/ColumnScan.hpp
ColumnScan defines a FilterClass-derived component with a static try_create factory, init and filter overrides, and private helpers for AST validation and bitmap construction from expression trees and reader/query/match lookups.
ColumnScan implementation—validation and bitmap evaluation
components/core/src/clp_s/search/ColumnScan.cpp
Implements construction, per-message filtering via bitmap lookup, iterative AST validation, filter buildability checks, and bitmap evaluation for AND/OR compositions and typed filter operations (numeric, boolean, CLP string, variable string, EXISTS/NEXISTS).
QueryRunner prepare_filter integration
components/core/src/clp_s/search/QueryRunner.hpp, components/core/src/clp_s/search/QueryRunner.cpp
QueryRunner adds prepare_filter public method, ColumnScan member, and includes; prepare_filter conditionally constructs ColumnScan and returns either the scan or QueryRunner itself as the active filter for a schema.
Output filter consumption
components/core/src/clp_s/search/Output.cpp
Output refactors filter() to call prepare_filter once per schema and pass the returned filter into get_next_message calls, replacing the prior initialize_filter call.
Build system compilation setup
components/core/src/clp_s/search/CMakeLists.txt
CMakeLists adds ColumnScan.cpp and ColumnScan.hpp to CLP_S_SEARCH_SOURCES for compilation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately describes the primary change: introducing column-based filtering for scanning ERT tables in the clp-s search component.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ShangDanLuXian ShangDanLuXian changed the title feat(clp-s::search): Add column-scan filter evaluation eat(clp-s::search): Scan ERT tables using column-based filtering May 14, 2026
@ShangDanLuXian ShangDanLuXian changed the title eat(clp-s::search): Scan ERT tables using column-based filtering (clp-s::search): Scan ERT tables using column-based filtering May 14, 2026
@ShangDanLuXian ShangDanLuXian marked this pull request as ready for review May 14, 2026 20:51
@ShangDanLuXian ShangDanLuXian requested review from a team and gibber9809 as code owners May 14, 2026 20:51
@ShangDanLuXian ShangDanLuXian changed the title (clp-s::search): Scan ERT tables using column-based filtering feat(clp-s::search): Scan ERT tables using column-based filtering May 15, 2026
Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving an initial set of comments -- sorry have been busy with other reviews. Will aim to give a full review tomorrow.

Comment on lines +65 to +70
std::shared_ptr<ast::Expression> m_expression;
BasicReaderMap const* m_basic_readers;
ClpStringReaderMap const* m_clp_string_readers;
VarStringReaderMap const* m_var_string_readers;
ClpQueryMap const* m_clp_queries;
VarMatchMap const* m_var_matches;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, all of these members are only used to avoid passing these around during initialization.

Could we get rid of these members and just pass them around during initialization instead? I know it will make some of these function signatures longer, but I think it will make this class easier to read, since when a reader sees these pointers they have to start wondering about the lifetimes & ownership of these pointers after initialization.

namespace clp_s::search {
class ColumnScan : public FilterClass {
public:
using Bitmap = std::vector<uint8_t>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we're using this instead of a packed bitmap since there wasn't much of a performance difference during benchmarking? If so, that makes sense for now.

return value >= operand;
case FilterOperation::EXISTS:
case FilterOperation::NEXISTS:
return false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return false;
return true;

EXISTS/NEXISTS are considered always matching at scan time because schema matching has previously evaluated whether they match or not.

Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a few more comments -- only file I still need to review in my first pass after this is ColumnScan.cpp.

Comment on lines +38 to +42
ColumnScan(ColumnScan&&) = default;
ColumnScan(ColumnScan const&) = delete;
auto operator=(ColumnScan&&) -> ColumnScan& = delete;
auto operator=(ColumnScan const&) -> ColumnScan& = delete;
virtual ~ColumnScan() = default;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should group these together with the standard set of comments for defaulted/deleted constructors

using ClpQueryMap = std::unordered_map<ast::Expression*, clp::Query*>;
using VarMatchMap = std::unordered_map<ast::Expression*, std::unordered_set<int64_t>*>;

[[nodiscard]] static auto try_create(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing docstring.

Comment on lines +59 to +63
[[nodiscard]] auto can_build_node(ast::Expression* expr) const -> bool;
[[nodiscard]] auto can_build_filter(ast::FilterExpr* filter) const -> bool;

[[nodiscard]] auto build_node(ast::Expression* expr) const -> Bitmap;
[[nodiscard]] auto build_filter(ast::FilterExpr* filter) const -> Bitmap;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty straightforward to understand what these do in context, but I think it's still worth adding docstrings here.

auto schema_init(int32_t schema_id) -> EvaluatedValue;

/**
* Initializes readers for the schema table and returns the filter implementation to use.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Initializes readers for the schema table and returns the filter implementation to use.
* Selects a filtering implementation, and prepares a filter on a given ERT.
*
* Note: This method must be called after schema_init.
*
* @param reader A reader for an ERT.
* @return The filtering implementation selected by QueryRunner.

Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look through ColumnScan.cpp as well now -- left some initial comments, but will take a closer look after the first round of review comments have been addressed.

namespace clp_s::search {
namespace {
template <typename T>
[[nodiscard]] auto compare(FilterOperation operation, T value, T operand) -> bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cases like this where we declare and define a function in a .cpp file, our guideline is to separate the declaration from the definition, and have all of the declarations (with docstrings) come before the definitions. It makes it easier to understand the code without having to scroll past all of the implementations.

In this instance that applies to this compare function, as well as is_equality_operation, invert, build_basic_filter, clp_string_matches, build_clp_string_filter, and build_var_string_filter.

You can take a look at clp_s/timestamp_parser/TimestampParser.cpp for a somewhat recent example of this in the codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants