Skip to content

Refactor the current search functionality into a search module#781

Merged
jviotti merged 3 commits intomainfrom
search-module
Mar 25, 2026
Merged

Refactor the current search functionality into a search module#781
jviotti merged 3 commits intomainfrom
search-module

Conversation

@jviotti
Copy link
Member

@jviotti jviotti commented Mar 25, 2026

Signed-off-by: Juan Cruz Viotti jv@jviotti.com

@jviotti jviotti force-pushed the search-module branch 3 times, most recently from 89fcfba to 8676745 Compare March 25, 2026 19:39
Signed-off-by: Juan Cruz Viotti <jv@jviotti.com>
@jviotti jviotti marked this pull request as ready for review March 25, 2026 19:42
@jviotti jviotti changed the title Introduce a search module Refactor the current search functionality into a search module Mar 25, 2026
@augmentcode
Copy link

augmentcode bot commented Mar 25, 2026

🤖 Augment PR Summary

Summary: This PR introduces a reusable search module for building and querying the Explorer search index, and wires it into both the indexer and the HTTP server.

Changes:

  • Adds a new src/search library exposing SearchEntry, make_search(), search(), and a memory-mapped SearchView.
  • Moves the server-side schema search implementation to the shared module and switches the endpoint to use SearchView.
  • Updates the indexer’s Explorer search-index generation to produce the JSONL payload via make_search().
  • Removes the dedicated metapack_write_jsonl helper and uses metapack_write_text for writing the JSONL payload.
  • Adds unit tests covering payload creation, matching behavior, case-insensitivity, and the 10-result cap.
  • Integrates the new library into the index and server build targets via CMake.

Technical Notes: Search is implemented as a line-by-line scan over a JSONL payload stored in a metapack file, with results returned as a JSON array of objects.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 5 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

const std::string_view query) -> sourcemeta::core::JSON {
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-reinterpret-cast)
const std::string_view data{reinterpret_cast<const char *>(payload),
payload_size};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search() constructs a std::string_view from payload even when payload may be nullptr (tests call search(nullptr, 0, ...)), which can violate string_view's pointer preconditions even for size 0. Consider guarding the empty-payload case before building data.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

}

if (std::ranges::search(line, query, [](const auto left, const auto right) {
return std::tolower(left) == std::tolower(right);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case-insensitive predicate calls std::tolower(left/right) on a char value; if char is signed and the byte is >= 0x80, this is undefined behavior. This should cast to unsigned char (or otherwise avoid passing negative values) before calling std::tolower.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

assert(payload_start_option.has_value());
const auto &payload_start{payload_start_option.value()};
this->payload_size_ = this->view_->size() - payload_start;
this->payload_ = this->view_->as<std::uint8_t>(payload_start);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the metapack payload is empty, payload_start can equal view_->size(), making view_->as<std::uint8_t>(payload_start) trip its bounds assert (offset + 1 <= size). This can crash for an empty search index (e.g., when make_search() returns an empty payload).

Severity: high

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

action.destination, result, "application/jsonl",
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-reinterpret-cast)
const std::string_view payload_view{
reinterpret_cast<const char *>(payload.data()), payload.size()};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When payload is empty, payload.data() may be nullptr; constructing payload_view from that pointer (even with size 0) can be undefined and may break metapack_write_text downstream. It’d be safer to handle the empty-payload case explicitly before creating the std::string_view.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

} else {
auto result{sourcemeta::one::search(
base / "explorer" / SENTINEL / "search.metapack", query)};
auto result{search_view.search(query)};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search_view.search(query) can throw (e.g., FileViewError if the index is missing/unreadable, or JSON parse errors if corrupted), but this handler doesn’t catch exceptions anymore (previously it returned a controlled error). Unhandled exceptions here could crash the server or abort the request instead of returning a JSON error response.

Severity: high

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Signed-off-by: Juan Cruz Viotti <jv@jviotti.com>
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 13 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="test/unit/search/search_test.cc">

<violation number="1" location="test/unit/search/search_test.cc:30">
P2: Avoid passing `nullptr` here: this test currently relies on undefined behavior in `search()`'s `std::string_view` construction.</violation>
</file>

<file name="src/search/search.cc">

<violation number="1" location="src/search/search.cc:31">
P2: Zero-metadata entries are never ordered by path, so their output order is arbitrary instead of lexicographic.</violation>

<violation number="2" location="src/search/search.cc:74">
P2: Cast both characters to `unsigned char` before calling `std::tolower` to avoid undefined behavior on non-ASCII bytes.</violation>

<violation number="3" location="src/search/search.cc:112">
P1: When the metapack payload section is empty, `payload_start` equals `view_->size()`, so `view_->as<std::uint8_t>(payload_start)` will trip the bounds check (`offset + 1 <= size`). Guard this access with a check for `payload_size_ == 0` before dereferencing.</violation>
</file>

<file name="src/server/action_schema_search.h">

<violation number="1" location="src/server/action_schema_search.h:26">
P1: This now routes search requests through an assert-based open path, so a missing `search.metapack` can terminate the server instead of producing the 500 response the old code returned.</violation>
</file>

<file name="src/server/server.cc">

<violation number="1" location="src/server/server.cc:159">
P1: Avoid a shared static `SearchView` here. Its lazy `ensure_open()` mutates internal state without synchronization, so concurrent search requests can race on first use.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

} else {
auto result{sourcemeta::one::search(
base / "explorer" / SENTINEL / "search.metapack", query)};
auto result{search_view.search(query)};
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: This now routes search requests through an assert-based open path, so a missing search.metapack can terminate the server instead of producing the 500 response the old code returned.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/server/action_schema_search.h, line 26:

<comment>This now routes search requests through an assert-based open path, so a missing `search.metapack` can terminate the server instead of producing the 500 response the old code returned.</comment>

<file context>
@@ -91,8 +23,7 @@ static auto action_schema_search(const std::filesystem::path &base,
     } else {
-      auto result{sourcemeta::one::search(
-          base / "explorer" / SENTINEL / "search.metapack", query)};
+      auto result{search_view.search(query)};
       response.write_status(sourcemeta::one::STATUS_OK);
       response.write_header("Access-Control-Allow-Origin", "*");
</file context>
Fix with Cubic

sourcemeta::one::HTTPRequest &request,
sourcemeta::one::HTTPResponse &response) -> void {
action_schema_search(base, request, response);
static sourcemeta::one::SearchView search_view{base / "explorer" / SENTINEL /
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Avoid a shared static SearchView here. Its lazy ensure_open() mutates internal state without synchronization, so concurrent search requests can race on first use.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/server/server.cc, line 159:

<comment>Avoid a shared static `SearchView` here. Its lazy `ensure_open()` mutates internal state without synchronization, so concurrent search requests can race on first use.</comment>

<file context>
@@ -156,7 +156,9 @@ static auto handle_self_v1_api_schemas_search(
     sourcemeta::one::HTTPRequest &request,
     sourcemeta::one::HTTPResponse &response) -> void {
-  action_schema_search(base, request, response);
+  static sourcemeta::one::SearchView search_view{base / "explorer" / SENTINEL /
+                                                 "search.metapack"};
+  action_schema_search(search_view, request, response);
</file context>
Fix with Cubic

}

TEST(Search, search_empty_payload) {
const auto result{sourcemeta::one::search(nullptr, 0, "anything")};
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Avoid passing nullptr here: this test currently relies on undefined behavior in search()'s std::string_view construction.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At test/unit/search/search_test.cc, line 30:

<comment>Avoid passing `nullptr` here: this test currently relies on undefined behavior in `search()`'s `std::string_view` construction.</comment>

<file context>
@@ -0,0 +1,177 @@
+}
+
+TEST(Search, search_empty_payload) {
+  const auto result{sourcemeta::one::search(nullptr, 0, "anything")};
+  EXPECT_TRUE(result.is_array());
+  EXPECT_EQ(result.size(), 0);
</file context>
Fix with Cubic

Signed-off-by: Juan Cruz Viotti <jv@jviotti.com>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Index (community)

Details
Benchmark suite Current: 39142ca Previous: 1275841 Ratio
Add one schema (0 existing) 20 ms 21 ms 0.95
Add one schema (100 existing) 26 ms 28 ms 0.93
Add one schema (1000 existing) 86 ms 85 ms 1.01
Add one schema (10000 existing) 729 ms 809 ms 0.90
Update one schema (1 existing) 18 ms 19 ms 0.95
Update one schema (101 existing) 28 ms 26 ms 1.08
Update one schema (1001 existing) 87 ms 87 ms 1
Update one schema (10001 existing) 744 ms 743 ms 1.00
Cached rebuild (1 existing) 10 ms 10 ms 1
Cached rebuild (101 existing) 13 ms 13 ms 1
Cached rebuild (1001 existing) 38 ms 38 ms 1
Cached rebuild (10001 existing) 302 ms 297 ms 1.02
Index 100 schemas 117 ms 125 ms 0.94
Index 1000 schemas 1050 ms 989 ms 1.06
Index 10000 schemas 13860 ms 14453 ms 0.96

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Index (enterprise)

Details
Benchmark suite Current: 39142ca Previous: 1275841 Ratio
Add one schema (0 existing) 22 ms 22 ms 1
Add one schema (100 existing) 28 ms 29 ms 0.97
Add one schema (1000 existing) 84 ms 85 ms 0.99
Add one schema (10000 existing) 690 ms 720 ms 0.96
Update one schema (1 existing) 20 ms 21 ms 0.95
Update one schema (101 existing) 27 ms 28 ms 0.96
Update one schema (1001 existing) 86 ms 85 ms 1.01
Update one schema (10001 existing) 701 ms 734 ms 0.96
Cached rebuild (1 existing) 11 ms 12 ms 0.92
Cached rebuild (101 existing) 14 ms 14 ms 1
Cached rebuild (1001 existing) 38 ms 38 ms 1
Cached rebuild (10001 existing) 298 ms 305 ms 0.98
Index 100 schemas 127 ms 124 ms 1.02
Index 1000 schemas 1081 ms 1094 ms 0.99
Index 10000 schemas 13183 ms 14706 ms 0.90

This comment was automatically generated by workflow using github-action-benchmark.

@jviotti jviotti merged commit dd5d964 into main Mar 25, 2026
6 checks passed
@jviotti jviotti deleted the search-module branch March 25, 2026 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant