Turn the search index into a binary file by jviotti · Pull Request #784 · sourcemeta/one

jviotti · 2026-03-25T20:41:24Z

Signed-off-by: Juan Cruz Viotti jv@jviotti.com

Signed-off-by: Juan Cruz Viotti <jv@jviotti.com>

cubic-dev-ai

1 issue found across 7 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/search/search.cc">

<violation number="1" location="src/search/search.cc:70">
P1: String lengths are truncated to `std::uint16_t` without validation. If any field exceeds 65535 bytes, the length is silently truncated, corrupting the binary index and causing incorrect reads. Consider adding validation or using larger length fields.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/search/search.cc

augmentcode · 2026-03-25T20:45:50Z

🤖 Augment PR Summary

Summary: Converts the Explorer search index payload from JSONL text into a compact binary format.
Changes:

Introduce packed binary headers (SearchIndexHeader, SearchRecordHeader) in sourcemeta/one/search.h
Rewrite make_search() to emit: header + offset table + variable-length records (path/title/description)
Update search() to parse the binary payload and run case-insensitive matching per field
Switch generated search index MIME type to application/octet-stream
Update CLI integration test to validate schema paths extracted from the binary payload
Split/expand unit tests into build-format tests and query/robustness tests for malformed payloads

Technical Notes: Result ordering is preserved by pre-sorting entries (metadata completeness, then health, then path) and search results are still capped at 10. Considerations: The binary layout uses an offset table to enable O(1) record lookup and bounded parsing during streaming.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-03-25T20:45:51Z

src/index/explorer.h

                               payload.size()}};
    sourcemeta::one::metapack_write_text(
-        action.destination, payload_view, "application/jsonl",
+        action.destination, payload_view, "application/octet-stream",


metapack_write_text() appends a trailing \n (and copies into a std::string), so using it for this binary search index will mutate the payload bytes (and even an “empty” index becomes a 1-byte payload). For a binary format this can be surprising and may break any consumer expecting the payload to end exactly at the last record.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-03-25T20:45:51Z

src/search/search.cc

+    std::memcpy(offset_table + entry_index * sizeof(std::uint32_t),
+                &record_offset, sizeof(std::uint32_t));
+
+    // Write record header


The record field lengths are truncated to std::uint16_t without checking entry.*.size() <= UINT16_MAX, which will silently corrupt/truncate large paths/titles/descriptions. Because the writer still advances by the full std::string::size(), this can also make the on-disk format internally inconsistent for long fields.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-03-25T20:45:51Z

src/search/search.cc

-  if (payload_size == 0) {
+  if (payload == nullptr || payload_size < sizeof(SearchIndexHeader)) {
    return result;
  }


payload is raw bytes from a file, but it’s being parsed via reinterpret_cast<...*>(payload) and dereferenced (header, offset_table, record_header), which relies on type-punning/object-lifetime assumptions and can be undefined behavior under optimization. Consider decoding these fixed-size structs/integers via std::memcpy into local trivially-copyable values instead.

Severity: medium

Other Locations

src/search/search.cc:125

src/search/search.cc:136

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

Signed-off-by: Juan Cruz Viotti <jv@jviotti.com>

github-actions

Benchmark Index (community)

Details

Benchmark suite	Current: `7f0b3f1`	Previous: `00a0767`	Ratio
`Add one schema (0 existing)`	`21` ms	`19` ms	`1.11`
`Add one schema (100 existing)`	`27` ms	`24` ms	`1.13`
`Add one schema (1000 existing)`	`86` ms	`78` ms	`1.10`
`Add one schema (10000 existing)`	`723` ms	`819` ms	`0.88`
`Update one schema (1 existing)`	`21` ms	`17` ms	`1.24`
`Update one schema (101 existing)`	`27` ms	`24` ms	`1.13`
`Update one schema (1001 existing)`	`93` ms	`77` ms	`1.21`
`Update one schema (10001 existing)`	`748` ms	`673` ms	`1.11`
`Cached rebuild (1 existing)`	`11` ms	`10` ms	`1.10`
`Cached rebuild (101 existing)`	`13` ms	`12` ms	`1.08`
`Cached rebuild (1001 existing)`	`38` ms	`33` ms	`1.15`
`Cached rebuild (10001 existing)`	`299` ms	`285` ms	`1.05`
`Index 100 schemas`	`121` ms	`109` ms	`1.11`
`Index 1000 schemas`	`1119` ms	`888` ms	`1.26`
`Index 10000 schemas`	`14037` ms	`13507` ms	`1.04`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Benchmark Index (enterprise)

Details

Benchmark suite	Current: `7f0b3f1`	Previous: `00a0767`	Ratio
`Add one schema (0 existing)`	`24` ms	`22` ms	`1.09`
`Add one schema (100 existing)`	`27` ms	`28` ms	`0.96`
`Add one schema (1000 existing)`	`84` ms	`84` ms	`1`
`Add one schema (10000 existing)`	`765` ms	`805` ms	`0.95`
`Update one schema (1 existing)`	`21` ms	`20` ms	`1.05`
`Update one schema (101 existing)`	`27` ms	`27` ms	`1`
`Update one schema (1001 existing)`	`86` ms	`87` ms	`0.99`
`Update one schema (10001 existing)`	`720` ms	`705` ms	`1.02`
`Cached rebuild (1 existing)`	`19` ms	`11` ms	`1.73`
`Cached rebuild (101 existing)`	`14` ms	`14` ms	`1`
`Cached rebuild (1001 existing)`	`38` ms	`39` ms	`0.97`
`Cached rebuild (10001 existing)`	`317` ms	`300` ms	`1.06`
`Index 100 schemas`	`124` ms	`124` ms	`1`
`Index 1000 schemas`	`992` ms	`1065` ms	`0.93`
`Index 10000 schemas`	`13976` ms	`14304` ms	`0.98`

This comment was automatically generated by workflow using github-action-benchmark.

jviotti marked this pull request as ready for review March 25, 2026 20:42

Turn the search index into a binary file

d107485

Signed-off-by: Juan Cruz Viotti <jv@jviotti.com>

jviotti force-pushed the search-binary branch from d316a28 to d107485 Compare March 25, 2026 20:42

cubic-dev-ai bot reviewed Mar 25, 2026

View reviewed changes

src/search/search.cc Show resolved Hide resolved

augmentcode bot reviewed Mar 25, 2026

View reviewed changes

Review

7f0b3f1

Signed-off-by: Juan Cruz Viotti <jv@jviotti.com>

github-actions bot reviewed Mar 25, 2026

View reviewed changes

jviotti merged commit aa4b1dd into main Mar 26, 2026
6 checks passed

jviotti deleted the search-binary branch March 26, 2026 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Turn the search index into a binary file#784

Turn the search index into a binary file#784
jviotti merged 2 commits intomainfrom
search-binary

jviotti commented Mar 25, 2026

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

augmentcode bot commented Mar 25, 2026

Uh oh!

augmentcode bot left a comment

Uh oh!

augmentcode bot Mar 25, 2026

Uh oh!

augmentcode bot Mar 25, 2026

Uh oh!

augmentcode bot Mar 25, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jviotti commented Mar 25, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

augmentcode bot commented Mar 25, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Benchmark Index (community)

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Benchmark Index (enterprise)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant