Migrate .metapack headers into a binary format#751
Conversation
Signed-off-by: Juan Cruz Viotti <jv@jviotti.com>
There was a problem hiding this comment.
Benchmark Index (enterprise)
Details
| Benchmark suite | Current: 5124c39 | Previous: 9a6fa19 | Ratio |
|---|---|---|---|
Add one schema (0 existing) |
23 ms |
25 ms |
0.92 |
Add one schema (100 existing) |
35 ms |
40 ms |
0.88 |
Add one schema (1000 existing) |
154 ms |
201 ms |
0.77 |
Add one schema (10000 existing) |
1665 ms |
2022 ms |
0.82 |
Update one schema (1 existing) |
21 ms |
23 ms |
0.91 |
Update one schema (101 existing) |
35 ms |
42 ms |
0.83 |
Update one schema (1001 existing) |
155 ms |
198 ms |
0.78 |
Update one schema (10001 existing) |
1602 ms |
2055 ms |
0.78 |
Cached rebuild (1 existing) |
11 ms |
12 ms |
0.92 |
Cached rebuild (101 existing) |
14 ms |
14 ms |
1 |
Cached rebuild (1001 existing) |
32 ms |
33 ms |
0.97 |
Cached rebuild (10001 existing) |
245 ms |
245 ms |
1 |
Index 100 schemas |
138 ms |
156 ms |
0.88 |
Index 1000 schemas |
1338 ms |
1419 ms |
0.94 |
Index 10000 schemas |
16127 ms |
16567 ms |
0.97 |
This comment was automatically generated by workflow using github-action-benchmark.
Signed-off-by: Juan Cruz Viotti <jv@jviotti.com>
🤖 Augment PR SummarySummary: This PR migrates the Changes:
Technical Notes: The new header stores magic/version, encoding (identity/gzip), timestamps, uncompressed byte count, duration, SHA-256 checksum, MIME length, and a length-prefixed extension region before the payload. 🤖 Was this summary useful? React with 👍 or 👎 |
| } | ||
|
|
||
| auto payload_offset{sizeof(MetapackHeader) + header->mime_length}; | ||
| const auto *extension_size{view.as<std::uint32_t>(payload_offset)}; |
There was a problem hiding this comment.
metapack_read_json reads the extension_size field using payload_offset before checking that header->mime_length keeps that offset within view.size(), so a malformed header could cause an out-of-bounds read/crash. Consider validating mime_length (and extension_size) against the file size before any view.as<...>(payload_offset) dereference.
Severity: high
Other Locations
src/metapack/metapack.cc:278src/metapack/metapack.cc:261
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| return 0; | ||
| } | ||
|
|
||
| const auto *extension_size_pointer{ |
There was a problem hiding this comment.
metapack_extension_offset returns a non-zero offset without verifying that the declared extension_size actually fits in the file, which can later make metapack_extension<T> return pointers into invalid memory. Consider ensuring offset_of_extension_size + sizeof(uint32_t) + *extension_size_pointer <= view.size() before returning an offset.
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| return; | ||
| } | ||
|
|
||
| const auto info{sourcemeta::one::metapack_info(view)}; |
There was a problem hiding this comment.
This calls metapack_info(view) after only a minimal size check; if the target file is corrupted or not a metapack, metapack_info currently relies on assert and may crash/UB in release builds. Consider validating magic/format_version (and header lengths) before calling into metapack_info and returning a clean HTTP error for invalid files.
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| write_link_header(response, dialect.to_string()); | ||
| const auto *dialect_ext{ | ||
| sourcemeta::one::metapack_extension<MetapackDialectExtension>(view)}; | ||
| const std::string_view dialect = |
There was a problem hiding this comment.
The dialect_length is trusted to build a std::string_view into the extension, but there’s no check that the extension is at least sizeof(MetapackDialectExtension) + dialect_length bytes long. A malformed header could make this read past the mapped file; consider bounds-checking against metapack_extension_size(view) (and the file size) before constructing the view.
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| const std::size_t field_length) | ||
| -> std::string_view { | ||
| // NOLINTNEXTLINE(cppcoreguidelines-pro-type-reinterpret-cast) | ||
| return {reinterpret_cast<const char *>( |
There was a problem hiding this comment.
explorer_extension_string constructs a std::string_view from field_offset/field_length without validating that the requested slice stays within the metapack extension buffer. If any length fields are corrupted, this can produce out-of-bounds views and undefined behavior when copied into std::string.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| header.health = health; | ||
| header.bytes = bytes; | ||
| header.dependencies = dependencies; | ||
| header.path_length = static_cast<std::uint16_t>(path.size()); |
There was a problem hiding this comment.
make_explorer_schema_extension truncates string sizes into std::uint16_t fields via static_cast, which will silently wrap for inputs > 65535 bytes and desync offsets on read. Consider validating each field length fits in 16 bits (or using a wider length type) before serializing.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| std::vector<std::uint8_t> result; | ||
| result.resize(sizeof(MetapackDialectExtension) + dialect.size()); | ||
| MetapackDialectExtension header{}; | ||
| header.dialect_length = static_cast<std::uint16_t>(dialect.size()); |
There was a problem hiding this comment.
make_dialect_extension truncates dialect.size() into a std::uint16_t without a bounds check, which could corrupt the extension if a very long dialect identifier ever appears. Consider asserting/validating dialect.size() <= UINT16_MAX before writing.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
10 issues found across 28 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/server/action_schema_search.h">
<violation number="1" location="src/server/action_schema_search.h:34">
P1: Validate `payload_start` before subtracting from `view.size()`. Without a bounds check, malformed metapack headers can cause size underflow and out-of-bounds reads.</violation>
</file>
<file name="src/index/explorer.h">
<violation number="1" location="src/index/explorer.h:235">
P2: Silent data corruption if any string exceeds 65535 bytes. The `static_cast<std::uint16_t>(size)` truncates without checking bounds, but the buffer is allocated using the real (untruncated) sizes. On read-back, the truncated lengths produce wrong offsets for every field after the overflowed one. Add an assertion or saturating check, e.g. `assert(path.size() <= UINT16_MAX)`.</violation>
</file>
<file name="src/metapack/metapack.cc">
<violation number="1" location="src/metapack/metapack.cc:180">
P1: `metapack_extension_offset` returns a non-zero offset without verifying that `offset + *extension_size_pointer` fits within `view.size()`. Callers like `metapack_extension<T>` will then dereference pointers into invalid memory if the declared extension size exceeds the actual file. Add a bounds check: `offset_of_extension_size + sizeof(uint32_t) + *extension_size_pointer <= view.size()`.</violation>
<violation number="2" location="src/metapack/metapack.cc:218">
P1: Validate `payload_offset + sizeof(uint32_t)` before reading `extension_size`; current ordering can dereference out of bounds on malformed headers.</violation>
<violation number="3" location="src/metapack/metapack.cc:243">
P1: `metapack_info` relies on `assert` to validate magic number and format version. In release builds (`NDEBUG` defined), these asserts are compiled out, so a corrupted file will cause undefined behavior instead of a clean error. Replace the asserts with runtime checks (throw or return an error), especially since this is called from the server where input may be untrusted.</violation>
<violation number="4" location="src/metapack/metapack.cc:266">
P1: Validate that `mime_length` fits in the `FileView` before constructing `std::string`; otherwise malformed headers can over-read memory.</violation>
<violation number="5" location="src/metapack/metapack.cc:279">
P1: Add bounds checks before and after reading `extension_size` in `metapack_payload_offset`; malformed headers can currently trigger out-of-bounds reads.</violation>
</file>
<file name="src/server/action_serve_metapack_file.h">
<violation number="1" location="src/server/action_serve_metapack_file.h:120">
P2: Validate `dialect_length` against the extension size before creating the string_view to avoid out-of-bounds reads on malformed metapack headers.</violation>
<violation number="2" location="src/server/action_serve_metapack_file.h:135">
P1: Check that `payload_start` is within the file size before computing `payload_size` to avoid underflow and out-of-bounds reads on malformed metapack headers.</violation>
</file>
<file name="src/index/generators.h">
<violation number="1" location="src/index/generators.h:53">
P2: Guard against dialect identifiers longer than 65,535 bytes before casting to `std::uint16_t`; otherwise the length is truncated and readers can mis-parse the extension payload.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| } | ||
| sourcemeta::core::FileView view{search_index}; | ||
| const auto payload_start{metapack_payload_offset(view)}; | ||
| const auto payload_size{view.size() - payload_start}; |
There was a problem hiding this comment.
P1: Validate payload_start before subtracting from view.size(). Without a bounds check, malformed metapack headers can cause size underflow and out-of-bounds reads.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/server/action_schema_search.h, line 34:
<comment>Validate `payload_start` before subtracting from `view.size()`. Without a bounds check, malformed metapack headers can cause size underflow and out-of-bounds reads.</comment>
<file context>
@@ -27,18 +29,29 @@ static auto search(const std::filesystem::path &search_index,
- }
+ sourcemeta::core::FileView view{search_index};
+ const auto payload_start{metapack_payload_offset(view)};
+ const auto payload_size{view.size() - payload_start};
+ // NOLINTNEXTLINE(cppcoreguidelines-pro-type-reinterpret-cast)
+ const std::string_view payload{
</file context>
| const auto payload_size{view.size() - payload_start}; | |
| if (payload_start > view.size()) { | |
| throw std::runtime_error("Invalid search index metapack header"); | |
| } | |
| const auto payload_size{view.size() - payload_start}; |
|
|
||
| return MetapackInfo{.checksum_hex = std::move(checksum_hex), | ||
| .last_modified = time_point, | ||
| .mime = std::string{mime_data, header->mime_length}, |
There was a problem hiding this comment.
P1: Validate that mime_length fits in the FileView before constructing std::string; otherwise malformed headers can over-read memory.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/metapack/metapack.cc, line 266:
<comment>Validate that `mime_length` fits in the `FileView` before constructing `std::string`; otherwise malformed headers can over-read memory.</comment>
<file context>
@@ -0,0 +1,285 @@
+
+ return MetapackInfo{.checksum_hex = std::move(checksum_hex),
+ .last_modified = time_point,
+ .mime = std::string{mime_data, header->mime_length},
+ .encoding = header->encoding,
+ .content_bytes = header->content_bytes,
</file context>
| assert(header->magic == METAPACK_MAGIC); | ||
|
|
||
| auto offset{sizeof(MetapackHeader) + header->mime_length}; | ||
| const auto *extension_size{view.as<std::uint32_t>(offset)}; |
There was a problem hiding this comment.
P1: Add bounds checks before and after reading extension_size in metapack_payload_offset; malformed headers can currently trigger out-of-bounds reads.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/metapack/metapack.cc, line 279:
<comment>Add bounds checks before and after reading `extension_size` in `metapack_payload_offset`; malformed headers can currently trigger out-of-bounds reads.</comment>
<file context>
@@ -0,0 +1,285 @@
+ assert(header->magic == METAPACK_MAGIC);
+
+ auto offset{sizeof(MetapackHeader) + header->mime_length};
+ const auto *extension_size{view.as<std::uint32_t>(offset)};
+ offset += sizeof(std::uint32_t) + *extension_size;
+
</file context>
| } | ||
|
|
||
| auto payload_offset{sizeof(MetapackHeader) + header->mime_length}; | ||
| const auto *extension_size{view.as<std::uint32_t>(payload_offset)}; |
There was a problem hiding this comment.
P1: Validate payload_offset + sizeof(uint32_t) before reading extension_size; current ordering can dereference out of bounds on malformed headers.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/metapack/metapack.cc, line 218:
<comment>Validate `payload_offset + sizeof(uint32_t)` before reading `extension_size`; current ordering can dereference out of bounds on malformed headers.</comment>
<file context>
@@ -0,0 +1,285 @@
+ }
+
+ auto payload_offset{sizeof(MetapackHeader) + header->mime_length};
+ const auto *extension_size{view.as<std::uint32_t>(payload_offset)};
+ payload_offset += sizeof(std::uint32_t) + *extension_size;
+
</file context>
| std::ostringstream contents; | ||
| contents << file.value().data.rdbuf(); | ||
| const auto payload_start{sourcemeta::one::metapack_payload_offset(view)}; | ||
| const auto payload_size{view.size() - payload_start}; |
There was a problem hiding this comment.
P1: Check that payload_start is within the file size before computing payload_size to avoid underflow and out-of-bounds reads on malformed metapack headers.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/server/action_serve_metapack_file.h, line 135:
<comment>Check that `payload_start` is within the file size before computing `payload_size` to avoid underflow and out-of-bounds reads on malformed metapack headers.</comment>
<file context>
@@ -98,20 +115,34 @@ static auto action_serve_metapack_file(
- std::ostringstream contents;
- contents << file.value().data.rdbuf();
+ const auto payload_start{sourcemeta::one::metapack_payload_offset(view)};
+ const auto payload_size{view.size() - payload_start};
+ // NOLINTNEXTLINE(cppcoreguidelines-pro-type-reinterpret-cast)
+ const std::string contents{
</file context>
| const auto payload_size{view.size() - payload_start}; | |
| if (payload_start > view.size()) { | |
| json_error(request, response, sourcemeta::one::STATUS_NOT_FOUND, | |
| "not-found", "There is nothing at this URL"); | |
| return; | |
| } | |
| const auto payload_size{view.size() - payload_start}; |
| } | ||
|
|
||
| auto metapack_info(const sourcemeta::core::FileView &view) -> MetapackInfo { | ||
| assert(view.size() >= sizeof(MetapackHeader) + sizeof(std::uint32_t)); |
There was a problem hiding this comment.
P1: metapack_info relies on assert to validate magic number and format version. In release builds (NDEBUG defined), these asserts are compiled out, so a corrupted file will cause undefined behavior instead of a clean error. Replace the asserts with runtime checks (throw or return an error), especially since this is called from the server where input may be untrusted.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/metapack/metapack.cc, line 243:
<comment>`metapack_info` relies on `assert` to validate magic number and format version. In release builds (`NDEBUG` defined), these asserts are compiled out, so a corrupted file will cause undefined behavior instead of a clean error. Replace the asserts with runtime checks (throw or return an error), especially since this is called from the server where input may be untrusted.</comment>
<file context>
@@ -0,0 +1,285 @@
+}
+
+auto metapack_info(const sourcemeta::core::FileView &view) -> MetapackInfo {
+ assert(view.size() >= sizeof(MetapackHeader) + sizeof(std::uint32_t));
+ const auto *header{view.as<MetapackHeader>()};
+ assert(header->magic == METAPACK_MAGIC);
</file context>
| return 0; | ||
| } | ||
|
|
||
| return offset_of_extension_size + sizeof(std::uint32_t); |
There was a problem hiding this comment.
P1: metapack_extension_offset returns a non-zero offset without verifying that offset + *extension_size_pointer fits within view.size(). Callers like metapack_extension<T> will then dereference pointers into invalid memory if the declared extension size exceeds the actual file. Add a bounds check: offset_of_extension_size + sizeof(uint32_t) + *extension_size_pointer <= view.size().
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/metapack/metapack.cc, line 180:
<comment>`metapack_extension_offset` returns a non-zero offset without verifying that `offset + *extension_size_pointer` fits within `view.size()`. Callers like `metapack_extension<T>` will then dereference pointers into invalid memory if the declared extension size exceeds the actual file. Add a bounds check: `offset_of_extension_size + sizeof(uint32_t) + *extension_size_pointer <= view.size()`.</comment>
<file context>
@@ -0,0 +1,285 @@
+ return 0;
+ }
+
+ return offset_of_extension_size + sizeof(std::uint32_t);
+}
+
</file context>
| header.health = health; | ||
| header.bytes = bytes; | ||
| header.dependencies = dependencies; | ||
| header.path_length = static_cast<std::uint16_t>(path.size()); |
There was a problem hiding this comment.
P2: Silent data corruption if any string exceeds 65535 bytes. The static_cast<std::uint16_t>(size) truncates without checking bounds, but the buffer is allocated using the real (untruncated) sizes. On read-back, the truncated lengths produce wrong offsets for every field after the overflowed one. Add an assertion or saturating check, e.g. assert(path.size() <= UINT16_MAX).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/index/explorer.h, line 235:
<comment>Silent data corruption if any string exceeds 65535 bytes. The `static_cast<std::uint16_t>(size)` truncates without checking bounds, but the buffer is allocated using the real (untruncated) sizes. On read-back, the truncated lengths produce wrong offsets for every field after the overflowed one. Add an assertion or saturating check, e.g. `assert(path.size() <= UINT16_MAX)`.</comment>
<file context>
@@ -107,6 +109,159 @@ inflate_metadata(const sourcemeta::one::Configuration &configuration,
+ header.health = health;
+ header.bytes = bytes;
+ header.dependencies = dependencies;
+ header.path_length = static_cast<std::uint16_t>(path.size());
+ header.identifier_length = static_cast<std::uint16_t>(identifier.size());
+ header.base_dialect_length = static_cast<std::uint16_t>(base_dialect.size());
</file context>
| const std::string_view dialect = | ||
| (dialect_ext != nullptr && dialect_ext->dialect_length > 0) | ||
| ? std::string_view{reinterpret_cast< | ||
| const char *>(view.as<std::uint8_t>( | ||
| sourcemeta::one::metapack_extension_offset( | ||
| view) + | ||
| sizeof(MetapackDialectExtension))), | ||
| dialect_ext->dialect_length} | ||
| : std::string_view{}; |
There was a problem hiding this comment.
P2: Validate dialect_length against the extension size before creating the string_view to avoid out-of-bounds reads on malformed metapack headers.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/server/action_serve_metapack_file.h, line 120:
<comment>Validate `dialect_length` against the extension size before creating the string_view to avoid out-of-bounds reads on malformed metapack headers.</comment>
<file context>
@@ -98,20 +115,34 @@ static auto action_serve_metapack_file(
- write_link_header(response, dialect.to_string());
+ const auto *dialect_ext{
+ sourcemeta::one::metapack_extension<MetapackDialectExtension>(view)};
+ const std::string_view dialect =
+ (dialect_ext != nullptr && dialect_ext->dialect_length > 0)
+ ? std::string_view{reinterpret_cast<
</file context>
| const std::string_view dialect = | |
| (dialect_ext != nullptr && dialect_ext->dialect_length > 0) | |
| ? std::string_view{reinterpret_cast< | |
| const char *>(view.as<std::uint8_t>( | |
| sourcemeta::one::metapack_extension_offset( | |
| view) + | |
| sizeof(MetapackDialectExtension))), | |
| dialect_ext->dialect_length} | |
| : std::string_view{}; | |
| const auto extension_size{sourcemeta::one::metapack_extension_size(view)}; | |
| const std::string_view dialect = | |
| (dialect_ext != nullptr && dialect_ext->dialect_length > 0 && | |
| extension_size >= sizeof(MetapackDialectExtension) && | |
| dialect_ext->dialect_length <= | |
| extension_size - sizeof(MetapackDialectExtension)) | |
| ? std::string_view{reinterpret_cast<const char *>( | |
| view.as<std::uint8_t>( | |
| sourcemeta::one::metapack_extension_offset( | |
| view) + | |
| sizeof(MetapackDialectExtension))), | |
| dialect_ext->dialect_length} | |
| : std::string_view{}; |
| std::vector<std::uint8_t> result; | ||
| result.resize(sizeof(MetapackDialectExtension) + dialect.size()); | ||
| MetapackDialectExtension header{}; | ||
| header.dialect_length = static_cast<std::uint16_t>(dialect.size()); |
There was a problem hiding this comment.
P2: Guard against dialect identifiers longer than 65,535 bytes before casting to std::uint16_t; otherwise the length is truncated and readers can mis-parse the extension payload.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/index/generators.h, line 53:
<comment>Guard against dialect identifiers longer than 65,535 bytes before casting to `std::uint16_t`; otherwise the length is truncated and readers can mis-parse the extension payload.</comment>
<file context>
@@ -37,6 +39,23 @@
+ std::vector<std::uint8_t> result;
+ result.resize(sizeof(MetapackDialectExtension) + dialect.size());
+ MetapackDialectExtension header{};
+ header.dialect_length = static_cast<std::uint16_t>(dialect.size());
+ std::memcpy(result.data(), &header, sizeof(header));
+ std::memcpy(result.data() + sizeof(header), dialect.data(), dialect.size());
</file context>
There was a problem hiding this comment.
Benchmark Index (community)
Details
| Benchmark suite | Current: 5124c39 | Previous: 9a6fa19 | Ratio |
|---|---|---|---|
Add one schema (0 existing) |
21 ms |
22 ms |
0.95 |
Add one schema (100 existing) |
32 ms |
40 ms |
0.80 |
Add one schema (1000 existing) |
159 ms |
197 ms |
0.81 |
Add one schema (10000 existing) |
1670 ms |
1987 ms |
0.84 |
Update one schema (1 existing) |
19 ms |
20 ms |
0.95 |
Update one schema (101 existing) |
33 ms |
37 ms |
0.89 |
Update one schema (1001 existing) |
154 ms |
188 ms |
0.82 |
Update one schema (10001 existing) |
1886 ms |
1955 ms |
0.96 |
Cached rebuild (1 existing) |
11 ms |
10 ms |
1.10 |
Cached rebuild (101 existing) |
12 ms |
13 ms |
0.92 |
Cached rebuild (1001 existing) |
31 ms |
30 ms |
1.03 |
Cached rebuild (10001 existing) |
238 ms |
232 ms |
1.03 |
Index 100 schemas |
161 ms |
157 ms |
1.03 |
Index 1000 schemas |
1241 ms |
1327 ms |
0.94 |
Index 10000 schemas |
15878 ms |
16889 ms |
0.94 |
This comment was automatically generated by workflow using github-action-benchmark.
.metapack headers into a memory format.metapack headers into a binary format
Signed-off-by: Juan Cruz Viotti jv@jviotti.com