refactor(clp-s): Use simdjson::builder::string_builder for JSON string escaping; Bump simdjson version to 4.6.4.#2216
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughRequire simdjson >= 4.0.0 and update pins/docs; add move-only clp_s::SimdJsonStringEscaper (header + impl using simdjson with fallback); extend BaseColumnReader API and implementations to accept the escaper; wire an escaper member into JsonSerializer and pass it when emitting quoted values. ChangesCore simdjson integration and escaping
Sequence Diagram(s)sequenceDiagram
participant components_core_CMake
participant clp_s_CMake
participant Taskfile
participant Docs
components_core_CMake->>clp_s_CMake: require simdjson >= 4.0.0 and add link
clp_s_CMake->>clp_s_CMake: insert simdjson::simdjson into targets
components_core_CMake->>Taskfile: update pinned simdjson tarball
Taskfile->>Docs: bump docs deps entry to v4.6.4
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| return; | ||
| } | ||
|
|
||
| // Preserve the original implementation in case simdjson fails. |
There was a problem hiding this comment.
Or we may remove it completely?
There was a problem hiding this comment.
Yeah, I would just remove the original implementation, and throw if the simdjson function fails.
There was a problem hiding this comment.
Per other review comment -- should probably keep it actually, and create a new class for this new implementation.
gibber9809
left a comment
There was a problem hiding this comment.
Nice work! Just a few things we should change.
|
|
||
| if(CLP_NEED_SIMDJSON) | ||
| find_package(simdjson REQUIRED) | ||
| find_package(simdjson 4.6.2 REQUIRED) |
There was a problem hiding this comment.
| find_package(simdjson 4.6.2 REQUIRED) | |
| find_package(simdjson 4.3.0 REQUIRED) |
To make it easier to integrate with other projects we should probably choose the earliest version that serves our needs -- it seems like 4.3.0 is the first version with the string escaping optimizations, so let's use that as our required version.
There was a problem hiding this comment.
Actually, sorry it might be better to require 4.0.0, which introduces the string builder API. Reason being that it seems like the version of velox integrate with currently pulls in simdjson 4.1.0, so requiring a newer version than that would cause problems.
| clp_s::timestamp_pattern | ||
| msgpack-cxx | ||
| nlohmann_json::nlohmann_json | ||
| simdjson::simdjson |
There was a problem hiding this comment.
I think this should be PRIVATE, since it's transitive through Utils.cpp, not through a header file.
We should probably also split out Utils.{cpp,hpp} into a library in our build since those utilities are used by a few of our libraries, but I won't make you do that as part of this PR. @coderabbitai, could you create an issue to track splitting off these utilities into a separate library in our build?
There was a problem hiding this comment.
🧩 Analysis chain
@gibber9809, agreed — simdjson::simdjson should be PRIVATE here since the dependency is only used within Utils.cpp and is not exposed through any public header of clp_s_archive_reader.
I've created a GitHub issue to track splitting Utils.{cpp,hpp} into a dedicated library. I'll share the link once the script completes above.
🏁 Script executed:
#!/bin/bash
gh issue create \
--repo y-scope/clp \
--assignee gibber9809 \
--title "refactor(clp-s): Split Utils.{cpp,hpp} into a separate library in the build" \
--body "## Overview
\`Utils.cpp\` and \`Utils.hpp\` in \`components/core/src/clp_s/\` contain shared utilities that are used by several clp-s libraries (e.g., \`clp_s_archive_reader\`, \`clp_s_archive_writer\`, \`clp_s_io\`). Currently these files are compiled into multiple targets, which leads to duplicated compilation and makes it harder to manage library-specific dependencies (e.g., transitive dependencies like \`simdjson::simdjson\` must be declared on each consumer target even when they are only required by \`Utils.cpp\`).
## Goal
Split \`Utils.{cpp,hpp}\` into a dedicated CMake library target so that:
- The utilities are compiled once and shared across consumers.
- Transitive dependencies (e.g., \`simdjson::simdjson\`) are declared in a single place.
- The dependency graph in \`components/core/src/clp_s/CMakeLists.txt\` is simplified.
## Background
Raised during review of PR `#2216` (https://github.com/y-scope/clp/pull/2216#discussion_r3132106691) by `@gibber9809`."
Unrecoverable error: Failed to clone repository
🏁 Script executed:
#!/bin/bash
gh issue create \
--repo y-scope/clp \
--assignee gibber9809 \
--title "refactor(clp-s): Split Utils.{cpp,hpp} into a separate library in the build" \
--body "## Overview
\`Utils.cpp\` and \`Utils.hpp\` in \`components/core/src/clp_s/\` contain shared utilities that are used by several clp-s libraries (e.g., \`clp_s_archive_reader\`, \`clp_s_archive_writer\`, \`clp_s_io\`). Currently these files are compiled into multiple targets, which leads to duplicated compilation and makes it harder to manage library-specific dependencies (e.g., transitive dependencies like \`simdjson::simdjson\` must be declared on each consumer target even when they are only required by \`Utils.cpp\`).
## Goal
Split \`Utils.{cpp,hpp}\` into a dedicated CMake library target so that:
- The utilities are compiled once and shared across consumers.
- Transitive dependencies (e.g., \`simdjson::simdjson\`) are declared in a single place.
- The dependency graph in \`components/core/src/clp_s/CMakeLists.txt\` is simplified.
## Background
Raised during review of PR `#2216` (https://github.com/y-scope/clp/pull/2216#discussion_r3132106691) by `@gibber9809`."
Length of output: 42
@gibber9809, agreed — simdjson::simdjson should be PRIVATE since it's only consumed by Utils.cpp and not exposed through any public header of clp_s_archive_reader.
I've created issue #2218 to track splitting Utils.{cpp,hpp} into a separate library: #2218
🧠 Learnings used
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1262
File: docs/src/dev-guide/components-core/index.md:50-50
Timestamp: 2025-08-25T07:27:19.449Z
Learning: In the CLP project, nlohmann_json version 3.11.3 is installed through their dependency management process, but they don't enforce version constraints in CMake find_package calls - they control versions through installation rather than CMake version pinning.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 2144
File: components/core/src/clp_s/search/ast/SearchUtils.cpp:10-11
Timestamp: 2026-03-29T20:05:47.160Z
Learning: In the y-scope/clp repository (e.g., `components/core/src/clp_s/search/ast/SearchUtils.cpp`), the include style convention is intentional: angle-bracket includes (e.g., `<clp_s/archive_constants.hpp>`) are used for headers that belong to external CMake targets, while quoted relative includes (e.g., `"ConvertToExists.hpp"`) are used for headers within the same CMake target. Do not flag this as an inconsistency.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 486
File: components/core/tests/test-error_handling.cpp:37-38
Timestamp: 2024-11-26T19:12:43.982Z
Learning: In the CLP project, C++20 is used, allowing the utilization of C++20 features such as class template argument deduction (CTAD) with `std::array`.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1184
File: taskfiles/deps/main.yaml:0-0
Timestamp: 2025-08-13T19:58:26.058Z
Learning: In the CLP project, custom CMake find modules (like FindLibLZMA.cmake) are designed to flexibly link against either shared or static libraries based on availability, rather than requiring both variants to be built. The find modules use flags like LibLZMA_USE_STATIC_LIBS and temporarily modify CMAKE_FIND_LIBRARY_SUFFIXES to prefer the desired library type while gracefully falling back if needed.
Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Learnt from: sitaowang1998
Repo: y-scope/clp PR: 1044
File: .github/workflows/clp-core-build-macos.yaml:0-0
Timestamp: 2025-06-27T01:49:15.724Z
Learning: In the clp project, manually setting LDFLAGS for library paths can cause runtime errors because it interferes with CMake's built-in library discovery and linking mechanisms. CMake can identify and correctly link to libraries on its own without requiring manual LDFLAGS configuration.
Learnt from: jackluo923
Repo: y-scope/clp PR: 1996
File: components/core/tools/packaging/common/bundle-libs.sh:49-49
Timestamp: 2026-02-15T04:15:30.839Z
Learning: In CLP packaging scripts (components/core/tools/packaging/common/bundle-libs.sh), the BINARIES array listing all CLP core binaries (clg, clo, clp, clp-s, indexer, log-converter, reducer-server) is intentionally explicit rather than dynamically derived. This design ensures that missing binaries fail the build immediately (fail-fast behavior) rather than being silently omitted from packages. The list is documented to be coupled with CMakeLists.txt and the RPM %files list for maintenance.
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Learnt from: gibber9809
Repo: y-scope/clp PR: 955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1124
File: components/core/cmake/Modules/FindBZip2.cmake:49-65
Timestamp: 2025-08-03T18:56:07.366Z
Learning: In CLP's custom CMake Find modules, `FindStaticLibraryDependencies` populates the `*_DYNAMIC_LIBS` variable even when building with static libraries. When the main library is static, all dependencies must also use static libraries - there's no fallback to shared libraries for dependencies.
| TARBALL_SHA256: "c240d4bffcccda4fe3a2bba2872718d96fc92e56d2615bfac4f9b2bad89a6386" | ||
| TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.6.2.tar.gz" |
There was a problem hiding this comment.
| TARBALL_SHA256: "c240d4bffcccda4fe3a2bba2872718d96fc92e56d2615bfac4f9b2bad89a6386" | |
| TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.6.2.tar.gz" | |
| TARBALL_SHA256: "bde0c42f43899c4c5c48be826c09abd22500ed537b89f16b3cced5eec477c263" | |
| TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.6.3.tar.gz" |
Might as well pull the newest version released a few days ago.
| | [msgpack-cxx](https://github.com/msgpack/msgpack-c/tree/cpp_master) | v7.0.0 | | ||
| | [nlohmann_json](https://github.com/nlohmann/json) | v3.11.3 | | ||
| | [simdjson](https://github.com/simdjson/simdjson) | v3.13.0 | | ||
| | [simdjson](https://github.com/simdjson/simdjson) | v4.6.2 | |
There was a problem hiding this comment.
| | [simdjson](https://github.com/simdjson/simdjson) | v4.6.2 | | |
| | [simdjson](https://github.com/simdjson/simdjson) | v4.6.3 | |
Per other comment.
|
I did some more benchmarking on regular JSON datasets, and I noticed that this actually makes serialization a bit slower on average. From profiling it seems like a lot of the regression comes from escaping keys -- right now we don't pre-escape keys at the leaf level, so for every field we end up calling into this escaping function, which appears to always do some heap allocation to create the string builder before doing the escaping. Since the previous implementation was non-allocating, and keys tend to not have any characters that require escaping (which means we don't benefit as much from simdjson's optimizations to the slow path), this ends up adding some meaningful overhead. With that in mind, could we try leaving the current escaping function as is, and creating a new class which holds an instance of |
| } | ||
|
|
||
| void StringUtils::escape_json_string(std::string& destination, std::string_view const source) { | ||
| simdjson::builtin::builder::string_builder json_string_builder{}; |
There was a problem hiding this comment.
| simdjson::builtin::builder::string_builder json_string_builder{}; | |
| simdjson::builder::string_builder json_string_builder{}; |
Didn't catch this the first time I looked -- seems like this is the class they want us to use, based on the simdjson docs.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/core/src/clp_s/Utils.cpp`:
- Around line 24-42: Add targeted unit tests for
SimdJsonStringEscaper::Implementation::escape to cover the simdjson fast path
and the scalar fallback: create tests that verify correct escaping for quotes,
backslashes, control characters (e.g., \n, \r, \t, 0x00-0x1F), multibyte UTF-8
sequences, and a failure scenario that forces m_builder
(simdjson::builder::string_builder) to return a non-SUCCESS status so the code
path calls StringUtils::escape_json_string; assert that the destination string
matches the expected JSON-escaped result in each case and include at least one
test that deliberately triggers the fallback to ensure behaviour parity with
StringUtils::escape_json_string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 71cef448-2cf8-41a2-ac0b-5c7424e747ae
📒 Files selected for processing (9)
components/core/CMakeLists.txtcomponents/core/src/clp_s/CMakeLists.txtcomponents/core/src/clp_s/ColumnReader.cppcomponents/core/src/clp_s/ColumnReader.hppcomponents/core/src/clp_s/JsonSerializer.hppcomponents/core/src/clp_s/Utils.cppcomponents/core/src/clp_s/Utils.hppdocs/src/dev-docs/components-core/index.mdtaskfiles/deps/main.yaml
| class SimdJsonStringEscaper::Implementation { | ||
| public: | ||
| void escape(std::string& destination, std::string_view const source) { | ||
| m_builder.clear(); | ||
| m_builder.escape_and_append(source); | ||
|
|
||
| std::string_view escaped_source; | ||
| auto const status = m_builder.view().get(escaped_source); | ||
| if (simdjson::SUCCESS == status) { | ||
| destination.append(escaped_source); | ||
| return; | ||
| } | ||
|
|
||
| StringUtils::escape_json_string(destination, source); | ||
| } | ||
|
|
||
| private: | ||
| simdjson::builder::string_builder m_builder{}; | ||
| }; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 🏗️ Heavy lift
Add targeted unit coverage for the new fast path and fallback.
This replaces a correctness-critical escaping primitive, but the validation called out in the PR is still dataset sanity checks. Please add focused cases for quotes, backslashes, control characters, multibyte UTF-8, and a failure path that exercises the scalar fallback so behaviour drift is caught early.
Also applies to: 220-232
🧰 Tools
🪛 Cppcheck (2.20.0)
[style] 29-29: The function 'get_curl_err' is never used.
(unusedFunction)
[style] 24-24: The function 'get_filename' is never used.
(unusedFunction)
[style] 26-26: The function 'get_line_number' is never used.
(unusedFunction)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/core/src/clp_s/Utils.cpp` around lines 24 - 42, Add targeted unit
tests for SimdJsonStringEscaper::Implementation::escape to cover the simdjson
fast path and the scalar fallback: create tests that verify correct escaping for
quotes, backslashes, control characters (e.g., \n, \r, \t, 0x00-0x1F), multibyte
UTF-8 sequences, and a failure scenario that forces m_builder
(simdjson::builder::string_builder) to return a non-SUCCESS status so the code
path calls StringUtils::escape_json_string; assert that the destination string
matches the expected JSON-escaped result in each case and include at least one
test that deliberately triggers the fallback to ensure behaviour parity with
StringUtils::escape_json_string.
gibber9809
left a comment
There was a problem hiding this comment.
Nice work! Mostly looking good now, just a few details.
| TARBALL_SHA256: "ad238fc578a4c18d3263250d0efaacadbc6a98f9c7b6be2e70867011b627fa57" | ||
| TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.0.0.tar.gz" |
There was a problem hiding this comment.
We can pull the newest version of simdjson here (4.6.4) even though we only require 4.0.0 in our cmake scripts -- cmake won't complain about pulling in newer versions of the library, and it's worthwhile for us to use a version with more bug fixes & optimizations in our own builds.
| | [msgpack-cxx](https://github.com/msgpack/msgpack-c/tree/cpp_master) | v7.0.0 | | ||
| | [nlohmann_json](https://github.com/nlohmann/json) | v3.11.3 | | ||
| | [simdjson](https://github.com/simdjson/simdjson) | v3.13.0 | | ||
| | [simdjson](https://github.com/simdjson/simdjson) | v4.0.0 | |
There was a problem hiding this comment.
Ditto, can update to 4.6.4.
| class SimdJsonStringEscaper { | ||
| public: | ||
| SimdJsonStringEscaper(); | ||
| SimdJsonStringEscaper(SimdJsonStringEscaper&&) noexcept; | ||
| ~SimdJsonStringEscaper(); | ||
|
|
||
| auto operator=(SimdJsonStringEscaper&&) noexcept -> SimdJsonStringEscaper&; | ||
|
|
||
| SimdJsonStringEscaper(SimdJsonStringEscaper const&) = delete; | ||
| auto operator=(SimdJsonStringEscaper const&) -> SimdJsonStringEscaper& = delete; | ||
|
|
||
| void escape(std::string& destination, std::string_view const source); | ||
|
|
||
| private: | ||
| class Implementation; | ||
| std::unique_ptr<Implementation> m_implementation; | ||
| }; |
There was a problem hiding this comment.
Let's change this class to own the simdjson::builder::string_builder + implement the escaping function directly. I think using this forward declaration to avoid pulling the simdjson header into this file is unnecessary.
| SimdJsonStringEscaper(); | ||
| SimdJsonStringEscaper(SimdJsonStringEscaper&&) noexcept; | ||
| ~SimdJsonStringEscaper(); | ||
|
|
||
| auto operator=(SimdJsonStringEscaper&&) noexcept -> SimdJsonStringEscaper&; | ||
|
|
||
| SimdJsonStringEscaper(SimdJsonStringEscaper const&) = delete; | ||
| auto operator=(SimdJsonStringEscaper const&) -> SimdJsonStringEscaper& = delete; |
There was a problem hiding this comment.
Since we're just explicitly defaulting/deleting these constructors and assignment operators we should do that all in this header instead of leaving some of it for the cpp file.
As well, we have a standard set of docstrings we use for different combinations of deleted/defaulted constructors that we should be using here as well. For example, take a look at components/core/src/clp/ffi/ir_stream/KvIrDeserializerImpl.hpp.
| Boost::url | ||
| clp_s::clp_dependencies | ||
| fmt::fmt | ||
| simdjson::simdjson |
There was a problem hiding this comment.
Related to other comment about not needing to avoid pulling simdjson into that header, we'll also need to move simdjson into public linkage here.
| extract_string_value_into_buffer(cur_message, buffer); | ||
| } | ||
|
|
||
| virtual auto extract_escaped_string_value_into_buffer( |
There was a problem hiding this comment.
LGTM after the most recent changes! Also confirmed on my side that the benchmark numbers look good on some of the other JSON datasets. For PR title how about:
refactor(clp-s): Use `simdjson::builder::string_builder` for JSON string escaping; Bump simdjson version to 4.6.4.
Could you also do a quick pass on the PR description to make sure it describes the final set of changes? I think some little details like version numbers need to be updated.
simdjson::builder::string_builder for JSON string escaping; Bump simdjson version to 4.6.4.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Took a quick skim and lgtm overall.
Description
This PR updates the JSON string escaping path used during
clp-sJSON marshalling.Main changes:
v3.13.0tov4.6.4.components/core/CMakeLists.txtcomponents/core/src/clp_s/CMakeLists.txttaskfiles/deps/main.yamldocs/src/dev-docs/components-core/index.mdsimdjson >= 4.0.0at configure time.SimdJsonStringEscaperbacked bysimdjson::builder::string_builder.StringUtils::escape_json_stringimplementation and use it as a fallback if the simdjson-based path fails.Why this change:
StringUtils::escape_json_stringshows up as a significant hotspot when marshalling string-heavy datasets.simdjson::builder::string_builderavoids per-call builder construction overhead while still using simdjson’s optimized escaping implementation.Checklist
breaking change.
Validation performed
hive-24hrand MongoDB datasets.hive-24hrdecompression on my side:22.1s9.2shive-24hrdecompression path and confirmed that JSON string escaping is no longer a dominant hotspot after the change.Summary by CodeRabbit