Skip to content

refactor(clp-s): Use simdjson::builder::string_builder for JSON string escaping; Bump simdjson version to 4.6.4.#2216

Merged
gibber9809 merged 7 commits into
y-scope:mainfrom
ShangDanLuXian:simdjson_marshalling
May 21, 2026
Merged

refactor(clp-s): Use simdjson::builder::string_builder for JSON string escaping; Bump simdjson version to 4.6.4.#2216
gibber9809 merged 7 commits into
y-scope:mainfrom
ShangDanLuXian:simdjson_marshalling

Conversation

@ShangDanLuXian
Copy link
Copy Markdown
Contributor

@ShangDanLuXian ShangDanLuXian commented Apr 22, 2026

Description

This PR updates the JSON string escaping path used during clp-s JSON marshalling.

Main changes:

  • Bump the simdjson dependency used by our local deps flow from v3.13.0 to v4.6.4.
  • Update build/dependency wiring for simdjson:
    • components/core/CMakeLists.txt
    • components/core/src/clp_s/CMakeLists.txt
    • taskfiles/deps/main.yaml
    • docs/src/dev-docs/components-core/index.md
  • Require simdjson >= 4.0.0 at configure time.
  • Add a reusable SimdJsonStringEscaper backed by simdjson::builder::string_builder.
  • Update the marshalling path to use the reusable simdjson-backed escaper for string values.
  • Keep the existing scalar StringUtils::escape_json_string implementation and use it as a fallback if the simdjson-based path fails.

Why this change:

  • StringUtils::escape_json_string shows up as a significant hotspot when marshalling string-heavy datasets.
  • Reusing a simdjson::builder::string_builder avoids per-call builder construction overhead while still using simdjson’s optimized escaping implementation.
  • Keeping the existing scalar implementation as a fallback preserves the previous behavior if the simdjson-based escape path returns an error.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Performed output sanity checks on hive-24hr and MongoDB datasets.
  • Compared scalar vs simdjson-backed outputs for representative queries and confirmed matching results.
  • Benchmarked hive-24hr decompression on my side:
    • scalar average: about 22.1s
    • simdjson average: about 9.2s
  • Generated flame graphs for the hive-24hr decompression path and confirmed that JSON string escaping is no longer a dominant hotspot after the change.

Summary by CodeRabbit

  • Chores
    • Require simdjson >= 4.0.0 at configure time and updated dependency download metadata to v4.6.4.
  • Enhancements
    • Added a simdjson-backed JSON string escaper and switched serializers/readers to use it, improving JSON escaping and output reliability.
    • Ensured string-processing components link against simdjson when enabled.
  • Documentation
    • Updated components-core dependency table to reflect the simdjson version bump.

Review Change Stack

@ShangDanLuXian ShangDanLuXian requested review from a team and gibber9809 as code owners April 22, 2026 16:35
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 97e7e76a-928a-41c3-93d1-bb2b49ac74ab

📥 Commits

Reviewing files that changed from the base of the PR and between 3f82489 and ca99882.

📒 Files selected for processing (1)
  • components/core/src/clp_s/CMakeLists.txt

Walkthrough

Require simdjson >= 4.0.0 and update pins/docs; add move-only clp_s::SimdJsonStringEscaper (header + impl using simdjson with fallback); extend BaseColumnReader API and implementations to accept the escaper; wire an escaper member into JsonSerializer and pass it when emitting quoted values.

Changes

Core simdjson integration and escaping

Layer / File(s) Summary
Build and dependency pins
components/core/CMakeLists.txt, components/core/src/clp_s/CMakeLists.txt, docs/src/dev-docs/components-core/index.md, taskfiles/deps/main.yaml
Require simdjson >= 4.0.0, insert/reorder simdjson::simdjson into clp_s targets, update docs dependency version and pinned tarball SHA/URL.
SimdJsonStringEscaper declaration
components/core/src/clp_s/Utils.hpp
Add move-only clp_s::SimdJsonStringEscaper type with void escape(std::string& destination, std::string_view const source); and include <simdjson.h>.
SimdJsonStringEscaper implementation
components/core/src/clp_s/Utils.cpp
Include simdjson.h and implement SimdJsonStringEscaper::escape using m_builder.clear(), escape_and_append, and view().get with fallback to StringUtils::escape_json_string.
ColumnReader API updates
components/core/src/clp_s/ColumnReader.hpp
Add virtual overload extract_escaped_string_value_into_buffer(uint64_t, std::string&, SimdJsonStringEscaper&) to BaseColumnReader and corresponding overrides in ClpStringColumnReader and VariableStringColumnReader.
ColumnReader runtime changes
components/core/src/clp_s/ColumnReader.cpp
Update implementations to accept SimdJsonStringEscaper& and call escaper.escape(...) for non-array escaping paths instead of StringUtils::escape_json_string.
JsonSerializer wiring
components/core/src/clp_s/JsonSerializer.hpp
Add SimdJsonStringEscaper m_string_value_escaper member and pass it into extract_escaped_string_value_into_buffer when building quoted JSON values.

Sequence Diagram(s)

sequenceDiagram
  participant components_core_CMake
  participant clp_s_CMake
  participant Taskfile
  participant Docs
  components_core_CMake->>clp_s_CMake: require simdjson >= 4.0.0 and add link
  clp_s_CMake->>clp_s_CMake: insert simdjson::simdjson into targets
  components_core_CMake->>Taskfile: update pinned simdjson tarball
  Taskfile->>Docs: bump docs deps entry to v4.6.4
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • gibber9809
  • LinZhihao-723
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: refactoring JSON string escaping to use simdjson's builder and bumping the simdjson version to 4.6.4, which are the primary objectives of this pull request.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread components/core/src/clp_s/Utils.cpp Outdated
return;
}

// Preserve the original implementation in case simdjson fails.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we may remove it completely?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would just remove the original implementation, and throw if the simdjson function fails.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per other review comment -- should probably keep it actually, and create a new class for this new implementation.

Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Just a few things we should change.

Comment thread components/core/CMakeLists.txt Outdated

if(CLP_NEED_SIMDJSON)
find_package(simdjson REQUIRED)
find_package(simdjson 4.6.2 REQUIRED)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
find_package(simdjson 4.6.2 REQUIRED)
find_package(simdjson 4.3.0 REQUIRED)

To make it easier to integrate with other projects we should probably choose the earliest version that serves our needs -- it seems like 4.3.0 is the first version with the string escaping optimizations, so let's use that as our required version.

Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry it might be better to require 4.0.0, which introduces the string builder API. Reason being that it seems like the version of velox integrate with currently pulls in simdjson 4.1.0, so requiring a newer version than that would cause problems.

clp_s::timestamp_pattern
msgpack-cxx
nlohmann_json::nlohmann_json
simdjson::simdjson
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be PRIVATE, since it's transitive through Utils.cpp, not through a header file.

We should probably also split out Utils.{cpp,hpp} into a library in our build since those utilities are used by a few of our libraries, but I won't make you do that as part of this PR. @coderabbitai, could you create an issue to track splitting off these utilities into a separate library in our build?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@gibber9809, agreed — simdjson::simdjson should be PRIVATE here since the dependency is only used within Utils.cpp and is not exposed through any public header of clp_s_archive_reader.

I've created a GitHub issue to track splitting Utils.{cpp,hpp} into a dedicated library. I'll share the link once the script completes above.


🏁 Script executed:

#!/bin/bash
gh issue create \
  --repo y-scope/clp \
  --assignee gibber9809 \
  --title "refactor(clp-s): Split Utils.{cpp,hpp} into a separate library in the build" \
  --body "## Overview

\`Utils.cpp\` and \`Utils.hpp\` in \`components/core/src/clp_s/\` contain shared utilities that are used by several clp-s libraries (e.g., \`clp_s_archive_reader\`, \`clp_s_archive_writer\`, \`clp_s_io\`). Currently these files are compiled into multiple targets, which leads to duplicated compilation and makes it harder to manage library-specific dependencies (e.g., transitive dependencies like \`simdjson::simdjson\` must be declared on each consumer target even when they are only required by \`Utils.cpp\`).

## Goal

Split \`Utils.{cpp,hpp}\` into a dedicated CMake library target so that:
- The utilities are compiled once and shared across consumers.
- Transitive dependencies (e.g., \`simdjson::simdjson\`) are declared in a single place.
- The dependency graph in \`components/core/src/clp_s/CMakeLists.txt\` is simplified.

## Background

Raised during review of PR `#2216` (https://github.com/y-scope/clp/pull/2216#discussion_r3132106691) by `@gibber9809`."

Unrecoverable error: Failed to clone repository


🏁 Script executed:

#!/bin/bash
gh issue create \
  --repo y-scope/clp \
  --assignee gibber9809 \
  --title "refactor(clp-s): Split Utils.{cpp,hpp} into a separate library in the build" \
  --body "## Overview

\`Utils.cpp\` and \`Utils.hpp\` in \`components/core/src/clp_s/\` contain shared utilities that are used by several clp-s libraries (e.g., \`clp_s_archive_reader\`, \`clp_s_archive_writer\`, \`clp_s_io\`). Currently these files are compiled into multiple targets, which leads to duplicated compilation and makes it harder to manage library-specific dependencies (e.g., transitive dependencies like \`simdjson::simdjson\` must be declared on each consumer target even when they are only required by \`Utils.cpp\`).

## Goal

Split \`Utils.{cpp,hpp}\` into a dedicated CMake library target so that:
- The utilities are compiled once and shared across consumers.
- Transitive dependencies (e.g., \`simdjson::simdjson\`) are declared in a single place.
- The dependency graph in \`components/core/src/clp_s/CMakeLists.txt\` is simplified.

## Background

Raised during review of PR `#2216` (https://github.com/y-scope/clp/pull/2216#discussion_r3132106691) by `@gibber9809`."

Length of output: 42


@gibber9809, agreed — simdjson::simdjson should be PRIVATE since it's only consumed by Utils.cpp and not exposed through any public header of clp_s_archive_reader.

I've created issue #2218 to track splitting Utils.{cpp,hpp} into a separate library: #2218


🧠 Learnings used
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1262
File: docs/src/dev-guide/components-core/index.md:50-50
Timestamp: 2025-08-25T07:27:19.449Z
Learning: In the CLP project, nlohmann_json version 3.11.3 is installed through their dependency management process, but they don't enforce version constraints in CMake find_package calls - they control versions through installation rather than CMake version pinning.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 2144
File: components/core/src/clp_s/search/ast/SearchUtils.cpp:10-11
Timestamp: 2026-03-29T20:05:47.160Z
Learning: In the y-scope/clp repository (e.g., `components/core/src/clp_s/search/ast/SearchUtils.cpp`), the include style convention is intentional: angle-bracket includes (e.g., `<clp_s/archive_constants.hpp>`) are used for headers that belong to external CMake targets, while quoted relative includes (e.g., `"ConvertToExists.hpp"`) are used for headers within the same CMake target. Do not flag this as an inconsistency.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 486
File: components/core/tests/test-error_handling.cpp:37-38
Timestamp: 2024-11-26T19:12:43.982Z
Learning: In the CLP project, C++20 is used, allowing the utilization of C++20 features such as class template argument deduction (CTAD) with `std::array`.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1184
File: taskfiles/deps/main.yaml:0-0
Timestamp: 2025-08-13T19:58:26.058Z
Learning: In the CLP project, custom CMake find modules (like FindLibLZMA.cmake) are designed to flexibly link against either shared or static libraries based on availability, rather than requiring both variants to be built. The find modules use flags like LibLZMA_USE_STATIC_LIBS and temporarily modify CMAKE_FIND_LIBRARY_SUFFIXES to prefer the desired library type while gracefully falling back if needed.

Learnt from: gibber9809
Repo: y-scope/clp PR: 630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Learnt from: sitaowang1998
Repo: y-scope/clp PR: 1044
File: .github/workflows/clp-core-build-macos.yaml:0-0
Timestamp: 2025-06-27T01:49:15.724Z
Learning: In the clp project, manually setting LDFLAGS for library paths can cause runtime errors because it interferes with CMake's built-in library discovery and linking mechanisms. CMake can identify and correctly link to libraries on its own without requiring manual LDFLAGS configuration.

Learnt from: jackluo923
Repo: y-scope/clp PR: 1996
File: components/core/tools/packaging/common/bundle-libs.sh:49-49
Timestamp: 2026-02-15T04:15:30.839Z
Learning: In CLP packaging scripts (components/core/tools/packaging/common/bundle-libs.sh), the BINARIES array listing all CLP core binaries (clg, clo, clp, clp-s, indexer, log-converter, reducer-server) is intentionally explicit rather than dynamically derived. This design ensures that missing binaries fail the build immediately (fail-fast behavior) rather than being silently omitted from packages. The list is documented to be coupled with CMakeLists.txt and the RPM %files list for maintenance.

Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Learnt from: gibber9809
Repo: y-scope/clp PR: 955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1124
File: components/core/cmake/Modules/FindBZip2.cmake:49-65
Timestamp: 2025-08-03T18:56:07.366Z
Learning: In CLP's custom CMake Find modules, `FindStaticLibraryDependencies` populates the `*_DYNAMIC_LIBS` variable even when building with static libraries. When the main library is static, all dependencies must also use static libraries - there's no fallback to shared libraries for dependencies.

Comment thread taskfiles/deps/main.yaml Outdated
Comment on lines +486 to +487
TARBALL_SHA256: "c240d4bffcccda4fe3a2bba2872718d96fc92e56d2615bfac4f9b2bad89a6386"
TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.6.2.tar.gz"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TARBALL_SHA256: "c240d4bffcccda4fe3a2bba2872718d96fc92e56d2615bfac4f9b2bad89a6386"
TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.6.2.tar.gz"
TARBALL_SHA256: "bde0c42f43899c4c5c48be826c09abd22500ed537b89f16b3cced5eec477c263"
TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.6.3.tar.gz"

Might as well pull the newest version released a few days ago.

| [msgpack-cxx](https://github.com/msgpack/msgpack-c/tree/cpp_master) | v7.0.0 |
| [nlohmann_json](https://github.com/nlohmann/json) | v3.11.3 |
| [simdjson](https://github.com/simdjson/simdjson) | v3.13.0 |
| [simdjson](https://github.com/simdjson/simdjson) | v4.6.2 |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [simdjson](https://github.com/simdjson/simdjson) | v4.6.2 |
| [simdjson](https://github.com/simdjson/simdjson) | v4.6.3 |

Per other comment.

@gibber9809
Copy link
Copy Markdown
Contributor

I did some more benchmarking on regular JSON datasets, and I noticed that this actually makes serialization a bit slower on average.

From profiling it seems like a lot of the regression comes from escaping keys -- right now we don't pre-escape keys at the leaf level, so for every field we end up calling into this escaping function, which appears to always do some heap allocation to create the string builder before doing the escaping. Since the previous implementation was non-allocating, and keys tend to not have any characters that require escaping (which means we don't benefit as much from simdjson's optimizations to the slow path), this ends up adding some meaningful overhead.

With that in mind, could we try leaving the current escaping function as is, and creating a new class which holds an instance of simdjson::string_builder w/ an escaping function that reuses that string builder instance? That way we can avoid the allocations on every invocation of the escaping function by just passing around an instance of that new class. It may also be worth testing whether we get better decompression speed if we only use the simdjson-based implementation for string values, or if it makes sense to use it to escape both keys and values.

Comment thread components/core/src/clp_s/Utils.cpp Outdated
}

void StringUtils::escape_json_string(std::string& destination, std::string_view const source) {
simdjson::builtin::builder::string_builder json_string_builder{};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
simdjson::builtin::builder::string_builder json_string_builder{};
simdjson::builder::string_builder json_string_builder{};

Didn't catch this the first time I looked -- seems like this is the class they want us to use, based on the simdjson docs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/core/src/clp_s/Utils.cpp`:
- Around line 24-42: Add targeted unit tests for
SimdJsonStringEscaper::Implementation::escape to cover the simdjson fast path
and the scalar fallback: create tests that verify correct escaping for quotes,
backslashes, control characters (e.g., \n, \r, \t, 0x00-0x1F), multibyte UTF-8
sequences, and a failure scenario that forces m_builder
(simdjson::builder::string_builder) to return a non-SUCCESS status so the code
path calls StringUtils::escape_json_string; assert that the destination string
matches the expected JSON-escaped result in each case and include at least one
test that deliberately triggers the fallback to ensure behaviour parity with
StringUtils::escape_json_string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 71cef448-2cf8-41a2-ac0b-5c7424e747ae

📥 Commits

Reviewing files that changed from the base of the PR and between 69c920d and f14b0e7.

📒 Files selected for processing (9)
  • components/core/CMakeLists.txt
  • components/core/src/clp_s/CMakeLists.txt
  • components/core/src/clp_s/ColumnReader.cpp
  • components/core/src/clp_s/ColumnReader.hpp
  • components/core/src/clp_s/JsonSerializer.hpp
  • components/core/src/clp_s/Utils.cpp
  • components/core/src/clp_s/Utils.hpp
  • docs/src/dev-docs/components-core/index.md
  • taskfiles/deps/main.yaml

Comment thread components/core/src/clp_s/Utils.cpp Outdated
Comment on lines +24 to +42
class SimdJsonStringEscaper::Implementation {
public:
void escape(std::string& destination, std::string_view const source) {
m_builder.clear();
m_builder.escape_and_append(source);

std::string_view escaped_source;
auto const status = m_builder.view().get(escaped_source);
if (simdjson::SUCCESS == status) {
destination.append(escaped_source);
return;
}

StringUtils::escape_json_string(destination, source);
}

private:
simdjson::builder::string_builder m_builder{};
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 🏗️ Heavy lift

Add targeted unit coverage for the new fast path and fallback.

This replaces a correctness-critical escaping primitive, but the validation called out in the PR is still dataset sanity checks. Please add focused cases for quotes, backslashes, control characters, multibyte UTF-8, and a failure path that exercises the scalar fallback so behaviour drift is caught early.

Also applies to: 220-232

🧰 Tools
🪛 Cppcheck (2.20.0)

[style] 29-29: The function 'get_curl_err' is never used.

(unusedFunction)


[style] 24-24: The function 'get_filename' is never used.

(unusedFunction)


[style] 26-26: The function 'get_line_number' is never used.

(unusedFunction)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/core/src/clp_s/Utils.cpp` around lines 24 - 42, Add targeted unit
tests for SimdJsonStringEscaper::Implementation::escape to cover the simdjson
fast path and the scalar fallback: create tests that verify correct escaping for
quotes, backslashes, control characters (e.g., \n, \r, \t, 0x00-0x1F), multibyte
UTF-8 sequences, and a failure scenario that forces m_builder
(simdjson::builder::string_builder) to return a non-SUCCESS status so the code
path calls StringUtils::escape_json_string; assert that the destination string
matches the expected JSON-escaped result in each case and include at least one
test that deliberately triggers the fallback to ensure behaviour parity with
StringUtils::escape_json_string.

Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Mostly looking good now, just a few details.

Comment thread taskfiles/deps/main.yaml Outdated
Comment on lines +486 to +487
TARBALL_SHA256: "ad238fc578a4c18d3263250d0efaacadbc6a98f9c7b6be2e70867011b627fa57"
TARBALL_URL: "https://github.com/simdjson/simdjson/archive/refs/tags/v4.0.0.tar.gz"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pull the newest version of simdjson here (4.6.4) even though we only require 4.0.0 in our cmake scripts -- cmake won't complain about pulling in newer versions of the library, and it's worthwhile for us to use a version with more bug fixes & optimizations in our own builds.

| [msgpack-cxx](https://github.com/msgpack/msgpack-c/tree/cpp_master) | v7.0.0 |
| [nlohmann_json](https://github.com/nlohmann/json) | v3.11.3 |
| [simdjson](https://github.com/simdjson/simdjson) | v3.13.0 |
| [simdjson](https://github.com/simdjson/simdjson) | v4.0.0 |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, can update to 4.6.4.

Comment on lines +132 to +148
class SimdJsonStringEscaper {
public:
SimdJsonStringEscaper();
SimdJsonStringEscaper(SimdJsonStringEscaper&&) noexcept;
~SimdJsonStringEscaper();

auto operator=(SimdJsonStringEscaper&&) noexcept -> SimdJsonStringEscaper&;

SimdJsonStringEscaper(SimdJsonStringEscaper const&) = delete;
auto operator=(SimdJsonStringEscaper const&) -> SimdJsonStringEscaper& = delete;

void escape(std::string& destination, std::string_view const source);

private:
class Implementation;
std::unique_ptr<Implementation> m_implementation;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this class to own the simdjson::builder::string_builder + implement the escaping function directly. I think using this forward declaration to avoid pulling the simdjson header into this file is unnecessary.

Comment thread components/core/src/clp_s/Utils.hpp Outdated
Comment on lines +134 to +141
SimdJsonStringEscaper();
SimdJsonStringEscaper(SimdJsonStringEscaper&&) noexcept;
~SimdJsonStringEscaper();

auto operator=(SimdJsonStringEscaper&&) noexcept -> SimdJsonStringEscaper&;

SimdJsonStringEscaper(SimdJsonStringEscaper const&) = delete;
auto operator=(SimdJsonStringEscaper const&) -> SimdJsonStringEscaper& = delete;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're just explicitly defaulting/deleting these constructors and assignment operators we should do that all in this header instead of leaving some of it for the cpp file.

As well, we have a standard set of docstrings we use for different combinations of deleted/defaulted constructors that we should be using here as well. For example, take a look at components/core/src/clp/ffi/ir_stream/KvIrDeserializerImpl.hpp.

Boost::url
clp_s::clp_dependencies
fmt::fmt
simdjson::simdjson
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to other comment about not needing to avoid pulling simdjson into that header, we'll also need to move simdjson into public linkage here.

extract_string_value_into_buffer(cur_message, buffer);
}

virtual auto extract_escaped_string_value_into_buffer(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring.

@gibber9809 gibber9809 self-requested a review May 21, 2026 18:40
Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after the most recent changes! Also confirmed on my side that the benchmark numbers look good on some of the other JSON datasets. For PR title how about:

refactor(clp-s): Use `simdjson::builder::string_builder` for JSON string escaping; Bump simdjson version to 4.6.4.

Could you also do a quick pass on the PR description to make sure it describes the final set of changes? I think some little details like version numbers need to be updated.

@ShangDanLuXian ShangDanLuXian changed the title refactor(clp-s): Use simdjson string_builder for JSON string escaping in marshalling. refactor(clp-s): Use simdjson::builder::string_builder for JSON string escaping; Bump simdjson version to 4.6.4. May 21, 2026
Copy link
Copy Markdown
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick skim and lgtm overall.

@gibber9809 gibber9809 merged commit 85eaa70 into y-scope:main May 21, 2026
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants