Skip to content

fix(json): don't crash on non-ASCII std::wstring / std::filesystem::path#668

Merged
liuzicheng1987 merged 3 commits into
getml:mainfrom
mrgyatso:fix/wstring-filepath-utf8-transcoding
May 15, 2026
Merged

fix(json): don't crash on non-ASCII std::wstring / std::filesystem::path#668
liuzicheng1987 merged 3 commits into
getml:mainfrom
mrgyatso:fix/wstring-filepath-utf8-transcoding

Conversation

@mrgyatso
Copy link
Copy Markdown
Contributor

Summary

A struct field of type std::wstring or std::filesystem::path containing any non-ASCII
character crashes rfl::json::read / rfl::json::write with std::terminate. This is the
crash behind #421 and #422. Both were reported from Windows, but the crash is not
platform-specific: it reproduces on Linux / glibc / g++ just as readily.

Minimal repro (Linux, default locale):

struct S { std::wstring w; };
rfl::json::write(S{ .w = L"é" });     // std::length_error, then std::terminate
rfl::json::read<S>(R"({"w":"é"})");   // std::length_error, then std::terminate (read is noexcept)

Root cause

Parser_wstring.hpp transcodes via std::mbsrtowcs / std::wcsrtombs and passes the
result straight into resize():

auto len = std::mbsrtowcs(outStr.data(), &ptr, val.size(), &state);
outStr.resize(len);   // len can be (size_t)-1

std::mbsrtowcs / std::wcsrtombs return (size_t)-1 when the input is not valid in the
current C locale's LC_CTYPE encoding. The default "C" locale is ASCII-only, so any
non-ASCII byte makes len equal to (size_t)-1, which becomes resize(SIZE_MAX) and
throws std::length_error. read is declared noexcept, so that becomes an immediate
std::terminate. Parser_filepath.hpp has the same locale dependency through
path::string() and the narrow path constructor.

The fix: two independent commits

Commit 1 serializes std::filesystem::path as UTF-8. Parser_filepath now uses
path::u8string() and construction from std::u8string instead of the locale-dependent
path::string() and narrow constructor. JSON is UTF-8 (RFC 8259), so this conversion is
well-defined and locale-independent. It uses the standard library only, about one
substantive line per function.

Commit 2 converts std::wstring with an explicit UTF-8 codec. It adds a small header-only
UTF-8 to wchar_t codec that handles both wchar_t widths (UTF-32 on Linux and macOS,
UTF-16 on Windows) and rejects malformed input rather than crashing, then rewires
Parser_wstring to use it. read now returns an rfl::Error on malformed UTF-8, and
write produces correct UTF-8. The parser itself gets shorter, since the locale juggling
is gone.

Commit 1 stands on its own, and commit 2 is independent of it.

Also fixes a crash that is not in either issue

#421 and #422 are about write and save. The read path crashes on the same input,
and because read is declared noexcept, there is no rfl::Error path and no
try/catch a caller can use: it is an unconditional std::terminate. Neither issue
mentions this. Both commits fix read and write.

A note on scope

I read through the #421 discussion. The genuinely hard part there, auto-detecting the
encoding of a std::string, is ambiguous on Windows, and this PR does not touch it. But
std::wstring and std::filesystem::path are not encoding-ambiguous (they are
well-defined UTF-16 / UTF-32, and the standard's defined path encoding), so a correct,
locale-independent conversion is well-defined, which is what commit 2 does.

If you would rather not take on the codec in commit 2, I am happy to reduce it to just
guarding the (size_t)-1 return, so read returns an rfl::Error and write emits an
empty string. That kills the crash without adding conversion logic. I led with the full
fix because it actually makes the types usable, but it is your call.

Tests

Regression tests added to tests/json/test_wstring.cpp and tests/json/test_filepath.cpp,
covering round-trips with Latin-1, Cyrillic, CJK, Hiragana, and astral-plane content.

Closes #422. Refs #421: this fully addresses its std::wstring and std::filesystem::path
cases. The separate std::string codepage question raised in that thread is intentionally
left out of scope.

mrgyatso added 2 commits May 14, 2026 15:35
…e locale

path::string() and the narrow std::filesystem::path constructor are
locale-dependent and crash or mangle non-ASCII paths in the default C
locale. JSON is UTF-8 (RFC 8259), so go through path::u8string() and
construction from std::u8string instead.

Refs getml#421.
Parser_wstring transcoded via std::mbsrtowcs / std::wcsrtombs and passed
the result straight into resize(). Those functions return (size_t)-1 on
input not valid in the current C locale, which becomes resize(SIZE_MAX)
and throws std::length_error; since read() is noexcept this is an
immediate std::terminate.

Replace the locale-dependent calls with an explicit, locale-independent
UTF-8 to wchar_t codec (rfl::internal::strings, header-only). read() now
returns an rfl::Error on malformed UTF-8; write() produces correct UTF-8.

Refs getml#421, getml#422.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces locale-independent UTF-8 conversion utilities to ensure robust handling of non-ASCII characters in std::wstring and std::filesystem::path during JSON serialization, replacing problematic locale-dependent C functions. The changes include a new internal header for UTF-8 decoding and encoding, along with regression tests for non-ASCII paths and strings. Review feedback focused on performance and safety optimizations, such as pre-allocating memory for decoded strings, using iterator-based constructors to avoid temporary allocations, and employing move semantics to prevent unnecessary copies. Additionally, it was recommended to wrap string conversions in try-catch blocks to handle potential memory allocation failures gracefully and maintain consistency with existing error handling patterns.

/// input is not well-formed UTF-8 (truncated, overlong, bad continuation byte,
/// surrogate code point, or out of range).
inline std::optional<std::u32string> decode_utf8(const std::string& _str) {
std::u32string out;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider using out.reserve(_str.size()) to avoid multiple reallocations during decoding. Since each UTF-8 byte represents at most one Unicode code point, the input string size is a safe upper bound for the number of code points.

Suggested change
std::u32string out;
std::u32string out;
out.reserve(_str.size());

Comment on lines +32 to +33
return std::filesystem::path(std::u8string(
reinterpret_cast<const char8_t*>(_str.data()), _str.size()));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the iterator-based constructor of std::filesystem::path with char8_t* avoids the creation of a temporary std::u8string object and its associated heap allocation.

Suggested change
return std::filesystem::path(std::u8string(
reinterpret_cast<const char8_t*>(_str.data()), _str.size()));
const auto* ptr = reinterpret_cast<const char8_t*>(_str.data());
return std::filesystem::path(ptr, ptr + _str.size());

Comment on lines +45 to +51
auto outStr = internal::strings::utf8_to_wstring(inStr.value());
if (!outStr) {
return Result<std::wstring>(
error("Could not parse the string: it is not valid UTF-8."));
}

return Result<std::wstring>(outStr);
return Result<std::wstring>(std::move(*outStr));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function is marked noexcept, but the string allocations and conversions inside utf8_to_wstring can throw std::bad_alloc. To prevent an unexpected std::terminate, consider wrapping the conversion in a try-catch block and returning a Result::error instead, which is consistent with the error handling pattern used in Parser_filepath.hpp.

    try {
      auto outStr = internal::strings::utf8_to_wstring(inStr.value());
      if (!outStr) {
        return error("Could not parse the string: it is not valid UTF-8.");
      }
      return std::move(*outStr);
    } catch (std::exception& e) {
      return error(e.what());
    }

Comment on lines +68 to +69
const auto outStr = internal::strings::wstring_to_utf8(_str);
ParentType::add_value(_w, outStr.value_or(std::string()), _parent);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation performs an unnecessary copy of the string when outStr has a value. By removing const and using std::move(outStr).value_or(...), you can transfer ownership of the internal string to the writer.

Suggested change
const auto outStr = internal::strings::wstring_to_utf8(_str);
ParentType::add_value(_w, outStr.value_or(std::string()), _parent);
auto outStr = internal::strings::wstring_to_utf8(_str);
ParentType::add_value(_w, std::move(outStr).value_or(std::string()), _parent);

@mrgyatso mrgyatso marked this pull request as ready for review May 14, 2026 16:05
@liuzicheng1987
Copy link
Copy Markdown
Contributor

@mrgyatso , thanks for the contribution. This looks great. I was about to do the 0.25.0 release, but I want to squeeze your commit in, so I will delay it for another day.

@liuzicheng1987
Copy link
Copy Markdown
Contributor

@mrgyatso it appears your solution does not work for Windows/MSVC. Could you take a look?

@mrgyatso
Copy link
Copy Markdown
Contributor Author

Ah yeah I overlooked that. I'll check it out tonight. Give me a bit.

MSVC reads source files using the system code page unless told otherwise,
so non-ASCII characters in the new wstring/filepath regression tests' wide
and u8 literals were mis-decoded before any library code ran, failing
json.test_wstring_non_ascii and json.test_filepath_non_ascii on the
windows-msvc CI job. /utf-8 makes MSVC treat the sources as UTF-8.
Library code is unchanged.
@mrgyatso
Copy link
Copy Markdown
Contributor Author

@liuzicheng1987 Sorry I took so long, just got home a bit ago. Pushed the fix in 8657c32. Library code is unchanged; this was a test-build issue. MSVC was reading the test sources using the system code page, so non-ASCII in the new L"..." / u8"..." literals was mis-decoded before the codec ran (expected \xC3\xA9, got \xC3\x83\xC2\xA9). The fix is /utf-8 in tests/CMakeLists.txt. Verified locally on VS 2022 / MSVC 19.50: without it the two non_ascii tests reproduce the CI failure, with it the full suite passes 340/340.

Can narrow /utf-8 to tests/json/CMakeLists.txt if you'd prefer.

@mrgyatso
Copy link
Copy Markdown
Contributor Author

mrgyatso commented May 15, 2026

Looks like this one failed due to a 502 error upstream with vcpkg, unrelated to the fix. Should work fine on re-run.

@liuzicheng1987
Copy link
Copy Markdown
Contributor

@mrgyatso , yes, unfortunately the build pipeline is a bit unstable. We have too many things running at the same time, so the vcpkg servers tend to block us. It's a known issue and clearly has nothing to do with your PR.

@liuzicheng1987 liuzicheng1987 merged commit 53530bc into getml:main May 15, 2026
330 of 331 checks passed
@liuzicheng1987
Copy link
Copy Markdown
Contributor

@mrgyatso , I have merged this. Thankl you for your contribution!

@mrgyatso
Copy link
Copy Markdown
Contributor Author

mrgyatso commented May 15, 2026

@mrgyatso , I have merged this. Thankl you for your contribution!

@liuzicheng1987
Happy to help! I sent you a request on LinkedIn by the way. Let's stay connected. Hope to build with you in the future. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

crash on rfl::json::save() while serializing struct with std::wstring containing cyrilic and japan chars

2 participants