fix(json): don't crash on non-ASCII std::wstring / std::filesystem::path by mrgyatso · Pull Request #668 · getml/reflect-cpp

mrgyatso · 2026-05-14T15:37:11Z

Summary

A struct field of type std::wstring or std::filesystem::path containing any non-ASCII
character crashes rfl::json::read / rfl::json::write with std::terminate. This is the
crash behind #421 and #422. Both were reported from Windows, but the crash is not
platform-specific: it reproduces on Linux / glibc / g++ just as readily.

Minimal repro (Linux, default locale):

struct S { std::wstring w; };
rfl::json::write(S{ .w = L"é" });     // std::length_error, then std::terminate
rfl::json::read<S>(R"({"w":"é"})");   // std::length_error, then std::terminate (read is noexcept)

Root cause

Parser_wstring.hpp transcodes via std::mbsrtowcs / std::wcsrtombs and passes the
result straight into resize():

auto len = std::mbsrtowcs(outStr.data(), &ptr, val.size(), &state);
outStr.resize(len);   // len can be (size_t)-1

std::mbsrtowcs / std::wcsrtombs return (size_t)-1 when the input is not valid in the
current C locale's LC_CTYPE encoding. The default "C" locale is ASCII-only, so any
non-ASCII byte makes len equal to (size_t)-1, which becomes resize(SIZE_MAX) and
throws std::length_error. read is declared noexcept, so that becomes an immediate
std::terminate. Parser_filepath.hpp has the same locale dependency through
path::string() and the narrow path constructor.

The fix: two independent commits

Commit 1 serializes std::filesystem::path as UTF-8. Parser_filepath now uses
path::u8string() and construction from std::u8string instead of the locale-dependent
path::string() and narrow constructor. JSON is UTF-8 (RFC 8259), so this conversion is
well-defined and locale-independent. It uses the standard library only, about one
substantive line per function.

Commit 2 converts std::wstring with an explicit UTF-8 codec. It adds a small header-only
UTF-8 to wchar_t codec that handles both wchar_t widths (UTF-32 on Linux and macOS,
UTF-16 on Windows) and rejects malformed input rather than crashing, then rewires
Parser_wstring to use it. read now returns an rfl::Error on malformed UTF-8, and
write produces correct UTF-8. The parser itself gets shorter, since the locale juggling
is gone.

Commit 1 stands on its own, and commit 2 is independent of it.

Also fixes a crash that is not in either issue

#421 and #422 are about write and save. The read path crashes on the same input,
and because read is declared noexcept, there is no rfl::Error path and no
try/catch a caller can use: it is an unconditional std::terminate. Neither issue
mentions this. Both commits fix read and write.

A note on scope

I read through the #421 discussion. The genuinely hard part there, auto-detecting the
encoding of a std::string, is ambiguous on Windows, and this PR does not touch it. But
std::wstring and std::filesystem::path are not encoding-ambiguous (they are
well-defined UTF-16 / UTF-32, and the standard's defined path encoding), so a correct,
locale-independent conversion is well-defined, which is what commit 2 does.

If you would rather not take on the codec in commit 2, I am happy to reduce it to just
guarding the (size_t)-1 return, so read returns an rfl::Error and write emits an
empty string. That kills the crash without adding conversion logic. I led with the full
fix because it actually makes the types usable, but it is your call.

Tests

Regression tests added to tests/json/test_wstring.cpp and tests/json/test_filepath.cpp,
covering round-trips with Latin-1, Cyrillic, CJK, Hiragana, and astral-plane content.

Closes #422. Refs #421: this fully addresses its std::wstring and std::filesystem::path
cases. The separate std::string codepage question raised in that thread is intentionally
left out of scope.

…e locale path::string() and the narrow std::filesystem::path constructor are locale-dependent and crash or mangle non-ASCII paths in the default C locale. JSON is UTF-8 (RFC 8259), so go through path::u8string() and construction from std::u8string instead. Refs getml#421.

Parser_wstring transcoded via std::mbsrtowcs / std::wcsrtombs and passed the result straight into resize(). Those functions return (size_t)-1 on input not valid in the current C locale, which becomes resize(SIZE_MAX) and throws std::length_error; since read() is noexcept this is an immediate std::terminate. Replace the locale-dependent calls with an explicit, locale-independent UTF-8 to wchar_t codec (rfl::internal::strings, header-only). read() now returns an rfl::Error on malformed UTF-8; write() produces correct UTF-8. Refs getml#421, getml#422.

gemini-code-assist

Code Review

This pull request introduces locale-independent UTF-8 conversion utilities to ensure robust handling of non-ASCII characters in std::wstring and std::filesystem::path during JSON serialization, replacing problematic locale-dependent C functions. The changes include a new internal header for UTF-8 decoding and encoding, along with regression tests for non-ASCII paths and strings. Review feedback focused on performance and safety optimizations, such as pre-allocating memory for decoded strings, using iterator-based constructors to avoid temporary allocations, and employing move semantics to prevent unnecessary copies. Additionally, it was recommended to wrap string conversions in try-catch blocks to handle potential memory allocation failures gracefully and maintain consistency with existing error handling patterns.

gemini-code-assist · 2026-05-14T15:40:50Z

+/// input is not well-formed UTF-8 (truncated, overlong, bad continuation byte,
+/// surrogate code point, or out of range).
+inline std::optional<std::u32string> decode_utf8(const std::string& _str) {
+  std::u32string out;


Consider using out.reserve(_str.size()) to avoid multiple reallocations during decoding. Since each UTF-8 byte represents at most one Unicode code point, the input string size is a safe upper bound for the number of code points.

Suggested change

std::u32string out;

std::u32string out;

out.reserve(_str.size());

gemini-code-assist · 2026-05-14T15:40:51Z

+        return std::filesystem::path(std::u8string(
+            reinterpret_cast<const char8_t*>(_str.data()), _str.size()));


Using the iterator-based constructor of std::filesystem::path with char8_t* avoids the creation of a temporary std::u8string object and its associated heap allocation.

Suggested change

return std::filesystem::path(std::u8string(

reinterpret_cast<const char8_t*>(_str.data()), _str.size()));

const auto* ptr = reinterpret_cast<const char8_t*>(_str.data());

return std::filesystem::path(ptr, ptr + _str.size());

gemini-code-assist · 2026-05-14T15:40:51Z

+    auto outStr = internal::strings::utf8_to_wstring(inStr.value());
+    if (!outStr) {
+      return Result<std::wstring>(
+          error("Could not parse the string: it is not valid UTF-8."));
+    }

-    return Result<std::wstring>(outStr);
+    return Result<std::wstring>(std::move(*outStr));


This function is marked noexcept, but the string allocations and conversions inside utf8_to_wstring can throw std::bad_alloc. To prevent an unexpected std::terminate, consider wrapping the conversion in a try-catch block and returning a Result::error instead, which is consistent with the error handling pattern used in Parser_filepath.hpp.

try { auto outStr = internal::strings::utf8_to_wstring(inStr.value()); if (!outStr) { return error("Could not parse the string: it is not valid UTF-8."); } return std::move(*outStr); } catch (std::exception& e) { return error(e.what()); }

gemini-code-assist · 2026-05-14T15:40:51Z

+    const auto outStr = internal::strings::wstring_to_utf8(_str);
+    ParentType::add_value(_w, outStr.value_or(std::string()), _parent);


The current implementation performs an unnecessary copy of the string when outStr has a value. By removing const and using std::move(outStr).value_or(...), you can transfer ownership of the internal string to the writer.

Suggested change

const auto outStr = internal::strings::wstring_to_utf8(_str);

ParentType::add_value(_w, outStr.value_or(std::string()), _parent);

auto outStr = internal::strings::wstring_to_utf8(_str);

ParentType::add_value(_w, std::move(outStr).value_or(std::string()), _parent);

liuzicheng1987 · 2026-05-14T20:49:43Z

@mrgyatso , thanks for the contribution. This looks great. I was about to do the 0.25.0 release, but I want to squeeze your commit in, so I will delay it for another day.

liuzicheng1987 · 2026-05-14T22:15:24Z

@mrgyatso it appears your solution does not work for Windows/MSVC. Could you take a look?

mrgyatso · 2026-05-14T22:52:53Z

Ah yeah I overlooked that. I'll check it out tonight. Give me a bit.

MSVC reads source files using the system code page unless told otherwise, so non-ASCII characters in the new wstring/filepath regression tests' wide and u8 literals were mis-decoded before any library code ran, failing json.test_wstring_non_ascii and json.test_filepath_non_ascii on the windows-msvc CI job. /utf-8 makes MSVC treat the sources as UTF-8. Library code is unchanged.

mrgyatso · 2026-05-15T04:50:41Z

@liuzicheng1987 Sorry I took so long, just got home a bit ago. Pushed the fix in 8657c32. Library code is unchanged; this was a test-build issue. MSVC was reading the test sources using the system code page, so non-ASCII in the new L"..." / u8"..." literals was mis-decoded before the codec ran (expected \xC3\xA9, got \xC3\x83\xC2\xA9). The fix is /utf-8 in tests/CMakeLists.txt. Verified locally on VS 2022 / MSVC 19.50: without it the two non_ascii tests reproduce the CI failure, with it the full suite passes 340/340.

Can narrow /utf-8 to tests/json/CMakeLists.txt if you'd prefer.

mrgyatso · 2026-05-15T13:36:46Z

Looks like this one failed due to a 502 error upstream with vcpkg, unrelated to the fix. Should work fine on re-run.

liuzicheng1987 · 2026-05-15T15:30:36Z

@mrgyatso , yes, unfortunately the build pipeline is a bit unstable. We have too many things running at the same time, so the vcpkg servers tend to block us. It's a known issue and clearly has nothing to do with your PR.

liuzicheng1987 · 2026-05-15T15:51:31Z

@mrgyatso , I have merged this. Thankl you for your contribution!

mrgyatso · 2026-05-15T18:34:16Z

@mrgyatso , I have merged this. Thankl you for your contribution!

@liuzicheng1987
Happy to help! I sent you a request on LinkedIn by the way. Let's stay connected. Hope to build with you in the future. :)

mrgyatso added 2 commits May 14, 2026 15:35

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

mrgyatso marked this pull request as ready for review May 14, 2026 16:05

liuzicheng1987 merged commit 53530bc into getml:main May 15, 2026
330 of 331 checks passed

	std::u32string out;
	std::u32string out;
	out.reserve(_str.size());

		return std::filesystem::path(std::u8string(
		reinterpret_cast<const char8_t*>(_str.data()), _str.size()));

		const auto outStr = internal::strings::wstring_to_utf8(_str);
		ParentType::add_value(_w, outStr.value_or(std::string()), _parent);

Conversation

mrgyatso commented May 14, 2026

Summary

Root cause

The fix: two independent commits

Also fixes a crash that is not in either issue

A note on scope

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

liuzicheng1987 commented May 14, 2026

Uh oh!

liuzicheng1987 commented May 14, 2026

Uh oh!

mrgyatso commented May 14, 2026

Uh oh!

mrgyatso commented May 15, 2026

Uh oh!

mrgyatso commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liuzicheng1987 commented May 15, 2026

Uh oh!

Uh oh!

liuzicheng1987 commented May 15, 2026

Uh oh!

mrgyatso commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mrgyatso commented May 15, 2026 •

edited

Loading

mrgyatso commented May 15, 2026 •

edited

Loading