fix(json): don't crash on non-ASCII std::wstring / std::filesystem::path#668
Conversation
…e locale path::string() and the narrow std::filesystem::path constructor are locale-dependent and crash or mangle non-ASCII paths in the default C locale. JSON is UTF-8 (RFC 8259), so go through path::u8string() and construction from std::u8string instead. Refs getml#421.
Parser_wstring transcoded via std::mbsrtowcs / std::wcsrtombs and passed the result straight into resize(). Those functions return (size_t)-1 on input not valid in the current C locale, which becomes resize(SIZE_MAX) and throws std::length_error; since read() is noexcept this is an immediate std::terminate. Replace the locale-dependent calls with an explicit, locale-independent UTF-8 to wchar_t codec (rfl::internal::strings, header-only). read() now returns an rfl::Error on malformed UTF-8; write() produces correct UTF-8. Refs getml#421, getml#422.
There was a problem hiding this comment.
Code Review
This pull request introduces locale-independent UTF-8 conversion utilities to ensure robust handling of non-ASCII characters in std::wstring and std::filesystem::path during JSON serialization, replacing problematic locale-dependent C functions. The changes include a new internal header for UTF-8 decoding and encoding, along with regression tests for non-ASCII paths and strings. Review feedback focused on performance and safety optimizations, such as pre-allocating memory for decoded strings, using iterator-based constructors to avoid temporary allocations, and employing move semantics to prevent unnecessary copies. Additionally, it was recommended to wrap string conversions in try-catch blocks to handle potential memory allocation failures gracefully and maintain consistency with existing error handling patterns.
| /// input is not well-formed UTF-8 (truncated, overlong, bad continuation byte, | ||
| /// surrogate code point, or out of range). | ||
| inline std::optional<std::u32string> decode_utf8(const std::string& _str) { | ||
| std::u32string out; |
There was a problem hiding this comment.
Consider using out.reserve(_str.size()) to avoid multiple reallocations during decoding. Since each UTF-8 byte represents at most one Unicode code point, the input string size is a safe upper bound for the number of code points.
| std::u32string out; | |
| std::u32string out; | |
| out.reserve(_str.size()); |
| return std::filesystem::path(std::u8string( | ||
| reinterpret_cast<const char8_t*>(_str.data()), _str.size())); |
There was a problem hiding this comment.
Using the iterator-based constructor of std::filesystem::path with char8_t* avoids the creation of a temporary std::u8string object and its associated heap allocation.
| return std::filesystem::path(std::u8string( | |
| reinterpret_cast<const char8_t*>(_str.data()), _str.size())); | |
| const auto* ptr = reinterpret_cast<const char8_t*>(_str.data()); | |
| return std::filesystem::path(ptr, ptr + _str.size()); |
| auto outStr = internal::strings::utf8_to_wstring(inStr.value()); | ||
| if (!outStr) { | ||
| return Result<std::wstring>( | ||
| error("Could not parse the string: it is not valid UTF-8.")); | ||
| } | ||
|
|
||
| return Result<std::wstring>(outStr); | ||
| return Result<std::wstring>(std::move(*outStr)); |
There was a problem hiding this comment.
This function is marked noexcept, but the string allocations and conversions inside utf8_to_wstring can throw std::bad_alloc. To prevent an unexpected std::terminate, consider wrapping the conversion in a try-catch block and returning a Result::error instead, which is consistent with the error handling pattern used in Parser_filepath.hpp.
try {
auto outStr = internal::strings::utf8_to_wstring(inStr.value());
if (!outStr) {
return error("Could not parse the string: it is not valid UTF-8.");
}
return std::move(*outStr);
} catch (std::exception& e) {
return error(e.what());
}| const auto outStr = internal::strings::wstring_to_utf8(_str); | ||
| ParentType::add_value(_w, outStr.value_or(std::string()), _parent); |
There was a problem hiding this comment.
The current implementation performs an unnecessary copy of the string when outStr has a value. By removing const and using std::move(outStr).value_or(...), you can transfer ownership of the internal string to the writer.
| const auto outStr = internal::strings::wstring_to_utf8(_str); | |
| ParentType::add_value(_w, outStr.value_or(std::string()), _parent); | |
| auto outStr = internal::strings::wstring_to_utf8(_str); | |
| ParentType::add_value(_w, std::move(outStr).value_or(std::string()), _parent); |
|
@mrgyatso , thanks for the contribution. This looks great. I was about to do the 0.25.0 release, but I want to squeeze your commit in, so I will delay it for another day. |
|
@mrgyatso it appears your solution does not work for Windows/MSVC. Could you take a look? |
|
Ah yeah I overlooked that. I'll check it out tonight. Give me a bit. |
MSVC reads source files using the system code page unless told otherwise, so non-ASCII characters in the new wstring/filepath regression tests' wide and u8 literals were mis-decoded before any library code ran, failing json.test_wstring_non_ascii and json.test_filepath_non_ascii on the windows-msvc CI job. /utf-8 makes MSVC treat the sources as UTF-8. Library code is unchanged.
|
@liuzicheng1987 Sorry I took so long, just got home a bit ago. Pushed the fix in 8657c32. Library code is unchanged; this was a test-build issue. MSVC was reading the test sources using the system code page, so non-ASCII in the new Can narrow |
|
Looks like this one failed due to a 502 error upstream with vcpkg, unrelated to the fix. Should work fine on re-run. |
|
@mrgyatso , yes, unfortunately the build pipeline is a bit unstable. We have too many things running at the same time, so the vcpkg servers tend to block us. It's a known issue and clearly has nothing to do with your PR. |
|
@mrgyatso , I have merged this. Thankl you for your contribution! |
@liuzicheng1987 |
Summary
A struct field of type
std::wstringorstd::filesystem::pathcontaining any non-ASCIIcharacter crashes
rfl::json::read/rfl::json::writewithstd::terminate. This is thecrash behind #421 and #422. Both were reported from Windows, but the crash is not
platform-specific: it reproduces on Linux / glibc / g++ just as readily.
Minimal repro (Linux, default locale):
Root cause
Parser_wstring.hpptranscodes viastd::mbsrtowcs/std::wcsrtombsand passes theresult straight into
resize():std::mbsrtowcs/std::wcsrtombsreturn(size_t)-1when the input is not valid in thecurrent C locale's
LC_CTYPEencoding. The default"C"locale is ASCII-only, so anynon-ASCII byte makes
lenequal to(size_t)-1, which becomesresize(SIZE_MAX)andthrows
std::length_error.readis declarednoexcept, so that becomes an immediatestd::terminate.Parser_filepath.hpphas the same locale dependency throughpath::string()and the narrowpathconstructor.The fix: two independent commits
Commit 1 serializes
std::filesystem::pathas UTF-8.Parser_filepathnow usespath::u8string()and construction fromstd::u8stringinstead of the locale-dependentpath::string()and narrow constructor. JSON is UTF-8 (RFC 8259), so this conversion iswell-defined and locale-independent. It uses the standard library only, about one
substantive line per function.
Commit 2 converts
std::wstringwith an explicit UTF-8 codec. It adds a small header-onlyUTF-8 to
wchar_tcodec that handles bothwchar_twidths (UTF-32 on Linux and macOS,UTF-16 on Windows) and rejects malformed input rather than crashing, then rewires
Parser_wstringto use it.readnow returns anrfl::Erroron malformed UTF-8, andwriteproduces correct UTF-8. The parser itself gets shorter, since the locale jugglingis gone.
Commit 1 stands on its own, and commit 2 is independent of it.
Also fixes a crash that is not in either issue
#421 and #422 are about
writeandsave. Thereadpath crashes on the same input,and because
readis declarednoexcept, there is norfl::Errorpath and notry/catcha caller can use: it is an unconditionalstd::terminate. Neither issuementions this. Both commits fix
readandwrite.A note on scope
I read through the #421 discussion. The genuinely hard part there, auto-detecting the
encoding of a
std::string, is ambiguous on Windows, and this PR does not touch it. Butstd::wstringandstd::filesystem::pathare not encoding-ambiguous (they arewell-defined UTF-16 / UTF-32, and the standard's defined path encoding), so a correct,
locale-independent conversion is well-defined, which is what commit 2 does.
If you would rather not take on the codec in commit 2, I am happy to reduce it to just
guarding the
(size_t)-1return, soreadreturns anrfl::Errorandwriteemits anempty string. That kills the crash without adding conversion logic. I led with the full
fix because it actually makes the types usable, but it is your call.
Tests
Regression tests added to
tests/json/test_wstring.cppandtests/json/test_filepath.cpp,covering round-trips with Latin-1, Cyrillic, CJK, Hiragana, and astral-plane content.
Closes #422. Refs #421: this fully addresses its
std::wstringandstd::filesystem::pathcases. The separate
std::stringcodepage question raised in that thread is intentionallyleft out of scope.