Backend
VL (Velox)
Bug description
Iceberg table testA has a qbody STRING column containing mixed Chinese characters and embedded 0x0A (LF) bytes.
select length(qbody),
length(regexp_replace(qbody, '\\n', '\\\\n'))
from testA
where id='86648395' and dt='20260509';
The output show below.
┌────────┬───────────────┬─────────────────────────────────────────────────┐
│ Engine │ length(qbody) │ length(regexp_replace(...)) │
├────────┼───────────────┼─────────────────────────────────────────────────┤
│ Spark │ 629 │ 647 (correct, +18 chars: each LF → 2-char "\n") │
├────────┼───────────────┼─────────────────────────────────────────────────┤
│ Gluten │ 629 │ 611 (wrong, -18 chars: each LF deleted) │
└────────┴───────────────┴─────────────────────────────────────────────────┘
The qbody contains 18 LF bytes interleaved with multibyte UTF-8 (Chinese) characters.
Reduces to substr(50) repro
select length(regexp_replace(substr(qbody, 1, 50), '\\n', '\\\\n'))
from testA where id='86648395' and dt='20260509';
-- Spark: 53, Gluten: 47
Inline literal does NOT trigger
select length(regexp_replace(unhex('<the same 50-char hex bytes>'), '\\n', '\\\\n'));
-- Both Spark and Gluten: 53 (correct)
NativeScan column input does trigger
The bug only appears when input flows from IcebergBatchScanTransformer (or
likely any *ScanTransformer producing Velox StringViews referencing the
original column buffer).
Workaround
Use replace() (literal string replace, not regex):
replace(qbody, unhex('0A'), '\n') -- works correctly on both engines
Or rebuild the string via unhex(hex(col)):
regexp_replace(unhex(hex(qbody)), '\n', '\\n') -- bug avoided
Suspected root cause
When regexp_replace operates on a Velox StringView pointing into the original
column buffer, the LF byte (0x08+ control code) immediately preceded or followed
by a multibyte UTF-8 lead byte (0xE5 etc.) appears to confuse RE2's UTF-8
boundary handling. Inline literals work because they go through a different
code path that materializes the string before regex.
Impact
100% of rows containing both LF and CJK characters produce corrupt output.
We discovered this when comparing 266k rows between Spark and Gluten in
customer service text data — every single output row differs.
Reproduction
create table t (s string) using parquet;
insert into t values (concat('a', unhex('0A'), 'b', unhex('0A'), 'c'));
select hex(regexp_replace(substr(s, 1, 5), '\\n', '\\\\n')) from t;
-- Spark: 615C6E625C6E63, Gluten: 616263
Gluten version
Gluten-1.3
Spark version
Spark-3.5.x
Spark configurations
No response
System information
No response
Relevant logs
Backend
VL (Velox)
Bug description
Iceberg table
testAhas aqbodySTRING column containing mixed Chinese characters and embedded0x0A(LF) bytes.The output show below.
The
qbodycontains 18 LF bytes interleaved with multibyte UTF-8 (Chinese) characters.Reduces to substr(50) repro
-- Spark: 53, Gluten: 47
Inline literal does NOT trigger
select length(regexp_replace(unhex('<the same 50-char hex bytes>'), '\\n', '\\\\n'));-- Both Spark and Gluten: 53 (correct)
NativeScan column input does trigger
The bug only appears when input flows from
IcebergBatchScanTransformer(orlikely any *
ScanTransformerproducing VeloxStringViewsreferencing theoriginal column buffer).
Workaround
Use replace() (literal string replace, not regex):
replace(qbody, unhex('0A'), '\n') -- works correctly on both engines
Or rebuild the string via unhex(hex(col)):
regexp_replace(unhex(hex(qbody)), '\n', '\\n') -- bug avoided
Suspected root cause
When regexp_replace operates on a Velox
StringViewpointing into the originalcolumn buffer, the LF byte (0x08+ control code) immediately preceded or followed
by a multibyte UTF-8 lead byte (0xE5 etc.) appears to confuse RE2's UTF-8
boundary handling. Inline literals work because they go through a different
code path that materializes the string before regex.
Impact
100% of rows containing both LF and CJK characters produce corrupt output.
We discovered this when comparing 266k rows between Spark and Gluten in
customer service text data — every single output row differs.
Reproduction
-- Spark: 615C6E625C6E63, Gluten: 616263
Gluten version
Gluten-1.3
Spark version
Spark-3.5.x
Spark configurations
No response
System information
No response
Relevant logs