Skip to content

Velox regexp_replace drops LF bytes when input is a NativeScan string column #12120

@beliefer

Description

@beliefer

Backend

VL (Velox)

Bug description

Iceberg table testA has a qbody STRING column containing mixed Chinese characters and embedded 0x0A (LF) bytes.

select length(qbody),
       length(regexp_replace(qbody, '\\n', '\\\\n'))
from testA
where id='86648395' and dt='20260509';

The output show below.

┌────────┬───────────────┬─────────────────────────────────────────────────┐
│ Engine │ length(qbody) │           length(regexp_replace(...))           │
├────────┼───────────────┼─────────────────────────────────────────────────┤
│ Spark  │ 629           │ 647 (correct, +18 chars: each LF → 2-char "\n") │
├────────┼───────────────┼─────────────────────────────────────────────────┤
│ Gluten │ 629           │ 611 (wrong, -18 chars: each LF deleted)         │
└────────┴───────────────┴─────────────────────────────────────────────────┘

The qbody contains 18 LF bytes interleaved with multibyte UTF-8 (Chinese) characters.

Reduces to substr(50) repro

  select length(regexp_replace(substr(qbody, 1, 50), '\\n', '\\\\n'))
  from testA where id='86648395' and dt='20260509';

-- Spark: 53, Gluten: 47

Inline literal does NOT trigger

select length(regexp_replace(unhex('<the same 50-char hex bytes>'), '\\n', '\\\\n'));
-- Both Spark and Gluten: 53 (correct)

NativeScan column input does trigger

The bug only appears when input flows from IcebergBatchScanTransformer (or
likely any *ScanTransformer producing Velox StringViews referencing the
original column buffer).

Workaround

Use replace() (literal string replace, not regex):
replace(qbody, unhex('0A'), '\n') -- works correctly on both engines

Or rebuild the string via unhex(hex(col)):
regexp_replace(unhex(hex(qbody)), '\n', '\\n') -- bug avoided

Suspected root cause

When regexp_replace operates on a Velox StringView pointing into the original
column buffer, the LF byte (0x08+ control code) immediately preceded or followed
by a multibyte UTF-8 lead byte (0xE5 etc.) appears to confuse RE2's UTF-8
boundary handling. Inline literals work because they go through a different
code path that materializes the string before regex.

Impact

100% of rows containing both LF and CJK characters produce corrupt output.
We discovered this when comparing 266k rows between Spark and Gluten in
customer service text data — every single output row differs.

Reproduction

  create table t (s string) using parquet;
  insert into t values (concat('a', unhex('0A'), 'b', unhex('0A'), 'c'));
  select hex(regexp_replace(substr(s, 1, 5), '\\n', '\\\\n')) from t;

-- Spark: 615C6E625C6E63, Gluten: 616263

Gluten version

Gluten-1.3

Spark version

Spark-3.5.x

Spark configurations

No response

System information

No response

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions