Velox regexp_replace drops LF bytes when input is a NativeScan string column

### Backend

VL (Velox)

### Bug description
  
  Iceberg table `testA` has a `qbody` STRING column containing mixed Chinese characters and embedded `0x0A` (LF) bytes.

  ```
  select length(qbody),
         length(regexp_replace(qbody, '\\n', '\\\\n'))
  from testA
  where id='86648395' and dt='20260509';
```
The output show below.

  ```
┌────────┬───────────────┬─────────────────────────────────────────────────┐
  │ Engine │ length(qbody) │           length(regexp_replace(...))           │
  ├────────┼───────────────┼─────────────────────────────────────────────────┤
  │ Spark  │ 629           │ 647 (correct, +18 chars: each LF → 2-char "\n") │
  ├────────┼───────────────┼─────────────────────────────────────────────────┤
  │ Gluten │ 629           │ 611 (wrong, -18 chars: each LF deleted)         │
  └────────┴───────────────┴─────────────────────────────────────────────────┘

```

  The `qbody` contains 18 LF bytes interleaved with multibyte UTF-8 (Chinese) characters.

  Reduces to substr(50) repro
```
  select length(regexp_replace(substr(qbody, 1, 50), '\\n', '\\\\n'))
  from testA where id='86648395' and dt='20260509';
```
  -- Spark: 53, Gluten: 47

  Inline literal does NOT trigger

`  select length(regexp_replace(unhex('<the same 50-char hex bytes>'), '\\n', '\\\\n'));`
  -- Both Spark and Gluten: 53 (correct)

  NativeScan column input does trigger

  The bug only appears when input flows from `IcebergBatchScanTransformer` (or
  likely any *`ScanTransformer` producing Velox `StringViews` referencing the
  original column buffer).

  Workaround

  Use replace() (literal string replace, not regex):
  replace(qbody, unhex('0A'), '\\n')   -- works correctly on both engines
  
  Or rebuild the string via unhex(hex(col)):
  regexp_replace(unhex(hex(qbody)), '\\n', '\\\\n')   -- bug avoided
  
  Suspected root cause

  When regexp_replace operates on a Velox `StringView` pointing into the original
  column buffer, the LF byte (0x08+ control code) immediately preceded or followed
  by a multibyte UTF-8 lead byte (0xE5 etc.) appears to confuse RE2's UTF-8
  boundary handling. Inline literals work because they go through a different
  code path that materializes the string before regex.

  Impact

  100% of rows containing both LF and CJK characters produce corrupt output.
  We discovered this when comparing 266k rows between Spark and Gluten in
  customer service text data — every single output row differs.

#### Reproduction

```
  create table t (s string) using parquet;
  insert into t values (concat('a', unhex('0A'), 'b', unhex('0A'), 'c'));
  select hex(regexp_replace(substr(s, 1, 5), '\\n', '\\\\n')) from t;
```
-- Spark: 615C6E625C6E63, Gluten: 616263

### Gluten version

Gluten-1.3

### Spark version

Spark-3.5.x

### Spark configurations

_No response_

### System information

_No response_

### Relevant logs

```bash

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velox regexp_replace drops LF bytes when input is a NativeScan string column #12120

Backend

Bug description

Reproduction

Gluten version

Spark version

Spark configurations

System information

Relevant logs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Velox regexp_replace drops LF bytes when input is a NativeScan string column #12120

Description

Backend

Bug description

Reproduction

Gluten version

Spark version

Spark configurations

System information

Relevant logs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions