Skip to content

[FLINK-39602][table] Add IS_VALID_UTF8 and MAKE_VALID_UTF8 built-in functions#28111

Open
gustavodemorais wants to merge 4 commits intoapache:masterfrom
confluentinc:FLINK-39602
Open

[FLINK-39602][table] Add IS_VALID_UTF8 and MAKE_VALID_UTF8 built-in functions#28111
gustavodemorais wants to merge 4 commits intoapache:masterfrom
confluentinc:FLINK-39602

Conversation

@gustavodemorais
Copy link
Copy Markdown
Contributor

@gustavodemorais gustavodemorais commented May 4, 2026

What is the purpose of the change

Adds two built-in SQL functions: IS_VALID_UTF8 for routing invalid records to a dead-letter sink, and MAKE_VALID_UTF8 for explicit lossy decoding (substitutes invalid sequences with U+FFFD). Both are also exposed via the Table API; input is BINARY/VARBINARY. Part of FLIP-568.

Brief change log

  • Add IS_VALID_UTF8 and MAKE_VALID_UTF8 to BuiltInFunctionDefinitions.
  • Add BYTES.isValidUtf8 and BYTES.makeValidUtf8 Table API methods on BaseExpressions.
  • Add EncodingUtils.isValidUtf8 with an inline Flink-style validator.
  • Document both functions in sql_functions.yml

Important note : Self-contained so this PR does not depend on FLINK-39601; once that lands, this should delegate to StringUtf8Utils.firstInvalidUtf8ByteIndex and we remove the duplicate code from EncondingUtils.

Verifying this change

  • Utf8FunctionsITCase
  • EncodingUtilsTest

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes - two new BuiltInFunctionDefinition entries and two new Table API methods on BaseExpressions)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no - new functions only)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (Entries in sql_functions.yml and sql_functions_zh.yml under conversion, plus JavaDocs on the runtime function classes and the new BaseExpressions methods.)

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

2.1.117 (Claude Code)

@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented May 4, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

* the Unicode maximum U+10FFFF, and UTF-16 surrogate values U+D800-U+DFFF (which have no UTF-8
* representation).
*/
public OutType isValidUtf8() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting. boolean return type here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expression API. You are constructing expressions, not evaluating data. The return type is another expression.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey David, isValidUtf8() doesn't run the check - it just builds a small piece of an SQL plan that says "validate UTF-8 here". The actual true/false is computed later, on every row, on the cluster. So the method has to return something you can keep chaining onto (.and(...), .filter(...), etc.) - that's what OutType is. It's the same return type every other Table API method uses, like isNull() or like()

Copy link
Copy Markdown
Contributor

@twalthr twalthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gustavodemorais.

Comment thread docs/data/sql_functions.yml Outdated
description: |
Returns `TRUE` if the input is well-formed UTF-8, `FALSE` otherwise. Specifically rejects: truncated multi-byte sequences (missing continuation bytes), "overlong" encodings (using more bytes than necessary for the code point), code points above the Unicode maximum U+10FFFF, and UTF-16 surrogate values U+D800-U+DFFF (which have no UTF-8 representation). Returns `NULL` if the input is `NULL`.
Useful for routing records with invalid UTF-8 to a dead-letter sink: `WHERE IS_VALID_UTF8(payload)` keeps clean rows; `WHERE NOT IS_VALID_UTF8(payload)` selects the rejects.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Useful for routing records with invalid UTF-8 to a dead-letter sink: `WHERE IS_VALID_UTF8(payload)` keeps clean rows; `WHERE NOT IS_VALID_UTF8(payload)` selects the rejects.
Useful for filtering records with invalid UTF-8: `WHERE IS_VALID_UTF8(payload)` keeps clean rows; `WHERE NOT IS_VALID_UTF8(payload)` selects the rejects.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not advertise features that don't exist

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread docs/data/sql_functions.yml Outdated
description: |
Decodes the input as UTF-8, replacing each invalid sequence with the Unicode replacement character `U+FFFD` (rendered as `�`). The substitution is lossy and irreversible. Returns `NULL` if the input is `NULL`.
If you want to explicitly have the behavior of silently substituting invalid bytes with `U+FFFD` when doing a `CAST(bytes AS STRING)`, replace the cast with `MAKE_VALID_UTF8(bytes)`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this comment confusing. How about "MAKE_VALID_UTF8() can fully replace a CAST(bytes AS STRING) which would error in case of invalid UTF-8"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment makes sense but only when the CAST change is merged. I thought of then updating this documentation. With this change, CAST(bytes AS STRING) and MAKE_VALID_UTF8(bytes) do exactly the same thing, the difference being MAKE_VALID_UTF8 is explicit about it

Copy link
Copy Markdown
Contributor Author

@gustavodemorais gustavodemorais May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we can drop this sentence for now and add your comment as a follow up when we do the CAST change. The first paragraph already states clearly what hte function does

Decodes the input as UTF-8, replacing each invalid sequence with the Unicode replacement character `U+FFFD` (rendered as `�`). The substitution is lossy and irreversible. Returns `NULL` if the input is `NULL`.

}

/**
* Returns {@code true} if the input bytes form a well-formed UTF-8 sequence, {@code false}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Returns {@code true} if the input bytes form a well-formed UTF-8 sequence, {@code false}
* Returns {@code true} if the input bytes are a well-formed UTF-8 sequence, {@code false}

@gustavodemorais
Copy link
Copy Markdown
Contributor Author

Thanks for the review, @twalthr. I've addressed the comments, take a look

Copy link
Copy Markdown
Contributor

@twalthr twalthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this @gustavodemorais.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants