[FLINK-39602][table] Add IS_VALID_UTF8 and MAKE_VALID_UTF8 built-in functions#28111
[FLINK-39602][table] Add IS_VALID_UTF8 and MAKE_VALID_UTF8 built-in functions#28111gustavodemorais wants to merge 4 commits intoapache:masterfrom
Conversation
| * the Unicode maximum U+10FFFF, and UTF-16 surrogate values U+D800-U+DFFF (which have no UTF-8 | ||
| * representation). | ||
| */ | ||
| public OutType isValidUtf8() { |
There was a problem hiding this comment.
I was expecting. boolean return type here
There was a problem hiding this comment.
This is expression API. You are constructing expressions, not evaluating data. The return type is another expression.
There was a problem hiding this comment.
Hey David, isValidUtf8() doesn't run the check - it just builds a small piece of an SQL plan that says "validate UTF-8 here". The actual true/false is computed later, on every row, on the cluster. So the method has to return something you can keep chaining onto (.and(...), .filter(...), etc.) - that's what OutType is. It's the same return type every other Table API method uses, like isNull() or like()
twalthr
left a comment
There was a problem hiding this comment.
Thank you @gustavodemorais.
| description: | | ||
| Returns `TRUE` if the input is well-formed UTF-8, `FALSE` otherwise. Specifically rejects: truncated multi-byte sequences (missing continuation bytes), "overlong" encodings (using more bytes than necessary for the code point), code points above the Unicode maximum U+10FFFF, and UTF-16 surrogate values U+D800-U+DFFF (which have no UTF-8 representation). Returns `NULL` if the input is `NULL`. | ||
| Useful for routing records with invalid UTF-8 to a dead-letter sink: `WHERE IS_VALID_UTF8(payload)` keeps clean rows; `WHERE NOT IS_VALID_UTF8(payload)` selects the rejects. |
There was a problem hiding this comment.
| Useful for routing records with invalid UTF-8 to a dead-letter sink: `WHERE IS_VALID_UTF8(payload)` keeps clean rows; `WHERE NOT IS_VALID_UTF8(payload)` selects the rejects. | |
| Useful for filtering records with invalid UTF-8: `WHERE IS_VALID_UTF8(payload)` keeps clean rows; `WHERE NOT IS_VALID_UTF8(payload)` selects the rejects. |
There was a problem hiding this comment.
let's not advertise features that don't exist
| description: | | ||
| Decodes the input as UTF-8, replacing each invalid sequence with the Unicode replacement character `U+FFFD` (rendered as `�`). The substitution is lossy and irreversible. Returns `NULL` if the input is `NULL`. | ||
| If you want to explicitly have the behavior of silently substituting invalid bytes with `U+FFFD` when doing a `CAST(bytes AS STRING)`, replace the cast with `MAKE_VALID_UTF8(bytes)`. |
There was a problem hiding this comment.
I find this comment confusing. How about "MAKE_VALID_UTF8() can fully replace a CAST(bytes AS STRING) which would error in case of invalid UTF-8"
There was a problem hiding this comment.
Your comment makes sense but only when the CAST change is merged. I thought of then updating this documentation. With this change, CAST(bytes AS STRING) and MAKE_VALID_UTF8(bytes) do exactly the same thing, the difference being MAKE_VALID_UTF8 is explicit about it
There was a problem hiding this comment.
I actually think we can drop this sentence for now and add your comment as a follow up when we do the CAST change. The first paragraph already states clearly what hte function does
Decodes the input as UTF-8, replacing each invalid sequence with the Unicode replacement character `U+FFFD` (rendered as `�`). The substitution is lossy and irreversible. Returns `NULL` if the input is `NULL`.
| } | ||
|
|
||
| /** | ||
| * Returns {@code true} if the input bytes form a well-formed UTF-8 sequence, {@code false} |
There was a problem hiding this comment.
| * Returns {@code true} if the input bytes form a well-formed UTF-8 sequence, {@code false} | |
| * Returns {@code true} if the input bytes are a well-formed UTF-8 sequence, {@code false} |
|
Thanks for the review, @twalthr. I've addressed the comments, take a look |
…FLINK-39601 merge
551e256 to
57dddc2
Compare
twalthr
left a comment
There was a problem hiding this comment.
LGTM, thanks for working on this @gustavodemorais.
What is the purpose of the change
Adds two built-in SQL functions: IS_VALID_UTF8 for routing invalid records to a dead-letter sink, and MAKE_VALID_UTF8 for explicit lossy decoding (substitutes invalid sequences with U+FFFD). Both are also exposed via the Table API; input is BINARY/VARBINARY. Part of FLIP-568.
Brief change log
Important note : Self-contained so this PR does not depend on FLINK-39601; once that lands, this should delegate to StringUtf8Utils.firstInvalidUtf8ByteIndex and we remove the duplicate code from EncondingUtils.
Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes - two new BuiltInFunctionDefinition entries and two new Table API methods on BaseExpressions)Documentation
Was generative AI tooling used to co-author this PR?
2.1.117 (Claude Code)