Always return UTF-8 from text string metadata#422
Open
splitbrain wants to merge 1 commit into
Open
Conversation
Merged
PrinsFrank
reviewed
Jun 22, 2026
PrinsFrank
reviewed
Jun 22, 2026
PrinsFrank
requested changes
Jun 22, 2026
PrinsFrank
left a comment
Owner
There was a problem hiding this comment.
Thank you so much for the implementation! This was on my to-do list but I can use all the help I can get! I added PDFDocEncoding, if you could use that here this is now fully supporting everything!
TextStringValue::getText() decodes literal and hex strings through a single path, then detects a leading UTF-16BE, UTF-16LE or UTF-8 byte order mark and converts to UTF-8 regardless of the surface form. Text strings without a byte order mark are decoded as PDFDocEncoding. Previously the UTF-16BE conversion only ran for hex strings, so UTF-16BE metadata written as a literal string was returned as raw bytes.
Contributor
Author
|
I rebased off main and updated the commit to make use of the new PDFDocEncoding class. Comments removed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I noticed weird strings returned in my metadata results. Investigating, showed that I got "mojibake" strings - UTF16-BE strings decoded as if they were UTF-8. The reason was that non-hex encoded, literal strings were read with
mb_chr().This PR reads all data with getBinaryString() and then uses the data's byte order mark (BOM) to decide what decoding is needed.
Note: the final fallback to latin-1 is not 100% correct (see code comment) but should be good enough. If you'd rather have proper PDFDocEncoding handling, let me know.