Skip to content

Always return UTF-8 from text string metadata#422

Open
splitbrain wants to merge 1 commit into
PrinsFrank:mainfrom
cosmocode:encoding
Open

Always return UTF-8 from text string metadata#422
splitbrain wants to merge 1 commit into
PrinsFrank:mainfrom
cosmocode:encoding

Conversation

@splitbrain

Copy link
Copy Markdown
Contributor

I noticed weird strings returned in my metadata results. Investigating, showed that I got "mojibake" strings - UTF16-BE strings decoded as if they were UTF-8. The reason was that non-hex encoded, literal strings were read with mb_chr().

This PR reads all data with getBinaryString() and then uses the data's byte order mark (BOM) to decide what decoding is needed.

Note: the final fallback to latin-1 is not 100% correct (see code comment) but should be good enough. If you'd rather have proper PDFDocEncoding handling, let me know.

@PrinsFrank PrinsFrank mentioned this pull request Jun 22, 2026
Comment thread src/Document/Dictionary/DictionaryValue/TextString/TextStringValue.php Outdated
Comment thread src/Document/Dictionary/DictionaryValue/TextString/TextStringValue.php Outdated

@PrinsFrank PrinsFrank left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the implementation! This was on my to-do list but I can use all the help I can get! I added PDFDocEncoding, if you could use that here this is now fully supporting everything!

TextStringValue::getText() decodes literal and hex strings through a
single path, then detects a leading UTF-16BE, UTF-16LE or UTF-8 byte
order mark and converts to UTF-8 regardless of the surface form. Text
strings without a byte order mark are decoded as PDFDocEncoding.

Previously the UTF-16BE conversion only ran for hex strings, so UTF-16BE
metadata written as a literal string was returned as raw bytes.
@splitbrain

Copy link
Copy Markdown
Contributor Author

I rebased off main and updated the commit to make use of the new PDFDocEncoding class. Comments removed.

@splitbrain splitbrain requested a review from PrinsFrank June 22, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants