Always return UTF-8 from text string metadata by splitbrain · Pull Request #422 · PrinsFrank/pdfparser

splitbrain · 2026-06-22T12:43:21Z

I noticed weird strings returned in my metadata results. Investigating, showed that I got "mojibake" strings - UTF16-BE strings decoded as if they were UTF-8. The reason was that non-hex encoded, literal strings were read with mb_chr().

This PR reads all data with getBinaryString() and then uses the data's byte order mark (BOM) to decide what decoding is needed.

Note: the final fallback to latin-1 is not 100% correct (see code comment) but should be good enough. If you'd rather have proper PDFDocEncoding handling, let me know.

PrinsFrank

Thank you so much for the implementation! This was on my to-do list but I can use all the help I can get! I added PDFDocEncoding, if you could use that here this is now fully supporting everything!

TextStringValue::getText() decodes literal and hex strings through a single path, then detects a leading UTF-16BE, UTF-16LE or UTF-8 byte order mark and converts to UTF-8 regardless of the surface form. Text strings without a byte order mark are decoded as PDFDocEncoding. Previously the UTF-16BE conversion only ran for hex strings, so UTF-16BE metadata written as a literal string was returned as raw bytes.

splitbrain · 2026-06-22T19:30:19Z

I rebased off main and updated the commit to make use of the new PDFDocEncoding class. Comments removed.

PrinsFrank mentioned this pull request Jun 22, 2026

Add PDFDocEncoding #423

Merged

PrinsFrank reviewed Jun 22, 2026

View reviewed changes

Comment thread src/Document/Dictionary/DictionaryValue/TextString/TextStringValue.php Outdated

PrinsFrank reviewed Jun 22, 2026

View reviewed changes

Comment thread src/Document/Dictionary/DictionaryValue/TextString/TextStringValue.php Outdated

PrinsFrank requested changes Jun 22, 2026

View reviewed changes

splitbrain force-pushed the encoding branch from b16553b to 6e879b9 Compare June 22, 2026 19:21

splitbrain force-pushed the encoding branch from 6e879b9 to c0175d9 Compare June 22, 2026 19:28

splitbrain requested a review from PrinsFrank June 22, 2026 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Always return UTF-8 from text string metadata#422

Always return UTF-8 from text string metadata#422
splitbrain wants to merge 1 commit into
PrinsFrank:mainfrom
cosmocode:encoding

splitbrain commented Jun 22, 2026

Uh oh!

Uh oh!

Uh oh!

PrinsFrank left a comment

Uh oh!

splitbrain commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

splitbrain commented Jun 22, 2026

Uh oh!

Uh oh!

Uh oh!

PrinsFrank left a comment

Choose a reason for hiding this comment

Uh oh!

splitbrain commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants