Normalize simple-font glyph widths to em units by splitbrain · Pull Request #421 · PrinsFrank/pdfparser

splitbrain · 2026-06-22T07:33:02Z

CID font widths are represented as em units throughout the code, eg. their width is divided by 1000 in RangeCIDWidth.php and ConsecutiveCIDWidth.php. This division by 1000 was not done for simple fonts in FontWidths.php. The result was that when calculating the advance width to decide if a space needs to be inserted between texts, the width for simple (eg. ASCII only) fonts were 1000 times too large. This resulted in some words not being separated correctly.

This PR fixes this by also dividing the simple font widths by 1000.

Obviously this changes how a lot of the samples are parsed. There are several obvious improvements:

  CH1CH1CH1CH1                          -> CH1 CH1 CH1 CH1
  RS232 INRS232 OUTRS232 INRS232 OUT    -> RS232 IN RS232 OUT RS232 IN RS232 OUT
  Control PanelControl PanelControl Panel -> Control Panel Control Panel Control Panel
  20232024Veraenderung                  -> 2023 2024 Veraenderung
  ...973.606169.622187.39325.376...      -> ...973.606 169.622 187.393 25.376...
  sit ametviverra                       -> sit amet viverra
  *Ut autem excepturi...                -> * Ut autem excepturi...

Unfortunately it also overinserts spaces sometimes. Most of them are harmless for the case of text extraction (two spaces instead of one, spaces around some punctuation marks) but in somewhat rare cases it also splits valid words ("Worte" -> "W orte").

I still think this implementation is more correct and fixes more cases than it overcorrects.

Note: this is also a pre-requisite for an up-coming PR I am working on that will correctly handle rotated text - without correctly calculated glyph widths it get's real weird otherwise.

Font::getWidthForChars() is the sole input to the inter-run space test in ContentStream::getText() (a space is emitted when the horizontal gap between two text runs exceeds the previous run's advance width by 0.40 * fontSize). That test only works if advance widths are expressed in em units. Glyph widths were inconsistently normalized: - CID fonts -> em units (RangeCIDWidth/ConsecutiveCIDWidth and the DW default already divide by 1000) - simple fonts -> raw 1000-unit glyph space (NOT divided) So for a simple (non-CID) font the advance width came out ~1000x too large. gap - advanceWidth was therefore always hugely negative, and the space test never fired between simple-font runs - leaving adjacent runs glued together in the extracted text. Fix, mirroring the CID width classes: - FontWidths::getWidthForCharacter() now divides the stored /Widths value by 1000 (the single, obvious spot symmetrical with the CID items). - Font::getDefaultWidth() non-CID fallback returns 1.0 em instead of 1000. The CID path is untouched to avoid double division. getWidthForChar()'s formula ((width * fontSize + charSpace) * scaleX) is already correct for em-unit input and needs no change. Effect on the sample fixtures (regenerated): Previously-unreadable jammed simple-font text now gains its missing spaces: CH1CH1CH1CH1 -> CH1 CH1 CH1 CH1 RS232 INRS232 OUTRS232 INRS232 OUT -> RS232 IN RS232 OUT RS232 IN RS232 OUT Control PanelControl PanelControl Panel -> Control Panel Control Panel Control Panel 20232024Veraenderung -> 2023 2024 Veraenderung ...973.606169.622187.39325.376... -> ...973.606 169.622 187.393 25.376... sit ametviverra -> sit amet viverra *Ut autem excepturi... -> * Ut autem excepturi... The now-active gap test also over-inserts in densely typeset / justified documents - occasional double spaces ("perferendis Et"), a space before a trailing-run punctuation glyph ("minima . Id"), and rare in-word splits ("Worte" -> "W orte") where a glyph pair sits just beyond the threshold. Tuning the 0.40 constant confirmed it is the best value: raising it drops genuine fixes (bullets, list dashes) while barely reducing the over-insertions, which stem from real inter-run gaps in the source layout. Net result is a large readability gain for tabular and label-heavy PDFs and probably worth the occasional overinsertion of spaces.

PrinsFrank · 2026-06-22T18:20:01Z

Hey, thanks so much for the high-quality PR! Just out of curiosity: Did you use an LLM and if so which one? I love the detailed commit message!

I'll continue reviewing this tomorrow, but looks amazing so far! I want to understand where the whitespace is coming from but I think this is fine to merge anyway. ;)

splitbrain · 2026-06-22T19:06:33Z

I am working with Claude Code (opus on thinking xhigh) but am reading and reviewing every line, to make sure I actually understand the changes. PR messages are hand crafted based on my own understanding. Same goes for all comments you'll receive from me.

PS: the rotation text PR will take a bit longer. I am not happy with it yet.

PrinsFrank · 2026-06-22T19:18:08Z

Kudos for the contribution and especially the commit messages, I think these are the highest quality contributions I've received so far! And no worries about AI usage if any! Thanks again for the contribution, I'll merge this somewhere tomorrow night! I'm looking forward to your future contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize simple-font glyph widths to em units#421

Normalize simple-font glyph widths to em units#421
splitbrain wants to merge 1 commit into
PrinsFrank:mainfrom
cosmocode:fontwidths

splitbrain commented Jun 22, 2026

Uh oh!

PrinsFrank commented Jun 22, 2026 •

edited

Loading

Uh oh!

splitbrain commented Jun 22, 2026 •

edited

Loading

Uh oh!

PrinsFrank commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

splitbrain commented Jun 22, 2026

Uh oh!

PrinsFrank commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

splitbrain commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PrinsFrank commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PrinsFrank commented Jun 22, 2026 •

edited

Loading

splitbrain commented Jun 22, 2026 •

edited

Loading