Skip to content

Normalize simple-font glyph widths to em units#421

Open
splitbrain wants to merge 1 commit into
PrinsFrank:mainfrom
cosmocode:fontwidths
Open

Normalize simple-font glyph widths to em units#421
splitbrain wants to merge 1 commit into
PrinsFrank:mainfrom
cosmocode:fontwidths

Conversation

@splitbrain

Copy link
Copy Markdown
Contributor

CID font widths are represented as em units throughout the code, eg. their width is divided by 1000 in RangeCIDWidth.php and ConsecutiveCIDWidth.php. This division by 1000 was not done for simple fonts in FontWidths.php. The result was that when calculating the advance width to decide if a space needs to be inserted between texts, the width for simple (eg. ASCII only) fonts were 1000 times too large. This resulted in some words not being separated correctly.

This PR fixes this by also dividing the simple font widths by 1000.

Obviously this changes how a lot of the samples are parsed. There are several obvious improvements:

  CH1CH1CH1CH1                          -> CH1 CH1 CH1 CH1
  RS232 INRS232 OUTRS232 INRS232 OUT    -> RS232 IN RS232 OUT RS232 IN RS232 OUT
  Control PanelControl PanelControl Panel -> Control Panel Control Panel Control Panel
  20232024Veraenderung                  -> 2023 2024 Veraenderung
  ...973.606169.622187.39325.376...      -> ...973.606 169.622 187.393 25.376...
  sit ametviverra                       -> sit amet viverra
  *Ut autem excepturi...                -> * Ut autem excepturi...

Unfortunately it also overinserts spaces sometimes. Most of them are harmless for the case of text extraction (two spaces instead of one, spaces around some punctuation marks) but in somewhat rare cases it also splits valid words ("Worte" -> "W orte").

I still think this implementation is more correct and fixes more cases than it overcorrects.

Note: this is also a pre-requisite for an up-coming PR I am working on that will correctly handle rotated text - without correctly calculated glyph widths it get's real weird otherwise.

Font::getWidthForChars() is the sole input to the inter-run space test in
ContentStream::getText() (a space is emitted when the horizontal gap between
two text runs exceeds the previous run's advance width by 0.40 * fontSize).
That test only works if advance widths are expressed in em units.

Glyph widths were inconsistently normalized:

  - CID fonts        -> em units      (RangeCIDWidth/ConsecutiveCIDWidth and
                                        the DW default already divide by 1000)
  - simple fonts     -> raw 1000-unit glyph space (NOT divided)

So for a simple (non-CID) font the advance width came out ~1000x too large.
gap - advanceWidth was therefore always hugely negative, and the space test
never fired between simple-font runs - leaving adjacent runs glued together
in the extracted text.

Fix, mirroring the CID width classes:

  - FontWidths::getWidthForCharacter() now divides the stored /Widths value
    by 1000 (the single, obvious spot symmetrical with the CID items).
  - Font::getDefaultWidth() non-CID fallback returns 1.0 em instead of 1000.

The CID path is untouched to avoid double division. getWidthForChar()'s
formula ((width * fontSize + charSpace) * scaleX) is already correct for
em-unit input and needs no change.

Effect on the sample fixtures (regenerated):

Previously-unreadable jammed simple-font text now gains its missing spaces:

  CH1CH1CH1CH1                          -> CH1 CH1 CH1 CH1
  RS232 INRS232 OUTRS232 INRS232 OUT    -> RS232 IN RS232 OUT RS232 IN RS232 OUT
  Control PanelControl PanelControl Panel -> Control Panel Control Panel Control Panel
  20232024Veraenderung                  -> 2023 2024 Veraenderung
  ...973.606169.622187.39325.376...      -> ...973.606 169.622 187.393 25.376...
  sit ametviverra                       -> sit amet viverra
  *Ut autem excepturi...                -> * Ut autem excepturi...

The now-active gap test also over-inserts in densely typeset / justified
documents - occasional double spaces ("perferendis  Et"), a space before a
trailing-run punctuation glyph ("minima . Id"), and rare in-word splits
("Worte" -> "W orte") where a glyph pair sits just beyond the threshold.
Tuning the 0.40 constant confirmed it is the best value: raising it
drops genuine fixes (bullets, list dashes) while barely reducing the
over-insertions, which stem from real inter-run gaps in the source layout.

Net result is a large readability gain for tabular and label-heavy PDFs
and probably worth the occasional overinsertion of spaces.
@PrinsFrank

PrinsFrank commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Hey, thanks so much for the high-quality PR! Just out of curiosity: Did you use an LLM and if so which one? I love the detailed commit message!

I'll continue reviewing this tomorrow, but looks amazing so far! I want to understand where the whitespace is coming from but I think this is fine to merge anyway. ;)

@splitbrain

splitbrain commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

I am working with Claude Code (opus on thinking xhigh) but am reading and reviewing every line, to make sure I actually understand the changes. PR messages are hand crafted based on my own understanding. Same goes for all comments you'll receive from me.

PS: the rotation text PR will take a bit longer. I am not happy with it yet.

@PrinsFrank

Copy link
Copy Markdown
Owner

Kudos for the contribution and especially the commit messages, I think these are the highest quality contributions I've received so far! And no worries about AI usage if any! Thanks again for the contribution, I'll merge this somewhere tomorrow night! I'm looking forward to your future contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants