pnm: rewrite PAM header decoder #2647

mstoeckl · 2025-11-14T13:06:38Z

This updates the PNM decoder to do a streaming parse of the PAM header, without requiring any unbounded length allocations. As noted in the earlier MR for the PnmDecoder, #2620, while the way the current parser reads lines usually isn't a big issue, there are cases where the file received by an image decoder (and thus the maximum line length) can be very long compared to the cost of storing it on disk. The change may also be helpful in the long run to support no_std+no alloc.

It should not take too much work to establish that this style of decoder is unlikely to crash: the main points where it could panic are bounds checks when writing into stack buffers, and str::from_utf8().expect(). Checking that it decodes to match the specific behavior of the PAM spec is rather more complicated. I've added a few more tests to help check some of the edge cases.

This is technically a breaking change, since it starts rejecting PAM images with custom TUPLTYPEs longer than 64 bytes. (But I am relatively confident that nobody has ever made such images, and I suspect that nobody has ever used custom TUPLTYPEs for non-testing purposes.)

(After PAM, I think the only remaining code directly in image which could allocate a large amount during header decoding, and does not have Limits implemented, is for Radiance HDR (fixable using a similar patch to this one, or with hard line length limits) and the JPEG decoder (now fixable with zune-jpeg's reader changes.) Formats like ICO and TGA only allocate small arrays with up to 2^16 entries.)

197g

Took a while to prioritize this review.

I like the abundance of tests but on the allocation side this swings heavily in the other way. We have limits to control for unbounded allocation, not arbitrary 64. If we want such a hard limit it should have reference to other programs, e.g. netbpm's pamstack interface allows 255. I don't have any information on usage but I also have no sample at all. Absence of information vs. Information of absence and all. Also rejecting any 16-byte header identifier early seems okay but the new error drops the information on the header. I think it may become limiting in a permissive mode that skips over them instead.

Also, I feel like having the parser loop in each individual header is quite a bunch of duplication and does not aid readability. Instead of this minimalism, can we get away doing one bounded allocation instead as a buffer for lines? Extra kudos if there's a small stack-buffer that only falls back on it where required.

mstoeckl · 2025-12-12T02:14:42Z

First, thank you for taking the time to review this one.

I like the abundance of tests but on the allocation side this swings heavily in the other way. We have limits to control for unbounded allocation, not arbitrary 64.

One of the reasons I took this approach, as verbose as the code may be, is to avoid the need for configurable memory limits.

Introducing memory limits, especially for header parsing, makes the API of the decoder more complicated -- not just the interface through ImageDecoder, but also for whatever direct users PnmDecoder has, who would now need to determine whether or not memory limits are necessary, what to set them to, etc. If a change to introduce memory limits does not break the existing API, then it is easy to overlook; and if it forces a breaking change, then users have more work to upgrade.

Also: in general I expect resource limits can be hard to test and keep working as code changes over the years, but if the design inherently limits the required memory then it should be easier to verify and retain that property.

If we want such a hard limit it should have reference to other programs, e.g. netbpm's pamstack interface allows 255

libnetpbm actually imposes a 255 byte limit in general on the TUPLTYPE length for its struct pam, so I've updated this PR to match it.

Also rejecting any 16-byte header identifier early seems okay but the new error drops the information on the header

I've updated the code to include the header prefix in the error message.

Also, I feel like having the parser loop in each individual header is quite a bunch of duplication and does not aid readability. Instead of this minimalism, can we get away doing one bounded allocation instead as a buffer for lines?

I've looked for ways to deduplicate some of the token recognition state machines (from read_next_pam_identifier, read_pam_intraline_whitespace, read_pam_decimal_value, read_pam_tupltype_value), but the cases are just different enough that I couldn't find a clean way to simplify them. (I could merge read_pam_decimal_value and read_pam_tupltype_value if you'd like, but that doesn't save much code and can lose some error information.) I expect that I could make the code more compact with some helper functions or macros, but it wouldn't necessarily be easier to understand.

Once it is stabilized, std::ascii::Char would be helpful.

Unfortunately, lines can contain whitespace padding of arbitrary length, so sticking with the current line-oriented approach would need some allocation limit.

Edit: also, if you'd like, I can split off most of the new tests (except for the bounded length ones) into a separate PR, since they might be helpful to have in isolation.

The new decoding logic does not require any unbounded length allocations. Behavioral changes: * The decoder now rejects leading + on numbers, to match PNM. * The length of custom TUPLTYPEs now has a hard limit. This does not affect realistic images, and PnmDecoder only accepts a short list of known TUPLTYPEs anyway. (Without a hard limit, the PAM header decoding logic would require memory limits to be specified in advance.)

197g reviewed Dec 11, 2025

View reviewed changes

mstoeckl force-pushed the pam-no-alloc-parse branch from c44ed93 to 25d9ebf Compare December 12, 2025 00:49

mstoeckl force-pushed the pam-no-alloc-parse branch from 25d9ebf to 824a7c3 Compare December 12, 2025 02:24

197g mentioned this pull request Dec 13, 2025

Unified ImageFormat for both built-in and plugin formats #2691

Draft

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pnm: rewrite PAM header decoder #2647

pnm: rewrite PAM header decoder #2647

Uh oh!

mstoeckl commented Nov 14, 2025 •

edited

Loading

Uh oh!

197g left a comment •

edited

Loading

Uh oh!

mstoeckl commented Dec 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pnm: rewrite PAM header decoder #2647

Are you sure you want to change the base?

pnm: rewrite PAM header decoder #2647

Uh oh!

Conversation

mstoeckl commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

197g left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mstoeckl commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mstoeckl commented Nov 14, 2025 •

edited

Loading

197g left a comment •

edited

Loading

mstoeckl commented Dec 12, 2025 •

edited

Loading