Skip to content

Conversation

@mstoeckl
Copy link
Contributor

@mstoeckl mstoeckl commented Nov 14, 2025

This updates the PNM decoder to do a streaming parse of the PAM header, without requiring any unbounded length allocations. As noted in the earlier MR for the PnmDecoder, #2620, while the way the current parser reads lines usually isn't a big issue, there are cases where the file received by an image decoder (and thus the maximum line length) can be very long compared to the cost of storing it on disk. The change may also be helpful in the long run to support no_std+no alloc.

It should not take too much work to establish that this style of decoder is unlikely to crash: the main points where it could panic are bounds checks when writing into stack buffers, and str::from_utf8().expect(). Checking that it decodes to match the specific behavior of the PAM spec is rather more complicated. I've added a few more tests to help check some of the edge cases.

This is technically a breaking change, since it starts rejecting PAM images with custom TUPLTYPEs longer than 64 bytes. (But I am relatively confident that nobody has ever made such images, and I suspect that nobody has ever used custom TUPLTYPEs for non-testing purposes.)

(After PAM, I think the only remaining code directly in image which could allocate a large amount during header decoding, and does not have Limits implemented, is for Radiance HDR (fixable using a similar patch to this one, or with hard line length limits) and the JPEG decoder (now fixable with zune-jpeg's reader changes.) Formats like ICO and TGA only allocate small arrays with up to 2^16 entries.)

Copy link
Member

@197g 197g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a while to prioritize this review.

I like the abundance of tests but on the allocation side this swings heavily in the other way. We have limits to control for unbounded allocation, not arbitrary 64. If we want such a hard limit it should have reference to other programs, e.g. netbpm's pamstack interface allows 255. I don't have any information on usage but I also have no sample at all. Absence of information vs. Information of absence and all. Also rejecting any 16-byte header identifier early seems okay but the new error drops the information on the header. I think it may become limiting in a permissive mode that skips over them instead.

Also, I feel like having the parser loop in each individual header is quite a bunch of duplication and does not aid readability. Instead of this minimalism, can we get away doing one bounded allocation instead as a buffer for lines? Extra kudos if there's a small stack-buffer that only falls back on it where required.

@mstoeckl
Copy link
Contributor Author

mstoeckl commented Dec 12, 2025

First, thank you for taking the time to review this one.

I like the abundance of tests but on the allocation side this swings heavily in the other way. We have limits to control for unbounded allocation, not arbitrary 64.

One of the reasons I took this approach, as verbose as the code may be, is to avoid the need for configurable memory limits.

Introducing memory limits, especially for header parsing, makes the API of the decoder more complicated -- not just the interface through ImageDecoder, but also for whatever direct users PnmDecoder has, who would now need to determine whether or not memory limits are necessary, what to set them to, etc. If a change to introduce memory limits does not break the existing API, then it is easy to overlook; and if it forces a breaking change, then users have more work to upgrade.

Also: in general I expect resource limits can be hard to test and keep working as code changes over the years, but if the design inherently limits the required memory then it should be easier to verify and retain that property.

If we want such a hard limit it should have reference to other programs, e.g. netbpm's pamstack interface allows 255

libnetpbm actually imposes a 255 byte limit in general on the TUPLTYPE length for its struct pam, so I've updated this PR to match it.

Also rejecting any 16-byte header identifier early seems okay but the new error drops the information on the header

I've updated the code to include the header prefix in the error message.

Also, I feel like having the parser loop in each individual header is quite a bunch of duplication and does not aid readability. Instead of this minimalism, can we get away doing one bounded allocation instead as a buffer for lines?

I've looked for ways to deduplicate some of the token recognition state machines (from read_next_pam_identifier, read_pam_intraline_whitespace, read_pam_decimal_value, read_pam_tupltype_value), but the cases are just different enough that I couldn't find a clean way to simplify them. (I could merge read_pam_decimal_value and read_pam_tupltype_value if you'd like, but that doesn't save much code and can lose some error information.) I expect that I could make the code more compact with some helper functions or macros, but it wouldn't necessarily be easier to understand.

Once it is stabilized, std::ascii::Char would be helpful.

Unfortunately, lines can contain whitespace padding of arbitrary length, so sticking with the current line-oriented approach would need some allocation limit.


Edit: also, if you'd like, I can split off most of the new tests (except for the bounded length ones) into a separate PR, since they might be helpful to have in isolation.

The new decoding logic does not require any unbounded length allocations.

Behavioral changes:
* The decoder now rejects leading + on numbers, to match PNM.
* The length of custom TUPLTYPEs now has a hard limit. This does not
  affect realistic images, and PnmDecoder only accepts a short list of
  known TUPLTYPEs anyway.  (Without a hard limit, the PAM header
  decoding logic would require memory limits to be specified in
  advance.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants