-
Notifications
You must be signed in to change notification settings - Fork 674
pnm: rewrite PAM header decoder #2647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a while to prioritize this review.
I like the abundance of tests but on the allocation side this swings heavily in the other way. We have limits to control for unbounded allocation, not arbitrary 64. If we want such a hard limit it should have reference to other programs, e.g. netbpm's pamstack interface allows 255. I don't have any information on usage but I also have no sample at all. Absence of information vs. Information of absence and all. Also rejecting any 16-byte header identifier early seems okay but the new error drops the information on the header. I think it may become limiting in a permissive mode that skips over them instead.
Also, I feel like having the parser loop in each individual header is quite a bunch of duplication and does not aid readability. Instead of this minimalism, can we get away doing one bounded allocation instead as a buffer for lines? Extra kudos if there's a small stack-buffer that only falls back on it where required.
c44ed93 to
25d9ebf
Compare
|
First, thank you for taking the time to review this one.
One of the reasons I took this approach, as verbose as the code may be, is to avoid the need for configurable memory limits. Introducing memory limits, especially for header parsing, makes the API of the decoder more complicated -- not just the interface through ImageDecoder, but also for whatever direct users PnmDecoder has, who would now need to determine whether or not memory limits are necessary, what to set them to, etc. If a change to introduce memory limits does not break the existing API, then it is easy to overlook; and if it forces a breaking change, then users have more work to upgrade. Also: in general I expect resource limits can be hard to test and keep working as code changes over the years, but if the design inherently limits the required memory then it should be easier to verify and retain that property.
libnetpbm actually imposes a 255 byte limit in general on the TUPLTYPE length for its
I've updated the code to include the header prefix in the error message.
I've looked for ways to deduplicate some of the token recognition state machines (from Once it is stabilized, Unfortunately, lines can contain whitespace padding of arbitrary length, so sticking with the current line-oriented approach would need some allocation limit. Edit: also, if you'd like, I can split off most of the new tests (except for the bounded length ones) into a separate PR, since they might be helpful to have in isolation. |
The new decoding logic does not require any unbounded length allocations. Behavioral changes: * The decoder now rejects leading + on numbers, to match PNM. * The length of custom TUPLTYPEs now has a hard limit. This does not affect realistic images, and PnmDecoder only accepts a short list of known TUPLTYPEs anyway. (Without a hard limit, the PAM header decoding logic would require memory limits to be specified in advance.)
25d9ebf to
824a7c3
Compare
This updates the PNM decoder to do a streaming parse of the PAM header, without requiring any unbounded length allocations. As noted in the earlier MR for the PnmDecoder, #2620, while the way the current parser reads lines usually isn't a big issue, there are cases where the file received by an image decoder (and thus the maximum line length) can be very long compared to the cost of storing it on disk. The change may also be helpful in the long run to support no_std+no alloc.
It should not take too much work to establish that this style of decoder is unlikely to crash: the main points where it could panic are bounds checks when writing into stack buffers, and
str::from_utf8().expect(). Checking that it decodes to match the specific behavior of the PAM spec is rather more complicated. I've added a few more tests to help check some of the edge cases.This is technically a breaking change, since it starts rejecting PAM images with custom TUPLTYPEs longer than 64 bytes. (But I am relatively confident that nobody has ever made such images, and I suspect that nobody has ever used custom TUPLTYPEs for non-testing purposes.)
(After PAM, I think the only remaining code directly in
imagewhich could allocate a large amount during header decoding, and does not have Limits implemented, is for Radiance HDR (fixable using a similar patch to this one, or with hard line length limits) and the JPEG decoder (now fixable with zune-jpeg's reader changes.) Formats like ICO and TGA only allocate small arrays with up to 2^16 entries.)