Skip to content

Conversation

@afonso360
Copy link
Contributor

@afonso360 afonso360 commented Aug 21, 2023

👋 Hey,

This is a followup to #6874 where f{min,max}_pseudo was removed and replaced it with bitselect+fcmp. Here we optimize that pattern into a mask generation instruction and vmerge.vvm that merges both inputs.

This allows us to avoid the quite long sequence for bitselect (4 instructions) and also vector mask expansion (1 instruction) in these patterns.

For tests here I'm relying mostly on wasmtimes wast testsuite.

@afonso360 afonso360 added the cranelift:area:riscv64 Issues related to the RISC-V 64 backend. label Aug 21, 2023
@afonso360 afonso360 requested a review from a team as a code owner August 21, 2023 23:14
@afonso360 afonso360 requested review from jameysharp and removed request for a team August 21, 2023 23:14
@afonso360 afonso360 changed the title riscv64: Optimize bitselect+cmp codegen riscv64: Optimize bitselect+{i,f}cmp codegen Aug 21, 2023
@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Aug 22, 2023
@alexcrichton
Copy link
Member

This might be good for a rule like on x64 where much of the logic of bitselect can be skipped if the mask is known to have a particular pattern? (e.g. generated by icmp or fcmp)

@afonso360
Copy link
Contributor Author

afonso360 commented Aug 22, 2023

Yeah, that's a neat way of de-duplicating these rules. I also think it might be worth copying that vconst_all_ones_or_zeros pattern into an egraphs rule, since it looks like we should be able to const propagate that.

Edit: Hmm actually I don't think we can quite deduplicate this as neatly since we have different ways of generating a icmp mask and fcmp mask.

Edit2: Wait, now I'm confused. We're already sort of skipping the main bitselect lowering by using vmerge instead, that merges whole lanes instead of bits. We can't skip generating the icmp/fcmp mask since in RISC-V the masks are 1 bit per lane, so we always need to make the conversion from 1 bit per bit, into 1 bit per lane. Is that what you meant?

@alexcrichton
Copy link
Member

Oh this may not be applicable for riscv64, but on x64 at least there's no bitselect instruction so it uses the and/or/not/etc combo by default, but x64 does have lane-based select so if a bitselect's input mask is a comparison then the and/or/not/etc combo can be skipped in favor of a single lane-based instruction. I figured the same might be possible for riscv64 where bitselect could skip the combo of instructions and generate a single vmerge* (which I interpreted, perhaps mistakenly, as a lane-based instruction) after the normal instruction generated for the comparison. That way the rules for bitselect wouldn't generate icmp/fcmp instructions at all, instead the output of those instructions will be fed directly into vmerge.

since we have different ways of generating a icmp mask and fcmp mask.

On x64 I believe this is handled by bitselect not actually generating the icmp/fcmp, instead basically calling put_in_reg for the output of the icmp/fcmp which then will later use other lowering rules to generate the icmp/fcmp.

We can't skip generating the icmp/fcmp mask since in RISC-V the masks are 1 bit per lane, so we always need to make the conversion from 1 bit per bit, into 1 bit per lane. Is that what you meant?

Right yeah, my thinking was that like on x64 the lowering for bitselect wouldn't generate the cmp, only the vmerge, but because the mask of the bitselect is only pattern-matched on but we don't use the sub-operands that it would still generate the cmp naturally via other lowering rules.

@afonso360
Copy link
Contributor Author

afonso360 commented Aug 22, 2023

Oh, okay so it's pretty much the same on RISC-V. We don't have a direct bitselect equivalent, we lower to and/or/not combo as well by default. vmerge is a lanewise select as you mentioned.

That way the rules for bitselect wouldn't generate icmp/fcmp instructions at all, instead the output of those instructions will be fed directly into vmerge.

Here's where this diverges. vmerge does not support the regular masks that are fed into bitselect. For a i8x16 we need a 16bit mask instead of a 128bit mask. (Or a 8 bit mask for i16x8, etc...)

In the regular icmp and fcmp instruction lowerings we do two things, first we generate the small mask (1 bit per lane), and then we do a second step of expanding that into a big mask.

In this PR gen_icmp_mask and gen_fcmp_mask only generate the small mask, that is compatible with vmerge. But we never expand it, which is what would happen if we let the regular icmp and fcmp instructions get lowered.

Does that make sense? I think I might be missing something, sorry if I am.

--
Edit:

Yeah I think this is where x64 and RISC-V are different. Reading the spec for the blend instruction it looks like it makes a decision based on the MSB on each lane, which allows it to directly use the input for bitselect, if we know its all the same for the entire lane.

On RISC-V our masks are completely different we only use the lowest n bits of the entire register (where n is the lane count). So they are sort of "compressed" masks. But that makes them incompatible with WASM and Cranelift's icmp/fcmp, so we need to pretend that they are the same by expanding them, and that prevents just generating vmerge as part of bitselect even if know that the input is a icmp/fcmp.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha I see, and makes sense, thanks for explaining! In that case I don't think a pattern similar to x64 fits well here, so I think what you've written is the way to go 👍

@alexcrichton alexcrichton added this pull request to the merge queue Aug 22, 2023
Merged via the queue into bytecodealliance:main with commit 5b8bb97 Aug 22, 2023
@jameysharp
Copy link
Contributor

A different way to deduplicate these rules would be along the lines of maybe_uextend; riscv could use a maybe_bitcast extractor to have just one bitselect rule for each of icmp and fcmp, instead of two for each.

But I'm not sure that would be more broadly useful so I'm not sure it's worth doing. Just a thought in case it helps with something later.

eduardomourar pushed a commit to eduardomourar/wasmtime that referenced this pull request Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:riscv64 Issues related to the RISC-V 64 backend. cranelift Issues related to the Cranelift code generator

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants