riscv64: Optimize `bitselect+{i,f}cmp` codegen #6876

afonso360 · 2023-08-21T23:14:19Z

👋 Hey,

This is a followup to #6874 where f{min,max}_pseudo was removed and replaced it with bitselect+fcmp. Here we optimize that pattern into a mask generation instruction and vmerge.vvm that merges both inputs.

This allows us to avoid the quite long sequence for bitselect (4 instructions) and also vector mask expansion (1 instruction) in these patterns.

For tests here I'm relying mostly on wasmtimes wast testsuite.

alexcrichton · 2023-08-22T14:22:15Z

This might be good for a rule like on x64 where much of the logic of bitselect can be skipped if the mask is known to have a particular pattern? (e.g. generated by icmp or fcmp)

afonso360 · 2023-08-22T14:50:21Z

Yeah, that's a neat way of de-duplicating these rules. I also think it might be worth copying that vconst_all_ones_or_zeros pattern into an egraphs rule, since it looks like we should be able to const propagate that.

Edit: Hmm actually I don't think we can quite deduplicate this as neatly since we have different ways of generating a icmp mask and fcmp mask.

Edit2: Wait, now I'm confused. We're already sort of skipping the main bitselect lowering by using vmerge instead, that merges whole lanes instead of bits. We can't skip generating the icmp/fcmp mask since in RISC-V the masks are 1 bit per lane, so we always need to make the conversion from 1 bit per bit, into 1 bit per lane. Is that what you meant?

alexcrichton · 2023-08-22T15:16:47Z

Oh this may not be applicable for riscv64, but on x64 at least there's no bitselect instruction so it uses the and/or/not/etc combo by default, but x64 does have lane-based select so if a bitselect's input mask is a comparison then the and/or/not/etc combo can be skipped in favor of a single lane-based instruction. I figured the same might be possible for riscv64 where bitselect could skip the combo of instructions and generate a single vmerge* (which I interpreted, perhaps mistakenly, as a lane-based instruction) after the normal instruction generated for the comparison. That way the rules for bitselect wouldn't generate icmp/fcmp instructions at all, instead the output of those instructions will be fed directly into vmerge.

since we have different ways of generating a icmp mask and fcmp mask.

On x64 I believe this is handled by bitselect not actually generating the icmp/fcmp, instead basically calling put_in_reg for the output of the icmp/fcmp which then will later use other lowering rules to generate the icmp/fcmp.

We can't skip generating the icmp/fcmp mask since in RISC-V the masks are 1 bit per lane, so we always need to make the conversion from 1 bit per bit, into 1 bit per lane. Is that what you meant?

Right yeah, my thinking was that like on x64 the lowering for bitselect wouldn't generate the cmp, only the vmerge, but because the mask of the bitselect is only pattern-matched on but we don't use the sub-operands that it would still generate the cmp naturally via other lowering rules.

afonso360 · 2023-08-22T16:19:50Z

Oh, okay so it's pretty much the same on RISC-V. We don't have a direct bitselect equivalent, we lower to and/or/not combo as well by default. vmerge is a lanewise select as you mentioned.

That way the rules for bitselect wouldn't generate icmp/fcmp instructions at all, instead the output of those instructions will be fed directly into vmerge.

Here's where this diverges. vmerge does not support the regular masks that are fed into bitselect. For a i8x16 we need a 16bit mask instead of a 128bit mask. (Or a 8 bit mask for i16x8, etc...)

In the regular icmp and fcmp instruction lowerings we do two things, first we generate the small mask (1 bit per lane), and then we do a second step of expanding that into a big mask.

In this PR gen_icmp_mask and gen_fcmp_mask only generate the small mask, that is compatible with vmerge. But we never expand it, which is what would happen if we let the regular icmp and fcmp instructions get lowered.

Does that make sense? I think I might be missing something, sorry if I am.

--
Edit:

Yeah I think this is where x64 and RISC-V are different. Reading the spec for the blend instruction it looks like it makes a decision based on the MSB on each lane, which allows it to directly use the input for bitselect, if we know its all the same for the entire lane.

On RISC-V our masks are completely different we only use the lowest n bits of the entire register (where n is the lane count). So they are sort of "compressed" masks. But that makes them incompatible with WASM and Cranelift's icmp/fcmp, so we need to pretend that they are the same by expanding them, and that prevents just generating vmerge as part of bitselect even if know that the input is a icmp/fcmp.

alexcrichton

Aha I see, and makes sense, thanks for explaining! In that case I don't think a pattern similar to x64 fits well here, so I think what you've written is the way to go 👍

jameysharp · 2023-08-22T17:21:36Z

A different way to deduplicate these rules would be along the lines of maybe_uextend; riscv could use a maybe_bitcast extractor to have just one bitselect rule for each of icmp and fcmp, instead of two for each.

But I'm not sure that would be more broadly useful so I'm not sure it's worth doing. Just a thought in case it helps with something later.

riscv64: Optimize bitselect+cmp codegen

c5aeeb2

afonso360 added the cranelift:area:riscv64 Issues related to the RISC-V 64 backend. label Aug 21, 2023

afonso360 requested a review from a team as a code owner August 21, 2023 23:14

afonso360 requested review from jameysharp and removed request for a team August 21, 2023 23:14

afonso360 changed the title ~~riscv64: Optimize bitselect+cmp codegen~~ riscv64: Optimize bitselect+{i,f}cmp codegen Aug 21, 2023

github-actions bot added the cranelift Issues related to the Cranelift code generator label Aug 22, 2023

afonso360 mentioned this pull request Aug 22, 2023

egraphs: Recognize bitselect from bitwise operations #6879

Merged

alexcrichton approved these changes Aug 22, 2023

View reviewed changes

alexcrichton added this pull request to the merge queue Aug 22, 2023

Merged via the queue into bytecodealliance:main with commit 5b8bb97 Aug 22, 2023

eduardomourar pushed a commit to eduardomourar/wasmtime that referenced this pull request Sep 6, 2023

riscv64: Optimize bitselect+cmp codegen (bytecodealliance#6876)

7c3c128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

riscv64: Optimize `bitselect+{i,f}cmp` codegen #6876

riscv64: Optimize `bitselect+{i,f}cmp` codegen #6876

afonso360 commented Aug 21, 2023 •

edited

Loading

Uh oh!

alexcrichton commented Aug 22, 2023

Uh oh!

afonso360 commented Aug 22, 2023 •

edited

Loading

Uh oh!

alexcrichton commented Aug 22, 2023

Uh oh!

afonso360 commented Aug 22, 2023 •

edited

Loading

Uh oh!

alexcrichton left a comment

Uh oh!

jameysharp commented Aug 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

riscv64: Optimize bitselect+{i,f}cmp codegen #6876

riscv64: Optimize bitselect+{i,f}cmp codegen #6876

Conversation

afonso360 commented Aug 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Aug 22, 2023

Uh oh!

afonso360 commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Aug 22, 2023

Uh oh!

afonso360 commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton left a comment

Choose a reason for hiding this comment

Uh oh!

jameysharp commented Aug 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

riscv64: Optimize `bitselect+{i,f}cmp` codegen #6876

riscv64: Optimize `bitselect+{i,f}cmp` codegen #6876

afonso360 commented Aug 21, 2023 •

edited

Loading

afonso360 commented Aug 22, 2023 •

edited

Loading

afonso360 commented Aug 22, 2023 •

edited

Loading