-
Notifications
You must be signed in to change notification settings - Fork 21
Description
After #115 lands, here's what needs to be done to make x86 SIMD (SSE4.2 and AVX2) feature-complete:
-
Load/store interleaved. There's a specialized implementation for
load_interleaved_128_u32x16on SSE4.2; all other operations (stores, other types, AVX2) use the fallback implementation. I think it's implementable for all types, and may be faster if specialized for AVX2, but I haven't tried. (Implement all interleaved load/store ops on x86 #140)We currently import
Fallbackto implement these operations, which could increase code size. If implementing this using native intrinsics isn't possible, we could just regenerate the fallback operations in the x86 code to avoid pulling in all ofFallback. -
f32 to u32 conversions (maybe the other way around too?) As a TODO comment in the codegen states, we currently just do an f32 to i32 conversion and pretend it's an f32 to u32 conversion. (Implement all float<->int conversions on x86 #134)
This is just broken for numbers above
i32::MAX. This StackOverflow post goes into detail on how to polyfill u32->f32; not sure about the other way.As part of this, we should add documentation to f32 -> u32 conversions noting that they are polyfilled on x86, and that f32 -> i32 -> u32 will be faster if the full range isn't needed.
-
(Maybe?) AVX2-specialized multiply and shift operations on
i8x16andu8x16. These must be polyfilled by widening to 16-bit, performing the operation, and then narrowing back to 8-bit. Currently, the AVX2 backend widens into two separate__m128vectors just like the SSE4.2 backend does, but it could use a single__m256instead. I'm not sure if this is actually faster. -
Everything with "precise" in the name, which currently seem to be ignored in the tests. (Nail down the semantics for
min/maxandmin_precise/max_precise#133, Implement new min/max/min_precise/max_precise semantics #136)