Skip to content

Conversation

@abadams
Copy link
Member

@abadams abadams commented Dec 18, 2025

Narrowing from a double to a (b)float16 was done via float, which is incorrect due to double-rounding. The case where it's wrong is captured in the new test case.

This PR fixes it by adding new routines to go directly from double to (b)float16. It required adding a strict_cast strict float intrinsic, to prevent bit-inexact fusion of narrowing casts.

This bug was discovered while working on emulating float16 fmas. You can do them as a double fma, but only if the narrowing cast is correct!

@mcourteaux
Copy link
Contributor

Interesting find! Funny to see how much code changes were required to fix one bit.

@abadams
Copy link
Member Author

abadams commented Dec 18, 2025

I refactored a bit to share more code between coming from 32-bit and 64-bit, and added some more explanatory comments (the constants were a bit mysterious to me)

@abadams abadams mentioned this pull request Dec 18, 2025
@abadams
Copy link
Member Author

abadams commented Dec 19, 2025

Failure unrelated. ptal

@alexreinking alexreinking self-requested a review December 19, 2025 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants