HOL-Light: add rej_uniform_eta proofs for AArch64#1040
Conversation
a3673e9 to
0213d92
Compare
CBMC Results (ML-DSA-44)Full Results (201 proofs)
|
0213d92 to
4558665
Compare
CBMC Results (ML-DSA-65)Full Results (201 proofs)
|
4558665 to
73d8d61
Compare
CBMC Results (ML-DSA-87)Full Results (201 proofs)
|
73d8d61 to
a7cc582
Compare
77cb935 to
f986df8
Compare
37dfafd to
ddd97de
Compare
CBMC Results (ML-DSA-44, REDUCE-RAM)Full Results (201 proofs)
|
7429b71 to
0c8ed0a
Compare
0c8ed0a to
63f752b
Compare
|
@mkannwischer @hanno-becker Ready for review |
mkannwischer
left a comment
There was a problem hiding this comment.
Thanks @jakemas for finishing the last AArch64 proof - exciting!!
I found a couple of mistakes.
0320606 to
0541f44
Compare
mkannwischer
left a comment
There was a problem hiding this comment.
Thanks @jakemas. Looks good to me now.
| (stackpointer,576)] | ||
| ==> ensures arm | ||
| (\s. aligned_bytes_loaded s (word pc) mldsa_rej_uniform_eta2_mc /\ | ||
| read PC s = word(pc + 4) /\ |
There was a problem hiding this comment.
For new proofs, let's avoid those hardcoded offsets and instead use the established pattern of introducing named offsets for preamble/postamble. I know not all proofs do this yet, but let's not make it worse.
There was a problem hiding this comment.
Done — added MLDSA_REJ_UNIFORM_ETA{2,4}_{PREAMBLE,POSTAMBLE}_LENGTH, _CORE_START, _CORE_END and a LENGTH_SIMPLIFY_CONV to unfold them. The hardcoded pc + 4 / pc + 364 (eta2) / pc + 336 (eta4) are replaced with named offsets in _CORRECT / _SUBROUTINE_CORRECT / _MEMSAFE / _SUBROUTINE_MEMSAFE. Left the internal loop-body offsets (pc + 256, pc + 108, etc.) as numerals since they're block-internal rather than preamble/postamble boundaries.
| requires(table == mld_rej_uniform_eta_table) | ||
| assigns(memory_slice(r, sizeof(int32_t) * MLDSA_N)) | ||
| ensures(return_value <= MLDSA_N) | ||
| ensures(array_abs_bound(r, 0, return_value, MLDSA_ETA + 1)) |
There was a problem hiding this comment.
This should be a hardcoded 3 as in the HOL-Light spec. The ASM is level-specific, so the specs should not contain level-dependent macros.
There was a problem hiding this comment.
Hardcoded 3 for eta2 with a check-magic comment.
| requires(memory_no_alias(buf, buflen)) | ||
| requires(table == mld_rej_uniform_eta_table) | ||
| assigns(memory_slice(r, sizeof(int32_t) * MLDSA_N)) | ||
| ensures(return_value <= MLDSA_N) |
There was a problem hiding this comment.
This clause, while a trivial consequence, is not part of the HOL-Light spec
There was a problem hiding this comment.
Added outlen <= 256 to the HOL-Light _SUBROUTINE_CORRECT postcondition so it now establishes the same bound (return_value <= MLDSA_N) that CBMC requires. Also hardcoded the literal bound (5 for eta4, 3 for eta2) since the asm is level-specific.
| * in mldsa/src/native/aarch64/src/arith_native_aarch64.h *) | ||
|
|
||
| let MLDSA_REJ_UNIFORM_ETA4_SUBROUTINE_CORRECT = prove | ||
| (`!res buf buflen table (inlist:byte list) pc stackpointer returnaddress. |
There was a problem hiding this comment.
Blocker: This is not the right spec. Compare
. The inlist here is naturally a4 word list, and the output is a filter on exactly that list. Here, instead, REJ_SAMPLE_ETA4 does nibble unpacking logic, which it shouldn't.
There was a problem hiding this comment.
Same for eta2 of course.
There was a problem hiding this comment.
REJ_SAMPLE_ETA{2,4} now takes a nibble list (typed int16 list, since that's how nibbles already live in SIMD lanes) and is just MAP/FILTER. The byte→nibble unpacking is applied at the subroutine spec via REJ_SAMPLE_ETA{2,4} (NIBBLES_OF_BYTES inlist). Internally the proof body still works against a private REJ_SAMPLE_ETA{2,4}_BYTES alias bridged by a one-line lemma — kept it that way to avoid rewriting the loop machinery, which peels off 8 bytes / 16 nibbles per iteration. Is that what you had in mind?
There was a problem hiding this comment.
Thank you @jakemas!
It's not yet what I had in mind: The most natural formulations of rejection sampling -- and the one used so far -- expresses the specification at the natural bitwidth. Here, the input nibbles are 4 bit wide, so the inlist should be 4 word list. It is beautiful property of John's setup that the machinery is completely agnostic to the bitlength here, and can express the splitting of the input range into a list of 4 words without problem.
That the algorithm internally needs to load larger chunks at a time, and split them, is not visible at the spec level, and should not be.
So I am looking for inlist : 4 word list and REJ_SAMPLE_ETA[2|4] (l:4 word list) -> 32 word list.
Does that make sense?
It should be of little significance to the proof, I hope -- you mostly need to move the logic of NIBBLE_PAIR into the proof somehow.
There was a problem hiding this comment.
The spec is not yet in the right shape, see comment and precedent of rejection sampling. The CBMC specs should not use level-dependent macros.
Finally: I realize we're living in a world where agents write proofs and their readability is becoming less important, but the proofs strike me as extremely long and lacking any commentary helping a human understand what's happening. As the agent reworks the proofs, please nudge it to a) write as compact as possible, b) add some high-level comments on the high-level steps. In particular, the proof contains multiple steps which strike me as inlining lemmas that should either be hoisted out or already be present in general form, such as
SUBGOAL_THEN
`SUB_LIST(8 * i, 8) (inlist:byte list) =
[word_subword (loaded_d:int64) (0,8):byte;
word_subword loaded_d (8,8);
word_subword loaded_d (16,8);
word_subword loaded_d (24,8);
word_subword loaded_d (32,8);
word_subword loaded_d (40,8);
word_subword loaded_d (48,8);
word_subword loaded_d (56,8)]`
I'd also be surprised if something like
let WORD_SUBWORD_OF_JOIN_8X16 = BITBLAST_RULE
`word_subword
(word_join
(word_join (word_join (h7:int16) (h6:int16):int32) (word_join (h5:int16) (h4:int16):int32):int64)
(word_join (word_join (h3:int16) (h2:int16):int32) (word_join (h1:int16) (h0:int16):int32):int64):int128)
(0,16):int16 = h0 /\
word_subword
(word_join
(word_join (word_join (h7:int16) (h6:int16):int32) (word_join (h5:int16) (h4:int16):int32):int64)
(word_join (word_join (h3:int16) (h2:int16):int32) (word_join (h1:int16) (h0:int16):int32):int64):int128)
(16,16):int16 = h1 /\
word_subword
(word_join
(word_join (word_join (h7:int16) (h6:int16):int32) (word_join (h5:int16) (h4:int16):int32):int64)
(word_join (word_join (h3:int16) (h2:int16):int32) (word_join (h1:int16) (h0:int16):int32):int64):int128)
(32,16):int16 = h2 /\
word_subword
(word_join
(word_join (word_join (h7:int16) (h6:int16):int32) (word_join (h5:int16) (h4:int16):int32):int64)
(word_join (word_join (h3:int16) (h2:int16):int32) (word_join (h1:int16) (h0:int16):int32):int64):int128)
(48,16):int16 = h3 /\
word_subword
(word_join
(word_join (word_join (h7:int16) (h6:int16):int32) (word_join (h5:int16) (h4:int16):int32):int64)
(word_join (word_join (h3:int16) (h2:int16):int32) (word_join (h1:int16) (h0:int16):int32):int64):int128)
(64,16):int16 = h4 /\
word_subword
(word_join
(word_join (word_join (h7:int16) (h6:int16):int32) (word_join (h5:int16) (h4:int16):int32):int64)
(word_join (word_join (h3:int16) (h2:int16):int32) (word_join (h1:int16) (h0:int16):int32):int64):int128)
(80,16):int16 = h5 /\
word_subword
(word_join
(word_join (word_join (h7:int16) (h6:int16):int32) (word_join (h5:int16) (h4:int16):int32):int64)
(word_join (word_join (h3:int16) (h2:int16):int32) (word_join (h1:int16) (h0:int16):int32):int64):int128)
(96,16):int16 = h6 /\
word_subword
(word_join
(word_join (word_join (h7:int16) (h6:int16):int32) (word_join (h5:int16) (h4:int16):int32):int64)
(word_join (word_join (h3:int16) (h2:int16):int32) (word_join (h1:int16) (h0:int16):int32):int64):int128)
(112,16):int16 = h7`;;
was really necessary -- please look into existing split/merge tactics.
This is not only useful for humans, but also for agents: They will have an easier time learning from and adjusting the proof if it is well-structured, and common material is hoisted out. And, like humans, they typically need prodding to turn some proof into a good proof. So let's not stop at the first proof we find but use the power of the agent to have it write a good proof before committing. Nudge the agent, but also use your own judgement and go over the proof with human eyes and see if you can follow the structure and whether there are dubious steps or steps that ought to be hoisted out into / reused from general infrastructure.
hanno-becker
left a comment
There was a problem hiding this comment.
Some clarification on my previous review.
First, I should have said: Thank you @jakemas for getting this to work in the first place. Experience tells that rejection sampling is notoriously difficult, so having a working proof -- regardless of any opportunity for further improvement -- is fantastic. I should have paused before putting my review-cap on and acknowledged this -- apologies I didn't!
Now to the specific feedback. First, the following I think would be good to address before merge:
- Removing level-dependent constants from the CBMC specs for ASM kernels.
- Adding an explicit length clause to the HOL-Light subroutine spec, to match the respective clause in the CBMC spec.
- Adjusting the functional spec we're proving against to work with nibble lists rather than byte lists, so that the top-level spec of rejection sampling is just a plain filter. This makes for the simplest and most intuitive spec.
Things I'd like us to consider, but not necessarily now:
- Avoiding using hardcoded offsets for preamble and postamble. I believe we do this consistently in mlkem-native, but inconsistently in mldsa-native. If it's easy to do, let's do it -- but if not, we can wait for when we address this uniformly.
- Proof cleanup. As done e.g. in #113, one can defer this. What's most important is that the specs are correct and aligned and that the proof runs in acceptable time -- once that's met, we can merge and iterate. The concern I voiced is also not specific to this PR, but general: In a world where more and more proofs are AI-generated, we should pause and reflect on what the value of the individual proof still is, in that case, and in particular, to what extent we should care about the specific shape of a proof. My feeling is that the same engineering principles that we tend to apply as humans will still prove useful in an AI-world: Clean structure, separation specifics from generalities, hoisting out the latter, etc. -- will all make subsequent work with the code-base easier, whether done by a human or by an AI. As indicated in the review, I think there is potential for improvement in the current proof.
Finally: @jakemas If I ask something in a PR of your's that's not done elsewhere either, please flag it. I want us to have a consistent discipline for all proofs, so if there's divergence, please flag it so we can decide whether to go with precedent, or whether to expand the scope of the change to other proofs (potentially deferred and tracked in an issue).
…ty proofs Add functional correctness, subroutine correctness, memory-safety and subroutine-safety proofs for the AArch64 assembly implementations of rej_uniform_eta for both eta variants: - rej_uniform_eta2 (eta=2, used in ML-DSA-44) - rej_uniform_eta4 (eta=4, used in ML-DSA-65/87) Memory safety follows the mlkem_rej_uniform_VARIABLE_TIME pattern in s2n-bignum because the loop count is data-dependent (which nibbles pass the < 9 / < 15 filter). Written with the assistance of Claude Opus 4.7. Signed-off-by: Jake Massimo <jakemas@amazon.com>
- REJ_SAMPLE_ETA{2,4} take a nibble list (plain filter+map); the proofs
bridge to the byte-shape interior at the subroutine spec only.
- Add `outlen <= 256` clause to the subroutine post, matching CBMC.
- Hardcode the per-coefficient bound (3 for eta2, 5 for eta4) in the
CBMC contracts; the asm is level-specific so the spec mustn't depend
on MLDSA_ETA.
- Drop unused REJ_SAMPLE_ETA{2,4}_{EMPTY,APPEND}, REJ_NIBBLES_COUNT_8,
and a few duplicate helpers from the eta2 file.
Written with the assistance of Claude Opus 4.7.
Signed-off-by: Jake Massimo <jakemas@amazon.com>
Signed-off-by: Jake Massimo <jakemas@amazon.com>
53458cd to
a9f533a
Compare
Hoist SUB_LIST_8_BYTES_FROM_INT64 (the 8-byte chunk -> int64 split, used 4 times) into aarch64_utils.ml. Compact WORD_SUBWORD_OF_JOIN_8X16 from 42 to 16 lines using a programmatic generator. Signed-off-by: Jake Massimo <jakemas@amazon.com>
Ok addressing this with 2e48fa7 |
…twidth The public spec now operates on a (4 word) list -- the natural bitwidth of a nibble -- rather than int16 list. Callers lift their byte buffer into a nibble list via BYTES_TO_NIBBLES, which lives in common/mldsa_specs.ml. The byte->int16 nibble form (NIBBLES_OF_BYTES) is kept inside the proof file as a private helper to bridge to the byte-shape interior; the bridge goes through a one-line lemma at the subroutine spec. Signed-off-by: Jake Massimo <jakemas@amazon.com>
| (* Splits each input byte into its low and high 4-bit nibbles, expressed at *) | ||
| (* the natural bitwidth (4 word). The output is twice the length of the *) | ||
| (* input. Used by callers of REJ_SAMPLE_ETA{2,4} to lift a byte buffer into *) | ||
| (* a nibble list. *) | ||
| let BYTES_TO_NIBBLES = define | ||
| `BYTES_TO_NIBBLES [] = ([]:(4 word) list) /\ | ||
| BYTES_TO_NIBBLES (CONS (b:byte) t) = | ||
| APPEND [word(val b MOD 16):4 word; word(val b DIV 16):4 word] | ||
| (BYTES_TO_NIBBLES t)`;; | ||
|
|
||
| let BYTES_TO_NIBBLES_APPEND = prove | ||
| (`!l1 l2. BYTES_TO_NIBBLES(APPEND l1 l2) = | ||
| APPEND (BYTES_TO_NIBBLES l1) (BYTES_TO_NIBBLES l2)`, | ||
| LIST_INDUCT_TAC THEN | ||
| ASM_REWRITE_TAC[BYTES_TO_NIBBLES; APPEND; APPEND_ASSOC]);; |
There was a problem hiding this comment.
This is not needed in mldsa_specs.ml anymore I believe and should go in the proof script.
There was a problem hiding this comment.
Done — removed from mldsa_specs.ml.
| `REJ_SAMPLE_ETA2 (l:(4 word) list) = | ||
| MAP (\x:4 word. | ||
| word_sx(word_sub (word 2:int16) | ||
| (word_umod (word_zx x:int16) (word 5))):int32) | ||
| (FILTER (\x:4 word. val x < 15) l)`;; |
There was a problem hiding this comment.
This has the right shape, but can be simpler: The MAP can do the 2 - (x umod 5) computation in 4 word, and ultimately word_sx from 4 word -> 32 word.
There was a problem hiding this comment.
Actually, for a specification it is most natural to work in the integers where possible, which would mean MAPing over 4 word -> int -> int -> 32 word, where 4 word -> int is unsigned val, int -> int is x |-> 2 - (x umod 5), and int -> 32 word is signed interpretation.
Could you also check if HOL-Light provides a MAP_FILTER primitive?
There was a problem hiding this comment.
Will apply to REJ_SAMPLE_ETA2 too — coming in a follow-up commit. (Tested the per-element bridge; word_umod doesn't BITBLAST cleanly so the eta2 bridge takes a 15-case enumeration on val x, but it goes through.)
On MAP_FILTER: HOL Light has FILTER_MAP : FILTER P (MAP f l) = MAP f (FILTER (P o f) l), but no fused MAP_FILTER combinator at the term level — only mapfilter as an OCaml-level helper. I don't think there's a cleaner spec form available here.
| MAP (\x:4 word. | ||
| word_sx(word_sub (word 4:int16) (word_zx x:int16)):int32) |
There was a problem hiding this comment.
Depending on what we do in REJ_SAMPLE_ETA2, this should be similar: Either work mostly in 4 word, or go through int; and, if present, use a MAP_FILTER primitive.
There was a problem hiding this comment.
REJ_SAMPLE_ETA4 now does the (4 - x) computation in 4 word and word_sx's once to int32 at the end. REJ_SAMPLE_ETA2 will get the same treatment in a follow-up commit.
The MAP body now does the (4 - x) computation in 4 word and word_sx's once to int32 at the end, removing the int16 detour. BYTES_TO_NIBBLES_APPEND moved out of mldsa_specs.ml (was unused there). Signed-off-by: Jake Massimo <jakemas@amazon.com>
Resolves #924
Some (development) notes:
I've moved the theorems that may be shared with future x86 proof to
mldsa_specs.ml. Should it occur that more could be shared during the development of the x86 proof, then we can pull what we need at that point.The CI gives a nice break down on how long each of the CORRECT and MEMSAFE proofs take. I was able to optimize the full eta2 proof quite a bit from 1h7m to just 31m, which is great considering the CORRECT takes ~24min.
CORRECT proof:
HOL-Light / AArch64 HOL Light proof for rej_uniform_eta4_aarch64_asm.S (pull_request) Successful in 11m
HOL-Light / AArch64 HOL Light proof for rej_uniform_eta2_aarch64_asm.S (pull_request) Successful in 24m
CORRECT + MEMSAFE proof:
HOL-Light / AArch64 HOL Light proof for rej_uniform_eta4_aarch64_asm.S (pull_request) Successful in 21m
HOL-Light / AArch64 HOL Light proof for rej_uniform_eta2_aarch64_asm.S (pull_request) Successful in 1h7m
Optimizations CORRECT + MEMSAFE proof:
HOL-Light / AArch64 HOL Light proof for rej_uniform_eta4_aarch64_asm.S succeeded in 18m 40s
HOL-Light / AArch64 HOL Light proof for rej_uniform_eta2_aarch64_asm.S succeeded in 31m 29s
Built with Claude Opus 4.7 1m. Using the Hol-Light MCP. About 4 weeks of effort.