Add ppc64le backend (supports p8 and above architectures)#1648
Add ppc64le backend (supports p8 and above architectures)#1648dannytsen wants to merge 32 commits into
Conversation
|
@bhess @mkannwischer Please review. Thanks. |
Thanks @dannytsen for rebasing this PR. Could you get it through CI please? |
bhess
left a comment
There was a problem hiding this comment.
Thanks for the updates, @dannytsen.
Minor comments are inline.
supports p8 and above architectures.
I do not think this is correct as written. I see illegal instructions with qemu-ppc64le -cpu power8, while the same code runs with -cpu power9. In particular, lxvx and stxvx do not appear to be supported on p8.
Can you please double-check this and update the requirement to POWER9 if that is indeed the minimum supported target?
@mkannwischer I tested on p8 system. I have to double check that. Ah, I found the problem, my p8 supports ISA 2.07 which means the qemu supports older one. I'll fix that along with @bhess comments. Thanks. |
|
@mkannwischer @bhess Just submitted my changes and stated that the supporting architecture to be p9 and above or Power systems support ISA 2.07 and other fixes. Hope this solve all the issues. Thanks. |
|
@mkannwischer @bhess Submitted an updated code after re-run autogen. Please review. |
|
@mkannwischer @bhess Not sure what's the errors for 4 CI errors. This was happened after I ran the autogen. |
Thanks for the progress. You can ignore the two OpenTitan tests (those should not be run on a fork because the AWS auth fails) and the markdown lint (that is a spurious failure). |
OK. I'll check into the ppc one. Thanks. |
Thanks for the updates. I have a few more minor follow-ups inline. |
@bhess I did not see your follow-ups or I missed it. Can you post it again? Thanks. |
| #define V_26 2 | ||
| #define V_MKQ 3 | ||
|
|
||
| .machine "any" |
There was a problem hiding this comment.
I believe this directive will get removed after simpasm.
There was a problem hiding this comment.
does it still make sense to include it in the dev-folder then, e.g., for documentation purposes?
There was a problem hiding this comment.
@bhess Well, this code still supports p8 but with ISA 2.07. Some older p8s don't support ISA 2.07. Looks like qemu supports older p8 not with ISA 2.07. In general, we support p9 but also p8 with ISA 2.07.
There was a problem hiding this comment.
hm, is there any way to make this distinction with the .machine pseudo-op?
There was a problem hiding this comment.
That's possible. I'll think about it.
Should be visible now. |
If you rebase on top of main, the OpenTitan tests should be properly skipped (fixed in #1649) |
|
@mkannwischer @bhess I manually modified mlkem/mlkem_native_asm.S and mlkem/mlkem_native.c to add ppc64le include so to pass the monolithic_build_multilevel_native CI tests because autogen can't handle the ppc64le platform. So I am not sure how to fix the test unless autogen get fixed. |
I’ve opened a PR against your branch that updates autogen. |
@bhess Thanks. |
@bhess I merged the autogen and rerun the autogen. Hope this commit fix the problem. Thanks. |
|
@bhess I just added a capability check of PPC_FEATURE2_ARCH_2_07. It will fall back to the original implementation if the architecture doesn't support ISA 2.07 and above which include some p8 and p9 above. Please review. Thanks. |
|
@mkannwischer @bhess Any feedback on the recent commit? Thanks. |
|
@mkannwischer In qemu, power8e supports ISA 2.07 which is a variant of p8 with DD2. |
There was a problem hiding this comment.
Thanks for the update, @dannytsen.
The runtime capability check goes in a slightly different direction than what I had in mind. My suggestion was about whether the .machine pseudo-op can express something like "POWER8 with ISA 2.07 support". If there's no way to express this with .machine, I'd suggest just removing it from the assembly and documenting the ISA 2.07 requirement in the project docs / README instead (instead of machine "any").
For the dynamic getauxval approach: I wonder if runtime feature detection might be better left to downstream consumers of mlkem-native (e.g., liboqs) rather than adding it here. There's also the consideration that getauxval/<sys/auxv.h> is not portable AFAIK. Any thoughts @mkannwischer ?
@bhess ".machine any" is already a psudo-op. I can remove that in dev/ppc64le/src. The README is already had the update. For getauxval approach, it's to prevent the code run on a older Power systems that did not support proper instaructions. |
There was a problem hiding this comment.
Coming back to this after a closer look at runtime CPU checks in mlkem-native:
my remaining concern is the runtime capability check in the PR. mlkem-native already has a designed-in mechanism for exactly this case, and I think the PR should use it rather than doing its own in the native backend.
- Hook:
MLK_CONFIG_CUSTOM_CAPABILITY_FUNC+ the
mlk_sys_check_capability(mlk_sys_cap)callback declared in
mlkem/src/sys.h.
Backends call it; consumers override it for runtime dispatch. - Examples in other backends:
Concretely, this would mean adding MLK_SYS_CAP_PPC_ISA_2_07 to the enum,
replacing the four mlkem_ppc_check_cap() calls with
mlk_sys_check_capability(...), and shipping the auxv code as a new
test/configs/custom_native_capability_config_PPC_AUXV.h. Specifics are inline.
@mkannwischer: please correct me if that's not the intended pattern, and feel free to share any additional review.
| #include "../api.h" | ||
| #include "src/arith_native_ppc64le.h" | ||
|
|
||
| #include <sys/auxv.h> |
There was a problem hiding this comment.
Because this header is pulled in by everyone that uses the native backend (via src/native/meta.h), this include makes the whole native build glibc-dependent.
The existing AVX2 and NEON backends don't query the OS, they assume the ISA the build was configured for, and runtime dispatch is delegated to the consumer through MLK_CONFIG_CUSTOM_CAPABILITY_FUNC (see test/configs/custom_native_capability_config_CPUID_AVX2.h and ..._ID_AA64PFR1_EL1.h).
I'd advise to use the same pattern here: drop this include, and move the auxv-based detection into a new
test/configs/custom_native_capability_config_PPC_AUXV.h example.
There was a problem hiding this comment.
I believe this is auto generated.
There was a problem hiding this comment.
Ah yes, anyway I think this is moot and not a blocker if the checks are removed for now (see the other thread). I can handle the autogen in a followup.
bhess
left a comment
There was a problem hiding this comment.
Thanks for the updates @dannytsen. I re-reviewed this together with the earlier discussion in #1184. The removal of the runtime capability check resolves my previous comment, and the main readability points raised on the earlier PR also seem addressed for this iteration. I’m not the maintainer, but from my side this iteration now looks good and I’d support merging it.
- ruff format scripts/autogen (formatting fix) - Add check-magic annotation for array size 2072 in consts.c and consts.h (7 groups of 8 base constants + 4 twiddle tables * 63 rows * 8 values) Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
|
Thanks @mkannwischer. I pushed an update that fixes the linting (linting now passes locally). |
Thanks! There are some issues in the POWER-7 tests: https://github.com/pq-code-package/mlkem-native/actions/runs/26212206823/job/77130946223?pr=1648. Could you take a look please? |
Yes just noticed it, will take a look. |
|
For the HOL-Light failures, a workaround would be to add |
…power8 (nix libc is compiled for power8 and otherwise causes illegal instructions. Avoids unused data parameter errors in the fallback code path. Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
The Illegal instruction issue doen't appear to come from the mlkem code itself, but from the nix toolchain that uses a glibc compiled for power8, so binaries fault when run under qemu-ppc64le -cpu power7. I also reproduced the fallback path locally. While doing that I hit an issue in the fallback-path (unused variable) fixed in commit 3442786.
Done in the latest commit. |
Thanks for investigating. That's annoying. I think this solution is acceptable for now and I'm glad that it at least revealed some problems we would have otherwise missed. I have updated #1677 so we can run the full CI and benchmarks there. |
| #if defined(MLK_ARITH_BACKEND_PPC64LE_DEFAULT) && \ | ||
| !defined(MLK_CONFIG_MULTILEVEL_NO_SHARED) && defined(__POWER8_VECTOR__) | ||
|
|
||
| #include "consts.h" |
There was a problem hiding this comment.
Why is the entire file not autogenerated?
There was a problem hiding this comment.
Updated in cc9087d which autogenerates the full consts.c file (without needing .inc files)
|
|
||
| /* Offsets into the constant table */ | ||
| /* check-magic: off */ | ||
| #define MLK_PPC_NQ_OFFSET 0 |
There was a problem hiding this comment.
If we auto-generate consts.c, we can also auto-generate those offsets from the same code.
There was a problem hiding this comment.
Updated in cc9087d, consts.h is now auto-generated.
|
|
||
| /* | ||
| * Load coefficient in r[j+len] (r') vectors from offset, R10, R17, R19 and R21 | ||
| * r[j+len]: V13, V18, V23, V28 |
There was a problem hiding this comment.
It is confusing to talk about absolute register values here if the code itself uses symbolic ones.
There was a problem hiding this comment.
Fixed in cc9087d.Load_4Rjp is now load_high; the comment uses the symbolic names.
| /* Since the result of the Barrett multiplication is bounded | ||
| by q/2 in absolute value. |
There was a problem hiding this comment.
That's not true, and the sentence just seems to stop.
There was a problem hiding this comment.
Removed the comment in cc9087d as it was also out of context. I think the same bound applies to Barrett multiplication as in eprint 2021/986 (Algorithm 6 / Corollary 2: |bar_N(a,b)| < N),
| * r[j]: V12, V17, V22, V27 | ||
| */ | ||
| .macro Load_4Rj | ||
| lxvd2x 32+vdata_a1, rinp, a1_offset /* V12: vector r0 */ | ||
| lxvd2x 32+vdata_a2, rinp, a2_offset /* V17: vector r1 */ | ||
| lxvd2x 32+vdata_a3, rinp, a3_offset /* V22: vector r2 */ | ||
| lxvd2x 32+vdata_a4, rinp, a4_offset /* V27: vector r3 */ |
There was a problem hiding this comment.
As above: Why talk about absolute registers when using symbolic names?
There was a problem hiding this comment.
Fixed in cc9087d. Load_4Rj is now load_low; the comment uses the symbolic names
| * Compute forward NTT based on the following 7 layers - | ||
| * len = 128, 64, 32, 16, 8, 4, 2. | ||
| * | ||
| * Each layer compute the coefficients on 2 legs, start and start + len*2 offsets. |
There was a problem hiding this comment.
It's unclear what is meant here
There was a problem hiding this comment.
Rewritten. The comment now says: each layer pairs a low coefficient r[j] (leg 1) with
a high coefficient r[j+len] (leg 2).
| * | ||
| * leg 1 leg 2 | ||
| * ----- ----- | ||
| * start start+len*2 |
There was a problem hiding this comment.
Is len*2 correct here? For len=128 it would be 256, which seems wrong.
There was a problem hiding this comment.
Ok, one should clarify whether one is talking about byte-offsets or array-offsets.
There was a problem hiding this comment.
Updated, the comment now states Offsets below are byte offsets into r...
| * following order, | ||
| * rjlen0, rjlen1, rjlen2, rjlen3, rjlen4, rjlen5, rjlen6, rjlen7 | ||
| */ | ||
| .macro Load_4Coeffs start, next |
There was a problem hiding this comment.
It's a bit confusing that this includes Load_4Rjp but not Load_4Rp. I suggest to remove this macro and just inline it into its unique call-site layer12345
There was a problem hiding this comment.
Updated.. Load_4Coeffs is removed. CT_Butterfly_4x (formerly ntt_layer12345) now calls
Init_Coeffs_offset and load_high directly.
| * Load coefficient in r[j+len] (r') vectors from offset, R10, R17, R19 and R21 | ||
| * r[j+len]: V13, V18, V23, V28 | ||
| */ | ||
| .macro Load_4Rjp |
There was a problem hiding this comment.
Please use a more descriptive name, like load_high and load_low and comment that those are the coefficient subject to / not subject to scaling in the CT butterfly.
There was a problem hiding this comment.
Done: Load_4Rjp-> load_high, Load_4Rj -> load_low, each with a comment noting
which operand is scaled by zeta in the CT butterfly and which is not.
| Load_4Coeffs \start, \next | ||
| barrett_fqmul_4x \_vz0, \_vz1, \_vz2, \_vz3, \_vzt0, \_vzt1, \_vzt2, \_vzt3 | ||
| Load_4Rj | ||
| Compute_4Coeffs |
There was a problem hiding this comment.
Please pick a more descriptive name for Compute_4Coeffs, e.g. AddSub_4x or something.
| /* | ||
| * NTT other layers, 1, 2, 3, 4, 5. | ||
| */ | ||
| .macro ntt_layer12345 start, next, _vz0, _vz1, _vz2, _vz3, _vzt0, _vzt1, _vzt2, _vzt3 |
There was a problem hiding this comment.
Please pick a more descriptive name for this macro. This is not an entire NTT layer, it's a set of butterflies. E.g. CT_Butterfly_4x
There was a problem hiding this comment.
Done: renaming ntt_layer12345 to CT_Butterfly_4x
autogenerate full consts.c and consts.h Signed-off-by: Basil Hess <bhe@zurich.ibm.com> Sync iNTT & Improved ABI note Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
|
Thanks for the review, @hanno-becker! I pushed cc9087d with an update, see my replies in the inline comments. |
There was a problem hiding this comment.
Thank you @bhess for your work 🙏
poly_tomont still uses rounding Montgomery -- can we change this to Barrett multiplication, too, please?
There's still a large number of places where I feel documentation/naming should be improved, but I'll take care of that.
| vmladduhm 15, 13, V1353, 3 | ||
| vmladduhm 20, 18, V1353, 3 | ||
| vmladduhm 25, 23, V1353, 3 | ||
| vmladduhm 9, 7, V1353, 3 | ||
|
|
||
| vmhraddshs 14, 13, V1353, 3 | ||
| vmhraddshs 19, 18, V1353, 3 | ||
| vmhraddshs 24, 23, V1353, 3 | ||
| vmhraddshs 8, 7, V1353, 3 | ||
|
|
||
| vmladduhm 15, 15, V_QINV, 3 | ||
| vmladduhm 20, 20, V_QINV, 3 | ||
| vmladduhm 25, 25, V_QINV, 3 | ||
| vmladduhm 9, 9, V_QINV, 3 | ||
|
|
||
| vmhraddshs 15, 15, V_NMKQ, 14 |
There was a problem hiding this comment.
@bhess This is still rounding Montgomery. Can we adjust this to use Barrett multiplication as well, please?
Sure, I'll take care of that.
Thank you @hanno-becker! |
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
I pushed an update in f511d3b changing poly_tomont to Barrett multiplication. |
Updated patch for ML_KEM for ppc64le supports p8 and above architectures.
Signed-off-by: Danny Tsen dtsen@us.ibm.com
The following tests were run on p10.
[09:28] danny@ltcden12-lp1 new_ppc64le_mlkem % ./scripts/tests func
INFO > Functional Test Compile (native no_opt): make func OPT=0 AUTO=1 -j40
INFO > Functional Test ML-KEM-512 (native no_opt): make run_func_512 -j40
INFO > Functional Test ML-KEM-768 (native no_opt): make run_func_768 -j40
INFO > Functional Test ML-KEM-1024 (native no_opt): make run_func_1024 -j40
INFO > Functional Test Compile (native opt): make func OPT=1 AUTO=1 -j40
INFO > Functional Test ML-KEM-512 (native opt): make run_func_512 -j40
INFO > Functional Test ML-KEM-768 (native opt): make run_func_768 -j40
INFO > Functional Test ML-KEM-1024 (native opt): make run_func_1024 -j40
All good!
[09:28] danny@ltcden12-lp1 new_ppc64le_mlkem % ./scripts/tests bench -c PERF
INFO > Benchmark Compile (native no_opt): make bench OPT=0 AUTO=1 CYCLES=PERF -j40
INFO > Benchmark ML-KEM-512 (native no_opt): make run_bench_512
INFO > Benchmark ML-KEM-512 (native no_opt): test/build/mlkem512/bin/bench_mlkem512
keypair cycles = 66982
encaps cycles = 78820
decaps cycles = 100923
keypair percentiles: 66438 66690 66791 66857 66920 66982 67043 67122 67218 67306 71905
encaps percentiles: 78322 78516 78618 78687 78752 78820 78878 78933 79012 79116 83825
decaps percentiles: 100427 100634 100733 100804 100869 100923 100985 101056 101131 101253 105852
INFO > Benchmark ML-KEM-768 (native no_opt): make run_bench_768
INFO > Benchmark ML-KEM-768 (native no_opt): test/build/mlkem768/bin/bench_mlkem768
keypair cycles = 111380
encaps cycles = 125891
decaps cycles = 154364
keypair percentiles: 110575 110914 111083 111192 111291 111380 111496 111617 111821 112414 116725
encaps percentiles: 125081 125403 125526 125655 125776 125891 125998 126122 126293 126771 131358
decaps percentiles: 153575 153870 154008 154131 154261 154364 154487 154630 154782 155313 159863
INFO > Benchmark ML-KEM-1024 (native no_opt): make run_bench_1024
INFO > Benchmark ML-KEM-1024 (native no_opt): test/build/mlkem1024/bin/bench_mlkem1024
keypair cycles = 166809
encaps cycles = 185315
decaps cycles = 220229
keypair percentiles: 165339 165995 166236 166435 166616 166809 167007 167200 167505 171058 175606
encaps percentiles: 183839 184563 184778 184951 185158 185315 185518 185744 186123 189637 192014
decaps percentiles: 218911 219430 219705 219841 220027 220229 220436 220673 221029 224484 226901
INFO > Benchmark Compile (native opt): make bench OPT=1 AUTO=1 CYCLES=PERF -j40
INFO > Benchmark ML-KEM-512 (native opt): make run_bench_512
INFO > Benchmark ML-KEM-512 (native opt): test/build/mlkem512/bin/bench_mlkem512
keypair cycles = 45750
encaps cycles = 50661
decaps cycles = 63561
keypair percentiles: 45248 45469 45546 45620 45690 45750 45806 45886 45954 46063 50703
encaps percentiles: 50192 50367 50468 50542 50600 50661 50710 50771 50858 50954 55652
decaps percentiles: 63091 63276 63381 63436 63497 63561 63623 63679 63743 63857 68437
INFO > Benchmark ML-KEM-768 (native opt): make run_bench_768
INFO > Benchmark ML-KEM-768 (native opt): test/build/mlkem768/bin/bench_mlkem768
keypair cycles = 79045
encaps cycles = 86455
decaps cycles = 103878
keypair percentiles: 78313 78578 78742 78847 78954 79045 79169 79285 79470 79978 84430
encaps percentiles: 85628 86009 86172 86272 86363 86455 86592 86711 86879 87292 92038
decaps percentiles: 103041 103399 103540 103676 103788 103878 103993 104104 104274 104736 109361
INFO > Benchmark ML-KEM-1024 (native opt): make run_bench_1024
INFO > Benchmark ML-KEM-1024 (native opt): test/build/mlkem1024/bin/bench_mlkem1024
keypair cycles = 124072
encaps cycles = 134500
decaps cycles = 157090
keypair percentiles: 122727 123259 123515 123720 123929 124072 124253 124527 125009 128466 133334
encaps percentiles: 133064 133681 133933 134129 134320 134500 134711 134933 135346 138753 141067
decaps percentiles: 155503 156261 156510 156694 156894 157090 157285 157605 158014 161592 166723
All good!