Skip to content

Add ppc64le backend (supports p8 and above architectures)#1648

Open
dannytsen wants to merge 32 commits into
pq-code-package:mainfrom
dannytsen:new_main
Open

Add ppc64le backend (supports p8 and above architectures)#1648
dannytsen wants to merge 32 commits into
pq-code-package:mainfrom
dannytsen:new_main

Conversation

@dannytsen
Copy link
Copy Markdown

Updated patch for ML_KEM for ppc64le supports p8 and above architectures.

  1. Run scripts/autogen and scripts/lint on Mac but not sure if it runs for ppc64le.
  2. Run simpasm on Red Hat Linux.
  3. Added detailed comments on NTT and INTT implementations.
  4. Used C type symbols to improve readability.
  5. Fixed some typos.

Signed-off-by: Danny Tsen dtsen@us.ibm.com

The following tests were run on p10.

[09:28] danny@ltcden12-lp1 new_ppc64le_mlkem % ./scripts/tests func
INFO > Functional Test Compile (native no_opt): make func OPT=0 AUTO=1 -j40
INFO > Functional Test ML-KEM-512 (native no_opt): make run_func_512 -j40
INFO > Functional Test ML-KEM-768 (native no_opt): make run_func_768 -j40
INFO > Functional Test ML-KEM-1024 (native no_opt): make run_func_1024 -j40
INFO > Functional Test Compile (native opt): make func OPT=1 AUTO=1 -j40
INFO > Functional Test ML-KEM-512 (native opt): make run_func_512 -j40
INFO > Functional Test ML-KEM-768 (native opt): make run_func_768 -j40
INFO > Functional Test ML-KEM-1024 (native opt): make run_func_1024 -j40
All good!

[09:28] danny@ltcden12-lp1 new_ppc64le_mlkem % ./scripts/tests bench -c PERF
INFO > Benchmark Compile (native no_opt): make bench OPT=0 AUTO=1 CYCLES=PERF -j40
INFO > Benchmark ML-KEM-512 (native no_opt): make run_bench_512
INFO > Benchmark ML-KEM-512 (native no_opt): test/build/mlkem512/bin/bench_mlkem512
keypair cycles = 66982
encaps cycles = 78820
decaps cycles = 100923

       percentile      1     10     20     30     40     50     60     70     80     90     99

keypair percentiles: 66438 66690 66791 66857 66920 66982 67043 67122 67218 67306 71905
encaps percentiles: 78322 78516 78618 78687 78752 78820 78878 78933 79012 79116 83825
decaps percentiles: 100427 100634 100733 100804 100869 100923 100985 101056 101131 101253 105852

INFO > Benchmark ML-KEM-768 (native no_opt): make run_bench_768
INFO > Benchmark ML-KEM-768 (native no_opt): test/build/mlkem768/bin/bench_mlkem768
keypair cycles = 111380
encaps cycles = 125891
decaps cycles = 154364

       percentile      1     10     20     30     40     50     60     70     80     90     99

keypair percentiles: 110575 110914 111083 111192 111291 111380 111496 111617 111821 112414 116725
encaps percentiles: 125081 125403 125526 125655 125776 125891 125998 126122 126293 126771 131358
decaps percentiles: 153575 153870 154008 154131 154261 154364 154487 154630 154782 155313 159863

INFO > Benchmark ML-KEM-1024 (native no_opt): make run_bench_1024
INFO > Benchmark ML-KEM-1024 (native no_opt): test/build/mlkem1024/bin/bench_mlkem1024
keypair cycles = 166809
encaps cycles = 185315
decaps cycles = 220229

       percentile      1     10     20     30     40     50     60     70     80     90     99

keypair percentiles: 165339 165995 166236 166435 166616 166809 167007 167200 167505 171058 175606
encaps percentiles: 183839 184563 184778 184951 185158 185315 185518 185744 186123 189637 192014
decaps percentiles: 218911 219430 219705 219841 220027 220229 220436 220673 221029 224484 226901

INFO > Benchmark Compile (native opt): make bench OPT=1 AUTO=1 CYCLES=PERF -j40
INFO > Benchmark ML-KEM-512 (native opt): make run_bench_512
INFO > Benchmark ML-KEM-512 (native opt): test/build/mlkem512/bin/bench_mlkem512
keypair cycles = 45750
encaps cycles = 50661
decaps cycles = 63561

       percentile      1     10     20     30     40     50     60     70     80     90     99

keypair percentiles: 45248 45469 45546 45620 45690 45750 45806 45886 45954 46063 50703
encaps percentiles: 50192 50367 50468 50542 50600 50661 50710 50771 50858 50954 55652
decaps percentiles: 63091 63276 63381 63436 63497 63561 63623 63679 63743 63857 68437

INFO > Benchmark ML-KEM-768 (native opt): make run_bench_768
INFO > Benchmark ML-KEM-768 (native opt): test/build/mlkem768/bin/bench_mlkem768
keypair cycles = 79045
encaps cycles = 86455
decaps cycles = 103878

       percentile      1     10     20     30     40     50     60     70     80     90     99

keypair percentiles: 78313 78578 78742 78847 78954 79045 79169 79285 79470 79978 84430
encaps percentiles: 85628 86009 86172 86272 86363 86455 86592 86711 86879 87292 92038
decaps percentiles: 103041 103399 103540 103676 103788 103878 103993 104104 104274 104736 109361

INFO > Benchmark ML-KEM-1024 (native opt): make run_bench_1024
INFO > Benchmark ML-KEM-1024 (native opt): test/build/mlkem1024/bin/bench_mlkem1024
keypair cycles = 124072
encaps cycles = 134500
decaps cycles = 157090

       percentile      1     10     20     30     40     50     60     70     80     90     99

keypair percentiles: 122727 123259 123515 123720 123929 124072 124253 124527 125009 128466 133334
encaps percentiles: 133064 133681 133933 134129 134320 134500 134711 134933 135346 138753 141067
decaps percentiles: 155503 156261 156510 156694 156894 157090 157285 157605 158014 161592 166723

All good!

@dannytsen dannytsen requested a review from a team as a code owner April 6, 2026 22:16
@dannytsen dannytsen mentioned this pull request Apr 6, 2026
5 tasks
@dannytsen
Copy link
Copy Markdown
Author

@bhess @mkannwischer Please review. Thanks.

@mkannwischer
Copy link
Copy Markdown
Contributor

@bhess @mkannwischer Please review. Thanks.

Thanks @dannytsen for rebasing this PR. Could you get it through CI please?

@mkannwischer mkannwischer self-assigned this Apr 7, 2026
Copy link
Copy Markdown
Contributor

@bhess bhess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, @dannytsen.

Minor comments are inline.

supports p8 and above architectures.

I do not think this is correct as written. I see illegal instructions with qemu-ppc64le -cpu power8, while the same code runs with -cpu power9. In particular, lxvx and stxvx do not appear to be supported on p8.

Can you please double-check this and update the requirement to POWER9 if that is indeed the minimum supported target?

Comment thread dev/ppc64le/src/arith_native_ppc64le.h Outdated
Comment thread dev/ppc64le/src/consts.c Outdated
Comment thread dev/ppc64le/src/reduce.S Outdated
Comment thread dev/ppc64le/src/poly_tomont.S Outdated
Comment thread dev/ppc64le/src/reduce.S Outdated
Comment thread mlkem/src/native/ppc64le/meta.h
Comment thread dev/ppc64le/src/poly_tomont.S Outdated
Comment thread dev/ppc64le/src/reduce.S Outdated
@dannytsen
Copy link
Copy Markdown
Author

dannytsen commented Apr 7, 2026

Thanks for the updates, @dannytsen.

Minor comments are inline.

supports p8 and above architectures.

I do not think this is correct as written. I see illegal instructions with qemu-ppc64le -cpu power8, while the same code runs with -cpu power9. In particular, lxvx and stxvx do not appear to be supported on p8.

Can you please double-check this and update the requirement to POWER9 if that is indeed the minimum supported target?

@mkannwischer I tested on p8 system. I have to double check that. Ah, I found the problem, my p8 supports ISA 2.07 which means the qemu supports older one. I'll fix that along with @bhess comments. Thanks.

@dannytsen
Copy link
Copy Markdown
Author

@mkannwischer @bhess Just submitted my changes and stated that the supporting architecture to be p9 and above or Power systems support ISA 2.07 and other fixes. Hope this solve all the issues. Thanks.

@dannytsen
Copy link
Copy Markdown
Author

@mkannwischer @bhess Submitted an updated code after re-run autogen. Please review.

@dannytsen
Copy link
Copy Markdown
Author

@mkannwischer @bhess Not sure what's the errors for 4 CI errors. This was happened after I ran the autogen.

@mkannwischer
Copy link
Copy Markdown
Contributor

@mkannwischer @bhess Not sure what's the errors for 4 CI errors. This was happened after I ran the autogen.

Thanks for the progress.

You can ignore the two OpenTitan tests (those should not be run on a fork because the AWS auth fails) and the markdown lint (that is a spurious failure).
The ppc one needs investigation - you should be able to run the examples locally to reproduce it.

@dannytsen
Copy link
Copy Markdown
Author

@mkannwischer @bhess Not sure what's the errors for 4 CI errors. This was happened after I ran the autogen.

Thanks for the progress.

You can ignore the two OpenTitan tests (those should not be run on a fork because the AWS auth fails) and the markdown lint (that is a spurious failure). The ppc one needs investigation - you should be able to run the examples locally to reproduce it.

OK. I'll check into the ppc one. Thanks.

@bhess
Copy link
Copy Markdown
Contributor

bhess commented Apr 8, 2026

@mkannwischer @bhess Just submitted my changes and stated that the supporting architecture to be p9 and above or Power systems support ISA 2.07 and other fixes. Hope this solve all the issues. Thanks.

Thanks for the updates. I have a few more minor follow-ups inline.

@dannytsen
Copy link
Copy Markdown
Author

@mkannwischer @bhess Just submitted my changes and stated that the supporting architecture to be p9 and above or Power systems support ISA 2.07 and other fixes. Hope this solve all the issues. Thanks.

Thanks for the updates. I have a few more minor follow-ups inline.

@bhess I did not see your follow-ups or I missed it. Can you post it again? Thanks.

Comment thread mlkem/src/native/ppc64le/README.md Outdated
Comment thread dev/ppc64le/README.md Outdated
Comment thread dev/ppc64le/src/poly_tomont.S Outdated
Comment thread dev/ppc64le/src/reduce.S Outdated
#define V_26 2
#define V_MKQ 3

.machine "any"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this directive will get removed after simpasm.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it still make sense to include it in the dev-folder then, e.g., for documentation purposes?

Copy link
Copy Markdown
Author

@dannytsen dannytsen Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhess Well, this code still supports p8 but with ISA 2.07. Some older p8s don't support ISA 2.07. Looks like qemu supports older p8 not with ISA 2.07. In general, we support p9 but also p8 with ISA 2.07.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, is there any way to make this distinction with the .machine pseudo-op?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's possible. I'll think about it.

@bhess
Copy link
Copy Markdown
Contributor

bhess commented Apr 9, 2026

@mkannwischer @bhess Just submitted my changes and stated that the supporting architecture to be p9 and above or Power systems support ISA 2.07 and other fixes. Hope this solve all the issues. Thanks.

Thanks for the updates. I have a few more minor follow-ups inline.

@bhess I did not see your follow-ups or I missed it. Can you post it again? Thanks.

Should be visible now.

@mkannwischer
Copy link
Copy Markdown
Contributor

@mkannwischer @bhess Not sure what's the errors for 4 CI errors. This was happened after I ran the autogen.

Thanks for the progress.
You can ignore the two OpenTitan tests (those should not be run on a fork because the AWS auth fails) and the markdown lint (that is a spurious failure). The ppc one needs investigation - you should be able to run the examples locally to reproduce it.

OK. I'll check into the ppc one. Thanks.

If you rebase on top of main, the OpenTitan tests should be properly skipped (fixed in #1649)

@dannytsen
Copy link
Copy Markdown
Author

@mkannwischer @bhess I manually modified mlkem/mlkem_native_asm.S and mlkem/mlkem_native.c to add ppc64le include so to pass the monolithic_build_multilevel_native CI tests because autogen can't handle the ppc64le platform. So I am not sure how to fix the test unless autogen get fixed.

@bhess
Copy link
Copy Markdown
Contributor

bhess commented Apr 9, 2026

@mkannwischer @bhess I manually modified mlkem/mlkem_native_asm.S and mlkem/mlkem_native.c to add ppc64le include so to pass the monolithic_build_multilevel_native CI tests because autogen can't handle the ppc64le platform. So I am not sure how to fix the test unless autogen get fixed.

I’ve opened a PR against your branch that updates autogen.
Running the updated autogen should produce identical results to your current manually modified code. PR: dannytsen#1

@dannytsen
Copy link
Copy Markdown
Author

@mkannwischer @bhess I manually modified mlkem/mlkem_native_asm.S and mlkem/mlkem_native.c to add ppc64le include so to pass the monolithic_build_multilevel_native CI tests because autogen can't handle the ppc64le platform. So I am not sure how to fix the test unless autogen get fixed.

I’ve opened a PR against your branch that updates autogen. Running the updated autogen should produce identical results to your current manually modified code. PR: dannytsen#1

@bhess Thanks.

@dannytsen
Copy link
Copy Markdown
Author

@mkannwischer @bhess I manually modified mlkem/mlkem_native_asm.S and mlkem/mlkem_native.c to add ppc64le include so to pass the monolithic_build_multilevel_native CI tests because autogen can't handle the ppc64le platform. So I am not sure how to fix the test unless autogen get fixed.

I’ve opened a PR against your branch that updates autogen. Running the updated autogen should produce identical results to your current manually modified code. PR: dannytsen#1

@bhess Thanks.

@bhess I merged the autogen and rerun the autogen. Hope this commit fix the problem. Thanks.

@dannytsen
Copy link
Copy Markdown
Author

@bhess I just added a capability check of PPC_FEATURE2_ARCH_2_07. It will fall back to the original implementation if the architecture doesn't support ISA 2.07 and above which include some p8 and p9 above. Please review. Thanks.

@dannytsen
Copy link
Copy Markdown
Author

@mkannwischer @bhess Any feedback on the recent commit? Thanks.

@dannytsen
Copy link
Copy Markdown
Author

@mkannwischer In qemu, power8e supports ISA 2.07 which is a variant of p8 with DD2.

Copy link
Copy Markdown
Contributor

@bhess bhess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, @dannytsen.
The runtime capability check goes in a slightly different direction than what I had in mind. My suggestion was about whether the .machine pseudo-op can express something like "POWER8 with ISA 2.07 support". If there's no way to express this with .machine, I'd suggest just removing it from the assembly and documenting the ISA 2.07 requirement in the project docs / README instead (instead of machine "any").

For the dynamic getauxval approach: I wonder if runtime feature detection might be better left to downstream consumers of mlkem-native (e.g., liboqs) rather than adding it here. There's also the consideration that getauxval/<sys/auxv.h> is not portable AFAIK. Any thoughts @mkannwischer ?

Comment thread mlkem/src/native/ppc64le/meta.h Outdated
@dannytsen
Copy link
Copy Markdown
Author

Thanks for the update, @dannytsen. The runtime capability check goes in a slightly different direction than what I had in mind. My suggestion was about whether the .machine pseudo-op can express something like "POWER8 with ISA 2.07 support". If there's no way to express this with .machine, I'd suggest just removing it from the assembly and documenting the ISA 2.07 requirement in the project docs / README instead (instead of machine "any").

For the dynamic getauxval approach: I wonder if runtime feature detection might be better left to upstream consumers of mlkem-native (e.g., liboqs) rather than adding it here. There's also the consideration that getauxval/<sys/auxv.h> is not portable AFAIK. Any thoughts @mkannwischer ?

@bhess ".machine any" is already a psudo-op. I can remove that in dev/ppc64le/src. The README is already had the update.

For getauxval approach, it's to prevent the code run on a older Power systems that did not support proper instaructions.

Copy link
Copy Markdown
Contributor

@bhess bhess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this after a closer look at runtime CPU checks in mlkem-native:
my remaining concern is the runtime capability check in the PR. mlkem-native already has a designed-in mechanism for exactly this case, and I think the PR should use it rather than doing its own in the native backend.

Concretely, this would mean adding MLK_SYS_CAP_PPC_ISA_2_07 to the enum,
replacing the four mlkem_ppc_check_cap() calls with
mlk_sys_check_capability(...), and shipping the auxv code as a new
test/configs/custom_native_capability_config_PPC_AUXV.h. Specifics are inline.

@mkannwischer: please correct me if that's not the intended pattern, and feel free to share any additional review.

Comment thread dev/ppc64le/meta.h Outdated
#include "../api.h"
#include "src/arith_native_ppc64le.h"

#include <sys/auxv.h>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this header is pulled in by everyone that uses the native backend (via src/native/meta.h), this include makes the whole native build glibc-dependent.

The existing AVX2 and NEON backends don't query the OS, they assume the ISA the build was configured for, and runtime dispatch is delegated to the consumer through MLK_CONFIG_CUSTOM_CAPABILITY_FUNC (see test/configs/custom_native_capability_config_CPUID_AVX2.h and ..._ID_AA64PFR1_EL1.h).
I'd advise to use the same pattern here: drop this include, and move the auxv-based detection into a new
test/configs/custom_native_capability_config_PPC_AUXV.h example.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is auto generated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, anyway I think this is moot and not a blocker if the checks are removed for now (see the other thread). I can handle the autogen in a followup.

Comment thread dev/ppc64le/meta.h
Copy link
Copy Markdown
Contributor

@bhess bhess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @dannytsen. I re-reviewed this together with the earlier discussion in #1184. The removal of the runtime capability check resolves my previous comment, and the main readability points raised on the earlier PR also seem addressed for this iteration. I’m not the maintainer, but from my side this iteration now looks good and I’d support merging it.

- ruff format scripts/autogen (formatting fix)
- Add check-magic annotation for array size 2072 in consts.c and consts.h
  (7 groups of 8 base constants + 4 twiddle tables * 63 rows * 8 values)

Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
@bhess
Copy link
Copy Markdown
Contributor

bhess commented May 21, 2026

Thanks @mkannwischer. I pushed an update that fixes the linting (linting now passes locally).

@mkannwischer
Copy link
Copy Markdown
Contributor

Thanks @mkannwischer. I pushed an update that fixes the linting (linting now passes locally).

Thanks! There are some issues in the POWER-7 tests: https://github.com/pq-code-package/mlkem-native/actions/runs/26212206823/job/77130946223?pr=1648. Could you take a look please?

@bhess
Copy link
Copy Markdown
Contributor

bhess commented May 21, 2026

Thanks! There are some issues in the POWER-7 tests: https://github.com/pq-code-package/mlkem-native/actions/runs/26212206823/job/77130946223?pr=1648. Could you take a look please?

Yes just noticed it, will take a look.

@mkannwischer
Copy link
Copy Markdown
Contributor

For the HOL-Light failures, a workaround would be to add toolchain_ppc64le to the hol_light-cross, hol_light-cross-aarch64, hol_light-cross-x86_64 shells. That's what we did for gcc-arm-embedded, too. That is fine for now until @hanno-becker or I can take a look on how to make --force-cross a little more selective.

bhess added 2 commits May 21, 2026 13:27
…power8 (nix libc is compiled for power8 and otherwise causes illegal instructions.

Avoids unused data parameter errors in the fallback code path.

Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
@bhess
Copy link
Copy Markdown
Contributor

bhess commented May 21, 2026

Thanks! There are some issues in the POWER-7 tests: https://github.com/pq-code-package/mlkem-native/actions/runs/26212206823/job/77130946223?pr=1648. Could you take a look please?

The Illegal instruction issue doen't appear to come from the mlkem code itself, but from the nix toolchain that uses a glibc compiled for power8, so binaries fault when run under qemu-ppc64le -cpu power7.
As a workaround, I changed the CI to still compile with -mcpu=power7 to enable the fallback, but to run the resulting binary with qemu-ppc64le -cpu power8 so that the binary runs properly. I think a cleaner long-term fix would be a separate ppc64le toolchain built for power7.

I also reproduced the fallback path locally. While doing that I hit an issue in the fallback-path (unused variable) fixed in commit 3442786.

For the HOL-Light failures, a workaround would be to add toolchain_ppc64le to the hol_light-cross, hol_light-cross-aarch64, hol_light-cross-x86_64 shells. That's what we did for gcc-arm-embedded, too. That is fine for now until @hanno-becker or I can take a look on how to make --force-cross a little more selective.

Done in the latest commit.

@mkannwischer
Copy link
Copy Markdown
Contributor

Thanks! There are some issues in the POWER-7 tests: https://github.com/pq-code-package/mlkem-native/actions/runs/26212206823/job/77130946223?pr=1648. Could you take a look please?

The Illegal instruction issue doen't appear to come from the mlkem code itself, but from the nix toolchain that uses a glibc compiled for power8, so binaries fault when run under qemu-ppc64le -cpu power7. As a workaround, I changed the CI to still compile with -mcpu=power7 to enable the fallback, but to run the resulting binary with qemu-ppc64le -cpu power8 so that the binary runs properly. I think a cleaner long-term fix would be a separate ppc64le toolchain built for power7.

I also reproduced the fallback path locally. While doing that I hit an issue in the fallback-path (unused variable) fixed in commit 3442786.

Thanks for investigating. That's annoying. I think this solution is acceptable for now and I'm glad that it at least revealed some problems we would have otherwise missed.

I have updated #1677 so we can run the full CI and benchmarks there.

Comment thread dev/ppc64le/src/consts.c
#if defined(MLK_ARITH_BACKEND_PPC64LE_DEFAULT) && \
!defined(MLK_CONFIG_MULTILEVEL_NO_SHARED) && defined(__POWER8_VECTOR__)

#include "consts.h"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the entire file not autogenerated?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in cc9087d which autogenerates the full consts.c file (without needing .inc files)

Comment thread dev/ppc64le/src/consts.h

/* Offsets into the constant table */
/* check-magic: off */
#define MLK_PPC_NQ_OFFSET 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we auto-generate consts.c, we can also auto-generate those offsets from the same code.

Copy link
Copy Markdown
Contributor

@bhess bhess Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in cc9087d, consts.h is now auto-generated.

Comment thread dev/ppc64le/src/ntt_ppc_asm.S Outdated

/*
* Load coefficient in r[j+len] (r') vectors from offset, R10, R17, R19 and R21
* r[j+len]: V13, V18, V23, V28
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing to talk about absolute register values here if the code itself uses symbolic ones.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cc9087d.Load_4Rjp is now load_high; the comment uses the symbolic names.

Comment thread dev/ppc64le/src/ntt_ppc_asm.S Outdated
Comment on lines +309 to +310
/* Since the result of the Barrett multiplication is bounded
by q/2 in absolute value.
Copy link
Copy Markdown
Contributor

@hanno-becker hanno-becker May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not true, and the sentence just seems to stop.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment in cc9087d as it was also out of context. I think the same bound applies to Barrett multiplication as in eprint 2021/986 (Algorithm 6 / Corollary 2: |bar_N(a,b)| < N),

Comment thread dev/ppc64le/src/ntt_ppc_asm.S Outdated
Comment on lines +294 to +300
* r[j]: V12, V17, V22, V27
*/
.macro Load_4Rj
lxvd2x 32+vdata_a1, rinp, a1_offset /* V12: vector r0 */
lxvd2x 32+vdata_a2, rinp, a2_offset /* V17: vector r1 */
lxvd2x 32+vdata_a3, rinp, a3_offset /* V22: vector r2 */
lxvd2x 32+vdata_a4, rinp, a4_offset /* V27: vector r3 */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above: Why talk about absolute registers when using symbolic names?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cc9087d. Load_4Rj is now load_low; the comment uses the symbolic names

Comment thread dev/ppc64le/src/ntt_ppc_asm.S Outdated
* Compute forward NTT based on the following 7 layers -
* len = 128, 64, 32, 16, 8, 4, 2.
*
* Each layer compute the coefficients on 2 legs, start and start + len*2 offsets.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what is meant here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten. The comment now says: each layer pairs a low coefficient r[j] (leg 1) with
a high coefficient r[j+len] (leg 2).

*
* leg 1 leg 2
* ----- -----
* start start+len*2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is len*2 correct here? For len=128 it would be 256, which seems wrong.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, one should clarify whether one is talking about byte-offsets or array-offsets.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, the comment now states Offsets below are byte offsets into r...

Comment thread dev/ppc64le/src/ntt_ppc_asm.S Outdated
* following order,
* rjlen0, rjlen1, rjlen2, rjlen3, rjlen4, rjlen5, rjlen6, rjlen7
*/
.macro Load_4Coeffs start, next
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing that this includes Load_4Rjp but not Load_4Rp. I suggest to remove this macro and just inline it into its unique call-site layer12345

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.. Load_4Coeffs is removed. CT_Butterfly_4x (formerly ntt_layer12345) now calls
Init_Coeffs_offset and load_high directly.

Comment thread dev/ppc64le/src/ntt_ppc_asm.S Outdated
* Load coefficient in r[j+len] (r') vectors from offset, R10, R17, R19 and R21
* r[j+len]: V13, V18, V23, V28
*/
.macro Load_4Rjp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a more descriptive name, like load_high and load_low and comment that those are the coefficient subject to / not subject to scaling in the CT butterfly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: Load_4Rjp-> load_high, Load_4Rj -> load_low, each with a comment noting
which operand is scaled by zeta in the CT butterfly and which is not.

Comment thread dev/ppc64le/src/ntt_ppc_asm.S Outdated
Load_4Coeffs \start, \next
barrett_fqmul_4x \_vz0, \_vz1, \_vz2, \_vz3, \_vzt0, \_vzt1, \_vzt2, \_vzt3
Load_4Rj
Compute_4Coeffs
Copy link
Copy Markdown
Contributor

@hanno-becker hanno-becker May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pick a more descriptive name for Compute_4Coeffs, e.g. AddSub_4x or something.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: renamed to AddSub_4x.

Comment thread dev/ppc64le/src/ntt_ppc_asm.S Outdated
/*
* NTT other layers, 1, 2, 3, 4, 5.
*/
.macro ntt_layer12345 start, next, _vz0, _vz1, _vz2, _vz3, _vzt0, _vzt1, _vzt2, _vzt3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pick a more descriptive name for this macro. This is not an entire NTT layer, it's a set of butterflies. E.g. CT_Butterfly_4x

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: renaming ntt_layer12345 to CT_Butterfly_4x

Comment thread dev/ppc64le/src/ntt_ppc_asm.S
Comment thread dev/ppc64le/src/ntt_ppc_asm.S
autogenerate full consts.c and consts.h

Signed-off-by: Basil Hess <bhe@zurich.ibm.com>

Sync iNTT & Improved ABI note

Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
@bhess
Copy link
Copy Markdown
Contributor

bhess commented Jun 2, 2026

Thanks for the review, @hanno-becker! I pushed cc9087d with an update, see my replies in the inline comments.

Copy link
Copy Markdown
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bhess for your work 🙏

poly_tomont still uses rounding Montgomery -- can we change this to Barrett multiplication, too, please?

There's still a large number of places where I feel documentation/naming should be improved, but I'll take care of that.

Comment thread dev/ppc64le/src/poly_tomont_ppc_asm.S Outdated
Comment on lines +48 to +63
vmladduhm 15, 13, V1353, 3
vmladduhm 20, 18, V1353, 3
vmladduhm 25, 23, V1353, 3
vmladduhm 9, 7, V1353, 3

vmhraddshs 14, 13, V1353, 3
vmhraddshs 19, 18, V1353, 3
vmhraddshs 24, 23, V1353, 3
vmhraddshs 8, 7, V1353, 3

vmladduhm 15, 15, V_QINV, 3
vmladduhm 20, 20, V_QINV, 3
vmladduhm 25, 25, V_QINV, 3
vmladduhm 9, 9, V_QINV, 3

vmhraddshs 15, 15, V_NMKQ, 14
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhess This is still rounding Montgomery. Can we adjust this to use Barrett multiplication as well, please?

@bhess
Copy link
Copy Markdown
Contributor

bhess commented Jun 4, 2026

poly_tomont still uses rounding Montgomery -- can we change this to Barrett multiplication, too, please?

Sure, I'll take care of that.

There's still a large number of places where I feel documentation/naming should be improved, but I'll take care of that.

Thank you @hanno-becker!

Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
@bhess
Copy link
Copy Markdown
Contributor

bhess commented Jun 5, 2026

poly_tomont still uses rounding Montgomery -- can we change this to Barrett multiplication, too, please?

I pushed an update in f511d3b changing poly_tomont to Barrett multiplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark this PR should be benchmarked in CI ppc64le

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants