Proposal: AOT tail-call support#1284
Conversation
|
Very interesting! Thanks! I'll take a close look very soon |
|
Thanks, no rush at all, happy to adjust as needed. As an aside I have been very much liking chicory its super useful! |
|
I'm facing the same issue with tail calls in wasm'd python 3.14. Would love to see it landing :) |
|
@zklapow thanks so much for this proposal! I love the direction and took a quick look at the implementation, a couple of high level comments for the first round:
Style wise, what do you think about, in private boolean tailCallPending;
private int tailCallFuncId;
private long[] tailCallArgs;to something like: private TailCallPending tailCallPending; // null when not set
private record TailCallPending(int funcId, long[] args);I have just skimmed through, if I'm missing something please let me know! Unrelated, @wendigo have you seen this ? |
|
yup can do, Ill take a pass at making it have smaller impact on the existing bytecode and update WRT style a bit tomorrow. Funny enough but I am also using this for the python 3.14 tail call compiler! I saw redline and it def gets better perf so in the end I may just shift to redline. I will say with some chicory edits I got python perf very close between chicory AOT and redline, but I had to switch back to the non-tail call interpreter there due to the lack of tail call support. Anyhow I figured this was worth pring regardless for anyone not on redline, I can take a look at tail call support in redline too if thats of interest? Also @andreaTP if I have some other small changes I'd like to run by you/possibly upstream do you have a preferred method should I create issues first or just send prs? |
|
@andreaTP not yet but looks promising, given that Trino is already on JDK 25 (so we can use FFM). The only issue I can see is that python 3.14 main loop still compiles to a large wasm method that AOT couldn't handle and I guess the same would happen with cranelift. Enabling tail-call actually makes method compile but during the execution is exhausts stack depth. Having trampolines should solve this. |
|
@ @zklapow :
Since this seems to have some traction, when we can make it "generic enough", I am fully supportive in hosting(and helping with) a "cpython4j" project in the roastedroot organization(I imagine something similar to
Perf improvements very much depend on the Wasm payload, with redline we ensure performance similar to Wasmtime but is totally possible that you can get good results with the bytecode compiler.
Fully agree, thanks for taking a stab at it!
You inspired me, I'm on it! Thanks for offering help!
I'm happy to directly review PRs of small, self contained, improvements; when there are tradeoffs to be discussed probably better to do a pre-flight check in an Issue or on Zulip. @ @wendigo
I'm very happy to say that this is not correct 🙂 , side stepping the problems with large functions on the JVM is the very reason why redline exists. Perf wise, the closer comparison is probably with |
|
@andreaTP with the interruption support we could try it ;-) |
|
I'll join Zulip happy to chat more there. I can package up some of my python work for publishing that had been on my list anyhow, I have a pretty good setup now with numpy/pandas/pydantic and all their static libs built in too, I might publish under our (HubSpot) org if I do that tho.
ya can confirm the native approach using cranelift immediately works out to be better than all of these optimization to AOT, even without the tail call interpreter. For me this worked out to about a 1.8-2x improvement not quite 4x, but still pretty good. I suspect cranelift with tail calls would be even better. |
|
Before I forget, I took a stab at a generic cpython integration, it's basically just experiments but I think it's worth to share on this thread: https://github.com/andreaTP/cpython4j-poc I was not happy with the poc, for multiple reasons, and I dropped the effort, but happy to support someone else and see where it goes!
I'll take it! 🙂 Working on tail call and interrupts in redline, I aim to release with those by EOW. |
|
updated so this doesnt touch existing bytecode, and added a specific approval test for tail calls. I ended up with a private static class for |
18328e9 to
81aacd2
Compare
…call spec tests - Add Java pseudocode comment to compileMachineCallWithTailCalls explaining the trampoline loop pattern - Add comment to RETURN_CALL emitter explaining callee-side behavior - Restore accidentally deleted comments (// must be power of two, stack layout) - Remove unused TailCallException class (leftover from earlier approach) - Fix InterpreterFallbackTest approvals (lambda$0 -> lambda$function$0 due to new methods added to Instance.java) - Enable 25 previously-excluded tail-call spec tests that now pass with the trampoline implementation (deep recursion + exception handling with tail calls) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
81aacd2 to
6e4c958
Compare
|
no worries, thx for doing that. |
|
Thanks! |
|
@wendigo also I meant to follow up, its not quite ready for prime time but I did publish https://github.com/HubSpot/boomslang if its of interest which contains the full optimized cpython build, plus a very small extension system that auto generates some of the java host side for allowing python to call into java. |
The current AOT compiler approach to tail-calls in wasm is relatively simply and treats them as regular function calls, however with deep recursion, this can easily exhaust the JVM stack. This PR proposes a trampoline based impl to prevent stack growth.
I wasnt sure if there was a good way to benchmark this on existing programs/workloads its been a big perf improvement for us largely b/c moving to a tail-call approach has allowed a very large method to become AOT compiled. All the existing tests pass without changes, other than the
Approvaltests but AFAICT those require modification should the bytecode emitted change.