Skip to content

Proposal: AOT tail-call support#1284

Merged
andreaTP merged 3 commits into
dylibso:mainfrom
zklapow:aot-tail-call-upstream
May 5, 2026
Merged

Proposal: AOT tail-call support#1284
andreaTP merged 3 commits into
dylibso:mainfrom
zklapow:aot-tail-call-upstream

Conversation

@zklapow
Copy link
Copy Markdown
Contributor

@zklapow zklapow commented Apr 27, 2026

The current AOT compiler approach to tail-calls in wasm is relatively simply and treats them as regular function calls, however with deep recursion, this can easily exhaust the JVM stack. This PR proposes a trampoline based impl to prevent stack growth.

I wasnt sure if there was a good way to benchmark this on existing programs/workloads its been a big perf improvement for us largely b/c moving to a tail-call approach has allowed a very large method to become AOT compiled. All the existing tests pass without changes, other than the Approval tests but AFAICT those require modification should the bytecode emitted change.

@zklapow zklapow changed the title AOT: tail-call support with optimized dispatch chunking AOT: tail-call support Apr 27, 2026
@zklapow zklapow changed the title AOT: tail-call support Proposal: AOT tail-call support Apr 27, 2026
@andreaTP
Copy link
Copy Markdown
Collaborator

Very interesting! Thanks! I'll take a close look very soon

@zklapow
Copy link
Copy Markdown
Contributor Author

zklapow commented Apr 27, 2026

Thanks, no rush at all, happy to adjust as needed. As an aside I have been very much liking chicory its super useful!

@wendigo
Copy link
Copy Markdown

wendigo commented Apr 28, 2026

I'm facing the same issue with tail calls in wasm'd python 3.14. Would love to see it landing :)

@andreaTP
Copy link
Copy Markdown
Collaborator

andreaTP commented Apr 28, 2026

@zklapow thanks so much for this proposal!

I love the direction and took a quick look at the implementation, a couple of high level comments for the first round:

  • I expect the generated code that is not using tail-call to remain un-modified, e.g. most of the approvals should remain the same.
  • Most of the logic for analyzing the module should go into the WasmAnalyzer, we are already looping through everything and, I believe we can move the analysis in-line. (e.g. move collectTailCallFunctions and similar to WasmAnalyzer)

Style wise, what do you think about, in Instance, moving:

    private boolean tailCallPending;
    private int tailCallFuncId;
    private long[] tailCallArgs;

to something like:

    private TailCallPending tailCallPending; // null when not set
    
    private record TailCallPending(int funcId, long[] args);

I have just skimmed through, if I'm missing something please let me know!

Unrelated, @wendigo have you seen this ?
I haven't tackled tail-call yet there, but shouldn't be hard.

@zklapow
Copy link
Copy Markdown
Contributor Author

zklapow commented Apr 29, 2026

yup can do, Ill take a pass at making it have smaller impact on the existing bytecode and update WRT style a bit tomorrow.

Funny enough but I am also using this for the python 3.14 tail call compiler! I saw redline and it def gets better perf so in the end I may just shift to redline. I will say with some chicory edits I got python perf very close between chicory AOT and redline, but I had to switch back to the non-tail call interpreter there due to the lack of tail call support.

Anyhow I figured this was worth pring regardless for anyone not on redline, I can take a look at tail call support in redline too if thats of interest?

Also @andreaTP if I have some other small changes I'd like to run by you/possibly upstream do you have a preferred method should I create issues first or just send prs?

@wendigo
Copy link
Copy Markdown

wendigo commented Apr 29, 2026

@andreaTP not yet but looks promising, given that Trino is already on JDK 25 (so we can use FFM). The only issue I can see is that python 3.14 main loop still compiles to a large wasm method that AOT couldn't handle and I guess the same would happen with cranelift. Enabling tail-call actually makes method compile but during the execution is exhausts stack depth. Having trampolines should solve this.

@andreaTP
Copy link
Copy Markdown
Collaborator

andreaTP commented Apr 29, 2026

@ @zklapow :

python 3.14

Since this seems to have some traction, when we can make it "generic enough", I am fully supportive in hosting(and helping with) a "cpython4j" project in the roastedroot organization(I imagine something similar to quickjs4j) when someone volunteer initial work.

very close between chicory AOT and redline

Perf improvements very much depend on the Wasm payload, with redline we ensure performance similar to Wasmtime but is totally possible that you can get good results with the bytecode compiler.

this was worth regardless

Fully agree, thanks for taking a stab at it!

I can take a look at tail call support in redline too if thats of interest?

You inspired me, I'm on it! Thanks for offering help!

if I have some other small changes I'd like to run by you/possibly upstream do you have a preferred method should I create issues first or just send prs?

I'm happy to directly review PRs of small, self contained, improvements; when there are tradeoffs to be discussed probably better to do a pre-flight check in an Issue or on Zulip.

@ @wendigo

large wasm method that AOT couldn't handle and I guess the same would happen with cranelift.

I'm very happy to say that this is not correct 🙂 , side stepping the problems with large functions on the JVM is the very reason why redline exists.
Functions are compiled to native code with Cranelift and the generated ASM doesn't incur into the limitations of the JVM.

Perf wise, the closer comparison is probably with quickjs4j where I get a ~4X performance improvement, but, as said earlier, it very much depends on the WASM payload "shape".
To use redline in prod(at best of my understanding) you need interruption support, expect it soon, I'm on it.

@wendigo
Copy link
Copy Markdown

wendigo commented Apr 29, 2026

@andreaTP with the interruption support we could try it ;-)

@zklapow
Copy link
Copy Markdown
Contributor Author

zklapow commented Apr 29, 2026

I'll join Zulip happy to chat more there. I can package up some of my python work for publishing that had been on my list anyhow, I have a pretty good setup now with numpy/pandas/pydantic and all their static libs built in too, I might publish under our (HubSpot) org if I do that tho.

I'm very happy to say that this is not correct

ya can confirm the native approach using cranelift immediately works out to be better than all of these optimization to AOT, even without the tail call interpreter. For me this worked out to about a 1.8-2x improvement not quite 4x, but still pretty good. I suspect cranelift with tail calls would be even better.

@andreaTP
Copy link
Copy Markdown
Collaborator

Before I forget, I took a stab at a generic cpython integration, it's basically just experiments but I think it's worth to share on this thread: https://github.com/andreaTP/cpython4j-poc

I was not happy with the poc, for multiple reasons, and I dropped the effort, but happy to support someone else and see where it goes!

1.8-2x improvement

I'll take it! 🙂 Working on tail call and interrupts in redline, I aim to release with those by EOW.

@zklapow
Copy link
Copy Markdown
Contributor Author

zklapow commented Apr 29, 2026

updated so this doesnt touch existing bytecode, and added a specific approval test for tail calls. I ended up with a private static class for TailCallPending instead of record b/c chicory seems to still target 11 and I didnt want to change that.

@andreaTP andreaTP force-pushed the aot-tail-call-upstream branch from 18328e9 to 81aacd2 Compare April 30, 2026 13:36
@andreaTP
Copy link
Copy Markdown
Collaborator

andreaTP commented Apr 30, 2026

@zklapow I took the freedom of directly editing the PR instead of adding comments in the spirit of expediting it, I hope is welcome.
The most meaningful change is this one, when CI is green we have a good indication everything is in order to proceed.

…call spec tests

- Add Java pseudocode comment to compileMachineCallWithTailCalls explaining
  the trampoline loop pattern
- Add comment to RETURN_CALL emitter explaining callee-side behavior
- Restore accidentally deleted comments (// must be power of two, stack layout)
- Remove unused TailCallException class (leftover from earlier approach)
- Fix InterpreterFallbackTest approvals (lambda$0 -> lambda$function$0 due
  to new methods added to Instance.java)
- Enable 25 previously-excluded tail-call spec tests that now pass with the
  trampoline implementation (deep recursion + exception handling with tail calls)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andreaTP andreaTP force-pushed the aot-tail-call-upstream branch from 81aacd2 to 6e4c958 Compare April 30, 2026 14:47
@zklapow
Copy link
Copy Markdown
Contributor Author

zklapow commented May 1, 2026

no worries, thx for doing that.

Copy link
Copy Markdown
Collaborator

@andreaTP andreaTP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the excellent contribution @zklapow

@andreaTP andreaTP merged commit 1303ad0 into dylibso:main May 5, 2026
25 of 29 checks passed
@andreaTP
Copy link
Copy Markdown
Collaborator

andreaTP commented May 5, 2026

Thanks!

@zklapow
Copy link
Copy Markdown
Contributor Author

zklapow commented May 15, 2026

@wendigo also I meant to follow up, its not quite ready for prime time but I did publish https://github.com/HubSpot/boomslang if its of interest which contains the full optimized cpython build, plus a very small extension system that auto generates some of the java host side for allowing python to call into java.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants