Thanks for landing this in 0.7.0. I tested it on Jetson AGX Thor with nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 and single-stream throughput was excellent (~1.8× than same model in vllm), plus load times are much faster after first run.
Two observations from the quick tests I ran:
The server handles one request at a time. Sending two concurrently triggers an error, and the process needs a restart to recover.
Responses seem to return text only. The OpenAI-standard tool_calls field doesn't seem to be populated yet, I rely on that for agentic workflows.
Let me know if there's something I might be missing that could cause these problems if they are known and to be solved.
I can provide my setup if useful
Thanks for landing this in 0.7.0. I tested it on Jetson AGX Thor with nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 and single-stream throughput was excellent (~1.8× than same model in vllm), plus load times are much faster after first run.
Two observations from the quick tests I ran:
The server handles one request at a time. Sending two concurrently triggers an error, and the process needs a restart to recover.
Responses seem to return text only. The OpenAI-standard tool_calls field doesn't seem to be populated yet, I rely on that for agentic workflows.
Let me know if there's something I might be missing that could cause these problems if they are known and to be solved.
I can provide my setup if useful