Minimalist, deterministic OpenAI-compatible stub for LLM infrastructure testing.
inference-stub is a specialized tool designed to simulate LLM inference streams. By providing a predictable, programmable backend, it allows for isolated performance analysis of AI Gateways and Proxy layers.
Unlike a real LLM, this stub removes inference variability, making it possible to measure the precise overhead of the networking stack (TTFT/TPOT) in Cloud-Native environments. It supports both stream and non-stream requests, returning dynamically generated Lorem Ipsum text based on configurable parameters.
- Gateway Benchmarking: Isolate proxy latency by using deterministic TTFT/TPOT settings.
- Protocol Validation: Ensure Gateway-level filters (Rate Limiting, Usage Tracking) behave correctly against standard OpenAI-compatible JSON responses and SSE streams.
- CI/CD Integration: Provide a lightweight, zero-cost alternative to real LLMs for automated integration tests.
# Build the binary
make build
# Run the stub with 100ms TTFT, 20ms TPOT, and fixed payload length of 15 words
./bin/inference-stub --ttft 100ms --tpot 20ms --length 15 --port 8080--port(default8080): The port to listen on.--ttft(default100ms): Time to first token. Simulates the initial processing delay.--tpot(default20ms): Time per output token. Simulates the delay between generation steps.--length(default50): The exact number of Lorem Ipsum words to generate in the mock response.--timeout(default1m0s): Timeout for requests.--debug(defaultfalse): Enable debug logging.
You can deploy inference-stub directly to your Kubernetes cluster using the provided Helm chart. The chart is published automatically to GHCR on release.
# from GHCR
helm upgrade -i inference-stub oci://ghcr.io/robin-vidal/charts/inference-stub \
--version 0.2.0 \
--namespace inference-stub --create-namespace \
--set stubConfig.ttft=100ms \
--set stubConfig.tpot=20ms \
--set stubConfig.length=15Alternatively, you can install it directly from the local source tree if you are developing:
helm upgrade -i inference-stub charts/inference-stub \
--namespace inference-stub --create-namespace- Error Injection: Support for simulating 429 Too Many Requests and 503 Service Unavailable.
- Usage Reporting: Implementation of the usage field in the final stream chunk for quota-testing.
Developed for the GSoC 2026 - kgateway Performance Benchmarking project.