inference-stub

Minimalist, deterministic OpenAI-compatible stub for LLM infrastructure testing.

Overview

inference-stub is a specialized tool designed to simulate LLM inference streams. By providing a predictable, programmable backend, it allows for isolated performance analysis of AI Gateways and Proxy layers.

Unlike a real LLM, this stub removes inference variability, making it possible to measure the precise overhead of the networking stack (TTFT/TPOT) in Cloud-Native environments. It supports both stream and non-stream requests, returning dynamically generated Lorem Ipsum text based on configurable parameters.

Current Focus

Gateway Benchmarking: Isolate proxy latency by using deterministic TTFT/TPOT settings.
Protocol Validation: Ensure Gateway-level filters (Rate Limiting, Usage Tracking) behave correctly against standard OpenAI-compatible JSON responses and SSE streams.
CI/CD Integration: Provide a lightweight, zero-cost alternative to real LLMs for automated integration tests.

Getting Started

# Build the binary
make build

# Run the stub with 100ms TTFT, 20ms TPOT, and fixed payload length of 15 words
./bin/inference-stub --ttft 100ms --tpot 20ms --length 15 --port 8080

Configuration Flags

--port (default 8080): The port to listen on.
--ttft (default 100ms): Time to first token. Simulates the initial processing delay.
--tpot (default 20ms): Time per output token. Simulates the delay between generation steps.
--length (default 50): The exact number of Lorem Ipsum words to generate in the mock response.
--timeout (default 1m0s): Timeout for requests.
--debug (default false): Enable debug logging.

Deploying with Helm

You can deploy inference-stub directly to your Kubernetes cluster using the provided Helm chart. The chart is published automatically to GHCR on release.

# from GHCR
helm upgrade -i inference-stub oci://ghcr.io/robin-vidal/charts/inference-stub \
  --version 0.2.0 \
  --namespace inference-stub --create-namespace \
  --set stubConfig.ttft=100ms \
  --set stubConfig.tpot=20ms \
  --set stubConfig.length=15

Alternatively, you can install it directly from the local source tree if you are developing:

helm upgrade -i inference-stub charts/inference-stub \
  --namespace inference-stub --create-namespace

Roadmap

Error Injection: Support for simulating 429 Too Many Requests and 503 Service Unavailable.
Usage Reporting: Implementation of the usage field in the final stream chunk for quota-testing.

Developed for the GSoC 2026 - kgateway Performance Benchmarking project.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
charts/inference-stub		charts/inference-stub
cmd/inference-stub		cmd/inference-stub
internal		internal
pkg		pkg
.dockerignore		.dockerignore
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

inference-stub

Overview

Current Focus

Getting Started

Configuration Flags

Deploying with Helm

Roadmap

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

inference-stub

Overview

Current Focus

Getting Started

Configuration Flags

Deploying with Helm

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages