Skip to content

robin-vidal/inference-stub

Repository files navigation

inference-stub

Minimalist, deterministic OpenAI-compatible stub for LLM infrastructure testing.

Go Version License: Apache-2.0 CI Docker

Overview

inference-stub is a specialized tool designed to simulate LLM inference streams. By providing a predictable, programmable backend, it allows for isolated performance analysis of AI Gateways and Proxy layers.

Unlike a real LLM, this stub removes inference variability, making it possible to measure the precise overhead of the networking stack (TTFT/TPOT) in Cloud-Native environments. It supports both stream and non-stream requests, returning dynamically generated Lorem Ipsum text based on configurable parameters.

Current Focus

  • Gateway Benchmarking: Isolate proxy latency by using deterministic TTFT/TPOT settings.
  • Protocol Validation: Ensure Gateway-level filters (Rate Limiting, Usage Tracking) behave correctly against standard OpenAI-compatible JSON responses and SSE streams.
  • CI/CD Integration: Provide a lightweight, zero-cost alternative to real LLMs for automated integration tests.

Getting Started

# Build the binary
make build

# Run the stub with 100ms TTFT, 20ms TPOT, and fixed payload length of 15 words
./bin/inference-stub --ttft 100ms --tpot 20ms --length 15 --port 8080

Configuration Flags

  • --port (default 8080): The port to listen on.
  • --ttft (default 100ms): Time to first token. Simulates the initial processing delay.
  • --tpot (default 20ms): Time per output token. Simulates the delay between generation steps.
  • --length (default 50): The exact number of Lorem Ipsum words to generate in the mock response.
  • --timeout (default 1m0s): Timeout for requests.
  • --debug (default false): Enable debug logging.

Deploying with Helm

You can deploy inference-stub directly to your Kubernetes cluster using the provided Helm chart. The chart is published automatically to GHCR on release.

# from GHCR
helm upgrade -i inference-stub oci://ghcr.io/robin-vidal/charts/inference-stub \
  --version 0.2.0 \
  --namespace inference-stub --create-namespace \
  --set stubConfig.ttft=100ms \
  --set stubConfig.tpot=20ms \
  --set stubConfig.length=15

Alternatively, you can install it directly from the local source tree if you are developing:

helm upgrade -i inference-stub charts/inference-stub \
  --namespace inference-stub --create-namespace

Roadmap

  • Error Injection: Support for simulating 429 Too Many Requests and 503 Service Unavailable.
  • Usage Reporting: Implementation of the usage field in the final stream chunk for quota-testing.

Developed for the GSoC 2026 - kgateway Performance Benchmarking project.

About

Lightweight OpenAI API simulator designed to isolate networking overhead in Cloud-Native AI stacks.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors