inferrs

A TurboQuant LLM inference server.

Why inferrs?

Most LLM serving stacks force a trade-off between features and resource usage. inferrs targets both:

	inferrs	vLLM	llama.cpp
Language	Rust	Python/C++	C/C++
Streaming (SSE)	✓	✓	✓
KV cache management	TurboQuant, Per-context alloc, PagedAttention	PagedAttention	Per-context alloc
Memory friendly	✓ — lightweight	✗ — claims most GPU memory	✓ — lightweight
Binary footprint	Single binary	Python environment + deps	Single binary

Features

OpenAI-compatible API — /v1/completions, /v1/chat/completions, /v1/models, /health
Anthropic-compatible API — /v1/messages (streaming and non-streaming)
Ollama-compatible API — /api/generate, /api/chat, /api/tags, /api/ps, /api/show, /api/version
Hardware backends — CUDA, ROCm, Metal, Hexagon, OpenVino, MUSA, CANN, Vulkan and CPU

Quick start

Install

macOS / Linux

brew tap ericcurtin/inferrs
brew install inferrs

Windows

scoop bucket add inferrs https://github.com/ericcurtin/scoop-inferrs
scoop install inferrs

Run

inferrs run google/gemma-4-E2B-it

Serve

Serve a specific model (OpenAI/Anthropic/Ollama API on port 8080)

inferrs serve google/gemma-4-E2B-it

Serve a specific model vLLM-style

inferrs serve --paged-attention google/gemma-4-E2B-it

Serve a specific model llama.cpp-style

inferrs serve --quantize google/gemma-4-E2B-it

Serve without a model (Ollama-compatible mode on port 11434)

inferrs serve

This behaves like ollama serve: the server starts on 0.0.0.0:11434, responds "Ollama is running" at GET /, and exposes the full Ollama API. Any Ollama client — including the ollama CLI — can point at it directly.

Architecture

┌─────────┐      HTTP       ┌────────┐  channel  ┌────────┐
│  Client │ ──────────────▶ │ Server │ ────────▶ │ Engine │
└─────────┘  (axum + SSE)   └────────┘           └────────┘
                                                     │
                               ┌──────────┬──────────┼──────────┐
                               ▼          ▼          ▼          ▼
                          Scheduler    Transformer  KV Cache  Sampler

Name		Name	Last commit message	Last commit date
Latest commit History 283 Commits
.cargo		.cargo
.gemini		.gemini
.github/workflows		.github/workflows
backends		backends
candle-core		candle-core
candle-kernels		candle-kernels
candle-metal-kernels		candle-metal-kernels
candle-nn		candle-nn
inferrs-benchmark		inferrs-benchmark
inferrs		inferrs
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

inferrs

Why inferrs?

Features

Quick start

Install

Run

Serve

Serve a specific model (OpenAI/Anthropic/Ollama API on port 8080)

Serve a specific model vLLM-style

Serve a specific model llama.cpp-style

Serve without a model (Ollama-compatible mode on port 11434)

Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

inferrs

Why inferrs?

Features

Quick start

Install

Run

Serve

Serve a specific model (OpenAI/Anthropic/Ollama API on port 8080)

Serve a specific model vLLM-style

Serve a specific model llama.cpp-style

Serve without a model (Ollama-compatible mode on port 11434)

Architecture

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages