⚡ LLM Gateway

One command to rule all models.

Zero-config local LLM server with an OpenAI-compatible API.

What is this?

LLM Gateway (llmgw) is a single-binary tool that lets you run any GGUF model from HuggingFace locally with one command — no Docker, no Python, no YAML configs, no fuss.

llmgw run tinyllama

That's it. It downloads the model, sets up the inference engine, and gives you a fully OpenAI-compatible REST API at http://localhost:8080/v1.

Features

🚀 One-command setup — Just llmgw run <model> and you're live
🤗 HuggingFace integration — Auto-downloads GGUF models from the Hub
🔌 OpenAI-compatible API — Drop-in replacement for api.openai.com
📦 Zero dependencies — Single binary, no Python/Docker/CUDA install needed
🧠 Smart quantization — Auto-selects the optimal GGUF variant (Q4_K_M)
🏷️ Built-in aliases — tinyllama, mistral, codellama, phi2, and more
💾 Model caching — Downloads once, serves forever
🌐 CORS enabled — Use from any web app out of the box
🔄 Streaming support — Real-time token-by-token responses

Quick Start

Install

# Build from source
git clone https://github.com/kacf/llm-gateway.git
cd llm-gateway
go build -o llmgw .

# Or on Windows
go build -o llmgw.exe .

Run a Model

# Using a built-in alias
llmgw run tinyllama

# Using a HuggingFace repo ID
llmgw run TheBloke/Mistral-7B-Instruct-v0.2-GGUF

# With custom options
llmgw run mistral -port 9000 -context 8192 -quant Q5_K_M

Use the API

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    "messages": [{"role": "user", "content": "Explain quantum computing in one sentence."}]
  }'

Works with any OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="tinyllama",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Commands

Command	Description
`llmgw run <model>`	Download & serve a model
`llmgw search <query>`	Search HuggingFace for GGUF models
`llmgw list`	List locally cached models
`llmgw remove <model>`	Remove a cached model
`llmgw aliases`	Show built-in model aliases
`llmgw version`	Print version

Run Flags

Flag	Default	Description
`-port`	`8080`	API server port
`-context`	`4096`	Context window size
`-quant`	auto	Preferred quantization (e.g. `Q4_K_M`)
`-verbose`	`false`	Show backend logs
`-token`	`$HF_TOKEN`	HuggingFace API token

API Endpoints

Method	Endpoint	Description
POST	`/v1/chat/completions`	Chat completion (ChatGPT-style)
POST	`/v1/completions`	Text completion
GET	`/v1/models`	List available models
GET	`/health`	Health check
GET	`/`	Server info

Built-in Aliases

Alias	HuggingFace Repo
`tinyllama`	TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
`mistral`	TheBloke/Mistral-7B-Instruct-v0.2-GGUF
`llama2`	TheBloke/Llama-2-7B-Chat-GGUF
`codellama`	TheBloke/CodeLlama-7B-Instruct-GGUF
`phi2`	TheBloke/phi-2-GGUF
`zephyr`	TheBloke/zephyr-7B-beta-GGUF
`deepseek`	TheBloke/deepseek-coder-6.7B-instruct-GGUF
...and more	Run `llmgw aliases` for full list

How It Works

llmgw run tinyllama
     │
     ├─ 1. Resolve alias → TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
     ├─ 2. Check local cache (~/.llmgw/models/)
     ├─ 3. Download GGUF from HuggingFace (if needed)
     ├─ 4. Download llama.cpp server (if needed)
     ├─ 5. Start inference engine on internal port
     ├─ 6. Expose OpenAI-compatible API on :8080
     └─ 7. Ready! 🚀

Requirements

Go 1.21+ to build
4GB+ RAM for small models (TinyLlama, Phi-2)
16GB+ RAM for 7B models (Mistral, Llama 2)
Internet connection for first download

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
internal		internal
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
llmgw.exe		llmgw.exe
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ LLM Gateway

What is this?

Features

Quick Start

Install

Run a Model

Use the API

Commands

Run Flags

API Endpoints

Built-in Aliases

How It Works

Requirements

License

About

Uh oh!

Releases

Packages

Languages

License

kacf/LLM-Gateway

Folders and files

Latest commit

History

Repository files navigation

⚡ LLM Gateway

What is this?

Features

Quick Start

Install

Run a Model

Use the API

Commands

Run Flags

API Endpoints

Built-in Aliases

How It Works

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages