Universal gateway for intercepting cloud LLM API calls and routing them to local models running in Kubernetes.
Local LLM Gateway enables you to:
- 🔄 Intercept AWS Bedrock, OpenAI, and Anthropic API calls
- 🏠 Route requests to local models (vLLM, Ollama)
- 💰 Save costs by running models on your own infrastructure
- 🔒 Keep data private - no external API calls
- ⚡ Scale horizontally with Kubernetes
- Kubernetes 1.28+
- Helm 3.13+
- GPU nodes (for vLLM) or CPU nodes (for Ollama)
# Install with Helm
helm install local-llm-gateway \
oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
--version 0.1.0 \
--namespace local-llm-gateway \
--create-namespacekubectl get pods -n local-llm-gateway
kubectl get svc -n local-llm-gatewayLocal LLM Gateway acts as a transparent proxy between cloud LLM SDKs and local model runners, enabling cost savings and data privacy without code changes.
┌─────────────────────────────────────────────────────────────────┐
│ Client Application │
│ (Existing code using AWS Bedrock, OpenAI, or Anthropic SDKs) │
└─────────────────────────────────────────────────────────────────┘
│ ▲
│ Provider-specific │ Provider-specific
│ request format │ response format
▼ │
┌─────────────────────────────────────────────────────────────────┐
│ Local LLM Gateway Proxy │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Bedrock │ │ OpenAI │ │ Anthropic │ │
│ │ Adapter │ │ Adapter │ │ Adapter │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ Standard │ │ Standard │
│ │ format │ │ format │
│ ▼ ▼ ▼ │
│ Runner Orchestrator │
│ (Maps cloud models → local runners) │
│ │ ▲ │
│ │ Standard │ Standard │
│ │ request │ response │
└────────────────────────────┼────────────────┼───────────────────┘
│ │
┌─────────┴─────────┐ │
│ │ │
▼ ▼ │
┌──────────────────┐ ┌──────────────────┐
│ vLLM Runner │ │ Ollama Runner │
│ (GPU-optimized) │ │ (CPU-friendly) │
│ ModelRunner │ │ ModelRunner │
└──────────────────┘ └──────────────────┘
│ │
│ vLLM API │ Ollama API
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Llama 3 8B │ │ Phi-3 Mini │
│ Mistral 7B │ │ Gemma 2B │
└──────────────────┘ └──────────────────┘
│ │
│ Inference │ Inference
│ results │ results
└─────────┬─────────┘
│
▼
(Results flow back up
through the same path)
Components:
- Gateway: Translates cloud API formats to standard format
- Adapters: Support for Bedrock, OpenAI, Anthropic
- Runners: Execute models (vLLM, Ollama)
- Model Mappings: Route requests to appropriate models
- ✅ AWS Bedrock (Anthropic Claude, Meta Llama, Amazon Titan, Cohere, AI21, Mistral)
- ✅ OpenAI (GPT-4, GPT-3.5)
- ✅ Anthropic (Claude 3.5 Sonnet, Claude 3 Opus)
- ✅ vLLM - High-performance inference (GPU-optimized)
- ✅ Ollama - CPU-friendly, easy local development
- 🔄 TGI (coming soon)
- 📊 Prometheus metrics
- 🔍 Service mesh support (Istio, Linkerd)
- 🔄 Horizontal Pod Autoscaling
- 🎯 Model routing and fallback
- 🔒 mTLS via service mesh
helm install local-llm-gateway \
oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
--version 0.1.0 \
-f examples/values-production.yamlhelm install local-llm-gateway \
oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
--version 0.1.0 \
-f examples/values-dev.yaml# custom-values.yaml
gateway:
replicaCount: 3
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
runners:
llama3-8b:
enabled: true
replicas: 2
resources:
limits:
nvidia.com/gpu: 1
mappings:
bedrock:
'anthropic.claude-3-sonnet-20240229-v1:0': llama3-8b
openai:
'gpt-4': llama3-8bhelm install local-llm-gateway \
oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
--version 0.1.0 \
-f custom-values.yamlSee the examples/ directory for:
values-dev.yaml- Development setup (minimal resources)values-staging.yaml- Staging environmentvalues-production.yaml- Production setup (HA, autoscaling)values-gpu-production.yaml- GPU-optimized productionvalues-cpu-only.yaml- CPU-only deployment (no GPU)values-eks-production.yaml- AWS EKS specific configuration
- Node.js 20+
- npm
- Docker (optional)
# Clone repository
git clone https://github.com/JoshWheeler08/local-llm-gateway.git
cd local-llm-gateway
# Install root dependencies (ESLint, commitlint, husky)
npm install
# Install application dependencies
cd proxy
npm install# Run linting
npm run lint
# Run type checking
npm run typecheck
# Run tests
npm test
# Build application
npm run build
# Run locally
cd proxy
npm run devWe use Conventional Commits:
feat(bedrock): add Mistral model support
fix(adapter): resolve provider detection bug
docs(readme): update installation instructionsSee CONTRIBUTING.md for details.
local-llm-gateway/
├── proxy/ # TypeScript gateway application
│ ├── src/
│ │ ├── adapters/ # Provider adapters (Bedrock, OpenAI, Anthropic)
│ │ ├── runners/ # Model runners (vLLM, Ollama)
│ │ ├── types/ # TypeScript type definitions
│ │ └── server.ts # Fastify server
│ └── Dockerfile
├── kubernetes/
│ └── helm/
│ └── local-llm-gateway/ # Helm chart
├── examples/ # Configuration examples
└── .github/ # CI/CD workflows
# Start local Kubernetes cluster
minikube start
# Install chart
helm install local-llm-gateway \
oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
--version 0.1.0 \
-f examples/values-dev.yaml
# Port forward to access locally
kubectl port-forward svc/local-llm-gateway 8080:8080 -n local-llm-gatewaySome models require a Hugging Face token (gated models like Llama 3, Mistral).
- Get token from https://huggingface.co/settings/tokens
- Accept model license on Hugging Face (for gated models)
- Create Kubernetes secret:
kubectl create secret generic huggingface-token \
--from-literal=token=hf_xxxxxxxxxxxxx \
-n local-llm-gatewayhelm install local-llm-gateway \
oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
--set huggingface.token.secretName=huggingface-token# custom-values.yaml
runners:
llama3-8b:
enabled: true
huggingfaceToken:
secretName: llama-token # Specific token
mistral-7b:
enabled: true
huggingfaceToken:
secretName: mistral-token # Different tokenhelm install local-llm-gateway \
oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
-f custom-values.yamlRequires token (gated):
- Meta Llama 3, Llama 2
- Mistral models
- Some Cohere models
No token needed (open):
- Microsoft Phi-3
- Google Gemma
- Most smaller models
The gateway exposes Prometheus metrics on port 9090:
local_llm_gateway_http_requests_total
local_llm_gateway_http_request_duration_seconds
local_llm_gateway_model_inference_requests_total
local_llm_gateway_model_inference_duration_seconds
local_llm_gateway_active_requests
Enable Prometheus ServiceMonitor:
metrics:
enabled: true
serviceMonitor:
enabled: truekubectl logs -l app.kubernetes.io/name=local-llm-gateway -n local-llm-gatewaykubectl logs -l local-llm-gateway.io/runner=llama3-8b -n local-llm-gateway# Gateway health
kubectl exec -it deployment/local-llm-gateway -n local-llm-gateway -- curl http://localhost:8080/health
# Runner health
kubectl exec -it deployment/local-llm-gateway-llama3-8b -n local-llm-gateway -- curl http://localhost:8000/health- Service mesh integration (Istio, Linkerd) for mTLS
- Pod security contexts with non-root users
- Network policies (optional)
- RBAC configurations
See Security Guide for details.
- vLLM: 24x faster than HuggingFace Transformers
- Continuous batching for high throughput
- PagedAttention for efficient memory usage
- Horizontal Pod Autoscaling based on CPU/memory
Contributions are welcome! Please read CONTRIBUTING.md for details.
Apache License 2.0 - see LICENSE for details.
If you use Local LLM Gateway in your research or project, please cite:
@software{local_llm_gateway,
author = {Josh Wheeler},
title = {Local LLM Gateway: Universal Gateway for Local LLM Inference},
year = {2025},
url = {https://github.com/JoshWheeler08/local-llm-gateway},
version = {0.1.0}
}- vLLM - High-performance inference engine
- Ollama - Easy local LLM deployment
- Fastify - Fast web framework
This project follows a Code of Conduct to ensure a welcoming environment for everyone. Please read and follow it.
Made with ❤️ by the Local LLM Gateway community