Local LLM Gateway

Universal gateway for intercepting cloud LLM API calls and routing them to local models running in Kubernetes.

Overview

Local LLM Gateway enables you to:

🔄 Intercept AWS Bedrock, OpenAI, and Anthropic API calls
🏠 Route requests to local models (vLLM, Ollama)
💰 Save costs by running models on your own infrastructure
🔒 Keep data private - no external API calls
⚡ Scale horizontally with Kubernetes

Quick Start

Prerequisites

Kubernetes 1.28+
Helm 3.13+
GPU nodes (for vLLM) or CPU nodes (for Ollama)

Installation

# Install with Helm
helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  --version 0.1.0 \
  --namespace local-llm-gateway \
  --create-namespace

Verify Installation

kubectl get pods -n local-llm-gateway
kubectl get svc -n local-llm-gateway

Architecture

Local LLM Gateway acts as a transparent proxy between cloud LLM SDKs and local model runners, enabling cost savings and data privacy without code changes.

High-Level Overview

                  ┌─────────────────────────────────────────────────────────────────┐
                  │                        Client Application                       │
                  │  (Existing code using AWS Bedrock, OpenAI, or Anthropic SDKs)   │
                  └─────────────────────────────────────────────────────────────────┘
                                      │                            ▲
                                      │ Provider-specific          │ Provider-specific
                                      │ request format             │ response format
                                      ▼                            │
                  ┌─────────────────────────────────────────────────────────────────┐
                  │                         Local LLM Gateway Proxy                 │
                  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
                  │  │   Bedrock    │  │    OpenAI    │  │  Anthropic   │           │
                  │  │   Adapter    │  │   Adapter    │  │   Adapter    │           │
                  │  └──────────────┘  └──────────────┘  └──────────────┘           │
                  │           │                │                │                   │
                  │           │ Standard       │                │ Standard          │
                  │           │ format         │                │ format            │
                  │           ▼                ▼                ▼                   │
                  │                   Runner Orchestrator                           │
                  │            (Maps cloud models → local runners)                  │
                  │                            │                ▲                   │
                  │                            │ Standard       │ Standard          │
                  │                            │ request        │ response          │
                  └────────────────────────────┼────────────────┼───────────────────┘
                                               │                │
                                     ┌─────────┴─────────┐      │
                                     │                   │      │
                                     ▼                   ▼      │
                            ┌──────────────────┐  ┌──────────────────┐
                            │   vLLM Runner    │  │  Ollama Runner   │
                            │  (GPU-optimized) │  │ (CPU-friendly)   │
                            │   ModelRunner    │  │   ModelRunner    │
                            └──────────────────┘  └──────────────────┘
                                      │                   │
                                      │ vLLM API          │ Ollama API
                                      ▼                   ▼
                          ┌──────────────────┐  ┌──────────────────┐
                          │  Llama 3 8B      │  │   Phi-3 Mini     │
                          │  Mistral 7B      │  │   Gemma 2B       │
                          └──────────────────┘  └──────────────────┘
                                    │                   │
                                    │ Inference         │ Inference
                                    │ results           │ results
                                    └─────────┬─────────┘
                                              │
                                              ▼
                                      (Results flow back up
                                      through the same path)

Components:

Gateway: Translates cloud API formats to standard format
Adapters: Support for Bedrock, OpenAI, Anthropic
Runners: Execute models (vLLM, Ollama)
Model Mappings: Route requests to appropriate models

Features

Supported Providers

✅ AWS Bedrock (Anthropic Claude, Meta Llama, Amazon Titan, Cohere, AI21, Mistral)
✅ OpenAI (GPT-4, GPT-3.5)
✅ Anthropic (Claude 3.5 Sonnet, Claude 3 Opus)

Model Runners

✅ vLLM - High-performance inference (GPU-optimized)
✅ Ollama - CPU-friendly, easy local development
🔄 TGI (coming soon)

Additional Features

📊 Prometheus metrics
🔍 Service mesh support (Istio, Linkerd)
🔄 Horizontal Pod Autoscaling
🎯 Model routing and fallback
🔒 mTLS via service mesh

Configuration

Example: Production Setup

helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  --version 0.1.0 \
  -f examples/values-production.yaml

Example: Development Setup

helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  --version 0.1.0 \
  -f examples/values-dev.yaml

Custom Configuration

# custom-values.yaml
gateway:
  replicaCount: 3
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10

runners:
  llama3-8b:
    enabled: true
    replicas: 2
    resources:
      limits:
        nvidia.com/gpu: 1

mappings:
  bedrock:
    'anthropic.claude-3-sonnet-20240229-v1:0': llama3-8b
  openai:
    'gpt-4': llama3-8b

helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  --version 0.1.0 \
  -f custom-values.yaml

Examples

See the examples/ directory for:

values-dev.yaml - Development setup (minimal resources)
values-staging.yaml - Staging environment
values-production.yaml - Production setup (HA, autoscaling)
values-gpu-production.yaml - GPU-optimized production
values-cpu-only.yaml - CPU-only deployment (no GPU)
values-eks-production.yaml - AWS EKS specific configuration

Documentation

Development

Prerequisites

Node.js 20+
npm
Docker (optional)

Setup

# Clone repository
git clone https://github.com/JoshWheeler08/local-llm-gateway.git
cd local-llm-gateway

# Install root dependencies (ESLint, commitlint, husky)
npm install

# Install application dependencies
cd proxy
npm install

Development Workflow

# Run linting
npm run lint

# Run type checking
npm run typecheck

# Run tests
npm test

# Build application
npm run build

# Run locally
cd proxy
npm run dev

Commit Convention

We use Conventional Commits:

feat(bedrock): add Mistral model support
fix(adapter): resolve provider detection bug
docs(readme): update installation instructions

See CONTRIBUTING.md for details.

Project Structure

local-llm-gateway/
├── proxy/                  # TypeScript gateway application
│   ├── src/
│   │   ├── adapters/      # Provider adapters (Bedrock, OpenAI, Anthropic)
│   │   ├── runners/       # Model runners (vLLM, Ollama)
│   │   ├── types/         # TypeScript type definitions
│   │   └── server.ts      # Fastify server
│   └── Dockerfile
├── kubernetes/
│   └── helm/
│       └── local-llm-gateway/   # Helm chart
├── examples/              # Configuration examples
└── .github/               # CI/CD workflows

Local Development

# Start local Kubernetes cluster
minikube start

# Install chart
helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  --version 0.1.0 \
  -f examples/values-dev.yaml

# Port forward to access locally
kubectl port-forward svc/local-llm-gateway 8080:8080 -n local-llm-gateway

Hugging Face Authentication

Some models require a Hugging Face token (gated models like Llama 3, Mistral).

Create Token

Get token from https://huggingface.co/settings/tokens
Accept model license on Hugging Face (for gated models)
Create Kubernetes secret:

kubectl create secret generic huggingface-token \
  --from-literal=token=hf_xxxxxxxxxxxxx \
  -n local-llm-gateway

Option 1: Global Token (All Runners)

helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  --set huggingface.token.secretName=huggingface-token

Option 2: Per-Runner Tokens

# custom-values.yaml
runners:
  llama3-8b:
    enabled: true
    huggingfaceToken:
      secretName: llama-token # Specific token

  mistral-7b:
    enabled: true
    huggingfaceToken:
      secretName: mistral-token # Different token

helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  -f custom-values.yaml

Which Models Need Tokens?

Requires token (gated):

Meta Llama 3, Llama 2
Mistral models
Some Cohere models

No token needed (open):

Microsoft Phi-3
Google Gemma
Most smaller models

Monitoring

Prometheus Metrics

The gateway exposes Prometheus metrics on port 9090:

local_llm_gateway_http_requests_total
local_llm_gateway_http_request_duration_seconds
local_llm_gateway_model_inference_requests_total
local_llm_gateway_model_inference_duration_seconds
local_llm_gateway_active_requests

ServiceMonitor

Enable Prometheus ServiceMonitor:

metrics:
  enabled: true
  serviceMonitor:
    enabled: true

Troubleshooting

Gateway Logs

kubectl logs -l app.kubernetes.io/name=local-llm-gateway -n local-llm-gateway

Runner Logs

kubectl logs -l local-llm-gateway.io/runner=llama3-8b -n local-llm-gateway

Health Checks

# Gateway health
kubectl exec -it deployment/local-llm-gateway -n local-llm-gateway -- curl http://localhost:8080/health

# Runner health
kubectl exec -it deployment/local-llm-gateway-llama3-8b -n local-llm-gateway -- curl http://localhost:8000/health

Security

Service mesh integration (Istio, Linkerd) for mTLS
Pod security contexts with non-root users
Network policies (optional)
RBAC configurations

See Security Guide for details.

Performance

vLLM: 24x faster than HuggingFace Transformers
Continuous batching for high throughput
PagedAttention for efficient memory usage
Horizontal Pod Autoscaling based on CPU/memory

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for details.

License

Apache License 2.0 - see LICENSE for details.

Support

Citation

If you use Local LLM Gateway in your research or project, please cite:

@software{local_llm_gateway,
  author = {Josh Wheeler},
  title = {Local LLM Gateway: Universal Gateway for Local LLM Inference},
  year = {2025},
  url = {https://github.com/JoshWheeler08/local-llm-gateway},
  version = {0.1.0}
}

Acknowledgments

vLLM - High-performance inference engine
Ollama - Easy local LLM deployment
Fastify - Fast web framework

Code of Conduct

This project follows a Code of Conduct to ensure a welcoming environment for everyone. Please read and follow it.

Made with ❤️ by the Local LLM Gateway community

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
.husky		.husky
docs		docs
examples		examples
kubernetes		kubernetes
proxy		proxy
.coderabbit.yaml		.coderabbit.yaml
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.json		.prettierrc.json
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
commitlint.config.js		commitlint.config.js
package-lock.json		package-lock.json
package.json		package.json
release-please-config.json		release-please-config.json

Folders and files

Latest commit

History

Repository files navigation

Local LLM Gateway

Overview

Quick Start

Prerequisites

Installation

Verify Installation

Architecture

High-Level Overview

Features

Supported Providers

Model Runners

Additional Features

Configuration

Example: Production Setup

Example: Development Setup

Custom Configuration

Examples

Documentation

Development

Prerequisites

Setup

Development Workflow

Commit Convention

Project Structure

Local Development

Hugging Face Authentication

Create Token

Option 1: Global Token (All Runners)

Option 2: Per-Runner Tokens

Which Models Need Tokens?

Monitoring

Prometheus Metrics

ServiceMonitor

Troubleshooting

Gateway Logs

Runner Logs

Health Checks

Security

Performance

Contributing

License

Support

Citation

Acknowledgments

Code of Conduct

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages