Skip to content

JoshWheeler08/local-llm-gateway

Local LLM Gateway

Universal gateway for intercepting cloud LLM API calls and routing them to local models running in Kubernetes.

License Release Docker Pulls Helm Chart CI Conventional Commits CodeRabbit PRs Welcome

Overview

Local LLM Gateway enables you to:

  • 🔄 Intercept AWS Bedrock, OpenAI, and Anthropic API calls
  • 🏠 Route requests to local models (vLLM, Ollama)
  • 💰 Save costs by running models on your own infrastructure
  • 🔒 Keep data private - no external API calls
  • Scale horizontally with Kubernetes

Quick Start

Prerequisites

  • Kubernetes 1.28+
  • Helm 3.13+
  • GPU nodes (for vLLM) or CPU nodes (for Ollama)

Installation

# Install with Helm
helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  --version 0.1.0 \
  --namespace local-llm-gateway \
  --create-namespace

Verify Installation

kubectl get pods -n local-llm-gateway
kubectl get svc -n local-llm-gateway

Architecture

Local LLM Gateway acts as a transparent proxy between cloud LLM SDKs and local model runners, enabling cost savings and data privacy without code changes.

High-Level Overview

                  ┌─────────────────────────────────────────────────────────────────┐
                  │                        Client Application                       │
                  │  (Existing code using AWS Bedrock, OpenAI, or Anthropic SDKs)   │
                  └─────────────────────────────────────────────────────────────────┘
                                      │                            ▲
                                      │ Provider-specific          │ Provider-specific
                                      │ request format             │ response format
                                      ▼                            │
                  ┌─────────────────────────────────────────────────────────────────┐
                  │                         Local LLM Gateway Proxy                 │
                  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
                  │  │   Bedrock    │  │    OpenAI    │  │  Anthropic   │           │
                  │  │   Adapter    │  │   Adapter    │  │   Adapter    │           │
                  │  └──────────────┘  └──────────────┘  └──────────────┘           │
                  │           │                │                │                   │
                  │           │ Standard       │                │ Standard          │
                  │           │ format         │                │ format            │
                  │           ▼                ▼                ▼                   │
                  │                   Runner Orchestrator                           │
                  │            (Maps cloud models → local runners)                  │
                  │                            │                ▲                   │
                  │                            │ Standard       │ Standard          │
                  │                            │ request        │ response          │
                  └────────────────────────────┼────────────────┼───────────────────┘
                                               │                │
                                     ┌─────────┴─────────┐      │
                                     │                   │      │
                                     ▼                   ▼      │
                            ┌──────────────────┐  ┌──────────────────┐
                            │   vLLM Runner    │  │  Ollama Runner   │
                            │  (GPU-optimized) │  │ (CPU-friendly)   │
                            │   ModelRunner    │  │   ModelRunner    │
                            └──────────────────┘  └──────────────────┘
                                      │                   │
                                      │ vLLM API          │ Ollama API
                                      ▼                   ▼
                          ┌──────────────────┐  ┌──────────────────┐
                          │  Llama 3 8B      │  │   Phi-3 Mini     │
                          │  Mistral 7B      │  │   Gemma 2B       │
                          └──────────────────┘  └──────────────────┘
                                    │                   │
                                    │ Inference         │ Inference
                                    │ results           │ results
                                    └─────────┬─────────┘
                                              │
                                              ▼
                                      (Results flow back up
                                      through the same path)

Components:

  • Gateway: Translates cloud API formats to standard format
  • Adapters: Support for Bedrock, OpenAI, Anthropic
  • Runners: Execute models (vLLM, Ollama)
  • Model Mappings: Route requests to appropriate models

Features

Supported Providers

  • AWS Bedrock (Anthropic Claude, Meta Llama, Amazon Titan, Cohere, AI21, Mistral)
  • OpenAI (GPT-4, GPT-3.5)
  • Anthropic (Claude 3.5 Sonnet, Claude 3 Opus)

Model Runners

  • vLLM - High-performance inference (GPU-optimized)
  • Ollama - CPU-friendly, easy local development
  • 🔄 TGI (coming soon)

Additional Features

  • 📊 Prometheus metrics
  • 🔍 Service mesh support (Istio, Linkerd)
  • 🔄 Horizontal Pod Autoscaling
  • 🎯 Model routing and fallback
  • 🔒 mTLS via service mesh

Configuration

Example: Production Setup

helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  --version 0.1.0 \
  -f examples/values-production.yaml

Example: Development Setup

helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  --version 0.1.0 \
  -f examples/values-dev.yaml

Custom Configuration

# custom-values.yaml
gateway:
  replicaCount: 3
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10

runners:
  llama3-8b:
    enabled: true
    replicas: 2
    resources:
      limits:
        nvidia.com/gpu: 1

mappings:
  bedrock:
    'anthropic.claude-3-sonnet-20240229-v1:0': llama3-8b
  openai:
    'gpt-4': llama3-8b
helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  --version 0.1.0 \
  -f custom-values.yaml

Examples

See the examples/ directory for:

  • values-dev.yaml - Development setup (minimal resources)
  • values-staging.yaml - Staging environment
  • values-production.yaml - Production setup (HA, autoscaling)
  • values-gpu-production.yaml - GPU-optimized production
  • values-cpu-only.yaml - CPU-only deployment (no GPU)
  • values-eks-production.yaml - AWS EKS specific configuration

Documentation

Development

Prerequisites

  • Node.js 20+
  • npm
  • Docker (optional)

Setup

# Clone repository
git clone https://github.com/JoshWheeler08/local-llm-gateway.git
cd local-llm-gateway

# Install root dependencies (ESLint, commitlint, husky)
npm install

# Install application dependencies
cd proxy
npm install

Development Workflow

# Run linting
npm run lint

# Run type checking
npm run typecheck

# Run tests
npm test

# Build application
npm run build

# Run locally
cd proxy
npm run dev

Commit Convention

We use Conventional Commits:

feat(bedrock): add Mistral model support
fix(adapter): resolve provider detection bug
docs(readme): update installation instructions

See CONTRIBUTING.md for details.

Project Structure

local-llm-gateway/
├── proxy/                  # TypeScript gateway application
│   ├── src/
│   │   ├── adapters/      # Provider adapters (Bedrock, OpenAI, Anthropic)
│   │   ├── runners/       # Model runners (vLLM, Ollama)
│   │   ├── types/         # TypeScript type definitions
│   │   └── server.ts      # Fastify server
│   └── Dockerfile
├── kubernetes/
│   └── helm/
│       └── local-llm-gateway/   # Helm chart
├── examples/              # Configuration examples
└── .github/               # CI/CD workflows

Local Development

# Start local Kubernetes cluster
minikube start

# Install chart
helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  --version 0.1.0 \
  -f examples/values-dev.yaml

# Port forward to access locally
kubectl port-forward svc/local-llm-gateway 8080:8080 -n local-llm-gateway

Hugging Face Authentication

Some models require a Hugging Face token (gated models like Llama 3, Mistral).

Create Token

  1. Get token from https://huggingface.co/settings/tokens
  2. Accept model license on Hugging Face (for gated models)
  3. Create Kubernetes secret:
kubectl create secret generic huggingface-token \
  --from-literal=token=hf_xxxxxxxxxxxxx \
  -n local-llm-gateway

Option 1: Global Token (All Runners)

helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  --set huggingface.token.secretName=huggingface-token

Option 2: Per-Runner Tokens

# custom-values.yaml
runners:
  llama3-8b:
    enabled: true
    huggingfaceToken:
      secretName: llama-token # Specific token

  mistral-7b:
    enabled: true
    huggingfaceToken:
      secretName: mistral-token # Different token
helm install local-llm-gateway \
  oci://ghcr.io/JoshWheeler08/local-llm-gateway/charts/local-llm-gateway \
  -f custom-values.yaml

Which Models Need Tokens?

Requires token (gated):

  • Meta Llama 3, Llama 2
  • Mistral models
  • Some Cohere models

No token needed (open):

  • Microsoft Phi-3
  • Google Gemma
  • Most smaller models

Monitoring

Prometheus Metrics

The gateway exposes Prometheus metrics on port 9090:

local_llm_gateway_http_requests_total
local_llm_gateway_http_request_duration_seconds
local_llm_gateway_model_inference_requests_total
local_llm_gateway_model_inference_duration_seconds
local_llm_gateway_active_requests

ServiceMonitor

Enable Prometheus ServiceMonitor:

metrics:
  enabled: true
  serviceMonitor:
    enabled: true

Troubleshooting

Gateway Logs

kubectl logs -l app.kubernetes.io/name=local-llm-gateway -n local-llm-gateway

Runner Logs

kubectl logs -l local-llm-gateway.io/runner=llama3-8b -n local-llm-gateway

Health Checks

# Gateway health
kubectl exec -it deployment/local-llm-gateway -n local-llm-gateway -- curl http://localhost:8080/health

# Runner health
kubectl exec -it deployment/local-llm-gateway-llama3-8b -n local-llm-gateway -- curl http://localhost:8000/health

Security

  • Service mesh integration (Istio, Linkerd) for mTLS
  • Pod security contexts with non-root users
  • Network policies (optional)
  • RBAC configurations

See Security Guide for details.

Performance

  • vLLM: 24x faster than HuggingFace Transformers
  • Continuous batching for high throughput
  • PagedAttention for efficient memory usage
  • Horizontal Pod Autoscaling based on CPU/memory

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for details.

License

Apache License 2.0 - see LICENSE for details.

Support

Citation

If you use Local LLM Gateway in your research or project, please cite:

@software{local_llm_gateway,
  author = {Josh Wheeler},
  title = {Local LLM Gateway: Universal Gateway for Local LLM Inference},
  year = {2025},
  url = {https://github.com/JoshWheeler08/local-llm-gateway},
  version = {0.1.0}
}

Acknowledgments

  • vLLM - High-performance inference engine
  • Ollama - Easy local LLM deployment
  • Fastify - Fast web framework

Code of Conduct

This project follows a Code of Conduct to ensure a welcoming environment for everyone. Please read and follow it.


Made with ❤️ by the Local LLM Gateway community

About

Universal Local LLM API Gateway for Kubernetes

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors