|
| 1 | +======================================== |
| 2 | +Intelligent Semantic Routing |
| 3 | +======================================== |
| 4 | + |
| 5 | +What is vLLM Semantic Router? |
| 6 | +======================================== |
| 7 | + |
| 8 | +vLLM Semantic Router is an AI-powered intelligent routing system for efficient LLM inference on Mixture-of-Models (MoM). It operates as an Envoy External Processor that semantically routes OpenAI API-compatible requests to the most suitable backend model using advanced neural network technologies. |
| 9 | + |
| 10 | +**Key Features:** |
| 11 | + |
| 12 | +* **🧠 Intelligent Routing**: Powered by ModernBERT fine-tuned models for intelligent intent understanding, it analyzes context, intent, and complexity to route requests to the best LLM |
| 13 | +* **🛡️ AI-Powered Security**: Advanced PII Detection and Prompt Guard to identify and block jailbreak attempts, ensuring secure and responsible AI interactions |
| 14 | +* **⚡ Semantic Caching**: Intelligent similarity cache that stores semantic representations of prompts, dramatically reducing token usage and latency through smart content matching |
| 15 | + |
| 16 | +Benefits of Integration |
| 17 | +======================================== |
| 18 | + |
| 19 | +Integrating vLLM Semantic Router with AIBrix provides several advantages: |
| 20 | + |
| 21 | +**1. Intelligent Request Routing** |
| 22 | + Semantic Router analyzes incoming requests and routes them to the most appropriate model based on content understanding, while AIBrix's gateway efficiently manages traffic distribution across model replicas. |
| 23 | + |
| 24 | +**2. Enhanced Scalability** |
| 25 | + AIBrix's autoscaler works seamlessly with Semantic Router to dynamically adjust resources based on routing patterns and real-time demand. |
| 26 | + |
| 27 | +**3. Cost Optimization** |
| 28 | + By combining Semantic Router's intelligent routing with AIBrix's heterogeneous serving capabilities, you can optimize GPU utilization and reduce infrastructure costs while maintaining SLO guarantees through per-token unit economics. |
| 29 | + |
| 30 | +**4. Production-Ready Infrastructure** |
| 31 | + AIBrix provides enterprise-grade features like distributed KV cache, GPU failure detection, and unified runtime management, making it easier to deploy Semantic Router in production environments. |
| 32 | + |
| 33 | +**5. Simplified Operations** |
| 34 | + The integration leverages Kubernetes-native patterns and Gateway API resources, providing a familiar operational model for DevOps teams. |
| 35 | + |
| 36 | +About vLLM AIBrix |
| 37 | +======================================== |
| 38 | + |
| 39 | +`vLLM AIBrix <https://github.com/vllm-project/aibrix>`_ is an open-source initiative designed to provide essential building blocks to construct scalable GenAI inference infrastructure. AIBrix delivers a cloud-native solution optimized for deploying, managing, and scaling large language model (LLM) inference, tailored specifically to enterprise needs. |
| 40 | + |
| 41 | +Prerequisites |
| 42 | +======================================== |
| 43 | + |
| 44 | +Before starting, ensure you have the installed AIBrix components. |
| 45 | + |
| 46 | +Step 1: Deploy vLLM Semantic Router |
| 47 | +======================================== |
| 48 | + |
| 49 | +Deploy the semantic router service with all required components: |
| 50 | + |
| 51 | +.. code-block:: bash |
| 52 | +
|
| 53 | + # Clone the semantic router repository |
| 54 | + git clone git@github.com:vllm-project/semantic-router.git && cd semantic-router |
| 55 | +
|
| 56 | + # Deploy semantic router using Kustomize |
| 57 | + kubectl apply -k deploy/kubernetes/aibrix/semantic-router |
| 58 | +
|
| 59 | + # Wait for deployment to be ready (this may take several minutes for model downloads) |
| 60 | + kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s |
| 61 | +
|
| 62 | + # Verify deployment status |
| 63 | + kubectl get pods -n vllm-semantic-router-system |
| 64 | +
|
| 65 | +Step 2: Deploy Demo LLM |
| 66 | +======================================== |
| 67 | + |
| 68 | +Create a demo LLM to serve as the backend for the semantic router: |
| 69 | + |
| 70 | +.. code-block:: bash |
| 71 | +
|
| 72 | + # Deploy demo LLM |
| 73 | + kubectl apply -f deploy/kubernetes/aibrix/aigw-resources/base-model.yaml |
| 74 | +
|
| 75 | + kubectl wait --timeout=2m -n default deployment/vllm-llama3-8b-instruct --for=condition=Available |
| 76 | +
|
| 77 | +Step 3: Create Gateway API Resources |
| 78 | +======================================== |
| 79 | + |
| 80 | +Create the necessary Gateway API resources for the envoy gateway: |
| 81 | + |
| 82 | +.. code-block:: bash |
| 83 | +
|
| 84 | + kubectl apply -f deploy/kubernetes/aibrix/aigw-resources/gwapi-resources.yaml |
| 85 | +
|
| 86 | +Testing the Deployment |
| 87 | +======================================== |
| 88 | + |
| 89 | +Method 1: Port Forwarding (Recommended for Local Testing) |
| 90 | +---------------------------------------------------------- |
| 91 | + |
| 92 | +Set up port forwarding to access the gateway locally: |
| 93 | + |
| 94 | +.. code-block:: bash |
| 95 | +
|
| 96 | + # Get the Envoy service name |
| 97 | + export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \ |
| 98 | + --selector=gateway.envoyproxy.io/owning-gateway-namespace=aibrix-system,gateway.envoyproxy.io/owning-gateway-name=aibrix-eg \ |
| 99 | + -o jsonpath='{.items[0].metadata.name}') |
| 100 | +
|
| 101 | + kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80 |
| 102 | +
|
| 103 | +Send Test Requests |
| 104 | +---------------------------------------------------------- |
| 105 | + |
| 106 | +Once the gateway is accessible, test the inference endpoint: |
| 107 | + |
| 108 | +.. code-block:: bash |
| 109 | +
|
| 110 | + # Test math domain chat completions endpoint |
| 111 | + curl -i -X POST http://localhost:8080/v1/chat/completions \ |
| 112 | + -H "Content-Type: application/json" \ |
| 113 | + -d '{ |
| 114 | + "model": "MoM", |
| 115 | + "messages": [ |
| 116 | + {"role": "user", "content": "What is the derivative of f(x) = x^3?"} |
| 117 | + ] |
| 118 | + }' |
| 119 | +
|
| 120 | +You will see the response from the demo LLM, and additional headers injected by the semantic router: |
| 121 | + |
| 122 | +.. code-block:: http |
| 123 | +
|
| 124 | + HTTP/1.1 200 OK |
| 125 | + server: fasthttp |
| 126 | + date: Thu, 06 Nov 2025 06:38:08 GMT |
| 127 | + content-type: application/json |
| 128 | + x-inference-pod: vllm-llama3-8b-instruct-984659dbb-gp5l9 |
| 129 | + x-went-into-req-headers: true |
| 130 | + request-id: b46b6f7b-5645-470f-9868-0dd8b99a7163 |
| 131 | + x-vsr-selected-category: math |
| 132 | + x-vsr-selected-reasoning: on |
| 133 | + x-vsr-selected-model: vllm-llama3-8b-instruct |
| 134 | + x-vsr-injected-system-prompt: true |
| 135 | + transfer-encoding: chunked |
| 136 | +
|
| 137 | + {"id":"chatcmpl-f390a0c6-b38f-4a73-b019-9374a3c5d69b","created":1762411088,"model":"vllm-llama3-8b-instruct","usage":{"prompt_tokens":42,"completion_tokens":48,"total_tokens":90},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I am your AI assistant, how can I help you today? To be or not to be that is the question. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Testing, testing 1,2,3"}}]} |
| 138 | +
|
| 139 | +Cleanup |
| 140 | +======================================== |
| 141 | + |
| 142 | +To remove the entire deployment: |
| 143 | + |
| 144 | +.. code-block:: bash |
| 145 | +
|
| 146 | + # Remove Gateway API resources and Demo LLM |
| 147 | + kubectl delete -f deploy/kubernetes/aibrix/aigw-resources |
| 148 | +
|
| 149 | + # Remove semantic router |
| 150 | + kubectl delete -k deploy/kubernetes/aibrix/semantic-router |
| 151 | +
|
| 152 | + # Delete kind cluster |
| 153 | + kind delete cluster --name semantic-router-cluster |
| 154 | +
|
| 155 | +Next Steps |
| 156 | +======================================== |
| 157 | + |
| 158 | +* Set up monitoring and observability |
| 159 | +* Implement authentication and authorization |
| 160 | +* Scale the semantic router deployment for production workloads |
0 commit comments