Skip to content

Commit 8e52c35

Browse files
committed
[Integration]: Add Intelligent Semantic Routing with vLLM-SR
Signed-off-by: bitliu <bitliu@tencent.com>
1 parent 941e68f commit 8e52c35

File tree

2 files changed

+163
-0
lines changed

2 files changed

+163
-0
lines changed
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
========================================
2+
Intelligent Semantic Routing
3+
========================================
4+
5+
This guide demonstrates how to integrate vLLM Semantic Router with vLLM AIBrix to build an intelligent Mixture-of-Models (MoM) system. The integration brings **system-level intelligence** through AI-powered semantic understanding, automated reasoning capabilities, and intelligent caching mechanisms, enabling production-grade LLM inference with enhanced quality, security, and cost efficiency.
6+
7+
What is vLLM Semantic Router?
8+
========================================
9+
10+
vLLM Semantic Router is an AI-powered intelligent routing system for efficient LLM inference. It operates as an Envoy External Processor that semantically routes OpenAI API-compatible requests to the most suitable backend model using advanced neural network technologies.
11+
12+
**Key Features:**
13+
14+
* **🧠 Intelligent Routing**: Powered by ModernBERT fine-tuned models for intelligent intent understanding, it analyzes context, intent, and complexity to route requests to the best LLM
15+
* **🛡️ AI-Powered Security**: Advanced PII Detection and Prompt Guard to identify and block jailbreak attempts, ensuring secure and responsible AI interactions
16+
* **⚡ Semantic Caching**: Intelligent similarity cache that stores semantic representations of prompts, dramatically reducing token usage and latency through smart content matching
17+
18+
Benefits of Integration
19+
========================================
20+
21+
Integrating vLLM Semantic Router with AIBrix provides several advantages:
22+
23+
**1. Intelligent Request Routing**
24+
Semantic Router analyzes incoming requests and routes them to the most appropriate model based on content understanding, while AIBrix's gateway efficiently manages traffic distribution across model replicas.
25+
26+
**2. Enhanced Scalability**
27+
AIBrix's autoscaler works seamlessly with Semantic Router to dynamically adjust resources based on routing patterns and real-time demand.
28+
29+
**3. Cost Optimization**
30+
By combining Semantic Router's intelligent routing with AIBrix's heterogeneous serving capabilities, you can optimize GPU utilization and reduce infrastructure costs while maintaining SLO guarantees through per-token unit economics.
31+
32+
**4. Production-Ready Infrastructure**
33+
AIBrix provides enterprise-grade features like distributed KV cache, GPU failure detection, and unified runtime management, making it easier to deploy Semantic Router in production environments.
34+
35+
**5. Simplified Operations**
36+
The integration leverages Kubernetes-native patterns and Gateway API resources, providing a familiar operational model for DevOps teams.
37+
38+
About vLLM AIBrix
39+
========================================
40+
41+
`vLLM AIBrix <https://github.com/vllm-project/aibrix>`_ is an open-source initiative designed to provide essential building blocks to construct scalable GenAI inference infrastructure. AIBrix delivers a cloud-native solution optimized for deploying, managing, and scaling large language model (LLM) inference, tailored specifically to enterprise needs.
42+
43+
Prerequisites
44+
========================================
45+
46+
Before starting, ensure you have the installed AIBrix components.
47+
48+
Step 1: Deploy vLLM Semantic Router
49+
========================================
50+
51+
Deploy the semantic router service with all required components:
52+
53+
.. code-block:: bash
54+
55+
# Clone the semantic router repository
56+
git clone git@github.com:vllm-project/semantic-router.git && cd semantic-router
57+
58+
# Deploy semantic router using Kustomize
59+
kubectl apply -k deploy/kubernetes/aibrix/semantic-router
60+
61+
# Wait for deployment to be ready (this may take several minutes for model downloads)
62+
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
63+
64+
# Verify deployment status
65+
kubectl get pods -n vllm-semantic-router-system
66+
67+
Step 2: Deploy Demo LLM
68+
========================================
69+
70+
Create a demo LLM to serve as the backend for the semantic router:
71+
72+
.. code-block:: bash
73+
74+
# Deploy demo LLM
75+
kubectl apply -f deploy/kubernetes/aibrix/aigw-resources/base-model.yaml
76+
77+
kubectl wait --timeout=2m -n default deployment/vllm-llama3-8b-instruct --for=condition=Available
78+
79+
Step 3: Create Gateway API Resources
80+
========================================
81+
82+
Create the necessary Gateway API resources for the envoy gateway:
83+
84+
.. code-block:: bash
85+
86+
kubectl apply -f deploy/kubernetes/aibrix/aigw-resources/gwapi-resources.yaml
87+
88+
Testing the Deployment
89+
========================================
90+
91+
Method 1: Port Forwarding (Recommended for Local Testing)
92+
----------------------------------------------------------
93+
94+
Set up port forwarding to access the gateway locally:
95+
96+
.. code-block:: bash
97+
98+
# Get the Envoy service name
99+
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
100+
--selector=gateway.envoyproxy.io/owning-gateway-namespace=aibrix-system,gateway.envoyproxy.io/owning-gateway-name=aibrix-eg \
101+
-o jsonpath='{.items[0].metadata.name}')
102+
103+
kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80
104+
105+
Send Test Requests
106+
----------------------------------------------------------
107+
108+
Once the gateway is accessible, test the inference endpoint:
109+
110+
.. code-block:: bash
111+
112+
# Test math domain chat completions endpoint
113+
curl -i -X POST http://localhost:8080/v1/chat/completions \
114+
-H "Content-Type: application/json" \
115+
-d '{
116+
"model": "MoM",
117+
"messages": [
118+
{"role": "user", "content": "What is the derivative of f(x) = x^3?"}
119+
]
120+
}'
121+
122+
You will see the response from the demo LLM, and additional headers injected by the semantic router:
123+
124+
.. code-block:: http
125+
126+
HTTP/1.1 200 OK
127+
server: fasthttp
128+
date: Thu, 06 Nov 2025 06:38:08 GMT
129+
content-type: application/json
130+
x-inference-pod: vllm-llama3-8b-instruct-984659dbb-gp5l9
131+
x-went-into-req-headers: true
132+
request-id: b46b6f7b-5645-470f-9868-0dd8b99a7163
133+
x-vsr-selected-category: math
134+
x-vsr-selected-reasoning: on
135+
x-vsr-selected-model: vllm-llama3-8b-instruct
136+
x-vsr-injected-system-prompt: true
137+
transfer-encoding: chunked
138+
139+
{"id":"chatcmpl-f390a0c6-b38f-4a73-b019-9374a3c5d69b","created":1762411088,"model":"vllm-llama3-8b-instruct","usage":{"prompt_tokens":42,"completion_tokens":48,"total_tokens":90},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I am your AI assistant, how can I help you today? To be or not to be that is the question. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Testing, testing 1,2,3"}}]}
140+
141+
Cleanup
142+
========================================
143+
144+
To remove the entire deployment:
145+
146+
.. code-block:: bash
147+
148+
# Remove Gateway API resources and Demo LLM
149+
kubectl delete -f deploy/kubernetes/aibrix/aigw-resources
150+
151+
# Remove semantic router
152+
kubectl delete -k deploy/kubernetes/aibrix/semantic-router
153+
154+
# Delete kind cluster
155+
kind delete cluster --name semantic-router-cluster
156+
157+
Next Steps
158+
========================================
159+
160+
* Set up monitoring and observability
161+
* Implement authentication and authorization
162+
* Scale the semantic router deployment for production workloads

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ Documentation
5959
features/kv-event-sync.rst
6060
features/benchmark-and-generator.rst
6161
features/multi-engine.rst
62+
features/semantic-routing.rst
6263

6364
.. toctree::
6465
:maxdepth: 1

0 commit comments

Comments
 (0)