Skip to content

Commit 3eb3cfa

Browse files
committed
[Integration]: Add Intelligent Semantic Routing with vLLM-SR
Signed-off-by: bitliu <bitliu@tencent.com>
1 parent 941e68f commit 3eb3cfa

File tree

2 files changed

+161
-0
lines changed

2 files changed

+161
-0
lines changed
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
========================================
2+
Intelligent Semantic Routing
3+
========================================
4+
5+
What is vLLM Semantic Router?
6+
========================================
7+
8+
vLLM Semantic Router is an AI-powered intelligent routing system for efficient LLM inference on Mixture-of-Models (MoM). It operates as an Envoy External Processor that semantically routes OpenAI API-compatible requests to the most suitable backend model using advanced neural network technologies.
9+
10+
**Key Features:**
11+
12+
* **🧠 Intelligent Routing**: Powered by ModernBERT fine-tuned models for intelligent intent understanding, it analyzes context, intent, and complexity to route requests to the best LLM
13+
* **🛡️ AI-Powered Security**: Advanced PII Detection and Prompt Guard to identify and block jailbreak attempts, ensuring secure and responsible AI interactions
14+
* **⚡ Semantic Caching**: Intelligent similarity cache that stores semantic representations of prompts, dramatically reducing token usage and latency through smart content matching
15+
16+
Benefits of Integration
17+
========================================
18+
19+
Integrating vLLM Semantic Router with AIBrix provides several advantages:
20+
21+
**1. Intelligent Request Routing**
22+
Semantic Router analyzes incoming requests and routes them to the most appropriate model based on content understanding, while AIBrix's gateway efficiently manages traffic distribution across model replicas.
23+
24+
**2. Enhanced Scalability**
25+
AIBrix's autoscaler works seamlessly with Semantic Router to dynamically adjust resources based on routing patterns and real-time demand.
26+
27+
**3. Cost Optimization**
28+
By combining Semantic Router's intelligent routing with AIBrix's heterogeneous serving capabilities, you can optimize GPU utilization and reduce infrastructure costs while maintaining SLO guarantees through per-token unit economics.
29+
30+
**4. Production-Ready Infrastructure**
31+
AIBrix provides enterprise-grade features like distributed KV cache, GPU failure detection, and unified runtime management, making it easier to deploy Semantic Router in production environments.
32+
33+
**5. Simplified Operations**
34+
The integration leverages Kubernetes-native patterns and Gateway API resources, providing a familiar operational model for DevOps teams.
35+
36+
About vLLM AIBrix
37+
========================================
38+
39+
`vLLM AIBrix <https://github.com/vllm-project/aibrix>`_ is an open-source initiative designed to provide essential building blocks to construct scalable GenAI inference infrastructure. AIBrix delivers a cloud-native solution optimized for deploying, managing, and scaling large language model (LLM) inference, tailored specifically to enterprise needs.
40+
41+
Prerequisites
42+
========================================
43+
44+
Before starting, ensure you have the installed AIBrix components.
45+
46+
Step 1: Deploy vLLM Semantic Router
47+
========================================
48+
49+
Deploy the semantic router service with all required components:
50+
51+
.. code-block:: bash
52+
53+
# Clone the semantic router repository
54+
git clone git@github.com:vllm-project/semantic-router.git && cd semantic-router
55+
56+
# Deploy semantic router using Kustomize
57+
kubectl apply -k deploy/kubernetes/aibrix/semantic-router
58+
59+
# Wait for deployment to be ready (this may take several minutes for model downloads)
60+
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
61+
62+
# Verify deployment status
63+
kubectl get pods -n vllm-semantic-router-system
64+
65+
Step 2: Deploy Demo LLM
66+
========================================
67+
68+
Create a demo LLM to serve as the backend for the semantic router:
69+
70+
.. code-block:: bash
71+
72+
# Deploy demo LLM
73+
kubectl apply -f deploy/kubernetes/aibrix/aigw-resources/base-model.yaml
74+
75+
kubectl wait --timeout=2m -n default deployment/vllm-llama3-8b-instruct --for=condition=Available
76+
77+
Step 3: Create Gateway API Resources
78+
========================================
79+
80+
Create the necessary Gateway API resources for the envoy gateway:
81+
82+
.. code-block:: bash
83+
84+
kubectl apply -f deploy/kubernetes/aibrix/aigw-resources/gwapi-resources.yaml
85+
86+
Testing the Deployment
87+
========================================
88+
89+
Method 1: Port Forwarding (Recommended for Local Testing)
90+
----------------------------------------------------------
91+
92+
Set up port forwarding to access the gateway locally:
93+
94+
.. code-block:: bash
95+
96+
# Get the Envoy service name
97+
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
98+
--selector=gateway.envoyproxy.io/owning-gateway-namespace=aibrix-system,gateway.envoyproxy.io/owning-gateway-name=aibrix-eg \
99+
-o jsonpath='{.items[0].metadata.name}')
100+
101+
kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80
102+
103+
Send Test Requests
104+
----------------------------------------------------------
105+
106+
Once the gateway is accessible, test the inference endpoint:
107+
108+
.. code-block:: bash
109+
110+
# Test math domain chat completions endpoint
111+
curl -i -X POST http://localhost:8080/v1/chat/completions \
112+
-H "Content-Type: application/json" \
113+
-d '{
114+
"model": "MoM",
115+
"messages": [
116+
{"role": "user", "content": "What is the derivative of f(x) = x^3?"}
117+
]
118+
}'
119+
120+
You will see the response from the demo LLM, and additional headers injected by the semantic router:
121+
122+
.. code-block:: http
123+
124+
HTTP/1.1 200 OK
125+
server: fasthttp
126+
date: Thu, 06 Nov 2025 06:38:08 GMT
127+
content-type: application/json
128+
x-inference-pod: vllm-llama3-8b-instruct-984659dbb-gp5l9
129+
x-went-into-req-headers: true
130+
request-id: b46b6f7b-5645-470f-9868-0dd8b99a7163
131+
x-vsr-selected-category: math
132+
x-vsr-selected-reasoning: on
133+
x-vsr-selected-model: vllm-llama3-8b-instruct
134+
x-vsr-injected-system-prompt: true
135+
transfer-encoding: chunked
136+
137+
{"id":"chatcmpl-f390a0c6-b38f-4a73-b019-9374a3c5d69b","created":1762411088,"model":"vllm-llama3-8b-instruct","usage":{"prompt_tokens":42,"completion_tokens":48,"total_tokens":90},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I am your AI assistant, how can I help you today? To be or not to be that is the question. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Testing, testing 1,2,3"}}]}
138+
139+
Cleanup
140+
========================================
141+
142+
To remove the entire deployment:
143+
144+
.. code-block:: bash
145+
146+
# Remove Gateway API resources and Demo LLM
147+
kubectl delete -f deploy/kubernetes/aibrix/aigw-resources
148+
149+
# Remove semantic router
150+
kubectl delete -k deploy/kubernetes/aibrix/semantic-router
151+
152+
# Delete kind cluster
153+
kind delete cluster --name semantic-router-cluster
154+
155+
Next Steps
156+
========================================
157+
158+
* Set up monitoring and observability
159+
* Implement authentication and authorization
160+
* Scale the semantic router deployment for production workloads

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ Documentation
5959
features/kv-event-sync.rst
6060
features/benchmark-and-generator.rst
6161
features/multi-engine.rst
62+
features/semantic-routing.rst
6263

6364
.. toctree::
6465
:maxdepth: 1

0 commit comments

Comments
 (0)