Skip to content

Commit c846888

Browse files
committed
[Integration]: Add Intelligent Semantic Routing with vLLM-SR
Signed-off-by: bitliu <bitliu@tencent.com>
1 parent 941e68f commit c846888

File tree

2 files changed

+164
-0
lines changed

2 files changed

+164
-0
lines changed
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
========================================
2+
Intelligent Semantic Routing
3+
========================================
4+
5+
What is vLLM Semantic Router?
6+
========================================
7+
8+
vLLM Semantic Router is an AI-powered intelligent routing system for efficient LLM inference on Mixture-of-Models (MoM). It operates as an Envoy External Processor that semantically routes OpenAI API-compatible requests to the most suitable backend model using advanced neural network technologies.
9+
10+
**Key Features:**
11+
12+
* **🧠 Intelligent Routing**: Powered by ModernBERT fine-tuned models for intelligent intent understanding, it analyzes context, intent, and complexity to route requests to the best LLM
13+
* **🛡️ AI-Powered Security**: Advanced PII Detection and Prompt Guard to identify and block jailbreak attempts, ensuring secure and responsible AI interactions
14+
* **⚡ Semantic Caching**: Intelligent similarity cache that stores semantic representations of prompts, dramatically reducing token usage and latency through smart content matching
15+
* **🤖 Auto-Reasoning Engine**: Automatically analyzes request complexity, domain expertise requirements, and performance constraints to select the best model for each task
16+
* **🔬 Real-time Analytics**: Comprehensive monitoring and analytics dashboard with neural network insights, model performance metrics, and intelligent routing decisions visualization
17+
* **🚀 Scalable Architecture**: Cloud-native design with distributed neural processing, auto-scaling capabilities, and seamless integration with existing LLM infrastructure
18+
19+
Benefits of Integration
20+
========================================
21+
22+
Integrating vLLM Semantic Router with AIBrix provides several advantages:
23+
24+
**1. Intelligent Request Routing**
25+
Semantic Router analyzes incoming requests and routes them to the most appropriate model based on content understanding, while AIBrix's gateway efficiently manages traffic distribution across model replicas.
26+
27+
**2. Enhanced Scalability**
28+
AIBrix's autoscaler works seamlessly with Semantic Router to dynamically adjust resources based on routing patterns and real-time demand.
29+
30+
**3. Cost Optimization**
31+
By combining Semantic Router's intelligent routing with AIBrix's heterogeneous serving capabilities, you can optimize GPU utilization and reduce infrastructure costs while maintaining SLO guarantees through per-token unit economics.
32+
33+
**4. Production-Ready Infrastructure**
34+
AIBrix provides enterprise-grade features like distributed KV cache, GPU failure detection, and unified runtime management, making it easier to deploy Semantic Router in production environments.
35+
36+
**5. Simplified Operations**
37+
The integration leverages Kubernetes-native patterns and Gateway API resources, providing a familiar operational model for DevOps teams.
38+
39+
About vLLM AIBrix
40+
========================================
41+
42+
`vLLM AIBrix <https://github.com/vllm-project/aibrix>`_ is an open-source initiative designed to provide essential building blocks to construct scalable GenAI inference infrastructure. AIBrix delivers a cloud-native solution optimized for deploying, managing, and scaling large language model (LLM) inference, tailored specifically to enterprise needs.
43+
44+
Prerequisites
45+
========================================
46+
47+
Before starting, ensure you have the installed AIBrix components.
48+
49+
Step 1: Deploy vLLM Semantic Router
50+
========================================
51+
52+
Deploy the semantic router service with all required components:
53+
54+
.. code-block:: bash
55+
56+
# Clone the semantic router repository
57+
git clone git@github.com:vllm-project/semantic-router.git && cd semantic-router
58+
59+
# Deploy semantic router using Kustomize
60+
kubectl apply -k deploy/kubernetes/aibrix/semantic-router
61+
62+
# Wait for deployment to be ready (this may take several minutes for model downloads)
63+
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
64+
65+
# Verify deployment status
66+
kubectl get pods -n vllm-semantic-router-system
67+
68+
Step 2: Deploy Demo LLM
69+
========================================
70+
71+
Create a demo LLM to serve as the backend for the semantic router:
72+
73+
.. code-block:: bash
74+
75+
# Deploy demo LLM
76+
kubectl apply -f deploy/kubernetes/aibrix/aigw-resources/base-model.yaml
77+
78+
kubectl wait --timeout=2m -n default deployment/vllm-llama3-8b-instruct --for=condition=Available
79+
80+
Step 3: Create Gateway API Resources
81+
========================================
82+
83+
Create the necessary Gateway API resources for the envoy gateway:
84+
85+
.. code-block:: bash
86+
87+
kubectl apply -f deploy/kubernetes/aibrix/aigw-resources/gwapi-resources.yaml
88+
89+
Testing the Deployment
90+
========================================
91+
92+
Method 1: Port Forwarding (Recommended for Local Testing)
93+
----------------------------------------------------------
94+
95+
Set up port forwarding to access the gateway locally:
96+
97+
.. code-block:: bash
98+
99+
# Get the Envoy service name
100+
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
101+
--selector=gateway.envoyproxy.io/owning-gateway-namespace=aibrix-system,gateway.envoyproxy.io/owning-gateway-name=aibrix-eg \
102+
-o jsonpath='{.items[0].metadata.name}')
103+
104+
kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80
105+
106+
Send Test Requests
107+
----------------------------------------------------------
108+
109+
Once the gateway is accessible, test the inference endpoint:
110+
111+
.. code-block:: bash
112+
113+
# Test math domain chat completions endpoint
114+
curl -i -X POST http://localhost:8080/v1/chat/completions \
115+
-H "Content-Type: application/json" \
116+
-d '{
117+
"model": "MoM",
118+
"messages": [
119+
{"role": "user", "content": "What is the derivative of f(x) = x^3?"}
120+
]
121+
}'
122+
123+
You will see the response from the demo LLM, and additional headers injected by the semantic router:
124+
125+
.. code-block:: http
126+
127+
HTTP/1.1 200 OK
128+
server: fasthttp
129+
date: Thu, 06 Nov 2025 06:38:08 GMT
130+
content-type: application/json
131+
x-inference-pod: vllm-llama3-8b-instruct-984659dbb-gp5l9
132+
x-went-into-req-headers: true
133+
request-id: b46b6f7b-5645-470f-9868-0dd8b99a7163
134+
x-vsr-selected-category: math
135+
x-vsr-selected-reasoning: on
136+
x-vsr-selected-model: vllm-llama3-8b-instruct
137+
x-vsr-injected-system-prompt: true
138+
transfer-encoding: chunked
139+
140+
{"id":"chatcmpl-f390a0c6-b38f-4a73-b019-9374a3c5d69b","created":1762411088,"model":"vllm-llama3-8b-instruct","usage":{"prompt_tokens":42,"completion_tokens":48,"total_tokens":90},"object":"chat.completion","do_remote_decode":false,"do_remote_prefill":false,"remote_block_ids":null,"remote_engine_id":"","remote_host":"","remote_port":0,"choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I am your AI assistant, how can I help you today? To be or not to be that is the question. Alas, poor Yorick! I knew him, Horatio: A fellow of infinite jest Testing, testing 1,2,3"}}]}
141+
142+
Cleanup
143+
========================================
144+
145+
To remove the entire deployment:
146+
147+
.. code-block:: bash
148+
149+
# Remove Gateway API resources and Demo LLM
150+
kubectl delete -f deploy/kubernetes/aibrix/aigw-resources
151+
152+
# Remove semantic router
153+
kubectl delete -k deploy/kubernetes/aibrix/semantic-router
154+
155+
# Delete kind cluster
156+
kind delete cluster --name semantic-router-cluster
157+
158+
Next Steps
159+
========================================
160+
161+
* Set up monitoring and observability
162+
* Implement authentication and authorization
163+
* Scale the semantic router deployment for production workloads

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ Documentation
5959
features/kv-event-sync.rst
6060
features/benchmark-and-generator.rst
6161
features/multi-engine.rst
62+
features/semantic-routing.rst
6263

6364
.. toctree::
6465
:maxdepth: 1

0 commit comments

Comments
 (0)