Skip to content

Commit 73028e0

Browse files
committed
Triton TensorRT Falcon Ensemble BYOC Model Deployment
1 parent 8a69362 commit 73028e0

File tree

10 files changed

+1239
-0
lines changed

10 files changed

+1239
-0
lines changed
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
FROM nvcr.io/nvidia/tritonserver:22.02-py3
2+
3+
HEALTHCHECK --start-period=15m --interval=30s --timeout=3s \
4+
CMD curl -f localhost:5000/v2/health/ready || exit 1
5+
6+
# Add key
7+
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
8+
9+
# Install python 3.7
10+
RUN add-apt-repository ppa:deadsnakes/ppa && apt-get install -y libpython3.7-dev
11+
12+
RUN pip install transformers
13+
RUN pip install tensorflow
14+
15+
WORKDIR /opt/ds/model/deployed_model
16+
COPY entrypoint.sh /
17+
ENTRYPOINT []
18+
RUN chmod +x /entrypoint.sh
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Deploy TensorRT model in NVIDIA Triton Inference Server
2+
This document provides a walkthrough for deploying Falcon TensorRT ensemble model into a NVIDIA [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) using OCI Data Science Modle Deployment's custom containers support.
3+
4+
The sample used here is based on [Triton's inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/47b609b670d6bb33a5ff113d98ad8a44d961c5c6/all_models/inflight_batcher_llm).
5+
6+
The Falcon Model TensorRT engine files need to be built using [TensorRT-LLM/examples/falcon](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/examples/falcon).
7+
8+
## Step 1: Set up Triton Inference Server
9+
### Step 1.1: Create Model Artifact
10+
To use Triton, we need to build a model repository. The structure of the repository as follows:
11+
```
12+
model_repository
13+
|
14+
+-- ensemble
15+
|
16+
+-- config.pbtxt
17+
+-- 1
18+
+-- postprocessing
19+
|
20+
+-- config.pbtxt
21+
+-- 1
22+
|
23+
+-- model.py
24+
+-- preprocessing
25+
|
26+
+-- config.pbtxt
27+
+-- 1
28+
|
29+
+-- model.py
30+
+-- tensorrt_llm
31+
|
32+
+-- config.pbtxt
33+
+-- 1
34+
|
35+
+-- model.py
36+
```
37+
38+
39+
### Step 1.2 Upload NVIDIA base triton server image to OCI Container Registry
40+
41+
```
42+
docker login $(OCIR_REGION).ocir.io
43+
mkdir -p tritonServer
44+
cd tritonServer
45+
git clone https://github.com/triton-inference-server/server.git -b v2.30.0 --depth 1
46+
cd server
47+
python compose.py --backend onnxruntime --repoagent checksum --output-name $(OCIR_REGION).ocir.io/$(OCIR_NAMESPACE)/oci-datascience-triton-server/onnx-runtime:1.0.0
48+
docker push $(OCIR_REGION).ocir.io/$(OCIR_NAMESPACE)/oci-datascience-triton-server/onnx-runtime:1.0.0
49+
```
50+
51+
### Step 1.3 Upload model artifact to Model catalog
52+
Compress model_repository folder created in Step 1.1 in zip format and upload it to model catalog. Refer to https://docs.oracle.com/en-us/iaas/data-science/using/models_saving_catalog.htm for details
53+
54+
55+
### Step 1.4 Create Model Deployment
56+
OCI Data Science Model Deployment has a dedicated support for Triton image, to make it easier to manage the Triton image by mapping of service-mandated endpoints to the Triton's inference and health HTTP/REST endpoint. To enable this support, enter the following environment variable when creating the Model Deployment:
57+
```
58+
CONTAINER_TYPE = TRITON
59+
```
60+
61+
#### Using python sdk
62+
```
63+
# Create a model configuration details object
64+
model_config_details = ModelConfigurationDetails(
65+
model_id= <model_id>,
66+
bandwidth_mbps = <bandwidth_mbps>,
67+
instance_configuration = <instance_configuration>,
68+
scaling_policy = <scaling_policy>
69+
)
70+
71+
# Create the container environment configuration
72+
environment_config_details = OcirModelDeploymentEnvironmentConfigurationDetails(
73+
environment_configuration_type="OCIR_CONTAINER",
74+
environment_variables={'CONTAINER_TYPE': 'TRITON'},
75+
image="iad.ocir.io/testtenancy/oci-datascience-triton-server/triton-tensorrt:1.1",
76+
image_digest=<image_digest>,
77+
cmd=[
78+
"tritonserver",
79+
"--model-repository=/opt/ds/model/deployed_model/model_repository"
80+
],
81+
server_port=8000,
82+
health_check_port=8000
83+
)
84+
85+
# create a model type deployment
86+
single_model_deployment_config_details = data_science.models.SingleModelDeploymentConfigurationDetails(
87+
deployment_type="SINGLE_MODEL",
88+
model_configuration_details=model_config_details,
89+
environment_configuration_details=environment_config_details
90+
)
91+
92+
# set up parameters required to create a new model deployment.
93+
create_model_deployment_details = CreateModelDeploymentDetails(
94+
display_name= <deployment_name>,
95+
model_deployment_configuration_details = single_model_deployment_config_details,
96+
compartment_id = <compartment_id>,
97+
project_id = <project_id>
98+
)
99+
```
100+
101+
## Step 2: Using Python SDK to query the Inference Server
102+
103+
Specify the JSON inference payload with input and output layers for the model as well as describe the shape and datatype of the expected input and output:
104+
```
105+
106+
import json
107+
108+
request_body = {"text_input": "Explain Cloud Computing to a school kid", "max_tokens": 30, "bad_words": ["now", "process"], "stop_words": [""], "top_k":20, "top_p":1, "end_id": 3, "pad_id": 2}
109+
request_body = json.dumps(request_body)
110+
```
111+
112+
Specify the request headers indicating model name and version:
113+
```
114+
request_headers = {"model_name":"ensemble", "model_version":"1"}
115+
```
116+
117+
Now, you can send an inference request to the Triton Inference Server:
118+
```
119+
# The OCI SDK must be installed for this example to function properly.
120+
# Installation instructions can be found here: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/pythonsdk.htm
121+
122+
import requests
123+
import oci
124+
from oci.signer import Signer
125+
126+
config = oci.config.from_file("~/.oci/config") # replace with the location of your oci config file
127+
auth = Signer(
128+
tenancy=config['tenancy'],
129+
user=config['user'],
130+
fingerprint=config['fingerprint'],
131+
private_key_file_location=config['key_file'],
132+
pass_phrase=config['pass_phrase'])
133+
134+
endpoint = <modelDeploymentEndpoint>
135+
136+
inference_output = requests.request('POST',endpoint, data=request_body, auth=auth, headers=request_headers).json()['outputs'][0]['data'][:5]
137+
```
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/bin/bash
2+
echo "tritonserver:model_repository_path" : $1;
3+
echo "tritonserver:mode " : $2;
4+
echo "tritonserver:http-port " : $3;
5+
6+
7+
exec /opt/tritonserver/bin/tritonserver --model-repository=$1 --model-control-mode=$2 --http-port=$3 --allow-gpu-metrics=false
Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
name: "ensemble"
2+
platform: "ensemble"
3+
max_batch_size: 1024
4+
input [
5+
{
6+
name: "text_input"
7+
data_type: TYPE_STRING
8+
dims: [ -1 ]
9+
},
10+
{
11+
name: "max_tokens"
12+
data_type: TYPE_UINT32
13+
dims: [ -1 ]
14+
},
15+
{
16+
name: "bad_words"
17+
data_type: TYPE_STRING
18+
dims: [ -1 ]
19+
},
20+
{
21+
name: "stop_words"
22+
data_type: TYPE_STRING
23+
dims: [ -1 ]
24+
},
25+
{
26+
name: "end_id"
27+
data_type: TYPE_UINT32
28+
dims: [ 1 ]
29+
optional: true
30+
},
31+
{
32+
name: "pad_id"
33+
data_type: TYPE_UINT32
34+
dims: [ 1 ]
35+
optional: true
36+
},
37+
{
38+
name: "top_k"
39+
data_type: TYPE_UINT32
40+
dims: [ 1 ]
41+
optional: true
42+
},
43+
{
44+
name: "top_p"
45+
data_type: TYPE_FP32
46+
dims: [ 1 ]
47+
optional: true
48+
},
49+
{
50+
name: "temperature"
51+
data_type: TYPE_FP32
52+
dims: [ 1 ]
53+
optional: true
54+
},
55+
{
56+
name: "length_penalty"
57+
data_type: TYPE_FP32
58+
dims: [ 1 ]
59+
optional: true
60+
},
61+
{
62+
name: "repetition_penalty"
63+
data_type: TYPE_FP32
64+
dims: [ 1 ]
65+
optional: true
66+
},
67+
{
68+
name: "min_length"
69+
data_type: TYPE_UINT32
70+
dims: [ 1 ]
71+
optional: true
72+
},
73+
{
74+
name: "presence_penalty"
75+
data_type: TYPE_FP32
76+
dims: [ 1 ]
77+
optional: true
78+
},
79+
{
80+
name: "random_seed"
81+
data_type: TYPE_UINT64
82+
dims: [ 1 ]
83+
optional: true
84+
},
85+
{
86+
name: "beam_width"
87+
data_type: TYPE_UINT32
88+
dims: [ 1 ]
89+
optional: true
90+
},
91+
{
92+
name: "output_log_probs"
93+
data_type: TYPE_BOOL
94+
dims: [ 1 ]
95+
optional: true
96+
}
97+
]
98+
output [
99+
{
100+
name: "text_output"
101+
data_type: TYPE_STRING
102+
dims: [ -1, -1 ]
103+
}
104+
]
105+
ensemble_scheduling {
106+
step [
107+
{
108+
model_name: "preprocessing"
109+
model_version: -1
110+
input_map {
111+
key: "QUERY"
112+
value: "text_input"
113+
}
114+
input_map {
115+
key: "REQUEST_OUTPUT_LEN"
116+
value: "max_tokens"
117+
}
118+
input_map {
119+
key: "BAD_WORDS_DICT"
120+
value: "bad_words"
121+
}
122+
input_map {
123+
key: "STOP_WORDS_DICT"
124+
value: "stop_words"
125+
}
126+
output_map {
127+
key: "REQUEST_INPUT_LEN"
128+
value: "_REQUEST_INPUT_LEN"
129+
}
130+
output_map {
131+
key: "INPUT_ID"
132+
value: "_INPUT_ID"
133+
}
134+
output_map {
135+
key: "REQUEST_OUTPUT_LEN"
136+
value: "_REQUEST_OUTPUT_LEN"
137+
}
138+
},
139+
{
140+
model_name: "tensorrt_llm"
141+
model_version: -1
142+
input_map {
143+
key: "input_ids"
144+
value: "_INPUT_ID"
145+
}
146+
input_map {
147+
key: "input_lengths"
148+
value: "_REQUEST_INPUT_LEN"
149+
}
150+
input_map {
151+
key: "request_output_len"
152+
value: "_REQUEST_OUTPUT_LEN"
153+
}
154+
input_map {
155+
key: "end_id"
156+
value: "end_id"
157+
}
158+
input_map {
159+
key: "pad_id"
160+
value: "pad_id"
161+
}
162+
input_map {
163+
key: "runtime_top_k"
164+
value: "top_k"
165+
}
166+
input_map {
167+
key: "runtime_top_p"
168+
value: "top_p"
169+
}
170+
input_map {
171+
key: "temperature"
172+
value: "temperature"
173+
}
174+
input_map {
175+
key: "len_penalty"
176+
value: "length_penalty"
177+
}
178+
input_map {
179+
key: "repetition_penalty"
180+
value: "repetition_penalty"
181+
}
182+
input_map {
183+
key: "min_length"
184+
value: "min_length"
185+
}
186+
input_map {
187+
key: "presence_penalty"
188+
value: "presence_penalty"
189+
}
190+
input_map {
191+
key: "random_seed"
192+
value: "random_seed"
193+
}
194+
input_map {
195+
key: "beam_width"
196+
value: "beam_width"
197+
}
198+
input_map {
199+
key: "output_log_probs"
200+
value: "output_log_probs"
201+
}
202+
output_map {
203+
key: "output_ids"
204+
value: "_TOKENS_BATCH"
205+
}
206+
},
207+
{
208+
model_name: "postprocessing"
209+
model_version: -1
210+
input_map {
211+
key: "TOKENS_BATCH"
212+
value: "_TOKENS_BATCH"
213+
}
214+
output_map {
215+
key: "OUTPUT"
216+
value: "text_output"
217+
}
218+
}
219+
]
220+
}

0 commit comments

Comments
 (0)