Skip to content

Commit 0c1074b

Browse files
authored
Merge pull request #371 from gargnipungarg/master
Triton TensorRT Falcon Ensemble BYOC Model Deployment
2 parents 9cb41f1 + 4dddb27 commit 0c1074b

File tree

9 files changed

+1323
-0
lines changed

9 files changed

+1323
-0
lines changed
Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
# Deploy TensorRT model in NVIDIA Triton Inference Server
2+
This document provides a walkthrough for deploying Falcon TensorRT ensemble model with a NVIDIA [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) using OCI Data Science Model Deployment's bring your own container [(BYOC)](https://docs.oracle.com/en-us/iaas/data-science/using/mod-dep-byoc.htm) support.
3+
4+
The sample used here is based on [Triton's inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/47b609b670d6bb33a5ff113d98ad8a44d961c5c6/all_models/inflight_batcher_llm).
5+
6+
The Falcon Model TensorRT engine files need to be built using [TensorRT-LLM/examples/falcon](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/examples/falcon).
7+
8+
# Prerequisites
9+
10+
## Hardware Requirements
11+
12+
| GPU | Total GPU Memory | Price per hour (For Pay-as-you-go, on-demand, Nov 2023) |
13+
| :-----: | :---: | :---: |
14+
| VM.GPU.A10.1 (1x NVIDIA A10 Tensor Core) | 24GB | $2 |
15+
| VM.GPU.A10.2 (2x NVIDIA A10 Tensor Core) | 48GB (2x 24GB) | $4 ($2 per node per hour) |
16+
| BM.GPU.A10.4 (4x NVIDIA A10 Tensor Core) | 96GB (4x 24GB) | $8 ($2 per node per hour) |
17+
| BM.GPU4.8 (8x NVIDIA A100 40GB Tensor Core) | 320GB (8x 40GB) | $24.4 ($3.05 per node per hour) |
18+
| BM.GPU.H100.8 (8x NVIDIA H100 80GB Tensor Core) | 640GB (8x 80GB) | $80 ($10 per node per hour) |
19+
20+
## Environment Setup
21+
Start by building a docker image which contain the tool chain required convert a huggingface hosted model into tensorRT-llm compatible artifact, We can also use the same docker image for deploying the model for evaluation:
22+
23+
### Step 1 - Build the base image
24+
```bash
25+
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
26+
cd tensorrtllm_backend/
27+
git submodule update --init —recursive
28+
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
29+
```
30+
31+
### Step 2 - Downloading and Retrieving the model weights
32+
```bash
33+
git lfs install
34+
git clone https://huggingface.co/tiiuae/falcon-180B
35+
```
36+
### Step 3 - Compiling the Model
37+
The subsequent phase involves converting the model into a TensorRT engine. This requires having both the model weights and a model definition crafted using the TensorRT-LLM Python API. Within the TensorRT-LLM repository, there is an extensive selection of pre-established model structures. For the purpose of this blog, we'll employ the provided Falcon model definition rather than creating a custom one. This serves as a basic illustration of some optimizations that TensorRT-LLM offers.
38+
```bash
39+
# -v /falcon-180B/:/model:Z , This statement mounts the downloaded model directory into tooling container
40+
docker run --gpus=all --shm-size=1g -v /falcon-180B/:/model:Z -it triton_trt_llm bash
41+
# Inside the container , Run following command
42+
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.2/compat/lib.real/
43+
```
44+
45+
### Step 4 - Perform next 2 steps, only if Quantisation is required for your model
46+
Install Quantisation tooling dependencies
47+
48+
```bash
49+
# Run following commands in container prompt
50+
cuda_version=$(nvcc --version | grep 'release' | awk '{print $6}' | awk -F'[V.]' '{print $2$3}’)
51+
python_version=$(python3 --version 2>&1 | awk '{print $2}' | awk -F. '{print $1$2}’)
52+
53+
# Download ammo framework required for Quantization
54+
wget https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.3.0.tar.gz
55+
tar -xzf nvidia_ammo-0.3.0.tar.gz
56+
pip install nvidia_ammo-0.3.0/nvidia_ammo-0.3.0+cu$cuda_version-cp$python_version-cp$python_version-linux_x86_64.whl
57+
```
58+
59+
### Step 5 - Apply Quantisation and TensorRT-LLM conversion
60+
```bash
61+
# Run following commands in container prompt
62+
cd tensorrt_llm/examples/falcon/
63+
64+
# Install model specific dependencies
65+
pip install -r requirements.txt
66+
67+
# Apply quantisation
68+
python quantize.py --model_dir /model \
69+
--dtype float16 \
70+
--qformat fp8 \
71+
--export_path quantized_fp8 \
72+
--calib_size 16
73+
74+
# Apply TensorRT-LLM Conversion
75+
# Build Falcon 180B TP=8 using HF checkpoint + PTQ scaling factors from the single-rank checkpoint
76+
python build.py --model_dir /model \
77+
--quantized_fp8_model_path ./quantized_fp8/falcon_tp1_rank0.npz \
78+
--dtype float16 \
79+
--enable_context_fmha \
80+
--use_gpt_attention_plugin float16 \
81+
--output_dir falcon/180b/trt_engines/fp8/8-gpu/ \
82+
--remove_input_padding \
83+
--enable_fp8 \
84+
--fp8_kv_cache \
85+
--strongly_typed \
86+
--world_size 8 \
87+
--tp_size 8 \
88+
--load_by_shard \
89+
--parallel_build
90+
91+
# --tp_size 8 Indicates, Tensor parallelism 8
92+
# --output_dir falcon/180b/trt_engines/fp8/8-gpu/ Indicates that converted model artifacts will be placed in this location.
93+
```
94+
95+
# Model Deployment
96+
97+
Set up Triton Inference Server compliant docker image and model
98+
### Step 1: Create Model Artifact
99+
To use Triton, we need to build a model repository. The structure of the repository as follows:
100+
```
101+
model_repository
102+
|
103+
+-- ensemble
104+
|
105+
+-- config.pbtxt
106+
+-- 1
107+
+-- postprocessing
108+
|
109+
+-- config.pbtxt
110+
+-- 1
111+
|
112+
+-- model.py
113+
+-- preprocessing
114+
|
115+
+-- config.pbtxt
116+
+-- 1
117+
|
118+
+-- model.py
119+
+-- tensorrt_llm
120+
|
121+
+-- config.pbtxt
122+
+-- 1
123+
|
124+
+-- model.py
125+
```
126+
127+
### Step 2: Upload model artifact to Model catalog
128+
Please zip the model_repository folder into model_artifact.zip and follow guidelines mentioned in [Readme step](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/model-deployment/containers/llama2/README.md#one-time-download-to-oci-model-catalog) to create a model catalog item with model_artifact.zip.
129+
130+
### Step 3: Upload NVIDIA base triton server image to OCI Container Registry
131+
132+
```bash
133+
docker login $(OCIR_REGION).ocir.io
134+
mkdir -p tritonServer
135+
cd tritonServer
136+
git clone https://github.com/triton-inference-server/server.git -b v2.30.0 --depth 1
137+
cd server
138+
python compose.py --backend onnxruntime --repoagent checksum --output-name $(OCIR_REGION).ocir.io/$(OCIR_NAMESPACE)/oci-datascience-triton-server/onnx-runtime:1.0.0
139+
docker push $(OCIR_REGION).ocir.io/$(OCIR_NAMESPACE)/oci-datascience-triton-server/onnx-runtime:1.0.0
140+
```
141+
142+
### Step 4: Create Model Deployment
143+
OCI Data Science Model Deployment has a dedicated support for Triton image, to make it easier to manage the Triton image by mapping of service-mandated endpoints to the Triton's inference and health HTTP/REST endpoint. To enable this support, enter the following environment variable when creating the Model Deployment:
144+
```bash
145+
CONTAINER_TYPE = TRITON
146+
```
147+
148+
#### Using Python sdk
149+
```bash
150+
# Create a model configuration details object
151+
model_config_details = ModelConfigurationDetails(
152+
model_id= <model_id>,
153+
bandwidth_mbps = <bandwidth_mbps>,
154+
instance_configuration = <instance_configuration>,
155+
scaling_policy = <scaling_policy>
156+
)
157+
158+
# Create the container environment configuration
159+
environment_config_details = OcirModelDeploymentEnvironmentConfigurationDetails(
160+
environment_configuration_type="OCIR_CONTAINER",
161+
environment_variables={'CONTAINER_TYPE': 'TRITON'},
162+
image="iad.ocir.io/testtenancy/oci-datascience-triton-server/triton-tensorrt:1.1",
163+
image_digest=<image_digest>,
164+
cmd=[
165+
"tritonserver",
166+
"--model-repository=/opt/ds/model/deployed_model/model_repository"
167+
],
168+
server_port=8000,
169+
health_check_port=8000
170+
)
171+
172+
# create a model type deployment
173+
single_model_deployment_config_details = data_science.models.SingleModelDeploymentConfigurationDetails(
174+
deployment_type="SINGLE_MODEL",
175+
model_configuration_details=model_config_details,
176+
environment_configuration_details=environment_config_details
177+
)
178+
179+
# set up parameters required to create a new model deployment.
180+
create_model_deployment_details = CreateModelDeploymentDetails(
181+
display_name= <deployment_name>,
182+
model_deployment_configuration_details = single_model_deployment_config_details,
183+
compartment_id = <compartment_id>,
184+
project_id = <project_id>
185+
)
186+
```
187+
188+
# Testing the model
189+
190+
## Using Python SDK to query the Inference Server
191+
192+
Specify the JSON inference payload with input and output layers for the model as well as describe the shape and datatype of the expected input and output:
193+
```bash
194+
195+
import json
196+
197+
request_body = {"text_input": "Explain Cloud Computing to a school kid", "max_tokens": 30, "bad_words": ["now", "process"], "stop_words": [""], "top_k":20, "top_p":1, "end_id": 3, "pad_id": 2}
198+
request_body = json.dumps(request_body)
199+
```
200+
201+
Specify the request headers indicating model name and version:
202+
```bash
203+
request_headers = {"model_name":"ensemble", "model_version":"1"}
204+
```
205+
206+
Now, you can send an inference request to the Triton Inference Server:
207+
```bash
208+
# The OCI SDK must be installed for this example to function properly.
209+
# Installation instructions can be found here: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/pythonsdk.htm
210+
211+
import requests
212+
import oci
213+
from oci.signer import Signer
214+
215+
config = oci.config.from_file("~/.oci/config") # replace with the location of your oci config file
216+
auth = Signer(
217+
tenancy=config['tenancy'],
218+
user=config['user'],
219+
fingerprint=config['fingerprint'],
220+
private_key_file_location=config['key_file'],
221+
pass_phrase=config['pass_phrase'])
222+
223+
endpoint = <modelDeploymentEndpoint>
224+
225+
inference_output = requests.request('POST',endpoint, data=request_body, auth=auth, headers=request_headers).json()['outputs'][0]['data'][:5]
226+
```
227+
228+
## For testing via OCI CLI, after Model deployment becomes Green
229+
```bash
230+
oci raw-request --http-method POST --target-uri https://<MODEL_DEPLOYMENT_URL>/predict --request-body '{"text_input": "What is the height of Eiffel Tower?", "max_tokens": 30, "bad_words": [""], "stop_words": [""], "top_k":10, "top_p":1, "end_id": 3, "pad_id": 2}'
231+
```
232+
233+
## For testing locally on the host, start the triton server and make an http request
234+
```bash
235+
# Run following commands in container prompt (Inside the tool chain container)
236+
tritonserver --model-repository <Location of model_repository folder having ensemble,preprocessor,postprocessor,tensorrt_llm>
237+
238+
curl -X POST localhost:8000/v2/models/ensemble/versions/1/generate -d '{"text_input": "What is the height of Eiffel Tower?", "max_tokens": 30, "bad_words": [""], "stop_words": [""], "top_k":10, "top_p":1, "end_id": 3, "pad_id": 2}'
239+
```
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/bin/bash
2+
echo "tritonserver:model_repository_path" : $1;
3+
echo "tritonserver:mode " : $2;
4+
echo "tritonserver:http-port " : $3;
5+
6+
7+
exec /opt/tritonserver/bin/tritonserver --model-repository=$1 --model-control-mode=$2 --http-port=$3 --allow-gpu-metrics=false

0 commit comments

Comments
 (0)