Skip to content

Commit b504894

Browse files
committed
Update readme with step details
1 parent 73028e0 commit b504894

File tree

1 file changed

+118
-17
lines changed
  • model-deployment/containers/Triton_TensorRT

1 file changed

+118
-17
lines changed

model-deployment/containers/Triton_TensorRT/Readme.md

Lines changed: 118 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,100 @@
11
# Deploy TensorRT model in NVIDIA Triton Inference Server
2-
This document provides a walkthrough for deploying Falcon TensorRT ensemble model into a NVIDIA [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) using OCI Data Science Modle Deployment's custom containers support.
2+
This document provides a walkthrough for deploying Falcon TensorRT ensemble model with a NVIDIA [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) using OCI Data Science Model Deployment's bring your own container [(BYOC)](https://docs.oracle.com/en-us/iaas/data-science/using/mod-dep-byoc.htm) support.
33

44
The sample used here is based on [Triton's inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/47b609b670d6bb33a5ff113d98ad8a44d961c5c6/all_models/inflight_batcher_llm).
55

66
The Falcon Model TensorRT engine files need to be built using [TensorRT-LLM/examples/falcon](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/examples/falcon).
77

8-
## Step 1: Set up Triton Inference Server
9-
### Step 1.1: Create Model Artifact
8+
# Prerequisites
9+
10+
## Hardware Requirements
11+
12+
| GPU | Total GPU Memory | Price per hour (For Pay-as-you-go, on-demand, Nov 2023) |
13+
| :-----: | :---: | :---: |
14+
| VM.GPU.A10.1 (1x NVIDIA A10 Tensor Core) | 24GB | $2 |
15+
| VM.GPU.A10.2 (2x NVIDIA A10 Tensor Core) | 48GB (2x 24GB) | $4 ($2 per node per hour) |
16+
| BM.GPU.A10.4 (4x NVIDIA A10 Tensor Core) | 96GB (4x 24GB) | $8 ($2 per node per hour) |
17+
| BM.GPU4.8 (8x NVIDIA A100 40GB Tensor Core) | 320GB (8x 40GB) | $24.4 ($3.05 per node per hour) |
18+
| BM.GPU.H100.8 (8x NVIDIA H100 80GB Tensor Core) | 640GB (8x 80GB) | $80 ($10 per node per hour) |
19+
20+
## Environment Setup
21+
Start by building a docker image which contain the tool chain required convert a huggingface hosted model into tensorRT-llm compatible artifact, We can also use the same docker image for deploying the model for evaluation:
22+
23+
### Step 1 - Build the base image
24+
```bash
25+
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
26+
cd tensorrtllm_backend/
27+
git submodule update --init —recursive
28+
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
29+
```
30+
31+
### Step 2 - Downloading and Retrieving the model weights
32+
```bash
33+
git lfs install
34+
git clone https://huggingface.co/tiiuae/falcon-180B
35+
```
36+
### Step 3 - Compiling the Model
37+
The subsequent phase involves converting the model into a TensorRT engine. This requires having both the model weights and a model definition crafted using the TensorRT-LLM Python API. Within the TensorRT-LLM repository, there is an extensive selection of pre-established model structures. For the purpose of this blog, we'll employ the provided Falcon model definition rather than creating a custom one. This serves as a basic illustration of some optimizations that TensorRT-LLM offers.
38+
```bash
39+
# -v /falcon-180B/:/model:Z , This statement mounts the downloaded model directory into tooling container
40+
docker run --gpus=all --shm-size=1g -v /falcon-180B/:/model:Z -it triton_trt_llm bash
41+
# Inside the container , Run following command
42+
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.2/compat/lib.real/
43+
```
44+
45+
### Step 4 - Perform next 2 steps, only if Quantisation is required for your model
46+
Install Quantisation tooling dependencies
47+
48+
```bash
49+
# Run following commands in container prompt
50+
cuda_version=$(nvcc --version | grep 'release' | awk '{print $6}' | awk -F'[V.]' '{print $2$3}’)
51+
python_version=$(python3 --version 2>&1 | awk '{print $2}' | awk -F. '{print $1$2}’)
52+
53+
# Download ammo framework required for Quantization
54+
wget https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.3.0.tar.gz
55+
tar -xzf nvidia_ammo-0.3.0.tar.gz
56+
pip install nvidia_ammo-0.3.0/nvidia_ammo-0.3.0+cu$cuda_version-cp$python_version-cp$python_version-linux_x86_64.whl
57+
```
58+
59+
### Step 5 - Apply Quantisation and TensorRT-LLM conversion
60+
```bash
61+
# Run following commands in container prompt
62+
cd tensorrt_llm/examples/falcon/
63+
# Install model specific dependencies
64+
pip install -r requirements.txt
65+
66+
# Apply quantisation
67+
python quantize.py --model_dir /model \
68+
--dtype float16 \
69+
--qformat fp8 \
70+
--export_path quantized_fp8 \
71+
--calib_size 16
72+
73+
# Apply TensorRT-LLM Conversion
74+
# Build Falcon 180B TP=8 using HF checkpoint + PTQ scaling factors from the single-rank checkpoint
75+
python build.py --model_dir /model \
76+
--quantized_fp8_model_path ./quantized_fp8/falcon_tp1_rank0.npz \
77+
--dtype float16 \
78+
--enable_context_fmha \
79+
--use_gpt_attention_plugin float16 \
80+
--output_dir falcon/180b/trt_engines/fp8/8-gpu/ \
81+
--remove_input_padding \
82+
--enable_fp8 \
83+
--fp8_kv_cache \
84+
--strongly_typed \
85+
--world_size 8 \
86+
--tp_size 8 \
87+
--load_by_shard \
88+
--parallel_build
89+
90+
# --tp_size 8 Indicates, Tensor parallelism 8
91+
# --output_dir falcon/180b/trt_engines/fp8/8-gpu/ Indicates that converted model artifacts will be placed in this location.
92+
```
93+
94+
# Model Deployment
95+
96+
Set up Triton Inference Server compliant docker image and model
97+
### Step 1: Create Model Artifact
1098
To use Triton, we need to build a model repository. The structure of the repository as follows:
1199
```
12100
model_repository
@@ -35,10 +123,12 @@ model_repository
35123
+-- model.py
36124
```
37125

126+
### Step 2: Upload model artifact to Model catalog
127+
Please zip the model_repository folder into model_artifact.zip and follow guidelines mentioned in [Readme step](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/model-deployment/containers/llama2/README.md#one-time-download-to-oci-model-catalog) to create a model catalog item with model_artifact.zip.
38128

39-
### Step 1.2 Upload NVIDIA base triton server image to OCI Container Registry
129+
### Step 3: Upload NVIDIA base triton server image to OCI Container Registry
40130

41-
```
131+
```bash
42132
docker login $(OCIR_REGION).ocir.io
43133
mkdir -p tritonServer
44134
cd tritonServer
@@ -48,18 +138,14 @@ python compose.py --backend onnxruntime --repoagent checksum --output-name $(OCI
48138
docker push $(OCIR_REGION).ocir.io/$(OCIR_NAMESPACE)/oci-datascience-triton-server/onnx-runtime:1.0.0
49139
```
50140

51-
### Step 1.3 Upload model artifact to Model catalog
52-
Compress model_repository folder created in Step 1.1 in zip format and upload it to model catalog. Refer to https://docs.oracle.com/en-us/iaas/data-science/using/models_saving_catalog.htm for details
53-
54-
55-
### Step 1.4 Create Model Deployment
141+
### Step 4: Create Model Deployment
56142
OCI Data Science Model Deployment has a dedicated support for Triton image, to make it easier to manage the Triton image by mapping of service-mandated endpoints to the Triton's inference and health HTTP/REST endpoint. To enable this support, enter the following environment variable when creating the Model Deployment:
57-
```
143+
```bash
58144
CONTAINER_TYPE = TRITON
59145
```
60146

61-
#### Using python sdk
62-
```
147+
#### Using Python sdk
148+
```bash
63149
# Create a model configuration details object
64150
model_config_details = ModelConfigurationDetails(
65151
model_id= <model_id>,
@@ -98,10 +184,12 @@ create_model_deployment_details = CreateModelDeploymentDetails(
98184
)
99185
```
100186

101-
## Step 2: Using Python SDK to query the Inference Server
187+
# Testing the model
188+
189+
## Using Python SDK to query the Inference Server
102190

103191
Specify the JSON inference payload with input and output layers for the model as well as describe the shape and datatype of the expected input and output:
104-
```
192+
```bash
105193

106194
import json
107195

@@ -110,12 +198,12 @@ request_body = json.dumps(request_body)
110198
```
111199

112200
Specify the request headers indicating model name and version:
113-
```
201+
```bash
114202
request_headers = {"model_name":"ensemble", "model_version":"1"}
115203
```
116204

117205
Now, you can send an inference request to the Triton Inference Server:
118-
```
206+
```bash
119207
# The OCI SDK must be installed for this example to function properly.
120208
# Installation instructions can be found here: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/pythonsdk.htm
121209

@@ -134,4 +222,17 @@ auth = Signer(
134222
endpoint = <modelDeploymentEndpoint>
135223

136224
inference_output = requests.request('POST',endpoint, data=request_body, auth=auth, headers=request_headers).json()['outputs'][0]['data'][:5]
225+
```
226+
227+
## For testing via OCI CLI, after Model deployment becomes Green
228+
```bash
229+
oci raw-request --http-method POST --target-uri https://<MODEL_DEPLOYMENT_URL>/predict --request-body '{"text_input": "What is the height of Eiffel Tower?", "max_tokens": 30, "bad_words": [""], "stop_words": [""], "top_k":10, "top_p":1, "end_id": 3, "pad_id": 2}'
230+
```
231+
232+
## For testing locally on the host, start the triton server and make an http request
233+
```bash
234+
# Run following commands in container prompt (Inside the tool chain container)
235+
tritonserver --model-repository <Location of model_repository folder having ensemble,preprocessor,postprocessor,tensorrt_llm>
236+
237+
curl -X POST localhost:8000/v2/models/ensemble/versions/1/generate -d '{"text_input": "What is the height of Eiffel Tower?", "max_tokens": 30, "bad_words": [""], "stop_words": [""], "top_k":10, "top_p":1, "end_id": 3, "pad_id": 2}'
137238
```

0 commit comments

Comments
 (0)