You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Deploy TensorRT model in NVIDIA Triton Inference Server
2
-
This document provides a walkthrough for deploying Falcon TensorRT ensemble model into a NVIDIA [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) using OCI Data Science Modle Deployment's custom containers support.
2
+
This document provides a walkthrough for deploying Falcon TensorRT ensemble model with a NVIDIA [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) using OCI Data Science Model Deployment's bring your own container [(BYOC)](https://docs.oracle.com/en-us/iaas/data-science/using/mod-dep-byoc.htm) support.
3
3
4
4
The sample used here is based on [Triton's inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/47b609b670d6bb33a5ff113d98ad8a44d961c5c6/all_models/inflight_batcher_llm).
5
5
6
6
The Falcon Model TensorRT engine files need to be built using [TensorRT-LLM/examples/falcon](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/examples/falcon).
7
7
8
-
## Step 1: Set up Triton Inference Server
9
-
### Step 1.1: Create Model Artifact
8
+
# Prerequisites
9
+
10
+
## Hardware Requirements
11
+
12
+
| GPU | Total GPU Memory | Price per hour (For Pay-as-you-go, on-demand, Nov 2023) |
| VM.GPU.A10.2 (2x NVIDIA A10 Tensor Core) | 48GB (2x 24GB) | $4 ($2 per node per hour) |
16
+
| BM.GPU.A10.4 (4x NVIDIA A10 Tensor Core) | 96GB (4x 24GB) | $8 ($2 per node per hour) |
17
+
| BM.GPU4.8 (8x NVIDIA A100 40GB Tensor Core) | 320GB (8x 40GB) | $24.4 ($3.05 per node per hour) |
18
+
| BM.GPU.H100.8 (8x NVIDIA H100 80GB Tensor Core) | 640GB (8x 80GB) | $80 ($10 per node per hour) |
19
+
20
+
## Environment Setup
21
+
Start by building a docker image which contain the tool chain required convert a huggingface hosted model into tensorRT-llm compatible artifact, We can also use the same docker image for deploying the model for evaluation:
The subsequent phase involves converting the model into a TensorRT engine. This requires having both the model weights and a model definition crafted using the TensorRT-LLM Python API. Within the TensorRT-LLM repository, there is an extensive selection of pre-established model structures. For the purpose of this blog, we'll employ the provided Falcon model definition rather than creating a custom one. This serves as a basic illustration of some optimizations that TensorRT-LLM offers.
38
+
```bash
39
+
# -v /falcon-180B/:/model:Z , This statement mounts the downloaded model directory into tooling container
40
+
docker run --gpus=all --shm-size=1g -v /falcon-180B/:/model:Z -it triton_trt_llm bash
# --output_dir falcon/180b/trt_engines/fp8/8-gpu/ Indicates that converted model artifacts will be placed in this location.
92
+
```
93
+
94
+
# Model Deployment
95
+
96
+
Set up Triton Inference Server compliant docker image and model
97
+
### Step 1: Create Model Artifact
10
98
To use Triton, we need to build a model repository. The structure of the repository as follows:
11
99
```
12
100
model_repository
@@ -35,10 +123,12 @@ model_repository
35
123
+-- model.py
36
124
```
37
125
126
+
### Step 2: Upload model artifact to Model catalog
127
+
Please zip the model_repository folder into model_artifact.zip and follow guidelines mentioned in [Readme step](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/model-deployment/containers/llama2/README.md#one-time-download-to-oci-model-catalog) to create a model catalog item with model_artifact.zip.
38
128
39
-
### Step 1.2 Upload NVIDIA base triton server image to OCI Container Registry
129
+
### Step 3: Upload NVIDIA base triton server image to OCI Container Registry
### Step 1.3 Upload model artifact to Model catalog
52
-
Compress model_repository folder created in Step 1.1 in zip format and upload it to model catalog. Refer to https://docs.oracle.com/en-us/iaas/data-science/using/models_saving_catalog.htm for details
53
-
54
-
55
-
### Step 1.4 Create Model Deployment
141
+
### Step 4: Create Model Deployment
56
142
OCI Data Science Model Deployment has a dedicated support for Triton image, to make it easier to manage the Triton image by mapping of service-mandated endpoints to the Triton's inference and health HTTP/REST endpoint. To enable this support, enter the following environment variable when creating the Model Deployment:
## Step 2: Using Python SDK to query the Inference Server
187
+
# Testing the model
188
+
189
+
## Using Python SDK to query the Inference Server
102
190
103
191
Specify the JSON inference payload with input and output layers for the model as well as describe the shape and datatype of the expected input and output:
0 commit comments