|
| 1 | +# Deploy TensorRT model in NVIDIA Triton Inference Server |
| 2 | +This document provides a walkthrough for deploying Falcon TensorRT ensemble model with a NVIDIA [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) using OCI Data Science Model Deployment's bring your own container [(BYOC)](https://docs.oracle.com/en-us/iaas/data-science/using/mod-dep-byoc.htm) support. |
| 3 | + |
| 4 | +The sample used here is based on [Triton's inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/47b609b670d6bb33a5ff113d98ad8a44d961c5c6/all_models/inflight_batcher_llm). |
| 5 | + |
| 6 | +The Falcon Model TensorRT engine files need to be built using [TensorRT-LLM/examples/falcon](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/examples/falcon). |
| 7 | + |
| 8 | +# Prerequisites |
| 9 | + |
| 10 | +## Hardware Requirements |
| 11 | + |
| 12 | +| GPU | Total GPU Memory | Price per hour (For Pay-as-you-go, on-demand, Nov 2023) | |
| 13 | +| :-----: | :---: | :---: | |
| 14 | +| VM.GPU.A10.1 (1x NVIDIA A10 Tensor Core) | 24GB | $2 | |
| 15 | +| VM.GPU.A10.2 (2x NVIDIA A10 Tensor Core) | 48GB (2x 24GB) | $4 ($2 per node per hour) | |
| 16 | +| BM.GPU.A10.4 (4x NVIDIA A10 Tensor Core) | 96GB (4x 24GB) | $8 ($2 per node per hour) | |
| 17 | +| BM.GPU4.8 (8x NVIDIA A100 40GB Tensor Core) | 320GB (8x 40GB) | $24.4 ($3.05 per node per hour) | |
| 18 | +| BM.GPU.H100.8 (8x NVIDIA H100 80GB Tensor Core) | 640GB (8x 80GB) | $80 ($10 per node per hour) | |
| 19 | + |
| 20 | +## Environment Setup |
| 21 | +Start by building a docker image which contain the tool chain required convert a huggingface hosted model into tensorRT-llm compatible artifact, We can also use the same docker image for deploying the model for evaluation: |
| 22 | + |
| 23 | +### Step 1 - Build the base image |
| 24 | +```bash |
| 25 | + git clone https://github.com/triton-inference-server/tensorrtllm_backend.git |
| 26 | + cd tensorrtllm_backend/ |
| 27 | + git submodule update --init —recursive |
| 28 | + DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend . |
| 29 | +``` |
| 30 | + |
| 31 | +### Step 2 - Downloading and Retrieving the model weights |
| 32 | +```bash |
| 33 | + git lfs install |
| 34 | + git clone https://huggingface.co/tiiuae/falcon-180B |
| 35 | +``` |
| 36 | +### Step 3 - Compiling the Model |
| 37 | +The subsequent phase involves converting the model into a TensorRT engine. This requires having both the model weights and a model definition crafted using the TensorRT-LLM Python API. Within the TensorRT-LLM repository, there is an extensive selection of pre-established model structures. For the purpose of this blog, we'll employ the provided Falcon model definition rather than creating a custom one. This serves as a basic illustration of some optimizations that TensorRT-LLM offers. |
| 38 | +```bash |
| 39 | + # -v /falcon-180B/:/model:Z , This statement mounts the downloaded model directory into tooling container |
| 40 | + docker run --gpus=all --shm-size=1g -v /falcon-180B/:/model:Z -it triton_trt_llm bash |
| 41 | + # Inside the container , Run following command |
| 42 | + export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.2/compat/lib.real/ |
| 43 | +``` |
| 44 | + |
| 45 | +### Step 4 - Perform next 2 steps, only if Quantisation is required for your model |
| 46 | +Install Quantisation tooling dependencies |
| 47 | + |
| 48 | +```bash |
| 49 | + # Run following commands in container prompt |
| 50 | + cuda_version=$(nvcc --version | grep 'release' | awk '{print $6}' | awk -F'[V.]' '{print $2$3}’) |
| 51 | + python_version=$(python3 --version 2>&1 | awk '{print $2}' | awk -F. '{print $1$2}’) |
| 52 | + |
| 53 | + # Download ammo framework required for Quantization |
| 54 | + wget https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.3.0.tar.gz |
| 55 | + tar -xzf nvidia_ammo-0.3.0.tar.gz |
| 56 | + pip install nvidia_ammo-0.3.0/nvidia_ammo-0.3.0+cu$cuda_version-cp$python_version-cp$python_version-linux_x86_64.whl |
| 57 | +``` |
| 58 | + |
| 59 | +### Step 5 - Apply Quantisation and TensorRT-LLM conversion |
| 60 | +```bash |
| 61 | + # Run following commands in container prompt |
| 62 | + cd tensorrt_llm/examples/falcon/ |
| 63 | + |
| 64 | + # Install model specific dependencies |
| 65 | + pip install -r requirements.txt |
| 66 | + |
| 67 | + # Apply quantisation |
| 68 | + python quantize.py --model_dir /model \ |
| 69 | + --dtype float16 \ |
| 70 | + --qformat fp8 \ |
| 71 | + --export_path quantized_fp8 \ |
| 72 | + --calib_size 16 |
| 73 | + |
| 74 | + # Apply TensorRT-LLM Conversion |
| 75 | + # Build Falcon 180B TP=8 using HF checkpoint + PTQ scaling factors from the single-rank checkpoint |
| 76 | + python build.py --model_dir /model \ |
| 77 | + --quantized_fp8_model_path ./quantized_fp8/falcon_tp1_rank0.npz \ |
| 78 | + --dtype float16 \ |
| 79 | + --enable_context_fmha \ |
| 80 | + --use_gpt_attention_plugin float16 \ |
| 81 | + --output_dir falcon/180b/trt_engines/fp8/8-gpu/ \ |
| 82 | + --remove_input_padding \ |
| 83 | + --enable_fp8 \ |
| 84 | + --fp8_kv_cache \ |
| 85 | + --strongly_typed \ |
| 86 | + --world_size 8 \ |
| 87 | + --tp_size 8 \ |
| 88 | + --load_by_shard \ |
| 89 | + --parallel_build |
| 90 | + |
| 91 | + # --tp_size 8 Indicates, Tensor parallelism 8 |
| 92 | + # --output_dir falcon/180b/trt_engines/fp8/8-gpu/ Indicates that converted model artifacts will be placed in this location. |
| 93 | +``` |
| 94 | + |
| 95 | +# Model Deployment |
| 96 | + |
| 97 | +Set up Triton Inference Server compliant docker image and model |
| 98 | +### Step 1: Create Model Artifact |
| 99 | +To use Triton, we need to build a model repository. The structure of the repository as follows: |
| 100 | +``` |
| 101 | +model_repository |
| 102 | +| |
| 103 | ++-- ensemble |
| 104 | + | |
| 105 | + +-- config.pbtxt |
| 106 | + +-- 1 |
| 107 | ++-- postprocessing |
| 108 | + | |
| 109 | + +-- config.pbtxt |
| 110 | + +-- 1 |
| 111 | + | |
| 112 | + +-- model.py |
| 113 | ++-- preprocessing |
| 114 | + | |
| 115 | + +-- config.pbtxt |
| 116 | + +-- 1 |
| 117 | + | |
| 118 | + +-- model.py |
| 119 | ++-- tensorrt_llm |
| 120 | + | |
| 121 | + +-- config.pbtxt |
| 122 | + +-- 1 |
| 123 | + | |
| 124 | + +-- model.py |
| 125 | +``` |
| 126 | + |
| 127 | +### Step 2: Upload model artifact to Model catalog |
| 128 | +Please zip the model_repository folder into model_artifact.zip and follow guidelines mentioned in [Readme step](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/model-deployment/containers/llama2/README.md#one-time-download-to-oci-model-catalog) to create a model catalog item with model_artifact.zip. |
| 129 | + |
| 130 | +### Step 3: Upload NVIDIA base triton server image to OCI Container Registry |
| 131 | + |
| 132 | +```bash |
| 133 | +docker login $(OCIR_REGION).ocir.io |
| 134 | +mkdir -p tritonServer |
| 135 | +cd tritonServer |
| 136 | +git clone https://github.com/triton-inference-server/server.git -b v2.30.0 --depth 1 |
| 137 | +cd server |
| 138 | +python compose.py --backend onnxruntime --repoagent checksum --output-name $(OCIR_REGION).ocir.io/$(OCIR_NAMESPACE)/oci-datascience-triton-server/onnx-runtime:1.0.0 |
| 139 | +docker push $(OCIR_REGION).ocir.io/$(OCIR_NAMESPACE)/oci-datascience-triton-server/onnx-runtime:1.0.0 |
| 140 | +``` |
| 141 | + |
| 142 | +### Step 4: Create Model Deployment |
| 143 | +OCI Data Science Model Deployment has a dedicated support for Triton image, to make it easier to manage the Triton image by mapping of service-mandated endpoints to the Triton's inference and health HTTP/REST endpoint. To enable this support, enter the following environment variable when creating the Model Deployment: |
| 144 | +```bash |
| 145 | +CONTAINER_TYPE = TRITON |
| 146 | +``` |
| 147 | + |
| 148 | +#### Using Python sdk |
| 149 | +```bash |
| 150 | +# Create a model configuration details object |
| 151 | +model_config_details = ModelConfigurationDetails( |
| 152 | + model_id= <model_id>, |
| 153 | + bandwidth_mbps = <bandwidth_mbps>, |
| 154 | + instance_configuration = <instance_configuration>, |
| 155 | + scaling_policy = <scaling_policy> |
| 156 | +) |
| 157 | + |
| 158 | +# Create the container environment configuration |
| 159 | +environment_config_details = OcirModelDeploymentEnvironmentConfigurationDetails( |
| 160 | + environment_configuration_type="OCIR_CONTAINER", |
| 161 | + environment_variables={'CONTAINER_TYPE': 'TRITON'}, |
| 162 | + image="iad.ocir.io/testtenancy/oci-datascience-triton-server/triton-tensorrt:1.1", |
| 163 | + image_digest=<image_digest>, |
| 164 | + cmd=[ |
| 165 | + "tritonserver", |
| 166 | + "--model-repository=/opt/ds/model/deployed_model/model_repository" |
| 167 | + ], |
| 168 | + server_port=8000, |
| 169 | + health_check_port=8000 |
| 170 | +) |
| 171 | + |
| 172 | +# create a model type deployment |
| 173 | +single_model_deployment_config_details = data_science.models.SingleModelDeploymentConfigurationDetails( |
| 174 | + deployment_type="SINGLE_MODEL", |
| 175 | + model_configuration_details=model_config_details, |
| 176 | + environment_configuration_details=environment_config_details |
| 177 | +) |
| 178 | + |
| 179 | +# set up parameters required to create a new model deployment. |
| 180 | +create_model_deployment_details = CreateModelDeploymentDetails( |
| 181 | + display_name= <deployment_name>, |
| 182 | + model_deployment_configuration_details = single_model_deployment_config_details, |
| 183 | + compartment_id = <compartment_id>, |
| 184 | + project_id = <project_id> |
| 185 | +) |
| 186 | +``` |
| 187 | + |
| 188 | +# Testing the model |
| 189 | + |
| 190 | +## Using Python SDK to query the Inference Server |
| 191 | + |
| 192 | +Specify the JSON inference payload with input and output layers for the model as well as describe the shape and datatype of the expected input and output: |
| 193 | +```bash |
| 194 | + |
| 195 | +import json |
| 196 | + |
| 197 | +request_body = {"text_input": "Explain Cloud Computing to a school kid", "max_tokens": 30, "bad_words": ["now", "process"], "stop_words": [""], "top_k":20, "top_p":1, "end_id": 3, "pad_id": 2} |
| 198 | +request_body = json.dumps(request_body) |
| 199 | +``` |
| 200 | + |
| 201 | +Specify the request headers indicating model name and version: |
| 202 | +```bash |
| 203 | +request_headers = {"model_name":"ensemble", "model_version":"1"} |
| 204 | +``` |
| 205 | + |
| 206 | +Now, you can send an inference request to the Triton Inference Server: |
| 207 | +```bash |
| 208 | +# The OCI SDK must be installed for this example to function properly. |
| 209 | +# Installation instructions can be found here: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/pythonsdk.htm |
| 210 | + |
| 211 | +import requests |
| 212 | +import oci |
| 213 | +from oci.signer import Signer |
| 214 | + |
| 215 | +config = oci.config.from_file("~/.oci/config") # replace with the location of your oci config file |
| 216 | +auth = Signer( |
| 217 | + tenancy=config['tenancy'], |
| 218 | + user=config['user'], |
| 219 | + fingerprint=config['fingerprint'], |
| 220 | + private_key_file_location=config['key_file'], |
| 221 | + pass_phrase=config['pass_phrase']) |
| 222 | + |
| 223 | +endpoint = <modelDeploymentEndpoint> |
| 224 | + |
| 225 | +inference_output = requests.request('POST',endpoint, data=request_body, auth=auth, headers=request_headers).json()['outputs'][0]['data'][:5] |
| 226 | +``` |
| 227 | + |
| 228 | +## For testing via OCI CLI, after Model deployment becomes Green |
| 229 | +```bash |
| 230 | + oci raw-request --http-method POST --target-uri https://<MODEL_DEPLOYMENT_URL>/predict --request-body '{"text_input": "What is the height of Eiffel Tower?", "max_tokens": 30, "bad_words": [""], "stop_words": [""], "top_k":10, "top_p":1, "end_id": 3, "pad_id": 2}' |
| 231 | +``` |
| 232 | + |
| 233 | +## For testing locally on the host, start the triton server and make an http request |
| 234 | +```bash |
| 235 | + # Run following commands in container prompt (Inside the tool chain container) |
| 236 | + tritonserver --model-repository <Location of model_repository folder having ensemble,preprocessor,postprocessor,tensorrt_llm> |
| 237 | + |
| 238 | + curl -X POST localhost:8000/v2/models/ensemble/versions/1/generate -d '{"text_input": "What is the height of Eiffel Tower?", "max_tokens": 30, "bad_words": [""], "stop_words": [""], "top_k":10, "top_p":1, "end_id": 3, "pad_id": 2}' |
| 239 | +``` |
0 commit comments