|
| 1 | +# Deploy TensorRT model in NVIDIA Triton Inference Server |
| 2 | +This document provides a walkthrough for deploying Falcon TensorRT ensemble model into a NVIDIA [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) using OCI Data Science Modle Deployment's custom containers support. |
| 3 | + |
| 4 | +The sample used here is based on [Triton's inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/47b609b670d6bb33a5ff113d98ad8a44d961c5c6/all_models/inflight_batcher_llm). |
| 5 | + |
| 6 | +The Falcon Model TensorRT engine files need to be built using [TensorRT-LLM/examples/falcon](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/examples/falcon). |
| 7 | + |
| 8 | +## Step 1: Set up Triton Inference Server |
| 9 | +### Step 1.1: Create Model Artifact |
| 10 | +To use Triton, we need to build a model repository. The structure of the repository as follows: |
| 11 | +``` |
| 12 | +model_repository |
| 13 | +| |
| 14 | ++-- ensemble |
| 15 | + | |
| 16 | + +-- config.pbtxt |
| 17 | + +-- 1 |
| 18 | ++-- postprocessing |
| 19 | + | |
| 20 | + +-- config.pbtxt |
| 21 | + +-- 1 |
| 22 | + | |
| 23 | + +-- model.py |
| 24 | ++-- preprocessing |
| 25 | + | |
| 26 | + +-- config.pbtxt |
| 27 | + +-- 1 |
| 28 | + | |
| 29 | + +-- model.py |
| 30 | ++-- tensorrt_llm |
| 31 | + | |
| 32 | + +-- config.pbtxt |
| 33 | + +-- 1 |
| 34 | + | |
| 35 | + +-- model.py |
| 36 | +``` |
| 37 | + |
| 38 | + |
| 39 | +### Step 1.2 Upload NVIDIA base triton server image to OCI Container Registry |
| 40 | + |
| 41 | +``` |
| 42 | +docker login $(OCIR_REGION).ocir.io |
| 43 | +mkdir -p tritonServer |
| 44 | +cd tritonServer |
| 45 | +git clone https://github.com/triton-inference-server/server.git -b v2.30.0 --depth 1 |
| 46 | +cd server |
| 47 | +python compose.py --backend onnxruntime --repoagent checksum --output-name $(OCIR_REGION).ocir.io/$(OCIR_NAMESPACE)/oci-datascience-triton-server/onnx-runtime:1.0.0 |
| 48 | +docker push $(OCIR_REGION).ocir.io/$(OCIR_NAMESPACE)/oci-datascience-triton-server/onnx-runtime:1.0.0 |
| 49 | +``` |
| 50 | + |
| 51 | +### Step 1.3 Upload model artifact to Model catalog |
| 52 | +Compress model_repository folder created in Step 1.1 in zip format and upload it to model catalog. Refer to https://docs.oracle.com/en-us/iaas/data-science/using/models_saving_catalog.htm for details |
| 53 | + |
| 54 | + |
| 55 | +### Step 1.4 Create Model Deployment |
| 56 | +OCI Data Science Model Deployment has a dedicated support for Triton image, to make it easier to manage the Triton image by mapping of service-mandated endpoints to the Triton's inference and health HTTP/REST endpoint. To enable this support, enter the following environment variable when creating the Model Deployment: |
| 57 | +``` |
| 58 | +CONTAINER_TYPE = TRITON |
| 59 | +``` |
| 60 | + |
| 61 | +#### Using python sdk |
| 62 | +``` |
| 63 | +# Create a model configuration details object |
| 64 | +model_config_details = ModelConfigurationDetails( |
| 65 | + model_id= <model_id>, |
| 66 | + bandwidth_mbps = <bandwidth_mbps>, |
| 67 | + instance_configuration = <instance_configuration>, |
| 68 | + scaling_policy = <scaling_policy> |
| 69 | +) |
| 70 | + |
| 71 | +# Create the container environment configuration |
| 72 | +environment_config_details = OcirModelDeploymentEnvironmentConfigurationDetails( |
| 73 | + environment_configuration_type="OCIR_CONTAINER", |
| 74 | + environment_variables={'CONTAINER_TYPE': 'TRITON'}, |
| 75 | + image="iad.ocir.io/testtenancy/oci-datascience-triton-server/triton-tensorrt:1.1", |
| 76 | + image_digest=<image_digest>, |
| 77 | + cmd=[ |
| 78 | + "tritonserver", |
| 79 | + "--model-repository=/opt/ds/model/deployed_model/model_repository" |
| 80 | + ], |
| 81 | + server_port=8000, |
| 82 | + health_check_port=8000 |
| 83 | +) |
| 84 | + |
| 85 | +# create a model type deployment |
| 86 | +single_model_deployment_config_details = data_science.models.SingleModelDeploymentConfigurationDetails( |
| 87 | + deployment_type="SINGLE_MODEL", |
| 88 | + model_configuration_details=model_config_details, |
| 89 | + environment_configuration_details=environment_config_details |
| 90 | +) |
| 91 | + |
| 92 | +# set up parameters required to create a new model deployment. |
| 93 | +create_model_deployment_details = CreateModelDeploymentDetails( |
| 94 | + display_name= <deployment_name>, |
| 95 | + model_deployment_configuration_details = single_model_deployment_config_details, |
| 96 | + compartment_id = <compartment_id>, |
| 97 | + project_id = <project_id> |
| 98 | +) |
| 99 | +``` |
| 100 | + |
| 101 | +## Step 2: Using Python SDK to query the Inference Server |
| 102 | + |
| 103 | +Specify the JSON inference payload with input and output layers for the model as well as describe the shape and datatype of the expected input and output: |
| 104 | +``` |
| 105 | +
|
| 106 | +import json |
| 107 | +
|
| 108 | +request_body = {"text_input": "Explain Cloud Computing to a school kid", "max_tokens": 30, "bad_words": ["now", "process"], "stop_words": [""], "top_k":20, "top_p":1, "end_id": 3, "pad_id": 2} |
| 109 | +request_body = json.dumps(request_body) |
| 110 | +``` |
| 111 | + |
| 112 | +Specify the request headers indicating model name and version: |
| 113 | +``` |
| 114 | +request_headers = {"model_name":"ensemble", "model_version":"1"} |
| 115 | +``` |
| 116 | + |
| 117 | +Now, you can send an inference request to the Triton Inference Server: |
| 118 | +``` |
| 119 | +# The OCI SDK must be installed for this example to function properly. |
| 120 | +# Installation instructions can be found here: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/pythonsdk.htm |
| 121 | + |
| 122 | +import requests |
| 123 | +import oci |
| 124 | +from oci.signer import Signer |
| 125 | + |
| 126 | +config = oci.config.from_file("~/.oci/config") # replace with the location of your oci config file |
| 127 | +auth = Signer( |
| 128 | + tenancy=config['tenancy'], |
| 129 | + user=config['user'], |
| 130 | + fingerprint=config['fingerprint'], |
| 131 | + private_key_file_location=config['key_file'], |
| 132 | + pass_phrase=config['pass_phrase']) |
| 133 | + |
| 134 | +endpoint = <modelDeploymentEndpoint> |
| 135 | + |
| 136 | +inference_output = requests.request('POST',endpoint, data=request_body, auth=auth, headers=request_headers).json()['outputs'][0]['data'][:5] |
| 137 | +``` |
0 commit comments