You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You could debug the code in the container utilizing the [Visual Studio Code Remote - Tunnels](https://code.visualstudio.com/docs/remote/tunnels) extension, which lets you connect to a remote machine, like a desktop PC or virtual machine (VM), via a secure tunnel. You can connect to that machine from a VS Code client anywhere, without the requirement of setting up your own SSH, including also using the Oracle Cloud Infrastructure Data Science Jobs.
4
+
5
+
The tunneling securely transmits data from one network to another. This can eliminate the need for the source code to be on your VS Code client machine since the extension runs commands and other extensions directly on the OCI Data Science Job remote machine.
6
+
7
+
## Requirements
8
+
9
+
To use the debugging you have to finalize the steps of building and pushing the container of your choice to the Oracle Cloud Container Registry.
10
+
11
+
# Run for Debugging
12
+
13
+
For debugging purposes we will utilize the OCI Data Science Jobs service. Once the TGI or the vLLM container was build and published to the OCIR, we can run it as a Job, which would enable us take advance of the VSCode Remote Tunneling. To do so follow the steps:
14
+
15
+
* In your [OCI Data Science](https://cloud.oracle.com/data-science/projects) section, select the project you've created for deployment
16
+
* Under the `Resources` section select `Jobs`
17
+
* Click on `Create job` button
18
+
* Under the `Default Configuration` select the checkbox for `Bring your own container`
19
+
* Set following environment variables:
20
+
*`CONTAINER_CUSTOM_IMAGE` with the value to the OCI Container Registry Repository location where you pushed your container, for example: `<your-region>.ocir.io/<your-tenancy-name>/vllm-odsc:0.1.3`
21
+
*`CONTAINER_ENTRYPOINT` with the value `"/bin/bash", "--login", "-c"`
22
+
*`CONTAINER_CMD` with the value `/aiapps/runner.sh`
23
+
* The above values will override the default values set in the `Dockerfile` and would enable to launch the tunneling
24
+
* Under `Compute shape` select `Custom configuration` and then `Specialty and previous generation` and select the `VM.GPU.A10.2` shape
25
+
* Under `Logging` select the log group you've created for the model deployment and keep the option `Enable automatic log creation`
26
+
* Under `Storage` set 500GB+ of storage
27
+
* Under `Networking` keep the `Default networking` configuration
28
+
29
+
With this we are now ready to start the job
30
+
31
+
* Select the newly created job, if you have not done so
32
+
* Click on the `Start a job run`
33
+
* Keep all settings by default and click on `Start` button at the bottom left
34
+
35
+
Once the job is up and running, you will notice in the logs, the authentication code appears, you can copied and use it to authorize the tunnel, few seconds later the link for the tunnel would appear.
36
+
37
+

38
+
39
+
Copy the link and open it in a browser, which should load the VSCode Editor and reveals the code inside the job, enabling direct debugging and coding.
40
+
41
+
`Notice` that you can also use your local VSCode IDE for the same purpose via the [Visual Studio Code Remote - Tunnels](https://code.visualstudio.com/docs/remote/tunnels) extension
@@ -5,17 +5,81 @@ This repo provides two approaches to deploy the Llama-2 LLM:
5
5
*[Text Generation Inference](https://github.com/huggingface/text-generation-inference) from HuggingFace.
6
6
*[vLLM](https://github.com/vllm-project/vllm) developed at UC Berkeley
7
7
8
-
The models are downloaded from the internet during the deployment process, which requires custom networking setup while creating Oracle Cloud Infrastructure Data Science Model Deployment.
9
-
10
8
## Prerequisites
11
9
10
+
* This is Limited Available feature. Please reach out to us via email `ask-oci-data-science_grp@oracle.com` to ask to be allowlisted for this LA feature.
12
11
* Configure your [API Auth Token](https://docs.oracle.com/en-us/iaas/Content/Registry/Tasks/registrygettingauthtoken.htm) to be able to run and test your code locally.
13
12
* Install [Docker](https://docs.docker.com/get-docker) or [Rancher Desktop](https://rancherdesktop.io/) as docker alternative.
14
-
* This is Limited Available feature. Please reach out to us via email `ask-oci-data-science_grp@oracle.com` to ask to be allowlisted for this LA feature.
13
+
14
+
## OCI Logging
15
+
When experimenting with new frameworks and models, it is highly advisable to attach log groups to model deployment in order to enable self assistance in debugging. Follow below steps to create log groups.
16
+
17
+
* Create logging for the model deployment (if you have to already created, you can skip this step)
18
+
* Go to the [OCI Logging Service](https://cloud.oracle.com/logging/log-groups) and select `Log Groups`
19
+
* Either select one of the existing Log Groups or create a new one
20
+
* In the log group create ***two***`Log`, one predict log and one access log, like:
21
+
* Click on the `Create custom log`
22
+
* Specify a name (predict|access) and select the log group you want to use
23
+
* Under `Create agent configuration` select `Add configuration later`
24
+
* Then click `Create agent configuration`
25
+
26
+
## Required IAM Policies
27
+
28
+
Public [documentation](https://docs.oracle.com/en-us/iaas/data-science/using/policies.htm).
29
+
30
+
### Generic Model Deployment policies
31
+
`allow group <group-name> to manage data-science-model-deployments in compartment <compartment-name>`
32
+
33
+
`allow dynamic-group <dynamic-group-name> to manage data-science-model-deployments in compartment <compartment-name>`
34
+
35
+
### Allows a model deployment to emit logs to the Logging service. You need this policy if you’re using Logging in a model deployment
36
+
`allow any-user to use log-content in tenancy where ALL {request.principal.type = 'datasciencemodeldeployment'}`
37
+
38
+
### Bring your own container [policies](https://docs.oracle.com/en-us/iaas/data-science/using/model-dep-policies-auth.htm#model_dep_policies_auth__access-logging-service#model_dep_policies_auth__access-custom-container)
`allow dynamic-group <dynamic-group-name> to read repos in compartment <compartment-name> where ANY {request.operation='ReadDockerRepositoryMetadata',request.operation='ReadDockerRepositoryManifest',request.operation='PullDockerLayer' }`
42
+
43
+
#### If the repository is in the root compartment, allow read for the tenancy
44
+
45
+
`allow dynamic-group <dynamic-group-name> to read repos in tenancy where ANY {
46
+
request.operation='ReadDockerRepositoryMetadata',
47
+
request.operation='ReadDockerRepositoryManifest',
48
+
request.operation='PullDockerLayer'
49
+
}`
50
+
51
+
#### For user level policies
52
+
53
+
`allow any-user to read repos in tenancy where ALL { request.principal.type = 'datasciencemodeldeployment' }`
54
+
55
+
`allow any-user to read repos in compartment <compartment-name> where ALL { request.principal.type = 'datasciencemodeldeployment'}`
56
+
57
+
### Model Store [export API](https://docs.oracle.com/en-us/iaas/data-science/using/large-model-artifact-export.htm#large-model-artifact-export) for creating model artifacts greater than 6 GB in size
58
+
59
+
`allow service datascience to manage object-family in compartment <compartment> where ALL {target.bucket.name='<bucket_name>'}`
60
+
61
+
`allow service objectstorage-<region> to manage object-family in compartment <compartment> where ALL {target.bucket.name='<bucket_name>'}`
62
+
63
+
### Policy to check Data Science work requests
64
+
`allow group <group_name> to manage data-science-work-requests in compartment <compartment_name>`
65
+
66
+
For all other Data Science policies, please refer these [details](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/distributed_training/README.md#3-oci-policies).
15
67
16
68
## Methods for model weight downloads
17
69
18
70
### Direct Download
71
+
72
+
If you are choosing to download the model directly from HuggingFace repository at container startup time, you will need to configure a VCN with access to internet. Create a subnet for the model deployment
73
+
* Go to the [Virtual Cloud Network](https://cloud.oracle.com/networking/vcns) in your Tenancy
74
+
* Select one of your existing VCNs
75
+
* Click on `Create subnet` button
76
+
* Specify a name
77
+
* As `Subnet type` select `Regional`
78
+
* As IP CIDR block set `10.0.32.0/19`
79
+
* Under `Route Table` select the routing table for private subnet
80
+
* Under `Subnet Access` select `Private Subnet`
81
+
* Click on `Create Subnet` to create it
82
+
19
83
The model will be downloaded at container startup time, we just need to provide authentication token to connect to model repository. Follow below steps to host the token:
20
84
* Create a file called `token` in the same folder and store your Hugging Face user access token inside, which you can locate under your [Hugging Face Setting](https://huggingface.co/settings/tokens)
21
85
@@ -56,9 +120,7 @@ The model will be downloaded at container startup time, we just need to provide
56
120
```
57
121
* Depending on the size of the model, model catalog item will take time to be prepared before it can be utilised to be deployed using Model Deploy service. The script above will return the status SUCCEEDED, once the model is completely uploaded and ready to be used in Model Deploy service.
58
122
59
-
60
-
## Requirements
61
-
123
+
## Build TGI Container
62
124
To construct the required containers for this deployment and retain the necessary information, please complete the following steps:
63
125
64
126
* Checkout this repository
@@ -67,36 +129,13 @@ To construct the required containers for this deployment and retain the necessar
67
129
```bash
68
130
cd model-deployment/containers/llama2
69
131
```
70
-
71
-
* Create logging for the model deployment (if you have to already created, you can skip this step)
72
-
* Go to the [OCI Logging Service](https://cloud.oracle.com/logging/log-groups) and select `Log Groups`
73
-
* Either select one of the existing Log Groups or create a new one
74
-
* In the log group create ***two*** `Log`, one predict log and one access log, like:
75
-
* Click on the `Create custom log`
76
-
* Specify a name (predict|access) and select the log group you want to use
77
-
* Under `Create agent configuration` select `Add configuration later`
78
-
* Then click `Create agent configuration`
79
-
80
-
* If you are choosing to download the model directly from source, you will need to configure a VCN with access to internet. Create a subnet for the model deployment
81
-
* Go to the [Virtual Cloud Network](https://cloud.oracle.com/networking/vcns) in your Tenancy
82
-
* Select one of your existing VCNs
83
-
* Click on `Create subnet` button
84
-
* Specify a name
85
-
* As `Subnet type` select `Regional`
86
-
* As IP CIDR block set `10.0.32.0/19`
87
-
* Under `Route Table` select the routing table for private subnet
88
-
* Under `Subnet Access` select `Private Subnet`
89
-
* Click on `Create Subnet` to create it
90
-
91
132
* This example uses [OCI Container Registry](https://docs.oracle.com/en-us/iaas/Content/Registry/Concepts/registryoverview.htm) to store the container image required for the deployment. For the `Makefile` to execute the container build and push process to Oracle Cloud Container Registry, you have to setup in your local terminal the `TENANCY_NAME` and `REGION_KEY` environment variables.`TENANCY_NAME` is the name of your tenancy, which you can find under your [account settings](https://cloud.oracle.com/tenancy) and the `REGION_KEY` is a 3 letter name of your tenancy region, you consider to use for this example, for example IAD for Ashburn, or FRA for Frankfurt. You can find the region keys in our public documentation for [Regions and Availability Domains](https://docs.oracle.com/en-us/iaas/Content/General/Concepts/regions.htm)
92
133
93
134
```bash
94
135
export TENANCY_NAME=<your-tenancy-name>
95
136
export REGION_KEY=<region-key>
96
137
```
97
138
98
-
## Build TGI Container
99
-
100
139
You can find the official documentation about OCI Data Science Model Deployment: [https://docs.oracle.com/en-us/iaas/data-science/using/model_dep_create.htm]
101
140
102
141
* Build the TGI container image, this step would take awhile
@@ -329,47 +368,53 @@ ads opctl run -f ads-md-deploy-tgi.yaml
329
368
ads opctl run -f ads-md-deploy-vllm.yaml
330
369
```
331
370
332
-
## Debugging
371
+
## Troubleshooting
372
+
373
+
Following are identified as the most probable failure cases while deploying large models.
374
+
375
+
### Create/Update Model deployment failure
376
+
377
+
#### Reason
378
+
Insufficient model deployment timeout.
333
379
334
-
You could debug the code in the container utilizing the [Visual Studio Code Remote - Tunnels](https://code.visualstudio.com/docs/remote/tunnels) extension, which lets you connect to a remote machine, like a desktop PC or virtual machine (VM), via a secure tunnel. You can connect to that machine from a VS Code client anywhere, without the requirement of setting up your own SSH, including also using the Oracle Cloud Infrastructure Data Science Jobs.
380
+
#### Symptoms
381
+
The Work Request logs will show the following error:
382
+
Workflow timed out. Maximum runtime was: <deployment_timeout> minutes.
335
383
336
-
The tunneling securely transmits data from one network to another. This can eliminate the need for the source code to be on your VS Code client machine since the extension runs commands and other extensions directly on the OCI Data Science Job remote machine.
384
+
#### Mitigation
385
+
If model loading takes more time than service calculated, user can override deployment timeout with default configuration override by using below key value pair
386
+
`DEPLOYMENT_TIMEOUT_IN_MINUTES`: `60`
337
387
338
-
### Requirements
388
+
Max value allowed is 240
339
389
340
-
To use the debugging you have to finalize the steps of building and pushing the container of your choice to the Oracle Cloud Container Registry.
390
+
### Create/Update Model deployment failure
341
391
342
-
## Run for Debugging
392
+
#### Reason
393
+
Insufficient boot volume storage.
343
394
344
-
For debugging purposes we will utilize the OCI Data Science Jobs service. Once the TGI or the vLLM container was build and published to the OCIR, we can run it as a Job, which would enable us take advance of the VSCode Remote Tunneling. To do so follow the steps:
395
+
#### Symptoms
396
+
The Work Request logs will show the following error: Errors occurred while bootstrapping the Model Deployment.
397
+
The customer should investigate further to determine that the failure was due to lack of sufficient boot volume size.
345
398
346
-
* In your [OCI Data Science](https://cloud.oracle.com/data-science/projects) section, select the project you've created for deployment
347
-
* Under the `Resources` section select`Jobs`
348
-
* Click on `Create job` button
349
-
* Under the `Default Configuration`selectthe checkbox for`Bring your own container`
350
-
* Set following environment variables:
351
-
*`CONTAINER_CUSTOM_IMAGE` with the value to the OCI Container Registry Repository location where you pushed your container, for example: `<your-region>.ocir.io/<your-tenancy-name>/vllm-odsc:0.1.3`
352
-
*`CONTAINER_ENTRYPOINT` with the value `"/bin/bash", "--login", "-c"`
353
-
*`CONTAINER_CMD` with the value `/aiapps/runner.sh`
354
-
* The above values will override the default values setin the `Dockerfile` and would enable to launch the tunneling
355
-
* Under `Compute shape`select`Custom configuration` and then`Specialty and previous generation` and selectthe`VM.GPU.A10.2` shape
356
-
* Under `Logging`selectthe log group you've created for the model deployment and keep the option `Enable automatic log creation`
357
-
* Under `Storage` set 500GB+ of storage
358
-
* Under `Networking` keep the `Default networking` configuration
399
+
#### Mitigation
400
+
If model zip compression rate is too high and service failed to download and unzip the artifact, user can override volume size with default configuration override by using below key value pair.
401
+
`STORAGE_SIZE_IN_GB`: `1000`
359
402
360
-
With this we are now ready to start the job
403
+
Max value allowed as high as max boot volumes allowed.
361
404
362
-
* Select the newly created job, if you have not done so
363
-
* Click on the `Start a job run`
364
-
* Keep all settings by default and click on `Start` button at the bottom left
405
+
### Create/Update Model deployment failure
365
406
366
-
Once the job is up and running, you will notice in the logs, the authentication code appears, you can copied and use it to authorize the tunnel, few seconds later the link for the tunnel would appear.
407
+
#### Reason
408
+
Container timeout.
367
409
368
-

410
+
#### Symptoms
411
+
The Work Request logs will show the following error: Errors occurred while bootstrapping the Model Deployment
369
412
370
-
Copy the link and open it in a browser, which should load the VSCode Editor and reveals the code inside the job, enabling direct debugging and coding.
413
+
#### Mitigation
414
+
Customer should check the predict and health check endpoints, if defined through environment variables, are valid for container image specified. They can also check the predict and access logs for more information.
371
415
372
-
`Notice` that you can also use your local VSCode IDE for the same purpose via the [Visual Studio Code Remote - Tunnels](https://code.visualstudio.com/docs/remote/tunnels) extension
416
+
### Advanced debugging options: Code debugging inside the container using job
417
+
For more detailed level of debugging, user can refer [README-DEBUG.md](./README-DEBUG.md).
0 commit comments