Skip to content

Commit 3466cd5

Browse files
authored
Merge pull request #343 from gargnipungarg/md-llama-content-realignment
Updated readme with troubleshooting section
2 parents 93f517b + 1e7e490 commit 3466cd5

File tree

2 files changed

+143
-57
lines changed

2 files changed

+143
-57
lines changed
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Debugging
2+
3+
You could debug the code in the container utilizing the [Visual Studio Code Remote - Tunnels](https://code.visualstudio.com/docs/remote/tunnels) extension, which lets you connect to a remote machine, like a desktop PC or virtual machine (VM), via a secure tunnel. You can connect to that machine from a VS Code client anywhere, without the requirement of setting up your own SSH, including also using the Oracle Cloud Infrastructure Data Science Jobs.
4+
5+
The tunneling securely transmits data from one network to another. This can eliminate the need for the source code to be on your VS Code client machine since the extension runs commands and other extensions directly on the OCI Data Science Job remote machine.
6+
7+
## Requirements
8+
9+
To use the debugging you have to finalize the steps of building and pushing the container of your choice to the Oracle Cloud Container Registry.
10+
11+
# Run for Debugging
12+
13+
For debugging purposes we will utilize the OCI Data Science Jobs service. Once the TGI or the vLLM container was build and published to the OCIR, we can run it as a Job, which would enable us take advance of the VSCode Remote Tunneling. To do so follow the steps:
14+
15+
* In your [OCI Data Science](https://cloud.oracle.com/data-science/projects) section, select the project you've created for deployment
16+
* Under the `Resources` section select `Jobs`
17+
* Click on `Create job` button
18+
* Under the `Default Configuration` select the checkbox for `Bring your own container`
19+
* Set following environment variables:
20+
* `CONTAINER_CUSTOM_IMAGE` with the value to the OCI Container Registry Repository location where you pushed your container, for example: `<your-region>.ocir.io/<your-tenancy-name>/vllm-odsc:0.1.3`
21+
* `CONTAINER_ENTRYPOINT` with the value `"/bin/bash", "--login", "-c"`
22+
* `CONTAINER_CMD` with the value `/aiapps/runner.sh`
23+
* The above values will override the default values set in the `Dockerfile` and would enable to launch the tunneling
24+
* Under `Compute shape` select `Custom configuration` and then `Specialty and previous generation` and select the `VM.GPU.A10.2` shape
25+
* Under `Logging` select the log group you've created for the model deployment and keep the option `Enable automatic log creation`
26+
* Under `Storage` set 500GB+ of storage
27+
* Under `Networking` keep the `Default networking` configuration
28+
29+
With this we are now ready to start the job
30+
31+
* Select the newly created job, if you have not done so
32+
* Click on the `Start a job run`
33+
* Keep all settings by default and click on `Start` button at the bottom left
34+
35+
Once the job is up and running, you will notice in the logs, the authentication code appears, you can copied and use it to authorize the tunnel, few seconds later the link for the tunnel would appear.
36+
37+
![vscode tunnel in the oci job](../../../jobs/tutorials/assets/images/vscde-server-tunnel-job.png)
38+
39+
Copy the link and open it in a browser, which should load the VSCode Editor and reveals the code inside the job, enabling direct debugging and coding.
40+
41+
`Notice` that you can also use your local VSCode IDE for the same purpose via the [Visual Studio Code Remote - Tunnels](https://code.visualstudio.com/docs/remote/tunnels) extension

model-deployment/containers/llama2/README.md

Lines changed: 102 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,81 @@ This repo provides two approaches to deploy the Llama-2 LLM:
55
* [Text Generation Inference](https://github.com/huggingface/text-generation-inference) from HuggingFace.
66
* [vLLM](https://github.com/vllm-project/vllm) developed at UC Berkeley
77

8-
The models are downloaded from the internet during the deployment process, which requires custom networking setup while creating Oracle Cloud Infrastructure Data Science Model Deployment.
9-
108
## Prerequisites
119

10+
* This is Limited Available feature. Please reach out to us via email `ask-oci-data-science_grp@oracle.com` to ask to be allowlisted for this LA feature.
1211
* Configure your [API Auth Token](https://docs.oracle.com/en-us/iaas/Content/Registry/Tasks/registrygettingauthtoken.htm) to be able to run and test your code locally.
1312
* Install [Docker](https://docs.docker.com/get-docker) or [Rancher Desktop](https://rancherdesktop.io/) as docker alternative.
14-
* This is Limited Available feature. Please reach out to us via email `ask-oci-data-science_grp@oracle.com` to ask to be allowlisted for this LA feature.
13+
14+
## OCI Logging
15+
When experimenting with new frameworks and models, it is highly advisable to attach log groups to model deployment in order to enable self assistance in debugging. Follow below steps to create log groups.
16+
17+
* Create logging for the model deployment (if you have to already created, you can skip this step)
18+
* Go to the [OCI Logging Service](https://cloud.oracle.com/logging/log-groups) and select `Log Groups`
19+
* Either select one of the existing Log Groups or create a new one
20+
* In the log group create ***two*** `Log`, one predict log and one access log, like:
21+
* Click on the `Create custom log`
22+
* Specify a name (predict|access) and select the log group you want to use
23+
* Under `Create agent configuration` select `Add configuration later`
24+
* Then click `Create agent configuration`
25+
26+
## Required IAM Policies
27+
28+
Public [documentation](https://docs.oracle.com/en-us/iaas/data-science/using/policies.htm).
29+
30+
### Generic Model Deployment policies
31+
`allow group <group-name> to manage data-science-model-deployments in compartment <compartment-name>`
32+
33+
`allow dynamic-group <dynamic-group-name> to manage data-science-model-deployments in compartment <compartment-name>`
34+
35+
### Allows a model deployment to emit logs to the Logging service. You need this policy if you’re using Logging in a model deployment
36+
`allow any-user to use log-content in tenancy where ALL {request.principal.type = 'datasciencemodeldeployment'}`
37+
38+
### Bring your own container [policies](https://docs.oracle.com/en-us/iaas/data-science/using/model-dep-policies-auth.htm#model_dep_policies_auth__access-logging-service#model_dep_policies_auth__access-custom-container)
39+
`ALL { resource.type = 'datasciencemodeldeployment' }`
40+
41+
`allow dynamic-group <dynamic-group-name> to read repos in compartment <compartment-name> where ANY {request.operation='ReadDockerRepositoryMetadata',request.operation='ReadDockerRepositoryManifest',request.operation='PullDockerLayer' }`
42+
43+
#### If the repository is in the root compartment, allow read for the tenancy
44+
45+
`allow dynamic-group <dynamic-group-name> to read repos in tenancy where ANY {
46+
request.operation='ReadDockerRepositoryMetadata',
47+
request.operation='ReadDockerRepositoryManifest',
48+
request.operation='PullDockerLayer'
49+
}`
50+
51+
#### For user level policies
52+
53+
`allow any-user to read repos in tenancy where ALL { request.principal.type = 'datasciencemodeldeployment' }`
54+
55+
`allow any-user to read repos in compartment <compartment-name> where ALL { request.principal.type = 'datasciencemodeldeployment'}`
56+
57+
### Model Store [export API](https://docs.oracle.com/en-us/iaas/data-science/using/large-model-artifact-export.htm#large-model-artifact-export) for creating model artifacts greater than 6 GB in size
58+
59+
`allow service datascience to manage object-family in compartment <compartment> where ALL {target.bucket.name='<bucket_name>'}`
60+
61+
`allow service objectstorage-<region> to manage object-family in compartment <compartment> where ALL {target.bucket.name='<bucket_name>'}`
62+
63+
### Policy to check Data Science work requests
64+
`allow group <group_name> to manage data-science-work-requests in compartment <compartment_name>`
65+
66+
For all other Data Science policies, please refer these [details](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/distributed_training/README.md#3-oci-policies).
1567

1668
## Methods for model weight downloads
1769

1870
### Direct Download
71+
72+
If you are choosing to download the model directly from HuggingFace repository at container startup time, you will need to configure a VCN with access to internet. Create a subnet for the model deployment
73+
* Go to the [Virtual Cloud Network](https://cloud.oracle.com/networking/vcns) in your Tenancy
74+
* Select one of your existing VCNs
75+
* Click on `Create subnet` button
76+
* Specify a name
77+
* As `Subnet type` select `Regional`
78+
* As IP CIDR block set `10.0.32.0/19`
79+
* Under `Route Table` select the routing table for private subnet
80+
* Under `Subnet Access` select `Private Subnet`
81+
* Click on `Create Subnet` to create it
82+
1983
The model will be downloaded at container startup time, we just need to provide authentication token to connect to model repository. Follow below steps to host the token:
2084
* Create a file called `token` in the same folder and store your Hugging Face user access token inside, which you can locate under your [Hugging Face Setting](https://huggingface.co/settings/tokens)
2185

@@ -56,9 +120,7 @@ The model will be downloaded at container startup time, we just need to provide
56120
```
57121
* Depending on the size of the model, model catalog item will take time to be prepared before it can be utilised to be deployed using Model Deploy service. The script above will return the status SUCCEEDED, once the model is completely uploaded and ready to be used in Model Deploy service.
58122
59-
60-
## Requirements
61-
123+
## Build TGI Container
62124
To construct the required containers for this deployment and retain the necessary information, please complete the following steps:
63125
64126
* Checkout this repository
@@ -67,36 +129,13 @@ To construct the required containers for this deployment and retain the necessar
67129
```bash
68130
cd model-deployment/containers/llama2
69131
```
70-
71-
* Create logging for the model deployment (if you have to already created, you can skip this step)
72-
* Go to the [OCI Logging Service](https://cloud.oracle.com/logging/log-groups) and select `Log Groups`
73-
* Either select one of the existing Log Groups or create a new one
74-
* In the log group create ***two*** `Log`, one predict log and one access log, like:
75-
* Click on the `Create custom log`
76-
* Specify a name (predict|access) and select the log group you want to use
77-
* Under `Create agent configuration` select `Add configuration later`
78-
* Then click `Create agent configuration`
79-
80-
* If you are choosing to download the model directly from source, you will need to configure a VCN with access to internet. Create a subnet for the model deployment
81-
* Go to the [Virtual Cloud Network](https://cloud.oracle.com/networking/vcns) in your Tenancy
82-
* Select one of your existing VCNs
83-
* Click on `Create subnet` button
84-
* Specify a name
85-
* As `Subnet type` select `Regional`
86-
* As IP CIDR block set `10.0.32.0/19`
87-
* Under `Route Table` select the routing table for private subnet
88-
* Under `Subnet Access` select `Private Subnet`
89-
* Click on `Create Subnet` to create it
90-
91132
* This example uses [OCI Container Registry](https://docs.oracle.com/en-us/iaas/Content/Registry/Concepts/registryoverview.htm) to store the container image required for the deployment. For the `Makefile` to execute the container build and push process to Oracle Cloud Container Registry, you have to setup in your local terminal the `TENANCY_NAME` and `REGION_KEY` environment variables.`TENANCY_NAME` is the name of your tenancy, which you can find under your [account settings](https://cloud.oracle.com/tenancy) and the `REGION_KEY` is a 3 letter name of your tenancy region, you consider to use for this example, for example IAD for Ashburn, or FRA for Frankfurt. You can find the region keys in our public documentation for [Regions and Availability Domains](https://docs.oracle.com/en-us/iaas/Content/General/Concepts/regions.htm)
92133
93134
```bash
94135
export TENANCY_NAME=<your-tenancy-name>
95136
export REGION_KEY=<region-key>
96137
```
97138
98-
## Build TGI Container
99-
100139
You can find the official documentation about OCI Data Science Model Deployment: [https://docs.oracle.com/en-us/iaas/data-science/using/model_dep_create.htm]
101140
102141
* Build the TGI container image, this step would take awhile
@@ -329,47 +368,53 @@ ads opctl run -f ads-md-deploy-tgi.yaml
329368
ads opctl run -f ads-md-deploy-vllm.yaml
330369
```
331370
332-
## Debugging
371+
## Troubleshooting
372+
373+
Following are identified as the most probable failure cases while deploying large models.
374+
375+
### Create/Update Model deployment failure
376+
377+
#### Reason
378+
Insufficient model deployment timeout.
333379
334-
You could debug the code in the container utilizing the [Visual Studio Code Remote - Tunnels](https://code.visualstudio.com/docs/remote/tunnels) extension, which lets you connect to a remote machine, like a desktop PC or virtual machine (VM), via a secure tunnel. You can connect to that machine from a VS Code client anywhere, without the requirement of setting up your own SSH, including also using the Oracle Cloud Infrastructure Data Science Jobs.
380+
#### Symptoms
381+
The Work Request logs will show the following error:
382+
Workflow timed out. Maximum runtime was: <deployment_timeout> minutes.
335383
336-
The tunneling securely transmits data from one network to another. This can eliminate the need for the source code to be on your VS Code client machine since the extension runs commands and other extensions directly on the OCI Data Science Job remote machine.
384+
#### Mitigation
385+
If model loading takes more time than service calculated, user can override deployment timeout with default configuration override by using below key value pair
386+
`DEPLOYMENT_TIMEOUT_IN_MINUTES`: `60`
337387
338-
### Requirements
388+
Max value allowed is 240
339389
340-
To use the debugging you have to finalize the steps of building and pushing the container of your choice to the Oracle Cloud Container Registry.
390+
### Create/Update Model deployment failure
341391
342-
## Run for Debugging
392+
#### Reason
393+
Insufficient boot volume storage.
343394
344-
For debugging purposes we will utilize the OCI Data Science Jobs service. Once the TGI or the vLLM container was build and published to the OCIR, we can run it as a Job, which would enable us take advance of the VSCode Remote Tunneling. To do so follow the steps:
395+
#### Symptoms
396+
The Work Request logs will show the following error: Errors occurred while bootstrapping the Model Deployment.
397+
The customer should investigate further to determine that the failure was due to lack of sufficient boot volume size.
345398
346-
* In your [OCI Data Science](https://cloud.oracle.com/data-science/projects) section, select the project you've created for deployment
347-
* Under the `Resources` section select `Jobs`
348-
* Click on `Create job` button
349-
* Under the `Default Configuration` select the checkbox for `Bring your own container`
350-
* Set following environment variables:
351-
* `CONTAINER_CUSTOM_IMAGE` with the value to the OCI Container Registry Repository location where you pushed your container, for example: `<your-region>.ocir.io/<your-tenancy-name>/vllm-odsc:0.1.3`
352-
* `CONTAINER_ENTRYPOINT` with the value `"/bin/bash", "--login", "-c"`
353-
* `CONTAINER_CMD` with the value `/aiapps/runner.sh`
354-
* The above values will override the default values set in the `Dockerfile` and would enable to launch the tunneling
355-
* Under `Compute shape` select `Custom configuration` and then `Specialty and previous generation` and select the `VM.GPU.A10.2` shape
356-
* Under `Logging` select the log group you've created for the model deployment and keep the option `Enable automatic log creation`
357-
* Under `Storage` set 500GB+ of storage
358-
* Under `Networking` keep the `Default networking` configuration
399+
#### Mitigation
400+
If model zip compression rate is too high and service failed to download and unzip the artifact, user can override volume size with default configuration override by using below key value pair.
401+
`STORAGE_SIZE_IN_GB`: `1000`
359402
360-
With this we are now ready to start the job
403+
Max value allowed as high as max boot volumes allowed.
361404
362-
* Select the newly created job, if you have not done so
363-
* Click on the `Start a job run`
364-
* Keep all settings by default and click on `Start` button at the bottom left
405+
### Create/Update Model deployment failure
365406
366-
Once the job is up and running, you will notice in the logs, the authentication code appears, you can copied and use it to authorize the tunnel, few seconds later the link for the tunnel would appear.
407+
#### Reason
408+
Container timeout.
367409
368-
![vscode tunnel in the oci job](../../../jobs/tutorials/assets/images/vscde-server-tunnel-job.png)
410+
#### Symptoms
411+
The Work Request logs will show the following error: Errors occurred while bootstrapping the Model Deployment
369412
370-
Copy the link and open it in a browser, which should load the VSCode Editor and reveals the code inside the job, enabling direct debugging and coding.
413+
#### Mitigation
414+
Customer should check the predict and health check endpoints, if defined through environment variables, are valid for container image specified. They can also check the predict and access logs for more information.
371415
372-
`Notice` that you can also use your local VSCode IDE for the same purpose via the [Visual Studio Code Remote - Tunnels](https://code.visualstudio.com/docs/remote/tunnels) extension
416+
### Advanced debugging options: Code debugging inside the container using job
417+
For more detailed level of debugging, user can refer [README-DEBUG.md](./README-DEBUG.md).
373418
374419
## Additional Make Commands
375420

0 commit comments

Comments
 (0)