Merge pull request #343 from gargnipungarg/md-llama-content-realignment

qiuosier · web-flow · commit 3466cd5c443e · 2023-10-19T14:25:16.000-04:00
Updated readme with troubleshooting section
diff --git a/model-deployment/containers/llama2/README-DEBUG.md b/model-deployment/containers/llama2/README-DEBUG.md
@@ -0,0 +1,41 @@
+# Debugging
+
+You could debug the code in the container utilizing the [Visual Studio Code Remote - Tunnels](https://code.visualstudio.com/docs/remote/tunnels) extension, which lets you connect to a remote machine, like a desktop PC or virtual machine (VM), via a secure tunnel. You can connect to that machine from a VS Code client anywhere, without the requirement of setting up your own SSH, including also using the Oracle Cloud Infrastructure Data Science Jobs.
+
+The tunneling securely transmits data from one network to another. This can eliminate the need for the source code to be on your VS Code client machine since the extension runs commands and other extensions directly on the OCI Data Science Job remote machine.
+
+## Requirements
+
+To use the debugging you have to finalize the steps of building and pushing the container of your choice to the Oracle Cloud Container Registry.
+
+# Run for Debugging
+
+For debugging purposes we will utilize the OCI Data Science Jobs service. Once the TGI or the vLLM container was build and published to the OCIR, we can run it as a Job, which would enable us take advance of the VSCode Remote Tunneling. To do so follow the steps:
+
+* In your [OCI Data Science](https://cloud.oracle.com/data-science/projects) section, select the project you've created for deployment
+* Under the `Resources` section select `Jobs`
+* Click on `Create job` button
+* Under the `Default Configuration` select the checkbox for `Bring your own container`
+  * Set following environment variables:
+  * `CONTAINER_CUSTOM_IMAGE` with the value to the OCI Container Registry Repository location where you pushed your container, for example: `<your-region>.ocir.io/<your-tenancy-name>/vllm-odsc:0.1.3`
+  * `CONTAINER_ENTRYPOINT` with the value `"/bin/bash", "--login",  "-c"`
+  * `CONTAINER_CMD` with the value `/aiapps/runner.sh`
+  * The above values will override the default values set in the `Dockerfile` and would enable to launch the tunneling
+* Under `Compute shape` select `Custom configuration` and then `Specialty and previous generation` and select the `VM.GPU.A10.2` shape
+* Under `Logging` select the log group you've created for the model deployment and keep the option `Enable automatic log creation`
+* Under `Storage` set 500GB+ of storage
+* Under `Networking` keep the `Default networking` configuration
+
+With this we are now ready to start the job
+
+* Select the newly created job, if you have not done so
+* Click on the `Start a job run` 
+* Keep all settings by default and click on `Start` button at the bottom left
+
+Once the job is up and running, you will notice in the logs, the authentication code appears, you can copied and use it to authorize the tunnel, few seconds later the link for the tunnel would appear.
+
+![vscode tunnel in the oci job](../../../jobs/tutorials/assets/images/vscde-server-tunnel-job.png)
+
+Copy the link and open it in a browser, which should load the VSCode Editor and reveals the code inside the job, enabling direct debugging and coding.
+
+`Notice` that you can also use your local VSCode IDE for the same purpose via the [Visual Studio Code Remote - Tunnels](https://code.visualstudio.com/docs/remote/tunnels) extension
diff --git a/model-deployment/containers/llama2/README.md b/model-deployment/containers/llama2/README.md
@@ -5,17 +5,81 @@ This repo provides two approaches to deploy the Llama-2 LLM:
 * [Text Generation Inference](https://github.com/huggingface/text-generation-inference) from HuggingFace.
 * [vLLM](https://github.com/vllm-project/vllm) developed at UC Berkeley
 
-The models are downloaded from the internet during the deployment process, which requires custom networking setup while creating Oracle Cloud Infrastructure Data Science Model Deployment.
-
 ## Prerequisites
 
+* This is Limited Available feature. Please reach out to us via email `ask-oci-data-science_grp@oracle.com`  to ask to be allowlisted for this LA feature.
 * Configure your [API Auth Token](https://docs.oracle.com/en-us/iaas/Content/Registry/Tasks/registrygettingauthtoken.htm) to be able to run and test your code locally.
 * Install [Docker](https://docs.docker.com/get-docker) or [Rancher Desktop](https://rancherdesktop.io/) as docker alternative.
-* This is Limited Available feature. Please reach out to us via email `ask-oci-data-science_grp@oracle.com`  to ask to be allowlisted for this LA feature.
+
+## OCI Logging
+When experimenting with new frameworks and models, it is highly advisable to attach log groups to model deployment in order to enable self assistance in debugging. Follow below steps to create log groups.
+
+* Create logging for the model deployment (if you have to already created, you can skip this step)
+  * Go to the [OCI Logging Service](https://cloud.oracle.com/logging/log-groups) and select `Log Groups`
+  * Either select one of the existing Log Groups or create a new one
+  * In the log group create ***two*** `Log`, one predict log and one access log, like:
+    * Click on the `Create custom log`
+    * Specify a name (predict|access) and select the log group you want to use
+    * Under `Create agent configuration` select `Add configuration later`
+    * Then click `Create agent configuration`
+
+## Required IAM Policies
+
+Public [documentation](https://docs.oracle.com/en-us/iaas/data-science/using/policies.htm).
+
+### Generic Model Deployment policies
+`allow group <group-name> to manage data-science-model-deployments in compartment <compartment-name>`
+
+`allow dynamic-group <dynamic-group-name> to manage  data-science-model-deployments in compartment <compartment-name>`
+
+### Allows a model deployment to emit logs to the Logging service. You need this policy if you’re using Logging in a model deployment
+`allow any-user to use log-content in tenancy where ALL {request.principal.type = 'datasciencemodeldeployment'}`
+
+### Bring your own container [policies](https://docs.oracle.com/en-us/iaas/data-science/using/model-dep-policies-auth.htm#model_dep_policies_auth__access-logging-service#model_dep_policies_auth__access-custom-container)
+`ALL { resource.type = 'datasciencemodeldeployment' }`
+
+`allow dynamic-group <dynamic-group-name> to read repos in compartment <compartment-name> where ANY {request.operation='ReadDockerRepositoryMetadata',request.operation='ReadDockerRepositoryManifest',request.operation='PullDockerLayer' }`
+
+#### If the repository is in the root compartment, allow read for the tenancy
+
+`allow dynamic-group <dynamic-group-name> to read repos in tenancy where ANY {
+    request.operation='ReadDockerRepositoryMetadata',
+    request.operation='ReadDockerRepositoryManifest',
+    request.operation='PullDockerLayer'
+}`
+
+#### For user level policies
+
+`allow any-user to read repos in tenancy where ALL { request.principal.type = 'datasciencemodeldeployment' }`
+
+`allow any-user to read repos in compartment <compartment-name> where ALL { request.principal.type = 'datasciencemodeldeployment'}`
+
+### Model Store [export API](https://docs.oracle.com/en-us/iaas/data-science/using/large-model-artifact-export.htm#large-model-artifact-export) for creating model artifacts greater than 6 GB in size
+
+`allow service datascience to manage object-family in compartment <compartment> where ALL {target.bucket.name='<bucket_name>'}`
+
+`allow service objectstorage-<region> to manage object-family in compartment <compartment> where ALL {target.bucket.name='<bucket_name>'}`
+
+### Policy to check Data Science work requests
+`allow group <group_name> to manage data-science-work-requests in compartment <compartment_name>`
+
+For all other Data Science policies, please refer these [details](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/distributed_training/README.md#3-oci-policies).
 
 ## Methods for model weight downloads
 
 ### Direct Download
+
+If you are choosing to download the model directly from HuggingFace repository at container startup time, you will need to configure a VCN with access to internet. Create a subnet for the model deployment
+  * Go to the [Virtual Cloud Network](https://cloud.oracle.com/networking/vcns) in your Tenancy
+  * Select one of your existing VCNs
+  * Click on `Create subnet` button
+  * Specify a name
+  * As `Subnet type` select `Regional`
+  * As IP CIDR block set `10.0.32.0/19`
+  * Under `Route Table` select the routing table for private subnet
+  * Under `Subnet Access` select `Private Subnet`
+  * Click on `Create Subnet` to create it
+
 The model will be downloaded at container startup time, we just need to provide authentication token to connect to model repository. Follow below steps to host the token:
 * Create a file called `token` in the same folder and store your Hugging Face user access token inside, which you can locate under your [Hugging Face Setting](https://huggingface.co/settings/tokens)
 
@@ -56,9 +120,7 @@ The model will be downloaded at container startup time, we just need to provide
     ```
 * Depending on the size of the model, model catalog item will take time to be prepared before it can be utilised to be deployed using Model Deploy service. The script above will return the status SUCCEEDED, once the model is completely uploaded and ready to be used in Model Deploy service.
 
-
-## Requirements
-
+## Build TGI Container
 To construct the required containers for this deployment and retain the necessary information, please complete the following steps:
 
 * Checkout this repository
@@ -67,36 +129,13 @@ To construct the required containers for this deployment and retain the necessar
     ```bash
     cd model-deployment/containers/llama2
     ```
-
-* Create logging for the model deployment (if you have to already created, you can skip this step)
-  * Go to the [OCI Logging Service](https://cloud.oracle.com/logging/log-groups) and select `Log Groups`
-  * Either select one of the existing Log Groups or create a new one
-  * In the log group create ***two*** `Log`, one predict log and one access log, like:
-    * Click on the `Create custom log`
-    * Specify a name (predict|access) and select the log group you want to use
-    * Under `Create agent configuration` select `Add configuration later`
-    * Then click `Create agent configuration`
-
-* If you are choosing to download the model directly from source, you will need to configure a VCN with access to internet. Create a subnet for the model deployment
-  * Go to the [Virtual Cloud Network](https://cloud.oracle.com/networking/vcns) in your Tenancy
-  * Select one of your existing VCNs
-  * Click on `Create subnet` button
-  * Specify a name
-  * As `Subnet type` select `Regional`
-  * As IP CIDR block set `10.0.32.0/19`
-  * Under `Route Table` select the routing table for private subnet
-  * Under `Subnet Access` select `Private Subnet`
-  * Click on `Create Subnet` to create it
-
 * This example uses [OCI Container Registry](https://docs.oracle.com/en-us/iaas/Content/Registry/Concepts/registryoverview.htm) to store the container image required for the deployment. For the `Makefile` to execute the container build and push process to Oracle Cloud Container Registry, you have to setup in your local terminal the  `TENANCY_NAME` and `REGION_KEY` environment variables.`TENANCY_NAME` is the name of your tenancy, which you can find under your [account settings](https://cloud.oracle.com/tenancy) and the `REGION_KEY` is a 3 letter name of your tenancy region, you consider to use for this example, for example IAD for Ashburn, or FRA for Frankfurt. You can find the region keys in our public documentation for [Regions and Availability Domains](https://docs.oracle.com/en-us/iaas/Content/General/Concepts/regions.htm)
 
     ```bash
     export TENANCY_NAME=<your-tenancy-name>
     export REGION_KEY=<region-key>
     ```
 
-## Build TGI Container
-
 You can find the official documentation about OCI Data Science Model Deployment: [https://docs.oracle.com/en-us/iaas/data-science/using/model_dep_create.htm]
 
 * Build the TGI container image, this step would take awhile
@@ -329,47 +368,53 @@ ads opctl run -f ads-md-deploy-tgi.yaml
 ads opctl run -f ads-md-deploy-vllm.yaml
 ```
 
-## Debugging
+## Troubleshooting
+
+Following are identified as the most probable failure cases while deploying large models.
+
+### Create/Update Model deployment failure
+
+#### Reason
+Insufficient model deployment timeout.
 
-You could debug the code in the container utilizing the [Visual Studio Code Remote - Tunnels](https://code.visualstudio.com/docs/remote/tunnels) extension, which lets you connect to a remote machine, like a desktop PC or virtual machine (VM), via a secure tunnel. You can connect to that machine from a VS Code client anywhere, without the requirement of setting up your own SSH, including also using the Oracle Cloud Infrastructure Data Science Jobs.
+#### Symptoms
+The Work Request logs will show the following error:
+Workflow timed out. Maximum runtime was: <deployment_timeout> minutes.
 
-The tunneling securely transmits data from one network to another. This can eliminate the need for the source code to be on your VS Code client machine since the extension runs commands and other extensions directly on the OCI Data Science Job remote machine.
+#### Mitigation
+If model loading takes more time than service calculated, user can override deployment timeout with default configuration override by using below key value pair
+`DEPLOYMENT_TIMEOUT_IN_MINUTES`: `60`
 
-### Requirements
+Max value allowed is 240
 
-To use the debugging you have to finalize the steps of building and pushing the container of your choice to the Oracle Cloud Container Registry.
+### Create/Update Model deployment failure
 
-## Run for Debugging
+#### Reason
+Insufficient boot volume storage.
 
-For debugging purposes we will utilize the OCI Data Science Jobs service. Once the TGI or the vLLM container was build and published to the OCIR, we can run it as a Job, which would enable us take advance of the VSCode Remote Tunneling. To do so follow the steps:
+#### Symptoms
+The Work Request logs will show the following error: Errors occurred while bootstrapping the Model Deployment.
+The customer should investigate further to determine that the failure was due to lack of sufficient boot volume size.
 
-* In your [OCI Data Science](https://cloud.oracle.com/data-science/projects) section, select the project you've created for deployment
-* Under the `Resources` section select `Jobs`
-* Click on `Create job` button
-* Under the `Default Configuration` select the checkbox for `Bring your own container`
-  * Set following environment variables:
-  * `CONTAINER_CUSTOM_IMAGE` with the value to the OCI Container Registry Repository location where you pushed your container, for example: `<your-region>.ocir.io/<your-tenancy-name>/vllm-odsc:0.1.3`
-  * `CONTAINER_ENTRYPOINT` with the value `"/bin/bash", "--login",  "-c"`
-  * `CONTAINER_CMD` with the value `/aiapps/runner.sh`
-  * The above values will override the default values set in the `Dockerfile` and would enable to launch the tunneling
-* Under `Compute shape` select `Custom configuration` and then `Specialty and previous generation` and select the `VM.GPU.A10.2` shape
-* Under `Logging` select the log group you've created for the model deployment and keep the option `Enable automatic log creation`
-* Under `Storage` set 500GB+ of storage
-* Under `Networking` keep the `Default networking` configuration
+#### Mitigation
+If model zip compression rate is too high and service failed to download and unzip the artifact, user can override volume size with default configuration override by using below key value pair.
+`STORAGE_SIZE_IN_GB`: `1000`
 
-With this we are now ready to start the job
+Max value allowed as high as max boot volumes allowed.
 
-* Select the newly created job, if you have not done so
-* Click on the `Start a job run` 
-* Keep all settings by default and click on `Start` button at the bottom left
+### Create/Update Model deployment failure
 
-Once the job is up and running, you will notice in the logs, the authentication code appears, you can copied and use it to authorize the tunnel, few seconds later the link for the tunnel would appear.
+#### Reason
+Container timeout.
 
-![vscode tunnel in the oci job](../../../jobs/tutorials/assets/images/vscde-server-tunnel-job.png)
+#### Symptoms
+The Work Request logs will show the following error: Errors occurred while bootstrapping the Model Deployment
 
-Copy the link and open it in a browser, which should load the VSCode Editor and reveals the code inside the job, enabling direct debugging and coding.
+#### Mitigation
+Customer should check the predict and health check endpoints, if defined through environment variables, are valid for container image specified. They can also check the predict and access logs for more information.
 
-`Notice` that you can also use your local VSCode IDE for the same purpose via the [Visual Studio Code Remote - Tunnels](https://code.visualstudio.com/docs/remote/tunnels) extension
+### Advanced debugging options: Code debugging inside the container using job
+For more detailed level of debugging, user can refer [README-DEBUG.md](./README-DEBUG.md).
 
 ## Additional Make Commands