oracle-samples
diff --git a/‎distributed_training/llama2/README.md‎
Lines changed: 353 additions & 0 deletions b/‎distributed_training/llama2/README.md‎
Lines changed: 353 additions & 0 deletions
diff --git a/‎distributed_training/llama2/demo.md‎
Lines changed: 0 additions & 191 deletions b/‎distributed_training/llama2/demo.md‎
Lines changed: 0 additions & 191 deletions
diff --git a/‎distributed_training/llama2/images/ads-opct-watch.png‎
141 KB b/‎distributed_training/llama2/images/ads-opct-watch.png‎
141 KB
diff --git a/‎distributed_training/llama2/images/ads-opctl-watch-jobrun-llama2-ft.png‎
627 KB b/‎distributed_training/llama2/images/ads-opctl-watch-jobrun-llama2-ft.png‎
627 KB
diff --git a/‎distributed_training/llama2/images/jobs-distributed-training.002.png‎
262 KB b/‎distributed_training/llama2/images/jobs-distributed-training.002.png‎
262 KB
diff --git a/‎distributed_training/llama2/images/llama2-ft-jobrun-gpu-memory.png‎
239 KB b/‎distributed_training/llama2/images/llama2-ft-jobrun-gpu-memory.png‎
239 KB
diff --git a/‎distributed_training/llama2/images/llama2-ft-jobrun-gpu-powerdraw.png‎
539 KB b/‎distributed_training/llama2/images/llama2-ft-jobrun-gpu-powerdraw.png‎
539 KB
diff --git a/‎distributed_training/llama2/images/llama2-ft-jobrun-metrics.png‎
582 KB b/‎distributed_training/llama2/images/llama2-ft-jobrun-metrics.png‎
582 KB
diff --git a/‎distributed_training/llama2/images/oci-console-job-llama2-ft.png‎
1.09 MB b/‎distributed_training/llama2/images/oci-console-job-llama2-ft.png‎
1.09 MB
diff --git a/‎distributed_training/llama2/load-back-FSDP-checkpoints.ipynb‎
Lines changed: 113 additions & 0 deletions b/‎distributed_training/llama2/load-back-FSDP-checkpoints.ipynb‎
Lines changed: 113 additions & 0 deletions
@@ -0,0 +1,113 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "58901f38",
+   "metadata": {},
+   "source": [
+    "# Loading back FSDP checkpoints\n",
+    "\n",
+    "For more information: https://github.com/facebookresearch/llama-recipes/blob/main/docs/inference.md#loading-back-fsdp-checkpoints"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98b57d30",
+   "metadata": {},
+   "source": [
+    "## All of the code in this notebook should be run in the OCI Data Science Notebook Terminal!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05a59132",
+   "metadata": {},
+   "source": [
+    "Before you start make sure that you've installed the `pytorch20_p39_gpu_v2` Conda and activate it in the `Terminal`\n",
+    "\n",
+    "```bash\n",
+    "odsc conda install -s pytorch20_p39_gpu_v2\n",
+    "```\n",
+    "\n",
+    "... then activate it\n",
+    "\n",
+    "```bash\n",
+    "conda activate /home/datascience/conda/pytorch20_p39_gpu_v2\n",
+    "```\n",
+    "\n",
+    "Then install all of the required dependancies\n",
+    "\n",
+    "```bash\n",
+    "!pip install tokenizers==0.13.3 -U && pip install transformers -U && pip install llama-recipes==0.0.1\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e2aecea",
+   "metadata": {},
+   "source": [
+    "Following commands work best when you execute them in the `terminal` too!\n",
+    "\n",
+    "First you have to login to access the Llama2 model\n",
+    "```bash\n",
+    "!huggingface-cli login\n",
+    "```\n",
+    "\n",
+    "Then run the checkpoint conververter, it looks like following\n",
+    "\n",
+    "```bash\n",
+    "python -m llama_recipes.inference.checkpoint_converter_fsdp_hf --fsdp_checkpoint_path  /mnt/llama2/outputs/lvp-7b/ocid1.datasciencejob.oc1.eu-frankfurt-1.amaaaaaan/fine-tuned-meta-llama/Llama-2-7b-hf --consolidated_model_path /mnt/llama2/fsdp_consolidated_checkpoints --HF_model_path_or_name \"meta-llama/Llama-2-13b-hf\"\n",
+    "```\n",
+    "\n",
+    "Replace the `--fsdp_checkpoint_path` with the folder you specified by the `--dist_checkpoint_root_folder` which will be the location at your object storage bucket, as per the example above. Notice that we ran this in OCI Data Science Notebooks and mounted the object storage bucket used to store the FSDP checkpoints under `/mnt/llama2`. The `--consolidated_model_path` is the path where the consolidated weights will be stored back. The `--HF_model_path_or_name` is the name of the model used for the fine-tuning, or if you downloaded the model locally, the location of the downloaded model.\n",
+    "\n",
+    "If the merging process was successful, you should see in your `--consolidated_model_path` folder something like this:\n",
+    "\n",
+    "```bash\n",
+    "   0 drwxr-xr-x. 1 datascience users    0 Oct 18 15:48 .\n",
+    "   0 drwxr-xr-x. 1 datascience users    0 Oct 18 14:38 ..\n",
+    " 512 -rw-r--r--. 1 datascience users   42 Oct 18 16:35 added_tokens.json\n",
+    "1.0K -rw-r--r--. 1 datascience users  656 Oct 18 16:35 config.json\n",
+    " 512 -rw-r--r--. 1 datascience users  111 Oct 18 16:35 generation_config.json\n",
+    "9.2G -rw-r--r--. 1 datascience users 9.2G Oct 18 16:35 pytorch_model-00001-of-00003.bin\n",
+    "9.3G -rw-r--r--. 1 datascience users 9.3G Oct 18 16:36 pytorch_model-00002-of-00003.bin\n",
+    "6.7G -rw-r--r--. 1 datascience users 6.7G Oct 18 16:36 pytorch_model-00003-of-00003.bin\n",
+    " 24K -rw-r--r--. 1 datascience users  24K Oct 18 16:36 pytorch_model.bin.index.json\n",
+    " 512 -rw-r--r--. 1 datascience users   72 Oct 18 16:35 special_tokens_map.json\n",
+    "1.5K -rw-r--r--. 1 datascience users 1.2K Oct 18 16:35 tokenizer_config.json\n",
+    "489K -rw-r--r--. 1 datascience users 489K Oct 18 16:35 tokenizer.model\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2407ae40",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [conda env:pytorch20_p39_gpu_v2]",
+   "language": "python",
+   "name": "conda-env-pytorch20_p39_gpu_v2-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}