From e8c1d6eae5d03464953b20491047f7b3b6486b16 Mon Sep 17 00:00:00 2001 From: Will Price Date: Fri, 11 Dec 2020 13:20:45 +0000 Subject: [PATCH 1/4] Add first draft of nvidia instructions --- source/running.rst | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/source/running.rst b/source/running.rst index 1f8b078..615d5d2 100644 --- a/source/running.rst +++ b/source/running.rst @@ -119,6 +119,49 @@ Once the script has been edited to your liking, re-run Packer with: This will start a VM inside your cloud account, build the image and then shut down the VM. From that point on, any newly-started nodes will use the new image. +AWS GPU nodes ++++++++++++++ + +We need to adapt the default packer image as out of the box it does not +contain any of the nvidia software necessary to interact with the GPU. + +The first step is to change the ``compute_image_extra.sh`` script to +install the nvidia driver and CUDA toolchain: + +.. code-block:: shell-session + + [citc@mgmt ~]$ cat >> compute_image_extra.sh < Date: Fri, 11 Dec 2020 13:51:21 +0000 Subject: [PATCH 2/4] Decrease the image size used for building an AWS GPU AMI 40GB is unnecessarily large and causes the AMI build to take a long time and increases the provisioning time of the compute nodes. --- source/running.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/running.rst b/source/running.rst index 615d5d2..1f9ff0f 100644 --- a/source/running.rst +++ b/source/running.rst @@ -149,7 +149,7 @@ following to the end of the ``source "amazon-ebs" "aws"`` section launch_block_device_mappings { device_name = "/dev/sda1" - volume_size = 40 + volume_size = 10 } We can now re-build the image used to provision compute nodes: From 6b3541009a230017a2734fd1a130c5af17420c66 Mon Sep 17 00:00:00 2001 From: Will Price Date: Fri, 11 Dec 2020 15:10:58 +0000 Subject: [PATCH 3/4] Remove installation of CUDA for AWS GPU AMI It is not necessary to install CUDA to run GPU accelerated things (e.g. pytorch) we will leave this up to users to install CUDA as a module. --- source/running.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/source/running.rst b/source/running.rst index 1f9ff0f..db67c05 100644 --- a/source/running.rst +++ b/source/running.rst @@ -135,7 +135,6 @@ install the nvidia driver and CUDA toolchain: sudo dnf clean all sudo dnf -y install kernel-devel sudo dnf -y module install nvidia-driver:latest-dkms - sudo dnf -y install cuda sudo dkms autoinstall EOF From bded4cbf3fbff662d21710dfe63a997321ce3b74 Mon Sep 17 00:00:00 2001 From: Will Price Date: Fri, 11 Dec 2020 15:12:03 +0000 Subject: [PATCH 4/4] Remove instructions to modify AMI builder size We have now updated the CitC config so that by default the builder size is 20GB see https://github.com/clusterinthecloud/ansible/pull/93 --- source/running.rst | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/source/running.rst b/source/running.rst index db67c05..014f805 100644 --- a/source/running.rst +++ b/source/running.rst @@ -138,19 +138,6 @@ install the nvidia driver and CUDA toolchain: sudo dkms autoinstall EOF -We can't just rebuild our image straight away though since the CUDA -toolchain is large and exceeds the base image size, consequently we need -to change the packer configuration to create a larger image. Edit -``/etc/citc/packer/all.pkr.hcl`` in your favourite editor, and add the -following to the end of the ``source "amazon-ebs" "aws"`` section - -.. code-block:: - - launch_block_device_mappings { - device_name = "/dev/sda1" - volume_size = 10 - } - We can now re-build the image used to provision compute nodes: .. code-block:: shell-session