Skip to content

Commit 2d4e2ea

Browse files
[Feature] Multiple minor improvements, simpler experiment template. (#30)
1 parent a3168fb commit 2d4e2ea

File tree

37 files changed

+336
-259
lines changed

37 files changed

+336
-259
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,6 @@ outputs/*
143143
!outputs/README.md
144144

145145
wandb
146-
slurm-*.out
146+
**/*.out
147147

148148
third-party

README.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -60,16 +60,23 @@ Remember to get back to this root one after finishing each step.
6060

6161
1. Clone the repo with destination `PROJECT_NAME`.
6262
- If you plan to develop on your local computer, clone it there.
63-
- If you plan to develop on your remote server (with direct access over say SSH, e.g. EPFL HaaS), clone it there.
64-
- If you plan to develop on CSCS Clariden, clone it there (use an allocation with a compute node, not the login node)
65-
- If you plan to develop or deploy on a managed cluster without a build engine
63+
- If you plan to develop or deploy on a remote server/cluster without a build engine
6664
(e.g., EPFL Run:ai clusters, SCITAS clusters), clone on your local machine.
67-
(Docker allows cross-platform builds with emulation, but it can be slow.
65+
(You will build the image on your local machine then clone there for deployment.
66+
Docker allows cross-platform builds with emulation, but it can be slow.
6867
We would recommend that your local machine is of the same platform as the cluster (e.g. `amd64`, `arm64`),
6968
or that you have access to a remote Docker engine running on the same platform as the cluster.)
69+
- If you plan to develop on a remote server/cluster with a build enginewith direct access over say SSH, e.g. EPFL HaaS)) clone it there.
70+
(e.g. EPFL HaaS, CSCS Clariden) clone it there.
7071
```
71-
git clone <HTTPS/SSH> PROJECT_NAME
72+
# For your local machine clone anywhere.
73+
# For clusters with scratch filesystems with a cleaning policy, clone in your home directory.
74+
# The training artifacts will be later stored on the scratch filesystem and symlinked to this directory.
75+
# Also note the creation of a `dev` instance of the repo (And later `run` instance for unattended jobs)
76+
mkdir PROJECT_NAME
7277
cd PROJECT_NAME
78+
git clone <HTTPS/SSH> dev
79+
cd dev
7380
# The current directory is referred to as PROJECT_ROOT
7481
```
7582
We will refer to the absolute path to the root of the repository as `PROJECT_ROOT`.

data/README.md

Lines changed: 31 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -13,40 +13,41 @@ The directory can be accessed in the experiments with `config.data_dir`.
1313
Of course, this doesn't mean that the datasets inside `PROJECT_ROOT/data/` need to be physically in the same directory
1414
as the project.
1515
You can create symlinks to them.
16-
This shifts the data path configuration from the code and config,
17-
which we prefer to be identical across deployment options,
18-
to the installation steps.
16+
This shifts the data path configuration from the code and config to the installation steps
17+
(which we prefer, as it makes the committed code identical across deployment options).
1918
This is also more convenient than using environment variables to point to individual dataset locations.
2019

2120
Below, you can instruct the users on how to download or link to the data and preprocess it.
2221

23-
When the data is small enough, you can instruct the users to download it in the `PROJECT_ROOT/data/` directory.
24-
Otherwise, you can provide hints to them on how to download it (or reuse parts of it) in separate storages
25-
(potentially in a shared storage where some datasets already exist) and then create symlinks to it.
26-
As these instructions to use separate storage will depend on the installation method, we advise moving
27-
them to the `installation/*/README.md` file after the development environment installation instructions.
28-
You can link back to this file for common instructions and information about download links.
29-
30-
* For the local conda installation method, this is only symlinks to other locations in the local filesystem. E.g.
31-
```bash
32-
# The data set already exist at /absolute_path/to/some-dataset
33-
# FROM the PROJECT_ROOT do
34-
ln -s /absolute-path/to/some-dataset data/some-dataset
35-
```
36-
* For the local deployment option with Docker, and the container deployment on managed clusters,
37-
in addition to symlinks, you naturally have to mount the storage where the symlinks points to.
38-
The symlink would look like:
39-
```bash
40-
# Make sure to consistently mount the volume with in the same location.
41-
# The data set already exist at /absolute-path-in-the-mounted-volume/to/some-dataset
42-
# FROM the PROJECT_ROOT do
43-
ln -s /absolute-path-in-the-mounted-volume/to/some-dataset data/some-dataset
44-
```
45-
And you would edit the `../installation/docker-*/compose.yaml` file for the local deployment option with Docker,
46-
otherwise the flags of the cluster client for the managed clusters.
47-
mount the individual datasets inside the `PROJECT_ROOT/data/` directory. Avoid nested mounts.
48-
Otherwise, you can do like the option below, to mount every shared dataset directory somewhere and symlink to them.
49-
(Note: we use symlinks in this case as well to avoid nested mounts which can cause issues.)
22+
When the data is small enough (a few MBs),
23+
you can instruct the users (including you) to download it in the `PROJECT_ROOT/data/` directory.
24+
25+
Otherwise, you can provide hints to them on how to download it (or reuse parts of it) in a separate storage
26+
(likely in a shared storage where some datasets already exist) and then create symlinks to the different parts.
27+
For managed clusters you need to mount different filesystems remember to add this to the deployment scripts
28+
and setup files (e.g. `compose.yaml` for deployment with Docker.)
29+
30+
Here are example instructions:
31+
32+
To setup the `data` directory you can download the data anywhere on your system and then symlink to the data from
33+
the `PROJECT_ROOT/data/` directory.
34+
35+
```bash
36+
# The data set already exist at /absolute_path/to/some-dataset
37+
# FROM the PROJECT_ROOT do
38+
ln -s /absolute-path/to/some-dataset data/some-dataset
39+
# Do this for each dataset root.
40+
# TEMPLATE TODO list all dataset roots (it's better to group them and use the groups accordingly in your code).
41+
```
42+
43+
Be mindful that for the different deployment methods with container engines you will have to mount the filesystems
44+
where the data is stored (E.g. the local deployment option with Docker, and the container deployment on managed clusters)
45+
46+
`TEMPLATE TODO:` For the local deployment option with Docker you would edit the `../installation/docker-*/compose.yaml`
47+
file for the local deployment option with Docker,
48+
for the managed clusters you would edit the flags of the cluster client (`runai`, `srun`, etc.).
49+
Avoid nested mounts.
50+
It's better to mount the whole "scratch" filesystem and let the symlinks handle the rest.
5051

5152
## Description of the data
5253

installation/conda-osx-arm64-mps/README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ for your future users (and yourself).
3131
# https://anaconda.org/pytorch/pytorch
3232
# The hardware acceleration will be determined by the packages you install.
3333
# E.g. if you install PyTorch with CUDA, set the acceleration to cuda.
34+
# Note: new PyTorch versions are only distributed on PyPI (i.e. with `pip`).
3435
```
3536
If you plan to support multiple platforms or hardware accelerations,
3637
you can duplicate this installation method directory
@@ -61,8 +62,11 @@ for your future users (and yourself).
6162
Clone the git repository.
6263

6364
```bash
64-
git clone <HTTPS/SSH> template-project-name
65+
# Keep a /dev copy for development and a /run copy for running unattended experiments.
66+
mkdir template-project-name
6567
cd template-project-name
68+
git clone <HTTPS/SSH> dev
69+
cd dev
6670
```
6771

6872
We will refer the absolute path to the root of the repository as `PROJECT_ROOT`.
@@ -131,6 +135,7 @@ Python dependencies are managed by both conda and pip.
131135
If not available on conda use `brew`.
132136
- Use `conda` for Python dependencies packaged with more that just Python code (e.g. `pytorch`, `numpy`).
133137
These will typically be your main dependencies and will likely not change as your project grows.
138+
Note: new PyTorch versions are only distributed on PyPI (i.e., with `pip`).
134139
- Use `pip` for the rest of the Python dependencies (e.g. `tqdm`).
135140
- For more complex dependencies that may require a custom installation or build,
136141
manually follow their installation steps.

installation/docker-amd64-cuda/CSCS-Clariden-setup/README.md

Lines changed: 19 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ cp _TODO ADD IMAGE_PATH_ $CONTAINER_IMAGES/ADAPTED_NAME.sqsh
124124

125125
#### From a registry (TODO)
126126

127-
### Clone your repository in your scratch directory
127+
### Clone your repository in your home directory
128128

129129
We strongly suggest having two instances of your project repository.
130130

@@ -138,34 +138,25 @@ This guide includes the steps to do it, and there are general details in `data/R
138138
```bash
139139
# SSH to a cluster.
140140
ssh clariden
141-
cd $SCRATCH
142141
# Clone the repo twice with name dev and run (if you already have one, mv it to a different name)
143-
mkdir template-project-name
144-
git clone <HTTPS/SSH> template-project-name/dev
145-
git clone <HTTPS/SSH> template-project-name/run
142+
mkdir -p $HOME/projects/template-project-name
143+
cd $HOME/projects/template-project-name
144+
git clone <HTTPS/SSH> dev
145+
git clone <HTTPS/SSH> run
146146
```
147147

148148
The rest of the instructions should be performed on the cluster from the dev instance of the project.
149149
```bash
150-
cd $SCRATCH/template-project-name/dev/
150+
cd dev
151151
# It may also be useful to open a remote code editor on a login node to view the project. (The remote development will happen in another IDE in the container.)
152152
# Push what you did on your local machine so far (change project name etc) and pull it on the cluster.
153153
git pull
154-
cd template-project-name/dev/installation/docker-amd64-cuda
154+
cd installation/docker-amd64-cuda
155155
```
156156

157-
### Note about the examples
158-
159-
The example files were made with username `smoalla` and lab-name `claire`.
160-
Adapt them accordingly to your username and lab name.
161-
Run
162-
```bash
163-
./template.sh env
164-
# Edit the .env file with your lab name (you can ignore the rest).
165-
./template.sh get_cscs_scripts
166-
```
167-
to get a copy of the examples in this guide with your username, lab name, etc.
168-
They will be in `./EPFL-SCITAS-setup/submit-scripts`.
157+
Example submit scripts are provided in the `example-submit-scripts` directory and are used in the following examples.
158+
You can copy them to the directory `submit-scripts` which is not tracked by git and edit them to your needs.
159+
Otherwise, we use shared scripts with shared configurations (including IDE, and shell setups) in `shared-submit-scripts`.
169160

170161
### A quick test to understand how the template works
171162

@@ -176,14 +167,14 @@ and the [`pyxis`](https://github.com/NVIDIA/pyxis) plugin directly integrated in
176167

177168
Run the script to see how the template works.
178169
```bash
179-
cd installation/docker-amd64-cuda//CSCS-Clariden-setup/submit-scripts
170+
cd installation/docker-amd64-cuda/CSCS-Clariden-setup/submit-scripts
180171
bash minimal.sh
181172
```
182173

183174
When the container starts, its entrypoint does the following:
184175

185176
- It runs the entrypoint of the base image if you specified it in the `compose-base.yaml` file.
186-
- It expects you specify `PROJECT_ROOT_AT=<location to your project in scratch (dev or run)>`.
177+
- It expects you specify `PROJECT_ROOT_AT=<location to your project (dev or run)>`.
187178
and `PROJECT_ROOT_AT` to be the working directory of the container.
188179
Otherwise, it will issue a warning and set it to the default working directory of the container.
189180
- It then tries to install the project in editable mode.
@@ -328,9 +319,9 @@ EOL
328319
We support the [Remote Development](https://www.jetbrains.com/help/pycharm/remote-development-overview.html)
329320
feature of PyCharm that runs a remote IDE in the container.
330321

331-
The first time connecting you will have to install the IDE in the server in a location mounted from `/scratch` so
332-
that is stored for future use.
333-
After that, or if you already have the IDE stored in `/scratch` from a previous project,
322+
The first time connecting you will have to install the IDE in the server in a location mounted in the container
323+
that is stored for future use (somewhere in you `$HOME` directory).
324+
After that, or if you already have the IDE stored in from a previous project,
334325
the template will start the IDE on its own at the container creation,
335326
and you will be able to directly connect to it from the JetBrains Gateway client on your local machine.
336327

@@ -351,13 +342,13 @@ All the directories will be created automatically.
351342
variables
352343
- `JETBRAINS_SERVER_AT`: set it to the `jetbrains-server` directory described above.
353344
- `PYCHARM_IDE_AT`: don't include it as IDE is not installed yet.
354-
2. Enable port forwarding for the SSH port.
345+
2. Add `JETBRAINS_SERVER_AT` in the `--container-mounts`
355346
3. Then follow the instructions [here](https://www.jetbrains.com/help/pycharm/remote-development-a.html#gateway) and
356347
install the IDE in your `${JETBRAINS_SERVER_AT}/dist`
357-
(something like `/scratch/moalla/jetbrains-server/dist`)
348+
(something like `/users/smoalla/jetbrains-server/dist`)
358349
not in its default location **(use the small "installation options..." link)**.
359350
For the project directory, it should be in the same location where it was mounted (`${PROJECT_ROOT_AT}`,
360-
something like `/scratch/moalla/template-project-name/dev`).
351+
something like `/users/smoalla/projects/template-project-name/dev`).
361352

362353
When in the container, locate the name of the PyCharm IDE installed.
363354
It will be at
@@ -424,7 +415,7 @@ to set up your ssh config file.
424415
1. In your submit command, set the environment variables for
425416
- Opening an ssh server `SSH_SERVER=1`.
426417
- preserving your config `VSCODE_SERVER_AT`.
427-
2. Enable port forwarding for the SSH connection.
418+
2. Add `VSCODE_SERVER_AT` to the `--container-mounts`.
428419
3. Have the [Remote - SSH](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh)
429420
extension on your local VS Code.
430421
4. Connect to the ssh host following the

installation/docker-amd64-cuda/CSCS-Clariden-setup/template-submit-examples/README.md renamed to installation/docker-amd64-cuda/CSCS-Clariden-setup/example-submit-scripts/README.md

File renamed without changes.

installation/docker-amd64-cuda/CSCS-Clariden-setup/template-submit-examples/edf.toml renamed to installation/docker-amd64-cuda/CSCS-Clariden-setup/example-submit-scripts/edf.toml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,4 @@ com.hooks.aws_ofi_nccl.enabled = "true"
33
com.hooks.aws_ofi_nccl.variant = "cuda12"
44

55
[env]
6-
FI_CXI_DISABLE_HOST_REGISTER = "1"
7-
FI_MR_CACHE_MONITOR = "userfaultfd"
86
NCCL_DEBUG = "INFO"

installation/docker-amd64-cuda/CSCS-Clariden-setup/template-submit-examples/minimal.sh renamed to installation/docker-amd64-cuda/CSCS-Clariden-setup/example-submit-scripts/minimal.sh

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@
22

33
# Variables used by the entrypoint script
44
# Change this to the path of your project (can be the /dev or /run copy)
5-
export PROJECT_ROOT_AT=$SCRATCH/template-project-name/dev
5+
export PROJECT_ROOT_AT=$HOME/projects/template-project-name/dev
6+
export PROJECT_NAME=template-project-name
7+
export PACKAGE_NAME=template_package_name
68
export SLURM_ONE_ENTRYPOINT_SCRIPT_PER_NODE=1
79

810
# Enroot + Pyxis
@@ -13,9 +15,9 @@ export SLURM_ONE_ENTRYPOINT_SCRIPT_PER_NODE=1
1315
srun \
1416
-J template-minimal \
1517
--pty \
16-
--container-image=$CONTAINER_IMAGES/claire+smoalla+template-project-name+amd64-cuda-root-latest.sqsh \
18+
--container-image=$CONTAINER_IMAGES/$(id -gn)+$(id -un)+template-project-name+amd64-cuda-root-latest.sqsh \
1719
--environment="${PROJECT_ROOT_AT}/installation/docker-amd64-cuda/CSCS-Clariden-setup/submit-scripts/edf.toml" \
18-
--container-mounts=$SCRATCH \
20+
--container-mounts=$PROJECT_ROOT_AT,$SCRATCH \
1921
--container-workdir=$PROJECT_ROOT_AT \
2022
--no-container-mount-home \
2123
--no-container-remap-root \

installation/docker-amd64-cuda/CSCS-Clariden-setup/template-submit-examples/remote-development.sh renamed to installation/docker-amd64-cuda/CSCS-Clariden-setup/example-submit-scripts/remote-development.sh

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,27 +5,33 @@
55

66
# Variables used by the entrypoint script
77
# Change this to the path of your project (can be the /dev or /run copy)
8-
export PROJECT_ROOT_AT=$SCRATCH/template-project-name/dev
8+
export PROJECT_ROOT_AT=$HOME/projects/template-project-name/dev
9+
export PROJECT_NAME=template-project-name
10+
export PACKAGE_NAME=template_package_name
911
export SLURM_ONE_ENTRYPOINT_SCRIPT_PER_NODE=1
1012
export WANDB_API_KEY_FILE_AT=$HOME/.wandb-api-key
1113
export HF_TOKEN_AT=$HOME/.hf-token
1214
export HF_HOME=$SCRATCH/huggingface
1315
export SSH_SERVER=1
1416
export NO_SUDO_NEEDED=1
15-
export JETBRAINS_SERVER_AT=$SCRATCH/jetbrains-server
17+
# For the first time, mkdir -p $HOME/jetbrains-server, and comment out PYCHARM_IDE_AT
18+
export JETBRAINS_SERVER_AT=$HOME/jetbrains-server
1619
#export PYCHARM_IDE_AT=744eea3d4045b_pycharm-professional-2024.1.6-aarch64
1720
# or
1821
# export VSCODE_SERVER_AT=$SCRATCH/vscode-server
22+
# and replace JETBRAINS_SERVER_AT in the container-mounts
1923

2024
srun \
21-
--container-image=$CONTAINER_IMAGES/claire+smoalla+template-project-name+amd64-cuda-root-latest.sqsh \
25+
--container-image=$CONTAINER_IMAGES/$(id -gn)+$(id -un)+template-project-name+amd64-cuda-root-latest.sqsh \
2226
--environment="${PROJECT_ROOT_AT}/installation/docker-amd64-cuda/CSCS-Clariden-setup/submit-scripts/edf.toml" \
2327
--container-mounts=\
28+
$PROJECT_ROOT_AT,\
2429
$SCRATCH,\
2530
$WANDB_API_KEY_FILE_AT,\
2631
$HOME/.gitconfig,\
2732
$HF_TOKEN_AT,\
28-
$HOME/.ssh/authorized_keys \
33+
$JETBRAINS_SERVER_AT,\
34+
$HOME/.ssh \
2935
--container-workdir=$PROJECT_ROOT_AT \
3036
--no-container-mount-home \
3137
--no-container-remap-root \

installation/docker-amd64-cuda/CSCS-Clariden-setup/template-submit-examples/test-interactive.sh renamed to installation/docker-amd64-cuda/CSCS-Clariden-setup/example-submit-scripts/test-interactive.sh

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,6 @@
22

33
# Enroot + Pyxis
44

5-
export PROJECT_ROOT_AT=$SCRATCH/template-project-name/dev
6-
75
srun \
86
-J template-test \
97
--pty \

0 commit comments

Comments
 (0)