Skip to content

Running on alihlt #45

@saganatt

Description

@saganatt

(Just keeping notes for myself. Maybe someone else will find these useful as well.)

Running the script on alihlt

Version with virtualenv

Prerequisities - to be installed by admin:

  • python3-virtualenv
  • graphviz
  • ROOT prerequsities

Running

  1. Install ROOT 6. Add to ~/.bashrc:
export PATH=/opt/rocm/bin:$PATH
export ALIBUILD_WORK_DIR="$HOME/alice/sw"
eval "`alienv shell-helper`"

Reload shell.
2. Add PYTHONPATH=/home/${LOGNAME}/.virtualenvs/tpcwithdnn/lib/python3.6/site-packages/:$PYTHONPATH to load.sh:89 and comment LD_LIBRARY_PATH line.
3. Copy input data from aliceml and change paths in database*.yml (/home/mkabus/data/...).
4.

alienv enter ROOT/latest
source load.sh
pip uninstall tf-nightly-gpu
pip install tensorflow-rocm
  1. In utilities_dnn.py:58 replace pool_type with 1 (forcing AveragePooling3D, MaxPooling3D causes: "3D pooling doesn't support workspace index mask mode" error).
  2. Change run_parallel to true in database*.yml.
  3. In dnn_optimiser.py:58 set devices explicitly, for 6 devices: self.strategy = MirroredStrategy(devices=["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3", "/gpu:4", "/gpu:5"])

Debugging

Comments to a tensorflow issue
ROCM guide on HIP debugging
ROCM guide on system-level debugging

Profiling

ROCM guide

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions