Skip to content
33 changes: 22 additions & 11 deletions doc/user_guide/profiling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
.. Modified by A. R. Porter, STFC Daresbury Lab
.. Modified by R. W. Ford, STFC Daresbury Lab
.. Modified by I. Kavcic, Met Office
.. Modified by T. H. Gibson, Advanced Micro Devices, Inc.

.. _userguide-profiling:

Expand All @@ -52,7 +53,8 @@ transformation within a transformation script.

PSyclone can be used with a variety of existing profiling tools.
It currently supports dl_timer, TAU, Vernier, Dr Hook, the NVIDIA GPU
profiling tools and it comes with a simple stand-alone timer library.
profiling tools (NVTX), the AMD ROCm profiling tools (ROCTx), and it
comes with a simple stand-alone timer library.
The :ref:`PSyData API <psy_data>` (see also the
:ref:`Developer Guide <devguide_psy_data>`)
is utilised to implement wrapper libraries that connect the PSyclone
Expand All @@ -78,11 +80,11 @@ Interface to Third Party Profiling Tools

PSyclone comes with :ref:`wrapper libraries <libraries>` to support
usage of TAU, Vernier, Dr Hook, dl_timer, NVTX (NVIDIA Tools Extension
library), and a simple non-thread-safe timing library. Support for further
profiling libraries will be added in the future. To compile the
wrapper libraries, change into the directory ``lib/profiling``
of PSyclone and type ``make`` to compile all wrappers. If only
some of the wrappers are required, you can either use
library), ROCTx (AMD library for code instrumentation), and a simple non-thread-safe timing
library. Support for further profiling libraries will be added in the
future. To compile the wrapper libraries, change into the directory
``lib/profiling`` of PSyclone and type ``make`` to compile all wrappers.
If only some of the wrappers are required, you can either use
``make wrapper-name`` (e.g. ``make drhook``), or change
into the corresponding directory and use ``make``. The
corresponding ``README.md`` files contain additional parameters
Expand Down Expand Up @@ -131,6 +133,11 @@ libraries that come with PSyclone:
to the NVIDIA Tools Extension library (NVTX). This library is
available from https://developer.nvidia.com/cuda-toolkit.

``lib/profiling/amd``
This is a wrapper library that maps the PSyclone profiling API
to the AMD ROCTx library. ROCTx documentation is available
from https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/how-to/using-rocprofiler-sdk-roctx.html.

``lib/profiling/lfric_timer``
This profile wrapper uses the timer functionality provided by
LFRic, and it comes in two different versions:
Expand Down Expand Up @@ -160,15 +167,18 @@ wrapper provided by the tool which will provide the required additional
compiler parameters. The exceptions are the template and simple_timing
libraries, which are stand alone. The profiling example in
``examples/gocean/eg5/profile`` can be used with any of the
wrapper libraries (except ``nvidia``) to see how they work.
wrapper libraries (except ``nvidia`` and ``amd``) to see how they work.

.. _required_profiling_calls:

Required Modifications to the Program
-------------------------------------
In order to guarantee that any profiling library is properly
initialised, PSyclone's profiling wrappers utilise two additional
function calls that the user must manually insert into the program:
function calls that the user must manually insert into the program
(the NVIDIA NVTX wrapper in ``lib/profiling/nvidia`` and the AMD ROCTx
wrapper in ``lib/profiling/amd`` are exceptions and do not require these
calls):

profile_PSyDataInit()
~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -249,9 +259,10 @@ cannot be used as there is no concept of `kernels`.
GPU execution).

.. note:: It is still the responsibility of the user to manually
add the calls to ``profile_PSyDataInit`` and
``profile_PSyDataShutdown`` to the
code base (see :ref:`required_profiling_calls`).
add the calls to ``profile_PSyDataInit`` and
``profile_PSyDataShutdown`` to the code base (see
:ref:`required_profiling_calls`), unless using the NVIDIA NVTX or
AMD ROCTx wrapper.

PSyclone will modify the schedule of each invoke to insert the
profiling regions. Below we show an example of a schedule created
Expand Down
10 changes: 10 additions & 0 deletions examples/nemo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,3 +149,13 @@ supported for generic transformations.
A simple stand-alone example that shows verification that read-only data
is not modified, e.g. by out-of-bounds accesses to other variables.
This uses the PSyData interface to instrument generic Fortran code.

## Example 7

OpenMP parallelisation (for CPU and GPU) of `tra_adv` over levels, using
`nowait` and minimisation of introduced barriers.

## Example 8

A simple profiling example that shows OpenMP offloading transformations
with profiling hooks enabled.
81 changes: 81 additions & 0 deletions examples/nemo/eg8/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# -----------------------------------------------------------------------------
# BSD 3-Clause License
#
# Copyright (c) 2026, Science and Technology Facilities Council.
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# * Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# * Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# * Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
# FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
# COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
# BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
# ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.
# ------------------------------------------------------------------------------
# Author: T. H. Gibson, Advanced Micro Devices, Inc.

# Set the compiler and flags in your environment before running make.
# This Makefile assumes F90FLAGS contains all compiler options needed for
# OpenMP and target offload (e.g. -fopenmp and --offload-arch=...).
# It also assumes LDFLAGS contains all linker options and vendor runtime
# libraries needed by your compiler/runtime stack.

# Example for AMD GPU offload using amdflang:
# export F90=amdflang
# export F90FLAGS="-O3 -fopenmp --offload-arch=gfx942"
# export LDFLAGS="-fopenmp --offload-arch=gfx942 -L${ROCM_PATH}/lib -lrocprofiler-sdk-roctx"

# Then run:
# make clean compile run

include ../../common.mk

GENERATED_FILES = traadv_instrumented.F90 \
traadv_instrumented.o \
traadv.exe \
output.dat

PROFILER ?= rocprofv3
PROFILER_FLAGS ?= --runtime-trace --output-format pftrace

# Profiling wrapper settings
PSYCLONE_PROFILING_DIR ?= $(PSYCLONE_DIR)/lib/profiling/amd
PSYCLONE_PROFILING_INCLUDE ?= ${PSYCLONE_PROFILING_DIR}
PSYCLONE_PROFILING_LIB ?= ${PSYCLONE_PROFILING_DIR}/libroctx_prof.a
PSYCLONE_PROFILING_LIBS ?= -L${PSYCLONE_PROFILING_DIR} -lroctx_prof

transform:
ENABLE_PROFILING=1 ${PSYCLONE} -s ./omp_gpu_profile_trans.py ../code/tra_adv.F90 -o traadv_instrumented.F90

compile: transform traadv.exe

run: traadv.exe
IT=10 JPI=64 JPJ=64 JPK=32 ${PROFILER} ${PROFILER_FLAGS} -- ./traadv.exe

traadv.exe: traadv_instrumented.o ${PSYCLONE_PROFILING_LIB}
${F90} ${F90FLAGS} traadv_instrumented.o -o traadv.exe ${LDFLAGS} ${PSYCLONE_PROFILING_LIBS}

traadv_instrumented.o: traadv_instrumented.F90 ${PSYCLONE_PROFILING_LIB}
${F90} ${F90FLAGS} -I${PSYCLONE_PROFILING_INCLUDE} -c $< -o $@

${PSYCLONE_PROFILING_LIB}:
${MAKE} -C ${PSYCLONE_PROFILING_DIR} F90=${F90}
90 changes: 90 additions & 0 deletions examples/nemo/eg8/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# PSyclone NEMO Example 8

**Author:** T. H. Gibson, Advanced Micro Devices, Inc.

This example demonstrates a simple profiling workflow for OpenMP target
offloading, using the tracer advection demo. It processes `../code/tra_adv.F90` and
generates `traadv_instrumented.F90` with OpenMP target offload directives plus
profiling hooks. The transformation script `omp_gpu_profile_trans.py` is a
small local transform script that uses shared helpers from `../scripts` and
inserts profile regions around *all* OpenMP target regions.

## Running

```sh
make transform
```

or explicitly:

```sh
ENABLE_PROFILING=1 ${PSYCLONE} -s ./omp_gpu_profile_trans.py ../code/tra_adv.F90 -o traadv_instrumented.F90
```

This emits transformed Fortran code with PSyData profiling around OpenMP target
regions.

## Compiling and Running

This example supports compilation and execution using the AMD ROCTx profiling
wrapper in `../../../lib/profiling/amd` and the ROCm profiler (`rocprofv3`)
by default. It can also be tested with NVIDIA tooling by overriding the relevant
Makefile variables (compiler/flags and profiling wrapper variables such as
`PSYCLONE_PROFILING_DIR`, `PSYCLONE_PROFILING_LIB`, and
`PSYCLONE_PROFILING_LIBS`).

Typical compiler settings for AMD GPU offloading are:

```sh
export F90=amdflang
export F90FLAGS="-O3 -fopenmp --offload-arch=<arch>"
export LDFLAGS="-fopenmp --offload-arch=<arch> -L${ROCM_PATH}/lib -lrocprofiler-sdk-roctx"
```

Then build and run:

```sh
make compile
make run
```

For more information on profiling wrappers and profiler-specific options, see the
[profiling wrappers README](../../../lib/profiling/README.md).

## Licence

-----------------------------------------------------------------------------

BSD 3-Clause License

Copyright (c) 2026, Science and Technology Facilities Council.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.

-----------------------------------------------------------------------------
116 changes: 116 additions & 0 deletions examples/nemo/eg8/omp_gpu_profile_trans.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
#!/usr/bin/env python
# -----------------------------------------------------------------------------
# BSD 3-Clause License
#
# Copyright (c) 2026, Science and Technology Facilities Council.
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# * Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# * Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# * Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
# FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
# COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
# BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
# ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.
# -----------------------------------------------------------------------------

import os
import pathlib
import sys
from typing import List, Union
from psyclone.psyir.nodes import (
Assignment, IfBlock, Node, OMPDirective, ProfileNode, Routine, Schedule)
from psyclone.psyir.transformations import OMPTargetTrans
from psyclone.transformations import OMPLoopTrans

# Add examples/nemo/scripts to python path; needed to import utils.py
SCRIPT_DIR = pathlib.Path(__file__).resolve().parent
NEMO_SCRIPTS_DIR = SCRIPT_DIR.parent / "scripts"
if str(NEMO_SCRIPTS_DIR) not in sys.path:
sys.path.insert(0, str(NEMO_SCRIPTS_DIR))


PROFILING_ENABLED = os.environ.get("ENABLE_PROFILING", False)


def add_omp_region_profiling_markers(children: Union[List[Node], Schedule]):
"""Insert profiling markers around all top-level OpenMP directives.

:param children: a Schedule or sibling nodes in the PSyIR to which to
attempt to add profiling regions.
"""
from utils import add_profile_region

if children and isinstance(children, Schedule):
# If we are given a Schedule, we look at its children.
children = children.children
# If we are given an empty list, we return.
if not children:
return
# We do not want profiling calipers inside functions (such as the
# PSyclone-generated comparison functions).
parent_routine = children[0].ancestor(Routine)
if parent_routine and parent_routine.return_symbol:
return
# Iterate over the children and wrap top-level OpenMP directives.
for child in children[:]:
if isinstance(child, OMPDirective):
# Only wrap top-level OpenMP directives and not
# nested directives or profiling markers.
if (not child.ancestor(OMPDirective) and
not child.ancestor(ProfileNode)):
add_profile_region([child])
if isinstance(child, IfBlock):
# Recursively wrap any nested OpenMP kernels in if/else constructs.
add_omp_region_profiling_markers(child.if_body)
add_omp_region_profiling_markers(child.else_body)
elif not isinstance(child, Assignment):
add_omp_region_profiling_markers(child.children)


def trans(psyir):
"""Apply OpenMP offloading and insert profiling around target regions."""
from utils import normalise_loops, insert_explicit_loop_parallelism

omp_target_trans = OMPTargetTrans()
omp_loop_trans = OMPLoopTrans(omp_schedule="none")
omp_loop_trans.omp_directive = "teamsloop"

for subroutine in psyir.walk(Routine):
normalise_loops(
subroutine,
hoist_local_arrays=False,
convert_array_notation=True,
loopify_array_intrinsics=True,
convert_range_loops=True,
increase_array_ranks=False,
hoist_expressions=True
)
insert_explicit_loop_parallelism(
subroutine,
region_directive_trans=omp_target_trans,
loop_directive_trans=omp_loop_trans,
collapse=True,
enable_reductions=True
)
if PROFILING_ENABLED:
add_omp_region_profiling_markers(subroutine.children)
4 changes: 2 additions & 2 deletions lib/profiling/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,10 @@ NO_DEP_LIBS = lfric_timer simple_timing template

# The list with all libraries (include the ones that have additional
# dependencies):
ALL_LIBS = $(NO_DEP_LIBS) dl_timer drhook nvidia tau vernier
ALL_LIBS = $(NO_DEP_LIBS) amd dl_timer drhook nvidia tau vernier

.PHONY: default all $(NO_DEP_LIBS) clean allclean \
dl_timer drhook nvidia tau vernier
amd dl_timer drhook nvidia tau vernier

# By default, compile all libraries that do not have additional dependencies
# The 'all' target is used by the compilation tests, so this also can only
Expand Down
Loading