From 62d0db4848965cfb35ad135e780cc85fa6d93031 Mon Sep 17 00:00:00 2001 From: Hank Wikle Date: Mon, 9 Mar 2026 10:53:34 -0600 Subject: [PATCH 1/3] Refine scheduler plugin documentation --- RELEASE.txt | 4 + docs/plugins/schedulers.rst | 201 ++++++++++++++++++------------------ docs/tests/scheduling.rst | 4 +- 3 files changed, 108 insertions(+), 101 deletions(-) diff --git a/RELEASE.txt b/RELEASE.txt index ffd1370ef..33f887d90 100644 --- a/RELEASE.txt +++ b/RELEASE.txt @@ -4,6 +4,10 @@ RELEASE=2.4 # Release History +## 2.6 Pre-release notes + +- Minor documentation improvements. + ## 2.5 Release Notes - Tests can (and should) now be structured as suites directories, which can contain test, host, diff --git a/docs/plugins/schedulers.rst b/docs/plugins/schedulers.rst index dbdc8653b..502ed34b6 100644 --- a/docs/plugins/schedulers.rst +++ b/docs/plugins/schedulers.rst @@ -18,7 +18,7 @@ Scheduler Requirements For a scheduler to work with Pavilion, it must: -- Produce jobs with a unique (for the moment), trackable job id +- Produce jobs with a unique (for the moment), trackable job ID - Produce jobs that can be cancelled - Allow a job to be started asynchronously. @@ -35,8 +35,9 @@ Advanced schedulers must be able to get an accurate inventory of nodes, includin - Whether each node is currently 'up' or 'allocated'. - System information about each node (CPUS, memory info, etc...) -- The scheduler 'groups' that the node belongs to: reservations, partitions. Pavilion's - must be able to filter nodes according the allocation parameters the same way the scheduler would. +- The scheduler groups that the node belongs to: reservations, partitions. Pavilion + must be able to filter nodes according to the allocation parameters the same way the scheduler + would. Advanced schedulers must also be able to dictate to the scheduler exactly which nodes to use. @@ -46,31 +47,31 @@ Scheduler Plugins The Scheduler Plugin ~~~~~~~~~~~~~~~~~~~~ -This inherits from the 'pavilion.schedulers.BasicSchedulerPlugin' or -'pavilion.schedulers.AdvancedSchedulerPlugin' class. All of these are fully documented in +This inherits from the ``pavilion.schedulers.SchedulerPluginBasic`` or +``pavilion.schedulers.SchedulerPluginAdvanced`` class. All of these are fully documented in the 'pavilion.schedulers.scheduler.SchedulerPlugin' class. All scheduler plugin require that you extend the base class by providing: -1. A ``_kickoff()`` method - a means to acquire an allocation given the scheduler parameters - and run a script on it. Also needs to return a 'serializable' job id, to uniquely +1. A ``_kickoff()`` method — a means to acquire an allocation given the scheduler parameters + and run a script on it. Also needs to return a 'serializable' job ID, to uniquely identify a scheduler job. -2. A ``job_status()`` method, that asks the scheduler whether a given job id is +2. A ``_job_status()`` method, that asks the scheduler whether a given job ID is scheduled, had a scheduling error, was cancelled, or is running. -3. A ``cancel()`` method, to cancel a given job id. +3. A ``cancel()`` method, to cancel a given job ID. 4. A ``_get_alloc_nodes()`` method, to get the list of nodes in an allocation that Pavilion is currently running under. -5. An ``available()`` method, to tell Pavilion if your scheduler can be used at all. +5. An ``_available()`` method, to tell Pavilion if your scheduler can be used at all. Advanced schedulers must also override the following. They are fully documented -in the 'pavilion.schedulers.advanced.SchedulerPluginAdvanced' class. +in the ``pavilion.schedulers.SchedulerPluginAdvanced`` class. 1. ``_get_raw_node_data()`` - Should fetch and return a list of information about each node. This is the per-node information mentioned above. -2. ``_transform_raw_node_data()`` - Converts that data into a '{node: info_dict}' dictionary. +2. ``_transform_raw_node_data()`` - Converts that data into a ``{node: info_dict}`` dictionary. - There are several required keys each node's info_dict must contain, see the method + There are several required keys each node's ``info_dict`` must contain. See the method documentation for info on the required and optional keys. Basic scheduler plugins don't require any extra methods, but are limited in functionality. @@ -80,9 +81,8 @@ Scheduler Variables ~~~~~~~~~~~~~~~~~~~ Every scheduler should also include a scheduler variables class, assigned to your -class's 'VAR_CLASS' class variable. This provides information from the scheduler -for each test to use in it's configuration, such as ``sched.test_nodes`` (the -for each test to use in it's configuration, such as `sched.test_nodes` (the +class's ``VAR_CLASS`` class variable. This provides information from the scheduler +for each test to use in its configuration, such as ``sched.test_nodes`` (the number of nodes in the test's allocation). The base class uses information given by the scheduler plugin and the test's configuration to figure out 99% of these on its own. You'll only need to override a few. @@ -94,11 +94,10 @@ Handling Errors ~~~~~~~~~~~~~~~ Your scheduler class should catch any errors it reasonably expects to occur. -This includes OSError when making system calls, ValueError when manipulating -values (like converting strings to ints), etc. Once caught, then raise a Pavilion -specific error, in this case it should always be SchedulerPluginError. Pavilion exceptions -take a message about the local context as their first argument, and the prior exception -as the second (optional) argument. +This includes ``OSError`` when making system calls, ``ValueError`` when manipulating +values (like converting strings to ints), etc. Once an error is caught, the scheduler plugin should +raise a ``SchedulerPluginError``. A Pavilion exception takes a message about the local context as +its first argument and the prior exception as its second (optional) argument. .. code-block:: python @@ -110,11 +109,11 @@ as the second (optional) argument. except ValueError as exc: raise SchedulerPluginError("Invalid value for foo.", exc) -This allows Pavilion to catch and handle predictable errors, and pass them +This allows Pavilion to catch and handle predictable errors and pass them directly to the user. -Init -~~~~ +Initialization +~~~~~~~~~~~~~~ Scheduler plugins initialize much like other Pavilion plugins: @@ -132,7 +131,7 @@ Scheduler plugins initialize much like other Pavilion plugins: Most customization is through method overrides and a few class variables that we'll cover later. There is also a ``SchedulerPluginBasic`` which allows for working -with schedulers with a much reduced feature set. +with schedulers with a reduced feature set. .. _Yaml Config: https://yaml-config.readthedocs.io/en/latest/ @@ -140,50 +139,50 @@ with schedulers with a much reduced feature set. Configuraton ~~~~~~~~~~~~ -Pavilion has unified scheduler plugin configuration into the 'schedule' section. Not all keys from -this section will apply to your scheduler, and that's ok. Most keys are handled automatically given -the information gathered on nodes. +Like Pavilion's built-in scheduler plugins, configuration for custom scheduler plugins is handled +in a test's ``schedule`` section. Not all keys from this section will apply to your scheduler. Most +keys are handled automatically given the information gathered on nodes. -You can also, optionally, add a scheduler specific configuration section. To do this, you'll need +You can also, optionally, add a scheduler-specific configuration section. To do this, you'll need to override the ``_get_config_elems()`` method. This method returns three items: - 1. A list of YamlConfig Elements. + 1. A list of ``YamlConfig`` elements. 2. A dictionary of validation/normalization functions. These will be called to transform the data for each key to a standard format. 3. A dictionary of default values for each key. -Pavilion uses the `Yaml Config`_ library to manage it's configuration format. -Yaml Config uses 'config elements' to describe each component of the -configuration and their relationships. +Pavilion uses the `Yaml Config`_ library to manage its configuration format. +Yaml Config uses "config elements" to describe each component of the +configuration and the relationships between them. The Slurm scheduler plugin provides a solid example of this, but in general: - - You should only use yaml_config StrElem, ListElem, KeyedElem (a dict with specific key - and value formats), and CategoryElem (a dict with mostly unlimited keys, and a shared + - You should only use Yaml Config ``StrElem``, ``ListElem``, ``KeyedElem`` (a dict with specific + key and value formats), and ``CategoryElem`` (a dict with mostly unrestricted keys and a shared value format). - - Validators for individual keys are optional, but you should do str to int conversion and value - range checking. These can take several forms, see the ``SchedulerPlugin._get_config_elems()`` - method documentation. - - Don't use the built-in validation and default options for the yaml_config objects, + - Validators for individual keys are optional, but in general, validators should be provided + to perforem ``str`` to ``int`` conversion and value range checking. These can take several + forms. See the ``SchedulerPlugin._get_config_elems()`` method documentation. + - Don't use the built-in validation and default options for the Yaml Config objects; use the validation callbacks/objects and defaults dictionary returned by the function instead. Kicking Off Tests ~~~~~~~~~~~~~~~~~ -Pavilion scheduler plugins generate a kickoff script for each job - a script that will +Pavilion scheduler plugins generate a ``kickoff`` script for each job — a script that will be handed to the scheduler to be run within the allocation. That script will run Pavilion one or more times within that allocation, starting a ``run.sh`` script for each test. It's the responsibility of the ``run.sh`` script to actually run applications under MPI, either with ``mpirun``, ``srun``, or similar. -Many schedulers rely on a header information in that ``kickoff`` script to relay to -the scheduler what the settings for the allocation should be. This is header is optional - the +Many schedulers rely on header information in the ``kickoff`` script to relay to +the scheduler what the settings for the allocation should be. This header is optional - the default header adds nothing to the file except a ``#!/bin/bash`` line. If you need to define header lines, you'll need to create a class that inherits from -``pavilion.schedulers.scheduler.KickoffScriptHeader``, and override the +``pavilion.schedulers.scheduler.KickoffScriptHeader`` and override the ``_kickoff_lines()`` method. This method simply returns a list of header lines -to add. +to add to the script. Alternatively, when writing your ``_kickoff`` method, you can simply pass any relevant information about the job to the scheduler directly through the command line @@ -198,11 +197,12 @@ Composing Commands ~~~~~~~~~~~~~~~~~~ Your scheduler plugin will most likely require that you run commands in a subshell. This -section provides guidance on how to do so reliably under Pavilion. +section provides guidance on how to do so reliably under Pavilion. The following are several +useful idioms for working with scheduler commands: .. code-block:: python - # These should be at the top of the file, as standard + # These modules will commonly be needed when working with scheduler commands. import subprocess import shutil @@ -231,8 +231,8 @@ section provides guidance on how to do so reliably under Pavilion. run_output = run_output.decode() -To find commands on a system, 'distutils.spawn.find_executable' is essentially -an in-python version of 'which'. +To find commands on a system, ``distutils.spawn.find_executable`` is essentially +an in-python version of ``which``. Environment Variables ^^^^^^^^^^^^^^^^^^^^^ @@ -251,35 +251,35 @@ need to make sure to include the base environment in most cases. subprocess.run(my_cmd, env=myenv) -Job Id's -^^^^^^^^ +Job IDs +^^^^^^^ -Regardless of how you kickoff a test, you must capture a job id for it, and return it -as part of a JobInfo object (which is really just a dict). All scheduler commands that act on a -job, like cancel, will have access to this object either directly or through an attached test. +Regardless of how you kickoff a test, you must capture a job ID for it, and return it +as part of a ``JobInfo`` object (which is really just a dict). All scheduler commands that act on a +job, like ``cancel``, will have access to this object either directly or through an attached test. -The JobInfo dict can contain any keys and values you like, as long as they're all strings. It's -useful to include the 'sys_name' of the machine you're on (via 'sys_vars.get_vars(True) -["sys_name"]') so that you also check if the system that started the job is the same as the one -that's manipulating it. +The ``JobInfo`` dict can contain any keys and values you like, as long as they're all strings. It's +useful to include the ``sys_name`` of the machine you're on (via +``sys_vars.get_vars(True)["sys_name"]``) so that you also check if the system that started the job +is the same as the one that's manipulating it. Job Status ~~~~~~~~~~ -The '_job_status()' method takes the Pavilion base config (Pavilion's configuration, rather than -a test configuration), and the JobInfo for job that status is needed for. It returns a -'TestStatusInfo' object, describing the job state returned by the scheduler. +The ``_job_status`` method takes the Pavilion base config (Pavilion's configuration, rather than +a test configuration), and the ``JobInfo`` object for the job whose status is needed. It returns a +``TestStatusInfo`` object, describing the job state returned by the scheduler. -It's job is to translate all the complicated potential job states for any particular scheduler -into one of four more basic states understood by Pavilion: +The method's job is to translate all the complicated potential job states for any particular scheduler +into one of four basic states understood by Pavilion: -- SCHED_ERROR - There was an error in scheduling the job -- SCHED_CANCELLED - The job was cancelled (usually externally to Pavilion) -- SCHED_RUNNING - The job is running (but not necessarily the particular test. -- SCHEDULED - The job is simply waiting for an allocation. +- ``SCHEDULED`` - The job is still waiting for an allocation. +- ``SCHED_ERROR`` - There was an error in scheduling the job. +- ``SCHED_CANCELLED`` - The job was cancelled (usually externally to Pavilion). +- ``SCHED_STARTUP`` - The job has started, but not the test. Note that this will only be called if the cached job status in the plugin's internal -'_job_statuses' dictionary is out of date. In fact, you can (as the slurm plugin does), simply +``_job_statuses`` dictionary is out of date. In fact, you can (as the Slurm plugin does), simply use the first call of this function to update the status of all the jobs on the system at once in that dictionary. @@ -292,23 +292,23 @@ in that dictionary. my_status = TestStatusInfo( STATES.SCHED_ERROR, # Simply pass one of the valid scheduler state constants. - "Cthulhu at my test.") # Along with a longer message describing the state. + "Cthulhu ate my test.") # Along with a longer message describing the state. -Cancelling Runs +Canceling Runs ~~~~~~~~~~~~~~~ -To write the 'cancel()' method, all you need to do is use the job id you saved when you -kicked a test off. If there's an error doing so, return a message why, otherwise simply -return 'None' to denote success. +To write the ``cancel`` method, all you need to do is use the job ID you saved when you +kicked a test off. If there's an error doing so, return a message explaining why. Otherwise simply +return ``None`` to denote success. -All the more complicated parts of cancelling are handled by functions that will wrap your method, -so there really isn't too much to worry about here. The Slurm plugin cancel command is a good -example in how simple this can be. +All the more complicated parts of canceling are handled by functions that will wrap your method, +so there really isn't too much to worry about here. The Slurm plugin ``cancel`` command is a good +example of how simple this can be. Finding the Allocation Nodes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The ``_get_alloc_nodes()`` method needs to be overridden to find the list of nodes for +The ``_get_alloc_nodes`` method needs to be overridden to find the list of nodes for a test's allocation. This will always be called only from within the allocation - typically the scheduler sets an environment variable with this information. @@ -319,8 +319,8 @@ the exact list of allocation nodes before the test is kicked off. Scheduler Availability ~~~~~~~~~~~~~~~~~~~~~~ -The 'available()' method simply tells Pavilion if the scheduler is available to run jobs -on the given system. It's not a measure of operability, simply a True/False value saying +The '_available' method simply tells Pavilion if the scheduler is available to run jobs +on the given system. It's not a measure of operability, simply a ``True``/``False`` value indicating whether the basic commands (or API modules) needed to use the plugin exist. .. _decoratored: https://www.programiz.com/python-programming/decorator @@ -328,48 +328,49 @@ whether the basic commands (or API modules) needed to use the plugin exist. Advanced Scheduler Methods -------------------------- -If you're trying to write an advanced scheduler plugin using the 'SchedulerPluginAdvanced' +If you're trying to write an advanced scheduler plugin using the ``SchedulerPluginAdvanced`` parent class, there are a couple more methods to override. These are: -- ``_get_raw_node_data()`` - A method to gather raw information on the cluster's nodes. +- ``_get_raw_node_data`` - A method to gather raw information on the cluster's nodes. - ``_transform_raw_node_data`` - A method that translates that same data into a dictionary of information about each node. For information on overriding each of these, refer to the doc strings for each as defined -in the 'pavilion.schedulers.advanced.SchedulerPluginAdvanced' class. They will tell you +in the ``pavilion.schedulers.advanced.SchedulerPluginAdvanced`` class. They will tell you everything you need to know about how to write those methods. -The purpose of these methods is to provide Pavilion with the information it needs to make -decisions about what nodes to schedule on itself, rather than relying on the scheduler to do +The purpose of these methods is to provide Pavilion with the information it needs to decide for +itself which nodes to schedule tests on, rather than relying on the scheduler to do so. This allows Pavilion to partition the system in ways that the scheduler might not support -on its own. These include the ability to specify 'all' as the number of nodes requested, -and the ability to perform :ref:`tests.scheduling.chunking` of system into multiple, evenly sized -pieces. +on its own. These include the ability to specify ``all`` as the number of nodes requested +and the ability to perform :ref:`chunking ` of a system into multiple, +evenly sized pieces. -The downside is that the per-node information must be perfectly accurate or jobs may be rejected by -the scheduler (such as when improperly requesting nodes not in the selected partition) or simply -wait in the queue forever (such as when selecting nodes that are down). +Be cautious when implementing these methods; per-node information must be perfectly accurate or jobs +may be rejected by the scheduler (when bad per-node data results in improperly requesting nodes not +in the selected partition) or simply wait in the queue forever (when it results in the selection of +nodes that are down). Scheduler Variables ------------------- The second part of creating a scheduler plugin is adding a set of variables that -test configs can use to manipulate their test. The vast majority of these are automatically -derived from the information you gathered about the nodes for Advanced scheduler plugins or -via the ``schedule.cluster_info`` test configuration information for Basic scheduler plugins. +test configs can use to manipulate tests. The vast majority of these variables are automatically +derived from the information an advanced scheduler plugin gathers about nodes or +via the ``schedule.cluster_info`` test configuration information for basic scheduler plugins. Pavilion provides a framework for creating these variables, the ``pavilion.schedulers.vars.SchedulerVariables`` class. By inheriting from this class, you can define scheduler variables simply by adding `decoratored`_ -methods to your child class. The decorators do most of the hard work, you -simply have create and return the value. The class itself provides good documentation +methods to your child class. The decorators do most of the hard work, requiring you to +simply create and return the value. The class itself provides good documentation on how to do this. -The most important variable in all of these is the ``test_cmd`` variable, which is probably the +The most important variable of all of these is the ``test_cmd`` variable, which is probably the only variable that will need to be customized for your scheduler plugin. It provides -tests with an mpi startup command, such as ``mpirun``, with arguments automatically set +tests with an MPI startup command, such as ``mpirun``, with arguments automatically set according to the test's settings. Pavilion tests generally use this variable to prefix -their mpi runs when writing their run scripts: +their MPI runs when writing their run scripts: .. code-block:: yaml @@ -383,7 +384,7 @@ their mpi runs when writing their run scripts: cmds: - '{{test_cmd}} ./my_mpi_cmd' -How to write a ``test_cmd`` variable is documented in the ``SchedulerVariables.test_cmd()`` method's +How to write a ``test_cmd`` variable is documented in the ``SchedulerVariables.test_cmd`` method's doc string. @@ -391,7 +392,7 @@ Adding the Scheduler Vars to the Scheduler Plugin ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To add your scheduler variable class to your scheduler plugin, simply -set the variable class as the ``VAR_CLASS`` attribute on your scheduler. +set the variable class as the ``VAR_CLASS`` attribute on your scheduler: .. code-block:: python @@ -400,6 +401,8 @@ set the variable class as the ``VAR_CLASS`` attribute on your scheduler. class MyVarClass(schedulers.SchedulerVariables): # Your scheduler variable class + ... + class MySchedPlugin(schedulers.SchedulerPlugin): VAR_CLASS = MyVarClass diff --git a/docs/tests/scheduling.rst b/docs/tests/scheduling.rst index 49b371d26..083ca6d8a 100644 --- a/docs/tests/scheduling.rst +++ b/docs/tests/scheduling.rst @@ -240,8 +240,6 @@ nodes. With ``schedule.share_allocation`` set to ``max``, Pavilion forces as many test runs into the same job as possible. -.. _tests.scheduling.chunking: - Node Filtering Exceptions ------------------------- @@ -276,6 +274,8 @@ To accomplish the same thing via a command-line override: pav run mytest -c schedule.exclude_nodes='nid001,nid003,nid[007-023]' +.. _tests.scheduling.chunking: + Chunking -------- From 90c5f0da98db00a6384c3085147134f110eea56f Mon Sep 17 00:00:00 2001 From: Hank Wikle Date: Mon, 9 Mar 2026 11:35:50 -0600 Subject: [PATCH 2/3] Minor edits --- docs/plugins/schedulers.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/plugins/schedulers.rst b/docs/plugins/schedulers.rst index 502ed34b6..781f931e2 100644 --- a/docs/plugins/schedulers.rst +++ b/docs/plugins/schedulers.rst @@ -53,23 +53,23 @@ the 'pavilion.schedulers.scheduler.SchedulerPlugin' class. All scheduler plugin require that you extend the base class by providing: -1. A ``_kickoff()`` method — a means to acquire an allocation given the scheduler parameters +1. A ``_kickoff`` method — a means to acquire an allocation given the scheduler parameters and run a script on it. Also needs to return a 'serializable' job ID, to uniquely identify a scheduler job. -2. A ``_job_status()`` method, that asks the scheduler whether a given job ID is +2. A ``_job_status`` method, that asks the scheduler whether a given job ID is scheduled, had a scheduling error, was cancelled, or is running. -3. A ``cancel()`` method, to cancel a given job ID. -4. A ``_get_alloc_nodes()`` method, to get the list of nodes in an allocation that +3. A ``cancel`` method, to cancel a given job ID. +4. A ``_get_alloc_nodes`` method, to get the list of nodes in an allocation that Pavilion is currently running under. -5. An ``_available()`` method, to tell Pavilion if your scheduler can be used at all. +5. An ``_available`` method, to tell Pavilion if your scheduler can be used at all. Advanced schedulers must also override the following. They are fully documented in the ``pavilion.schedulers.SchedulerPluginAdvanced`` class. -1. ``_get_raw_node_data()`` - Should fetch and return a list of information about each node. +1. ``_get_raw_node_data`` - Should fetch and return a list of information about each node. This is the per-node information mentioned above. -2. ``_transform_raw_node_data()`` - Converts that data into a ``{node: info_dict}`` dictionary. +2. ``_transform_raw_node_data`` - Converts that data into a ``{node: info_dict}`` dictionary. There are several required keys each node's ``info_dict`` must contain. See the method documentation for info on the required and optional keys. From 149903807f44b5daa9cfdbcdc369828dc5fddbdb Mon Sep 17 00:00:00 2001 From: Hank Wikle Date: Mon, 9 Mar 2026 12:08:37 -0600 Subject: [PATCH 3/3] Minor edits to scheduling docs --- docs/tests/scheduling.rst | 88 +++++++++++++++++++-------------------- 1 file changed, 43 insertions(+), 45 deletions(-) diff --git a/docs/tests/scheduling.rst b/docs/tests/scheduling.rst index 083ca6d8a..5b3943c6d 100644 --- a/docs/tests/scheduling.rst +++ b/docs/tests/scheduling.rst @@ -40,9 +40,8 @@ You may also notice scheduler specific sections in the listed options as well. T allow for custom configuration specific to a particular schedulers - options that are not generally applicable. -Note that not all options are expected to be generally applicable either. We may, in the future, -add a scheduler with concept of a QOS setting, for instance. When a setting is not applicable, it -is simply ignored. +Note that not all options are expected to be generally applicable either. When a setting is not +applicable, it is simply ignored. .. code-block:: yaml @@ -62,11 +61,11 @@ Scheduler Plugin Basics Scheduler plugins are responsible for the following: -- Providing test runs with *scheduler* variables +- Providing test runs with ``scheduler`` variables - (Optionally) writing kickoff scripts -- Using kickoff scripts (or other mechanisms) to then run `pav _run - ` on allocations with reasonable environments -- Generating a unique scheduler ``job_id`` for each test run +- Using kickoff scripts (or other mechanisms) to then run ``pav _run + `` on allocations with reasonable environments +- Generating a unique scheduler job ID for each test run - Providing mechanisms for canceling tests - Providing mechanisms for checking test statuses @@ -122,16 +121,16 @@ Jobs ---- When Pavilion schedules a test, it also creates a job. Jobs organize all the information used -to kick off a test (or tests!), including the kickoff script, kickoff log, job id, and symlinks +to kick off a test (or tests!), including the kickoff script, kickoff log, job ID, and symlinks back to each test that's part of the job. Each job is named by a random hash located in the -``working_dir>/jobs`` directory. Tests also refer back to their job through a symlink in each +``working_dir>/jobs`` directory. Each test also refers back to its job through a symlink in its test run directory. The Kickoff Script ~~~~~~~~~~~~~~~~~~ The kickoff script's job is to have Pavilion run specific test run instances under an -allocation. This is generally expected to be a shell script of some sort that +allocation. The script is generally expected to be a shell script of some sort that will both define the allocation (if possible) and run ``pav _run `` within that allocation under an environment that can find Pavilion and its libraries. @@ -141,62 +140,61 @@ For slurm, the kickoff script would look something like this: .. code-block:: bash #!/bin/bash - #SBATCH --job-name "pav test #18697" + #SBATCH --job-name "pav test s3.7" #SBATCH -p standard #SBATCH -N 3-3 #SBATCH --tasks-per-node=1 # Redirect all output to kickoff.log - exec >/usr/local/pav/working_dir/test_runs/0018697/kickoff.log 2>&1 + exec >/usr/local/pav/working_dir/test_runs/s3.7/kickoff.log 2>&1 export PATH=/usr/local/pav/src/bin:${PATH} export PAV_CONFIG_FILE=/usr/local/pav/config/pavilion.yaml export PAV_CONFIG_DIR=/usr/local/pav/config - pav _run 18697 + pav _run s3.7 job_id ~~~~~~ -The plugin must assign the test run a job id. This will generally be used by -the scheduler plugin to cancel or check the status of tests. It's saved in -the job's 'job_id' file, and also as part of the test results. +The plugin must assign each test run a job ID. This will generally be used by +the scheduler plugin to cancel or check the status of the test. It's saved in +the job's 'job_id' file and also as part of the test results. Cancel Mechanisms ~~~~~~~~~~~~~~~~~ -Pavilion scheduler plugins are required to provide a mechanism to cancel jobs +Each Pavilion scheduler plugins is required to provide a mechanism to cancel jobs managed by that scheduler, whether they're currently running or queued under -the scheduler. Generally this means just using the test_run's job id to -cancel the test. Cancelled tests will be given the 'SCHED_CANCELLED' status. +the scheduler. Generally this means just using the test_run's job ID to +cancel the test. Cancelled tests will be given the ``SCHED_CANCELLED`` status. Status Mechanisms ~~~~~~~~~~~~~~~~~ Similarly, Pavilion scheduler plugins must be able to query the status of -jobs, and give useful feedback on their state in the scheduler. As long as -the test is in the 'SCHEDULED' or 'RUNNING' states from the test run's perspective (in the -run's status file), Pavilion will use the scheduler to look up the schedulers -status for the job, in order to provide more up-to-date test status -information. +jobs and give useful feedback on their state in the scheduler. As long as +a test is in the ``SCHEDULED`` or ``RUNNING`` states from the test run's perspective (in the +run's status file), Pavilion will use the scheduler to look up the status for the job, in order to +provide more up-to-date test status information. .. _tests.scheduling.types: Scheduler Plugin Types ---------------------- -Scheduler plugins come in two varieties: Basic and Advanced +Scheduler plugins come in two varieties: basic and advanced Basic ~~~~~ -**The only 'basic' scheduler is 'raw' which only ever has one node. Most of this doesn't apply -except to user added schedulers.** +**The only basic scheduler is the raw scheduler, which only ever has one node. Most of this doesn't +apply except to user added schedulers.** -Basic Schedulers don't know anything about the system that isn't manually configured. This +Basic schedulers don't know anything about the system that isn't manually configured. This information is given via the ``schedule.cluster_info`` section (see ``pav show sched --config``). This information should generally be set in the host config for a particular system. -Asking for 'all' nodes on a basic scheduler will result in an allocation for the +Asking for ``all`` nodes on a basic scheduler will result in an allocation for the configured number of nodes, regardless of the state of those nodes. .. code-block:: yaml @@ -216,7 +214,7 @@ Advanced Advanced scheduler plugins are plugins that can get an inventory of nodes and node state from the system. Such schedulers are able to dynamically determine how many nodes are up or -available, and create allocations based on that. As a result, asking for 'all' nodes via an +available and create allocations based on that. As a result, asking for ``all`` nodes via an advanced scheduler will get you an allocation request for all nodes that are currently up and not otherwise filtered out by ``partition`` or other scheduler settings. @@ -247,17 +245,17 @@ Advanced schedulers filter nodes down to only those which are currently usable. several mechanisms for providing exceptions to filtering rules. There are three scheduler options that control this behavior: -1. `include_nodes`: specifies a set of nodes to be included in every chunk. Other nodes may be +1. ``include_nodes``: specifies a set of nodes to be included in every chunk. Other nodes may be used as well, but those specified are guaranteed to be among the final set of - nodes on which the test is scheduled (provided they are in an 'available' + nodes on which the test is scheduled (provided they are in an ``available`` state). -2. `exclude_nodes`: specifies a set of nodes to be excluded when scheduling tests. -3. `across_nodes`: specifies a set of nodes to be considered exclusively when scheduling tests. No +2. ``exclude_nodes``: specifies a set of nodes to be excluded when scheduling tests. +3. ``across_nodes``: specifies a set of nodes to be considered exclusively when scheduling tests. No nodes beyond those requested will be scheduled. The final set of nodes on which the test is scheduled may be a subset of those specified. -The syntax for specifying nodes is identical to that used with Slurm's `--nodelist` option; it can -combine full names of nodes (e.g. `nid001`) with node ranges (e.g. `nid[007-023]`), which can in +The syntax for specifying nodes is identical to that used with Slurm's ``--nodelist`` option; it can +combine full names of nodes (e.g. ``nid001``) with node ranges (e.g. ``nid[007-023]``), which can in turn be combined with commas. For example, the following test excludes nodes 1, 3, and 7-23 from being scheduled: @@ -291,13 +289,13 @@ specific chunk size. # When using chunking, this is relative to the chunk and not the whole system. nodes: all - # Get 500 node chunks + # Get 500-node chunks chunking: size: 500 When using chunking, Pavilion selects nodes for each job entirely in advance. This can lead to the tests being a bit more fragile than usual: the failure of a single node can keep a test -from running even if the are 'spare' nodes outside of the chunk. +from running even if there are spare nodes outside of the chunk. Chunk Selection ~~~~~~~~~~~~~~~ @@ -326,7 +324,7 @@ of the chunk of which it is a member. # rather than on the whole system. nodes: all - # Get 500 node chunks + # Get 500-node chunks chunking: size: 500 @@ -342,22 +340,22 @@ system (``dist``), or semi-randomly distributed (``rand-dist``). Regardless of t the number of chunks will be the same and they (mostly) won't overlap. It is likely that the chunk size won't divide evenly into the total number of nodes. Nodes which make -up the remainder may be excluded or back-filled with nodes from another chunk (these nodes are always -drawn from the second to last chunk). The default behavior is to 'backfill'. +up the remainder may be excluded or backfilled with nodes from another chunk (these nodes are always +drawn from the second to last chunk). The default behavior is to perform backfilling. Chunking behavior is set via the ``schedule.chunking.node_selection`` and ``schedule.chunking.extra`` options. .. code-block:: yaml - # This test run over a random selection of 25% of the nodes on the system. + # This test runs over a random selection of 25% of the nodes on the system. mytest: schedule: # When using chunking, 'all' refers to all nodes in the chunk # rather than on the whole system. nodes: all - # Get 500 node chunks + # Get 500-node chunks chunking: size: 25% node_selection: random @@ -367,7 +365,7 @@ options. Wrapper ------- -You can use the wrapper feature on any scheduler to wrap the scheduler test command and run the +You can use the ``wrapper`` feature on any scheduler to wrap the scheduler test command and run the wrapper command before actually running the intended command. .. code-block:: yaml @@ -386,7 +384,7 @@ wrapper command before actually running the intended command. - '{{sched.test_cmd}} ./supermagic -a' When using the ``raw`` scheduler, ``{{sched.test_cmd}}`` normally evaluates to an empty string. You -can use the wrapper setting to control a different scheduler directly. +can use the ``wrapper`` setting to control a different scheduler directly. .. code-block:: yaml