diff --git a/models/sweeps/add-w-and-b-to-your-code.mdx b/models/sweeps/add-w-and-b-to-your-code.mdx index 206045728c..824f440a84 100644 --- a/models/sweeps/add-w-and-b-to-your-code.mdx +++ b/models/sweeps/add-w-and-b-to-your-code.mdx @@ -3,20 +3,22 @@ description: Add W&B to your Python code script or Jupyter Notebook. title: Add W&B (wandb) to your code --- -This guide provides recommendations on how to integrate W&B into your Python training script or notebook for hyperparameter search optimization. +This guide provides recommendations on how to integrate W&B into your Python training script or notebook for hyperparameter search optimization. By following these recommendations, you can use W&B Sweeps to explore hyperparameter values, log training and validation metrics, and identify the configuration that produces strong model performance. + +This guide is for machine learning practitioners who already have a Python training script and want to add hyperparameter sweep support. The following sections walk through an example training script, then show how to update it to work with W&B Sweeps. ## Original training script -Suppose you have a Python script that trains a model (see below). Your goal is to find the hyperparameters that maxmimizes the validation accuracy(`val_acc`). +Suppose you have a Python script that trains a model (see the following code). Your goal is to find the hyperparameters that maximize the validation accuracy (`val_acc`). -In your Python script, you define two functions: `train_one_epoch` and `evaluate_one_epoch`. The `train_one_epoch` function simulates training for one epoch and returns the training accuracy and loss. The `evaluate_one_epoch` function simulates evaluating the model on the validation data set and returns the validation accuracy and loss. +In your Python script, you define two functions: `train_one_epoch` and `evaluate_one_epoch`. The `train_one_epoch` function simulates training for one epoch and returns the training accuracy and loss. The `evaluate_one_epoch` function simulates evaluation of the model on the validation data set and returns the validation accuracy and loss. -You define a configuration dictionary (`config`) that contains hyperparameter values such as the learning rate (`lr`), batch size (`batch_size`), and number of epochs (`epochs`). The values in the configuration dictionary control the training process. +You define a configuration dictionary (`config`) that contains hyperparameter values such as the learning rate (`lr`), batch size (`batch_size`), and number of epochs (`epochs`). The values in the configuration dictionary control the training process. -Next you define a function called `main` that mimics a typical training loop. For each epoch, the accuracy and loss is computed on the training and validation data sets. +Next, you define a function called `main` that mimics a typical training loop. For each epoch, the script computes the accuracy and loss on the training and validation data sets. -This code is a mock training script. It does not train a model, but simulates the training process by generating random accuracy and loss values. The purpose of this code is to demonstrate how to integrate W&B into your training script. +This code is a mock training script. It doesn't train a model, but simulates the training process by generating random accuracy and loss values. The purpose of this code is to demonstrate how to integrate W&B into your training script. ```python @@ -53,21 +55,21 @@ if __name__ == "__main__": main() ``` -In the next section, you will add W&B to your Python script to track hyperparameters and metrics during training. You want to use W&B to find the best hyperparameters that maximize the validation accuracy (`val_acc`). +The following section shows how to add W&B to your Python script to track hyperparameters and metrics during training. You want to use W&B to find the best hyperparameters that maximize the validation accuracy (`val_acc`). ## Add W&B to your training script -Update you training script to include W&B. How you integrate W&B to your Python script or notebook depends on how you manage sweeps. +This section shows how to modify the original training script so that the sweep agent can pass hyperparameter values into each run and W&B can record the resulting metrics. How you integrate W&B into your Python script or notebook depends on how you manage sweeps. -To use the W&B Python SDK to start, stop, and manage sweeps, follow the instructions in the **Python script or notebook** tab. To use the W&B CLI instead, follow the instructions in the **CLI** tab. +To use the W&B Python SDK to start, stop, and manage sweeps, follow the instructions in the **Python script or notebook** tab. To use the W&B CLI instead, follow the instructions in the **CLI** tab. Create a YAML configuration file with your sweep configuration. The configuration file contains the hyperparameters you want the sweep to explore. In -the following example, the batch size (`batch_size`), epochs (`epochs`), and -the learning rate (`lr`) hyperparameters are varied during each sweep. +the following example, the sweep varies the batch size (`batch_size`), epochs +(`epochs`), and learning rate (`lr`) hyperparameters during each run. ```yaml # config.yaml @@ -87,18 +89,18 @@ parameters: values: [5, 10, 15] ``` -For more information on how to create a W&B Sweep configuration, see [Define sweep configuration](/models/sweeps/define-sweep-configuration/). +For more information, see [Define sweep configuration](/models/sweeps/define-sweep-configuration). You must provide the name of your Python script for the `program` key in your YAML file. Next, add the following to the code example: -1. Import the W&B Python SDK (`wandb`) and PyYAML (`yaml`). PyYAML is used to read in our YAML configuration file. +1. Import the W&B Python SDK (`wandb`) and PyYAML (`yaml`). Use PyYAML to read in your YAML configuration file. 2. Read in the configuration file. 3. Use [`wandb.init()`](/models/ref/python/functions/init) to start a background process to sync and log data as a [W&B Run](/models/ref/python/experiments/run). Pass the config object to the config parameter. -4. Define hyperparameter values from `wandb.Run.config` instead of using hard coded values. -5. Log the metric you want to optimize with [`wandb.Run.log()`](/models/ref/python/experiments/run.md/#method-runlog). You must log the metric defined in your configuration. Within the configuration dictionary (`sweep_configuration` in this example) you define the sweep to maximize the `val_acc` value. +4. Define hyperparameter values from `wandb.Run.config` instead of using hardcoded values. +5. Log the metric you want to optimize with [`wandb.Run.log()`](/models/ref/python/experiments/run.md/#method-runlog). You must log the metric defined in your configuration. Within the configuration dictionary (`sweep_configuration` in this example), you define the sweep to maximize the `val_acc` value. ```python import wandb @@ -142,46 +144,43 @@ def main(): main() ``` -In your CLI, set a maximum number of runs for the sweep -agent to try. This is optional. This example we set the -maximum number to 5. +After you update your training script, initialize and start the sweep from your CLI: -```bash -NUM=5 -``` +1. Optionally, set a maximum number of runs for the sweep agent to try. This example sets the maximum to five: -Next, initialize the sweep with the [`wandb sweep`](/models/ref/cli/wandb-sweep) command. Provide the name of the YAML file. Optionally provide the name of the project for the project flag (`--project`): + ```bash + NUM=5 + ``` -```bash -wandb sweep --project sweep-demo-cli config.yaml -``` +2. Initialize the sweep with the [`wandb sweep`](/models/ref/cli/wandb-sweep) command. Provide the name of the YAML file. Optionally, provide the name of the project for the project flag (`--project`): -This returns a sweep ID. For more information on how to initialize sweeps, see -[Initialize sweeps](./initialize-sweeps). + ```bash + wandb sweep --project sweep-demo-cli config.yaml + ``` -Copy the sweep ID and replace `sweepID` in the following code snippet to start -the sweep job with the [`wandb agent`](/models/ref/cli/wandb-agent) -command: + This returns a sweep ID. For more information, see [Initialize sweeps](/models/sweeps/initialize-sweeps). -```bash -wandb agent --count $NUM your-entity/sweep-demo-cli/sweepID -``` +3. Copy the sweep ID and replace `[SWEEP-ID]` in the following command to start the sweep job with the [`wandb agent`](/models/ref/cli/wandb-agent) command. Replace `[YOUR-ENTITY]` with your W&B entity name: + + ```bash + wandb agent --count $NUM [YOUR-ENTITY]/sweep-demo-cli/[SWEEP-ID] + ``` -For more information, see [Start sweep jobs](./start-sweep-agents). +The sweep agent runs your training script repeatedly, each time with a different combination of hyperparameter values from your YAML configuration, and logs the results to W&B. For more information, see [Start sweep jobs](/models/sweeps/start-sweep-agents). Follow these steps to add W&B to your Python script: -1. Create a dictionary object where the key-value pairs define a [sweep configuration](/models/sweeps/define-sweep-configuration/). The sweep configuration defines the hyperparameters you want W&B to explore on your behalf along with the metric you want to optimize. Continuing from the previous example, the batch size (`batch_size`), epochs (`epochs`), and the learning rate (`lr`) are the hyperparameters to vary during each sweep. You want to maximize the accuracy of the validation score so you set `"goal": "maximize"` and the name of the variable you want to optimize for, in this case `val_acc` (`"name": "val_acc"`). -2. Pass the sweep configuration dictionary to [`wandb.sweep()`](/models/ref/python/functions/sweep). This initializes the sweep and returns a sweep ID (`sweep_id`). For more information, see [Initialize sweeps](./initialize-sweeps). +1. Create a dictionary object where the key-value pairs define a [sweep configuration](/models/sweeps/define-sweep-configuration). The sweep configuration defines the hyperparameters you want W&B to explore on your behalf along with the metric you want to optimize. Continuing from the previous example, the batch size (`batch_size`), epochs (`epochs`), and the learning rate (`lr`) are the hyperparameters to vary during each sweep. You want to maximize the accuracy of the validation score, so set `"goal": "maximize"` and the name of the variable you want to optimize for, in this case `val_acc` (`"name": "val_acc"`). +2. Pass the sweep configuration dictionary to [`wandb.sweep()`](/models/ref/python/functions/sweep). This initializes the sweep and returns a sweep ID (`sweep_id`). For more information, see [Initialize sweeps](/models/sweeps/initialize-sweeps). 3. At the top of your script, import the W&B Python SDK (`wandb`). -4. Within your `main` function, use [`wandb.init()`](/models/ref/python/functions/init) to generate a background process to sync and log data as a [W&B Run](/models/ref/python/experiments/run). Pass the project name as a parameter to the `wandb.init()` method. If you do not pass a project name, W&B uses the default project name. -5. Fetch the hyperparameter values from the `wandb.Run.config` object. This allows you to use the hyperparameter values defined in the sweep configuration dictionary instead of hard coded values. -6. Log the metric you are optimizing for to W&B using [`wandb.Run.log()`](/models/ref/python/experiments/run.md/#method-runlog). You must log the metric defined in your configuration. For example, if you define the metric to optimize as `val_acc`, you must log `val_acc`. If you do not log the metric, W&B does not know what to optimize for. Within the configuration dictionary (`sweep_configuration` in this example), you define the sweep to maximize the `val_acc` value. -7. Start the sweep with [`wandb.agent()`](/models/ref/python/functions/agent). Provide the sweep ID and the name of the function the sweep will execute (`function=main`), and specify the maximum number of runs to try to four (`count=4`). +4. Within your `main` function, use [`wandb.init()`](/models/ref/python/functions/init) to generate a background process to sync and log data as a [W&B Run](/models/ref/python/experiments/run). Pass the project name as a parameter to the `wandb.init()` method. If you don't pass a project name, W&B uses the default project name. +5. Fetch the hyperparameter values from the `wandb.Run.config` object. This lets you use the hyperparameter values defined in the sweep configuration dictionary instead of hardcoded values. +6. Log the metric you're optimizing for to W&B using [`wandb.Run.log()`](/models/ref/python/experiments/run.md/#method-runlog). You must log the metric defined in your configuration. For example, if you define the metric to optimize as `val_acc`, you must log `val_acc`. If you don't log the metric, W&B can't perform optimization. Within the configuration dictionary (`sweep_configuration` in this example), you define the sweep to maximize the `val_acc` value. +7. Start the sweep with [`wandb.agent()`](/models/ref/python/functions/agent). Provide the sweep ID and the name of the function the sweep executes (`function=main`), and set the maximum number of runs to four (`count=4`). -Putting this all together, your script might look similar to the following: +To put this together, your script might look similar to the following: ```python import wandb # Import the W&B Python SDK @@ -200,8 +199,8 @@ def evaluate_one_epoch(epoch): return acc, loss def main(args=None): - # When called by sweep agent, args will be None, - # so we use the project from sweep config + # When called by sweep agent, args is None, + # so use the project from sweep config project = args.project if args else None with wandb.init(project=project) as run: @@ -254,6 +253,8 @@ if __name__ == "__main__": # Start the sweep job wandb.agent(sweep_id, function=main, count=4) ``` + +When you run this script, W&B starts the sweep, executes the `main` function up to four times with different hyperparameter combinations, and logs each run's metrics so you can compare results in the W&B App. @@ -261,7 +262,7 @@ if __name__ == "__main__": **Logging metrics to W&B in a sweep** -You must log the metric you define and are optimizing for in both your sweep configuration and with `wandb.Run.log()`. For example, if you define the metric to optimize as `val_acc` within your sweep configuration, you must also log `val_acc` to W&B. If you do not log the metric, W&B does not know what to optimize for. +You must log the metric you define and are optimizing for in both your sweep configuration and with `wandb.Run.log()`. For example, if you define the metric to optimize as `val_acc` within your sweep configuration, you must also log `val_acc` to W&B. If you don't log the metric, W&B can't perform optimization. ```python with wandb.init() as run: @@ -274,7 +275,7 @@ with wandb.init() as run: ) ``` -The following is an incorrect example of logging the metric to W&B. The metric that is optimized for in the sweep configuration is `val_acc`, but the code logs `val_acc` within a nested dictionary under the key `validation`. You must log the metric directly, not within a nested dictionary. +The following is an incorrect example of logging the metric to W&B. The sweep configuration optimizes for `val_acc`, but the code logs `val_acc` within a nested dictionary under the key `validation`. You must log the metric directly, not within a nested dictionary. ```python with wandb.init() as run: diff --git a/models/sweeps/define-sweep-configuration.mdx b/models/sweeps/define-sweep-configuration.mdx index 83af47237a..180b9fac7c 100644 --- a/models/sweeps/define-sweep-configuration.mdx +++ b/models/sweeps/define-sweep-configuration.mdx @@ -4,7 +4,9 @@ title: Overview --- -A W&B Sweep combines a strategy for exploring hyperparameter values with the code that evaluates them. The strategy can be as simple as trying every option or as complex as Bayesian Optimization and Hyperband ([BOHB](https://arxiv.org/abs/1807.01774)). +A sweep combines a strategy for exploring hyperparameter values with the code that evaluates them. The strategy can be as simple as trying every option or as complex as Bayesian Optimization and Hyperband ([BOHB](https://arxiv.org/abs/1807.01774)). + +This guide shows you how to author a sweep configuration that specifies which hyperparameters to search, which search strategy to use, and how to evaluate each run. Use it when you're setting up a new sweep or adapting an existing configuration to a different search method or parameter space. Define a sweep configuration either in a [Python dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) or a [YAML](https://yaml.org/) file. How you define your sweep configuration depends on how you want to manage your sweep. @@ -12,21 +14,21 @@ Define a sweep configuration either in a [Python dictionary](https://docs.python Define your sweep configuration in a YAML file if you want to initialize a sweep and start a sweep agent from the command line. Define your sweep in a Python dictionary if you initialize a sweep and start a sweep entirely within a Python script or notebook. -The following guide describes how to format your sweep configuration. See [Sweep configuration options](./sweep-config-keys) for a comprehensive list of top-level sweep configuration keys. +The following sections describe how to format your sweep configuration. See [Sweep configuration options](/models/sweeps/sweep-config-keys) for a comprehensive list of top-level sweep configuration keys. ## Basic structure -Both sweep configuration format options (YAML and Python dictionary) use key-value pairs and nested structures. +Both sweep configuration format options (YAML and Python dictionary) use key-value pairs and nested structures. -Use top-level keys within your sweep configuration to define qualities of your sweep search such as the name of the sweep ([`name`](./sweep-config-keys) key), the parameters to search through ([`parameters`](./sweep-config-keys#parameters) key), the methodology to search the parameter space ([`method`](./sweep-config-keys#method) key), and more. +Use top-level keys within your sweep configuration to define qualities of your sweep search. These qualities include the name of the sweep ([`name`](/models/sweeps/sweep-config-keys) key), the parameters to search through ([`parameters`](/models/sweeps/sweep-config-keys#parameters) key), the methodology to search the parameter space ([`method`](/models/sweeps/sweep-config-keys#method) key), and more. -For example, the following code snippets show the same sweep configuration defined within a YAML file and within a Python dictionary. Within the sweep configuration there are five top level keys specified: `program`, `name`, `method`, `metric` and `parameters`. +For example, the following code snippets show the same sweep configuration defined within a YAML file and within a Python dictionary. The sweep configuration specifies five top-level keys: `program`, `name`, `method`, `metric`, and `parameters`. -Define a sweep configuration in a YAML file if you want to manage sweeps interactively from the command line (CLI) +Define a sweep configuration in a YAML file if you want to manage sweeps interactively from the command line (CLI). ```yaml title="config.yaml" program: train.py @@ -48,7 +50,7 @@ parameters: ``` -Define a sweep in a Python dictionary data structure if you define training algorithm in a Python script or notebook. +Define a sweep in a Python dictionary data structure if you define your training algorithm in a Python script or notebook. The following code snippet stores a sweep configuration in a variable named `sweep_configuration`: @@ -69,20 +71,16 @@ sweep_configuration = { -Within the top level `parameters` key, the following keys are nested: `learning_rate`, `batch_size`, `epoch`, and `optimizer`. For each of the nested keys you specify, you can provide one or more values, a distribution, a probability, and more. For more information, see the [parameters](./sweep-config-keys#parameters) section in [Sweep configuration options](./sweep-config-keys). +The top-level `parameters` key nests the following keys: `learning_rate`, `batch_size`, `epochs`, and `optimizer`. For each nested key you specify, you can provide one or more values, a distribution, a probability, and more. For more information, see the [parameters](/models/sweeps/sweep-config-keys#parameters) section in [Sweep configuration options](/models/sweeps/sweep-config-keys). ## Double nested parameters -Sweep configurations support nested parameters. To define a nested parameter, include an additional `parameters` key under the top-level parameter name. - -The following example shows a sweep configuration with three nested parameters: `nested_category_1`, `nested_category_2`, and `nested_category_3`. Each nested parameter includes two additional parameters: `momentum` and `weight_decay`. +Use nested parameters when you want to group related hyperparameters together or when your training code expects a nested configuration structure. To define a nested parameter, include an additional `parameters` key under the top-level parameter name. - -`nested_category_1`, `nested_category_2`, and `nested_category_3` are placeholders. Replace them with names that fit your use case. - +The following example shows a sweep configuration with nested parameters `nested_category_1`, `nested_category_2`, and `nested_category_3`. Each nested parameter includes the additional parameters `momentum` and `weight_decay`. -The following code snippets show how to define nested parameters in both a YAML file and a Python dictionary. +The following code snippets show how to define nested parameters in both a YAML file and a Python dictionary: @@ -220,9 +218,9 @@ parameters: 1. Create a top level `parameters` key in your sweep config. 2. Within the `parameters`key, nest the following: - 1. Specify the name of hyperparameter you want to optimize. + 1. Specify the name of hyperparameter you want to optimize. 2. Specify the distribution you want to use for the `distribution` key. Nest the `distribution` key-value pair underneath the hyperparameter name. - 3. Specify one or more values to explore. The value (or values) should be inline with the distribution key. + 3. Specify one or more values to explore. The value (or values) should be inline with the distribution key. 1. (Optional) Use an additional parameters key under the top level parameter name to delineate a nested parameter. */} {/* For example, the following code snippets show a sweep config both in a YAML config file and a Python script. */} @@ -258,61 +256,62 @@ sweep_configuration = { }, } -sweep_id = wandb.sweep(sweep=sweep_configuration, project="") +sweep_id = wandb.sweep(sweep=sweep_configuration, project="[PROJECT]") wandb.agent(sweep_id, function=main, count=4) ``` -During a sweep run, `run.config["nested_param"]` reflects the subtree defined by the -sweep (`learning_rate`, `double_nested_param`) config and does not include `manual_key` defined -in `wandb.init(config=...)`. +During a sweep run, `run.config["nested_param"]` reflects the subtree defined by the sweep configuration (`learning_rate` and `double_nested_param`). It doesn't include `manual_key`, which is defined in `wandb.init(config=...)`. ## Sweep configuration template +Use this template as a starting point when authoring a new sweep configuration. It illustrates the most common parameter and early-termination patterns so you can copy it and fill in the values for your own search. -The following template shows how you can configure parameters and specify search constraints. Replace `hyperparameter_name` with the name of your hyperparameter and any values enclosed in `<>`. +The following template shows how you can configure parameters and specify search constraints. Replace `hyperparameter_name` with the name of your hyperparameter and any values enclosed in brackets. ```yaml title="config.yaml" -program: -method: -parameter: +program: [INSERT] +method: [INSERT] +parameters: hyperparameter_name0: - value: 0 - hyperparameter_name1: + value: 0 + hyperparameter_name1: values: [0, 0, 0] - hyperparameter_name: - distribution: - value: - hyperparameter_name2: - distribution: - min: - max: - q: - hyperparameter_name3: - distribution: + hyperparameter_name: + distribution: [INSERT] + value: [INSERT] + hyperparameter_name2: + distribution: [INSERT] + min: [INSERT] + max: [INSERT] + q: [INSERT] + hyperparameter_name3: + distribution: [INSERT] values: - - - - - - + - [LIST-OF-VALUES] + - [LIST-OF-VALUES] + - [LIST-OF-VALUES] early_terminate: type: hyperband - s: 0 - eta: 0 - max_iter: 0 + s: [INSERT] + eta: [INSERT] + max_iter: [INSERT] command: - ${Command macro} - ${Command macro} - ${Command macro} -- ${Command macro} +- ${Command macro} ``` -To express a numeric value using scientific notation, add the YAML `!!float` operator, which casts the value to a floating point number. For example, `min: !!float 1e-5`. See [Command example](#command-example). +To express a numeric value using scientific notation, add the YAML `!!float` operator, which casts the value to a floating-point number. For example, `min: !!float 1e-5`. For more information, see [Macro and custom command arguments example](#macro-and-custom-command-arguments-example). ## Sweep configuration examples +The following sweep configurations illustrate common scenarios. Use them as references when adapting a sweep to your own training script. + -```yaml title="config.yaml" +```yaml title="config.yaml" program: train.py method: random metric: @@ -321,14 +320,14 @@ metric: parameters: batch_size: distribution: q_log_uniform_values - max: 256 + max: 256 min: 32 q: 8 - dropout: + dropout: values: [0.3, 0.4, 0.5] epochs: value: 1 - fc_layer_size: + fc_layer_size: values: [128, 256, 512] learning_rate: distribution: uniform @@ -339,7 +338,7 @@ parameters: ``` -```python title="train.py" +```python title="train.py" sweep_config = { "method": "random", "metric": {"goal": "minimize", "name": "loss"}, @@ -365,6 +364,8 @@ sweep_config = { ### Bayes hyperband example +The following example combines Bayesian search with Hyperband early termination to stop underperforming runs early and preserve resources for more promising configurations. + ```yaml program: train.py method: bayes @@ -398,8 +399,8 @@ early_terminate: The following tabs show how to specify either a minimum or maximum number of iterations for `early_terminate`: - -The brackets for this example are: `[3, 3*eta, 3*eta*eta, 3*eta*eta*eta]`, which equals `[3, 9, 27, 81]`. + +The brackets for this example are `[3, 3*eta, 3*eta*eta, 3*eta*eta*eta]`, which equals `[3, 9, 27, 81]`. ```yaml early_terminate: @@ -407,8 +408,8 @@ early_terminate: min_iter: 3 ``` - -The brackets for this example are `[27/eta, 27/eta/eta]`, which equals `[9, 3]`. + +The brackets for this example are `[27/eta, 27/eta/eta]`, which equals `[9, 3]`. ```yaml early_terminate: @@ -423,7 +424,9 @@ early_terminate: ### Macro and custom command arguments example -For more complex command line arguments, you can use macros to pass environment variables, the Python interpreter, and additional arguments. [W&B supports pre defined macros](./sweep-config-keys#command-macros) and custom command line arguments that you can specify in your sweep configuration. +This example shows how to construct the command that the sweep agent runs for each trial when you need finer control than the default invocation provides. + +For more complex command-line arguments, you can use macros to pass environment variables, the Python interpreter, and additional arguments. [W&B supports predefined macros](/models/sweeps/sweep-config-keys#command-macros) and custom command-line arguments that you can specify in your sweep configuration. For example, the following sweep configuration (`sweep.yaml`) defines a command that runs a Python script (`run.py`) with the `${env}`, `${interpreter}`, and `${program}` macros replaced with the appropriate values when the sweep runs. @@ -446,31 +449,44 @@ command: - "--optimizer=${optimizer}" - "--test=True" ``` -The associated Python script (`run.py`) can then parse these command line arguments using the `argparse` module. + +The associated Python script (`run.py`) can then parse these command-line arguments using the `argparse` module: ```python title="run.py" -# run.py +# run.py import wandb import argparse + +def str2bool(v: str) -> bool: + """Convert a string such as "True" to a boolean, because argparse + doesn't support boolean arguments by default. + """ + if isinstance(v, bool): + return v + return v.lower() in ('yes', 'true', 't', '1') + + parser = argparse.ArgumentParser() parser.add_argument('--batch_size', type=int) parser.add_argument('--optimizer', type=str, choices=['adam', 'sgd'], required=True) parser.add_argument('--test', type=str2bool, default=False) args = parser.parse_args() -# Initialize a W&B Run -with wandb.init('test-project') as run: - run.log({'validation_loss':1}) +# Initialize a W&B run +with wandb.init(project="test-project") as run: + run.log({'validation_loss': 1}) ``` -See the [Command macros](./sweep-config-keys#command-macros) section in [Sweep configuration options](./sweep-config-keys) for a list of pre-defined macros you can use in your sweep configuration. +See the [Command macros](/models/sweeps/sweep-config-keys#command-macros) section in [Sweep configuration options](/models/sweeps/sweep-config-keys) for a list of predefined macros you can use in your sweep configuration. #### Boolean arguments -The `argparse` module does not support boolean arguments by default. To define a boolean argument, you can use the [`action`](https://docs.python.org/3/library/argparse.html#action) parameter or use a custom function to convert the string representation of the boolean value to a boolean type. +If your sweep passes boolean flags through command arguments, your training script needs extra handling because `argparse` doesn't interpret boolean strings by default. + +The `argparse` module doesn't support boolean arguments by default. To define a boolean argument, use the [`action`](https://docs.python.org/3/library/argparse.html#action) parameter or use a custom function to convert the string representation of the boolean value to a boolean type. -As an example, you can use the following code snippet to define a boolean argument. Pass `store_true` or `store_false` as an argument to `ArgumentParser`. +For example, you can use the following code snippet to define a boolean argument. Pass `store_true` or `store_false` as an argument to `ArgumentParser`: ```python import wandb @@ -483,12 +499,12 @@ args = parser.parse_args() args.test # This will be True if --test is passed, otherwise False ``` -You can also define a custom function to convert the string representation of the boolean value to a boolean type. For example, the following code snippet defines the `str2bool` function, which converts a string to a boolean value. +You can also define a custom function to convert the string representation of the boolean value to a boolean type. For example, the following code snippet defines the `str2bool` function, which converts a string to a boolean value: ```python def str2bool(v: str) -> bool: """Convert a string to a boolean. This is required because - argparse does not support boolean arguments by default. + argparse doesn't support boolean arguments by default. """ if isinstance(v, bool): return v diff --git a/models/sweeps/existing-project.mdx b/models/sweeps/existing-project.mdx index 3a1b4d8fe5..a33f0b731e 100644 --- a/models/sweeps/existing-project.mdx +++ b/models/sweeps/existing-project.mdx @@ -3,64 +3,66 @@ description: Tutorial on how to create sweep jobs from a pre-existing W&B projec title: 'Tutorial: Create sweep job from project' --- -This tutorial explains how to create sweep jobs from a pre-existing W&B project. We will use the [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) to train a PyTorch convolutional neural network how to classify images. The required code an dataset is located in the [W&B examples repository (PyTorch CNN Fashion)](https://github.com/wandb/examples/tree/master/examples/pytorch/pytorch-cnn-fashion) +This tutorial explains how to create sweep jobs from a pre-existing W&B project. By the end, you'll have created a baseline project, configured a hyperparameter sweep, and launched agents that run training jobs in parallel. You use the [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) to train a PyTorch convolutional neural network to classify images. The [W&B examples repository (PyTorch CNN Fashion)](https://github.com/wandb/examples/tree/master/examples/pytorch/pytorch-cnn-fashion) provides the required code and dataset. Explore the results in this [W&B Dashboard](https://app.wandb.ai/carey/pytorch-cnn-fashion). -## 1. Create a project +## Create a project -First, create a baseline. Download the PyTorch MNIST dataset example model from W&B examples GitHub repository. Next, train the model. The training script is within the `examples/pytorch/pytorch-cnn-fashion` directory. +First, create a baseline project by training the example model at least once. This baseline gives the sweep something to configure against in later steps. Download the PyTorch MNIST dataset example model from the W&B examples GitHub repository. Next, train the model. The training script is within the `examples/pytorch/pytorch-cnn-fashion` directory. -1. Clone this repo `git clone https://github.com/wandb/examples.git` -2. Open this example `cd examples/pytorch/pytorch-cnn-fashion` -3. Run a run manually `python train.py` +To download and train the example model, follow these steps: -Optionally explore the example appear in the W&B App UI dashboard. +1. Clone the repository: `git clone https://github.com/wandb/examples.git`. +2. Open the example directory: `cd examples/pytorch/pytorch-cnn-fashion`. +3. Run the training script manually: `python train.py`. -[View an example project page →](https://app.wandb.ai/carey/pytorch-cnn-fashion) +Optional: Explore the example in the W&B App dashboard. [View an example project page](https://app.wandb.ai/carey/pytorch-cnn-fashion). -## 2. Create a sweep +After this initial run completes, you have a baseline project in W&B that the sweep can build on. -From your project page, open the [Sweep tab](./visualize-sweep-results) in the project sidebar and select **Create Sweep**. +## Create a sweep + +With a baseline project in place, you can configure a sweep over its runs. From your project page, open the [**Sweep** tab](/models/sweeps/visualize-sweep-results) in the project sidebar and select **Create Sweep**. - Sweep overview + W&B project page with the Sweep tab open and the Create Sweep button highlighted -The auto-generated configuration guesses values to sweep over based on the runs you have completed. Edit the configuration to specify what ranges of hyperparameters you want to try. When you launch the sweep, it starts a new process on the hosted W&B sweep server. This centralized service coordinates the agents— the machines that are running the training jobs. +The auto-generated configuration suggests values to sweep over based on the runs you've completed. Edit the configuration to specify what ranges of hyperparameters you want to try. When you launch the sweep, it starts a new process on W&B's hosted sweep server. This centralized service coordinates the agents (the machines that run the training jobs). - Sweep configuration + Auto-generated sweep configuration editor showing hyperparameter ranges -## 3. Launch agents +## Launch agents -Next, launch an agent locally. You can launch up to 20 agents on different machines in parallel if you want to distribute the work and finish the sweep job more quickly. The agent will print out the set of parameters it’s trying next. +After you configure the sweep, launch one or more agents locally to execute the runs. To distribute the work and finish the sweep job more quickly, launch up to 20 agents on different machines in parallel. The agent prints out the next set of parameters to use. - Launch agents + Terminal output from a sweep agent printing the next set of hyperparameters -Now you're running a sweep. The following image demonstrates what the dashboard looks like as the example sweep job is running. [View an example project page →](https://app.wandb.ai/carey/pytorch-cnn-fashion) +You now have a running sweep that coordinates training jobs across your agents and reports results back to W&B. The following image shows what the dashboard looks like as the example sweep job runs. - Sweep dashboard + Sweep dashboard plotting metrics across parallel training runs ## Seed a new sweep with existing runs -Launch a new sweep using existing runs that you've previously logged. +You can also launch a new sweep using existing runs that you've previously logged, which lets you reuse earlier results as a starting point. To seed a new sweep with existing runs, follow these steps: 1. Open your project table. -2. Select the runs you want to use with checkboxes on the left side of the table. -3. Click the dropdown to create a new sweep. +2. Select the runs you want to use by enabling their row checkboxes. +3. Select the dropdown to create a new sweep. -Your sweep will now be set up on our server. All you need to do is launch one or more agents to start running runs. +Your sweep is now set up on the server. Launch one or more agents to start the runs. - Seed sweep from runs + Project runs table with rows selected and the create sweep option in the dropdown -If you kick off the new sweep as a bayesian sweep, the selected runs will also seed the Gaussian Process. +If you start the new sweep as a Bayesian sweep, the selected runs also seed the Gaussian Process. \ No newline at end of file diff --git a/models/sweeps/initialize-sweeps.mdx b/models/sweeps/initialize-sweeps.mdx index 4db2b6ed03..7c694042ee 100644 --- a/models/sweeps/initialize-sweeps.mdx +++ b/models/sweeps/initialize-sweeps.mdx @@ -1,21 +1,23 @@ --- -description: "Initialize a W&B Sweep using the Python SDK or CLI to start hyperparameter searches with your sweep configuration." +description: "Initialize a sweep using the Python SDK or CLI to start hyperparameter searches with your sweep configuration." title: Initialize a sweep --- -W&B uses a _Sweep Controller_ to manage sweeps on the cloud (standard), locally (local) across one or more machines. After a run completes, the sweep controller will issue a new set of instructions describing a new run to execute. These instructions are picked up by _agents_ who actually perform the runs. In a typical W&B Sweep, the controller lives on the W&B server. Agents live on _your_ machines. +Initialize a sweep to register your sweep configuration with W&B and return a sweep ID that agents use to fetch run instructions. Initialize a sweep after you define a sweep configuration and before you launch agents, so that you have a sweep ready to coordinate hyperparameter searches across one or more machines. -The following code snippets demonstrate how to initialize sweeps with the CLI and within a Jupyter Notebook or Python script. +W&B uses a _sweep controller_ to manage sweeps on the cloud (standard) or locally across one or more machines. After a run completes, the sweep controller issues a new set of instructions describing a new run to execute. _Agents_ pick up these instructions and perform the runs. In a typical sweep, the controller lives on the W&B server. Agents live on your machines. + +Before you initialize a sweep, you must have a sweep configuration defined either in a YAML file or a nested Python dictionary object in your script. For more information, see [Define sweep configuration](/models/sweeps/define-sweep-configuration). + +The following code snippets show how to initialize sweeps with the CLI and within a Jupyter notebook or Python script. Choose the method that matches your workflow. -1. Before you initialize a sweep, make sure you have a sweep configuration defined either in a YAML file or a nested Python dictionary object in your script. For more information, see [Define sweep configuration](/models/sweeps/define-sweep-configuration/). -2. Both the W&B Sweep and the W&B Run must be in the same project. Therefore, the name you provide when you initialize W&B ([`wandb.init()`](/models/ref/python/functions/init)) must match the name of the project you provide when you initialize a W&B Sweep ([`wandb.sweep()`](/models/ref/python/functions/sweep)). +The sweep and the run must be in the same project. The project name you provide when you initialize W&B ([`wandb.init()`](/models/ref/python/functions/init)) must match the project name you provide when you initialize a sweep ([`wandb.sweep()`](/models/ref/python/functions/sweep)). - -Use the W&B SDK to initialize a sweep. Pass the sweep configuration dictionary to the `sweep` parameter. Optionally provide the name of the project for the project parameter (`project`) where you want the output of the W&B Run to be stored. If the project is not specified, the run is put in an "Uncategorized" project. +Use the W&B SDK to initialize a sweep. Pass the sweep configuration dictionary to the `sweep` parameter. Optionally provide the name of the project for the project parameter (`project`) where you want W&B to store the output of the run. If you don't specify the project, W&B puts the run in an "Uncategorized" project. ```python import wandb @@ -35,10 +37,10 @@ sweep_configuration = { sweep_id = wandb.sweep(sweep=sweep_configuration, project="project-name") ``` -The [`wandb.sweep()`](/models/ref/python/functions/sweep) function returns the sweep ID. The sweep ID includes the entity name and the project name. Make a note of the sweep ID. +The [`wandb.sweep()`](/models/ref/python/functions/sweep) function returns the sweep ID. The sweep ID includes the entity name and the project name. Note the sweep ID, since you pass it to agents when you start them. With the sweep ID, you're ready to launch one or more agents to execute runs. -Use the W&B CLI to initialize a sweep. Provide the name of your configuration file. Optionally provide the name of the project for the `project` flag. If the project is not specified, the W&B Run is put in an "Uncategorized" project. +Use the W&B CLI to initialize a sweep. Provide the name of your configuration file. Optionally provide the name of the project for the `project` flag. If you don't specify the project, W&B puts the run in an "Uncategorized" project. Use the [`wandb sweep`](/models/ref/cli/wandb-sweep) command to initialize a sweep. The following code example initializes a sweep for a `sweeps_demo` project and uses a `config.yaml` file for the configuration. @@ -46,7 +48,7 @@ Use the [`wandb sweep`](/models/ref/cli/wandb-sweep) command to initialize a swe wandb sweep --project sweeps_demo config.yaml ``` -This command will print out a sweep ID. The sweep ID includes the entity name and the project name. Make a note of the sweep ID. +This command prints out a sweep ID. The sweep ID includes the entity name and the project name. Note the sweep ID, since you pass it to agents when you start them. With the sweep ID, you're ready to launch one or more agents to execute runs. diff --git a/models/sweeps/local-controller.mdx b/models/sweeps/local-controller.mdx index a98720ce4d..46b6736957 100644 --- a/models/sweeps/local-controller.mdx +++ b/models/sweeps/local-controller.mdx @@ -4,31 +4,35 @@ description: Search and stop algorithms locally instead of using the W&B cloud-h title: Manage algorithms locally --- -The hyper-parameter controller is hosted by Weights & Biased as a cloud service by default. W&B agents communicate with the controller to determine the next set of parameters to use for training. The controller is also responsible for running early stopping algorithms to determine which runs can be stopped. +This page shows you how to run a sweep's search and stopping algorithms locally instead of using the W&B cloud-hosted controller. Use the local controller when you want to inspect and instrument the code to debug issues or develop new algorithms that you can later incorporate into the cloud service. -The local controller feature allows you to run search and stop algorithms locally. The local controller gives you the ability to inspect and instrument the code to debug issues and develop new features that can be incorporated into the cloud service. +By default, W&B hosts the hyperparameter controller as a cloud service. W&B agents communicate with the controller to determine the next set of parameters to use for training. The controller also runs early stopping algorithms to determine which runs to stop. + +The local controller feature lets you run search and stop algorithms locally. -This feature is offered to support faster development and debugging of new algorithms for the Sweeps tool. It is not intended for actual hyperparameter optimization workloads. +This feature supports faster development and debugging of new algorithms for the Sweeps tool. It isn't intended for hyperparameter optimization workloads. -Before you get start, you must install the W&B SDK(`wandb`). Type the following code snippet into your command line: +## Prerequisites -``` -pip install wandb sweeps +Install the W&B SDK (`wandb`) so that the local controller commands are available: + +```bash +pip install wandb sweeps ``` -The following examples assume you already have a configuration file and a training loop defined in a python script or Jupyter Notebook. For more information about how to define a configuration file, see [Define sweep configuration](/models/sweeps/define-sweep-configuration/). +The following examples assume you already have a configuration file and a training loop defined in a Python script or Jupyter notebook. For more information, see [Define sweep configuration](/models/sweeps/define-sweep-configuration). -### Run the local controller from the command line +## Run the local controller from the command line -Initialize a sweep similarly to how you normally would when you use hyper-parameter controllers hosted by W&B as a cloud service. Specify the controller flag (`controller`) to indicate you want to use the local controller for W&B sweep jobs: +Initialize a sweep the same way you do when you use hyperparameter controllers hosted by W&B as a cloud service. Specify the `--controller` flag to indicate that you want to use the local controller for sweep jobs: ```bash wandb sweep --controller config.yaml ``` -Alternatively, you can separate initializing a sweep and specifying that you want to use a local controller into two steps. +Alternatively, you can initialize the sweep and specify the local controller in two steps. To separate the steps, first add the following key-value to your sweep's YAML configuration file: @@ -43,30 +47,32 @@ Next, initialize the sweep: wandb sweep config.yaml ``` -`wandb sweep` generates a sweep ID. After you initialized the sweep, start a controller with [`wandb controller`](/models/ref/python/functions/controller): +`wandb sweep` generates a sweep ID. After you initialize the sweep, start a controller with `wandb controller`. Replace `[SWEEP-ID]` with the sweep ID that `wandb sweep` generated. You can pass the short sweep ID or include the entity and project as a path (`[ENTITY]/[PROJECT]/[SWEEP-ID]`): ```bash -wandb controller {user}/{entity}/{sweep_id} +wandb controller [SWEEP-ID] ``` -Once you have specified you want to use a local controller, start one or more Sweep agents to execute the sweep. Start a W&B Sweep similar to how you normally would. See [Start sweep agents](/models/sweeps/start-sweep-agents/), for more information. +After you specify that you want to use a local controller, start one or more sweep agents to run the sweep, the same way you usually do. For more information, see [Start sweep agents](/models/sweeps/start-sweep-agents). + +Replace `[SWEEP-ID]` with the sweep ID returned by `wandb sweep`: ```bash -wandb sweep sweep_ID +wandb agent [SWEEP-ID] ``` -### Run a local controller with W&B Python SDK +## Run a local controller with W&B Python SDK -The following code snippets demonstrate how to specify and use a local controller with the W&B Python SDK. +The following code snippets demonstrate how to specify and use a local controller with the W&B Python SDK. Each example offers progressively more control over the controller loop, so you can choose the approach that matches how much you need to customize search and scheduling. -The simplest way to use a controller with the Python SDK is to pass the sweep ID to the [`wandb.controller()`](/models/ref/python/functions/controller) method. Next, use the return objects `run` method to start the sweep job: +The simplest way to use a controller with the Python SDK is to pass the sweep ID to the [`wandb.controller()`](/models/ref/python/functions/controller) method. Then, use the returned object's `run` method to start the sweep job: ```python sweep = wandb.controller(sweep_id) sweep.run() ``` -If you want more control of the controller loop: +For more control over the controller loop, step through it yourself: ```python import wandb @@ -78,7 +84,7 @@ while not sweep.done(): time.sleep(5) ``` -Or even more control over the parameters served: +For even more control over the parameters served, call `search` and `schedule` directly: ```python import wandb @@ -90,7 +96,7 @@ while not sweep.done(): sweep.print_status() ``` -If you want to specify your sweep entirely with code you can do something like this: +To specify your sweep entirely in code rather than a YAML configuration file, configure the search, program, controller, and parameters in Python: ```python import wandb @@ -102,4 +108,4 @@ sweep.configure_controller(type="local") sweep.configure_parameter("param1", value=3) sweep.create() sweep.run() -``` \ No newline at end of file +``` diff --git a/models/sweeps/parallelize-agents.mdx b/models/sweeps/parallelize-agents.mdx index e46270ca69..bd21a9d504 100644 --- a/models/sweeps/parallelize-agents.mdx +++ b/models/sweeps/parallelize-agents.mdx @@ -1,60 +1,57 @@ --- -description: Parallelize W&B Sweep agents on multi-core or multi-GPU machine. +description: Parallelize sweep agents on multi-core or multi-GPU machines. title: Parallelize agents --- -Parallelize your W&B Sweep agents on a multi-core or multi-GPU machine. Before you get started, ensure you have initialized your W&B Sweep. For more information on how to initialize a W&B Sweep, see [Initialize sweeps](./initialize-sweeps). +Parallelize your sweep agents on a multi-core or multi-GPU machine to run multiple sweep runs at the same time. Parallelization reduces the time needed to explore your hyperparameter space. This page shows you how to launch parallel agents on multi-CPU and multi-GPU machines so you can use all available compute on a single host. -### Parallelize on a multi-CPU machine +Before you start, you must initialize your sweep. For more information, see [Initialize sweeps](/models/sweeps/initialize-sweeps). -Depending on your use case, explore the following tabs to learn how to parallelize W&B Sweep agents using the CLI or within a Jupyter Notebook. +## Parallelize on a multi-CPU machine + +The following tabs describe how to run multiple sweep agents in parallel on the same machine. Parallel agents are most useful on a machine with multiple CPU cores, where each agent can run on its own core. Choose the tab for the CLI or a Jupyter Notebook based on your workflow. -Use the [`wandb agent`](/models/ref/cli/wandb-agent) command to parallelize your sweep agent across multiple CPUs with the terminal. Provide the sweep ID that was returned when you [initialized the sweep](./initialize-sweeps). +Use the [`wandb agent`](/models/ref/cli/wandb-agent) command to parallelize your sweep agent across multiple CPUs with the terminal. Provide the sweep ID that W&B returned when you [initialized the sweep](/models/sweeps/initialize-sweeps). 1. Open more than one terminal window on your local machine. -2. Copy and paste the code snippet below and replace `sweep_id` with your sweep ID: +2. Copy and paste the following code snippet and replace `[SWEEP-ID]` with your sweep ID: -```bash -wandb agent sweep_id -``` + ```bash + wandb agent [SWEEP-ID] + ``` -Use the W&B Python SDK library to parallelize your W&B Sweep agent across multiple CPUs within Jupyter Notebooks. Ensure you have the sweep ID that was returned when you [initialized the sweep](./initialize-sweeps). In addition, provide the name of the function the sweep will execute for the `function` parameter: +Use the W&B Python SDK library to parallelize your sweep agent across multiple CPUs within Jupyter Notebooks. Ensure you have the sweep ID that W&B returned when you [initialized the sweep](/models/sweeps/initialize-sweeps). Also provide the name of the function the sweep executes for the `function` parameter: 1. Open more than one Jupyter Notebook. -2. Copy and paste the W&B Sweep ID on multiple Jupyter Notebooks to parallelize a W&B Sweep. For example, you can paste the following code snippet on multiple jupyter notebooks to parallelize your sweep if you have the sweep ID stored in a variable called `sweep_id` and the name of the function is `function_name`: +2. Copy and paste the sweep ID on multiple Jupyter Notebooks to parallelize a sweep. For example, if you have the sweep ID stored in a variable called `sweep_id` and the name of the function is `function_name`, paste the following code snippet on multiple Jupyter Notebooks to parallelize your sweep: -```python -wandb.agent(sweep_id=sweep_id, function=function_name) -``` + ```python + wandb.agent(sweep_id=sweep_id, function=function_name) + ``` -### Parallelize on a multi-GPU machine - -Follow the procedure outlined to parallelize your W&B Sweep agent across multiple GPUs with a terminal using CUDA Toolkit: - -1. Open more than one terminal window on your local machine. -2. Specify the GPU instance to use with `CUDA_VISIBLE_DEVICES` when you start a W&B Sweep job ([`wandb agent`](/models/ref/cli/wandb-agent)). Assign `CUDA_VISIBLE_DEVICES` an integer value corresponding to the GPU instance to use. +## Parallelize on a multi-GPU machine -For example, suppose you have two NVIDIA GPUs on your local machine. Open a terminal window and set `CUDA_VISIBLE_DEVICES` to `0` (`CUDA_VISIBLE_DEVICES=0`). Replace `sweep_ID` in the following example with the W&B Sweep ID that is returned when you initialized a W&B Sweep: +The following procedure describes how to run sweep agents in parallel across multiple GPUs on the same machine, using a terminal with CUDA Toolkit. This procedure requires a machine with more than one GPU. Use `CUDA_VISIBLE_DEVICES` to assign each [`wandb agent`](/models/ref/cli/wandb-agent) to a different GPU so the agents run in parallel without competing for the same device. -Terminal 1 +For example, suppose you have two NVIDIA GPUs on your local machine. Replace `[SWEEP-ID]` in each command with the sweep ID that W&B returns when you initialize a sweep: -```bash -CUDA_VISIBLE_DEVICES=0 wandb agent sweep_ID -``` +1. In one terminal, set `CUDA_VISIBLE_DEVICES` to `0` and start an agent: -Open a second terminal window. Set `CUDA_VISIBLE_DEVICES` to `1` (`CUDA_VISIBLE_DEVICES=1`). Paste the same W&B Sweep ID for the `sweep_ID` mentioned in the following code snippet: + ```bash + CUDA_VISIBLE_DEVICES=0 wandb agent [SWEEP-ID] + ``` -Terminal 2 +2. In a second terminal, set `CUDA_VISIBLE_DEVICES` to `1` and start another agent with the same sweep ID: -```bash -CUDA_VISIBLE_DEVICES=1 wandb agent sweep_ID -``` \ No newline at end of file + ```bash + CUDA_VISIBLE_DEVICES=1 wandb agent [SWEEP-ID] + ``` \ No newline at end of file diff --git a/models/sweeps/pause-resume-and-cancel-sweeps.mdx b/models/sweeps/pause-resume-and-cancel-sweeps.mdx index e16e18f154..e8b2f71796 100644 --- a/models/sweeps/pause-resume-and-cancel-sweeps.mdx +++ b/models/sweeps/pause-resume-and-cancel-sweeps.mdx @@ -1,49 +1,39 @@ --- -description: Pause, resume, and cancel a W&B Sweep with the CLI. +description: Pause, resume, and cancel a sweep with the CLI. title: Manage sweeps --- -Use the [W&B CLI](/models/ref/cli/wandb-sweep) to pause, resume, and cancel a sweep. The CLI's `sweep` command uses flags such as `--pause` and `--resume` to control the sweep's ability to create new W&B runs, with different effects on existing runs: +This page shows you how to control the lifecycle of an active sweep from the command line. You can stop wasted compute, preserve in-progress experiments, or temporarily halt exploration without losing state. -- `--pause`: When you pause a sweep, the agent creates no new runs until you resume the sweep. Existing runs continue to execute normally. -- `--resume`: When you resume a sweep, the agent continues creating new runs according to the search strategy. -- `--stop`: When you stop a sweep, the agent stops creating new runs. Existing runs continue to completion. -- `--cancel`: When you cancel a sweep, the agent immediately kills all currently executing runs and stops creating new runs. +Use the [W&B CLI](/models/ref/cli/wandb-sweep) to pause, resume, stop, and cancel a sweep. Each command has a different effect on whether the sweep creates new runs and on runs that are already executing. In each case, provide the sweep ID that W&B generated when you initialized the sweep. +## Pause a sweep -Use the following guidance to pause, resume, and cancel a sweep. In each case, provide the sweep ID that was generated when you initialized a sweep. - -### Pause a sweep - -Pause a sweep so it temporarily stops creating new runs. Runs that are already executing will continue to run until completion. Use the [`wandb sweep --pause`](/models/ref/cli/wandb-sweep) command to pause a sweep. Provide the sweep ID that you want to pause. +Pause a sweep so it temporarily stops creating new runs. Runs that are already executing continue running until completion. Use the [`wandb sweep --pause`](/models/ref/cli/wandb-sweep) command to pause a sweep. Provide the sweep ID that you want to pause. ```bash wandb sweep --pause entity/project/sweep_ID ``` -### Resume a sweep +## Resume a sweep -Resume a paused sweep with the [`wandb sweep --resume`](/models/ref/cli/wandb-sweep) command. The sweep will start creating new runs again according to its search strategy. Provide the sweep ID that you want to resume: +Resume a paused sweep with the [`wandb sweep --resume`](/models/ref/cli/wandb-sweep) command. The sweep starts creating new runs again according to its search strategy. Provide the sweep ID that you want to resume: ```bash wandb sweep --resume entity/project/sweep_ID ``` -### Stop a sweep +## Stop a sweep -Finish a sweep to stop creating new runs while letting currently executing runs finish gracefully. Use the [`wandb sweep --stop`](/models/ref/cli/wandb-sweep) command: +Stop a sweep to prevent the creation of new runs while letting executing runs finish gracefully. Use the [`wandb sweep --stop`](/models/ref/cli/wandb-sweep) command: ```bash wandb sweep --stop entity/project/sweep_ID ``` - -W&B does not terminate active [sweeps](/models/sweeps) or agents when you delete a project. - +## Cancel a sweep -### Cancel a sweep - -Cancel a sweep to immediately kill all running runs and stop creating new runs. This is the only sweep command that forcibly terminates existing runs. Runs are terminated abruptly; the running processes have no chance to run user-defined signal handlers. Use the [`wandb sweep --cancel`](/models/ref/cli/wandb-sweep) command to cancel a sweep. Provide the sweep ID that you want to cancel. For more on signals and sweep runs, see [Signal handling and sweep runs](/models/sweeps/signal-handling-sweep-runs). +Cancel a sweep to immediately terminate all active runs and stop creating new runs. This is the only sweep command that forcibly terminates existing runs. Runs terminate abruptly, and the running processes have no chance to run user-defined signal handlers. Use the [`wandb sweep --cancel`](/models/ref/cli/wandb-sweep) command to cancel a sweep. Provide the sweep ID that you want to cancel. For more information about signals and sweep runs, see [Signal handling and sweep runs](/models/sweeps/signal-handling-sweep-runs). ```bash wandb sweep --cancel entity/project/sweep_ID @@ -52,24 +42,27 @@ wandb sweep --cancel entity/project/sweep_ID For a full list of CLI command options, see the [wandb sweep](/models/ref/cli/wandb-sweep) CLI Reference Guide. -W&B does not terminate active [sweeps](/models/sweeps) or agents when you delete a project. +W&B doesn't terminate active [sweeps](/models/sweeps) or agents when you delete a project. -## Understanding sweep and run statuses +## Sweep and run statuses -A sweep orchestrates multiple runs to explore hyperparameter combinations. Understanding how sweep status and run status interact is crucial for effectively managing your hyperparameter optimization. +A sweep orchestrates multiple runs to explore hyperparameter combinations. To manage your hyperparameter optimization effectively, you must understand how sweep status and run status interact. The following sections describe how the two statuses differ, what happens when you stop an individual run, and which lifecycle command to choose. ### Key differences -- **Sweep status** controls whether new runs are created (Running, Paused, Stopped, Cancelled, Finished, Failed, Crashed) -- **Run status** reflects the execution state of individual runs (Pending, Running, Finished, Failed, Crashed, Killed) +- **Sweep status** controls whether the agent creates new runs (Running, Paused, Stopped, Cancelled, Finished, Failed, Crashed). +- **Run status** reflects the execution state of individual runs (Pending, Running, Finished, Failed, Crashed, Killed). ### Stop an individual run -When you [stop a run](/models/runs/stop-runs) in a sweep, the sweep agent automatically kicks off the next run in the sweep. This allows you to skip poorly performing configurations without interrupting the sweep's overall progress. + +When you [stop a run](/models/runs/stop-runs) in a sweep, the sweep agent automatically starts the next run in the sweep. You can skip poorly performing configurations without interrupting the sweep's overall progress. ### Best practices -- Use `--pause` instead of cancel when you want to temporarily halt exploration without losing running experiments -- Monitor individual run statuses to identify systematic failures -- Use `--stop` for graceful termination when you've found satisfactory hyperparameters -- Reserve `--cancel` for emergencies when runs are consuming excessive resources or producing errors +The following recommendations help you choose the right lifecycle command for the situation, so you avoid losing useful work or holding onto unwanted compute. + +- Use `--pause` instead of cancel when you want to temporarily halt exploration without losing running experiments. +- Monitor individual run statuses to identify systematic failures. +- Use `--stop` for graceful termination when you've found satisfactory hyperparameters. +- Reserve `--cancel` for emergencies when runs consume excessive resources or produce errors. diff --git a/models/sweeps/signal-handling-sweep-runs.mdx b/models/sweeps/signal-handling-sweep-runs.mdx index ff830025f1..ff83d6525e 100644 --- a/models/sweeps/signal-handling-sweep-runs.mdx +++ b/models/sweeps/signal-handling-sweep-runs.mdx @@ -3,7 +3,7 @@ description: Learn how W&B Sweeps handle UNIX signals, exit codes, and preemptio title: Signal handling and sweep runs --- -This page provides details about how W&B Sweeps handle system signals and process exit codes, to help you run sweeps reliably in preemptible environments such as SLURM, EC2 Spot, or Google Cloud preemptible VMs. These sections explain how to interrupt runs cleanly from the keyboard and give details to help you understand and predict run requeue behavior. For details about how runs are requeued when preempted, see [Resume preemptible Sweeps runs](/models/runs/resuming#resume-preemptible-sweeps-runs). +This page provides details about how W&B Sweeps handle system signals and process exit codes. Use this information to run sweeps reliably in preemptible environments such as SLURM, EC2 Spot, or Google Cloud preemptible VMs. The following sections describe how to interrupt runs cleanly from the keyboard and give details to help you understand and predict run requeue behavior. This page targets users who run sweeps on preemptible infrastructure or who need fine-grained control over run lifecycle and cleanup. For details about how W&B requeues runs when they're preempted, see [Resume preemptible Sweeps runs](/models/runs/resuming#resume-preemptible-sweeps-runs). ## Exit status and signals @@ -11,20 +11,20 @@ W&B uses the training process exit status to decide whether a run is requeued an **Exit code contract:** -- **Exit code 0**: The run is considered to have completed successfully and is not requeued. -- **Non-zero exit code**: The run is treated as failed or preempted. When you use [`mark_preempting()`](/models/ref/python/experiments/run#mark_preempting), W&B requeues the run so another agent (or the same agent after restart) can resume it. +- **Exit code 0**: W&B considers the run to have completed successfully and doesn't requeue it. +- **Non-zero exit code**: W&B treats the run as failed or preempted. When you use [`mark_preempting()`](/models/ref/python/experiments/run#mark_preempting), W&B requeues the run so another agent (or the same agent after restart) can resume it. -This applies whether the process exits from a signal handler, from an exception, or from an explicit `sys.exit()` call. Understanding and relying on this contract is vitally important in preemptible or cluster environments. +This applies whether the process exits from a signal handler, from an exception, or from an explicit `sys.exit()` call. Understanding and relying on this contract matters in preemptible or cluster environments. -When the process exits due to a [**catchable** signal](#catchable-signals-and-preemption), your handler can run, call [`wandb.run.mark_preempting()`](/models/ref/python/experiments/run#mark_preempting) if you want the run requeued, perform cleanup (for example, save a checkpoint), then exit with a non-zero code. A common convention is `sys.exit(128 + signum)` for termination by signal. W&B records that exit code and the same [requeue rules](/models/runs/resuming#resume-preemptible-sweeps-runs) apply. When the process is killed by the operating system kernel with [**`SIGKILL`**](#sigkill-uncatchable), the process cannot run exit hooks, so no final summary is written and the run may appear as crashed or killed; the agent still starts the next run. +When the process exits due to a [catchable signal](#catchable-signals-and-preemption), your handler can run, call [`wandb.run.mark_preempting()`](/models/ref/python/experiments/run#mark_preempting) if you want the run requeued, perform cleanup (for example, save a checkpoint), then exit with a non-zero code. A common convention is `sys.exit(128 + signum)` for termination by signal. W&B records that exit code and the same [requeue rules](/models/runs/resuming#resume-preemptible-sweeps-runs) apply. When the operating system kernel kills the process with [`SIGKILL`](#sigkill-uncatchable), the process can't run exit hooks, so W&B doesn't write a final summary and the run might appear as crashed or killed. The agent still starts the next run. ## Stale runs and server-side timeouts -If a run neither finishes nor posts new metrics for a long time (on the order of about five minutes), the W&B server marks the run as **crashed**. That often happens when the training process hangs, stops logging, or is terminated without a clean exit (for example after `SIGKILL`). Logging metrics on a steady cadence or exiting with a defined code helps keep run state aligned with what actually happened. +Exit codes aren't the only way W&B determines run state. The W&B server also infers run state from activity. If a run neither finishes nor posts new metrics for about 5 minutes, the W&B server marks the run as crashed. That often happens when the training process becomes unresponsive, stops logging, or terminates without a clean exit (for example, after `SIGKILL`). Logging metrics on a steady cadence or exiting with a defined code helps keep run state aligned with what happened. ## Catchable signals and preemption -You can register custom signal handlers in your training script. When a catchable signal is delivered, your handler runs; metrics already sent to W&B are preserved, and the agent detects the process exit and starts the next run. +Most signals you'll encounter in preemptible environments are catchable, meaning your training script can intercept them and shut down cleanly. You can register custom signal handlers in your training script. When the system delivers a catchable signal, your handler runs. W&B preserves metrics that were already sent, and the agent detects the process exit and starts the next run. **Best practices:** @@ -64,11 +64,11 @@ if __name__ == "__main__": ## `SIGKILL` (uncatchable) -`SIGKILL` cannot be caught or ignored. The process terminates immediately with no chance to run handlers or atexit callbacks. W&B cannot write a final summary for the run. The agent still recovers and continues the sweep, but run data for that run is incomplete. Use `SIGKILL` only as a last resort; prefer `SIGTERM` or `SIGINT` when you need graceful shutdown. +`SIGKILL` can't be caught or ignored. The process terminates immediately with no chance to run handlers or atexit callbacks. W&B can't write a final summary for the run. The agent still recovers and continues the sweep, but run data for that run is incomplete. Use `SIGKILL` only as a last resort. Prefer `SIGTERM` or `SIGINT` when you need graceful shutdown. -## Forwarding signals from agent to child +## Signal forwarding from agent to child -When you use the [`wandb agent`](/models/ref/cli/wandb-agent) CLI, the agent runs your training script as a **child process**. When you interrupt the **agent** (for example, with Ctrl+C or when a scheduler sends `SIGTERM` to the job), the **child** (training process) does not receive the signal by default; the training script cannot run its handler or call `mark_preempting()`. This is described in [GitHub #3667](https://github.com/wandb/wandb/issues/3667). +When you use the [`wandb agent`](/models/ref/cli/wandb-agent) CLI, the agent runs your training script as a child process. When you interrupt the agent (for example, with Ctrl+C or when a scheduler sends `SIGTERM` to the job), the child (training process) doesn't receive the signal by default. The training script can't run its handler or call `mark_preempting()`. For more information, see [wandb GitHub issue #3667](https://github.com/wandb/wandb/issues/3667). To let the child shut down gracefully and call `wandb.run.mark_preempting()` in a handler, run the CLI agent with `--forward-signals`: @@ -76,57 +76,57 @@ To let the child shut down gracefully and call `wandb.run.mark_preempting()` in wandb agent --forward-signals entity/project/sweep_ID ``` -Signal forwarding is **not** supported for [`wandb.agent()`](/models/ref/python/functions/agent) in the Python API. That path runs your training function in a thread, not as a separate child process, so the same forwarding behavior does not apply. +W&B doesn't support signal forwarding for [`wandb.agent()`](/models/ref/python/functions/agent) in the Python API. That path runs your training function in a thread, not as a separate child process, so the same forwarding behavior doesn't apply. -When the CLI agent receives `SIGINT` or `SIGTERM` with forwarding enabled, it relays the signal to the child so your training script's handler can run, call `wandb.run.mark_preempting()` and [`wandb.finish()`](/models/ref/python/experiments/run#finish) with a non-zero exit code if needed, and exit with a non-zero code. If you press Ctrl+C twice on the agent process, the agent receives `SIGTERM` by default. With `--forward-signals`, `SIGINT` can be forwarded to the child so your handler runs. +When the CLI agent receives `SIGINT` or `SIGTERM` with forwarding enabled, it relays the signal to the child. Your training script's handler can then run, call `wandb.run.mark_preempting()` and [`wandb.finish()`](/models/ref/python/experiments/run#finish) with a non-zero exit code if needed, and exit with a non-zero code. If you press Ctrl+C twice on the agent process, the agent receives `SIGTERM` by default. With `--forward-signals`, the agent can forward `SIGINT` to the child so your handler runs. -See the [wandb agent](/models/ref/cli/wandb-agent) CLI reference for details. +For more information, see the [`wandb agent`](/models/ref/cli/wandb-agent) CLI reference. -## Preemptible clusters like `SLURM` +## Preemptible clusters like SLURM -On preemption, the **training process** must receive the signal, mark the run as preempting, and exit with a non-zero code so the run is requeued. A new agent (or the same agent after the job is requeued) can then resume the run. +This section describes how to configure sweeps so that runs survive preemption on clusters such as SLURM, EC2 Spot, or Google Cloud preemptible VMs. On preemption, the training process must receive the signal, mark the run as preempting, and exit with a non-zero code so W&B requeues the run. A new agent (or the same agent after the job is requeued) can then resume the run. **Ensure the training process receives the signal:** -1. **When the scheduler signals the agent**: Run the agent with `wandb agent --forward-signals` so that when the scheduler (or user) sends a signal to the agent, the agent forwards it to the child. The child's handler can then call `wandb.run.mark_preempting()`, [`wandb.finish(exit_code=...)`](/models/ref/python/experiments/run#finish) with a non-zero code, and `sys.exit(128 + signum)` (or another non-zero exit code). -2. **When the scheduler signals the launch script (not the agent directly)**: Have the launch script send the preemption signal directly to the training process. For example, the training script writes its process ID to a file; the launch script traps the cluster signal (for example `SIGUSR1`) and runs `kill -SIGUSR1 $(cat $PID_FILE)` so the training process's handler runs. +- **When the scheduler signals the agent**: Run the agent with `wandb agent --forward-signals` so that when the scheduler (or user) sends a signal to the agent, the agent forwards it to the child. The child's handler can then call `wandb.run.mark_preempting()`, [`wandb.finish(exit_code=...)`](/models/ref/python/experiments/run#finish) with a non-zero code, and `sys.exit(128 + signum)` (or another non-zero exit code). +- **When the scheduler signals the launch script (not the agent directly)**: Have the launch script send the preemption signal directly to the training process. For example, the training script writes its process ID to a file. The launch script traps the cluster signal (for example, `SIGUSR1`) and runs `kill -SIGUSR1 $(cat $PID_FILE)` so the training process's handler runs. -**In the training script:** Register a handler for the signal your cluster uses (for example `SIGTERM` or `SIGUSR1`). In the handler, call `wandb.run.mark_preempting()` if a run is active, then finish the run with a non-zero exit code and `sys.exit(128 + signum)` (or another non-zero code) so the run is requeued. See [Resume preemptible Sweeps runs](/models/runs/resuming#resume-preemptible-sweeps-runs) for when runs are requeued and how that interacts with `mark_preempting()`. +**In the training script:** Register a handler for the signal your cluster uses (for example, `SIGTERM` or `SIGUSR1`). In the handler, call `wandb.run.mark_preempting()` if a run is active, then finish the run with a non-zero exit code and `sys.exit(128 + signum)` (or another non-zero code) so W&B requeues the run. For more information about when W&B requeues runs and how that interacts with `mark_preempting()`, see [Resume preemptible Sweeps runs](/models/runs/resuming#resume-preemptible-sweeps-runs). -**Sweep state:** Run `wandb sweep entity/project/sweep_ID --resume` before starting the agent so the sweep is in resume mode and will hand out requeued runs. +**Sweep state:** Run `wandb sweep entity/project/sweep_ID --resume` before starting the agent so the sweep is in resume mode and hands out requeued runs. -**Multi-agent coordination:** When many agents run at once (such as SLURM array jobs), they can race to claim the same preempted run. This is a known limitation. Stagger agent startup or use external coordination mechanisms like locks to help work around this potential issue. +**Multi-agent coordination:** When many agents run at once (such as SLURM array jobs), they can race to claim the same preempted run. This is a known limitation. To work around it, stagger agent startup or use external coordination mechanisms such as locks. ## `wandb sweep --cancel` -You cancel a sweep using the W&B API, not an OS signal. Run a command like `wandb sweep --cancel entity/project/sweep_ID`. The server tells the agent to exit, and the agent then terminates running child processes and stops. There can be a short delay (on the order of the agent's API polling interval) before cancellation takes effect. +This section describes how the `--cancel` command interacts with signals and child processes, since cancellation behaves differently from sending an OS signal directly. You cancel a sweep using the W&B API, not an OS signal. Run a command such as `wandb sweep --cancel entity/project/sweep_ID`. The server tells the agent to exit, and the agent then terminates running child processes and stops. A short delay (on the order of the agent's API polling interval) can occur before cancellation takes effect. -Cancellation delivers **`SIGKILL`** to runs. Child processes have no chance to run user-defined signal handlers. The same applies when you use the **Cancel** control on the Sweeps UI. Use `--cancel` when you want to stop the entire sweep and mark it cancelled. For graceful shutdown of the current run, send a catchable signal to the run (or use `--forward-signals` with the CLI agent and signal the agent). For graceful sweep completion, use [`wandb sweep --stop`](/models/sweeps/pause-resume-and-cancel-sweeps#stop-a-sweep) instead of `--cancel`. +Cancellation delivers `SIGKILL` to runs. Child processes have no chance to run user-defined signal handlers. The same applies when you use the **Cancel** control on the Sweeps UI. Use `--cancel` when you want to stop the entire sweep and mark it canceled. For graceful shutdown of the current run, send a catchable signal to the run (or use `--forward-signals` with the CLI agent and signal the agent). For graceful sweep completion, use [`wandb sweep --stop`](/models/sweeps/pause-resume-and-cancel-sweeps#stop-a-sweep) instead of `--cancel`. -See [Manage sweeps](/models/sweeps/pause-resume-and-cancel-sweeps) for pause, resume, stop, and cancel options. +For more information about pause, resume, stop, and cancel options, see [Manage sweeps](/models/sweeps/pause-resume-and-cancel-sweeps). -## Killing the agent vs the run +## Signals to the agent versus signals to the run -If you send a signal to the **agent** process (not the child training process), the agent may exit while the child continues running as an orphan. The orphan may keep printing to your terminal, and the shell may not show a new prompt until you press Enter. +Understanding the distinction between signaling the agent and signaling the training run helps you avoid orphaned processes and unexpected behavior. If you send a signal to the agent process (not the child training process), the agent might exit while the child continues running as an orphan. The orphan might keep printing to your terminal, and the shell might not show a new prompt until you press Enter. -Unless you use `--forward-signals` with the CLI agent, stopping the agent does not guarantee the child training process stops. +Unless you use `--forward-signals` with the CLI agent, stopping the agent doesn't guarantee the child training process stops. -To confirm the agent has exited, use an OS command like `ps -p ` or `pgrep -f "wandb agent"` instead of relying on prompt appearance. +To confirm the agent has exited, use an OS command like `ps -p [AGENT-PID]` or `pgrep -f "wandb agent"` instead of relying on prompt appearance. ## Reference: `mark_preempting()` and final run state -The table below summarizes how run state depends on **when** you call `mark_preempting()` and how the process exits. It assumes you use the [`wandb agent`](/models/ref/cli/wandb-agent) CLI with your training program as a subprocess. +The following table summarizes how run state depends on when you call `mark_preempting()` and how the process exits. It assumes you use the [`wandb agent`](/models/ref/cli/wandb-agent) CLI with your training program as a subprocess. | Scenario | No `mark_preempting()` | Signal handler calls `mark_preempting()` and exits non-zero | `mark_preempting()` always called right after `init()` | | --- | --- | --- | --- | | Run completes normally with exit code 0 | FINISHED | FINISHED | FINISHED | | Run fails with non-zero exit code | FAILED | FAILED | PREEMPTED | -| Run receives `SIGKILL` | CRASHED after about five minutes | CRASHED after about five minutes (uncatchable) | PREEMPTED after about five minutes | +| Run receives `SIGKILL` | CRASHED after about 5 minutes | CRASHED after about 5 minutes (uncatchable) | PREEMPTED after about 5 minutes | | Run receives `SIGINT` | KILLED | PREEMPTED (with a `SIGINT` handler) | PREEMPTED | -| Run receives another signal (for example `SIGTERM` or `SIGUSR1`) | CRASHED after about five minutes | PREEMPTED (with a matching handler) | PREEMPTED after about five minutes | +| Run receives another signal (for example, `SIGTERM` or `SIGUSR1`) | CRASHED after about 5 minutes | PREEMPTED (with a matching handler) | PREEMPTED after about 5 minutes | -If you only call `mark_preempting()` inside a signal handler, you do not cover cases where the handler never runs, such as `SIGKILL`. +If you only call `mark_preempting()` inside a signal handler, you don't cover cases where the handler never runs, such as `SIGKILL`. -If you always call `mark_preempting()` immediately after `wandb.init()`, any failure can be treated as preemption and the run may be requeued repeatedly, including for bugs or bad configuration. +If you always call `mark_preempting()` immediately after `wandb.init()`, W&B can treat any failure as preemption and might requeue the run repeatedly, including for bugs or bad configuration. -For environments with a well-defined preemption signal, the usual approach is a **signal handler** that calls `mark_preempting()` and exits non-zero, not an unconditional call after `init()`. +For environments with a well-defined preemption signal, the usual approach is a signal handler that calls `mark_preempting()` and exits non-zero, not an unconditional call after `init()`. diff --git a/models/sweeps/start-sweep-agents.mdx b/models/sweeps/start-sweep-agents.mdx index fc6a601fc0..29efc58a3b 100644 --- a/models/sweeps/start-sweep-agents.mdx +++ b/models/sweeps/start-sweep-agents.mdx @@ -1,20 +1,18 @@ --- -description: Start or stop a W&B Sweep Agent on one or more machines. +description: Start or stop a sweep agent on one or more machines. title: Start a sweep agent --- +This page explains how to start a sweep agent on one or more machines so that W&B can run your hyperparameter search. Sweep agents use the sweep configuration that you define when you [initialize a sweep](/models/sweeps/initialize-sweeps) to explore different hyperparameter combinations. W&B creates a new run for each hyperparameter combination that the sweep agent runs. -Start a sweep on one or more agents on one or more machines. Sweep agents use the sweep configuration defined when you [initialize a sweep](/models/sweeps/initialize-sweeps) to explore different hyperparameter combinations. W&B creates a new run for each hyperparameter combination the sweep agent tries. +See [Manage sweeps](/models/sweeps/pause-resume-and-cancel-sweeps) to learn how to pause, resume, stop, or cancel a running sweep agent. -See [Manage sweeps](/models/sweeps/pause-resume-and-cancel-sweeps) to learn how to pause, resume, stop, or cancel a sweep. +Before you continue: - -Before you continue, make sure you: -* Configure your training script to create and track hyperparameter combinations with W&B. For more information, see [Add W&B to your code](./add-w-and-b-to-your-code#training-script-with-w%26b-python-sdk). -* Have a [configuration file](./define-sweep-configuration) defined for your sweep. - +* Configure your training script to create and track hyperparameter combinations with W&B. See [Add W&B to your code](/models/sweeps/add-w-and-b-to-your-code#python-script-or-notebook) for examples. +* Define a [configuration file](/models/sweeps/define-sweep-configuration) for your sweep. -The following code snippets demonstrate how to start an agent with the CLI and within a Jupyter Notebook or Python script. For both methods, provide the sweep ID that W&B returns when you initialized the sweep. The sweep ID has the form: +These examples show how to start a sweep agent with the W&B CLI or from Python. You need the sweep ID that W&B returns when you initialize the sweep. You can use the short sweep ID on its own (for example, `dtzl1o7u`), or include the entity and project as a path: ```bash entity/project/sweep_ID @@ -23,84 +21,83 @@ entity/project/sweep_ID Where: * `entity`: Your W&B username or team name. -* `project`: The name of the project where you want W&B to store the output of the run. If the project is not specified, W&B puts the run in a project called "Uncategorized". -* `sweep_ID`: The pseudo random, unique ID generated by W&B. +* `project`: The name of the project where you want W&B to store the output of the run. If you don't specify the project, W&B puts the run in a project called "Uncategorized". +* `sweep_ID`: The pseudo-random, unique ID that W&B generates. -Use the `wandb agent` command to start a sweep. Provide the sweep ID that W&B returns when you initialized the sweep. +Use the `wandb agent` command to start a sweep. Provide the sweep ID that W&B returns when you initialize the sweep. -Copy and paste the code snippet below and replace `sweep_id` with your sweep ID: +Replace `[SWEEP-ID]` with your sweep ID in the following command: ```bash -wandb agent sweep_id +wandb agent [SWEEP-ID] ``` -For graceful shutdown when you interrupt the agent (for example, with Ctrl+C), use `wandb agent --forward-signals sweep_id` so the current run receives the signal and can shut down cleanly. See [Signal handling and sweep runs](/models/sweeps/signal-handling-sweep-runs) for details. +For graceful shutdown when you interrupt the agent (for example, with `Ctrl+C`), use `wandb agent --forward-signals [SWEEP-ID]` so the current run receives the signal and can shut down cleanly. See [Signal handling and sweep runs](/models/sweeps/signal-handling-sweep-runs) for details. -Use [`wandb.agent()`](/models/ref/python/functions/agent) to start a sweep. Provide the sweep ID that W&B returns when you initialized the sweep along with the name of the function that is the entrypoint to your training script. +Use [`wandb.agent()`](/models/ref/python/functions/agent) to start a sweep. Provide the sweep ID that W&B returns when you initialize the sweep, along with the name of the function that serves as the entrypoint to your training script. -Copy and paste the code snippet below and replace `` with your sweep ID and `` with the name of your training function: +Replace `[SWEEP-ID]` with your sweep ID and `[FUNCTION-NAME]` with the name of your training function in the following code: ```python -wandb.agent(sweep_id="", function="") +wandb.agent(sweep_id="[SWEEP-ID]", function="[FUNCTION-NAME]") ``` -Signal forwarding from the agent to the training run is only supported when you use the CLI (`wandb agent --forward-signals`). It is not supported for `wandb.agent()` in Python because the training function runs in a thread, not as a child process. See [Signal handling and sweep runs](/models/sweeps/signal-handling-sweep-runs) for details. +W&B only supports signal forwarding from the agent to the training run when you use the CLI (`wandb agent --forward-signals`). W&B doesn't support signal forwarding for `wandb.agent()` in Python because the training function runs in a thread, not as a child process. -See [Python script or notebook tab](/models/sweeps/add-w-and-b-to-your-code#python-script-or-notebook) in Add W&B to your code for an example of how to set up your training script if you use this method. +See the [Python script or notebook tab](/models/sweeps/add-w-and-b-to-your-code#python-script-or-notebook) in Add W&B to your code for an example of how to set up your training script if you use this method. **Multiprocessing** -You must wrap your `wandb.agent()` and `wandb.sweep()` calls with `if __name__ == '__main__':` if you use Python standard library's `multiprocessing` or PyTorch's `pytorch.multiprocessing` package. For example: +You must wrap your `wandb.agent()` and `wandb.sweep()` calls with `if __name__ == '__main__':` if you use the Python standard library's `multiprocessing` or PyTorch's `pytorch.multiprocessing` package. For example: ```python if __name__ == '__main__': - wandb.agent(sweep_id="", function="", count="") + wandb.agent(sweep_id="[SWEEP-ID]", function="[FUNCTION]", count="[COUNT]") ``` -Wrapping your code with this convention ensures that it is only executed when the script is run directly, and not when it is imported as a module in a worker process. +Wrapping your code with this convention ensures that Python only runs it when you run the script directly, and not when a worker process imports it as a module. -See [Python standard library `multiprocessing`](https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods) or [PyTorch `multiprocessing`](https://docs.pytorch.org/docs/stable/notes/multiprocessing.html#asynchronous-multiprocess-training-e-g-hogwild) for more information about multiprocessing. See https://realpython.com/if-name-main-python/ for information about the `if __name__ == '__main__':` convention. +For more information about multiprocessing, see [Python standard library `multiprocessing`](https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods) or [PyTorch `multiprocessing`](https://docs.pytorch.org/docs/stable/notes/multiprocessing.html#asynchronous-multiprocess-training-e-g-hogwild). For more information about the `if __name__ == '__main__':` convention, see [Defining main functions in Python](https://realpython.com/if-name-main-python/). +## Limit the number of runs a sweep agent tries -### Limit the number of runs a sweep agent tries +By default, random and Bayesian searches run indefinitely, so cap how many runs an agent attempts. Specify the number of runs a sweep agent should try to bound its work. The following code snippets demonstrate how to set a maximum number of [runs](/models/ref/python/experiments/run) with the CLI and within a Jupyter notebook or Python script. -Random and Bayesian searches will run forever. You must stop the process from the command line, within your python script, or the [Sweeps UI](./visualize-sweep-results). +Random and Bayesian searches run indefinitely. If you don't set a run count, you must stop the process from the command line, from within your Python script, or from the [Sweeps UI](/models/sweeps/visualize-sweep-results). -Specify the number of runs a sweep agent should try. The following code snippets demonstrate how to set a maximum number of [W&B Runs](/models/ref/python/experiments/run) with the CLI and within a Jupyter Notebook, Python script. - -First, initialize your sweep with the [`wandb sweep`](/models/ref/cli/wandb-sweep) command. For more information, see [Initialize sweeps](./initialize-sweeps). +First, initialize your sweep with the [`wandb sweep`](/models/ref/cli/wandb-sweep) command. For more information, see [Initialize sweeps](/models/sweeps/initialize-sweeps). -``` +```bash wandb sweep config.yaml ``` Next, pass an integer value to the count flag to set the maximum number of runs to try. -```python +```bash NUM=10 SWEEPID="dtzl1o7u" wandb agent --count $NUM $SWEEPID ``` -First, initialize your sweep. For more information, see [Initialize sweeps](./initialize-sweeps). +First, initialize your sweep. For more information, see [Initialize sweeps](/models/sweeps/initialize-sweeps). -``` +```python sweep_id = wandb.sweep(sweep_config) ``` -Next, start the sweep job. Provide the sweep ID generated from sweep initiation. Pass an integer value to the count parameter to set the maximum number of runs to try. +Next, start the sweep job. Provide the sweep ID that sweep initiation generates. Pass an integer value to the count parameter to set the maximum number of runs to try. ```python sweep_id, count = "dtzl1o7u", 10 @@ -108,7 +105,7 @@ wandb.agent(sweep_id, count=count) ``` -If you start a new run after the sweep agent has finished, within the same script or notebook, then you should call `wandb.teardown()` before starting the new run. +If you start a new run after the sweep agent finishes, within the same script or notebook, call `wandb.teardown()` before you start the new run. diff --git a/models/sweeps/sweep-config-keys.mdx b/models/sweeps/sweep-config-keys.mdx index 7a943890b0..d98016c524 100644 --- a/models/sweeps/sweep-config-keys.mdx +++ b/models/sweeps/sweep-config-keys.mdx @@ -1,12 +1,12 @@ --- title: Sweep configuration options -description: "Reference for all W&B Sweep configuration keys including method, metric, parameters, early termination, and command." +description: "Reference for all sweep configuration keys including method, metric, parameters, early termination, and command." --- -A sweep configuration consists of nested key-value pairs. Use top-level keys within your sweep configuration to define qualities of your sweep search such as the parameters to search through ([`parameter`](./sweep-config-keys#parameters) key), the methodology to search the parameter space ([`method`](./sweep-config-keys#method) key), and more. +A sweep configuration consists of nested key-value pairs. Use top-level keys within your sweep configuration to define qualities of your sweep search, such as the parameters to search through ([`parameters`](#parameters) key) and the methodology to search the parameter space ([`method`](#method) key). -The following table lists top-level sweep configuration keys and a brief description. See the respective sections for more information about each key. +The following table lists top-level sweep configuration keys and a brief description. For more information about each key, see the respective sections. | Top-level keys | Description | @@ -23,7 +23,7 @@ The following table lists top-level sweep configuration keys and a brief descrip | [`command`](#command) | Command structure for invoking and passing arguments to the training script | | `run_cap` | Maximum number of runs for this sweep | -See the [Sweep configuration](./sweep-config-keys) structure for more information on how to structure your sweep configuration. +For more information, see [Define sweep configuration](/models/sweeps/define-sweep-configuration). {/* ## `program` @@ -42,16 +42,17 @@ Use the `metric` top-level sweep configuration key to specify the name, the goal |Key | Description | | -------- | --------------------------------------------------------- | | `name` | Name of the metric to optimize. | -| `goal` | Either `minimize` or `maximize` (Default is `minimize`). | -| `target` | Goal value for the metric you are optimizing. The sweep does not create new runs when if or when a run reaches a target value that you specify. Active agents that have a run executing (when the run reaches the target) wait until the run completes before the agent stops creating new runs. | +| `goal` | Either `minimize` or `maximize` (default is `minimize`). | +| `target` | Goal value for the metric you're optimizing. The sweep doesn't create new runs when a run reaches a target value that you specify. Active agents that have a run executing (when the run reaches the target) wait until the run completes before the agent stops creating new runs. | ## `parameters` -In your YAML file or Python script, specify `parameters` as a top level key. Within the `parameters` key, provide the name of a hyperparameter you want to optimize. Common hyperparameters include: learning rate, batch size, epochs, optimizers, and more. For each hyperparameter you define in your sweep configuration, specify one or more search constraints. -The following table shows supported hyperparameter search constraints. Based on your hyperparameter and use case, use one of the search constraints below to tell your sweep agent where (in the case of a distribution) or what (`value`, `values`, and so forth) to search or use. +In your YAML file or Python script, specify `parameters` as a top level key. Within the `parameters` key, provide the name of a hyperparameter you want to optimize. Common hyperparameters include learning rate, batch size, epochs, and optimizers. For each hyperparameter you define in your sweep configuration, specify one or more search constraints. + +The following table shows supported hyperparameter search constraints. Based on your hyperparameter and use case, use one of the following search constraints to tell your sweep agent where (in the case of a distribution) or what (such as `value` or `values`) to search or use. | Search constraint | Description | @@ -60,42 +61,48 @@ The following table shows supported hyperparameter search constraints. Based on | `value` | Specifies the single valid value for this hyperparameter. Compatible with `grid`. | | `distribution` | Specify a probability [distribution](#distribution-options-for-random-and-bayesian-search). See the note following this table for information on default values. | | `probabilities` | Specify the probability of selecting each element of `values` when using `random`. | -| `min`, `max` | (`int`or `float`) Maximum and minimum values. If `int`, for `int_uniform` -distributed hyperparameters. If `float`, for `uniform` -distributed hyperparameters. | -| `mu` | (`float`) Mean parameter for `normal` - or `lognormal` -distributed hyperparameters. | -| `sigma` | (`float`) Standard deviation parameter for `normal` - or `lognormal` -distributed hyperparameters. | +| `min`, `max` | (`int` or `float`) Maximum and minimum values. If `int`, for `int_uniform`-distributed hyperparameters. If `float`, for `uniform`-distributed hyperparameters. | +| `mu` | (`float`) Mean parameter for `normal`- or `lognormal`-distributed hyperparameters. | +| `sigma` | (`float`) Standard deviation parameter for `normal`- or `lognormal`-distributed hyperparameters. | | `q` | (`float`) Quantization step size for quantized hyperparameters. | | `parameters` | Nest other parameters inside a root level parameter. | -W&B sets the following distributions based on the following conditions if a [distribution](#distribution-options-for-random-and-bayesian-search) is not specified: +If you don't specify a [distribution](#distribution-options-for-random-and-bayesian-search), W&B sets the following distributions based on these conditions: * `categorical` if you specify `values` * `int_uniform` if you specify `max` and `min` as integers * `uniform` if you specify `max` and `min` as floats -* `constant` if you provide a set to `value` +* `constant` if you specify `value` ## `method` -Specify the hyperparameter search strategy with the `method` key. There are three hyperparameter search strategies to choose from: grid, random, and Bayesian search. -#### Grid search -Iterate over every combination of hyperparameter values. Grid search makes uninformed decisions on the set of hyperparameter values to use on each iteration. Grid search can be computationally costly. -Grid search executes forever if it is searching within in a continuous search space. +Specify the hyperparameter search strategy with the `method` key. Choose from three hyperparameter search strategies: grid, random, and Bayesian search. + +### Grid search + +Grid search iterates over every combination of hyperparameter values. Grid search makes uninformed decisions on the set of hyperparameter values to use on each iteration. Grid search can be computationally costly. + +Grid search executes forever if it searches within a continuous search space. -#### Random search -Choose a random, uninformed, set of hyperparameter values on each iteration based on a distribution. Random search runs forever unless you stop the process from the command line, within your python script, or [the W&B App](/models/sweeps/visualize-sweep-results/). +### Random search -Specify the distribution space with the metric key if you choose random (`method: random`) search. +Random search chooses a random, uninformed set of hyperparameter values on each iteration based on a distribution. Random search runs forever unless you stop the process from the command line, within your Python script, or from [the W&B App](/models/sweeps/visualize-sweep-results). -#### Bayesian search -In contrast to [random](#random-search) and [grid](#grid-search) search, Bayesian models make informed decisions. Bayesian optimization uses a probabilistic model to decide which values to use through an iterative process of testing values on a surrogate function before evaluating the objective function. Bayesian search works well for small numbers of continuous parameters but scales poorly. For more information about Bayesian search, see the [Bayesian Optimization Primer paper](https://web.archive.org/web/20240209053347/https://static.sigopt.com/b/20a144d208ef255d3b981ce419667ec25d8412e2/static/pdf/SigOpt_Bayesian_Optimization_Primer.pdf). +If you choose random search (`method: random`), specify the distribution for each hyperparameter in the `parameters` key. + +### Bayesian search + +In contrast to [random](#random-search) and [grid](#grid-search) search, Bayesian models make informed decisions. Bayesian optimization uses a probabilistic model to decide which values to use. The process iteratively tests values on a surrogate function before evaluating the objective function. Bayesian search works well for small numbers of continuous parameters but scales poorly. For more information about Bayesian search, see the [Bayesian Optimization Primer paper](https://web.archive.org/web/20240209053347/https://static.sigopt.com/b/20a144d208ef255d3b981ce419667ec25d8412e2/static/pdf/SigOpt_Bayesian_Optimization_Primer.pdf). {/* There are different Bayesian optimization methods. W&B uses a Gaussian process to model the relationship between hyperparameters and the model metric. For more information, see this paper. [LINK] */} -Bayesian search runs forever unless you stop the process from the command line, within your python script, or [the W&B App](/models/sweeps/visualize-sweep-results/). +Bayesian search runs forever unless you stop the process from the command line, within your Python script, or from [the W&B App](/models/sweeps/visualize-sweep-results). ### Distribution options for random and Bayesian search -Within the `parameter` key, nest the name of the hyperparameter. Next, specify the `distribution` key and specify a distribution for the value. + +Within the `parameters` key, nest the name of the hyperparameter. Next, specify the `distribution` key and specify a distribution for the value. The following table lists distributions W&B supports. @@ -105,15 +112,15 @@ The following table lists distributions W&B supports. | `categorical` | Categorical distribution. Must specify all valid values (`values`) for this hyperparameter. | | `int_uniform` | Discrete uniform distribution on integers. Must specify `max` and `min` as integers. | | `uniform` | Continuous uniform distribution. Must specify `max` and `min` as floats. | -| `q_uniform` | Quantized uniform distribution. Returns `round(X / q) * q` where X is uniform. `q` defaults to `1`.| -| `log_uniform` | Log-uniform distribution. Returns a value `X` between `exp(min)` and `exp(max)`such that the natural logarithm is uniformly distributed between `min` and `max`. | -| `log_uniform_values` | Log-uniform distribution. Returns a value `X` between `min` and `max` such that `log(`X`)` is uniformly distributed between `log(min)` and `log(max)`. | +| `q_uniform` | Quantized uniform distribution. Returns `round(X / q) * q` where `X` is uniform. `q` defaults to `1`.| +| `log_uniform` | Log-uniform distribution. Returns a value `X` between `exp(min)` and `exp(max)` such that the natural logarithm is uniformly distributed between `min` and `max`. | +| `log_uniform_values` | Log-uniform distribution. Returns a value `X` between `min` and `max` such that `log(X)` is uniformly distributed between `log(min)` and `log(max)`. | | `q_log_uniform` | Quantized log uniform. Returns `round(X / q) * q` where `X` is `log_uniform`. `q` defaults to `1`. | | `q_log_uniform_values` | Quantized log uniform. Returns `round(X / q) * q` where `X` is `log_uniform_values`. `q` defaults to `1`. | -| `inv_log_uniform` | Inverse log uniform distribution. Returns `X`, where `log(1/X)` is uniformly distributed between `min` and `max`. | -| `inv_log_uniform_values` | Inverse log uniform distribution. Returns `X`, where `log(1/X)` is uniformly distributed between `log(1/max)` and `log(1/min)`. | +| `inv_log_uniform` | Inverse log uniform distribution. Returns `X`, where `log(1/X)` is uniformly distributed between `min` and `max`. | +| `inv_log_uniform_values` | Inverse log uniform distribution. Returns `X`, where `log(1/X)` is uniformly distributed between `log(1/max)` and `log(1/min)`. | | `normal` | Normal distribution. Return value is normally distributed with mean `mu` (default `0`) and standard deviation `sigma` (default `1`).| -| `q_normal` | Quantized normal distribution. Returns `round(X / q) * q` where `X` is `normal`. Q defaults to 1. | +| `q_normal` | Quantized normal distribution. Returns `round(X / q) * q` where `X` is `normal`. `q` defaults to `1`. | | `log_normal` | Log normal distribution. Returns a value `X` such that the natural logarithm `log(X)` is normally distributed with mean `mu` (default `0`) and standard deviation `sigma` (default `1`). | | `q_log_normal` | Quantized log normal distribution. Returns `round(X / q) * q` where `X` is `log_normal`. `q` defaults to `1`. | @@ -131,14 +138,14 @@ You must specify a stopping algorithm if you use `early_terminate`. Nest the `ty ### Stopping algorithm -W&B currently supports [Hyperband](https://arxiv.org/abs/1603.06560) stopping algorithm. +W&B supports the [Hyperband](https://arxiv.org/abs/1603.06560) stopping algorithm. -[Hyperband](https://arxiv.org/abs/1603.06560) hyperparameter optimization evaluates if a program should stop or if it should to continue at one or more pre-set iteration counts, called *brackets*. +[Hyperband](https://arxiv.org/abs/1603.06560) hyperparameter optimization evaluates whether a program should stop or continue at one or more pre-set iteration counts, called *brackets*. When a W&B run reaches a bracket, the sweep compares that run's metric to all previously reported metric values. The sweep terminates the run if the run's metric value is too high (when the goal is minimization) or if the run's metric is too low (when the goal is maximization). -Brackets are based on the number of logged iterations. The number of brackets corresponds to the number of times you log the metric you are optimizing. The iterations can correspond to steps, epochs, or something in between. The numerical value of the step counter is not used in bracket calculations. +Brackets are based on the number of logged iterations. The number of brackets corresponds to the number of times you log the metric you're optimizing. The iterations can correspond to steps, epochs, or something in between. Bracket calculations don't use the numerical value of the step counter. Specify either `min_iter` or `max_iter` to create a bracket schedule. @@ -151,19 +158,19 @@ Specify either `min_iter` or `max_iter` to create a bracket schedule. | `max_iter` | Specify the maximum number of iterations. | | `s` | Specify the total number of brackets (required for `max_iter`) | | `eta` | Specify the bracket multiplier schedule (default: `3`). | -| `strict` | Enable 'strict' mode that prunes runs aggressively, more closely following the original Hyperband paper. Defaults to false. | +| `strict` | Enable `strict` mode that prunes runs aggressively, more closely following the original Hyperband paper. Defaults to `false`. | -Hyperband checks which [runs](/models/ref/python/experiments/run) to end once every few minutes. The end run timestamp might differ from the specified brackets if your run or iteration are short. +Hyperband checks which [runs](/models/ref/python/experiments/run) to end once every few minutes. The end run timestamp might differ from the specified brackets if your run or iteration is short. -## `command` +## `command` {/* Agents created with [`wandb agent`](/models/ref/cli/wandb-agent) receive a command in the following format by default: */} -Modify the format and contents with nested values within the `command` key. You can directly include fixed components such as filenames. +Use the `command` top-level key to control how the sweep agent invokes your training script and passes arguments to it. Modify the format and contents with nested values within the `command` key. You can directly include fixed components such as filenames. On Unix systems, `/usr/bin/env` ensures that the OS chooses the correct Python interpreter based on the environment. @@ -181,4 +188,4 @@ W&B supports the following macros for variable components of the command: | `${args_no_hyphens}` | Hyperparameters and their values in the form `param1=value1 param2=value2`. | | `${args_json}` | Hyperparameters and their values encoded as JSON. | | `${args_json_file}` | The path to a file containing the hyperparameters and their values encoded as JSON. | -| `${envvar}` | A way to pass environment variables. `${envvar:MYENVVAR}` __ expands to the value of MYENVVAR environment variable. __ | \ No newline at end of file +| `${envvar}` | A way to pass environment variables. `${envvar:MYENVVAR}` expands to the value of the `MYENVVAR` environment variable. | \ No newline at end of file diff --git a/models/sweeps/troubleshoot-sweeps.mdx b/models/sweeps/troubleshoot-sweeps.mdx index 7fce5dbc52..26b570c369 100644 --- a/models/sweeps/troubleshoot-sweeps.mdx +++ b/models/sweeps/troubleshoot-sweeps.mdx @@ -1,74 +1,80 @@ --- -description: "Troubleshoot common W&B Sweep issues including CommError, CUDA out of memory, and wandb agent failures." +description: "Troubleshoot common sweep issues including CommError, CUDA out of memory, and wandb agent failures." title: Sweeps troubleshooting --- -Troubleshoot common error messages with the guidance suggested. +This page helps you diagnose and resolve common error messages you might encounter when you run W&B Sweeps. The following sections describe an error, explain why it occurs, and recommend how to fix it. -### `CommError, Run does not exist` and `ERROR Error uploading` +## `CommError, Run does not exist` and `ERROR Error uploading` -Your W&B Run ID might be defined if these two error messages are both returned. As an example, you might have a similar code snippet defined somewhere in your Jupyter Notebooks or Python script: +Your W&B run ID might be defined if W&B returns both of these error messages. For example, you might have a similar code snippet defined somewhere in your Jupyter Notebooks or Python script: ```python wandb.init(id="some-string") ``` -You can not set a Run ID for W&B Sweeps because W&B automatically generates random, unique IDs for Runs created by W&B Sweeps. +You can't set a run ID for sweeps because W&B automatically generates random, unique IDs for runs that sweeps create. -W&B Run IDs need to be unique within a project. +W&B run IDs need to be unique within a project. -We recommend you pass a name to the name parameter when you initialized W&B, if you want to set a custom name that will appear on tables and graphs. For example: +If you want to set a custom name that appears on tables and graphs, pass a name to the `name` parameter when you initialize W&B. For example: ```python wandb.init(name="a helpful readable run name") ``` -### `Cuda out of memory` +After you remove the `id` argument from `wandb.init()`, the sweep can assign its own unique run IDs, and the upload errors stop. -Refactor your code to use process-based executions if you see this error message. More specifically, rewrite your code to a Python script. In addition, call the W&B Sweep Agent from the CLI, instead of the W&B Python SDK. +## `CUDA out of memory` -As an example, suppose you rewrite your code to a Python script called `train.py`. Add the name of the training script (`train.py`) to your YAML Sweep configuration file (`config.yaml` in this example): +If you see this error message, refactor your code to use process-based executions. When you run each trial in its own process, W&B releases GPU memory between runs. -```yaml -program: train.py -method: bayes -metric: - name: validation_loss - goal: maximize -parameters: - learning_rate: - min: 0.0001 - max: 0.1 - optimizer: - values: ["adam", "sgd"] -``` +To refactor your code, complete the following steps: -Next, add the following to your `train.py` Python script: +1. Rewrite your code as a Python script called `train.py`. Add the name of the training script (`train.py`) to your YAML sweep configuration file (`config.yaml` in this example): -```python -if _name_ == "_main_": - train() -``` + ```yaml + program: train.py + method: bayes + metric: + name: validation_loss + goal: maximize + parameters: + learning_rate: + min: 0.0001 + max: 0.1 + optimizer: + values: ["adam", "sgd"] + ``` -Navigate to your CLI and initialize a W&B Sweep with wandb sweep: +2. Add the following to your `train.py` Python script: -```shell -wandb sweep config.yaml -``` + ```python + if __name__ == "__main__": + train() + ``` -Make a note of the W&B Sweep ID that is returned. Next, start the Sweep job with [`wandb agent`](/models/ref/cli/wandb-agent) with the CLI instead of the Python SDK ([`wandb.agent()`](/models/ref/python/functions/agent)). Replace `sweep_ID` in the code snippet below with the Sweep ID that was returned in the previous step: +3. From your CLI, initialize a sweep with `wandb sweep`: -```shell -wandb agent sweep_ID -``` + ```bash + wandb sweep config.yaml + ``` + +4. Note the sweep ID that W&B returns. Start the sweep job with [`wandb agent`](/models/ref/cli/wandb-agent) from the CLI instead of the Python SDK ([`wandb.agent()`](/models/ref/python/functions/agent)). Replace `[SWEEP-ID]` in the following code snippet with the sweep ID that W&B returned in the previous step: + + ```bash + wandb agent [SWEEP-ID] + ``` + +With your training code running as a script under the CLI agent, each trial executes in its own process, and W&B releases GPU memory between runs. -### `anaconda 400 error` +## `anaconda 400 error` -The following error usually occurs when you do not log the metric that you are optimizing: +The following error usually occurs when you don't log the metric you're optimizing: -```shell +```text wandb: ERROR Error while calling W&B API: anaconda 400 error: {"code": 400, "message": "TypeError: bad operand type for unary -: 'NoneType'"} ``` -Within your YAML file or nested dictionary you specify a key named "metric" to optimize. Ensure that you log (`wandb.Run.log()`) this metric. In addition, ensure you use the _exact_ metric name that you defined the sweep to optimize within your Python script or Jupyter Notebook. For more information about configuration files, see [Define sweep configuration](/models/sweeps/define-sweep-configuration/). \ No newline at end of file +Within your YAML file or nested dictionary, you specify a key named `metric` to optimize. Ensure that you log this metric with `wandb.Run.log()`. Also, ensure you use the exact metric name you defined the sweep to optimize within your Python script or Jupyter Notebook. For more information about configuration files, see [Define sweep configuration](/models/sweeps/define-sweep-configuration). \ No newline at end of file diff --git a/models/sweeps/useful-resources.mdx b/models/sweeps/useful-resources.mdx index 342a5e4c22..43ccc5af2a 100644 --- a/models/sweeps/useful-resources.mdx +++ b/models/sweeps/useful-resources.mdx @@ -3,34 +3,40 @@ description: "Find links to academic papers, example reports, tutorials, and the title: Learn more about sweeps --- -### Academic papers +This page collects external resources to help you learn more about W&B Sweeps. Resources include academic background, example projects shared as W&B Reports, a hands-on tutorial, and the open source repository. -Li, Lisha, et al. "[Hyperband: A novel bandit-based approach to hyperparameter optimization.](https://arxiv.org/pdf/1603.06560.pdf)" _The Journal of Machine Learning Research_ 18.1 (2017): 6765-6816. +## Academic papers -### Sweep Experiments +The following paper provides background on the algorithms behind hyperparameter optimization techniques used in Sweeps. -The following W&B Reports demonstrate examples of projects that explore hyperparameter optimization with W&B Sweeps. +Li, Lisha, et al. "[Hyperband: A novel bandit-based approach to hyperparameter optimization.](https://arxiv.org/pdf/1603.06560.pdf)" _The Journal of Machine Learning Research_ 18.1 (2017): 6765-6816. + +## Sweeps experiments + +The following W&B Reports showcase projects that explore hyperparameter optimization with Sweeps. * [Drought Watch Benchmark Progress](https://wandb.ai/stacey/droughtwatch/reports/Drought-Watch-Benchmark-Progress--Vmlldzo3ODQ3OQ) * Description: Developing the baseline and exploring submissions to the Drought Watch benchmark. * [Tuning Safety Penalties in Reinforcement Learning](https://wandb.ai/safelife/benchmark-sweeps/reports/Tuning-Safety-Penalties-in-Reinforcement-Learning---VmlldzoyNjQyODM) - * Description: We examine agents trained with different side effect penalties on three different tasks: pattern creation, pattern removal, and navigation. + * Description: Examines agents trained with different side effect penalties on three different tasks: pattern creation, pattern removal, and navigation. * [Meaning and Noise in Hyperparameter Search with W&B](https://wandb.ai/stacey/pytorch_intro/reports/Meaning-and-Noise-in-Hyperparameter-Search--Vmlldzo0Mzk5MQ) [Stacey Svetlichnaya](https://wandb.ai/stacey) - * Description: How do we distinguish signal from pareidolia (imaginary patterns)? This article is showcases what is possible with W&B and aims to inspire further exploration. + * Description: How do you distinguish signal from pareidolia (imaginary patterns)? This article showcases what's possible with W&B and aims to inspire further exploration. * [Who is Them? Text Disambiguation with Transformers](https://wandb.ai/stacey/winograd/reports/Who-is-Them-Text-Disambiguation-with-Transformers--VmlldzoxMDU1NTc) - * Description: Using Hugging Face to explore models for natural language understanding + * Description: Using Hugging Face to explore models for natural language understanding. * [DeepChem: Molecular Solubility](https://wandb.ai/stacey/deepchem_molsol/reports/DeepChem-Molecular-Solubility--VmlldzoxMjQxMjM) * Description: Predict chemical properties from molecular structure with random forests and deep nets. * [Intro to MLOps: Hyperparameter Tuning](https://wandb.ai/iamleonie/Intro-to-MLOps/reports/Intro-to-MLOps-Hyperparameter-Tuning--VmlldzozMTg2OTk3) - * Description: Explore why hyperparameter optimization matters and look at three algorithms to automate hyperparameter tuning for your machine learning models. + * Description: Explore why hyperparameter optimization matters and examine three algorithms to automate hyperparameter tuning for your machine learning models. + +## How-to guide -### selfm-anaged +The following how-to guide demonstrates how to solve real-world problems with W&B: -The following how-to-guide demonstrates how to solve real-world problems with W&B: +* [Sweeps with XGBoost](https://github.com/wandb/examples/blob/master/examples/wandb-sweeps/sweeps-xgboost/xgboost_tune.py) + * Description: How to use Sweeps for hyperparameter tuning using XGBoost. -* [Sweeps with XGBoost ](https://github.com/wandb/examples/blob/master/examples/wandb-sweeps/sweeps-xgboost/xgboost_tune.py) - * Description: How to use W&B Sweeps for hyperparameter tuning using XGBoost. +## Sweeps GitHub repository -### Sweep GitHub repository +This section points to the source code for Sweeps and explains how to contribute. -W&B advocates open source and welcome contributions from the community. Find the [W&B Sweeps GitHub repository](https://github.com/wandb/sweeps). For information on how to contribute to the W&B open source repo, see the W&B GitHub [Contribution guidelines](https://github.com/wandb/wandb/blob/main/CONTRIBUTING.md). \ No newline at end of file +W&B supports open source and welcomes contributions from the community. Find the [W&B Sweeps GitHub repository](https://github.com/wandb/sweeps). For information on how to contribute to the W&B open source repository, see the W&B GitHub [Contribution guidelines](https://github.com/wandb/wandb/blob/main/CONTRIBUTING.md). \ No newline at end of file diff --git a/models/sweeps/visualize-sweep-results.mdx b/models/sweeps/visualize-sweep-results.mdx index 531ade653a..22bc72c82c 100644 --- a/models/sweeps/visualize-sweep-results.mdx +++ b/models/sweeps/visualize-sweep-results.mdx @@ -3,31 +3,42 @@ description: Visualize the results of your W&B Sweeps with the W&B App UI. title: Visualize sweep results --- -Visualize the results of your W&B Sweeps with the W&B App. Navigate to the [W&B App](https://wandb.ai/home). Choose the project that you specified when you initialized a sweep. You will be redirected to your project [workspace](/models/track/workspaces/). Select the **Sweep icon** in the project sidebar (broom icon). From the Sweep UI, select the name of your Sweep from the list. +Visualize the results of your W&B Sweeps with the W&B App. The Sweep UI provides charts and controls that help you compare runs, understand which hyperparameters matter most, and manage sweep execution. Use this page to find the sweep view in the App, interpret the default charts, and customize them. -The sweep list shows each sweep's state (**State**), creation time (**Created**), who started it (**Creator**), how many runs finished (**Run count**), and total **Compute time**. For a grid search over a discrete search space, W&B also shows **Est. Runs** (the expected number of runs). Open a sweep from the list to pause, resume, stop, or kill it from the app. For the same controls with the CLI, see [Manage sweeps](/models/sweeps/pause-resume-and-cancel-sweeps/). +## Open the sweep UI -By default, W&B will automatically create a parallel coordinates plot, a parameter importance plot, and a scatter plot when you start a W&B Sweep job. +To open the Sweep UI, navigate to the [W&B App](https://wandb.ai/home). Choose the project that you specified when you initialized a sweep. W&B redirects you to your project [workspace](/models/track/workspaces). Select the broom (**Sweep**) icon in the project sidebar. From the Sweep UI, select the name of your sweep from the list. The Sweep UI lists every sweep in the project and serves as the entry point for inspecting an individual sweep's runs and charts. + +The sweep list shows each sweep's state (**State**), creation time (**Created**), who started it (**Creator**), how many runs finished (**Run count**), and total **Compute time**. For a grid search over a discrete search space, W&B also shows **Est. Runs** (the expected number of runs). Open a sweep from the list to pause, resume, stop, or cancel it from the app. For the same controls with the CLI, see [Manage sweeps](/models/sweeps/pause-resume-and-cancel-sweeps). Sweep UI in project sidebar -Parallel coordinates charts summarize the relationship between large numbers of hyperparameters and model metrics at a glance. For more information on parallel coordinates plots, see [Parallel coordinates](/models/app/features/panels/parallel-coordinates/). +## Default visualizations + +By default, W&B creates a parallel coordinates plot, a parameter importance plot, and a scatter plot when you start a sweep job. The following sections describe each of these default visualizations and how to customize them. + +### Parallel coordinates plot + +Parallel coordinates charts summarize the relationship between large numbers of hyperparameters and model metrics at a glance. For more information about parallel coordinates plots, see [Parallel coordinates](/models/app/features/panels/parallel-coordinates). - Example parallel coordinates plot. + Example parallel coordinates plot -The scatter plot(left) compares the W&B Runs that were generated during the Sweep. For more information about scatter plots, see [Scatter Plots](/models/app/features/panels/scatter-plot/). +### Scatter plot and parameter importance + +The scatter plot (left) compares the runs generated during the sweep. For more information about scatter plots, see [Scatter Plots](/models/app/features/panels/scatter-plot). -The parameter importance plot(right) lists the hyperparameters that were the best predictors of, and highly correlated to desirable values of your metrics. For more information on parameter importance plots, see [Parameter Importance](/models/app/features/panels/parameter-importance/). +The parameter importance plot (right) lists the hyperparameters that are the best predictors of, and highly correlated with, desirable values of your metrics. For more information about parameter importance plots, see [Parameter Importance](/models/app/features/panels/parameter-importance). Scatter plot and parameter importance +## Customize panels -You can alter the dependent and independent values (x and y axis) that are automatically used. Within each panel there is a pencil icon called **Edit panel**. Choose **Edit panel**. A model will appear. Within the modal, you can alter the behavior of the graph. +If the default axes don't show the comparison you need, you can alter the dependent and independent values (x-axis and y-axis) that W&B uses automatically. Each panel has a pencil icon called **Edit panel**. Choose **Edit panel** to open a modal where you can alter the behavior of the graph. -For more information on all default W&B visualization options, see [Panels](/models/app/features/panels/). See the [Data Visualization docs](/models/tables/) for information on how to create plots from W&B Runs that are not part of a W&B Sweep. \ No newline at end of file +For more information about all default W&B visualization options, see [Panels](/models/app/features/panels). See the [Data Visualization docs](/models/tables) for information about how to create plots from runs that are not part of a sweep. \ No newline at end of file diff --git a/models/sweeps/walkthrough.mdx b/models/sweeps/walkthrough.mdx index 71074481f1..75102f5d60 100644 --- a/models/sweeps/walkthrough.mdx +++ b/models/sweeps/walkthrough.mdx @@ -1,18 +1,18 @@ --- -description: Sweeps quickstart shows how to define, initialize, and run a sweep. There - are four main steps +description: Define, initialize, and run a sweep to search a hyperparameter space and find the configuration that produces the best model. title: 'Tutorial: Define, initialize, and run a sweep' --- -This page shows how to define, initialize, and run a sweep. There are four main steps: +This tutorial shows how to define, initialize, and run a sweep so you can automate hyperparameter search and find the configuration that produces the best model. It's intended for users who are already familiar with logging runs to W&B and want to start tuning hyperparameters at scale. + +The tutorial has four main steps: 1. [Set up your training code](#set-up-your-training-code) 2. [Define the search space with a sweep configuration](#define-the-search-space-with-a-sweep-configuration) 3. [Initialize the sweep](#initialize-the-sweep) 4. [Start the sweep agent](#start-the-sweep) - -Copy and paste the following code into a Jupyter Notebook or Python script: +To get started, copy and paste the following code into a Jupyter Notebook or Python script. The sections that follow break down each part of this example. ```python # Import the W&B Python Library and log into W&B @@ -44,13 +44,13 @@ sweep_id = wandb.sweep(sweep=sweep_configuration, project="my-first-sweep") wandb.agent(sweep_id, function=main, count=10) ``` -The following sections break down and explains each step in the code sample. +## Set up your training code +The sweep agent calls your training function with each combination of hyperparameter values to try, so the first step is to write a function that accepts those values and reports a metric back to W&B. -## Set up your training code Define a training function that takes in hyperparameter values from `wandb.Run.config` and uses them to train a model and return metrics. -Optionally provide the name of the project where you want the output of the W&B Run to be stored (project parameter in [`wandb.init()`](/models/ref/python/functions/init)). If the project is not specified, the run is put in an "Uncategorized" project. +Optionally provide the name of the project where you want to store the output of the run (project parameter in [`wandb.init()`](/models/ref/python/functions/init)). If you don't specify a project, W&B puts the run in an "Uncategorized" project. Both the sweep and the run must be in the same project. Therefore, the name you provide when you initialize W&B must match the name of the project you provide when you initialize a sweep. @@ -71,11 +71,13 @@ def main(): ## Define the search space with a sweep configuration -Specify the hyperparameters to sweep in a dictionary. For configuration options, see [Define sweep configuration](/models/sweeps/define-sweep-configuration/). +With the training function in place, the next step is to tell W&B which hyperparameters to vary and how to search over them. + +Specify the hyperparameters to sweep in a dictionary. For configuration options, see [Define sweep configuration](/models/sweeps/define-sweep-configuration). -The following example demonstrates a sweep configuration that uses a random search (`'method':'random'`). The sweep will randomly select a random set of values listed in the configuration for the batch size, epoch, and the learning rate. +The following example shows a sweep configuration that uses a random search (`'method':'random'`). The sweep randomly selects a set of values listed in the configuration for the `x` and `y` parameters. -W&B minimizes the metric specified in the `metric` key when `"goal": "minimize"` is associated with it. In this case, W&B will optimize for minimizing the metric `score` (`"name": "score"`). +W&B minimizes the metric specified in the `metric` key when `"goal": "minimize"` is associated with it. In this case, W&B optimizes for minimizing the metric `score` (`"name": "score"`). ```python @@ -90,52 +92,60 @@ sweep_configuration = { } ``` -## Initialize the Sweep +## Initialize the sweep -W&B uses a _Sweep Controller_ to manage sweeps on the cloud (standard), locally (local) across one or more machines. For more information about Sweep Controllers, see [Search and stop algorithms locally](./local-controller). +Initializing the sweep registers your search space with W&B and returns an identifier that the agent uses to request hyperparameter combinations. -A sweep identification number is returned when you initialize a sweep: +W&B uses a _Sweep Controller_ to manage sweeps in the cloud (standard) or locally (local) across one or more machines. For more information about Sweep Controllers, see [Search and stop algorithms locally](/models/sweeps/local-controller). + +Initializing a sweep returns a sweep identification number: ```python sweep_id = wandb.sweep(sweep=sweep_configuration, project="my-first-sweep") ``` -For more information about initializing sweeps, see [Initialize sweeps](./initialize-sweeps). +For more information, see [Initialize sweeps](/models/sweeps/initialize-sweeps). + +## Start the sweep -## Start the Sweep +With the sweep registered, start an agent to execute the runs and explore the search space. -Use the [`wandb.agent()`](/models/ref/python/functions/agent) API call to start a sweep. +To start a sweep, use the [`wandb.agent()`](/models/ref/python/functions/agent) API call. ```python wandb.agent(sweep_id, function=main, count=10) ``` +After the agent starts, it requests hyperparameter combinations from W&B, calls your training function for each one, and logs the resulting metrics back to your project. + **Multiprocessing** -You must wrap your `wandb.agent()` and `wandb.sweep()` calls with `if __name__ == '__main__':` if you use Python standard library's `multiprocessing` or PyTorch's `pytorch.multiprocessing` package. For example: +If you use the Python standard library's `multiprocessing` package or PyTorch's `pytorch.multiprocessing` package, you must wrap your `wandb.agent()` and `wandb.sweep()` calls with `if __name__ == '__main__':`. For example: ```python if __name__ == '__main__': - wandb.agent(sweep_id="", function="", count="") + wandb.agent(sweep_id="[SWEEP-ID]", function="[FUNCTION]", count="[COUNT]") ``` -Wrapping your code with this convention ensures that it is only executed when the script is run directly, and not when it is imported as a module in a worker process. +This convention ensures the code runs only when the script runs directly, not when imported as a module in a worker process. -See [Python standard library `multiprocessing`](https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods) or [PyTorch `multiprocessing`](https://docs.pytorch.org/docs/stable/notes/multiprocessing.html#asynchronous-multiprocess-training-e-g-hogwild) for more information about multiprocessing. See https://realpython.com/if-name-main-python/ for information about the `if __name__ == '__main__':` convention. +For more information about multiprocessing, see [Python standard library `multiprocessing`](https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods) or [PyTorch `multiprocessing`](https://docs.pytorch.org/docs/stable/notes/multiprocessing.html#asynchronous-multiprocess-training-e-g-hogwild). For more information about the `if __name__ == '__main__':` convention, see [Real Python's guide to `__main__`](https://realpython.com/if-name-main-python/). -## Visualize results (optional) +## Optional: Visualize results + +Once the sweep is running, you can explore how different hyperparameter combinations affect your metric in the W&B App. -Open your project to see your live results in the W&B App dashboard. With just a few clicks, construct rich, interactive charts like [parallel coordinates plots](/models/app/features/panels/parallel-coordinates/),[ parameter importance analyzes](/models/app/features/panels/parameter-importance/), and [additional chart types](/models/app/features/panels/). +Open your project to see your live results in the W&B App dashboard. In a few clicks, build interactive charts such as [parallel coordinates plots](/models/app/features/panels/parallel-coordinates), [parameter importance analyses](/models/app/features/panels/parameter-importance), and [other chart types](/models/app/features/panels). Sweeps Dashboard example -For more information about how to visualize results, see [Visualize sweep results](./visualize-sweep-results). For an example dashboard, see this sample [Sweeps Project](https://wandb.ai/anmolmann/pytorch-cnn-fashion/sweeps/pmqye6u3). +For more information, see [Visualize sweep results](/models/sweeps/visualize-sweep-results). For an example dashboard, see this sample [Sweeps Project](https://wandb.ai/anmolmann/pytorch-cnn-fashion/sweeps/pmqye6u3). -## Stop the agent (optional) +## Optional: Stop the agent -In the terminal, press `Ctrl+C` to stop the current run. Press it again to terminate the agent. +In the terminal, press `Ctrl+C` to stop the current run. Press it again to end the agent.