Skip to content

Conversation

@victorlin
Copy link
Member

@victorlin victorlin commented Sep 22, 2025

Description of proposed changes

The augur subsample command is built to replace workflow-specific augur filter calls with a generic augur subsample call.

Note that this is a breaking change and the old configuration will no longer work.


WNV is the only other repo that is using augur subsample. The RSV workflow is much more complicated with a matrix of subtype (A/B), sequence type (genome/G/F), and focal timespan (all-time/6y/3y). For this reason, the config is structured differently (rule-first instead of build-first).

I've used YAML "variables" (anchors and aliases) to construct a config that works with augur subsample. Unfortunately, this results in a much more complicated config file, which is not great for users. It's somewhat justified by the greater flexibility that comes with augur subsample, but I'm open to other ideas.

Related issue(s)

Closes #101

Checklist

@victorlin victorlin self-assigned this Sep 22, 2025
@victorlin victorlin force-pushed the victorlin/use-augur-subsample branch from e869e31 to 18232af Compare September 22, 2025 22:24
@victorlin victorlin mentioned this pull request Sep 22, 2025
2 tasks
@victorlin victorlin linked an issue Sep 22, 2025 that may be closed by this pull request
The augur subsample command is built to replace workflow-specific augur
filter calls with a generic augur subsample call.

Note that this is a breaking change and the old configuration will no
longer work.
@victorlin victorlin force-pushed the victorlin/use-augur-subsample branch from 18232af to b0d6728 Compare September 24, 2025 20:20
genome/all-time:
samples:
all-time:
<<: [*subsample_genome, *subsample_all-time, *subsample_defaults]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'm not sure this pattern of YAML anchors/aliases is very user friendly. If we do use anchors/aliases, then we should probably disable them in the dumped config so that users can at least see the fully expanded config in results/run_config.yaml (see suggested workaround in yaml/pyyaml#103).

I wonder if this can use the config wildcards pattern that @jameshadfield used in avian-flu. These could be expanded at start of Snakemake (or whatever comes of discussion in nextstrain/public#23).

Copy link
Member

@jameshadfield jameshadfield Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do use anchors/aliases, then we should probably disable them in the dumped config

Yes, agreed. (I'd go further: we should write out small-multiples with lots of duplication, but that's beyond this PR.)

I wonder if this can use the config wildcards pattern that @jameshadfield used in avian-flu.

Across the 9 builds almost all of the difference is the subsampling parameters, so at some level it is going to be either complex or verbose. I don't think glob-like syntax will help here with the structure as it is now, but it could if you moved towards something more like:

subsample:
  samples:
    min_length:
      "genome/*": 10_000
      "G/*": 600
      "F/*": 1_200
    group_by: "year country"
    min_coverage: 0.3
    resolutions:
      "*/all-time": {"min_date": "1795-01-01"}
      "*/6y": {"min_date": "6Y", "background_min_date": "1975-01-01"}
      "*/3y": {"min_date": "3Y", "background_min_date": "1975-01-01"}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think we should use YAML anchors and aliases, and would like to decide on a good alternative in nextstrain/public#27. I'll consider something like the above with config pre-processing to translate into augur subsample-ready config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my ability to follow this thread, the reason against using YAML anchors and aliases:

  • not sure if the pattern of YAML anchors/aliases is very user friendly (for which users? A google search is returning several articles recommending YAML anchors (example) to avoid YAML duplication, but maybe these are not our user group or feel free to add comments from WA-DOH/others etc.)
  • requires resolving the anchors explicitly in the dumped config (similar lift as config pre-processing to translate into augur subsample-ready config)

Feel free to add to the above list, mostly looking for a summary statement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if the pattern of YAML anchors/aliases is very user friendly

I don't think we're advising against anchors per-se -- they can be very useful! -- rather I think the pushback was against the specific usage of anchors in this config. (The pushback wasn't from me, so I won't elaborate.)

requires resolving the anchors explicitly in the dumped config

Anchors are resolved automatically, i.e. yaml.dump(yaml.load(...)) will not preserve anchors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anchors are resolved automatically, i.e. yaml.dump(yaml.load(...)) will not preserve anchors.

I got this wrong (see nextstrain/shared#62), at least partially. Anchors of primitive types won't be preserved, but anchors of more complex types are preserved. E.g. the following YAML

foo: &foo
  a: 1
  b: 2
bar: *foo

x: &x 42
y: *x

when roundtripped through yaml.dump(yaml.load(...)) will keep one of the anchors and drop the other:

bar: &id001
  a: 1
  b: 2
foo: *id001
x: 42
y: 42

@victorlin victorlin mentioned this pull request Oct 7, 2025
3 tasks
@victorlin
Copy link
Member Author

I'll wait for a decision in nextstrain/public#27 before continuing here.

@victorlin victorlin marked this pull request as draft October 7, 2025 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use augur subsample

5 participants