Skip to content

Conversation

@victorlin
Copy link
Member

@victorlin victorlin commented Sep 22, 2025

Description of proposed changes

Small adjustments to config, particularly around filtering and subsampling.

Related issue(s)

Preparing for #103

Checklist

  • Checks pass
  • Update changelog

augur filter has never taken a reference file as input, so my guess is
this was originally copied by mistake.
Preparing to make changes to the config that would have broken this. It
seems fine to hardcode along with the already hardcoded sample size.
This will make it easier to compare changes in the switch to augur
subsample which will define all parameters in config.
@victorlin victorlin self-assigned this Sep 22, 2025
Prevents unintentional file pattern matching by rule inputs/outputs.
This shouldn't rely on config from another rule.
@victorlin victorlin mentioned this pull request Sep 22, 2025
3 tasks
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General config improvements make sense to me 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these changes were made before you added the shared write_config function. Update to use shared/vendored?

"""
input:
sequences="data/{a_or_b}/sequences.fasta",
reference="config/{a_or_b}reference.gbk",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

augur filter has never taken a reference file as input, so my guess is
this was originally copied by mistake.

I wondered if this was added to enforce the order in which rules are run. However, based on initial workflow that just uses config directly, agree this was probably copied by mistake.

--group-by {params.group_by} \
--subsample-max-sequences {params.subsample_max_sequences} \
--query '({params.min_coverage}) & missing_data<1000'
--query '({params.min_coverage}) & missing_data<{params.missing_data_threshold}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

I was going to suggest combining these two params into a single query param, but maybe not necessary to do here since we'll switch to augur subsample ~soon.

Comment on lines +9 to +10
build_name=r"genome|G|F",
resolution=r"all-time|6y|3y",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest these should be dynamic

Suggested change
build_name=r"genome|G|F",
resolution=r"all-time|6y|3y",
build_name="|".join(config["builds_to_run"]),
resolution="|".join(config["resolutions_to_run"]),

However I then realized that these already hardcoded due to the values in the config/distance_maps.tsv. Probably good to add a note here and add to config validation in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants