Skip to content

Conversation

@xonq
Copy link
Contributor

@xonq xonq commented Aug 6, 2025

This pull request initializes a Lyssavirus rabies (rabies) Nextclade dataset with clade-subclade resolution. Created in collaboration with @kimandrews and with subject matter expertise/user input from Massachusetts Department of Public Health. Please review the README.md for information on dataset creation and citations.

@ivan-aksamentov
Copy link
Member

Thanks! Seems to be working

https://master.clades.nextstrain.org/?dataset-server=gh:xonq/nextclade_data@master@/data_output&dataset-name=community/theiagen/rabies/all-clades

As a dev I can only review the technical side. And I will let our scientists to check the sciency bits :)

The virus is quite diverse it seems - lots of mutations. But this is probably expected.

If you have an public repo where you prepare trees and other data for the dataset, it would be a great help to the users of your dataset if you add it to the readme. We typically use a boilerplate like this in Nextstrain datasets:

| Key | Value |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| authors | [Cornelius Roemer](https://neherlab.org), [Richard Neher](https://neherlab.org), [Nextstrain](https://nextstrain.org) |
| reference | `Wuhan-Hu-1/2019` |
| workflow | https://github.com/neherlab/nextclade_data_workflows/tree/v3-sc2/sars-cov-2 |
| path | `nextstrain/sars-cov-2/orfs` |
| clade definitions | [Nextstrain clades](https://nextstrain.org/blog/2022-04-29-SARS-CoV-2-clade-naming-2022) and [Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) |

But that's not mandatory.

All looks good to me. Smooth work!

@xonq
Copy link
Contributor Author

xonq commented Aug 7, 2025

Thank you. Rabies is indeed very diverse - I contemplated creating independent datasets for each clade, but the genotyping of this "all-clades" dataset has been sufficient for our SME partners. Additionally, there were issues with sub-clade metadata quality that limit the improvements more refined datasets may provide.

I do not have a repository for tree building - I built the Nextclade dataset from the Nextstrain rabies build as a template, though I ended up deviating with the tree building methodology and metadata acquisition. The methodology is hopefully adequately documented for users in this PR's README.

@rneher
Copy link
Member

rneher commented Aug 11, 2025

Thanks a lot for contributing this dataset! Overall, this looks very good. But I have a few suggestions to make it better.

  • the amino acid annotation is inconsistent with the annotation in Nextclade. On the tree, CDS are called RABVgp1_N etc, in the Nextclade annotation you have things like NP_062343. Harmonizing these annotation would be important.
  • Given the diversity of the virus, I would recommend different alignment parameters. See here for an example.
  • I am not sure how exactly you do things in your workflow, but we recommend aligning and translating the sequences in the tree exactly like Nextclade will align query sequences later. see here for an example. The translation can be later used in ancestral reconstruction.

@xonq
Copy link
Contributor Author

xonq commented Aug 20, 2025

hey @rneher, just wanted to reply and inform you that I cannot return to this to address your concerns until a later date. not sure when, but hopefully within the next several weeks. Thanks for your suggestions and my apologies for my ignorance to some of the standardized procedures.

RE: alignment parameters: I'm not really certain how to systematically adjust these parameters - do you have specific recommendations/procedures to determine what parameters are more ideal, or do you suggest dragging and dropping the linked pathogen.json you sent?

RE: Apart from the tree-building, the workflow was performed with AUGUR. With this in mind, do these steps deviate from Nextclade like you're suggesting?:

Alignment:

# align genomes relative to the reference
augur align \
  -s <GENOMES.fa> \
  --nthreads <CPUS> \
  --output aligned.fa \
  --reference-name <REFERENCE_ACCESSION> \
  --debug

Tree building:

performed as discussed in the README

Refinement:

augur refine \
  --tree <IQTREE.contree> \
  --alignment <ALIGNMENT.fa> \
  --metadata <METADATA.tsv> \
  --output-tree refined.nwk \
  --output-node-data branch_lengths.json \
  --keep-root \
  --metadata-id-columns <TIP_NAMES_COLUMN>

Trait application:

augur traits \
	--tree refined.nwk \
	--metadata <METADATA.tsv> \
	--output-node-data traits.json \
	--columns <COLUMN1> <COLUMN2> .. <COLUMNn> \
	--confidence \
	--metadata-id-columns <TIP_NAMES_COLUMN>

Nucleotide mutation calling:

augur ancestral \
  --tree refined.nwk \
  --alignment <ALIGNMENT.fa> \
  --output-node-data nt_muts.json \
  --root-sequence <REFERENCE.fasta> \
  --inference joint

Translation:

augur translate \
  --tree refined.nwk \
  --ancestral-sequences nt_muts.json \
  --reference-sequence <REFERENCE.gbk> \
  --output-node-data aa_muts.json

Clade mutation extraction (non-AUGUR):

extract_nextclades.py \
  -t refined.nwk \
  -m <METADATA.tsv> \
  -cc <CLADE_COLUMN1> <CLADE_COLUMN2> .. <CLADE_COLUMNn> \
  -tc <TIP_NAMES_COLUMN> \
  -aa aa_muts.json \
  -nt nt_muts.json \
  -n

Clade mutation application:

augur clades \
 --tree refined.nwk \
 --mutations nt_muts.json aa_muts.json \
 --clades clades.tsv \
 --output-node-data clades.json

Export:

augur export v2 \
 --tree refined.nwk \
 --metadata <METADATA.tsv> \
 --node-data branch_lengths.json \
   traits.json \
   nt_muts.json \
   aa_muts.json \
   clades.json \
 --lat-longs <LATITUDE_LONGITUDES.tsv> \
 --colors <COLORS.tsv> \
 --auspice-config <AUSPICE_CONFIG>.json \
 --output <FINAL_AUGUR_OUTPUT.json> \
 --metadata-id-columns <TIP_NAMES_COLUMN>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants