[Rabies] Initialize Lyssavirus rabies all-clades community dataset #333

xonq · 2025-08-06T21:12:56Z

This pull request initializes a Lyssavirus rabies (rabies) Nextclade dataset with clade-subclade resolution. Created in collaboration with @kimandrews and with subject matter expertise/user input from Massachusetts Department of Public Health. Please review the README.md for information on dataset creation and citations.

ivan-aksamentov · 2025-08-06T22:21:27Z

Thanks! Seems to be working

https://master.clades.nextstrain.org/?dataset-server=gh:xonq/nextclade_data@master@/data_output&dataset-name=community/theiagen/rabies/all-clades

As a dev I can only review the technical side. And I will let our scientists to check the sciency bits :)

The virus is quite diverse it seems - lots of mutations. But this is probably expected.

If you have an public repo where you prepare trees and other data for the dataset, it would be a great help to the users of your dataset if you add it to the readme. We typically use a boilerplate like this in Nextstrain datasets:

nextclade_data/data/nextstrain/sars-cov-2/wuhan-hu-1/orfs/README.md

Lines 3 to 9 in 54b1567

    
           | Key               | Value                                                                                                                                                            | 
        
           | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | 
        
           | authors           | [Cornelius Roemer](https://neherlab.org), [Richard Neher](https://neherlab.org), [Nextstrain](https://nextstrain.org)                                            | 
        
           | reference         | `Wuhan-Hu-1/2019`                                                                                                                                                | 
        
           | workflow          | https://github.com/neherlab/nextclade_data_workflows/tree/v3-sc2/sars-cov-2                                                                                      | 
        
           | path              | `nextstrain/sars-cov-2/orfs`                                                                                                                                     | 
        
           | clade definitions | [Nextstrain clades](https://nextstrain.org/blog/2022-04-29-SARS-CoV-2-clade-naming-2022) and [Pango lineages](https://www.nature.com/articles/s41564-020-0770-5) |

But that's not mandatory.

All looks good to me. Smooth work!

xonq · 2025-08-07T19:53:43Z

Thank you. Rabies is indeed very diverse - I contemplated creating independent datasets for each clade, but the genotyping of this "all-clades" dataset has been sufficient for our SME partners. Additionally, there were issues with sub-clade metadata quality that limit the improvements more refined datasets may provide.

I do not have a repository for tree building - I built the Nextclade dataset from the Nextstrain rabies build as a template, though I ended up deviating with the tree building methodology and metadata acquisition. The methodology is hopefully adequately documented for users in this PR's README.

rneher · 2025-08-11T07:11:24Z

Thanks a lot for contributing this dataset! Overall, this looks very good. But I have a few suggestions to make it better.

the amino acid annotation is inconsistent with the annotation in Nextclade. On the tree, CDS are called RABVgp1_N etc, in the Nextclade annotation you have things like NP_062343. Harmonizing these annotation would be important.
Given the diversity of the virus, I would recommend different alignment parameters. See here for an example.
I am not sure how exactly you do things in your workflow, but we recommend aligning and translating the sequences in the tree exactly like Nextclade will align query sequences later. see here for an example. The translation can be later used in ancestral reconstruction.

xonq · 2025-08-20T17:47:06Z

hey @rneher, just wanted to reply and inform you that I cannot return to this to address your concerns until a later date. not sure when, but hopefully within the next several weeks. Thanks for your suggestions and my apologies for my ignorance to some of the standardized procedures.

RE: alignment parameters: I'm not really certain how to systematically adjust these parameters - do you have specific recommendations/procedures to determine what parameters are more ideal, or do you suggest dragging and dropping the linked pathogen.json you sent?

RE: Apart from the tree-building, the workflow was performed with AUGUR. With this in mind, do these steps deviate from Nextclade like you're suggesting?:

Alignment:

# align genomes relative to the reference
augur align \
  -s <GENOMES.fa> \
  --nthreads <CPUS> \
  --output aligned.fa \
  --reference-name <REFERENCE_ACCESSION> \
  --debug

Tree building:

performed as discussed in the README

Refinement:

augur refine \
  --tree <IQTREE.contree> \
  --alignment <ALIGNMENT.fa> \
  --metadata <METADATA.tsv> \
  --output-tree refined.nwk \
  --output-node-data branch_lengths.json \
  --keep-root \
  --metadata-id-columns <TIP_NAMES_COLUMN>

Trait application:

augur traits \
	--tree refined.nwk \
	--metadata <METADATA.tsv> \
	--output-node-data traits.json \
	--columns <COLUMN1> <COLUMN2> .. <COLUMNn> \
	--confidence \
	--metadata-id-columns <TIP_NAMES_COLUMN>

Nucleotide mutation calling:

augur ancestral \
  --tree refined.nwk \
  --alignment <ALIGNMENT.fa> \
  --output-node-data nt_muts.json \
  --root-sequence <REFERENCE.fasta> \
  --inference joint

Translation:

augur translate \
  --tree refined.nwk \
  --ancestral-sequences nt_muts.json \
  --reference-sequence <REFERENCE.gbk> \
  --output-node-data aa_muts.json

Clade mutation extraction (non-AUGUR):

extract_nextclades.py \
  -t refined.nwk \
  -m <METADATA.tsv> \
  -cc <CLADE_COLUMN1> <CLADE_COLUMN2> .. <CLADE_COLUMNn> \
  -tc <TIP_NAMES_COLUMN> \
  -aa aa_muts.json \
  -nt nt_muts.json \
  -n

Clade mutation application:

augur clades \
 --tree refined.nwk \
 --mutations nt_muts.json aa_muts.json \
 --clades clades.tsv \
 --output-node-data clades.json

Export:

augur export v2 \
 --tree refined.nwk \
 --metadata <METADATA.tsv> \
 --node-data branch_lengths.json \
   traits.json \
   nt_muts.json \
   aa_muts.json \
   clades.json \
 --lat-longs <LATITUDE_LONGITUDES.tsv> \
 --colors <COLORS.tsv> \
 --auspice-config <AUSPICE_CONFIG>.json \
 --output <FINAL_AUGUR_OUTPUT.json> \
 --metadata-id-columns <TIP_NAMES_COLUMN>

xonq added 5 commits August 6, 2025 17:23

initialize dataset

af286ce

update citations

99f5a21

add script reference

4ad0dea

update doi link

c54b596

appropriate explanation for tree generation

e824f77

xonq mentioned this pull request Aug 6, 2025

[Rabies] Initialize Lyssavirus rabies all-clades community dataset #332

Merged

chore: rebuild

0edca50

ivan-aksamentov requested review from corneliusroemer, kimandrews and rneher August 6, 2025 22:21

Merge branch 'nextstrain:master' into master

f6d9875

j23414 mentioned this pull request Sep 30, 2025

Add Nextclade workflow for Norovirus genotyping nextstrain/norovirus#6

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Rabies] Initialize Lyssavirus rabies all-clades community dataset #333

[Rabies] Initialize Lyssavirus rabies all-clades community dataset #333

Uh oh!

xonq commented Aug 6, 2025 •

edited

Loading

Uh oh!

ivan-aksamentov commented Aug 6, 2025

Uh oh!

xonq commented Aug 7, 2025

Uh oh!

rneher commented Aug 11, 2025

Uh oh!

xonq commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Rabies] Initialize Lyssavirus rabies all-clades community dataset #333

Are you sure you want to change the base?

[Rabies] Initialize Lyssavirus rabies all-clades community dataset #333

Uh oh!

Conversation

xonq commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivan-aksamentov commented Aug 6, 2025

Uh oh!

xonq commented Aug 7, 2025

Uh oh!

rneher commented Aug 11, 2025

Uh oh!

xonq commented Aug 20, 2025

Alignment:

Tree building:

Refinement:

Trait application:

Nucleotide mutation calling:

Translation:

Clade mutation extraction (non-AUGUR):

Clade mutation application:

Export:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xonq commented Aug 6, 2025 •

edited

Loading