-
Notifications
You must be signed in to change notification settings - Fork 41
Description
I'm trying to create a custom RSV nextclade dataset following the tutorials from https://docs.nextstrain.org/en/latest/tutorials/creating-a-phylogenetic-workflow.html#annotate-the-phylogeny and https://github.com/nextstrain/nextclade_data/blob/master/docs/dataset-creation-guide.md.
I have two questions:
For the reference tree, I used the same sequences as provided in the Underlying data from https://nextstrain.org/rsv/a/genome/6y. I also used identical parameters in pathogen.json as the official dataset. However, my QC results differ significantly from the official nextclade results - many samples that pass QC in the official dataset are marked as "bad" in my custom dataset. What could be causing this discrepancy?
I noticed that the official nextclade datasets use EPI_ISL_412866 (for RSVA) and OP975389 (for RSVB) as references, while many academic publications, such as Nature Communications' "Distinct patterns of within-host virus populations between two subgroups of human respiratory syncytial virus", use NC_038235 (RSVA) and NC_001781 (RSVB) as references. What's the rationale behind these different reference choices?
Would appreciate any insights into these questions. Thank you!