Skip to content

Questions about creating a custom RSVA nextclade dataset with NC_038235 reference #237

@YangJingqii

Description

@YangJingqii

I'm trying to create a custom RSV nextclade dataset following the tutorials from https://docs.nextstrain.org/en/latest/tutorials/creating-a-phylogenetic-workflow.html#annotate-the-phylogeny and https://github.com/nextstrain/nextclade_data/blob/master/docs/dataset-creation-guide.md.

I have two questions:

For the reference tree, I used the same sequences as provided in the Underlying data from https://nextstrain.org/rsv/a/genome/6y. I also used identical parameters in pathogen.json as the official dataset. However, my QC results differ significantly from the official nextclade results - many samples that pass QC in the official dataset are marked as "bad" in my custom dataset. What could be causing this discrepancy?
I noticed that the official nextclade datasets use EPI_ISL_412866 (for RSVA) and OP975389 (for RSVB) as references, while many academic publications, such as Nature Communications' "Distinct patterns of within-host virus populations between two subgroups of human respiratory syncytial virus", use NC_038235 (RSVA) and NC_001781 (RSVB) as references. What's the rationale behind these different reference choices?
Would appreciate any insights into these questions. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions