Tango is a Snakemake-based RNA-seq processing pipeline built for time-series experiments where you want reliable gene- and transcript-level quantification (salmon) using a reference genome/annotation, with an optional genome-guided de novo transcriptome assembly (via StringTie) for exploring novel isoforms. Starting from paired-end FASTQ files, Tango performs trimming + QC, splice-aware alignment, transcript assembly/merging, annotation comparison, QC (e.g., BUSCO), and Salmon quantification against a reference transcriptome and/or a StringTie-merged transcriptome. Where possible, steps are streamed/piped between tools (e.g., aligner → BAM processing) to avoid writing large intermediate files and help keep scratch usage manageable.
Why the name Tango? We originally set out to choreograph the entire time series rhythmicity analysis end to end, but downstream rhythm calling is often an interactive and lab specific process. We decided to stop Tango here to focus on the steps in the process that benefit most from Snakemake style automation on an HPC: robust RNA seq processing and fast, scalable gene and transcript level quantification (reference and or genome guided de novo). This pipeline was built to parallelize cleanly across lots of samples via SLURM job scheduling for large time series experiments (>100 PE samples). In the future we may extend Tango with optional analysis rules for common rhythmicity workflows like cosinor or JTK_CYCLE, while keeping the core pipeline modular so labs can mix and match or swap downstream methods as needed.
- Rulegraph (SVG): rulegraph.svg
Given paired-end RNA-seq reads and a reference genome/annotation, Tango runs:
- fastp: adapter/quality trimming + per-sample HTML/JSON QC
- HISAT2: splice-aware alignment to the genome (with exon/splice-site guided index)
- samtools: BAM sorting + indexing
- deepTools: BigWig generation for genome browser visualization
- StringTie: per-sample transcript assembly, merge, and Ballgown tables
- prepDE.py: gene/transcript count matrices (DESeq2-ready CSVs)
- gffcompare: comparison of merged transcripts vs reference annotation (class codes, tracking, stats)
- gffread: FASTA export of merged transcriptome from merged GTF
- BUSCO: assembly QC on denovo transcriptome, plus optional BUSCO on a provided reference transcriptome
- Salmon (decoy-aware): quantification against both:
- the reference transcriptome (
ref_quant) - the denovo transcriptome (
denovo_quant)
- the reference transcriptome (
Key locations:
tango/SnakeFiles/Snakefile— main workflowtango/SnakeFiles/tango_Snakemake_config.yaml— user config (paths + output dir)tango/SnakeFiles/ClusterProfiles/slurm/— example SLURM executor profiletango/RawData/— optional place to put reads (see input structure below)tango/Genome/— optional place to put genome FASTA + GFFtango/Utils/— helper scripts used by the workflowtango_envi.yml— conda environment for Tango
Your output directories do NOT need to live inside the tango/ repository.
In fact, for HPC usage it’s typically best to write outputs to scratch space (recommended), e.g. /scratch/.../tango_output_run1/.
The examples currently in the tango/SnakeFiles/tango_Snakemake_config.yaml file currently have some examples of these directories based on a previous analysis I ran.
This is controlled by:
output_dir:intango/SnakeFiles/tango_Snakemake_config.yaml
Similarly, your raw reads do not need to be copied into the repo:
rawdata_dir:in the same config (can be an absolute scratch path)
Keeping large, frequently-written outputs on scratch will:
- avoid filling home/project quotas,
- speed up I/O-heavy steps.
You’ll need a working conda (or mamba) install.
Tools included in the provided environment:
- snakemake
- fastp, hisat2, samtools, deeptools
- stringtie, gffcompare, gffread
- busco
- salmon
- multiqc
- (optional) snakemake SLURM executor plugin
From the repo root:
conda env create -f tango_envi.yml
conda activate tangoTip: You may want to delete the prefix: line in tango_envi.yml if it points to a path that doesn’t exist on your system.
Tango expects paired-end reads named:
{sample}_1.fq.gz{sample}_2.fq.gz
…and organized like:
RAW_DIR/
├── SampleA
│ ├── SampleA_1.fq.gz
│ ├── SampleA_2.fq.gz
├── SampleB
│ ├── SampleB_1.fq.gz
│ ├── SampleB_2.fq.gz
└── ...
The {sample} name is discovered automatically from files matching:
{rawdata_dir}/{sample}/{sample}_1.fq.gz
You provide paths in the config:
fasta: reference genome FASTA (.fa,.fasta, etc.)gff: reference annotation GFF (.gff,.gff3, etc.)trans_fasta: reference transcriptome FASTA (used for Salmonref_quant+ optional BUSCO)
If your genome is large (≥4 Gbp), HISAT2 may require the ht2l index format. Control this with:
hisat2_index_extension: "ht2"or"ht2l"
Edit:
./tango/SnakeFiles/tango_Snakemake_config.yaml
Key fields:
rawdata_dir: directory containing sample subfolders with FASTQs (can be absolute)fasta: path to genome FASTAgff: path to genome annotation GFFtrans_fasta: path to reference transcriptome FASTA (recommended if available)hisat2_index_extension:ht2(default) orht2lfor large genomesoutput_dir: where all outputs will be written (scratch recommended)utils_dir: leave as-is unless you moved the repo structure
From tango/SnakeFiles/:
cd tango/SnakeFiles
snakemake -j 8 -pAdjust -j to match your available cores.
This repo includes a SLURM executor profile:
tango/SnakeFiles/ClusterProfiles/slurm/config.yaml
From tango/SnakeFiles/:
cd tango/SnakeFiles
snakemake --workflow-profile ClusterProfiles/slurm -pThere is also an example sbatch launcher:
tango/SnakeFiles/execute_snakemake.sbatchYou will likely want to edit the SLURM partition/account/QOS/email settings in the profile and sbatch script to match your cluster.
If you’re new to Snakemake, it’s a good idea to run Tango in test mode first to confirm that Snakemake can find your inputs and that the planned jobs look correct.
A dry-run shows what would run without executing anything:
cd tango/SnakeFiles
snakemake -n -pYou can visualize the dependency graph (DAG) of the workflow. This requires graphviz (dot) to be available on your system.
From tango/SnakeFiles/:
# Generate a DAG file and render it to PNG
snakemake --dag | dot -Tpng > tango_dag.pngIf you want to see rule-level dependencies (often cleaner):
snakemake --rulegraph | dot -Tpng > tango_rulegraph.png
If you’re unfamiliar with Snakemake, the official docs are a great starting point:
-
Snakemake documentation: https://snakemake.readthedocs.io/
-
Snakemake CLI reference: https://snakemake.readthedocs.io/en/stable/executing/cli.html
All outputs are written under output_dir and organized into subfolders:
fastp/— trimmed FASTQs + fastp reportshisat2_index/— HISAT2 index + exon/splice-site files + logshisat2_samtools/— sorted BAMs, BAI indices, BigWigs, FASTA index, logsstringtie/— per-sample assemblies, merged GTF/FASTA, Ballgown tables, prepDE count matricesgff_compare/— gffcompare outputs + class code summariesbusco/— BUSCO runs (denovo transcriptome + reference transcriptome if provided)salmon/ref_quant/— Salmon index + per-sample quants vs reference transcriptomedenovo_quant/— Salmon index + per-sample quants vs denovo transcriptome
- Disk usage: BAMs, BigWigs, Salmon indices, and BUSCO downloads can be large—use scratch (
output_dir) whenever possible. - Sample discovery: if Snakemake finds zero samples, double-check:
- subfolder names match file prefixes exactly
- files are named
{sample}_1.fq.gzand{sample}_2.fq.gz
- Reference transcriptome: if you don’t have one, you can still run most of the pipeline, but
ref_quantand reference BUSCO requiretrans_fasta. - Re-running: Snakemake will skip completed outputs; delete specific output folders (or targets) if you want to force regeneration.
Tango wraps widely used open-source bioinformatics tools (fastp, HISAT2, samtools, deepTools, StringTie, gffcompare/gffread, BUSCO, Salmon) via Snakemake. Please cite the underlying tools appropriately in publications.
Portions of this documentation were drafted with assistance from ChatGPT (GPT-5.2 Thinking, OpenAI; February 2026) and were reviewed/edited by Daniel Kunk.
