ssi_analysis_utility is a modular, reproducible bioinformatics pipeline built for bacterial genome analysis.
It integrates various tools for antimicrobial resistance profiling, sequence typing, virulence detection, and more. Designed for flexibility and transparency, it leverages Snakemake for workflow management and nbdev for literate programming.
- π Getting Started
- π Project Structure
- π§« Supported Databases
- π Running Pipeline
- 𧬠Coding: Species Additions
- π§ͺ Conda (Miniconda or Anaconda) >=24.7.1
- 𧬠Snakemake
For inspiration, inspect the example sheet found in examples/samplesheet.tsv
| sample_name | Illumina_read_files | Nanopore_read_file | assembly_file | organism | variant | notes |
|---|---|---|---|---|---|---|
| SRR10518319 | examples/Dataset/reads/SRR10518319_1.fastq.gz,examples/Dataset/reads/SRR10518319_2.fastq.gz | Na | SRR10518319.fasta | Clostridioides difficile | Na | ST2 |
-
Clone the repository:
git clone https://github.com/your-org/ssi_analysis_utility.git cd ssi_analysis_utility -
Create and activate the conda environment:
conda env create -f conda.dev.env.yaml conda activate ssi_analysis_dev
Input manager configuration:
The main configuration file (ssi_analysis_utility/config/config.yaml) specifies default input files and defines the databases to be updated using the specified environment during the pipeline execution.
####################### INPUT MANAGER #######################################
input_manager:
path: examples/samplesheet.tsv # Path to the input samplesheet (CSV/TSV file)
database_path: resources
config_species: workflow/configs_species/
out_folder: examples/Results # Folder where the analysis results will be saved
###################### ANALYSIS & TOOLS SETTINGS ############################
analysis_settings:
kmeraligner:
yaml: ../envs/kmeraligner.yaml
database: kmeraligner
Clostridioides_difficile_db:
yaml: ../envs/DatabaseFetch.yaml
database: Clostridioides_difficile_db
.
.
.
skesa:
yaml : ../envs/skesa.yamlSpecies-specific configurations:
Certain species may require different tools than those currently defined or tool-specific parameters that differ from those used in other species. The options which distinguish them are defined in the species-specific configuration files, such as Clostridioides difficile specific (ssi_analysis_utility/workflow/configs_species/C.diff.yaml)
analyses_to_run:
kmeraligner:
status : False
Title : Kmer Aligner on two pair reads
ID: 90
additional_option : -matrix
database : resources/plasmidfinder_db/enterobacteriales
wrangler: workflow/scripts/KMA_wrangler.py
Cdiff_KMA_Toxin:
status : True
Title : Kmer Aligner for Clostridium difficile Toxin detection
ID: 91
additional_option : -ref_fsa -nf -sam 4 -vcf 2
database : resources/Clostridioides_difficile_db/Toxin/
.
.
.
bcftools_view_filter:
status: True
Title: Filter INDELs from mpileup
ID: 91
additional_option: -v indels -i 'INFO/DP>10'
region: AM180355.1_tcdC_804309_805008The core Python module implements the logic for sample management, configuration parsing, and pipeline execution. This package is designed to be importable and modular.
Contains all Snakemake-related files: rule definitions, species-specific configs, conda environments, and helper scripts.
Includes test data, example results, and template samplesheets to demonstrate pipeline functionality and structure.
Stores the important external datasets and pre-built KMA or tool-specific databases (e.g., for AMRFinder, kmeraligner, etc.).
Notebook-based documentation and exploratory development environment built using nbdev. These include tutorials, demos, and tests.
π CLICK TO EXPAND: FULL PROJECT TREE STRUCTURE WITH EXAMPLES FOR SPECIES SAMPLE
The structure below shows all information within the directory after running two snakemake pipelines
- Updating all databases
- Running species-specific pipeline for one sample SRR10518319 belonging to the species Clostridioides difficile
.
βββ LICENSE
βββ Logs
βΒ Β βββ Databases
βΒ Β βΒ Β βββ setup_AMRFinder.log
βΒ Β βΒ Β βββ setup_CdiffTRST.log
βΒ Β βΒ Β βββ setup_CdiffToxin.log
βΒ Β βΒ Β βββ setup_DisinFinder.log
βΒ Β βΒ Β βββ setup_EcoliKmerAligner.log
βΒ Β βΒ Β βββ setup_PlasmidFinder.log
βΒ Β βΒ Β βββ setup_PointFinder.log
βΒ Β βΒ Β βββ setup_ResFinder.log
βΒ Β βΒ Β βββ setup_VirulenceFinder.log
βΒ Β βββ SRR10518319
βΒ Β βΒ Β βββ Cdiff_KMA_Toxin.log
βββ MANIFEST.in
βββ README.md
βββ conda.dev.env.yaml
βββ config
βΒ Β βββ config.yaml
βββ examples
βΒ Β βββ Dataset
βΒ Β βΒ Β βββ assemblies
βΒ Β βΒ Β βΒ Β βββ SRR10518319.fasta
βΒ Β βΒ Β βββ reads
βΒ Β βΒ Β βββ SRR10518319_1.fastq.gz
βΒ Β βΒ Β βββ SRR10518319_2.fastq.gz
βΒ Β βΒ Β βββ dl_script.sh
βΒ Β βββ Log
βΒ Β βΒ Β βββ SRR10518319_kma_cdiff.log
βΒ Β βββ Results
βΒ Β βΒ Β βββ SRR10518319
βΒ Β βΒ Β βΒ Β βββ Cdiff_KMA_Toxin
βΒ Β βΒ Β βΒ Β βββ GenotypeCalls
βΒ Β βΒ Β βΒ Β βββ skesa
βΒ Β βββ samplesheet.tsv
βββ input
βββ nbs
βΒ Β βββ 00_core.ipynb
βββ pyproject.toml
βββ resources
βΒ Β βββ Clostridioides_difficile_db
βΒ Β βΒ Β βββ TRST
βΒ Β βΒ Β βΒ Β βββ TR10_repeat_sequences.fa
βΒ Β βΒ Β βΒ Β βββ TR10_repeat_types.txt
βΒ Β βΒ Β βΒ Β βββ TR6_repeat_sequences.fa
βΒ Β βΒ Β βΒ Β βββ TR6_repeat_types.txt
βΒ Β βΒ Β βΒ Β βββ TRST_repeat_types.txt
βΒ Β βΒ Β βββ Toxin
βΒ Β βΒ Β βββ Cdiff_Toxin.bed6
βΒ Β βΒ Β βββ Cdiff_Toxin.comp.b
βΒ Β βΒ Β βββ Cdiff_Toxin.fasta
βΒ Β βΒ Β βββ Cdiff_Toxin.fasta.fai
βΒ Β βΒ Β βββ Cdiff_Toxin.length.b
βΒ Β βΒ Β βββ Cdiff_Toxin.name
βΒ Β βΒ Β βββ Cdiff_Toxin.seq.b
βΒ Β βΒ Β βββ Cdiff_Toxin.txt
βΒ Β βββ amrfinderplus
βΒ Β βββ disinfinder_db
βΒ Β βββ kmeraligner
βΒ Β βββ plasmidfinder_db
βΒ Β βββ pointfinder_db
βΒ Β βββ resfinder_db
βΒ Β βββ virulencefinder_db
βββ settings.ini
βββ setup.py
βββ ssi_analysis_utility
βΒ Β βββ analysis_utility.py
βΒ Β βββ config
βΒ Β βΒ Β βββ config.default.env
βΒ Β βΒ Β βββ config.default.yaml
βΒ Β βββ core.py
βΒ Β βββ pipeline_runner.py
βΒ Β βββ sample_manager.py
βββ workflow
βββ Snakefile
βββ configs_species
βΒ Β βββ C.diff.yaml
βββ envs
βΒ Β βββ DatabaseFetch.yaml
βΒ Β βββ skesa.yaml
βββ rules
βΒ Β βββ characterizers.smk
βΒ Β βββ db_setups.smk
βΒ Β βββ finders.smk
βΒ Β βββ others.smk
βββ scripts
βββ Cdiff_KMA.py
βββ genbank_fetcher.py
βββ logging_utils.py
βββ thresholds.py
When cloning the repository, the databases have not been downloaded. The source for each database is defined in the database setup snakemake file (ssi_analysis_utility/workflow/rules/db_setups.smk). The current database includes:
AMR Profiling
setup_AMRFinder: Sets up AMRFinderPlus for AMR gene and mutation detection (multi-species).setup_ResFinder: Prepares ResFinder for acquired AMR gene identification.setup_PointFinder: Indexes PointFinder for chromosomal resistance mutations (e.g., E. coli, Salmonella).setup_DisinFinder: Installs DisinFinder for disinfectant resistance gene detection.
Virulence & Plasmids
setup_PlasmidFinder: Sets up PlasmidFinder for plasmid replicon detection (e.g., E. coli).setup_VirulenceFinder: Sets up VirulenceFinder for virulence gene detection (e.g., E. coli).
Typing & Subtyping
setup_EcoliKmerAligner: Downloads and indexes k-mer gene set for E. coli typing.setup_CdiffToxin: Builds Clostridioides difficile toxin gene database from GenBank.setup_CdiffTRST: Downloads TRST repeat files for C. difficile strain typing.
-
When running the entire pipeline, the species-specific pipeline (based on the organism provided in the sample sheet) is determined, and based on the configuration files, it automatically downloads the required databases for the species-specific analysis.
-
It is also possible for the user to manually download or update all databases
snakemake --use-conda --cores all setup_all_databases
-
It is also possible for the user to manually download or update specific databases
Clostridioides difficile toxin database
snakemake --use-conda --cores 1 setup_CdiffToxin
To execute the full workflow using all available cores:
snakemake --use-conda --cores allTo re-run incomplete or failed steps:
snakemake --use-conda --cores all --rerun-incompleteTo run a specific rule (e.g., cdiff_kma):
snakemake --use-conda --cores 1 cdiff_kmaTo preview what will run without executing (dry-run):
snakemake -nResults are organized inside the out_folder variable specified within the config/config.yaml file (e.g., Results). Analysis output are stored according to {Sample}/{Tool}:
Results/
βββ sample1/
β βββ analysis_1/
β βββ analysis_2/
β βββ analysis_3/
βββ sample2/
| βββ analysis_1/
| βββ analysis_2/
| βββ...
βββ...
Depending on the chosen tools used for analysis, individual species require specific information extracted from similar output data files, defined using the species-specific config files. The final species-specific output function extracts relevant information from the output folder and appends it to the original examples/samplesheet.tsv.
Clostridioides difficile specific final output
python workflow/scripts/Cdiff_KMA.py --samplesheet examples/samplesheet.tsv --outputfile Results/Cdiff_results.tsvExtended samplesheet output file
| sample_name | Illumina_read_files | Nanopore_read_file | assembly_file | organism | variant | notes | tcdA | tcdB | tcdC | cdtAB | verbose | tcdC117 | tcdCdel | deletion_details | TRST | TR6 | TR10 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SRR10518319 | examples/Dataset/reads/SRR10518319_1.fastq.gz,examples/Dataset/reads/SRR10518319_2.fastq.gz | Na | SRR10518319.fasta | Clostridioides difficile | Na | ST2 | Positive | Positive | Positive | - | tcdB_283.94_100.00_99.97;tcdA_268.76_100.00_99.85;tcdC_256.69_100.00_99.86 | wt | - | - | tr046 | A001 | B038 |
The snakemake pipeline will always depend on the config files and overall structure, supporting a limited number of species. If a user needs to include a new species, several crucial files must be altered, and specific steps must be taken to update existing files.
This section describes how to extend the pipeline to accommodate new bacterial species through coding with the example of "Clostridioides difficile".
If a species requires a new database, the following files need to be altered:
config/config.yaml: add the database- a yaml entry name - new unique name
- a yaml environment - add dependencies to the
workflow/envs/DatabaseFetch.yamlor create a separate yaml file - A database name used during the pipeline. The current convention is that the yaml entry name and database name remain identical.
Clostridioides_difficile_db:
yaml: ../envs/DatabaseFetch.yaml
database: Clostridioides_difficile_db workflow/rules/db_setups.smk: Add a new Snakemake rule which downloads the database, which refers to the database specifications defined in step 1
rule setup_CdiffToxin:
conda:
config["analysis_settings"]["Clostridioides_difficile_db"]["yaml"]
output:
database = directory(f'{database_path}/{config["analysis_settings"]["Clostridioides_difficile_db"]["database"]}/Toxin')
params:
accession_loci = "AM180355.1:tcdA,tcdB,tcdC AF271719.1:cdtA,cdtB",
db_toxin = "Cdiff_Toxin"
log:
stdout = f'Logs/Databases/setup_CdiffToxin.log'
message:
"[setup_CdiffToxin]: Setting up C. difficile toxin database"
shell:
r"""
code to download the database files and create index files if necessary
"""- The "conda:", "output:", "log:", "message:" parts of the Snakemake rule are the conventions for all database download rules.
- Add the database setup rule name from step 2 to as an input in the "rule setup_all_databases:" of
workflow/rules/db_setups.smk.
rule setup_all_databases:
input:
rules.setup_EcoliKmerAligner.output.database,
.
.
rules.setup_CdiffToxin.output.database,During development, test data might be required:
examples/Dataset/reads/dl_script.sh: add the SRA link for simpler download of multiple samples from same or different species
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/ERR142064/ERR142064_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/ERR142064/ERR142064_2.fastq.gzexamples/samplesheet.tsv: add the new sample information as described inexamples/samplesheet.tsv.
Species-specific information
workflow/Snakefile: add the various naming conventions of the species to the "species_name_map" dictionary, with the key representing the "organism" column from theexamples/samplesheet.tsv, and the value representing the prefix of the species-specific configuration file. By defining various naming conventions, the workflow allows for greater flexibility of different spelling and older organism names.
species_name_map = {
"Clostridioides difficile": "C.diff",
"Clostridium difficile": "C.diff",
"C. difficile": "C.diff",
"C difficile": "C.diff",
"C. diff": "C.diff",
"C.diff": "C.diff",
"Escherichia coli": "E.coli",
"E coli": "E.coli",
"E.coli": "E.coli",
# Add more mappings here
}workflow/config_species/{species_name_map.values}: Create a species-specific configuration file, with the naming convention following the defined values from the "species_name_map" dictionary (described in step 3), e.g.workflow/configs_species/C.diff.yaml. Once created, the conventions are as follows:- All species specific configuration files must contain the same rules
- Thus, when adding a new unique species-specific snakemake rule to the C.diff.yaml, the status is set as status : True. The rule should also be copied to the E. coli.yaml (and other species) with status : False
C.diff.yaml
analyses_to_run:
kmeraligner:
status : False
Title : Kmer Aligner on two pair reads
ID: 90
additional_option : -matrix
database : resources/plasmidfinder_db/enterobacteriales
wrangler: workflow/scripts/KMA_wrangler.py
Cdiff_KMA_Toxin:
status : True
Title : Kmer Aligner for Clostridium difficile Toxin detection
ID: 91
additional_option : -ref_fsa -nf -sam 4 -vcf 2
database : resources/Clostridioides_difficile_db/Toxin/E.coli.yaml
analyses_to_run:
kmeraligner:
status : True
Title : Kmer Aligner on two pair reads
ID: 90
additional_option : -matrix
database : resources/plasmidfinder_db/enterobacteriales
wrangler: workflow/scripts/KMA_wrangler.py
Cdiff_KMA_Toxin:
status : False
Title : Kmer Aligner for Clostridium difficile Toxin detection
ID: 91
additional_option : -nc -nf -sam 4 -vcf 2
database : resources/Clostridioides_difficile_db/Toxin/workflow/rules/others.smk: Once the status of the species-specific rules have been set, the corresponding rule and expected output is defined.
rule Cdiff_KMA_Toxin:
input:
R1 = lambda wildcards: sample_to_illumina[wildcards.sample][0],
R2 = lambda wildcards: sample_to_illumina[wildcards.sample][1],
database = rules.setup_CdiffToxin.output.database
params:
db_prefix = "Cdiff_Toxin",
add_opt = lambda wildcards: species_configs[sample_to_organism[wildcards.sample]]["analyses_to_run"]["Cdiff_KMA_Toxin"]["additional_option"],
prefix = "%s/{sample}/Cdiff_KMA_Toxin/{sample}" % OUT_FOLDER
output:
aln = temp("%s/{sample}/Cdiff_KMA_Toxin/{sample}.aln" % OUT_FOLDER),
res = "%s/{sample}/Cdiff_KMA_Toxin/{sample}.res" % OUT_FOLDER,
fsa = "%s/{sample}/Cdiff_KMA_Toxin/{sample}.fsa" % OUT_FOLDER,
sam = temp("%s/{sample}/Cdiff_KMA_Toxin/{sample}.sam" % OUT_FOLDER),
vcf_gz = temp("%s/{sample}/Cdiff_KMA_Toxin/{sample}.vcf.gz" % OUT_FOLDER),
conda:
config["analysis_settings"]["Clostridioides_difficile_db"]["yaml"]
log:
stdout = 'Logs/{sample}/Cdiff_KMA_Toxin.log'
message:
"[Cdiff_KMA_Toxin]: Running Clostridium difficile kmer aligner on {wildcards.sample}"
shell:
"""
mkdir -p $(dirname {output.sam})
cmd="kma -ipe {input.R1} {input.R2} -o {params.prefix} {params.add_opt} -t_db {input.database}/{params.db_prefix} > {output.sam}"
echo "Executing command:\n$cmd\n" > {log.stdout} 2>&1
eval $cmd >> {log.stdout} 2>&1
"""workflow/Snakefile: Once these species-specific rules and their status have been defined, the expected output from those rules should be added to the rule all in the Snakefile, ensuring the correct config rules are initiated.
rule all:
input:
expand(
"%s/{sample}/kmeraligner/" %OUT_FOLDER,
sample=[
s for s in SAMPLES
if species_configs[sample_to_organism[s]]["analyses_to_run"]["kmeraligner"]["status"] is True
]
),
expand(
"%s/{sample}/Cdiff_KMA_Toxin/{sample}.sam" % OUT_FOLDER,
sample=[
s for s in SAMPLES
if species_configs[sample_to_organism[s]]["analyses_to_run"]["Cdiff_KMA_Toxin"]["status"] is True
]
),
.
.
.
include : "rules/db_setups.smk"
include : "rules/finders.smk"
include : "rules/characterizers.smk"
include : "rules/others.smk"-
Once the snakemake rules have been completed, the results is present in the defined sample specific folders. Depending on the species, different output information is relevant. As such, each species has its own data wrangling script. e.g
workflow/scripts/Cdiff_wrangler.pyorworkflow/scripts/Ecoli_wrangler.py.- When creating a new data wrangling script for a species, the convention is that it should at least take as input the
examples/samplesheet.tsv - The output should extend the columns of the
examples/samplesheet.tsvas explained in section Species-specific results β extending samplesheet
- When creating a new data wrangling script for a species, the convention is that it should at least take as input the
-
workflow/scripts/thresholds.py: contains the thresholds used for all species to filter on the input data, such as removing low-quality alignment information etc. This is to ensure that if two species need to apply the same threshold to output files created from the same Snakemake rule, it can easily be imported as opposed to redefined.
### KMAfinder thresholds
ecoli_kma_threshold = {
"stx": [98, 98],
"wzx": [98, 98],
"wzy": [98, 98],
"wzt": [98, 98],
"wzm": [98, 98],
"fliC": [90, 90],
"fli": [90, 90],
"eae": [95, 95],
"ehxA": [95, 95],
"other": [98, 98]
}
cdiff_kma_threshold = {
"tcdA": [90,90,10],
"tcdB": [90,90,10],
"tcdC": [90,90,10],
"cdtAB": [90,90,10],
"other": [98, 98,10]
}
.
.