Tangle is a platform / ecosystem of tools to help curate / classify proteins and transcripts across species.
There are several repositories in this project
tangle: (this repository) core data models, core libraries, scripts to download public dataheap: tools to annotate proteins and transcripts, by structure or sequence similarityneedle: HMM based detection of protein sequences from genomic sequences, by-passing gene predictionpile: tools to work with RNAseq data, including assembly, searching, comparison, and quantification
Download these two
- MMSeqs2 docker image:
docker pull ghcr.io/soedinglab/mmseqs2 - Muscle aligner:
docker pull pegi3s/muscle
Point the TANGLE_WORLD environment variable to a root directory. This is where all tangle files are located.
Set the TANGLE_AREA environment variable to name of a focus area, e.g. "coral".
scripts/world scripts will maintain files under TANGLE_WORLD. scripts/area
scripts will maintain files under the focus area.
Download files related to an NCBI genome accession
python3 scripts/world/ncbi-download.py <accession>
Download various KEGG, Pfam, OrthoDB files
python3 scripts/world/kegg-download.sh
python3 scripts/world/pfam-download.sh
python3 scripts/world/odb-download.sh
Genomes to be used for an area should be manually curated in the genomes.csv
file under the area directory. This CSV should follow the schema outlined in
tangle.genomes.GenomeAccessionList table.
To get list of genomes, then filter by those requiring protein detection, and those with NCBI curated proteins, respectively, use these scripts
python3 scripts/area/genome-list.py
python3 scripts/area/genome-list.py -d
python3 scripts/area/genome-list.py -n
You can use this with the ncbi-download.py script, e.g.
python3 scripts/area/genome-list.py | python3 scripts/world/ncbi-download.py -
And to fetch taxonomy metadata - note that the taxonomy files are kept in "world", and shared across areas.
python3 scripts/area/genome-list.py | python3 scripts/world/ncbi-genome-metadata.py -
Or to get default paths, either of the following works
python3 scripts/area/genome-list.py -d | python3 scripts/defaults.py --file ncbi_genome_fna -
python3 scripts/area/genome-list.py -d | xargs python3 scripts/defaults.py ncbi_genome_fna
Combine detected and reference FASTAs together
tangle-py tangle/scripts/defaults.py \
-m area_detected_proteins \
-m ncbi_genome_proteins_path \
GCF_002042975.1 | xargs cat > combined.faa
Extract entries from FASTA
echo aten_0.1.m1.10024.m1 | \
tangle-py tangle/scripts/fasta-emit.py `tangle-py tangle/scripts/defaults.py -m area_experiment PM32426508`/proteins.faa.gz -