Skip to content

benjiec/tangle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tangle

Tangle is a platform / ecosystem of tools to help curate / classify proteins and transcripts across species.

There are several repositories in this project

  • tangle: (this repository) core data models, core libraries, scripts to download public data
  • heap: tools to annotate proteins and transcripts, by structure or sequence similarity
  • needle: HMM based detection of protein sequences from genomic sequences, by-passing gene prediction
  • pile: tools to work with RNAseq data, including assembly, searching, comparison, and quantification

Docker Images

Download these two

  • MMSeqs2 docker image: docker pull ghcr.io/soedinglab/mmseqs2
  • Muscle aligner: docker pull pegi3s/muscle

Environment Variables

Point the TANGLE_WORLD environment variable to a root directory. This is where all tangle files are located.

Set the TANGLE_AREA environment variable to name of a focus area, e.g. "coral".

scripts/world scripts will maintain files under TANGLE_WORLD. scripts/area scripts will maintain files under the focus area.

World Scripts

Download files related to an NCBI genome accession

python3 scripts/world/ncbi-download.py <accession>

Download various KEGG, Pfam, OrthoDB files

python3 scripts/world/kegg-download.sh
python3 scripts/world/pfam-download.sh
python3 scripts/world/odb-download.sh

Area Scripts

Genomes to be used for an area should be manually curated in the genomes.csv file under the area directory. This CSV should follow the schema outlined in tangle.genomes.GenomeAccessionList table.

To get list of genomes, then filter by those requiring protein detection, and those with NCBI curated proteins, respectively, use these scripts

python3 scripts/area/genome-list.py
python3 scripts/area/genome-list.py -d
python3 scripts/area/genome-list.py -n

You can use this with the ncbi-download.py script, e.g.

python3 scripts/area/genome-list.py | python3 scripts/world/ncbi-download.py -

And to fetch taxonomy metadata - note that the taxonomy files are kept in "world", and shared across areas.

python3 scripts/area/genome-list.py | python3 scripts/world/ncbi-genome-metadata.py -

Or to get default paths, either of the following works

python3 scripts/area/genome-list.py -d | python3 scripts/defaults.py --file ncbi_genome_fna -
python3 scripts/area/genome-list.py -d | xargs python3 scripts/defaults.py ncbi_genome_fna

Combine detected and reference FASTAs together

tangle-py tangle/scripts/defaults.py \
  -m area_detected_proteins \
  -m ncbi_genome_proteins_path \
  GCF_002042975.1 | xargs cat > combined.faa

Generic Scripts

Extract entries from FASTA

echo aten_0.1.m1.10024.m1 | \
  tangle-py tangle/scripts/fasta-emit.py `tangle-py tangle/scripts/defaults.py -m area_experiment PM32426508`/proteins.faa.gz -

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors