Tangle

Tangle is a platform / ecosystem of tools to help curate / classify proteins and transcripts across species.

There are several repositories in this project

tangle: (this repository) core data models, core libraries, scripts to download public data
heap: tools to annotate proteins and transcripts, by structure or sequence similarity
needle: HMM based detection of protein sequences from genomic sequences, by-passing gene prediction
pile: tools to work with RNAseq data, including assembly, searching, comparison, and quantification

Docker Images

Download these two

MMSeqs2 docker image: docker pull ghcr.io/soedinglab/mmseqs2
Muscle aligner: docker pull pegi3s/muscle

Environment Variables

Point the TANGLE_WORLD environment variable to a root directory. This is where all tangle files are located.

Set the TANGLE_AREA environment variable to name of a focus area, e.g. "coral".

scripts/world scripts will maintain files under TANGLE_WORLD. scripts/area scripts will maintain files under the focus area.

World Scripts

Download files related to an NCBI genome accession

python3 scripts/world/ncbi-download.py <accession>

Download various KEGG, Pfam, OrthoDB files

python3 scripts/world/kegg-download.sh
python3 scripts/world/pfam-download.sh
python3 scripts/world/odb-download.sh

Area Scripts

Genomes to be used for an area should be manually curated in the genomes.csv file under the area directory. This CSV should follow the schema outlined in tangle.genomes.GenomeAccessionList table.

To get list of genomes, then filter by those requiring protein detection, and those with NCBI curated proteins, respectively, use these scripts

python3 scripts/area/genome-list.py
python3 scripts/area/genome-list.py -d
python3 scripts/area/genome-list.py -n

You can use this with the ncbi-download.py script, e.g.

python3 scripts/area/genome-list.py | python3 scripts/world/ncbi-download.py -

And to fetch taxonomy metadata - note that the taxonomy files are kept in "world", and shared across areas.

python3 scripts/area/genome-list.py | python3 scripts/world/ncbi-genome-metadata.py -

Or to get default paths, either of the following works

python3 scripts/area/genome-list.py -d | python3 scripts/defaults.py --file ncbi_genome_fna -
python3 scripts/area/genome-list.py -d | xargs python3 scripts/defaults.py ncbi_genome_fna

Combine detected and reference FASTAs together

tangle-py tangle/scripts/defaults.py \
  -m area_detected_proteins \
  -m ncbi_genome_proteins_path \
  GCF_002042975.1 | xargs cat > combined.faa

Generic Scripts

Extract entries from FASTA

echo aten_0.1.m1.10024.m1 | \
  tangle-py tangle/scripts/fasta-emit.py `tangle-py tangle/scripts/defaults.py -m area_experiment PM32426508`/proteins.faa.gz -

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
assets		assets
helpers		helpers
scripts		scripts
tangle		tangle
tests		tests
.gitignore		.gitignore
GoogleCloud.md		GoogleCloud.md
README.md		README.md
pyproject.toml		pyproject.toml
run-unit-tests		run-unit-tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tangle

Docker Images

Environment Variables

World Scripts

Area Scripts

Generic Scripts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tangle

Docker Images

Environment Variables

World Scripts

Area Scripts

Generic Scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages