Skip to content

taffish/edta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

taf-edta

TAFFISH wrapper for EDTA, the Extensive de novo TE Annotator. EDTA builds high-quality transposable element libraries and can perform whole-genome repeat annotation.

This repository packages EDTA 2.3.0 as a TAFFISH tool app. The image is based on the BioContainers/Bioconda EDTA environment quay.io/biocontainers/edta:2.3.0--hdfd78af_0, pinned by digest in the Dockerfile for reproducibility.

Installation

Install from the public TAFFISH Hub index:

taf update
taf install edta

Install the exact release:

taf install edta 2.3.0-r2

For local testing before the app is published to the public index:

taf install --from .

Usage

Show TAFFISH app help:

taf-edta --help

Show the TAFFISH package version:

taf-edta --version

Show upstream EDTA help:

taf-edta EDTA.pl -h
taf-edta -- -h

Check that the bundled EDTA runtime dependencies are available:

taf-edta EDTA.pl --check_dependencies
taf-edta -- --check_dependencies

Run the main EDTA pipeline:

taf-edta EDTA.pl --genome genome.fa --threads 10

Run a fuller official-style annotation command:

taf-edta EDTA.pl \
  --genome genome.fa \
  --cds genome.cds.fa \
  --curatedlib curated.fa \
  --exclude exclude.bed \
  --overwrite 1 \
  --sensitive 1 \
  --anno 1 \
  --threads 10

Option-leading shorthand also works because EDTA.pl is the default command:

taf-edta -- --genome genome.fa --threads 10

Command Mode

This is a command-mode TAFFISH tool. The first non-option argument is treated as an executable inside the container, so the clearest form is to name the upstream EDTA command explicitly:

taf-edta EDTA.pl --genome genome.fa --threads 10
taf-edta EDTA_raw.pl --genome genome.fa --type ltr --threads 10
taf-edta panEDTA.sh
taf-edta panEDTA.sh -g genome_list.txt -c cds.fa -t 10

Do not assume taf-edta raw ... or taf-edta -- raw ... means EDTA.pl raw ...; EDTA does not expose a single subcommand-style CLI. Its public interfaces are separate scripts such as EDTA.pl, EDTA_raw.pl, EDTA_processK.pl, lib-test.pl, and panEDTA.sh.

Access bundled helper executables directly:

taf-edta RepeatMasker -h
taf-edta RepeatModeler -h
taf-edta BuildDatabase -h
taf-edta LTR_retriever -h
taf-edta TEsorter -h
taf-edta makeblastdb -version
taf-edta samtools --version
taf-edta Rscript --version
taf-edta python3 --version

Practical Notes

EDTA is sensitive to paths. The upstream Docker documentation recommends running with all needed inputs in the current working directory and using plain local filenames. In practice, prefer:

cd my-edta-run
taf-edta EDTA.pl --genome genome.fa --cds genome.cds.fa --threads 10

Avoid absolute paths and symlink-heavy inputs unless you have tested that EDTA resolves them correctly in the container. This matters because EDTA creates many working directories, symlinks, and intermediate files while coordinating RepeatMasker, RepeatModeler, BLAST+, LTR_retriever, TIR-Learner, and related tools.

EDTA is also I/O intensive. Use a fast local working directory for real genomes, and reserve enough disk space for intermediate files.

Outputs

Common EDTA outputs include:

genome.fa.mod.EDTA.TElib.fa       Final non-redundant TE library
genome.fa.mod.EDTA.intact.gff3    Structurally intact TE annotation
genome.fa.mod.EDTA.TEanno.gff3    Whole-genome TE annotation, with --anno 1
genome.fa.mod.EDTA.TEanno.gtf     Whole-genome TE annotation in GTF, with --anno 1
genome.fa.mod.EDTA.TEanno.sum     Whole-genome TE annotation summary
genome.fa.mod.MAKER.masked        Low-threshold masked genome, with --anno 1

For pan-genome annotation, use panEDTA.sh:

taf-edta panEDTA.sh -g genome_list.txt -c cds.fa -t 10

The genome list should use paths accessible from the working directory. For the most predictable TAFFISH/container behavior, keep the listed genomes and CDS files in or under the current working directory.

Package

name: edta
command: taf-edta
version: 2.3.0-r2
kind: tool
image: ghcr.io/taffish/edta:2.3.0-r2

Container

The container image starts from the official BioContainers EDTA image:

quay.io/biocontainers/edta:2.3.0--hdfd78af_0
digest: sha256:6dfb5313b05caf4d6cafa724d6c5a95365e0471adee29005c42a338dfdf358c5

EDTA has a large dependency set. The image intentionally keeps the full BioContainers/Bioconda runtime rather than rebuilding a partial custom environment. It includes EDTA plus the major tools EDTA calls internally:

RepeatMasker, RepeatModeler, BuildDatabase, BLAST+, LTR_retriever,
LTR_FINDER_parallel, LTR_HARVEST_parallel, TIR-Learner, HelitronScanner,
AnnoSINE, TEsorter, GenomeTools, TRF, GRF, CD-HIT, SAMtools, BEDTools,
R, Python, Java, and Perl modules

The image is large, but this is the reliable choice for EDTA. Rebuilding EDTA manually from Debian would still require installing and validating this whole toolchain, and would likely recreate most of the BioContainers image while increasing maintenance risk.

The TAFFISH Dockerfile adds only:

TAFFISH environment metadata
PYTHONNOUSERSITE=1
a corrected panEDTA.sh bash launcher
build-time dependency/help checks

The current release is built for:

linux/amd64

The BioContainers tag is a single linux/amd64 manifest, not a native multi-architecture image. For Docker and Podman, src/main.taf declares --platform linux/amd64, so arm64 machines such as Apple Silicon Macs can run it through normal amd64 emulation:

TAFFISH_CONTAINER_BACKEND=docker \
  taf-edta EDTA.pl --check_dependencies

This does not mean the image contains a native arm64 build. Apptainer compatibility depends on the host and site configuration.

Smoke Checks

The TAFFISH metadata declares a Docker smoke check:

exist:
  EDTA.pl, EDTA_raw.pl, EDTA_processK.pl, lib-test.pl, panEDTA.sh
  RepeatMasker, RepeatModeler, BuildDatabase, LTR_retriever
  LTR_FINDER_parallel, LTR_HARVEST_parallel, TIR-Learner, HelitronScanner
  AnnoSINE_v2, TEsorter, makeblastdb, blastn, blastx, gt, grf-main, trf
  mdust, cd-hit-est, samtools, bedtools, Rscript, python3, perl, java

test:
  EDTA.pl help is available
  EDTA.pl --check_dependencies reports "All passed"
  EDTA_raw.pl help is available
  EDTA_processK.pl help is available
  lib-test.pl help is available
  panEDTA.sh usage is available through the corrected bash launcher

The smoke check deliberately does not run a full genome annotation. Upstream's own toy genome test takes minutes and produces many intermediate files, so it is better kept as a manual functional test when needed:

taf-edta EDTA.pl \
  --genome genome.fa \
  --cds genome.cds.fa \
  --curatedlib curated.fa \
  --exclude exclude.bed \
  --overwrite 1 \
  --sensitive 1 \
  --anno 1 \
  --threads 10

Upstream

Maintainer Notes

Useful checks before publishing:

taf check
taf publish --release --dry-run
docker build --platform linux/amd64 -t ghcr.io/taffish/edta:2.3.0-r2 -f docker/Dockerfile .
docker run --rm --platform linux/amd64 ghcr.io/taffish/edta:2.3.0-r2 EDTA.pl --check_dependencies
docker run --rm --platform linux/amd64 ghcr.io/taffish/edta:2.3.0-r2 EDTA_raw.pl -h
docker run --rm --platform linux/amd64 ghcr.io/taffish/edta:2.3.0-r2 panEDTA.sh

The repository wrapper files are licensed under Apache-2.0. Upstream EDTA is GPL-3.0-only, and third-party runtime components are distributed under their own upstream licenses.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors