Build-pipeline resilience fixes: BioMart retries, GENCODE streaming, URL updates by af8 · Pull Request #234 · ndaniel/fusioncatcher

af8 · 2026-05-25T10:29:54Z

Build-pipeline resilience fixes (2026-05)

A focused set of fixes to make fusioncatcher-build.py finish cleanly against a current Ensembl/GENCODE release. All changes are confined to bin/get_*.py and bin/fusioncatcher-build.py — none of the runtime fusion-detection code is touched.

This PR partially addresses #189 ("Download data using docker is not working as expected") and the broader experience that the build pipeline has become brittle against transient BioMart errors and against several upstream resources that have moved or been decommissioned since the 1.33 release.

Tested with

GENCODE v48 / Ensembl v114 / Homo sapiens build, run end-to-end inside an nfcore/base:1.7-derived image with the fusioncatcher-1.33 conda package.
160/160 steps complete in ~4h52m (matches the duration of an unpatched build that I am able to coax to completion with manual retries).
Filter-input outputs (paralogs.txt, adjacent_genes.txt, biotypes.txt, synonyms.txt) are byte-identical to what an unpatched run produces when retries succeed by chance.
Four previously-empty fusion catalogs are now populated:
- pcawg.txt — 4,346 fusion entries
- cancer_genes.txt — 2,986 gene symbols (Bushman Lab June 2021 list)
- mitelman.txt — 202,423 entries (April 2026 ISB-CGC snapshot)
- chimerdb4kb.txt / chimerdb4pub.txt / chimerdb4seq.txt — 1,951 + 1,941 + 374,029 entries (KOBIC ChimerDB v4)

Commits (in order)

Add retry logic to BioMart-based download scripts — uniform 5-attempt loop with exponential backoff and a 60 s timeout across the 8 BioMart helpers (get_biotypes, get_genes_descriptions, get_genes_symbols, get_hla2, get_mtrna, get_refseq_ensembl, get_rrna, get_trna). Ensembl BioMart returns transient HTTP 405 on a small but non-zero fraction of requests; this turns them from build-aborting errors into a small delay.
Add per-chromosome retry and fail-on-incomplete in get_paralogs and get_exons_positions — same retry pattern but applied to both queries (chromosome list + per-chromosome). On exhausted retries the script sys.exit(1)s: paralogs.txt and adjacent_genes.txt are used downstream as fusion-pair filters, so a partial file would silently produce non-deterministic detection output across reference rebuilds of the same Ensembl version.
Stream GENCODE GTF decompression to avoid OOM — the human v48 GTF is ~3 GB uncompressed; gzip.open(...).read() followed by .readlines() held the entire string in memory plus a second copy as a list, OOM-ing on machines with less than ~8 GB available to the Python 2.7 interpreter. Replaced with shutil.copyfileobj() streaming + line-by-line iteration. Also adds the FTP retry pattern. Output is bit-identical to the previous behaviour on machines where it didn't OOM.
Update URLs for migrated and dead resources —
- get_pcawg.py — dcc.icgc.org/api/v1/download?fn=/PCAWG/... is decommissioned; PCAWG open-tier data was migrated to the ICGC ARGO icgc25k-open bucket at https://object.genomeinformatics.org. TSV schema is unchanged; parser untouched. Also wraps the existing gzip parse in a try/except so a future migration breakage produces an empty file rather than crashing the build.
- get_cancer-genes.py — the Bushman Lab moved its resource directory; /assets/doc/allOnco_May2018.tsv is dead, replaced by /export/geneLists/allOnco_June2021.tsv. The new file is a quoted TSV with a row-index column at position 0 and the gene symbol at position 1, so the parser is adjusted accordingly.
- get_mitelman.py — the default URL pointed at storage.cloud.google.com (the GCS web console endpoint, returns HTML) and silently produced an empty mitelman.txt. Switched to the data endpoint storage.googleapis.com. Separately, the dataset moved its files from a mitelman_db/ subdirectory to the archive root; the ZIP lookup now tries both layouts.
- get_chimerdb4.py — KOBIC consolidated their ChimerDB site and dropped the /chimerdb_mirror/ path segment; the live download endpoint is /chimerdb/downloads. The server only serves over HTTPS now (HTTP returns 500), so the default --server is updated from http://www.kobic.re.kr to https://www.kobic.re.kr. Parser untouched (locates H_GENE/T_GENE columns by header name).
Fix get_synonyms.py filter to exclude GRCh37 entries instead of .gz files — one-line fix to the Ensembl MySQL directory filter at line 138. The previous filter (not el.lower().endswith(".gz")) was meant to exclude tarballs but accidentally let homo_sapiens_core_*_37 (GRCh37) entries through, occasionally polluting a GRCh38 build with GRCh37 synonyms.
Auto-pin GENCODE release in fusioncatcher-build.py based on Ensembl version — when --ftp-ensembl-path is supplied (e.g. /pub/release-114), derive the matching GENCODE release as ensembl_release − 66 and pass it to get_gencode.py as --release <N>. Without this, get_gencode.py always picks the latest release from ftp.ebi.ac.uk, which produces a gencode_genes.txt that does not match the requested Ensembl version (e.g. Ensembl v114 paired with GENCODE v49 instead of v48). Skipped silently when the path is omitted or unparseable, preserving previous behaviour for those callers.

Out of scope (intentionally)

These were considered but deliberately not included to keep the PR focused on regressions and dead links:

OncoKB: the public data dir was removed from oncokb/oncokb-public on GitHub in 2024; the official API requires an authentication token. Fixing this requires a credentials story, not just a URL update.
COSMIC / CGP: login-gated; same reasoning. Already documented as --skip-database cgp,cosmic workaround.
TICdb / ConjoinG / CACG: server permanently offline or genuinely unavailable. The current "produce an empty output file silently" behaviour is preserved here.
TCGA Yoshihara / Cell Reports supplements: publisher CDNs return 403 under their Text-and-Data-Mining policies. PMC mirrors exist but require a session-based fetch I haven't automated.

These can be addressed in follow-up PRs if there is interest.

Reviewer notes

The diff for the BioMart scripts (commit 1) is large because the existing try / urlopen / except block is wrapped in a for attempt in xrange(5) loop. The logical change is small and uniform across the 8 files; reviewing one of them is sufficient to validate the others.
get_paralogs.py (commit 2) intentionally does not silent-skip a chromosome on exhausted retries. An intermediate version of this patch did silent-skip, and the resulting paralogs.txt was non-deterministic across rebuilds — that behaviour was reverted.
No requirements.txt or environment.yml changes are needed; everything uses the standard library plus what fusioncatcher-1.33 already depends on.

Suggestion for a future iteration (not part of this PR)

While working through these fixes I noticed that the build pipeline currently treats the ~30 resource downloads as a flat list: when one comes back empty, the user has no way to tell whether the missing file is a problem (because that file feeds the detection algorithm or is a hard filter) or a non-issue (because that file is a "seen in cohort X" enrichment label that the call can be made without).

It might be worth, at some point, surfacing this distinction to the user — for example as a one-line note in the end-of-build log saying "the following empty outputs do not affect fusion calling: …" versus "the following empty outputs do affect fusion calling and should be investigated: …". The contract for each get_*.py would not change; it would just be a summary helper. Happy to draft that as a separate small PR if you think it would be useful, but it's also fine to leave as-is — this PR is intentionally limited to the resilience and URL fixes.

Ensembl BioMart returns transient HTTP 405 responses on a small but non-zero fraction of requests; the existing scripts treated any error as fatal and aborted the build pipeline. Wrap each urlopen call in a 5-attempt retry loop with exponential backoff (60 * attempt seconds) and raise the request timeout from 30 s to 60 s. On the fifth failed attempt the script still sys.exit(1)s -- the contract for callers is unchanged; only the success rate against a flaky BioMart improves. Applies the same pattern uniformly to: - get_biotypes.py - get_genes_descriptions.py - get_genes_symbols.py - get_hla2.py - get_mtrna.py - get_refseq_ensembl.py - get_rrna.py - get_trna.py No business logic, output format, or parse rule is modified.

…et_exons_positions Both scripts issue one BioMart query to fetch a chromosome list, then one query per chromosome to fetch paralog or exon-coordinate data. The single try/except around each call previously aborted on the first transient error. Wrap each query in its own 5-attempt retry loop with exponential backoff. Timeouts: 60 s for the chromosome-list query, 120 s for the per-chromosome queries. On exhausted retries of any chromosome the script sys.exit(1)s -- this is intentional. paralogs.txt and adjacent_genes.txt are used downstream as fusion-pair filters; a partial file (missing one or more chromosomes) would silently produce non-deterministic fusion-detection output across reference rebuilds of the same Ensembl version.

The previous implementation called gzip.open(...).read() / .readlines() on the GENCODE GTF, which is ~3 GB uncompressed for the human v48 annotation. Holding the entire decompressed string in memory plus a second copy as a list of lines caused the script to OOM on machines with less than ~8 GB available to the Python 2.7 interpreter (a common situation in containerized builds). Replace with shutil.copyfileobj() streaming decompression to a temporary file, then iterate line by line. Output is bit-identical to the previous behaviour on machines where the previous code did not OOM. Also wrap the FTP fetch in a 5-attempt retry loop with reconnect on each attempt. ftplib previously aborted on any transient FTP error.

get_pcawg.py ICGC DCC was decommissioned in 2024; PCAWG open-tier files were migrated to the ICGC ARGO icgc25k-open S3-compatible bucket. Update the default --server from https://dcc.icgc.org to https://object.genomeinformatics.org and the URL path accordingly. TSV schema is unchanged (26 columns, geneA->geneB at column 0, Ensembl IDs at columns 4 and 5); parser untouched. Also wrap the existing gzip.open(...).readlines() call in try/except so a future migration breakage (server returns HTML instead of gzip) produces an empty pcawg.txt rather than crashing the build. get_cancer-genes.py Bushman Lab restructured its resource directory; the old path /assets/doc/allOnco_May2018.tsv returns 404 and the file layout changed. Update to /export/geneLists/allOnco_June2021.tsv. The new file is a quoted TSV with a row-index column at position 0 and the gene symbol at position 1, so the parser is adjusted from split("\t")[0] to split("\t")[1].strip('"'). get_mitelman.py The default --url pointed at storage.cloud.google.com (the GCS web console endpoint, which returns HTML) and silently produced an empty mitelman.txt. Switch to the actual data endpoint https://storage.googleapis.com/mitelman-data-files/prod/mitelman_db.zip (HTTP 200, ~21 MB). Separately, the Mitelman dataset moved its files from the mitelman_db/ subdirectory to the archive root. The lookup of MBCA.TXT.DATA inside the ZIP now tries both layouts via zf.getinfo() and warns instead of raising KeyError when neither is present. get_chimerdb4.py KOBIC consolidated their ChimerDB site and dropped the /chimerdb_mirror/ path segment; the live download endpoint is now /chimerdb/downloads. The same server only serves over HTTPS, so the default --server is updated from http://www.kobic.re.kr to https://www.kobic.re.kr (HTTP returns 500). Verified end-to-end against the live downloads: ChimerKB4.xlsx (1,951 fusions), ChimerPub4.xlsx (1,941 fusions), ChimerSeq4.xlsx (374,029 fusions). The parser was not touched -- it locates the H_GENE and T_GENE columns by header name, which the XLSX files preserve from previous releases.

…iles The Ensembl MySQL directory filter at line 138 used to be: not el.lower().endswith(".gz") which was meant to exclude tarball files, but accidentally let the homo_sapiens_core_*_37 directories (the GRCh37 assembly entries) through. When the FTP listing happened to include those, the script picked GRCh37 synonyms for a GRCh38 build, producing wrong gene synonyms. Change the filter to: not el.lower().endswith("37") so only the current assembly entries are kept.

…ersion When --ftp-ensembl-path is supplied (e.g. /pub/release-114), derive the matching GENCODE release as ensembl_release - 66 and pass it to get_gencode.py as --release <N>. Without this, get_gencode.py always picks the latest release from ftp.ebi.ac.uk, which produces a gencode_genes.txt that does not match the requested Ensembl version (e.g. Ensembl v114 paired with GENCODE v49 instead of v48). The derivation is skipped silently if the Ensembl path does not end in a parseable number, preserving previous behaviour for callers that omit --ftp-ensembl-path.

af8 added 3 commits May 25, 2026 10:05

af8 changed the title ~~Fix/build resilience 2026 05~~ Build-pipeline resilience fixes: BioMart retries, GENCODE streaming, URL updates May 25, 2026

af8 force-pushed the fix/build-resilience-2026-05 branch from 7332123 to 3a82564 Compare May 25, 2026 15:21

af8 added 2 commits May 25, 2026 17:33

af8 force-pushed the fix/build-resilience-2026-05 branch from 3a82564 to 4e33ae7 Compare May 25, 2026 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build-pipeline resilience fixes: BioMart retries, GENCODE streaming, URL updates#234

Build-pipeline resilience fixes: BioMart retries, GENCODE streaming, URL updates#234
af8 wants to merge 6 commits into
ndaniel:masterfrom
af8:fix/build-resilience-2026-05

af8 commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

af8 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Build-pipeline resilience fixes (2026-05)

Tested with

Commits (in order)

Out of scope (intentionally)

Reviewer notes

Suggestion for a future iteration (not part of this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

af8 commented May 25, 2026 •

edited

Loading