Skip to content

Build-pipeline resilience fixes: BioMart retries, GENCODE streaming, URL updates#234

Open
af8 wants to merge 6 commits into
ndaniel:masterfrom
af8:fix/build-resilience-2026-05
Open

Build-pipeline resilience fixes: BioMart retries, GENCODE streaming, URL updates#234
af8 wants to merge 6 commits into
ndaniel:masterfrom
af8:fix/build-resilience-2026-05

Conversation

@af8
Copy link
Copy Markdown

@af8 af8 commented May 25, 2026

Build-pipeline resilience fixes (2026-05)

A focused set of fixes to make fusioncatcher-build.py finish cleanly against a current Ensembl/GENCODE release. All changes are confined to bin/get_*.py and bin/fusioncatcher-build.py — none of the runtime fusion-detection code is touched.

This PR partially addresses #189 ("Download data using docker is not working as expected") and the broader experience that the build pipeline has become brittle against transient BioMart errors and against several upstream resources that have moved or been decommissioned since the 1.33 release.

Tested with

  • GENCODE v48 / Ensembl v114 / Homo sapiens build, run end-to-end inside an nfcore/base:1.7-derived image with the fusioncatcher-1.33 conda package.
  • 160/160 steps complete in ~4h52m (matches the duration of an unpatched build that I am able to coax to completion with manual retries).
  • Filter-input outputs (paralogs.txt, adjacent_genes.txt, biotypes.txt, synonyms.txt) are byte-identical to what an unpatched run produces when retries succeed by chance.
  • Four previously-empty fusion catalogs are now populated:
    • pcawg.txt — 4,346 fusion entries
    • cancer_genes.txt — 2,986 gene symbols (Bushman Lab June 2021 list)
    • mitelman.txt — 202,423 entries (April 2026 ISB-CGC snapshot)
    • chimerdb4kb.txt / chimerdb4pub.txt / chimerdb4seq.txt — 1,951 + 1,941 + 374,029 entries (KOBIC ChimerDB v4)

Commits (in order)

  1. Add retry logic to BioMart-based download scripts — uniform 5-attempt loop with exponential backoff and a 60 s timeout across the 8 BioMart helpers (get_biotypes, get_genes_descriptions, get_genes_symbols, get_hla2, get_mtrna, get_refseq_ensembl, get_rrna, get_trna). Ensembl BioMart returns transient HTTP 405 on a small but non-zero fraction of requests; this turns them from build-aborting errors into a small delay.

  2. Add per-chromosome retry and fail-on-incomplete in get_paralogs and get_exons_positions — same retry pattern but applied to both queries (chromosome list + per-chromosome). On exhausted retries the script sys.exit(1)s: paralogs.txt and adjacent_genes.txt are used downstream as fusion-pair filters, so a partial file would silently produce non-deterministic detection output across reference rebuilds of the same Ensembl version.

  3. Stream GENCODE GTF decompression to avoid OOM — the human v48 GTF is ~3 GB uncompressed; gzip.open(...).read() followed by .readlines() held the entire string in memory plus a second copy as a list, OOM-ing on machines with less than ~8 GB available to the Python 2.7 interpreter. Replaced with shutil.copyfileobj() streaming + line-by-line iteration. Also adds the FTP retry pattern. Output is bit-identical to the previous behaviour on machines where it didn't OOM.

  4. Update URLs for migrated and dead resources

    • get_pcawg.pydcc.icgc.org/api/v1/download?fn=/PCAWG/... is decommissioned; PCAWG open-tier data was migrated to the ICGC ARGO icgc25k-open bucket at https://object.genomeinformatics.org. TSV schema is unchanged; parser untouched. Also wraps the existing gzip parse in a try/except so a future migration breakage produces an empty file rather than crashing the build.
    • get_cancer-genes.py — the Bushman Lab moved its resource directory; /assets/doc/allOnco_May2018.tsv is dead, replaced by /export/geneLists/allOnco_June2021.tsv. The new file is a quoted TSV with a row-index column at position 0 and the gene symbol at position 1, so the parser is adjusted accordingly.
    • get_mitelman.py — the default URL pointed at storage.cloud.google.com (the GCS web console endpoint, returns HTML) and silently produced an empty mitelman.txt. Switched to the data endpoint storage.googleapis.com. Separately, the dataset moved its files from a mitelman_db/ subdirectory to the archive root; the ZIP lookup now tries both layouts.
    • get_chimerdb4.py — KOBIC consolidated their ChimerDB site and dropped the /chimerdb_mirror/ path segment; the live download endpoint is /chimerdb/downloads. The server only serves over HTTPS now (HTTP returns 500), so the default --server is updated from http://www.kobic.re.kr to https://www.kobic.re.kr. Parser untouched (locates H_GENE/T_GENE columns by header name).
  5. Fix get_synonyms.py filter to exclude GRCh37 entries instead of .gz files — one-line fix to the Ensembl MySQL directory filter at line 138. The previous filter (not el.lower().endswith(".gz")) was meant to exclude tarballs but accidentally let homo_sapiens_core_*_37 (GRCh37) entries through, occasionally polluting a GRCh38 build with GRCh37 synonyms.

  6. Auto-pin GENCODE release in fusioncatcher-build.py based on Ensembl version — when --ftp-ensembl-path is supplied (e.g. /pub/release-114), derive the matching GENCODE release as ensembl_release − 66 and pass it to get_gencode.py as --release <N>. Without this, get_gencode.py always picks the latest release from ftp.ebi.ac.uk, which produces a gencode_genes.txt that does not match the requested Ensembl version (e.g. Ensembl v114 paired with GENCODE v49 instead of v48). Skipped silently when the path is omitted or unparseable, preserving previous behaviour for those callers.

Out of scope (intentionally)

These were considered but deliberately not included to keep the PR focused on regressions and dead links:

  • OncoKB: the public data dir was removed from oncokb/oncokb-public on GitHub in 2024; the official API requires an authentication token. Fixing this requires a credentials story, not just a URL update.
  • COSMIC / CGP: login-gated; same reasoning. Already documented as --skip-database cgp,cosmic workaround.
  • TICdb / ConjoinG / CACG: server permanently offline or genuinely unavailable. The current "produce an empty output file silently" behaviour is preserved here.
  • TCGA Yoshihara / Cell Reports supplements: publisher CDNs return 403 under their Text-and-Data-Mining policies. PMC mirrors exist but require a session-based fetch I haven't automated.

These can be addressed in follow-up PRs if there is interest.

Reviewer notes

  • The diff for the BioMart scripts (commit 1) is large because the existing try / urlopen / except block is wrapped in a for attempt in xrange(5) loop. The logical change is small and uniform across the 8 files; reviewing one of them is sufficient to validate the others.
  • get_paralogs.py (commit 2) intentionally does not silent-skip a chromosome on exhausted retries. An intermediate version of this patch did silent-skip, and the resulting paralogs.txt was non-deterministic across rebuilds — that behaviour was reverted.
  • No requirements.txt or environment.yml changes are needed; everything uses the standard library plus what fusioncatcher-1.33 already depends on.

Suggestion for a future iteration (not part of this PR)

While working through these fixes I noticed that the build pipeline currently treats the ~30 resource downloads as a flat list: when one comes back empty, the user has no way to tell whether the missing file is a problem (because that file feeds the detection algorithm or is a hard filter) or a non-issue (because that file is a "seen in cohort X" enrichment label that the call can be made without).

It might be worth, at some point, surfacing this distinction to the user — for example as a one-line note in the end-of-build log saying "the following empty outputs do not affect fusion calling: …" versus "the following empty outputs do affect fusion calling and should be investigated: …". The contract for each get_*.py would not change; it would just be a summary helper. Happy to draft that as a separate small PR if you think it would be useful, but it's also fine to leave as-is — this PR is intentionally limited to the resilience and URL fixes.

af8 added 3 commits May 25, 2026 10:05
Ensembl BioMart returns transient HTTP 405 responses on a small but
non-zero fraction of requests; the existing scripts treated any error
as fatal and aborted the build pipeline. Wrap each urlopen call in a
5-attempt retry loop with exponential backoff (60 * attempt seconds)
and raise the request timeout from 30 s to 60 s. On the fifth failed
attempt the script still sys.exit(1)s -- the contract for callers is
unchanged; only the success rate against a flaky BioMart improves.

Applies the same pattern uniformly to:
  - get_biotypes.py
  - get_genes_descriptions.py
  - get_genes_symbols.py
  - get_hla2.py
  - get_mtrna.py
  - get_refseq_ensembl.py
  - get_rrna.py
  - get_trna.py

No business logic, output format, or parse rule is modified.
…et_exons_positions

Both scripts issue one BioMart query to fetch a chromosome list, then
one query per chromosome to fetch paralog or exon-coordinate data.
The single try/except around each call previously aborted on the
first transient error.

Wrap each query in its own 5-attempt retry loop with exponential
backoff. Timeouts: 60 s for the chromosome-list query, 120 s for the
per-chromosome queries. On exhausted retries of any chromosome the
script sys.exit(1)s -- this is intentional. paralogs.txt and
adjacent_genes.txt are used downstream as fusion-pair filters; a
partial file (missing one or more chromosomes) would silently produce
non-deterministic fusion-detection output across reference rebuilds
of the same Ensembl version.
The previous implementation called gzip.open(...).read() / .readlines()
on the GENCODE GTF, which is ~3 GB uncompressed for the human v48
annotation. Holding the entire decompressed string in memory plus a
second copy as a list of lines caused the script to OOM on machines
with less than ~8 GB available to the Python 2.7 interpreter (a
common situation in containerized builds).

Replace with shutil.copyfileobj() streaming decompression to a
temporary file, then iterate line by line. Output is bit-identical
to the previous behaviour on machines where the previous code did
not OOM.

Also wrap the FTP fetch in a 5-attempt retry loop with reconnect on
each attempt. ftplib previously aborted on any transient FTP error.
@af8 af8 changed the title Fix/build resilience 2026 05 Build-pipeline resilience fixes: BioMart retries, GENCODE streaming, URL updates May 25, 2026
get_pcawg.py
  ICGC DCC was decommissioned in 2024; PCAWG open-tier files were
  migrated to the ICGC ARGO icgc25k-open S3-compatible bucket. Update
  the default --server from https://dcc.icgc.org to
  https://object.genomeinformatics.org and the URL path accordingly.
  TSV schema is unchanged (26 columns, geneA->geneB at column 0,
  Ensembl IDs at columns 4 and 5); parser untouched.

  Also wrap the existing gzip.open(...).readlines() call in
  try/except so a future migration breakage (server returns HTML
  instead of gzip) produces an empty pcawg.txt rather than crashing
  the build.

get_cancer-genes.py
  Bushman Lab restructured its resource directory; the old path
  /assets/doc/allOnco_May2018.tsv returns 404 and the file layout
  changed. Update to /export/geneLists/allOnco_June2021.tsv. The
  new file is a quoted TSV with a row-index column at position 0
  and the gene symbol at position 1, so the parser is adjusted from
  split("\t")[0] to split("\t")[1].strip('"').

get_mitelman.py
  The default --url pointed at storage.cloud.google.com (the GCS
  web console endpoint, which returns HTML) and silently produced
  an empty mitelman.txt. Switch to the actual data endpoint
  https://storage.googleapis.com/mitelman-data-files/prod/mitelman_db.zip
  (HTTP 200, ~21 MB).

  Separately, the Mitelman dataset moved its files from the
  mitelman_db/ subdirectory to the archive root. The lookup of
  MBCA.TXT.DATA inside the ZIP now tries both layouts via
  zf.getinfo() and warns instead of raising KeyError when neither
  is present.

get_chimerdb4.py
  KOBIC consolidated their ChimerDB site and dropped the
  /chimerdb_mirror/ path segment; the live download endpoint is now
  /chimerdb/downloads. The same server only serves over HTTPS, so the
  default --server is updated from http://www.kobic.re.kr to
  https://www.kobic.re.kr (HTTP returns 500).

  Verified end-to-end against the live downloads: ChimerKB4.xlsx
  (1,951 fusions), ChimerPub4.xlsx (1,941 fusions), ChimerSeq4.xlsx
  (374,029 fusions). The parser was not touched -- it locates the
  H_GENE and T_GENE columns by header name, which the XLSX files
  preserve from previous releases.
@af8 af8 force-pushed the fix/build-resilience-2026-05 branch from 7332123 to 3a82564 Compare May 25, 2026 15:21
af8 added 2 commits May 25, 2026 17:33
…iles

The Ensembl MySQL directory filter at line 138 used to be:

    not el.lower().endswith(".gz")

which was meant to exclude tarball files, but accidentally let the
homo_sapiens_core_*_37 directories (the GRCh37 assembly entries)
through. When the FTP listing happened to include those, the script
picked GRCh37 synonyms for a GRCh38 build, producing wrong gene
synonyms.

Change the filter to:

    not el.lower().endswith("37")

so only the current assembly entries are kept.
…ersion

When --ftp-ensembl-path is supplied (e.g. /pub/release-114), derive
the matching GENCODE release as ensembl_release - 66 and pass it to
get_gencode.py as --release <N>. Without this, get_gencode.py always
picks the latest release from ftp.ebi.ac.uk, which produces a
gencode_genes.txt that does not match the requested Ensembl version
(e.g. Ensembl v114 paired with GENCODE v49 instead of v48).

The derivation is skipped silently if the Ensembl path does not end
in a parseable number, preserving previous behaviour for callers
that omit --ftp-ensembl-path.
@af8 af8 force-pushed the fix/build-resilience-2026-05 branch from 3a82564 to 4e33ae7 Compare May 25, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant