Build-pipeline resilience fixes: BioMart retries, GENCODE streaming, URL updates#234
Open
af8 wants to merge 6 commits into
Open
Build-pipeline resilience fixes: BioMart retries, GENCODE streaming, URL updates#234af8 wants to merge 6 commits into
af8 wants to merge 6 commits into
Conversation
Ensembl BioMart returns transient HTTP 405 responses on a small but non-zero fraction of requests; the existing scripts treated any error as fatal and aborted the build pipeline. Wrap each urlopen call in a 5-attempt retry loop with exponential backoff (60 * attempt seconds) and raise the request timeout from 30 s to 60 s. On the fifth failed attempt the script still sys.exit(1)s -- the contract for callers is unchanged; only the success rate against a flaky BioMart improves. Applies the same pattern uniformly to: - get_biotypes.py - get_genes_descriptions.py - get_genes_symbols.py - get_hla2.py - get_mtrna.py - get_refseq_ensembl.py - get_rrna.py - get_trna.py No business logic, output format, or parse rule is modified.
…et_exons_positions Both scripts issue one BioMart query to fetch a chromosome list, then one query per chromosome to fetch paralog or exon-coordinate data. The single try/except around each call previously aborted on the first transient error. Wrap each query in its own 5-attempt retry loop with exponential backoff. Timeouts: 60 s for the chromosome-list query, 120 s for the per-chromosome queries. On exhausted retries of any chromosome the script sys.exit(1)s -- this is intentional. paralogs.txt and adjacent_genes.txt are used downstream as fusion-pair filters; a partial file (missing one or more chromosomes) would silently produce non-deterministic fusion-detection output across reference rebuilds of the same Ensembl version.
The previous implementation called gzip.open(...).read() / .readlines() on the GENCODE GTF, which is ~3 GB uncompressed for the human v48 annotation. Holding the entire decompressed string in memory plus a second copy as a list of lines caused the script to OOM on machines with less than ~8 GB available to the Python 2.7 interpreter (a common situation in containerized builds). Replace with shutil.copyfileobj() streaming decompression to a temporary file, then iterate line by line. Output is bit-identical to the previous behaviour on machines where the previous code did not OOM. Also wrap the FTP fetch in a 5-attempt retry loop with reconnect on each attempt. ftplib previously aborted on any transient FTP error.
get_pcawg.py ICGC DCC was decommissioned in 2024; PCAWG open-tier files were migrated to the ICGC ARGO icgc25k-open S3-compatible bucket. Update the default --server from https://dcc.icgc.org to https://object.genomeinformatics.org and the URL path accordingly. TSV schema is unchanged (26 columns, geneA->geneB at column 0, Ensembl IDs at columns 4 and 5); parser untouched. Also wrap the existing gzip.open(...).readlines() call in try/except so a future migration breakage (server returns HTML instead of gzip) produces an empty pcawg.txt rather than crashing the build. get_cancer-genes.py Bushman Lab restructured its resource directory; the old path /assets/doc/allOnco_May2018.tsv returns 404 and the file layout changed. Update to /export/geneLists/allOnco_June2021.tsv. The new file is a quoted TSV with a row-index column at position 0 and the gene symbol at position 1, so the parser is adjusted from split("\t")[0] to split("\t")[1].strip('"'). get_mitelman.py The default --url pointed at storage.cloud.google.com (the GCS web console endpoint, which returns HTML) and silently produced an empty mitelman.txt. Switch to the actual data endpoint https://storage.googleapis.com/mitelman-data-files/prod/mitelman_db.zip (HTTP 200, ~21 MB). Separately, the Mitelman dataset moved its files from the mitelman_db/ subdirectory to the archive root. The lookup of MBCA.TXT.DATA inside the ZIP now tries both layouts via zf.getinfo() and warns instead of raising KeyError when neither is present. get_chimerdb4.py KOBIC consolidated their ChimerDB site and dropped the /chimerdb_mirror/ path segment; the live download endpoint is now /chimerdb/downloads. The same server only serves over HTTPS, so the default --server is updated from http://www.kobic.re.kr to https://www.kobic.re.kr (HTTP returns 500). Verified end-to-end against the live downloads: ChimerKB4.xlsx (1,951 fusions), ChimerPub4.xlsx (1,941 fusions), ChimerSeq4.xlsx (374,029 fusions). The parser was not touched -- it locates the H_GENE and T_GENE columns by header name, which the XLSX files preserve from previous releases.
7332123 to
3a82564
Compare
…iles
The Ensembl MySQL directory filter at line 138 used to be:
not el.lower().endswith(".gz")
which was meant to exclude tarball files, but accidentally let the
homo_sapiens_core_*_37 directories (the GRCh37 assembly entries)
through. When the FTP listing happened to include those, the script
picked GRCh37 synonyms for a GRCh38 build, producing wrong gene
synonyms.
Change the filter to:
not el.lower().endswith("37")
so only the current assembly entries are kept.
…ersion When --ftp-ensembl-path is supplied (e.g. /pub/release-114), derive the matching GENCODE release as ensembl_release - 66 and pass it to get_gencode.py as --release <N>. Without this, get_gencode.py always picks the latest release from ftp.ebi.ac.uk, which produces a gencode_genes.txt that does not match the requested Ensembl version (e.g. Ensembl v114 paired with GENCODE v49 instead of v48). The derivation is skipped silently if the Ensembl path does not end in a parseable number, preserving previous behaviour for callers that omit --ftp-ensembl-path.
3a82564 to
4e33ae7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Build-pipeline resilience fixes (2026-05)
A focused set of fixes to make
fusioncatcher-build.pyfinish cleanly against a current Ensembl/GENCODE release. All changes are confined tobin/get_*.pyandbin/fusioncatcher-build.py— none of the runtime fusion-detection code is touched.This PR partially addresses #189 ("Download data using docker is not working as expected") and the broader experience that the build pipeline has become brittle against transient BioMart errors and against several upstream resources that have moved or been decommissioned since the 1.33 release.
Tested with
nfcore/base:1.7-derived image with thefusioncatcher-1.33conda package.paralogs.txt,adjacent_genes.txt,biotypes.txt,synonyms.txt) are byte-identical to what an unpatched run produces when retries succeed by chance.pcawg.txt— 4,346 fusion entriescancer_genes.txt— 2,986 gene symbols (Bushman Lab June 2021 list)mitelman.txt— 202,423 entries (April 2026 ISB-CGC snapshot)chimerdb4kb.txt/chimerdb4pub.txt/chimerdb4seq.txt— 1,951 + 1,941 + 374,029 entries (KOBIC ChimerDB v4)Commits (in order)
Add retry logic to BioMart-based download scripts — uniform 5-attempt loop with exponential backoff and a 60 s timeout across the 8 BioMart helpers (
get_biotypes,get_genes_descriptions,get_genes_symbols,get_hla2,get_mtrna,get_refseq_ensembl,get_rrna,get_trna). Ensembl BioMart returns transient HTTP 405 on a small but non-zero fraction of requests; this turns them from build-aborting errors into a small delay.Add per-chromosome retry and fail-on-incomplete in get_paralogs and get_exons_positions — same retry pattern but applied to both queries (chromosome list + per-chromosome). On exhausted retries the script
sys.exit(1)s:paralogs.txtandadjacent_genes.txtare used downstream as fusion-pair filters, so a partial file would silently produce non-deterministic detection output across reference rebuilds of the same Ensembl version.Stream GENCODE GTF decompression to avoid OOM — the human v48 GTF is ~3 GB uncompressed;
gzip.open(...).read()followed by.readlines()held the entire string in memory plus a second copy as a list, OOM-ing on machines with less than ~8 GB available to the Python 2.7 interpreter. Replaced withshutil.copyfileobj()streaming + line-by-line iteration. Also adds the FTP retry pattern. Output is bit-identical to the previous behaviour on machines where it didn't OOM.Update URLs for migrated and dead resources —
get_pcawg.py—dcc.icgc.org/api/v1/download?fn=/PCAWG/...is decommissioned; PCAWG open-tier data was migrated to the ICGC ARGOicgc25k-openbucket athttps://object.genomeinformatics.org. TSV schema is unchanged; parser untouched. Also wraps the existing gzip parse in atry/exceptso a future migration breakage produces an empty file rather than crashing the build.get_cancer-genes.py— the Bushman Lab moved its resource directory;/assets/doc/allOnco_May2018.tsvis dead, replaced by/export/geneLists/allOnco_June2021.tsv. The new file is a quoted TSV with a row-index column at position 0 and the gene symbol at position 1, so the parser is adjusted accordingly.get_mitelman.py— the default URL pointed atstorage.cloud.google.com(the GCS web console endpoint, returns HTML) and silently produced an emptymitelman.txt. Switched to the data endpointstorage.googleapis.com. Separately, the dataset moved its files from amitelman_db/subdirectory to the archive root; the ZIP lookup now tries both layouts.get_chimerdb4.py— KOBIC consolidated their ChimerDB site and dropped the/chimerdb_mirror/path segment; the live download endpoint is/chimerdb/downloads. The server only serves over HTTPS now (HTTP returns 500), so the default--serveris updated fromhttp://www.kobic.re.krtohttps://www.kobic.re.kr. Parser untouched (locatesH_GENE/T_GENEcolumns by header name).Fix get_synonyms.py filter to exclude GRCh37 entries instead of .gz files — one-line fix to the Ensembl MySQL directory filter at line 138. The previous filter (
not el.lower().endswith(".gz")) was meant to exclude tarballs but accidentally lethomo_sapiens_core_*_37(GRCh37) entries through, occasionally polluting a GRCh38 build with GRCh37 synonyms.Auto-pin GENCODE release in fusioncatcher-build.py based on Ensembl version — when
--ftp-ensembl-pathis supplied (e.g./pub/release-114), derive the matching GENCODE release asensembl_release − 66and pass it toget_gencode.pyas--release <N>. Without this,get_gencode.pyalways picks the latest release fromftp.ebi.ac.uk, which produces agencode_genes.txtthat does not match the requested Ensembl version (e.g. Ensembl v114 paired with GENCODE v49 instead of v48). Skipped silently when the path is omitted or unparseable, preserving previous behaviour for those callers.Out of scope (intentionally)
These were considered but deliberately not included to keep the PR focused on regressions and dead links:
oncokb/oncokb-publicon GitHub in 2024; the official API requires an authentication token. Fixing this requires a credentials story, not just a URL update.--skip-database cgp,cosmicworkaround.These can be addressed in follow-up PRs if there is interest.
Reviewer notes
try / urlopen / exceptblock is wrapped in afor attempt in xrange(5)loop. The logical change is small and uniform across the 8 files; reviewing one of them is sufficient to validate the others.get_paralogs.py(commit 2) intentionally does not silent-skip a chromosome on exhausted retries. An intermediate version of this patch did silent-skip, and the resultingparalogs.txtwas non-deterministic across rebuilds — that behaviour was reverted.requirements.txtorenvironment.ymlchanges are needed; everything uses the standard library plus whatfusioncatcher-1.33already depends on.Suggestion for a future iteration (not part of this PR)
While working through these fixes I noticed that the build pipeline currently treats the ~30 resource downloads as a flat list: when one comes back empty, the user has no way to tell whether the missing file is a problem (because that file feeds the detection algorithm or is a hard filter) or a non-issue (because that file is a "seen in cohort X" enrichment label that the call can be made without).
It might be worth, at some point, surfacing this distinction to the user — for example as a one-line note in the end-of-build log saying "the following empty outputs do not affect fusion calling: …" versus "the following empty outputs do affect fusion calling and should be investigated: …". The contract for each
get_*.pywould not change; it would just be a summary helper. Happy to draft that as a separate small PR if you think it would be useful, but it's also fine to leave as-is — this PR is intentionally limited to the resilience and URL fixes.