Skip to content

feat: added ESM-2 preprocessing pipeline#49

Merged
jdpearce4 merged 5 commits intomainfrom
resolves-45-emb-preproc
Aug 11, 2025
Merged

feat: added ESM-2 preprocessing pipeline#49
jdpearce4 merged 5 commits intomainfrom
resolves-45-emb-preproc

Conversation

@jdpearce4
Copy link
Collaborator

Resolves #45

Changes

  • Added the preprocessing pipeline from internal codebase.
  • Added instructions to main README.md
  • Created preprocessing README.md
  • Replaced wget and gunzip dependencies with python modules

Testing

  • tested human ESM-2 embeddings generation with defaults
  • tested new species (sheep) generation with defaults

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a comprehensive ESM-2 preprocessing pipeline for generating protein embeddings from FASTA files. The implementation replaces shell-based dependencies (wget/gunzip) with pure Python modules and provides a complete workflow for downloading, processing, and converting protein sequences to gene-level embeddings.

Key changes:

  • Complete preprocessing pipeline with ESM-2 model integration for protein embedding generation
  • Python-based file downloading and decompression replacing shell dependencies
  • Support for 20+ species with extensible configuration through JSON manifest

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
preprocess/requirements.txt Dependencies for ESM-2 preprocessing pipeline
preprocess/protein_embedding.py Main script for generating protein embeddings using ESM-2 models
preprocess/get_stable_id_mapping.py Utility functions for mapping gene/protein/transcript stable IDs
preprocess/fasta_manifest_pep.json Configuration manifest with Ensembl download URLs for supported species
preprocess/README.md Comprehensive documentation for the preprocessing pipeline
README.md Updated main README with protein embedding availability table and preprocessing instructions
Comments suppressed due to low confidence (2)

preprocess/requirements.txt:4

  • The biopython dependency is missing a version specification. Consider pinning to a specific version for reproducibility, e.g., 'biopython==1.81'.
biopython

preprocess/requirements.txt:5

  • The h5py dependency is missing a version specification. Consider pinning to a specific version for reproducibility, e.g., 'h5py==3.9.0'.
h5py

jdpearce4 and others added 4 commits July 28, 2025 16:04
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@jdpearce4 jdpearce4 merged commit 181193c into main Aug 11, 2025
5 checks passed
@jdpearce4 jdpearce4 deleted the resolves-45-emb-preproc branch August 11, 2025 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gene embeddings code

1 participant