Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 44 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<p align="center">
<img src="assets/model_overview.png" width="600" alt="TranscriptFormer Overview">
<br>
<em>Overview of TranscriptFormer pretraining data, model, outputs and downstream tasks.
<em>Overview of TranscriptFormer pretraining data (A), model (B), outputs (C) and downstream tasks (D).
</em>
</p>

Expand Down Expand Up @@ -112,6 +112,49 @@ The command will download and extract the following files to the `./checkpoints`
- `./checkpoints/tf_metazoa/`: Metazoa model weights
- `./checkpoints/all_embeddings/`: Embedding files for out-of-distribution species

#### Available Protein Embeddings

The following protein embeddings are available for download with `transcriptformer download all-embeddings`:

| Scientific Name | Common Name | TF-Metazoa | TF-Exemplar | TF-Sapiens | Notes |
|-----------------|-------------|------------|-------------|------------|-------|
| *Homo sapiens* | Human | ✓ | ✓ | ✓ | Primary training species |
| *Mus musculus* | Mouse | ✓ | ✓ | - | Model organism |
| *Danio rerio* | Zebrafish | ✓ | ✓ | - | Model organism |
| *Drosophila melanogaster* | Fruit fly | ✓ | ✓ | - | Model organism |
| *Caenorhabditis elegans* | C. elegans | ✓ | ✓ | - | Model organism |
| *Oryctolagus cuniculus* | Rabbit | ✓ | - | - | Vertebrate |
| *Gallus gallus* | Chicken | ✓ | - | - | Vertebrate |
| *Xenopus laevis* | African clawed frog | ✓ | - | - | Vertebrate |
| *Lytechinus variegatus* | Sea urchin | ✓ | - | - | Invertebrate |
| *Spongilla lacustris* | Freshwater sponge | ✓ | - | - | Invertebrate |
| *Saccharomyces cerevisiae* | Yeast | ✓ | - | - | Fungus |
| *Plasmodium falciparum* | Malaria parasite | ✓ | - | - | Protist |
| *Rattus norvegicus* | Rat | - | - | - | Out-of-distribution |
| *Sus scrofa* | Pig | - | - | - | Out-of-distribution |
| *Pan troglodytes* | Chimpanzee | - | - | - | Out-of-distribution |
| *Gorilla gorilla* | Gorilla | - | - | - | Out-of-distribution |
| *Macaca mulatta* | Rhesus macaque | - | - | - | Out-of-distribution |
| *Callithrix jacchus* | Marmoset | - | - | - | Out-of-distribution |
| *Xenopus tropicalis* | Western clawed frog | - | - | - | Out-of-distribution |
| *Ornithorhynchus anatinus* | Platypus | - | - | - | Out-of-distribution |
| *Monodelphis domestica* | Opossum | - | - | - | Out-of-distribution |
| *Heterocephalus glaber* | Naked mole-rat | - | - | - | Out-of-distribution |
| *Petromyzon marinus* | Sea lamprey | - | - | - | Out-of-distribution |
| *Stylophora pistillata* | Coral | - | - | - | Out-of-distribution |

**Legend:**
- ✓ = Species included in model training data
- \- = Species not included in model training (out-of-distribution)

### Generating Protein Embeddings for New Species

The pre-generated embeddings cover the most commonly used species. If you need to work with a species not included in the downloaded embeddings, you can generate protein embeddings using the ESM-2 models.

**Note**: This is only necessary for new species that don't have pre-generated embeddings available for download.

For detailed instructions on generating protein embeddings for additional species, see the [protein_embeddings/README.md](protein_embeddings/README.md) documentation.

### Downloading Training Datasets

Use the CLI to download single-cell RNA sequencing datasets from the CellxGene Discover portal:
Expand Down
194 changes: 194 additions & 0 deletions preprocess/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# Protein Embeddings Generation

This directory contains scripts for generating protein embeddings using Facebook's ESM-2 (Evolutionary Scale Modeling) models. The pipeline downloads protein sequences from Ensembl, processes them with pre-trained ESM-2 models, and outputs gene-level embeddings suitable for inputs to TranscriptFormer.

## Overview

The protein embedding pipeline consists of three main components:

1. **`protein_embedding.py`** - Main script for generating protein embeddings using ESM-2 models
2. **`get_stable_id_mapping.py`** - Utility functions for mapping between gene, transcript, and protein stable IDs
3. **`fasta_manifest_pep.json`** - Configuration file containing download URLs for protein FASTA files from Ensembl

## Installation

### Using pip (traditional)

Install the required dependencies:

```bash
pip install -r requirements.txt
pip install fair-esm
```

For GPU acceleration (recommended):
```bash
# For CUDA-enabled PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```

### Using uv (recommended)

[uv](https://github.com/astral-sh/uv) is a fast Python package installer and resolver:

```bash
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate virtual environment
uv venv protein-embeddings
source protein-embeddings/bin/activate # On Windows: protein-embeddings\Scripts\activate

# Install dependencies
uv pip install -r requirements.txt
uv pip install fair-esm

# For GPU acceleration
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```

### System Requirements

- **Memory**: At least 16GB RAM (32GB+ recommended for large models)
- **GPU**: NVIDIA GPU with 8GB+ VRAM (optional but highly recommended)
- **Storage**: Several GB for downloaded FASTA files and generated embeddings
- **Network**: Internet connection for downloading protein sequences from Ensembl

## Usage

### Basic Usage

Generate protein embeddings for a single species:

```bash
python protein_embedding.py --organism_key homo_sapiens
```

### Advanced Usage

```bash
python protein_embedding.py \
--organism_key mus_musculus \
--output_dir /path/to/output \
--batch_size 32 \
--use_large_model true
```

### Command Line Arguments

- `--organism_key`: Species to process (see [Supported Species](#supported-species))
- `--output_dir`: Directory to save embeddings (default: current directory `./`)
- `--batch_size`: Batch size for processing (default: 16)
- `--use_large_model`: Use ESM-2 15B parameter model instead of 3B (default: false)


## Supported Species

The pipeline supports the following species (from Ensembl release 110/113):

| Species | Organism Key | Common Name |
|---------|-------------|-------------|
| Homo sapiens | `homo_sapiens` | Human |
| Mus musculus | `mus_musculus` | Mouse |
| Rattus norvegicus | `rattus_norvegicus` | Rat |
| Sus scrofa | `sus_scrofa` | Pig |
| Oryctolagus cuniculus | `oryctolagus_cuniculus` | Rabbit |
| Macaca mulatta | `macaca_mulatta` | Rhesus macaque |
| Pan troglodytes | `pan_troglodytes` | Chimpanzee |
| Gorilla gorilla | `gorilla_gorilla` | Gorilla |
| Callithrix jacchus | `callithrix_jacchus` | Marmoset |
| Microcebus murinus | `microcebus_murinus` | Mouse lemur |
| Gallus gallus | `gallus_gallus` | Chicken |
| Danio rerio | `danio_rerio` | Zebrafish |
| Xenopus tropicalis | `xenopus_tropicalis` | Frog |
| Drosophila melanogaster | `drosophila_melanogaster` | Fruit fly |
| Petromyzon marinus | `petromyzon_marinus` | Sea lamprey |
| Ornithorhynchus anatinus | `ornithorhynchus_anatinus` | Platypus |
| Monodelphis domestica | `monodelphis_domestica` | Opossum |
| Heterocephalus glaber | `heterocephalus_glaber` | Naked mole-rat |
| Stylophora pistillata | `stylophora_pistillata` | Coral |

## Adding New Species

To add support for a new species, you need to update the `fasta_manifest_pep.json` file with the appropriate Ensembl download URLs.

### Step 1: Find Ensembl URLs

1. Visit the [Ensembl FTP site](https://ftp.ensembl.org/pub/) or [Ensembl Genomes](https://ftp.ebi.ac.uk/ensemblgenomes/pub/) for non-vertebrates
2. Navigate to the latest release (e.g., `release-113/`)
3. Find your species under `fasta/{species_name}/pep/`
4. Copy the URL for the `.pep.all.fa.gz` file

### Step 2: Update the Manifest

Add an entry to `fasta_manifest_pep.json`:

```json
{
"new_species_name": {
"fa": "https://ftp.ensembl.org/pub/release-113/fasta/new_species/pep/New_species.Assembly.pep.all.fa.gz"
}
}
```

### Step 3: Generate Embeddings

```bash
python protein_embedding.py --organism_key new_species_name
```

### Example: Adding Sheep (Ovis aries)

```json
{
"ovis_aries": {
"fa": "https://ftp.ensembl.org/pub/release-113/fasta/ovis_aries/pep/Ovis_aries_rambouillet.ARS-UI_Ramb_v2.0.pep.all.fa.gz"
}
}
```

### Notes

- Use lowercase with underscores for organism keys (e.g., `ovis_aries`)
- Ensure the FASTA file contains protein sequences (`.pep.` not `.cdna.` or `.dna.`)
- Some species may be in Ensembl Genomes rather than main Ensembl
- Check that the assembly and release versions are current

## Output Format

The script generates embeddings in HDF5 format with the following structure:

```python
import h5py

# Load embeddings
with h5py.File('homo_sapiens_gene.h5', 'r') as f:
keys = f['keys'][:] # Gene IDs
embeddings = f['arrays'] # Group containing embedding arrays

# Access specific gene embedding
gene_id = 'ENSG00000139618' # Example: BRCA2
embedding = embeddings[gene_id][:] # Shape: (2560,) for ESM-2 3B model
```

### Output Files

- **Standard model**: `{organism}_gene.h5` (d=2560, TranscriptFormer default)
- **Large model**: `{organism}_gene_large.h5` (d=5120, UCE default)

## Pipeline Details

### Processing Steps

1. **Download**: Automatically downloads protein FASTA files from Ensembl FTP
2. **Parse**: Extracts gene IDs from protein sequence headers
3. **Deduplicate**: Removes duplicate sequences for the same gene
4. **Clean**: Replaces invalid amino acids (*) with `<unk>` tokens
5. **Embed**: Generates embeddings using ESM-2 model (layer 33 for 3B, layer 48 for 15B)
6. **Average**: Averages embeddings across all protein isoforms per gene
7. **Save**: Stores final gene-level embeddings in HDF5 format

### Models Used

- **ESM-2 3B** (`esm2_t36_3B_UR50D`): 36-layer, 3 billion parameter model
- **ESM-2 15B** (`esm2_t48_15B_UR50D`): 48-layer, 15 billion parameter model
63 changes: 63 additions & 0 deletions preprocess/fasta_manifest_pep.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
{
"homo_sapiens": {
"fa": "https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz"
},
"mus_musculus": {
"fa": "https://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/pep/Mus_musculus.GRCm39.pep.all.fa.gz"
},
"danio_rerio": {
"fa": "https://ftp.ensembl.org/pub/release-110/fasta/danio_rerio/pep/Danio_rerio.GRCz11.pep.all.fa.gz"
},
"callithrix_jacchus": {
"fa": "https://ftp.ensembl.org/pub/release-110/fasta/callithrix_jacchus/pep/Callithrix_jacchus.mCalJac1.pat.X.pep.all.fa.gz"
},
"gorilla_gorilla": {
"fa": "https://ftp.ensembl.org/pub/release-110/fasta/gorilla_gorilla/pep/Gorilla_gorilla.gorGor4.pep.all.fa.gz"
},
"macaca_mulatta": {
"fa": "https://ftp.ensembl.org/pub/release-110/fasta/macaca_mulatta/pep/Macaca_mulatta.Mmul_10.pep.all.fa.gz"
},
"sus_scrofa": {
"fa": "https://ftp.ensembl.org/pub/release-110/fasta/sus_scrofa/pep/Sus_scrofa.Sscrofa11.1.pep.all.fa.gz"
},
"pan_troglodytes": {
"fa": "https://ftp.ensembl.org/pub/release-110/fasta/pan_troglodytes/pep/Pan_troglodytes.Pan_tro_3.0.pep.all.fa.gz"
},
"gallus_gallus": {
"fa": "https://ftp.ensembl.org/pub/release-113/fasta/gallus_gallus/pep/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.pep.all.fa.gz"
},
"heterocephalus_glaber": {
"fa": "https://ftp.ensembl.org/pub/release-113/fasta/heterocephalus_glaber_female/pep/Heterocephalus_glaber_female.Naked_mole-rat_maternal.pep.all.fa.gz"
},
"monodelphis_domestica": {
"fa": "https://ftp.ensembl.org/pub/release-113/fasta/monodelphis_domestica/pep/Monodelphis_domestica.ASM229v1.pep.all.fa.gz"
},
"drosophila_melanogaster": {
"fa": "https://ftp.ensembl.org/pub/release-113/fasta/drosophila_melanogaster/pep/Drosophila_melanogaster.BDGP6.46.pep.all.fa.gz"
},
"ornithorhynchus_anatinus": {
"fa": "https://ftp.ensembl.org/pub/release-113/fasta/ornithorhynchus_anatinus/pep/Ornithorhynchus_anatinus.mOrnAna1.p.v1.pep.all.fa.gz",
"gff3": "https://ftp.ensembl.org/pub/release-113/gff3/ornithorhynchus_anatinus/Ornithorhynchus_anatinus.mOrnAna1.p.v1.113.gff3.gz"
},
"oryctolagus_cuniculus": {
"fa": "https://ftp.ensembl.org/pub/release-113/fasta/oryctolagus_cuniculus/pep/Oryctolagus_cuniculus.OryCun2.0.pep.all.fa.gz",
"gff3": "https://ftp.ensembl.org/pub/release-113/gff3/oryctolagus_cuniculus/Oryctolagus_cuniculus.OryCun2.0.113.gff3.gz"
},
"xenopus_tropicalis": {
"fa": "https://ftp.ensembl.org/pub/release-113/fasta/xenopus_tropicalis/pep/Xenopus_tropicalis.UCB_Xtro_10.0.pep.all.fa.gz",
"gff3": "https://ftp.ensembl.org/pub/release-113/gff3/xenopus_tropicalis/Xenopus_tropicalis.UCB_Xtro_10.0.113.gff3.gz"
},
"microcebus_murinus": {
"fa": "https://ftp.ensembl.org/pub/release-113/fasta/microcebus_murinus/pep/Microcebus_murinus.Mmur_3.0.pep.all.fa.gz",
"gff3": "https://ftp.ensembl.org/pub/release-113/gff3/microcebus_murinus/Microcebus_murinus.Mmur_3.0.113.chr.gff3.gz"
},
"stylophora_pistillata": {
"fa": "https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-60/metazoa/fasta/stylophora_pistillata_gca002571385v1/pep/Stylophora_pistillata_gca002571385v1.Stylophora_pistillata_v1.pep.all.fa.gz"
},
"petromyzon_marinus": {
"fa": "https://ftp.ensembl.org/pub/release-113/fasta/petromyzon_marinus/pep/Petromyzon_marinus.Pmarinus_7.0.pep.all.fa.gz"
},
"rattus_norvegicus": {
"fa": "https://ftp.ensembl.org/pub/release-113/fasta/rattus_norvegicus/pep/Rattus_norvegicus.mRatBN7.2.pep.all.fa.gz"
}
}
Loading