Positive-Unlabeled Learning for Predicting Small Molecule MS2 Identifiability from MS1 Context and Acquisition Parameters
This repository contains the complete pipeline for training a model that predicts the identifiability of MS2 spectra based on provided MS1 spectra and instrument configurations used to generate consecutive MS2 scans.
The approach utilizes Positive-Unlabeled (PU) Learning to train models using only positive examples (library-matched spectra) and unlabeled data. It features a Transformer-based architecture (using the depthcharge library) to encode spectra and incorporates acquisition parameters as features. To handle large-scale spectral data efficiently, the project utilizes the Lance data format.
spectral_quality_assessment/
├── README.md # This file
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment specification
│
├── src/ # Core model implementations
│ └── transformers/
│ ├── model_bce_loss_one_hot.py # BCE loss model (polarity-aware)
│ └── model_nn_pu_loss_detach_diff_polarity.py # nnPU loss model
│
├── scripts/ # Python scripts for each pipeline step
│ ├── data_download/
│ │ └── msv_download_datasets.py # Download datasets from MassIVE using CSV metadata
│ │
│ ├── data_preprocessing/
│ │ ├── split_library.py # Split GNPS library by polarity
│ │ ├── process_raw.py # Convert .raw → .mzML, run ScanHeadsman
│ │ ├── library_matching_diff_polarity.py # GNPS library matching
│ │ ├── data_processing_pipeline.py # Complete data processing pipeline
│ │ └── create_lance_add_one_hot.py # Create Lance dataset
│ │
│ ├── training/
│ │ ├── training_bce_loss_diff_polarity_one_hot.py # Train BCE models
│ │ └── training_nn_pu_loss_detach_diff_polarity.py # Train nnPU model
│ │
│ └── inference/
│ ├── predict_lance_all.py # Run predictions
│ └── predict_lance_diff_polarity_one_hot.py # Polarity-specific predictions
│
├── slurm_scripts/ # Cluster job submission scripts
│ ├── data_download/
│ │ └── msv_download.sh
│ ├── data_preprocessing/
│ │ ├── run_process_raw.sh # Convert raw files to mzML
│ │ ├── library_matching.sh # Run library matching
│ │ ├── run_processing_pipeline.sh # Complete processing pipeline
│ │ └── run_build_lance.sh # Build Lance datasets
│ ├── training/
│ │ ├── run_train_bce_loss_diff_polarity.sh
│ │ └── run_train_nnpu_loss.sh
│ └── inference/
│ ├── run_predict_lance.sh
│ └── run_predict_lance_val.sh
│
├── checkpoints/ # Pre-trained model checkpoints (download from Zenodo)
│ └── README.md # Download instructions
│
├── tools/ # External tools (download separately)
│ └── README.md # Installation guide for ThermoRawFileParser & ScanHeadsman
│
├── data/ # Data and metadata
│ ├── README.md # Data directory documentation
│ ├── metadata/ # Dataset metadata (in repo)
│ │ ├── train_datasets.csv
│ │ ├── val_datasets.csv
│ │ ├── test_1_metadata.csv
│ │ ├── test_2_metadata.csv
│ │ └── test_3_metadata.csv
│ ├── libraries/ # GNPS libraries (download & split)
│ │ └── README.md # Download & split instructions
│ ├── file_paths/ # [User-created] Lists of local file paths
│ │ ├── file_paths_train.txt
│ │ └── file_paths_val.txt
│ ├── lance_datasets/ # [External] Training & validation Lance data (download from Zenodo)
│ ├── lance_data_test_set_1/ # [External] Test Set 1 (download from Zenodo)
│ ├── lance_data_test_set_2/ # [External] Test Set 2 (download from Zenodo)
│ └── lance_data_test_set_3/ # [External] Test Set 3 (download from Zenodo)
│
└── docs/ # Detailed documentation
├── DATA_PREPROCESSING.md # Preprocessing pipeline
├── TRAINING.md # Model training guide
└── INFERENCE.md # Running predictions
- Python 3.11+
- CUDA 12.8+ (for GPU training)
- Conda
- Access to a computing cluster (recommended for full pipeline)
Note: This project has been tested and run on a Linux environment HPC cluster, and the model was trained on this HPC system using 2 GPUs. Training and inference have also been tested on macOS.
# Clone the repository
git clone https://github.com/bittremieuxlab/pu_ms2_identifiability.git
cd pu_ms2_identifiability
# Create and activate conda environment
# For Linux HPC cluster:
conda env create -f environment.yml
# For macOS testing (training and inference):
conda env create -f environment-mac.yml
conda activate instrument_settingNote: Use
environment.ymlfor Linux HPC cluster environments. Useenvironment-mac.ymlfor testing training and inference on macOS.
External Tools :
If you intend to process raw data (convert .raw files to the Lance format used by the model), you must install the following tools in the tools/ directory. If you only plan to use the pre-processed data from Zenodo, these are not required.
- ThermoRawFileParser: For converting
.rawto.mzML. - ScanHeadsman: For extracting MS1 spectra.
Please refer to tools/README.md for installation instructions.
Note: External tools are not required if using pre-processed data from Zenodo
All datasets and pre-trained models are hosted on Zenodo.
Download the checkpoints to the checkpoints/ directory.
- nnPU Model (Recommended): The final model trained with non-negative PU loss.
- BCE Models: Polarity-specific models used for prior estimation.
If you wish to reproduce the training or testing results without processing raw files, download the pre-processed Lance datasets.
- Training/Validation Data:
lance_data_train_validation.tar.gz - Test Sets:
lance_data_test_set_1.tar.gz,lance_data_test_set_2.tar.gz,lance_data_test_set_3.tar.gz
Download Datasets (Zenodo Link)
For detailed inference instructions, see docs/INFERENCE.md.
You can run the model on the provided test set or on your own custom data.
Note: Inference is supported on both Linux HPC clusters and macOS. The standalone inference script
scripts/inference/predict_lance_all.pyhas been tested on both platforms (Linux HPC cluster with 1 GPU, and macOS 15.6.1 (24G90)).
To evaluate the model on the provided Test Set 3 (Lance format downloaded from Zenodo: lance_data_test_set_3):
sbatch slurm_scripts/inference/run_predict_lance.shNote: Before running, edit slurm_scripts/inference/run_predict_lance.sh to configure:
- Paths to checkpoint and dataset
- Output directory
- Batch size and other parameters
To run the model on your own data, you must first convert your .raw or .mzML files into the Lance format required by the model.
-
Preprocess Data: Follow the instructions in [
docs/DATA_PREPROCESSING.md] to generate a Lance dataset from your files. -
Run Prediction: Edit
slurm_scripts/inference/run_predict_lance.shto point to your custom dataset, then run:sbatch slurm_scripts/inference/run_predict_lance.sh
Or run directly with Python:
python scripts/inference/predict_lance_all.py \ --checkpoint_path checkpoints/best_model_nnpu.ckpt \ --lance_path path/to/your/custom_lance_dataset/test_data \ --output_csv your_results.csv
Output: The script generates a CSV containing the original_index, probability (quality score), mzml_filepath, and scan_number.
For detailed training instructions, see docs/TRAINING.md.
Training involves a multi-stage pipeline designed for PU learning. You may train using the provided Zenodo datasets or your own preprocessed Lance datasets.
Production Training: All models were trained on a high-performance computing (HPC) cluster node equipped with:
- CPUs: Dual Intel Xeon Gold 5320 (2.20 GHz)
- GPUs: 4× NVIDIA A100 (80 GB)
Training was performed using distributed data parallelism across 2 GPUs with the following SLURM configuration:
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=8
#SBATCH --mem=200GNote: While production training was performed on an HPC cluster, the training scripts (
training_nn_pu_loss_detach_diff_polarity.pyandtraining_bce_loss_diff_polarity_one_hot.py) have been tested and are fully functional on macOS. The scripts automatically detect available hardware (GPU/MPS/CPU) and configure accordingly.
- BCE Pre-training: Train separate models for positive and negative polarities using Binary Cross-Entropy loss.
-
Prior Estimation: Use the best BCE models to estimate the class prior (
$\pi$ ) on a held-out validation set (Test Set 1). - nnPU Training: Train the final model using the estimated priors.
Train separate models for positive (--polarity 1) and negative (--polarity 0) modes.
sbatch slurm_scripts/training/run_train_bce_loss_diff_polarity.shNote: Before running, edit slurm_scripts/training/run_train_bce_loss_diff_polarity.sh to configure:
- Polarity setting (
--polarity 0for negative,--polarity 1for positive) - Paths to Lance datasets
Use the trained BCE models to predict probabilities on Test Set 1, then calculate the average probability to estimate the priors.
# Predict on Test Set 1
python scripts/inference/predict_lance_diff_polarity_one_hot.py \
--checkpoint_path logs/training_bce_loss/best_model_bce_negative.ckpt \
--lance_path data/lance_data_test_set_1/test_data \
--output_csv results/predictions_prior_est.csv \
--polarity 0
Train the final model using the priors estimates.
sbatch slurm_scripts/training/run_train_nnpu_loss.shNote: Before running, edit slurm_scripts/training/run_train_nnpu_loss.sh to configure:
- Prior estimates (
--prior_posand--prior_neg) - Paths to Lance datasets
- Hyperparameters (learning rates, batch size, etc.)
For detailed data preprocessing instructions, see docs/DATA_PREPROCESSING.md.
If you want to process raw data and create the Lance datasets yourself:
Install ThermoRawFileParser and ScanHeadsman (required for .raw file processing):
# See detailed installation instructions
cat tools/README.md
# Quick install (example for ThermoRawFileParser)
cd tools
wget https://github.com/compomics/ThermoRawFileParser/releases/download/v1.4.4/ThermoRawFileParser1.4.4.zip
unzip ThermoRawFileParser1.4.4.zip -d ThermoRawFileParser/For detailed instructions, see tools/README.md
For detailed instructions on downloading and processing GNPS spectral libraries, see data/libraries/README.md.
See detailed instructions in docs/DATA_PREPROCESSING.md for:
- Converting raw files to mzML and mgf
- Running library matching
- Creating Lance datasets
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.