Associated Publication: Young, J., Gu, L. & Zhou, R. "Predicting FOX gene candidates for oxic nitrogen fixation using multi-omic machine learning and comparative bioinformatics." Scientific Reports (submitted).
git clone https://github.com/jamesyoung93/FoxGenes_ML.git
cd FoxGenes_ML
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtFoxGenes_ML/
├── src/ # Core module
│ ├── data_prep.py # Data loading
│ ├── feature_sets.py # Feature set definitions
│ ├── nested_cv.py # Nested CV (not used for Table 1)
│ ├── bootstrap_holdout.py # Repeated holdout evaluation (Table 1)
│ └── blocked_area_cv.py # Location-blocked CV (Table S6)
├── data/
│ └── prepared_for_modeling.plus_conservation_v2.csv
├── outputs/ # Manuscript results
│ ├── bootstrap_ablation/ # Table 1 & S5 metrics
│ ├── blocked_cv/ # Table S6 metrics
│ └── final_predictions/ # Trained models & pFOX scores
├── run_bootstrap_ablation_suite.py
├── run_blocked_area_cv.py
├── train_final_and_predict_unknowns.py
├── make_fig5_6panel.py
└── requirements.txt
Table 1 reports metrics from 20 repeated stratified 80/20 train-test splits using the principled_expression feature set (includes genomic position).
python run_bootstrap_ablation_suite.py \
--data data/prepared_for_modeling.plus_conservation_v2.csv \
--out outputs/table1_check \
--train_fracs 0.8 \
--n_splits 20Pre-computed results are in outputs/bootstrap_ablation/principled_expression/summary.csv.
Tests generalization across genomic regions (5 blocks × 4 partition offsets = 20 splits).
python run_blocked_area_cv.py \
--data data/prepared_for_modeling.plus_conservation_v2.csv \
--feature_set principled_expression \
--n_groups 5 \
--n_partitions 4 \
--out outputs/blocked_cv_checkAfter evaluation, models were retrained on all 903 labeled genes with 5-fold CV hyperparameter tuning (60 iterations).
python train_final_and_predict_unknowns.py \
--data data/prepared_for_modeling.plus_conservation_v2.csv \
--feature_set principled_expression \
--out outputs/final_check \
--n_iter 60 \
--cv_folds 5Final trained models and genome-wide pFOX scores are in outputs/final_predictions/.
python make_fig5_6panel.py \
--model_joblib outputs/final_predictions/final_model_xgb.joblib \
--feature_matrix data/prepared_for_modeling.plus_conservation_v2.csv \
--predictions_all outputs/final_predictions/predictions_all_ensemble.csv \
--outdir outputs/fig5| Name | Description |
|---|---|
principled_expression |
RPKM + proteomics + promoter + conservation + position (main model) |
principled_expression_no_position |
Same without genomic coordinates (ablation) |
baseline_full |
All available features |
- FoxGenesApp — Interactive web tool for exploring predictions
- cyanobacteria-diazotrophic-proteome — RBH comparative proteomics pipeline
MIT