This repository implement small-scale active learning experiments to illustrate the ridge leverage scores approximation to Shapley data values as in this paper. The repo compares different selection strategies on MNIST, CIFAR-10, and synthetic datasets.
To install create a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtTo run a single active learning experiment:
python scripts/run_active_learning.py \
--dataset mnist \
--model mlp \
--selector ridge-leverage \
--rounds 20 \
--batch-size 5 \
--initial-size 100 \
--pretraining 10 \
--adaptive-lambda \
--alpha 0.01 \
--seed 42 \
--device cpuParameters:
--dataset: Dataset to use (mnist,cifar10,synthetic)--model: Model architecture (mlp,cnn)--selector: Selection strategy (ridge-leverage,uniform,kcenter,margin,entropy,loss,egl)--rounds: Number of active learning rounds--batch-size: Samples selected per round--initial-size: Initial labeled set size--pretraining: Pretraining rounds before active learning--adaptive-lambda: Use adaptive lambda calculation--alpha: Scaling factor for adaptive lambda (default: 0.01)--seed: Random seed for reproducibility--device: Device to use (cpu,cuda,mps)
To compare all selection strategies run python scripts/run_comparison.py with any of the above parameters, but omitting the --selector flag. CSV files, plots, and tables will be saved to geosh/experiments/output
To replicate the figures from the NeurIPS Workshop paper, run:
bash mlxor.shThis executes the full experimental setup with 40 rounds, 20 pretraining rounds, and 5 random seeds. No GPUs required!
If you use this code in your research, please cite:
@misc{mendozasmith2025geometricdatavaluationleverage,
title={Geometric Data Valuation via Leverage Scores},
author={Rodrigo Mendoza-Smith},
year={2025},
eprint={2511.02100},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.02100},
}