Decon identifies documents contaminated with eval instances.
It uses simple token based sampling and counting methods, making it suitable for large datasets. It is deterministic with interpretable results.
Decon can produce contamination reports and cleaned datasets.
Note
π This fork adds Python bindings β the core Rust functionality is unchanged. Skip to Python Quick Start to get started, or see the Architecture section to understand how bindings are structured. For the full Python API signature, see crates/decon-py/src/lib.rs.
The goal of this fork is to simply expose the API transparently for python users.
Please note, the package currently used by decon is decontaminate, import is import decon.
Use pip install decontaminate to install the package to get started.
Consider a 30GB web dataset in ~/sample-data that includes documents containing evaluation question text.
TRAINING DOC:
"... for ΞΈ 30 c i ΞΈ i0 4 for ΞΈ 90 d i ΞΈ is constant for all values of ΞΈ the plane face of plano convex lens of focal length 20 cm is silvered this combination is equivalent to the type of mirror and its focal length is a convex f 20 c m b concave f 20 cm in a displacement method using convex lens two images are obtained for a separation of d between ..."
EVAL PROMPT: the plane face of plano convex lens of focal length 20 cm is silvered this combination is equivalent to the type of mirror and its focal length is
EVAL ANSWER: concave f 10 cm
We can identify the contamination locations running decon.
$ decon detect --training-dir ~/sample-data --evals-dir ~/references
Training files 4,487/4,487 [00:02:55/00:00:00] [ββββββββββββββββββββββββββββββββββββ]
βββββββββββββββββββββββββββββββββββββββββββββ
β Contamination Detection Results β
βββββββββββββββββββββββββββββββββββββββββββββ€
β Training lines 5,162,084 β
β Processing rate 34 ΞΌs/doc β
βββββββββββββββββββββββββββββββββββββββββββββ€
β Index building time 38.59s β
β Detection time 175.69s β
β Total time 214.28s β
βββββββββββββββββββββββββββββββββββββββββββββ€
β Contaminated matches 7,699 β
β Contaminated documents 1,851 β
βββββββββββββββββββββββββββββββββββββββββββββ
$ decon review --stats /tmp/decon-295c0cbd
=== TRAINING DOCUMENTS CONTAMINATED BY EVAL SUITE ===
(Each count represents unique training documents that need removal)
sciq 652 βββββββββββββββββββββββββββββββββββββββββ
mmlu 278 ββββββββββββββββββββββ β
mmlu_pro 211 βββββββββββββββββ β
ai2_arc_easy 83 βββββββ β
super_gpqa 65 βββββ β
...
pip install decontaminateimport decon
config = decon.Config(
training_dir="/path/to/training/data",
evals_dir="/path/to/eval/references",
report_output_dir="/path/to/output",
)
report_dir = decon.detect(config)PyPI package is
decontaminate, import isimport decon.
See crates/decon-py/src/lib.rs for all Config parameters and available functions (detect, review, compare, evals, server, Tokenizer, clean_text).
# Clone and build. Requires rust 1.88
git clone https://github.com/allenai/decon
cd decon
# For full set of commands and options, help is available.
cargo run --release -- --help
# List current eval datasets in reference (small default set initially).
cargo run --release -- evals
# Run contamination detection.
cargo run --release -- detect --training-dir tests/fixtures/training/
# Create a clean copy (contaminated documents removed) of your dataset.
cargo run --release -- detect --training-dir tests/fixtures/training/ --purify
# Review report output. A decon detect run will report an output directory.
cargo run --release -- review /tmp/decon-output-directorySensible defaults are provided for decon parameters, with a single contamination_score_threshold that can be adjusted to desired sensitivity. Experimenting with these parameters on your own dataset and eval reference set is recommended.
Decon operates on a directory containing jsonl files.
Each JSON object in the files must contain a field with a string value representing a training document [example].
Decon runs against a reference set of eval suites that is also expected be a directory containing jsonl files [example].
Decon eval reference files have a normalized format including passage, question, answer keys as well as metadata for reporting. Decon includes tooling to generate reference files from hf datasets.
Three eval suites are included in the eval reference dataset by default, gsm8k, mmlu, and agi_eval.
It's likely you will want to build your own reference set with your evals of interest.
The decon evals command can process an extensible declarative yaml file to normalize huggingface datasets.
To download all the pre-configured evals included in the configuration file, run the following command. This requires python3 with the datasets library installed.
# Review current set of evals in reference
cargo run --release -- evals
# Download and normalize all evals configured in a config file
cargo run --release -- evals --download --config config/evals.yaml
See the Evaluation Dataset Guide for more information on preparing evaluation datasets.
Decon can also be run as a server to facilitate distributing workloads.
# Launch a server
decon server --port 8080An example orchestration script is provided which demonstrates one approach to batch retrieve a partition of documents, submit documents to the server, poll for job status, and upload reports and clean documents to a new location.
See deployment guide for details.
Decon includes tools for qualitative review and basic stats which can be filtered to analyze contamination.
# To qualitatively review individual matches
cargo run --release -- review /my-results-directory
# To see statistics
cargo run --release -- review --stats /my-results-directory
# To review with filters, e.g. specific eval with minimum score
cargo run --release -- review /my-results-directory --eval mmlu --min-score 0.9
# Compare results between different decontamination runs
cargo run --release -- compare /tmp/results-a /tmp/results-bDecon reports are jsonl files which are ready for analysis beyond the provided tooling.
This fork restructures decon as a Rust workspace with three crates:
| Crate | Source | Description |
|---|---|---|
| decon-core | crates/decon-core/ |
Core detection engine β pure Rust library (unchanged from upstream) |
| decon-cli | crates/decon-cli/ |
Command-line interface built on decon-core |
| decon-py | crates/decon-py/ |
Python bindings via PyO3 |
The Python bindings are a thin wrapper around decon-core β no detection logic is reimplemented in Python. Key files:
| File | Purpose |
|---|---|
crates/decon-py/src/lib.rs |
PyO3 wrapper classes (PyConfig, PyTokenizer) and functions (detect, clean_text) |
crates/decon-py/python/decon/__init__.py |
Python module re-exports |
crates/decon-py/tests/test_parity.py |
Parity tests ensuring Python β Rust equivalence |
The detect() function releases the GIL via py.allow_threads(), enabling full utilization of Rayon's parallel processing on all CPU cores.
Rust CLI:
cargo build --release
# Binary at: target/release/deconPython bindings (requires maturin):
cd crates/decon-py
maturin develop --release
# Or build wheels: maturin build --releaseπ¦ Detailed guide: See doc/building.md for cross-platform builds, troubleshooting, and CI/CD.
- Rust: 1.88+ (edition 2024)
- Python: 3.12+ (for bindings)