ALYAMBR · ALYAMBR · Feb 24, 2026 · Feb 24, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,104 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+GDR (Genetic Dimension Reduction) is a scikit-learn compatible transformer that uses Gene Expression Programming (GEP) to evolve interpretable synthetic features for binary classification tasks on tabular data.
+
+## Setup & Running
+
+Install dependencies (run from repo root):
+```bash
+pip install -r requirements.txt
+```
+
+Run an experiment (always from the **repo root**, not from `gdr/`):
+```bash
+python ./gdr/run_benchmark.py
+```
+
+Configure the experiment by editing `gdr/experiment_config.py` before running.
+
+## Architecture
+
+All scripts import each other as sibling modules (no `__init__.py`), so `run_benchmark.py` must be invoked from the repo root. Paths like `data/`, `experiments/` are resolved relative to the working directory (repo root).
+
+### Core flow
+
+```
+experiment_config.py       ← edit this to configure a run
+        ↓
+run_benchmark.py           ← orchestrator: loops datasets × folds, saves results
+        ↓
+simple_benchmark.py        ← one fold: baseline eval → GDR fit/transform → post eval
+        ↓
+gdr.py (GDR class)         ← the algorithm itself
+```
+
+### GDR algorithm (`gdr/gdr.py`)
+
+`GDR` is a `BaseEstimator` / `TransformerMixin`. Each call to `fit()` evolves `generate_features_num` synthetic features sequentially. For each feature:
+
+1. **Exploration phase** (1 generation, full population) — broad search
+2. **Exploitation phase** (`generations_number - 1` more generations, top 10% kept) — refinement
+
+The **fitness function rotates criteria** across features (cycling through LogisticRegression → DecisionTree → pairwise class distance). Each individual is also scored with a `LinearSVC` "mixin" score; fitness is a 2-tuple.
+
+`transform()` applies compiled GEP expressions to produce standardized synthetic features. The `DEAP` `creator` classes are guarded against re-registration on repeated `fit()` calls.
+
+### Key files
+
+| File | Role |
+|---|---|
+| `gdr/gdr.py` | `GDR` class — fit, transform, visualization helpers |
+| `gdr/experiment_config.py` | Experiment parameters (edit per run) |
+| `gdr/run_benchmark.py` | Entry point; orchestrates loops, logging, CSV output |
+| `gdr/simple_benchmark.py` | Single-fold benchmark (`benchmark_gdr()`) |
+| `gdr/config_preprocessing.py` | Per-dataset numerical/categorical column lists |
+| `gdr/plotting.py` | Standalone plotting utilities (comparison bars, loss curves, fitness) |
+| `gdr/time_logger.py` | Per-feature timing statistics |
+| `gdr/clean_experiment_data.py` | Wipes & recreates the experiment output directory |
+| `data/` | CSV datasets (not in repo; download from OpenML — see README) |
+| `experiments/` | Output root; each run creates `experiments/{experiment_name}/` |
+
+### Experiment output structure
+
+```
+experiments/{experiment_name}/
+  experiment.log          # full debug log
+  experiment_config.py    # config snapshot
+  results_gdr.csv         # final pre/post metrics table
+  diagrams/               # fitness evolution PNGs
+  pics/                   # expression tree PNGs
+  time_logs/              # per-feature timing CSVs
+  transformed_data/       # GDR-transformed dataset CSVs
+  loss_curves/            # train/val AUC CSVs + PNGs
+```
+
+## Configuration (`experiment_config.py`)
+
+Key parameters:
+
+| Key | Description |
+|---|---|
+| `experiment_name` | Output folder name under `experiments/` |
+| `datasets` | List of `(filename, target_column)` tuples; CSVs go in `data/` |
+| `merge_mode` | `'categoricals'` (append label-encoded cats) or `'all'` (append all original features) |
+| `generate_features` | Number of GDR synthetic features to evolve |
+| `generations` | Generations per feature |
+| `population` | Population size per generation |
+| `terminal_selection` | `'important'` (LightGBM-ranked), `'random'`, or `'all'` |
+| `base_feats` | Max input features fed to GEP |
+
+## Adding a New Dataset
+
+1. Place CSV in `data/`.
+2. Add an entry to `experiment_config.py` `datasets` list: `('filename.csv', 'target_col')`.
+3. If the dataset has categorical columns, add a `config_preprocessing` entry in `gdr/config_preprocessing.py` listing `numericals` and `non_numericals`. Otherwise all columns are treated as numerical.
+
+## Notes
+
+- `run_benchmark.py` wipes the entire `experiments/{experiment_name}/` directory at startup via `clean_experiment_data`. Rename or archive outputs before re-running.
+- Interim results are saved after each fold as `temp_results_gdr.csv` and removed when the run completes.
+- The `med_research` branch contains work with medical datasets (Retinopathy Debrecen, Myopia) stored in `data/`.