Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

GDR (Genetic Dimension Reduction) is a scikit-learn compatible transformer that uses Gene Expression Programming (GEP) to evolve interpretable synthetic features for binary classification tasks on tabular data.

## Setup & Running

Install dependencies (run from repo root):
```bash
pip install -r requirements.txt
```

Run an experiment (always from the **repo root**, not from `gdr/`):
```bash
python ./gdr/run_benchmark.py
```

Configure the experiment by editing `gdr/experiment_config.py` before running.

## Architecture

All scripts import each other as sibling modules (no `__init__.py`), so `run_benchmark.py` must be invoked from the repo root. Paths like `data/`, `experiments/` are resolved relative to the working directory (repo root).

### Core flow

```
experiment_config.py ← edit this to configure a run
run_benchmark.py ← orchestrator: loops datasets × folds, saves results
simple_benchmark.py ← one fold: baseline eval → GDR fit/transform → post eval
gdr.py (GDR class) ← the algorithm itself
```

### GDR algorithm (`gdr/gdr.py`)

`GDR` is a `BaseEstimator` / `TransformerMixin`. Each call to `fit()` evolves `generate_features_num` synthetic features sequentially. For each feature:

1. **Exploration phase** (1 generation, full population) — broad search
2. **Exploitation phase** (`generations_number - 1` more generations, top 10% kept) — refinement

The **fitness function rotates criteria** across features (cycling through LogisticRegression → DecisionTree → pairwise class distance). Each individual is also scored with a `LinearSVC` "mixin" score; fitness is a 2-tuple.

`transform()` applies compiled GEP expressions to produce standardized synthetic features. The `DEAP` `creator` classes are guarded against re-registration on repeated `fit()` calls.

### Key files

| File | Role |
|---|---|
| `gdr/gdr.py` | `GDR` class — fit, transform, visualization helpers |
| `gdr/experiment_config.py` | Experiment parameters (edit per run) |
| `gdr/run_benchmark.py` | Entry point; orchestrates loops, logging, CSV output |
| `gdr/simple_benchmark.py` | Single-fold benchmark (`benchmark_gdr()`) |
| `gdr/config_preprocessing.py` | Per-dataset numerical/categorical column lists |
| `gdr/plotting.py` | Standalone plotting utilities (comparison bars, loss curves, fitness) |
| `gdr/time_logger.py` | Per-feature timing statistics |
| `gdr/clean_experiment_data.py` | Wipes & recreates the experiment output directory |
| `data/` | CSV datasets (not in repo; download from OpenML — see README) |
| `experiments/` | Output root; each run creates `experiments/{experiment_name}/` |

### Experiment output structure

```
experiments/{experiment_name}/
experiment.log # full debug log
experiment_config.py # config snapshot
results_gdr.csv # final pre/post metrics table
diagrams/ # fitness evolution PNGs
pics/ # expression tree PNGs
time_logs/ # per-feature timing CSVs
transformed_data/ # GDR-transformed dataset CSVs
loss_curves/ # train/val AUC CSVs + PNGs
```

## Configuration (`experiment_config.py`)

Key parameters:

| Key | Description |
|---|---|
| `experiment_name` | Output folder name under `experiments/` |
| `datasets` | List of `(filename, target_column)` tuples; CSVs go in `data/` |
| `merge_mode` | `'categoricals'` (append label-encoded cats) or `'all'` (append all original features) |
| `generate_features` | Number of GDR synthetic features to evolve |
| `generations` | Generations per feature |
| `population` | Population size per generation |
| `terminal_selection` | `'important'` (LightGBM-ranked), `'random'`, or `'all'` |
| `base_feats` | Max input features fed to GEP |

## Adding a New Dataset

1. Place CSV in `data/`.
2. Add an entry to `experiment_config.py` `datasets` list: `('filename.csv', 'target_col')`.
3. If the dataset has categorical columns, add a `config_preprocessing` entry in `gdr/config_preprocessing.py` listing `numericals` and `non_numericals`. Otherwise all columns are treated as numerical.

## Notes

- `run_benchmark.py` wipes the entire `experiments/{experiment_name}/` directory at startup via `clean_experiment_data`. Rename or archive outputs before re-running.
- Interim results are saved after each fold as `temp_results_gdr.csv` and removed when the run completes.
- The `med_research` branch contains work with medical datasets (Retinopathy Debrecen, Myopia) stored in `data/`.
Loading