diff --git a/README.md b/README.md index c658bd2..f36d463 100644 --- a/README.md +++ b/README.md @@ -7,8 +7,10 @@ EcoNetToolkit lets you train a shallow neural network or classical models on you - CSV input with automatic preprocessing (impute, scale, encode) - Model zoo: MLP (shallow), Random Forest, SVM, XGBoost, Logistic Regression, Linear Regression +- **Hyperparameter tuning** with grouped train/val/test splits (prevents data leakage) - Repeated training with different seeds for stable estimates - Metrics, including for unbalanced datasets (balanced accuracy, PR AUC) +- K-fold cross-validation with spatial/temporal grouping - Configure the project from a single config file ## Table of Contents @@ -26,6 +28,7 @@ EcoNetToolkit lets you train a shallow neural network or classical models on you - [Available models and key parameters](#available-models-and-key-parameters) - [Notes on metrics](#notes-on-metrics) - [Additional notes](#additional-notes) + - [Hyperparameter Tuning](#hyperparameter-tuning) - [Using your own data](#using-your-own-data) - [Testing](#testing) - [Unit Tests](#unit-tests) @@ -211,6 +214,60 @@ output: **Note:** If `output.dir` is not specified, outputs are automatically saved to `outputs//` where `` is derived from your config file name. +### Multi-output (multi-target) prediction + +EcoNetToolkit supports predicting **multiple target variables simultaneously** (multi-output learning). This is useful when you want to predict several related outcomes from the same features. + +**Example: Multi-output classification** + +```yaml +problem_type: classification + +data: + path: data/palmerpenguins_extended.csv + features: [bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, island] + labels: [species, sex, life_stage] # Predict 3 labels simultaneously + test_size: 0.2 + val_size: 0.15 + scaling: standard + +models: + - name: random_forest + params: + n_estimators: 100 + max_depth: 15 + +training: + repetitions: 5 +``` + +**Example: Multi-output regression** + +```yaml +problem_type: regression + +data: + path: data/possum.csv + features: [hdlngth, skullw, totlngth, taill] + labels: [age, chest, belly] # Predict 3 continuous values + test_size: 0.2 + scaling: standard + +models: + - name: mlp + params: + hidden_layer_sizes: [32, 16] + max_iter: 500 +``` + +**Key points:** +- Use `labels:` (list) instead of `label:` (single string) to specify multiple targets +- For backward compatibility, `label:` still works for single-output prediction +- Multi-output metrics report mean and standard deviation across all outputs +- Some models support multi-output natively (Random Forest, MLP Regressor), others are wrapped automatically (Logistic Regression, SVM, Linear Regression) + +See `configs/penguins_multilabel.yaml` and `configs/possum_multilabel.yaml` for complete examples. + ### Available models and key parameters **MLP (Multi-Layer Perceptron)** @@ -265,6 +322,70 @@ output: - For multi-class problems, macro-averaged Precision/Recall/F1 summarise performance across all classes. - Models are ranked by Cohen's kappa (classification) or MSE (regression) to identify the best performer. +## Hyperparameter Tuning + +EcoNetToolkit includes automated hyperparameter tuning with proper train/validation/test splits to prevent data leakage. This is especially important for ecological data with spatial or temporal structure. + +**Quick Example:** + +```bash +python run.py --config configs/mangrove_tuning.yaml +``` + +**Key Features:** + +- **Grouped splits**: Assign groups (e.g., patches, sites, years) to train/val/test sets +- **Automatic search**: GridSearchCV or RandomizedSearchCV to find optimal hyperparameters +- **Multiple seeds**: Run with different random seeds for stable results +- **Proper evaluation**: Tune on train+val, evaluate on held-out test set + +**Example Config:** + +```yaml +problem_type: regression + +data: + path: data/mangrove.csv + cv_group_column: patch_id # Group by spatial patches + n_train_groups: 4 # 4 patches for training + n_val_groups: 2 # 2 patches for validation (tuning) + n_test_groups: 2 # 2 patches for test (final eval) + + labels: [NDVI] + features: [pu_x, pu_y, temperature, ...] + scaling: standard + +# Enable hyperparameter tuning +tuning: + enabled: true + search_method: random # "random" or "grid" + n_iter: 30 # Number of parameter combinations + cv_folds: 3 # CV folds during tuning + scoring: neg_mean_squared_error + n_jobs: -1 # Use all CPU cores + +# Define models and search spaces +models: + - name: random_forest + param_space: + n_estimators: [100, 200, 500, 1000] + max_depth: [10, 20, 30, 50] + min_samples_split: [2, 5, 10] + max_features: [sqrt, log2, 0.5] + +training: + repetitions: 5 + random_seed: 42 +``` + +**Outputs include:** +- Best hyperparameters for each seed +- Validation and test set performance +- Comparison plots +- Trained models with optimal parameters + +For detailed information, see [docs/HYPERPARAMETER_TUNING.md](docs/HYPERPARAMETER_TUNING.md) + ## Using your own data 1. Place your CSV file in the `data` folder. @@ -321,7 +442,7 @@ python run.py --config configs/penguins_config.yaml python run.py --config configs/possum_config.yaml ``` -These demonstrate that the toolkit works correctly for both problem types and generates appropriate metrics and visualizations. +These demonstrate that the toolkit works correctly for both problem types and generates appropriate metrics and visualisations. ## Troubleshooting diff --git a/configs/penguins_multilabel.yaml b/configs/penguins_multilabel.yaml new file mode 100644 index 0000000..5452b6f --- /dev/null +++ b/configs/penguins_multilabel.yaml @@ -0,0 +1,64 @@ +# Multi-output classification example using Palmer Penguins dataset +# This config demonstrates predicting multiple labels simultaneously + +problem_type: classification # Multi-output classification + +data: + path: data/palmerpenguins_extended.csv + + # Features: Physical measurements and contextual variables + features: + - bill_length_mm # Continuous: length of bill in millimeters + - bill_depth_mm # Continuous: depth of bill in millimeters + - flipper_length_mm # Continuous: flipper length in millimeters + - body_mass_g # Continuous: body mass in grams + - island # Categorical: Biscoe, Dream, Torgensen + - year # Numeric: 2021-2025 + + # Multiple target variables to predict simultaneously + labels: + - species # Adelie, Chinstrap, Gentoo (3 classes) + - sex # male, female (2 classes) + - life_stage # adult, juvenile, chick (3 classes) + + test_size: 0.2 # 20% for testing + val_size: 0.15 # 15% of remaining for validation + random_state: 42 # For reproducibility + scaling: standard # Standardize features + impute_strategy: mean # Imputation for missing values + +# Train multiple models for comparison +models: + # Random Forest - natively supports multi-output + - name: random_forest + params: + n_estimators: 100 + max_depth: 15 + min_samples_split: 5 + min_samples_leaf: 2 + max_features: sqrt + class_weight: balanced + + # MLP - good for multi-output tasks + - name: mlp + params: + hidden_layer_sizes: [64, 32] + max_iter: 300 + early_stopping: true + validation_fraction: 0.1 + n_iter_no_change: 15 + + # Logistic Regression - wrapped with MultiOutputClassifier + - name: logistic + params: + max_iter: 1000 + class_weight: balanced + +# Training configuration +training: + repetitions: 5 # Train 5 times with different seeds + random_seed: 0 + +# Output configuration +output: + dir: outputs/penguins_multilabel diff --git a/configs/possum_multilabel.yaml b/configs/possum_multilabel.yaml new file mode 100644 index 0000000..b293047 --- /dev/null +++ b/configs/possum_multilabel.yaml @@ -0,0 +1,61 @@ +# Multi-output regression example using Possum dataset +# This config demonstrates predicting multiple continuous variables simultaneously + +problem_type: regression # Multi-output regression + +data: + path: data/possum.csv + + # Features: Various possum measurements + features: + - hdlngth # Head length + - skullw # Skull width + - totlngth # Total length + - taill # Tail length + - footlgth # Foot length + - earconch # Ear conch length + - eye # Eye measurement + + # Multiple target variables to predict simultaneously + labels: + - age # Age of the possum + - chest # Chest circumference + - belly # Belly circumference + + test_size: 0.2 # 20% for testing + val_size: 0.15 # 15% of remaining for validation + random_state: 42 # For reproducibility + scaling: standard # Standardize features + impute_strategy: mean # Imputation for missing values + +# Train multiple models for comparison +models: + # Random Forest - natively supports multi-output + - name: random_forest + params: + n_estimators: 100 + max_depth: 10 + min_samples_split: 3 + min_samples_leaf: 2 + + # MLP Regressor - natively supports multi-output + - name: mlp + params: + hidden_layer_sizes: [32, 16] + max_iter: 500 + early_stopping: true + validation_fraction: 0.15 + n_iter_no_change: 20 + + # Linear Regression - wrapped with MultiOutputRegressor + - name: linear + params: {} + +# Training configuration +training: + repetitions: 5 # Train 5 times with different seeds + random_seed: 0 + +# Output configuration +output: + dir: outputs/possum_multilabel diff --git a/ecosci/__init__.py b/ecosci/__init__.py index 5f8a414..08d3f21 100644 --- a/ecosci/__init__.py +++ b/ecosci/__init__.py @@ -4,6 +4,6 @@ from .data import CSVDataLoader from .models import ModelZoo from .trainer import Trainer -from .eval import evaluate_and_report +from .evaluation import evaluate_and_report __all__ = ["load_config", "CSVDataLoader", "ModelZoo", "Trainer", "evaluate_and_report"] diff --git a/ecosci/data.py b/ecosci/data.py deleted file mode 100644 index c9e705a..0000000 --- a/ecosci/data.py +++ /dev/null @@ -1,209 +0,0 @@ -"""Data loading and preprocessing utilities. - -What this does for you (no coding required): -- Reads your CSV file -- Picks which columns are inputs (features) and which is the label (what you want to predict) -- Handles missing values (imputation) -- Scales numeric columns so models train better -- Encodes text/categorical columns into numbers automatically -- Splits your data into train/validation/test sets -""" - -from typing import List, Optional, Tuple -import pandas as pd - - -class CSVDataLoader: - """Load a CSV and build a simple preprocessing pipeline. - - How to use (via YAML): specify the CSV path, list the feature columns and the - label column, and choose options for imputation and scaling. - - Notes - - Scaling: "standard" makes numbers with mean=0 and std=1 (good default). - "minmax" squashes numbers to 0..1. - - Categorical text columns are one-hot encoded automatically. - - Missing numbers use the mean (by default). Categorical uses the most - frequent value. - """ - - def __init__( - self, - path: str, - features: Optional[List[str]] = None, - label: Optional[str] = None, - test_size: float = 0.2, - val_size: float = 0.2, - random_state: int = 0, - scaling: str = "standard", - impute_strategy: str = "mean", - problem_type: str = "classification", - ): - self.path = path - self.features = features - self.label = label - self.test_size = test_size - self.val_size = val_size - self.random_state = random_state - self.scaling = scaling - self.impute_strategy = impute_strategy - self.problem_type = problem_type - - def load(self) -> pd.DataFrame: - """Read the CSV into a DataFrame.""" - df = pd.read_csv(self.path) - return df - - def prepare(self) -> Tuple: - """Prepare arrays for modeling. - - Returns - ------- - tuple - (X_train, X_val, X_test, y_train, y_val, y_test) - - Under the hood we use scikit-learn pipelines for imputation, encoding - and scaling to keep things robust and simple. - """ - from sklearn.model_selection import train_test_split - from sklearn.impute import SimpleImputer - from sklearn.preprocessing import ( - StandardScaler, - MinMaxScaler, - OneHotEncoder, - LabelEncoder, - ) - from sklearn.compose import ColumnTransformer - from sklearn.pipeline import Pipeline - import numpy as np - - df = self.load() - - if self.label is None: - raise ValueError("Label column must be provided in config") - - if self.features is None: - # default: all columns except label - self.features = [c for c in df.columns if c != self.label] - - X = df[self.features].copy() - y = df[self.label].copy() - - # Drop rows with NaN in target variable - mask_target = ~y.isna() - if not mask_target.all(): - n_dropped_target = (~mask_target).sum() - print( - f"Warning: Dropping {n_dropped_target} rows with NaN in target '{self.label}'" - ) - X = X[mask_target] - y = y[mask_target] - - # Drop rows with NaN in any features - mask_features = ~X.isna().any(axis=1) - if not mask_features.all(): - n_dropped_features = (~mask_features).sum() - cols_with_nan = X.columns[X.isna().any()].tolist() - print( - f"Warning: Dropping {n_dropped_features} additional rows with NaN in features: {cols_with_nan}" - ) - X = X[mask_features] - y = y[mask_features] - - # Reset indices after dropping rows - X = X.reset_index(drop=True) - y = y.reset_index(drop=True) - - # Encode labels if they are strings/categorical (needed for XGBoost and others) - if y.dtype == "object" or y.dtype.name == "category": - label_encoder = LabelEncoder() - y = label_encoder.fit_transform(y) - # Store the encoder for later use if needed - self.label_encoder = label_encoder - self.label_classes = label_encoder.classes_ - else: - self.label_encoder = None - self.label_classes = None - - # Identify numeric vs categorical (strings, categories, etc.) - numeric_cols = X.select_dtypes(include=["number"]).columns.tolist() - cat_cols = [c for c in X.columns if c not in numeric_cols] - - transformers = [] - - if numeric_cols: - num_pipeline = Pipeline( - [ - ("imputer", SimpleImputer(strategy=self.impute_strategy)), - ( - "scaler", - ( - StandardScaler() - if self.scaling == "standard" - else MinMaxScaler() - ), - ), - ] - ) - transformers.append(("num", num_pipeline, numeric_cols)) - - if cat_cols: - cat_pipeline = Pipeline( - [ - ("imputer", SimpleImputer(strategy="most_frequent")), - ( - "onehot", - OneHotEncoder(handle_unknown="ignore", sparse_output=False), - ), - ] - ) - transformers.append(("cat", cat_pipeline, cat_cols)) - - preprocessor = ColumnTransformer(transformers, remainder="drop") - - X_proc = preprocessor.fit_transform(X) - - # Check number of unique classes for stratification - # Only stratify for classification problems with reasonable number of classes - if self.problem_type == "classification": - n_unique = len(np.unique(y)) - stratify_y = y if n_unique <= 20 else None - else: - stratify_y = None - - # Split: first test, then validation from remaining - X_train_val, X_test, y_train_val, y_test = train_test_split( - X_proc, - y, - test_size=self.test_size, - random_state=self.random_state, - stratify=stratify_y, - ) - - # compute val fraction relative to train_val - if self.val_size <= 0: - X_train, X_val, y_train, y_val = X_train_val, None, y_train_val, None - else: - val_fraction = self.val_size / (1.0 - self.test_size) - # Only stratify for classification problems with reasonable number of classes - if self.problem_type == "classification": - n_unique_train = len(np.unique(y_train_val)) - stratify_train = y_train_val if n_unique_train <= 20 else None - else: - stratify_train = None - X_train, X_val, y_train, y_val = train_test_split( - X_train_val, - y_train_val, - test_size=val_fraction, - random_state=self.random_state, - stratify=stratify_train, - ) - - return ( - X_train, - X_val, - X_test, - np.array(y_train), - (None if y_val is None else np.array(y_val)), - np.array(y_test), - ) diff --git a/ecosci/data/__init__.py b/ecosci/data/__init__.py new file mode 100644 index 0000000..1b84bcf --- /dev/null +++ b/ecosci/data/__init__.py @@ -0,0 +1,5 @@ +"""Data loading, preprocessing, and splitting utilities.""" + +from .loader import CSVDataLoader + +__all__ = ["CSVDataLoader"] diff --git a/ecosci/data/loader.py b/ecosci/data/loader.py new file mode 100644 index 0000000..3a3f859 --- /dev/null +++ b/ecosci/data/loader.py @@ -0,0 +1,396 @@ +"""CSV data loader with preprocessing pipeline.""" + +from typing import List, Optional, Tuple +import pandas as pd +import numpy as np + +from .preprocessing import ( + get_feature_names_after_transform, + build_preprocessing_pipeline, + encode_labels, +) +from .splitting import prepare_cv_folds, prepare_grouped_splits + + +class CSVDataLoader: + """Load a CSV and build a simple preprocessing pipeline. + + How to use (via YAML): specify the CSV path, list the feature columns and the + label column, and choose options for imputation and scaling. + + Notes + ----- + - Scaling: "standard" makes numbers with mean=0 and std=1 (good default). + "minmax" squashes numbers to 0..1. + - Categorical text columns are one-hot encoded automatically. + - Missing numbers use the mean (by default). Categorical uses the most + frequent value. + + Parameters + ---------- + path : str + Path to CSV file + features : list of str, optional + Column names to use as features. If None, all columns except labels are used. + label : str, optional + Single target column name (for backward compatibility) + labels : list of str, optional + Multiple target column names (for multi-output) + test_size : float + Fraction of data for test set + val_size : float + Fraction of data for validation set + random_state : int + Random seed for reproducibility + scaling : str + "standard" or "minmax" scaling + impute_strategy : str + Imputation strategy: "mean", "median", or "most_frequent" + problem_type : str + "classification" or "regression" + cv_group_column : str, optional + Column name for grouping in cross-validation + """ + + def __init__( + self, + path: str, + features: Optional[List[str]] = None, + label: Optional[str] = None, + labels: Optional[List[str]] = None, + test_size: float = 0.2, + val_size: float = 0.2, + random_state: int = 0, + scaling: str = "standard", + impute_strategy: str = "mean", + problem_type: str = "classification", + cv_group_column: Optional[str] = None, + ): + self.path = path + self.features = features + # Support both single label and multiple labels + if labels is not None: + self.labels = labels if isinstance(labels, list) else [labels] + self.label = None + elif label is not None: + self.labels = [label] + self.label = label + else: + self.labels = None + self.label = None + self.test_size = test_size + self.val_size = val_size + self.random_state = random_state + self.scaling = scaling + self.impute_strategy = impute_strategy + self.problem_type = problem_type + self.cv_group_column = cv_group_column + + def load(self) -> pd.DataFrame: + """Read the CSV into a DataFrame.""" + df = pd.read_csv(self.path) + return df + + def _get_feature_names_after_transform(self, preprocessor, numeric_cols, cat_cols): + """Get feature names after preprocessing (including one-hot encoding).""" + return get_feature_names_after_transform(preprocessor, numeric_cols, cat_cols) + + def prepare(self) -> Tuple: + """Prepare arrays for modeling. + + Returns + ------- + tuple + (X_train, X_val, X_test, y_train, y_val, y_test) + + Under the hood we use scikit-learn pipelines for imputation, encoding + and scaling to keep things robust and simple. + """ + from sklearn.model_selection import train_test_split + + df = self.load() + + if self.labels is None: + raise ValueError("Label column(s) must be provided in config") + + if self.features is None: + # default: all columns except labels + self.features = [c for c in df.columns if c not in self.labels] + + X = df[self.features].copy() + + # When single label, DataFrame.iloc returns Series; when multiple, returns DataFrame + if len(self.labels) == 1: + y = df[self.labels[0]].copy() # Get as Series + mask_target = ~y.isna() + else: + y = df[self.labels].copy() # Get as DataFrame + mask_target = ~y.isna().any(axis=1) + + if mask_target.any() and not mask_target.all(): + n_dropped_target = (~mask_target).sum() + print(f"Warning: Dropping {n_dropped_target} rows with NaN in target {self.labels}") + X = X[mask_target] + y = y[mask_target] + + # Drop rows with NaN in any features + mask_features = ~X.isna().any(axis=1) + if not mask_features.all(): + n_dropped_features = (~mask_features).sum() + cols_with_nan = X.columns[X.isna().any()].tolist() + print(f"Warning: Dropping {n_dropped_features} additional rows with NaN in features: {cols_with_nan}") + X = X[mask_features] + y = y[mask_features] + + # Reset indices after dropping rows + X = X.reset_index(drop=True) + y = y.reset_index(drop=True) + + # Encode labels if they are strings/categorical + y, self.label_encoders, self.label_classes_dict, self.label_encoder, self.label_classes = encode_labels( + y, self.labels, self.problem_type + ) + + # Identify numeric vs categorical (strings, categories, etc.) + numeric_cols = X.select_dtypes(include=["number"]).columns.tolist() + cat_cols = [c for c in X.columns if c not in numeric_cols] + + # Build preprocessing pipeline + preprocessor = build_preprocessing_pipeline( + numeric_cols, cat_cols, self.scaling, self.impute_strategy + ) + + X_proc = preprocessor.fit_transform(X) + + # Store preprocessor and generate feature names after transformation + self.preprocessor = preprocessor + self.processed_feature_names = self._get_feature_names_after_transform( + preprocessor, numeric_cols, cat_cols + ) + + # Check number of unique classes for stratification + if self.problem_type == "classification" and len(self.labels) == 1: + n_unique = len(np.unique(y)) + stratify_y = y if n_unique <= 20 else None + else: + # Cannot stratify on multi-output or regression + stratify_y = None + + # Split: first test, then validation from remaining + X_train_val, X_test, y_train_val, y_test = train_test_split( + X_proc, + y, + test_size=self.test_size, + random_state=self.random_state, + stratify=stratify_y, + ) + + # compute val fraction relative to train_val + if self.val_size <= 0: + X_train, X_val, y_train, y_val = X_train_val, None, y_train_val, None + else: + val_fraction = self.val_size / (1.0 - self.test_size) + # Only stratify for classification problems with reasonable number of classes + if self.problem_type == "classification" and len(self.labels) == 1: + n_unique_train = len(np.unique(y_train_val)) + stratify_train = y_train_val if n_unique_train <= 20 else None + else: + stratify_train = None + X_train, X_val, y_train, y_val = train_test_split( + X_train_val, + y_train_val, + test_size=val_fraction, + random_state=self.random_state, + stratify=stratify_train, + ) + + return ( + X_train, + X_val, + X_test, + np.array(y_train), + (None if y_val is None else np.array(y_val)), + np.array(y_test), + ) + + def prepare_cv_folds(self): + """Prepare k-fold cross-validation splits using group-based splitting. + + Returns + ------- + list of tuples + Each tuple contains (X_train, X_val, X_test, y_train, y_val, y_test, fold_id) + + Notes + ----- + - Uses GroupKFold to keep all samples from the same group together + - Number of folds = number of unique groups + """ + if self.cv_group_column is None: + raise ValueError("cv_group_column must be specified for k-fold cross-validation") + + df = self.load() + + if self.labels is None: + raise ValueError("Label column(s) must be provided in config") + + if self.features is None: + self.features = [c for c in df.columns if c not in self.labels and c != self.cv_group_column] + + # Get group IDs + if self.cv_group_column not in df.columns: + raise ValueError(f"Group column '{self.cv_group_column}' not found in data") + + groups = df[self.cv_group_column].copy() + X = df[self.features].copy() + + # Handle single vs multi-output + if len(self.labels) == 1: + y = df[self.labels[0]].copy() + mask_target = ~y.isna() + else: + y = df[self.labels].copy() + mask_target = ~y.isna().any(axis=1) + + # Drop NaN rows + if mask_target.any() and not mask_target.all(): + n_dropped_target = (~mask_target).sum() + print(f"Warning: Dropping {n_dropped_target} rows with NaN in target {self.labels}") + X, y, groups = X[mask_target], y[mask_target], groups[mask_target] + + mask_features = ~X.isna().any(axis=1) + if not mask_features.all(): + n_dropped_features = (~mask_features).sum() + cols_with_nan = X.columns[X.isna().any()].tolist() + print(f"Warning: Dropping {n_dropped_features} additional rows with NaN in features: {cols_with_nan}") + X, y, groups = X[mask_features], y[mask_features], groups[mask_features] + + # Reset indices + X = X.reset_index(drop=True) + y = y.reset_index(drop=True) + groups = groups.reset_index(drop=True) + + # Encode labels + y, self.label_encoders, self.label_classes_dict, self.label_encoder, self.label_classes = encode_labels( + y, self.labels, self.problem_type + ) + + # Identify numeric vs categorical + numeric_cols = X.select_dtypes(include=["number"]).columns.tolist() + cat_cols = [c for c in X.columns if c not in numeric_cols] + + # Store feature names from a sample fit + sample_preprocessor = build_preprocessing_pipeline( + numeric_cols, cat_cols, self.scaling, self.impute_strategy + ) + sample_preprocessor.fit(X) + self.processed_feature_names = self._get_feature_names_after_transform( + sample_preprocessor, numeric_cols, cat_cols + ) + + # Prepare folds + return prepare_cv_folds( + X, y, groups, self.labels, numeric_cols, cat_cols, + self.val_size, self.random_state, self.scaling, self.impute_strategy, + self.problem_type, self.cv_group_column + ) + + def prepare_grouped_splits( + self, + n_train_groups: int = 4, + n_val_groups: int = 2, + n_test_groups: int = 2, + ): + """Prepare train/val/test splits by assigning groups to each split. + + This method ensures no data leakage by keeping all samples from the same + group together in one split. + + Parameters + ---------- + n_train_groups : int + Number of groups to assign to training set + n_val_groups : int + Number of groups to assign to validation set + n_test_groups : int + Number of groups to assign to test set + + Returns + ------- + tuple + (X_train, X_val, X_test, y_train, y_val, y_test, group_assignments, + groups_train, groups_val, groups_test) + """ + if self.cv_group_column is None: + raise ValueError("cv_group_column must be specified for grouped train/val/test splits") + + df = self.load() + + if self.labels is None: + raise ValueError("Label column(s) must be provided in config") + + if self.features is None: + self.features = [c for c in df.columns if c not in self.labels and c != self.cv_group_column] + + if self.cv_group_column not in df.columns: + raise ValueError(f"Group column '{self.cv_group_column}' not found in data") + + groups = df[self.cv_group_column].copy() + X = df[self.features].copy() + + # Handle single vs multi-output + if len(self.labels) == 1: + y = df[self.labels[0]].copy() + mask_target = ~y.isna() + else: + y = df[self.labels].copy() + mask_target = ~y.isna().any(axis=1) + + # Drop NaN rows + if mask_target.any() and not mask_target.all(): + n_dropped_target = (~mask_target).sum() + print(f"Warning: Dropping {n_dropped_target} rows with NaN in target {self.labels}") + X, y, groups = X[mask_target], y[mask_target], groups[mask_target] + + mask_features = ~X.isna().any(axis=1) + if not mask_features.all(): + n_dropped_features = (~mask_features).sum() + cols_with_nan = X.columns[X.isna().any()].tolist() + print(f"Warning: Dropping {n_dropped_features} additional rows with NaN in features: {cols_with_nan}") + X, y, groups = X[mask_features], y[mask_features], groups[mask_features] + + # Reset indices + X = X.reset_index(drop=True) + y = y.reset_index(drop=True) + groups = groups.reset_index(drop=True) + + # Encode labels + y, self.label_encoders, self.label_classes_dict, self.label_encoder, self.label_classes = encode_labels( + y, self.labels, self.problem_type + ) + + # Identify numeric vs categorical + numeric_cols = X.select_dtypes(include=["number"]).columns.tolist() + cat_cols = [c for c in X.columns if c not in numeric_cols] + + # Prepare grouped splits + result = prepare_grouped_splits( + X, y, groups, self.labels, numeric_cols, cat_cols, + self.random_state, self.scaling, self.impute_strategy, + n_train_groups, n_val_groups, n_test_groups, self.cv_group_column + ) + + # Unpack result + (X_train, X_val, X_test, y_train, y_val, y_test, group_assignments, + groups_train, groups_val, groups_test, preprocessor) = result + + # Store preprocessor and feature names + self.preprocessor = preprocessor + self.processed_feature_names = self._get_feature_names_after_transform( + preprocessor, numeric_cols, cat_cols + ) + + return ( + X_train, X_val, X_test, y_train, y_val, y_test, group_assignments, + groups_train, groups_val, groups_test + ) diff --git a/ecosci/data/preprocessing.py b/ecosci/data/preprocessing.py new file mode 100644 index 0000000..9748690 --- /dev/null +++ b/ecosci/data/preprocessing.py @@ -0,0 +1,144 @@ +"""Preprocessing utilities for feature extraction and transformation.""" + +from typing import List + + +def get_feature_names_after_transform(preprocessor, numeric_cols: List[str], cat_cols: List[str]) -> List[str]: + """Get feature names after preprocessing (including one-hot encoding). + + Parameters + ---------- + preprocessor : ColumnTransformer + Fitted sklearn ColumnTransformer with numeric and categorical pipelines + numeric_cols : list of str + Names of numeric columns + cat_cols : list of str + Names of categorical columns + + Returns + ------- + list of str + Feature names after transformation + """ + feature_names = [] + + # Get feature names from each transformer + for name, transformer, columns in preprocessor.transformers_: + if name == "num": + # Numeric columns keep their names + feature_names.extend(columns) + elif name == "cat": + # For categorical columns, get one-hot encoded names + try: + # Get the OneHotEncoder from the pipeline + onehot = transformer.named_steps.get("onehot") + if onehot is not None and hasattr(onehot, "get_feature_names_out"): + # Get one-hot feature names + cat_features = onehot.get_feature_names_out(columns) + feature_names.extend(cat_features) + else: + # Fallback: just use column names + feature_names.extend(columns) + except Exception: + # Fallback if something goes wrong + feature_names.extend(columns) + + return feature_names + + +def build_preprocessing_pipeline(numeric_cols: List[str], cat_cols: List[str], + scaling: str = "standard", impute_strategy: str = "mean"): + """Build a scikit-learn preprocessing pipeline. + + Parameters + ---------- + numeric_cols : list of str + Names of numeric columns + cat_cols : list of str + Names of categorical columns + scaling : str + "standard" or "minmax" scaling for numeric features + impute_strategy : str + Imputation strategy for numeric features ("mean", "median", "most_frequent") + + Returns + ------- + ColumnTransformer + Preprocessing pipeline ready to fit + """ + from sklearn.impute import SimpleImputer + from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder + from sklearn.compose import ColumnTransformer + from sklearn.pipeline import Pipeline + + transformers = [] + + if numeric_cols: + scaler = StandardScaler() if scaling == "standard" else MinMaxScaler() + num_pipeline = Pipeline([ + ("imputer", SimpleImputer(strategy=impute_strategy)), + ("scaler", scaler) + ]) + transformers.append(("num", num_pipeline, numeric_cols)) + + if cat_cols: + cat_pipeline = Pipeline([ + ("imputer", SimpleImputer(strategy="most_frequent")), + ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False)), + ]) + transformers.append(("cat", cat_pipeline, cat_cols)) + + return ColumnTransformer(transformers=transformers, remainder="drop") + + +def encode_labels(y, labels: List[str], problem_type: str = "classification"): + """Encode labels if they are categorical strings. + + Parameters + ---------- + y : pandas Series or DataFrame + Target variable(s) + labels : list of str + Names of target variable(s) + problem_type : str + "classification" or "regression" + + Returns + ------- + tuple + (encoded_y, label_encoders_dict, label_classes_dict, label_encoder, label_classes) + Last two maintain backward compatibility for single output + """ + from sklearn.preprocessing import LabelEncoder + import numpy as np + import pandas as pd + + label_encoders = {} + label_classes_dict = {} + + if len(labels) == 1: + # Single output: maintain backward compatibility + if y.dtype == "object" or y.dtype.name == "category": + label_encoder = LabelEncoder() + y_encoded = label_encoder.fit_transform(y) + label_encoders[labels[0]] = label_encoder + label_classes_dict[labels[0]] = label_encoder.classes_ + return y_encoded, label_encoders, label_classes_dict, label_encoder, label_encoder.classes_ + else: + return y, label_encoders, label_classes_dict, None, None + else: + # Multi-output: encode each column separately + y_encoded = [] + for col in labels: + if y[col].dtype == "object" or y[col].dtype.name == "category": + label_encoder = LabelEncoder() + y_col_encoded = label_encoder.fit_transform(y[col]) + label_encoders[col] = label_encoder + label_classes_dict[col] = label_encoder.classes_ + else: + y_col_encoded = y[col].values + label_encoders[col] = None + label_classes_dict[col] = None + y_encoded.append(y_col_encoded) + y_encoded = np.column_stack(y_encoded) + return y_encoded, label_encoders, label_classes_dict, None, None diff --git a/ecosci/data/splitting.py b/ecosci/data/splitting.py new file mode 100644 index 0000000..a832071 --- /dev/null +++ b/ecosci/data/splitting.py @@ -0,0 +1,283 @@ +"""Data splitting utilities for cross-validation and grouped splits.""" + +from typing import List, Tuple +import numpy as np +import pandas as pd + + +def prepare_cv_folds( + X: pd.DataFrame, + y, + groups: pd.Series, + labels: List[str], + numeric_cols: List[str], + cat_cols: List[str], + val_size: float, + random_state: int, + scaling: str, + impute_strategy: str, + problem_type: str, + cv_group_column: str, +) -> List[Tuple]: + """Prepare k-fold cross-validation splits using group-based splitting. + + Parameters + ---------- + X : DataFrame + Feature data + y : array-like + Target data (already encoded if categorical) + groups : Series + Group IDs for GroupKFold + labels : list of str + Names of target variables + numeric_cols : list of str + Names of numeric columns + cat_cols : list of str + Names of categorical columns + val_size : float + Validation set size as fraction + random_state : int + Random seed + scaling : str + "standard" or "minmax" + impute_strategy : str + Imputation strategy + problem_type : str + "classification" or "regression" + cv_group_column : str + Name of the group column + + Returns + ------- + list of tuples + Each tuple contains (X_train, X_val, X_test, y_train, y_val, y_test, fold_id) + """ + from sklearn.model_selection import GroupKFold, train_test_split + from .preprocessing import build_preprocessing_pipeline + + unique_groups = groups.unique() + n_folds = len(unique_groups) + + print(f"\nK-Fold Cross-Validation Setup:") + print(f" Group column: {cv_group_column}") + print(f" Number of unique groups: {n_folds}") + print(f" Number of folds: {n_folds}") + print(f" Samples per group: {groups.value_counts().to_dict()}\n") + + # Create k-fold splits + group_kfold = GroupKFold(n_splits=n_folds) + + fold_data = [] + for fold_idx, (train_val_idx, test_idx) in enumerate(group_kfold.split(X, y, groups)): + # Get raw train_val and test data + X_train_val_raw = X.iloc[train_val_idx] + y_train_val = y[train_val_idx] if isinstance(y, np.ndarray) else y.iloc[train_val_idx] + X_test_raw = X.iloc[test_idx] + y_test = y[test_idx] if isinstance(y, np.ndarray) else y.iloc[test_idx] + + # Create a NEW preprocessor for this fold (to avoid data leakage) + fold_preprocessor = build_preprocessing_pipeline( + numeric_cols, cat_cols, scaling, impute_strategy + ) + + # Fit preprocessor ONLY on train_val data (not on test!) + X_train_val_proc = fold_preprocessor.fit_transform(X_train_val_raw) + X_test_proc = fold_preprocessor.transform(X_test_raw) + + # Further split train_val into train and val + if val_size <= 0: + X_train, X_val, y_train, y_val = X_train_val_proc, None, y_train_val, None + else: + val_fraction = val_size / (1.0 - (1.0 / n_folds)) + + # For stratification + if problem_type == "classification" and len(labels) == 1: + n_unique_train = len(np.unique(y_train_val)) + stratify_train = y_train_val if n_unique_train <= 20 else None + else: + stratify_train = None + + X_train, X_val, y_train, y_val = train_test_split( + X_train_val_proc, + y_train_val, + test_size=val_fraction, + random_state=random_state, + stratify=stratify_train, + ) + + fold_data.append(( + X_train, + X_val, + X_test_proc, + np.array(y_train), + (None if y_val is None else np.array(y_val)), + np.array(y_test), + fold_idx + )) + + return fold_data + + +def prepare_grouped_splits( + X: pd.DataFrame, + y, + groups: pd.Series, + labels: List[str], + numeric_cols: List[str], + cat_cols: List[str], + random_state: int, + scaling: str, + impute_strategy: str, + n_train_groups: int, + n_val_groups: int, + n_test_groups: int, + cv_group_column: str, +) -> Tuple: + """Prepare train/val/test splits by assigning groups to each split. + + This function ensures no data leakage by keeping all samples from the same + group together in one split (train, val, or test). + + Parameters + ---------- + X : DataFrame + Feature data + y : array-like + Target data (already encoded if categorical) + groups : Series + Group IDs + labels : list of str + Names of target variables + numeric_cols : list of str + Names of numeric columns + cat_cols : list of str + Names of categorical columns + random_state : int + Random seed + scaling : str + "standard" or "minmax" + impute_strategy : str + Imputation strategy + n_train_groups : int + Number of groups for training + n_val_groups : int + Number of groups for validation + n_test_groups : int + Number of groups for testing + cv_group_column : str + Name of the group column + + Returns + ------- + tuple + (X_train, X_val, X_test, y_train, y_val, y_test, group_assignments, + groups_train, groups_val, groups_test, preprocessor) + + X_train : ndarray + Preprocessed training features + X_val : ndarray + Preprocessed validation features + X_test : ndarray + Preprocessed testing features + y_train : ndarray + Training target values + y_val : ndarray + Validation target values + y_test : ndarray + Testing target values + group_assignments : dict + Dictionary mapping split names to group IDs + groups_train : ndarray + Group IDs for training samples + groups_val : ndarray + Group IDs for validation samples + groups_test : ndarray + Group IDs for testing samples + preprocessor : ColumnTransformer + Fitted preprocessing pipeline + """ + from .preprocessing import build_preprocessing_pipeline + + unique_groups = groups.unique() + n_groups = len(unique_groups) + + # Validate we have enough groups + total_groups_needed = n_train_groups + n_val_groups + n_test_groups + if n_groups < total_groups_needed: + raise ValueError( + f"Not enough groups: need {total_groups_needed} " + f"({n_train_groups} train + {n_val_groups} val + {n_test_groups} test), " + f"but only have {n_groups} unique groups" + ) + + # Shuffle groups for random assignment + rng = np.random.RandomState(random_state) + shuffled_groups = rng.permutation(unique_groups) + + # Assign groups to splits + train_groups = shuffled_groups[:n_train_groups] + val_groups = shuffled_groups[n_train_groups:n_train_groups + n_val_groups] + test_groups = shuffled_groups[n_train_groups + n_val_groups:n_train_groups + n_val_groups + n_test_groups] + + group_assignments = { + 'train': train_groups.tolist(), + 'val': val_groups.tolist(), + 'test': test_groups.tolist(), + } + + print(f"\nGrouped Train/Val/Test Split Setup:") + print(f" Group column: {cv_group_column}") + print(f" Total unique groups: {n_groups}") + print(f" Train groups ({n_train_groups}): {train_groups.tolist()}") + print(f" Val groups ({n_val_groups}): {val_groups.tolist()}") + print(f" Test groups ({n_test_groups}): {test_groups.tolist()}") + + # Create masks for each split + train_mask = groups.isin(train_groups) + val_mask = groups.isin(val_groups) + test_mask = groups.isin(test_groups) + + print(f"\n Train samples: {train_mask.sum()}") + print(f" Val samples: {val_mask.sum()}") + print(f" Test samples: {test_mask.sum()}\n") + + # Build preprocessing pipeline + preprocessor = build_preprocessing_pipeline( + numeric_cols, cat_cols, scaling, impute_strategy + ) + + # Fit preprocessor on training data only (important!) + X_train_raw = X[train_mask] + preprocessor.fit(X_train_raw) + + # Transform all splits + X_proc = preprocessor.transform(X) + + # Split the data + X_train = X_proc[train_mask] + X_val = X_proc[val_mask] + X_test = X_proc[test_mask] + + y_train = y[train_mask] + y_val = y[val_mask] + y_test = y[test_mask] + + # Extract group IDs for each split + groups_train = groups[train_mask].values + groups_val = groups[val_mask].values + groups_test = groups[test_mask].values + + return ( + X_train, + X_val, + X_test, + np.array(y_train), + np.array(y_val), + np.array(y_test), + group_assignments, + groups_train, + groups_val, + groups_test, + preprocessor, # Return for feature name extraction + ) diff --git a/ecosci/eval.py b/ecosci/eval.py deleted file mode 100644 index 6e798dd..0000000 --- a/ecosci/eval.py +++ /dev/null @@ -1,501 +0,0 @@ -"""Evaluation and reporting utilities. - -Focus on metrics that make sense for unbalanced datasets: -- Balanced accuracy (accounts for class imbalance) -- Precision/Recall/F1 (macro for multi-class) -- ROC-AUC and Average Precision (PR AUC) when probabilities are available - -Outputs: -- `report.json`: list of per-seed metrics -- `metric_*.png`: quick boxplots across seeds -- `confusion_matrix.png`: confusion matrix heatmap (last run) -- `pr_curve.png`: precision-recall curve if binary and probabilities exist -""" - -from typing import Dict, Any, Optional -import os -import json -import numpy as np -import matplotlib.pyplot as plt -import seaborn as sns - - -def safe_std(vals): - """Calculate std, returning 0.0 instead of nan when there's only 1 value.""" - if len(vals) <= 1: - return 0.0 - return vals.std() - - -def compute_classification_metrics(y_true, y_pred, y_proba=None) -> Dict[str, Any]: - from sklearn.metrics import ( - accuracy_score, - precision_score, - recall_score, - f1_score, - roc_auc_score, - confusion_matrix, - balanced_accuracy_score, - average_precision_score, - cohen_kappa_score, - ) - - out = {} - out["accuracy"] = accuracy_score(y_true, y_pred) - out["balanced_accuracy"] = balanced_accuracy_score(y_true, y_pred) - out["precision"] = precision_score( - y_true, - y_pred, - average="binary" if len(np.unique(y_true)) == 2 else "macro", - zero_division=0, - ) - out["recall"] = recall_score( - y_true, - y_pred, - average="binary" if len(np.unique(y_true)) == 2 else "macro", - zero_division=0, - ) - out["f1"] = f1_score( - y_true, - y_pred, - average="binary" if len(np.unique(y_true)) == 2 else "macro", - zero_division=0, - ) - out["cohen_kappa"] = cohen_kappa_score(y_true, y_pred) - out["confusion_matrix"] = confusion_matrix(y_true, y_pred).tolist() - - if y_proba is not None: - try: - # Binary classification: use positive class probabilities - if y_proba.ndim == 2 and y_proba.shape[1] == 2: - out["roc_auc"] = roc_auc_score(y_true, y_proba[:, 1]) - out["average_precision"] = average_precision_score( - y_true, y_proba[:, 1] - ) - # Multiclass: use OVR strategy - elif y_proba.ndim == 2 and y_proba.shape[1] > 2: - out["roc_auc"] = roc_auc_score(y_true, y_proba, multi_class="ovr") - out["average_precision"] = average_precision_score( - y_true, y_proba, average="macro" - ) - # 1D probabilities (rare, but handle it) - else: - out["roc_auc"] = roc_auc_score(y_true, y_proba) - out["average_precision"] = average_precision_score(y_true, y_proba) - except Exception as e: - print(f"Warning: Could not compute ROC-AUC/AP: {e}") - out["roc_auc"] = None - out["average_precision"] = None - else: - out["roc_auc"] = None - out["average_precision"] = None - - return out - - -def compute_regression_metrics(y_true, y_pred) -> Dict[str, Any]: - """Compute regression metrics. - - Parameters - ---------- - y_true : array-like - Ground truth target values. - y_pred : array-like - Predicted target values. - - Returns - ------- - Dict[str, Any] - Dictionary containing: - - mse: Mean Squared Error - - rmse: Root Mean Squared Error - - mae: Mean Absolute Error - - r2: R-squared (coefficient of determination) - - mape: Mean Absolute Percentage Error (if no zeros in y_true) - """ - from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score - - out = {} - out["mse"] = mean_squared_error(y_true, y_pred) - out["rmse"] = np.sqrt(out["mse"]) - out["mae"] = mean_absolute_error(y_true, y_pred) - out["r2"] = r2_score(y_true, y_pred) - - # MAPE - only compute if no zeros in y_true - try: - if not np.any(y_true == 0): - mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100 - out["mape"] = mape - else: - out["mape"] = None - except Exception: - out["mape"] = None - - return out - - -def evaluate_and_report( - results, y_test, output_dir: str = "outputs", problem_type: str = "classification" -): - """Compute metrics across runs and save a compact report and plots. - - Parameters - ---------- - results : dict or list - If dict: Output from Trainer.run() with multiple models {model_name: [run_results]} - If list: Single model results (backward compatibility) - y_test : array-like - Ground-truth labels (classification) or values (regression) for the test set. - output_dir : str - Where to save the report and plots. - problem_type : str - "classification" or "regression". Determines which metrics to compute. - """ - os.makedirs(output_dir, exist_ok=True) - import pandas as pd - - # Handle backward compatibility: convert list to dict - if isinstance(results, list): - # Assume single model (old format) - model_name = results[0].get("model_name", "model") if results else "model" - results = {model_name: results} - - # Compute metrics for each model - all_summaries = {} - all_dfs = {} - - for model_name, model_results in results.items(): - summary = [] - for r in model_results: - seed = r.get("seed") - y_pred = r.get("y_pred") - y_proba = r.get("y_proba") - - if problem_type == "regression": - metrics = compute_regression_metrics(y_test, y_pred) - else: - metrics = compute_classification_metrics(y_test, y_pred, y_proba) - - metrics["seed"] = seed - metrics["model"] = model_name - summary.append(metrics) - - all_summaries[model_name] = summary - all_dfs[model_name] = pd.DataFrame(summary) - - # Save individual model reports in model-specific subfolders - for model_name, summary in all_summaries.items(): - model_dir = os.path.join(output_dir, model_name) - os.makedirs(model_dir, exist_ok=True) - report_path = os.path.join(model_dir, f"report_{model_name}.json") - with open(report_path, "w") as f: - json.dump(summary, f, indent=2) - - # Combine all results for comparison - combined_df = pd.concat(all_dfs.values(), ignore_index=True) - combined_summary = [item for sublist in all_summaries.values() for item in sublist] - - # Save combined report - with open(os.path.join(output_dir, "report_all_models.json"), "w") as f: - json.dump(combined_summary, f, indent=2) - - # Print results for each model - if problem_type == "regression": - display_cols = ["seed", "mse", "rmse", "mae", "r2"] - if "mape" in combined_df.columns: - display_cols.append("mape") - comparison_metrics = ["mse", "rmse", "mae", "r2"] - plot_metrics = ["mse", "rmse", "mae", "r2"] - else: - display_cols = [ - "seed", - "accuracy", - "balanced_accuracy", - "precision", - "recall", - "f1", - "cohen_kappa", - ] - if "roc_auc" in combined_df.columns: - display_cols.append("roc_auc") - if "average_precision" in combined_df.columns: - display_cols.append("average_precision") - comparison_metrics = [ - "accuracy", - "balanced_accuracy", - "f1", - "cohen_kappa", - "roc_auc", - ] - plot_metrics = ["accuracy", "balanced_accuracy", "f1", "cohen_kappa"] - - for model_name, df in all_dfs.items(): - print(f"\n{'='*80}") - print(f"RESULTS FOR MODEL: {model_name.upper()}") - print(f"{'='*80}") - print( - df[display_cols].to_string( - index=False, - float_format=lambda x: f"{x:.4f}" if not pd.isna(x) else "N/A", - ) - ) - - print(f"\n{'-'*80}") - print(f"SUMMARY STATISTICS FOR {model_name.upper()} (mean ± std)") - print(f"{'-'*80}") - metric_cols = [c for c in display_cols if c != "seed"] - for col in metric_cols: - if col in df.columns: - vals = df[col].dropna() - if len(vals) > 0: - print(f"{col:20s}: {vals.mean():.4f} ± {safe_std(vals):.4f}") - print(f"{'-'*80}\n") - - # Model comparison table - if len(all_dfs) > 1: - print(f"\n{'='*80}") - print("MODEL COMPARISON (mean ± std)") - print(f"{'='*80}") - comparison_data = [] - for model_name, df in all_dfs.items(): - row = {"Model": model_name} - for col in comparison_metrics: - if col in df.columns: - vals = df[col].dropna() - if len(vals) > 0: - row[col] = f"{vals.mean():.4f} ± {safe_std(vals):.4f}" - else: - row[col] = "N/A" - comparison_data.append(row) - - comparison_df = pd.DataFrame(comparison_data) - print(comparison_df.to_string(index=False)) - print(f"{'='*80}\n") - - # Determine best and second-best models - # For regression: lower MSE is better - # Select primary metric based on problem type - # For classification: Cohen's kappa is more robust as it accounts for chance agreement - # For regression: MSE is standard (lower is better) - if problem_type == "regression": - primary_metric = "mse" - lower_is_better = True - else: - primary_metric = "cohen_kappa" - lower_is_better = False - - # Calculate mean primary metric for each model - model_scores = [] - for model_name, df in all_dfs.items(): - if primary_metric in df.columns: - vals = df[primary_metric].dropna() - if len(vals) > 0: - model_scores.append( - { - "model": model_name, - "mean": vals.mean(), - "std": safe_std(vals), - } - ) - - if len(model_scores) > 0: - # Sort by primary metric - model_scores.sort(key=lambda x: x["mean"], reverse=not lower_is_better) - - print(f"\n{'='*80}") - print("MODEL RANKING") - print(f"{'='*80}") - print( - f"Ranked by: {primary_metric.replace('_', ' ').upper()} ({'Lower is better' if lower_is_better else 'Higher is better'})" - ) - print(f"{'-'*80}") - - # Create ranking table with all metrics - ranking_data = [] - for rank, score in enumerate(model_scores, 1): - model_name = score["model"] - model_df = all_dfs[model_name] - - row = { - "Rank": f"{rank}.", - "Model": model_name.upper(), - primary_metric: f"{score['mean']:.4f} ± {score['std']:.4f}", - } - - # Add other metrics - metric_cols = [ - c for c in display_cols if c != "seed" and c != primary_metric - ] - for col in metric_cols: - if col in model_df.columns: - vals = model_df[col].dropna() - if len(vals) > 0: - row[col] = f"{vals.mean():.4f} ± {safe_std(vals):.4f}" - - ranking_data.append(row) - - # Print ranking table - ranking_df = pd.DataFrame(ranking_data) - print(ranking_df.to_string(index=False)) - print(f"{'='*80}\n") - - # Plot comparison across models - - if len(all_dfs) > 1: - # Multi-model comparison plots - for m in plot_metrics: - if m in combined_df.columns: - plt.figure(figsize=(max(6, len(all_dfs) * 1.5), 4)) - sns.boxplot(data=combined_df, x="model", y=m) - plt.title(f"{m.replace('_', ' ').title()} - Model Comparison") - plt.xlabel("Model") - plt.ylabel(m.replace("_", " ").title()) - plt.xticks(rotation=45, ha="right") - plt.tight_layout() - plt.savefig(os.path.join(output_dir, f"comparison_{m}.png")) - plt.close() - else: - # Single model plots (backward compatibility) - for m in plot_metrics: - if m in combined_df.columns: - plt.figure(figsize=(4, 3)) - sns.boxplot(data=combined_df, y=m) - plt.title(m.replace("_", " ").title()) - plt.tight_layout() - plt.savefig(os.path.join(output_dir, f"metric_{m}.png")) - plt.close() - - # Confusion matrices for each model (last run) - classification only - if problem_type == "classification": - for model_name, model_results in results.items(): - try: - from sklearn.metrics import confusion_matrix - - last = model_results[-1] - cm = confusion_matrix(y_test, last.get("y_pred")) - plt.figure(figsize=(5, 4)) - sns.heatmap(cm, annot=True, fmt="d", cmap="Blues") - plt.title(f"Confusion Matrix - {model_name.upper()} (last run)") - plt.xlabel("Predicted") - plt.ylabel("True") - plt.tight_layout() - model_dir = os.path.join(output_dir, model_name) - os.makedirs(model_dir, exist_ok=True) - plt.savefig( - os.path.join(model_dir, f"confusion_matrix_{model_name}.png") - ) - plt.close() - except Exception as e: - print( - f"Warning: Could not create confusion matrix for {model_name}: {e}" - ) - - # Precision-Recall curves for each model - classification only - if problem_type == "classification": - for model_name, model_results in results.items(): - try: - from sklearn.metrics import ( - precision_recall_curve, - average_precision_score, - ) - - last = model_results[-1] - y_proba = last.get("y_proba") - if y_proba is not None: - # if proba has shape (n,2), take positive class - if getattr(y_proba, "ndim", 1) == 2 and y_proba.shape[1] >= 2: - pos_scores = y_proba[:, 1] - else: - pos_scores = y_proba - precision, recall, _ = precision_recall_curve(y_test, pos_scores) - ap = average_precision_score(y_test, pos_scores) - plt.figure(figsize=(5, 4)) - plt.plot(recall, precision, label=f"AP={ap:.3f}") - plt.xlabel("Recall") - plt.ylabel("Precision") - plt.title(f"Precision-Recall - {model_name.upper()} (last run)") - plt.legend() - plt.tight_layout() - model_dir = os.path.join(output_dir, model_name) - os.makedirs(model_dir, exist_ok=True) - plt.savefig(os.path.join(model_dir, f"pr_curve_{model_name}.png")) - plt.close() - except Exception as e: - pass # Silently skip if PR curve can't be generated - - # Combined PR curve for model comparison (binary classification only) - if len(results) > 1: - try: - from sklearn.metrics import ( - precision_recall_curve, - average_precision_score, - ) - - plt.figure(figsize=(7, 5)) - for model_name, model_results in results.items(): - last = model_results[-1] - y_proba = last.get("y_proba") - if y_proba is not None: - if getattr(y_proba, "ndim", 1) == 2 and y_proba.shape[1] >= 2: - pos_scores = y_proba[:, 1] - else: - pos_scores = y_proba - precision, recall, _ = precision_recall_curve( - y_test, pos_scores - ) - ap = average_precision_score(y_test, pos_scores) - plt.plot( - recall, - precision, - label=f"{model_name} (AP={ap:.3f})", - linewidth=2, - ) - - plt.xlabel("Recall") - plt.ylabel("Precision") - plt.title("Precision-Recall Comparison (last run)") - plt.legend() - plt.grid(True, alpha=0.3) - plt.tight_layout() - plt.savefig(os.path.join(output_dir, "pr_curve_comparison.png")) - plt.close() - except Exception as e: - pass # Silently skip if comparison can't be generated - - # Residual plots for regression - elif problem_type == "regression": - for model_name, model_results in results.items(): - try: - last = model_results[-1] - y_pred = last.get("y_pred") - residuals = y_test - y_pred - - ax1, ax2 = plt.subplots(1, 2, figsize=(12, 4)) - - # Predicted vs Actual - ax1.scatter(y_test, y_pred, alpha=0.6) - min_val = min(y_test.min(), y_pred.min()) - max_val = max(y_test.max(), y_pred.max()) - ax1.plot([min_val, max_val], [min_val, max_val], "r--", lw=2) - ax1.set_xlabel("Actual") - ax1.set_ylabel("Predicted") - ax1.set_title(f"Predicted vs Actual - {model_name.upper()}") - ax1.grid(True, alpha=0.3) - - # Residual plot - ax2.scatter(y_pred, residuals, alpha=0.6) - ax2.axhline(y=0, color="r", linestyle="--", lw=2) - ax2.set_xlabel("Predicted") - ax2.set_ylabel("Residuals") - ax2.set_title(f"Residual Plot - {model_name.upper()}") - ax2.grid(True, alpha=0.3) - - plt.tight_layout() - model_dir = os.path.join(output_dir, model_name) - os.makedirs(model_dir, exist_ok=True) - plt.savefig(os.path.join(model_dir, f"residual_plot_{model_name}.png")) - plt.close() - except Exception as e: - print(f"Warning: Could not create residual plot for {model_name}: {e}") - - return combined_summary diff --git a/ecosci/evaluation/__init__.py b/ecosci/evaluation/__init__.py new file mode 100644 index 0000000..040b9a6 --- /dev/null +++ b/ecosci/evaluation/__init__.py @@ -0,0 +1,59 @@ +"""Evaluation module for EcoNetToolkit. + +This module provides comprehensive evaluation and reporting capabilities +for machine learning models, including: +- Metrics computation for classification and regression +- Feature importance analysis +- Visualisation and plotting +- Cross-validation results reporting +- Hyperparameter tuning evaluation + +The module is organised into submodules for better maintainability: +- metrics: Core metric computation functions +- plotting: Visualisation utilities +- feature_importance: Feature importance extraction and analysis +- reporting: Main evaluation and reporting orchestration + +For backward compatibility, key functions are re-exported from the main module. +""" + +# Re-export key functions for backward compatibility +from .metrics import ( + safe_std, + compute_classification_metrics, + compute_regression_metrics, + compute_multi_output_classification_metrics, + compute_multi_output_regression_metrics, +) + +from .reporting import ( + evaluate_and_report, + evaluate_and_report_cv, + evaluate_tuning_results, +) + +# Make submodules available +from . import metrics +from . import plotting +from . import feature_importance +from . import reporting + +__all__ = [ + # Core functions + 'evaluate_and_report', + 'evaluate_and_report_cv', + 'evaluate_tuning_results', + + # Metric functions + 'safe_std', + 'compute_classification_metrics', + 'compute_regression_metrics', + 'compute_multi_output_classification_metrics', + 'compute_multi_output_regression_metrics', + + # Submodules + 'metrics', + 'plotting', + 'feature_importance', + 'reporting', +] diff --git a/ecosci/evaluation/feature_importance.py b/ecosci/evaluation/feature_importance.py new file mode 100644 index 0000000..75da92e --- /dev/null +++ b/ecosci/evaluation/feature_importance.py @@ -0,0 +1,340 @@ +"""Feature importance extraction and analysis. + +This module contains functions for extracting and analyzing feature importance +from trained models, including built-in importance (tree models, linear models) +and permutation importance (for models like MLPs). +""" + +from typing import Dict, Optional, List +import os +import numpy as np +import pandas as pd + +from .plotting import plot_feature_importance + + +def extract_feature_importance( + results: Dict, + X_test, + y_test, + output_dir: str, + problem_type: str, + feature_names: Optional[List[str]] = None +): + """Extract and report feature importance for all models. + + Parameters + ---------- + results : Dict + Results dictionary {model_name: [run_results]} + X_test : array-like + Test features + y_test : array-like + Test labels/values + output_dir : str + Directory to save outputs + problem_type : str + "classification" or "regression" + feature_names : Optional[List[str]] + Names of features + """ + import joblib + from sklearn.inspection import permutation_importance + from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor + + print(f"\n{'='*80}") + print("FEATURE IMPORTANCE ANALYSIS") + print(f"{'='*80}") + + for model_name, model_results in results.items(): + try: + # Collect feature importances from all trained models + all_importances = [] + + for result in model_results: + model_path = result.get("model_path") + + if not model_path or not os.path.exists(model_path): + continue + + model = joblib.load(model_path) + + # Extract feature importances based on model type + feature_importances = None + importance_method = None + + # For multi-output models, need to handle differently + if isinstance(model, (MultiOutputClassifier, MultiOutputRegressor)): + # Get the base estimator + if hasattr(model.estimators_[0], 'feature_importances_'): + # Average importances across outputs + importances_list = [est.feature_importances_ for est in model.estimators_ if hasattr(est, 'feature_importances_')] + if importances_list: + feature_importances = np.mean(importances_list, axis=0) + importance_method = "built-in (averaged across outputs)" + elif hasattr(model, 'feature_importances_'): + # Tree-based models (Random Forest, XGBoost) + feature_importances = model.feature_importances_ + importance_method = "built-in (Gini/gain)" + elif hasattr(model, 'coef_'): + # Linear models (use absolute coefficient values) + coef = model.coef_ + if len(coef.shape) > 1: + # Multi-output or multi-class: average absolute coefficients + feature_importances = np.mean(np.abs(coef), axis=0) + else: + feature_importances = np.abs(coef) + importance_method = "coefficient magnitude" + + # If no built-in importance, use permutation importance (works for all models including MLPs) + if feature_importances is None and X_test is not None: + # Only print once for the first model + if len(all_importances) == 0: + print(f"\n{model_name.upper()}: Computing permutation importance across all {len(model_results)} models...") + print(f" (This measures performance drop when each feature is shuffled)") + + # Determine scoring metric based on problem type + if problem_type == "classification": + scoring = "accuracy" + else: # regression + scoring = "r2" + + # Compute permutation importance + perm_importance = permutation_importance( + model, X_test, y_test, + n_repeats=10, # Number of times to permute each feature + random_state=42, + scoring=scoring, + n_jobs=-1 # Use all available cores + ) + feature_importances = perm_importance.importances_mean + importance_method = f"permutation (based on {scoring}, averaged across seeds)" + elif feature_importances is not None and importance_method: + # Update method to indicate averaging across seeds + if len(all_importances) == 0: + importance_method = importance_method + f", averaged across {len(model_results)} seeds" + + if feature_importances is not None: + all_importances.append(feature_importances) + + # Average importances across all models + if all_importances: + # Average across all seeds + avg_feature_importances = np.mean(all_importances, axis=0) + std_feature_importances = np.std(all_importances, axis=0) + + print(f"\n{model_name.upper()}:") + print(f"{'-'*80}") + print(f"Method: {importance_method}") + print(f"Averaged across {len(all_importances)} model(s)") + + # Sort features by importance + indices = np.argsort(avg_feature_importances)[::-1] + + # Helper to get feature label + def get_feature_label(idx): + if feature_names and idx < len(feature_names): + return feature_names[idx] + return f"Feature {idx}" + + # Print top 20 features with mean ± std + print(f"\nTop 20 most important features (mean ± std):") + for i, idx in enumerate(indices[:20], 1): + importance = avg_feature_importances[idx] + std = std_feature_importances[idx] + feat_label = get_feature_label(idx) + print(f" {i:2d}. {feat_label:40s}: {importance:.6f} ± {std:.6f}") + + # Create feature importance plot + plot_feature_importance( + avg_feature_importances, + std_feature_importances, + feature_names, + model_name, + output_dir, + importance_method, + top_n=20 + ) + else: + print(f"\n{model_name.upper()}: Feature importance not available (no models or no test data provided)") + except Exception as e: + print(f"\nWarning: Could not extract feature importance for {model_name}: {e}") + + print(f"\n{'='*80}\n") + + +def extract_cv_feature_importance( + results: Dict, + output_dir: str, + problem_type: str, + feature_names: Optional[List[str]] = None +): + """Extract and report feature importance for cross-validation results. + + Parameters + ---------- + results : Dict + Results dictionary {model_name: [run_results_with_fold_info]} + output_dir : str + Directory to save outputs + problem_type : str + "classification" or "regression" + feature_names : Optional[List[str]] + Names of features + """ + import joblib + from sklearn.inspection import permutation_importance + from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor + + print(f"{'='*80}") + print("FEATURE IMPORTANCE ANALYSIS (per fold and averaged)") + print(f"{'='*80}") + + for model_name, model_results in results.items(): + try: + # Group results by fold + folds_for_fi = {} + for r in model_results: + fold_id = r.get("fold") + if fold_id not in folds_for_fi: + folds_for_fi[fold_id] = [] + folds_for_fi[fold_id].append(r) + + print(f"\n{model_name.upper()}: Collecting feature importances from all {len(folds_for_fi)} folds...") + + # Collect feature importance for each fold + fold_importances_dict = {} + importance_method = None + + for fold_id in sorted(folds_for_fi.keys()): + fold_results = folds_for_fi[fold_id] + fold_importances_list = [] + + for result in fold_results: + model_path = result.get("model_path") + + if not model_path or not os.path.exists(model_path): + continue + + model = joblib.load(model_path) + y_test_fold = result.get("y_test") + X_test_fold = result.get("X_test") + + # Extract feature importances based on model type + feature_importances = None + + # For multi-output models + if isinstance(model, (MultiOutputClassifier, MultiOutputRegressor)): + if hasattr(model.estimators_[0], 'feature_importances_'): + importances_list = [est.feature_importances_ for est in model.estimators_ if hasattr(est, 'feature_importances_')] + if importances_list: + feature_importances = np.mean(importances_list, axis=0) + importance_method = "built-in (averaged across outputs, seeds, and folds)" + elif hasattr(model, 'feature_importances_'): + feature_importances = model.feature_importances_ + importance_method = "built-in (Gini/gain, averaged across seeds and folds)" + elif hasattr(model, 'coef_'): + coef = model.coef_ + if len(coef.shape) > 1: + feature_importances = np.mean(np.abs(coef), axis=0) + else: + feature_importances = np.abs(coef) + importance_method = "coefficient magnitude (averaged across seeds and folds)" + + # If no built-in importance, use permutation importance (e.g., for MLPs) + if feature_importances is None and X_test_fold is not None and y_test_fold is not None: + # Only print once per fold + if len(fold_importances_list) == 0: + print(f" Fold {fold_id}: Computing permutation importance (this may take a while)...") + + # Determine scoring metric based on problem type + if problem_type == "classification": + scoring = "accuracy" + else: # regression + scoring = "r2" + + # Compute permutation importance + perm_importance = permutation_importance( + model, X_test_fold, y_test_fold, + n_repeats=10, # Number of times to permute each feature + random_state=42, + scoring=scoring, + n_jobs=-1 # Use all available cores + ) + feature_importances = perm_importance.importances_mean + importance_method = f"permutation (based on {scoring}, averaged across seeds and folds)" + + if feature_importances is not None: + fold_importances_list.append(feature_importances) + + # Average across seeds for this fold + if fold_importances_list: + fold_avg_importance = np.mean(fold_importances_list, axis=0) + fold_importances_dict[fold_id] = fold_avg_importance + + # Average importances across all folds + if fold_importances_dict: + all_fold_importances = list(fold_importances_dict.values()) + avg_feature_importances = np.mean(all_fold_importances, axis=0) + std_feature_importances = np.std(all_fold_importances, axis=0) + + print(f"\n{model_name.upper()}:") + print(f"{'-'*80}") + print(f"Method: {importance_method}") + print(f"Computed from all {len(fold_importances_dict)} folds") + + # Sort features by importance + indices = np.argsort(avg_feature_importances)[::-1] + + # Helper to get feature label + def get_feature_label(idx): + if feature_names and idx < len(feature_names): + return feature_names[idx] + return f"Feature {idx}" + + # Print top 20 features + print(f"\nTop 20 most important features (mean ± std):") + for i, idx in enumerate(indices[:20], 1): + importance = avg_feature_importances[idx] + std = std_feature_importances[idx] + feat_label = get_feature_label(idx) + print(f" {i:2d}. {feat_label:40s}: {importance:.6f} ± {std:.6f}") + + # Save feature importance to CSV (all features, sorted by importance) + feature_importance_data = [] + for rank, idx in enumerate(indices, 1): + row = { + "rank": rank, + "feature": get_feature_label(idx), + "importance_mean_all_folds": avg_feature_importances[idx], + "importance_std_across_folds": std_feature_importances[idx], + "feature_index": idx + } + # Add per-fold importance + for fold_id in sorted(fold_importances_dict.keys()): + row[f"importance_fold_{fold_id}"] = fold_importances_dict[fold_id][idx] + feature_importance_data.append(row) + + fi_df = pd.DataFrame(feature_importance_data) + model_dir = os.path.join(output_dir, model_name) + os.makedirs(model_dir, exist_ok=True) + fi_csv_path = os.path.join(model_dir, f"feature_importance_{model_name}_cv_all_folds.csv") + fi_df.to_csv(fi_csv_path, index=False) + print(f"\n Feature importance CSV (all folds) saved to: {fi_csv_path}") + + # Create feature importance plot + plot_feature_importance( + avg_feature_importances, + std_feature_importances, + feature_names, + f"{model_name}_cv", + output_dir, + importance_method, + top_n=20 + ) + else: + print(f"\n{model_name.upper()}: Feature importance not available") + except Exception as e: + print(f"\nWarning: Could not extract feature importance for {model_name}: {e}") + + print(f"\n{'='*80}\n") diff --git a/ecosci/evaluation/metrics.py b/ecosci/evaluation/metrics.py new file mode 100644 index 0000000..9b6b49b --- /dev/null +++ b/ecosci/evaluation/metrics.py @@ -0,0 +1,239 @@ +"""Metrics computation for classification and regression tasks. + +This module contains functions for computing various performance metrics +for both single-output and multi-output problems. +""" + +from typing import Dict, Any +import numpy as np + + +def safe_std(vals): + """Calculate std, returning 0.0 instead of nan when there's only 1 value.""" + if len(vals) <= 1: + return 0.0 + return vals.std() + + +def compute_classification_metrics(y_true, y_pred, y_proba=None) -> Dict[str, Any]: + """Compute classification metrics. + + Parameters + ---------- + y_true : array-like + Ground truth labels. + y_pred : array-like + Predicted labels. + y_proba : array-like, optional + Predicted probabilities (if available). + + Returns + ------- + Dict[str, Any] + Dictionary containing accuracy, balanced_accuracy, precision, recall, + f1, cohen_kappa, confusion_matrix, and optionally roc_auc and average_precision. + """ + from sklearn.metrics import ( + accuracy_score, + precision_score, + recall_score, + f1_score, + roc_auc_score, + confusion_matrix, + balanced_accuracy_score, + average_precision_score, + cohen_kappa_score, + ) + + out = {} + out["accuracy"] = accuracy_score(y_true, y_pred) + out["balanced_accuracy"] = balanced_accuracy_score(y_true, y_pred) + out["precision"] = precision_score( + y_true, + y_pred, + average="binary" if len(np.unique(y_true)) == 2 else "macro", + zero_division=0, + ) + out["recall"] = recall_score( + y_true, + y_pred, + average="binary" if len(np.unique(y_true)) == 2 else "macro", + zero_division=0, + ) + out["f1"] = f1_score( + y_true, + y_pred, + average="binary" if len(np.unique(y_true)) == 2 else "macro", + zero_division=0, + ) + out["cohen_kappa"] = cohen_kappa_score(y_true, y_pred) + out["confusion_matrix"] = confusion_matrix(y_true, y_pred).tolist() + + if y_proba is not None: + try: + # Binary classification: use positive class probabilities + if y_proba.ndim == 2 and y_proba.shape[1] == 2: + out["roc_auc"] = roc_auc_score(y_true, y_proba[:, 1]) + out["average_precision"] = average_precision_score( + y_true, y_proba[:, 1] + ) + # Multiclass: use OVR strategy + elif y_proba.ndim == 2 and y_proba.shape[1] > 2: + out["roc_auc"] = roc_auc_score(y_true, y_proba, multi_class="ovr") + out["average_precision"] = average_precision_score( + y_true, y_proba, average="macro" + ) + # 1D probabilities (rare, but handle it) + else: + out["roc_auc"] = roc_auc_score(y_true, y_proba) + out["average_precision"] = average_precision_score(y_true, y_proba) + except Exception as e: + print(f"Warning: Could not compute ROC-AUC/AP: {e}") + out["roc_auc"] = None + out["average_precision"] = None + else: + out["roc_auc"] = None + out["average_precision"] = None + + return out + + +def compute_multi_output_classification_metrics(y_true, y_pred, y_proba=None) -> Dict[str, Any]: + """Compute classification metrics for multi-output predictions. + + For multi-output, we compute metrics per output and aggregate. + + Parameters + ---------- + y_true : array-like, shape (n_samples, n_outputs) + Ground truth labels + y_pred : array-like, shape (n_samples, n_outputs) + Predicted labels + y_proba : array-like, shape (n_samples, n_outputs, n_classes) or None + Predicted probabilities (if available) + + Returns + ------- + Dict[str, Any] + Dictionary with average metrics across outputs and per-output metrics + """ + from sklearn.metrics import ( + accuracy_score, + precision_score, + recall_score, + f1_score, + balanced_accuracy_score, + cohen_kappa_score, + ) + + n_outputs = y_true.shape[1] + out = {} + + # Compute per-output metrics + per_output_metrics = [] + for i in range(n_outputs): + y_true_i = y_true[:, i] + y_pred_i = y_pred[:, i] + y_proba_i = y_proba[i] if y_proba is not None and isinstance(y_proba, list) else None + + metrics_i = compute_classification_metrics(y_true_i, y_pred_i, y_proba_i) + per_output_metrics.append(metrics_i) + + # Aggregate metrics (average across outputs) + for key in ['accuracy', 'balanced_accuracy', 'precision', 'recall', 'f1', 'cohen_kappa']: + values = [m[key] for m in per_output_metrics if key in m and m[key] is not None] + if values: + out[f"{key}_mean"] = np.mean(values) + out[f"{key}_std"] = np.std(values) + + # Store per-output metrics + out["per_output"] = per_output_metrics + out["n_outputs"] = n_outputs + + return out + + +def compute_regression_metrics(y_true, y_pred) -> Dict[str, Any]: + """Compute regression metrics. + + Parameters + ---------- + y_true : array-like + Ground truth target values. + y_pred : array-like + Predicted target values. + + Returns + ------- + Dict[str, Any] + Dictionary containing: + - mse: Mean Squared Error + - rmse: Root Mean Squared Error + - mae: Mean Absolute Error + - r2: R-squared (coefficient of determination) + - mape: Mean Absolute Percentage Error (if no zeros in y_true) + """ + from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score + + out = {} + out["mse"] = mean_squared_error(y_true, y_pred) + out["rmse"] = np.sqrt(out["mse"]) + out["mae"] = mean_absolute_error(y_true, y_pred) + out["r2"] = r2_score(y_true, y_pred) + + # MAPE - only compute if no zeros in y_true + try: + if not np.any(y_true == 0): + mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100 + out["mape"] = mape + else: + out["mape"] = None + except Exception: + out["mape"] = None + + return out + + +def compute_multi_output_regression_metrics(y_true, y_pred) -> Dict[str, Any]: + """Compute regression metrics for multi-output predictions. + + For multi-output regression, we compute metrics per output and aggregate. + + Parameters + ---------- + y_true : array-like, shape (n_samples, n_outputs) + Ground truth values + y_pred : array-like, shape (n_samples, n_outputs) + Predicted values + + Returns + ------- + Dict[str, Any] + Dictionary with average metrics across outputs and per-output metrics + """ + from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score + + n_outputs = y_true.shape[1] + out = {} + + # Compute per-output metrics + per_output_metrics = [] + for i in range(n_outputs): + y_true_i = y_true[:, i] + y_pred_i = y_pred[:, i] + + metrics_i = compute_regression_metrics(y_true_i, y_pred_i) + per_output_metrics.append(metrics_i) + + # Aggregate metrics (average across outputs) + for key in ['mse', 'rmse', 'mae', 'r2']: + values = [m[key] for m in per_output_metrics if key in m and m[key] is not None] + if values: + out[f"{key}_mean"] = np.mean(values) + out[f"{key}_std"] = np.std(values) + + # Store per-output metrics + out["per_output"] = per_output_metrics + out["n_outputs"] = n_outputs + + return out diff --git a/ecosci/evaluation/plotting.py b/ecosci/evaluation/plotting.py new file mode 100644 index 0000000..f6c8ef2 --- /dev/null +++ b/ecosci/evaluation/plotting.py @@ -0,0 +1,427 @@ +"""Plotting and visualisation utilities for evaluation results. + +This module contains functions for creating various plots and visualisations +for model evaluation, including confusion matrices, PR curves, residual plots, +and feature importance plots. +""" + +from typing import Dict, Optional, List +import os +import numpy as np +import matplotlib.pyplot as plt +import seaborn as sns +import pandas as pd + + +def plot_metric_comparisons( + combined_df: pd.DataFrame, + plot_metrics: List[str], + output_dir: str, + all_dfs: Optional[Dict] = None +): + """Create boxplot comparisons for metrics across models or seeds. + + Parameters + ---------- + combined_df : pd.DataFrame + Combined DataFrame with all results + plot_metrics : List[str] + List of metric names to plot + output_dir : str + Directory to save plots + all_dfs : Optional[Dict] + Dictionary of DataFrames per model (for multi-model comparison) + """ + if all_dfs and len(all_dfs) > 1: + # Multi-model comparison plots + for m in plot_metrics: + if m in combined_df.columns: + plt.figure(figsize=(max(6, len(all_dfs) * 1.5), 4)) + sns.boxplot(data=combined_df, x="model", y=m) + plt.title(f"{m.replace('_', ' ').title()} - Model Comparison") + plt.xlabel("Model") + plt.ylabel(m.replace("_", " ").title()) + plt.xticks(rotation=45, ha="right") + plt.tight_layout() + plt.savefig(os.path.join(output_dir, f"comparison_{m}.png")) + plt.close() + else: + # Single model plots (backward compatibility) + for m in plot_metrics: + if m in combined_df.columns: + plt.figure(figsize=(4, 3)) + sns.boxplot(data=combined_df, y=m) + plt.title(m.replace("_", " ").title()) + plt.tight_layout() + plt.savefig(os.path.join(output_dir, f"metric_{m}.png")) + plt.close() + + +def plot_confusion_matrices( + results: Dict, + y_test, + output_dir: str +): + """Create confusion matrix heatmaps for classification models. + + Parameters + ---------- + results : Dict + Results dictionary {model_name: [run_results]} + y_test : array-like + Test labels + output_dir : str + Directory to save plots + """ + from sklearn.metrics import confusion_matrix + + for model_name, model_results in results.items(): + try: + last = model_results[-1] + cm = confusion_matrix(y_test, last.get("y_pred")) + plt.figure(figsize=(5, 4)) + sns.heatmap(cm, annot=True, fmt="d", cmap="Blues") + plt.title(f"Confusion Matrix - {model_name.upper()} (last run)") + plt.xlabel("Predicted") + plt.ylabel("True") + plt.tight_layout() + model_dir = os.path.join(output_dir, model_name) + os.makedirs(model_dir, exist_ok=True) + plt.savefig( + os.path.join(model_dir, f"confusion_matrix_{model_name}.png") + ) + plt.close() + except Exception as e: + print( + f"Warning: Could not create confusion matrix for {model_name}: {e}" + ) + + +def plot_pr_curves( + results: Dict, + y_test, + output_dir: str +): + """Create precision-recall curves for classification models. + + Parameters + ---------- + results : Dict + Results dictionary {model_name: [run_results]} + y_test : array-like + Test labels + output_dir : str + Directory to save plots + """ + from sklearn.metrics import ( + precision_recall_curve, + average_precision_score, + ) + + # Individual PR curves for each model + for model_name, model_results in results.items(): + try: + last = model_results[-1] + y_proba = last.get("y_proba") + if y_proba is not None: + # if proba has shape (n,2), take positive class + if getattr(y_proba, "ndim", 1) == 2 and y_proba.shape[1] >= 2: + pos_scores = y_proba[:, 1] + else: + pos_scores = y_proba + precision, recall, _ = precision_recall_curve(y_test, pos_scores) + ap = average_precision_score(y_test, pos_scores) + plt.figure(figsize=(5, 4)) + plt.plot(recall, precision, label=f"AP={ap:.3f}") + plt.xlabel("Recall") + plt.ylabel("Precision") + plt.title(f"Precision-Recall - {model_name.upper()} (last run)") + plt.legend() + plt.tight_layout() + model_dir = os.path.join(output_dir, model_name) + os.makedirs(model_dir, exist_ok=True) + plt.savefig(os.path.join(model_dir, f"pr_curve_{model_name}.png")) + plt.close() + except Exception as e: + pass # Silently skip if PR curve can't be generated + + # Combined PR curve for model comparison (binary classification only) + if len(results) > 1: + try: + plt.figure(figsize=(7, 5)) + for model_name, model_results in results.items(): + last = model_results[-1] + y_proba = last.get("y_proba") + if y_proba is not None: + if getattr(y_proba, "ndim", 1) == 2 and y_proba.shape[1] >= 2: + pos_scores = y_proba[:, 1] + else: + pos_scores = y_proba + precision, recall, _ = precision_recall_curve( + y_test, pos_scores + ) + ap = average_precision_score(y_test, pos_scores) + plt.plot( + recall, + precision, + label=f"{model_name} (AP={ap:.3f})", + linewidth=2, + ) + + plt.xlabel("Recall") + plt.ylabel("Precision") + plt.title("Precision-Recall Comparison (last run)") + plt.legend() + plt.grid(True, alpha=0.3) + plt.tight_layout() + plt.savefig(os.path.join(output_dir, "pr_curve_comparison.png")) + plt.close() + except Exception as e: + pass # Silently skip if comparison can't be generated + + +def plot_residuals( + results: Dict, + y_test, + output_dir: str +): + """Create residual plots for regression models. + + Parameters + ---------- + results : Dict + Results dictionary {model_name: [run_results]} + y_test : array-like + Test values + output_dir : str + Directory to save plots + """ + for model_name, model_results in results.items(): + try: + last = model_results[-1] + y_pred = last.get("y_pred") + residuals = y_test - y_pred + + _, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) + + # Predicted vs Actual + ax1.scatter(y_test, y_pred, alpha=0.6) + min_val = min(y_test.min(), y_pred.min()) + max_val = max(y_test.max(), y_pred.max()) + ax1.plot([min_val, max_val], [min_val, max_val], "r--", lw=2) + ax1.set_xlabel("Actual") + ax1.set_ylabel("Predicted") + ax1.set_title(f"Predicted vs Actual - {model_name.upper()}") + ax1.grid(True, alpha=0.3) + + # Residual plot + ax2.scatter(y_pred, residuals, alpha=0.6) + ax2.axhline(y=0, color="r", linestyle="--", lw=2) + ax2.set_xlabel("Predicted") + ax2.set_ylabel("Residuals") + ax2.set_title(f"Residual Plot - {model_name.upper()}") + ax2.grid(True, alpha=0.3) + + plt.tight_layout() + model_dir = os.path.join(output_dir, model_name) + os.makedirs(model_dir, exist_ok=True) + plt.savefig(os.path.join(model_dir, f"residual_plot_{model_name}.png")) + plt.close() + except Exception as e: + print(f"Warning: Could not create residual plot for {model_name}: {e}") + + +def plot_feature_importance( + feature_importances: np.ndarray, + feature_std: np.ndarray, + feature_names: Optional[List[str]], + model_name: str, + output_dir: str, + importance_method: str, + top_n: int = 20 +): + """Create feature importance bar plot. + + Parameters + ---------- + feature_importances : np.ndarray + Array of feature importance values + feature_std : np.ndarray + Array of standard deviations for feature importances + feature_names : Optional[List[str]] + Names of features + model_name : str + Name of the model + output_dir : str + Directory to save plot + importance_method : str + Description of the method used to compute importance + top_n : int + Number of top features to display + """ + try: + # Sort features by importance + indices = np.argsort(feature_importances)[::-1] + + # Helper to get feature label + def get_feature_label(idx): + if feature_names and idx < len(feature_names): + return feature_names[idx] + return f"Feature {idx}" + + fig, ax = plt.subplots(figsize=(10, 8)) + + # Plot top N features + top_n = min(top_n, len(feature_importances)) + top_indices = indices[:top_n] + top_importances = feature_importances[top_indices] + top_stds = feature_std[top_indices] + + y_pos = np.arange(top_n) + ax.barh(y_pos, top_importances, xerr=top_stds, align='center', alpha=0.7, capsize=3) + ax.set_yticks(y_pos) + ax.set_yticklabels([get_feature_label(idx) for idx in top_indices]) + ax.invert_yaxis() + ax.set_xlabel('Importance') + ax.set_title(f'Top {top_n} Feature Importances - {model_name.upper()}\n({importance_method})') + ax.grid(True, alpha=0.3, axis='x') + + plt.tight_layout() + model_dir = os.path.join(output_dir, model_name) + os.makedirs(model_dir, exist_ok=True) + plt.savefig(os.path.join(model_dir, f"feature_importance_{model_name}.png"), dpi=100, bbox_inches='tight') + plt.close() + + print(f"\n Feature importance plot saved to: {model_dir}/feature_importance_{model_name}.png") + except Exception as e: + print(f" Warning: Could not create feature importance plot: {e}") + + +def plot_cv_comparison( + all_dfs: Dict[str, pd.DataFrame], + output_dir: str, + problem_type: str, + is_multi_output: bool +): + """Create comparison plots across folds and models for cross-validation. + + Parameters + ---------- + all_dfs : Dict[str, pd.DataFrame] + Dictionary of DataFrames per model + output_dir : str + Directory to save plots + problem_type : str + "classification" or "regression" + is_multi_output : bool + Whether this is multi-output problem + """ + if len(all_dfs) <= 1: + return + + print(f"\nGenerating comparison plots...") + + # Determine which metrics to plot + if problem_type == "classification": + if is_multi_output: + metrics_to_plot = {"accuracy_mean": "Accuracy", "balanced_accuracy_mean": "Balanced Accuracy", + "f1_mean": "F1 Score", "cohen_kappa_mean": "Cohen's Kappa"} + else: + metrics_to_plot = {"accuracy": "Accuracy", "balanced_accuracy": "Balanced Accuracy", + "f1": "F1 Score", "cohen_kappa": "Cohen's Kappa"} + else: + if is_multi_output: + metrics_to_plot = {"r2_mean": "R²", "rmse_mean": "RMSE", "mae_mean": "MAE", "mse_mean": "MSE"} + else: + metrics_to_plot = {"r2": "R²", "rmse": "RMSE", "mae": "MAE", "mse": "MSE"} + + for metric_key, metric_label in metrics_to_plot.items(): + try: + fig, ax = plt.subplots(figsize=(10, 6)) + + data_for_plot = [] + labels_for_plot = [] + + for model_name, df in all_dfs.items(): + if metric_key in df.columns: + data_for_plot.append(df[metric_key].values) + labels_for_plot.append(model_name.upper()) + + if data_for_plot: + bp = ax.boxplot(data_for_plot, labels=labels_for_plot, patch_artist=True) + for patch in bp["boxes"]: + patch.set_facecolor("lightblue") + + ax.set_ylabel(metric_label) + ax.set_title(f"{metric_label} Comparison - K-Fold CV") + ax.grid(True, alpha=0.3, axis='y') + plt.xticks(rotation=45, ha='right') + plt.tight_layout() + plt.savefig(os.path.join(output_dir, f"cv_comparison_{metric_key}.png")) + plt.close() + except Exception as e: + print(f" Warning: Could not create comparison plot for {metric_key}: {e}") + + +def plot_validation_vs_test( + val_df: pd.DataFrame, + test_df: pd.DataFrame, + model_name: str, + output_dir: str, + problem_type: str +): + """Create validation vs test comparison plots for hyperparameter tuning. + + Parameters + ---------- + val_df : pd.DataFrame + Validation metrics DataFrame + test_df : pd.DataFrame + Test metrics DataFrame + model_name : str + Name of the model + output_dir : str + Directory to save plots + problem_type : str + "classification" or "regression" + """ + try: + # Select key metrics based on problem type + if problem_type == "regression": + metrics_to_plot = ['r2', 'rmse', 'mae'] + else: + metrics_to_plot = ['accuracy', 'balanced_accuracy', 'f1'] + + fig, axes = plt.subplots(1, len(metrics_to_plot), figsize=(5*len(metrics_to_plot), 5)) + if len(metrics_to_plot) == 1: + axes = [axes] + + for idx, metric in enumerate(metrics_to_plot): + ax = axes[idx] + + # Prepare data for boxplot + data = [val_df[metric].values, test_df[metric].values] + positions = [1, 2] + labels = ['Validation', 'Test'] + + bp = ax.boxplot(data, positions=positions, labels=labels, + patch_artist=True, widths=0.6) + + # Colour boxes + colours = ['lightblue', 'lightcoral'] + for patch, colour in zip(bp['boxes'], colours): + patch.set_facecolor(colour) + + ax.set_ylabel(metric.upper()) + ax.set_title(f'{metric.upper()} - {model_name.upper()}') + ax.grid(True, alpha=0.3, axis='y') + + plt.tight_layout() + model_dir = os.path.join(output_dir, model_name) + os.makedirs(model_dir, exist_ok=True) + plot_path = os.path.join(model_dir, f"val_vs_test_{model_name}.png") + plt.savefig(plot_path, dpi=100, bbox_inches='tight') + plt.close() + print(f" Comparison plot saved to: {plot_path}") + except Exception as e: + print(f" Warning: Could not create comparison plot: {e}") diff --git a/ecosci/evaluation/reporting/__init__.py b/ecosci/evaluation/reporting/__init__.py new file mode 100644 index 0000000..f118cd0 --- /dev/null +++ b/ecosci/evaluation/reporting/__init__.py @@ -0,0 +1,13 @@ +"""Reporting submodule for evaluation.""" + +from .evaluation import ( + evaluate_and_report, + evaluate_and_report_cv, + evaluate_tuning_results, +) + +__all__ = [ + "evaluate_and_report", + "evaluate_and_report_cv", + "evaluate_tuning_results", +] diff --git a/ecosci/evaluation/reporting/display.py b/ecosci/evaluation/reporting/display.py new file mode 100644 index 0000000..a54340a --- /dev/null +++ b/ecosci/evaluation/reporting/display.py @@ -0,0 +1,372 @@ +"""Display and formatting utilities for evaluation reports.""" + +import numpy as np +import pandas as pd +from ..metrics import safe_std + + +def get_metric_columns(problem_type, is_multi_output, combined_df): + """Determine which metric columns to display based on problem type. + + Parameters + ---------- + problem_type : str + "classification" or "regression" + is_multi_output : bool + Whether the problem has multiple outputs + combined_df : DataFrame + Combined results DataFrame + + Returns + ------- + tuple + (display_cols, comparison_metrics, plot_metrics) + """ + if is_multi_output: + if problem_type == "regression": + display_cols = ["seed", "mse_mean", "rmse_mean", "mae_mean", "r2_mean"] + comparison_metrics = ["mse_mean", "rmse_mean", "mae_mean", "r2_mean"] + plot_metrics = ["mse_mean", "rmse_mean", "mae_mean", "r2_mean"] + else: + display_cols = [ + "seed", "accuracy_mean", "balanced_accuracy_mean", + "precision_mean", "recall_mean", "f1_mean", "cohen_kappa_mean", + ] + comparison_metrics = [ + "accuracy_mean", "balanced_accuracy_mean", "f1_mean", "cohen_kappa_mean", + ] + plot_metrics = ["accuracy_mean", "balanced_accuracy_mean", "f1_mean", "cohen_kappa_mean"] + else: + if problem_type == "regression": + display_cols = ["seed", "mse", "rmse", "mae", "r2"] + if "mape" in combined_df.columns: + display_cols.append("mape") + comparison_metrics = ["mse", "rmse", "mae", "r2"] + plot_metrics = ["mse", "rmse", "mae", "r2"] + else: + display_cols = [ + "seed", "accuracy", "balanced_accuracy", + "precision", "recall", "f1", "cohen_kappa", + ] + if "roc_auc" in combined_df.columns: + display_cols.append("roc_auc") + if "average_precision" in combined_df.columns: + display_cols.append("average_precision") + comparison_metrics = [ + "accuracy", "balanced_accuracy", "f1", "cohen_kappa", "roc_auc", + ] + plot_metrics = ["accuracy", "balanced_accuracy", "f1", "cohen_kappa"] + + return display_cols, comparison_metrics, plot_metrics + + +def print_model_results(all_dfs, display_cols, is_multi_output, label_names): + """Print results for each model. + + Parameters + ---------- + all_dfs : dict + Dictionary mapping model names to DataFrames with results + display_cols : list + Columns to display + is_multi_output : bool + Whether the problem has multiple outputs + label_names : list or None + Names of output labels + """ + for model_name, df in all_dfs.items(): + print(f"\n{'='*80}") + print(f"RESULTS FOR MODEL: {model_name.upper()}") + print(f"{'='*80}") + print( + df[display_cols].to_string( + index=False, + float_format=lambda x: f"{x:.4f}" if not pd.isna(x) else "N/A", + ) + ) + + print(f"\n{'-'*80}") + print(f"SUMMARY STATISTICS FOR {model_name.upper()} (mean ± std)") + print(f"{'-'*80}") + metric_cols = [c for c in display_cols if c != "seed"] + for col in metric_cols: + if col in df.columns: + vals = df[col].dropna() + if len(vals) > 0: + print(f"{col:20s}: {vals.mean():.4f} ± {safe_std(vals):.4f}") + print(f"{'-'*80}\n") + + # For multi-output, print per-output breakdown + if is_multi_output and len(df) > 0 and "per_output" in df.iloc[0]: + print_per_output_breakdown(df, label_names, model_name) + + +def print_per_output_breakdown(df, label_names, model_name): + """Print per-output breakdown for multi-output problems. + + Parameters + ---------- + df : DataFrame + Results DataFrame + label_names : list or None + Names of output labels + model_name : str + Name of the model + """ + print(f"\n{'-'*80}") + print(f"PER-OUTPUT BREAKDOWN FOR {model_name.upper()}") + print(f"{'-'*80}") + + first_row = df.iloc[0] + if "per_output" in first_row and first_row["per_output"] is not None: + per_output_list = first_row["per_output"] + n_outputs = len(per_output_list) + + # Collect metrics across all seeds for each output + for output_idx in range(n_outputs): + # Use label name if provided, otherwise use generic Output N + if label_names and output_idx < len(label_names): + output_label = f"{label_names[output_idx]}" + else: + output_label = f"Output {output_idx + 1}" + + print(f"\n{output_label}:") + + # Aggregate metrics across seeds for this output + output_metrics = {} + for _, row in df.iterrows(): + if "per_output" in row and row["per_output"] is not None: + per_out = row["per_output"] + if output_idx < len(per_out): + for key, val in per_out[output_idx].items(): + if key not in ["confusion_matrix", "seed", "model"] and val is not None: + if key not in output_metrics: + output_metrics[key] = [] + output_metrics[key].append(val) + + # Print summary for this output + for metric_name, values in sorted(output_metrics.items()): + if len(values) > 0: + mean_val = np.mean(values) + std_val = safe_std(np.array(values)) + print(f" {metric_name:20s}: {mean_val:.4f} ± {std_val:.4f}") + print(f"{'-'*80}\n") + + +def print_model_comparison(all_dfs, comparison_metrics): + """Print model comparison table. + + Parameters + ---------- + all_dfs : dict + Dictionary mapping model names to DataFrames + comparison_metrics : list + Metrics to compare + """ + print(f"\n{'='*80}") + print("MODEL COMPARISON (mean ± std)") + print(f"{'='*80}") + comparison_data = [] + for model_name, df in all_dfs.items(): + row = {"Model": model_name} + for col in comparison_metrics: + if col in df.columns: + vals = df[col].dropna() + if len(vals) > 0: + row[col] = f"{vals.mean():.4f} ± {safe_std(vals):.4f}" + else: + row[col] = "N/A" + comparison_data.append(row) + + comparison_df = pd.DataFrame(comparison_data) + print(comparison_df.to_string(index=False)) + print(f"{'='*80}\n") + + +def print_model_ranking(all_dfs, display_cols, problem_type, is_multi_output): + """Print model ranking based on primary metric. + + Parameters + ---------- + all_dfs : dict + Dictionary mapping model names to DataFrames + display_cols : list + Columns to display + problem_type : str + "classification" or "regression" + is_multi_output : bool + Whether the problem has multiple outputs + """ + # Determine primary metric + if is_multi_output: + if problem_type == "regression": + primary_metric = "mse_mean" + lower_is_better = True + else: + primary_metric = "cohen_kappa_mean" + lower_is_better = False + else: + if problem_type == "regression": + primary_metric = "mse" + lower_is_better = True + else: + primary_metric = "cohen_kappa" + lower_is_better = False + + # Calculate mean primary metric for each model + model_scores = [] + for model_name, df in all_dfs.items(): + if primary_metric in df.columns: + vals = df[primary_metric].dropna() + if len(vals) > 0: + model_scores.append({ + "model": model_name, + "mean": vals.mean(), + "std": safe_std(vals), + }) + + if len(model_scores) > 0: + # Sort by primary metric + model_scores.sort(key=lambda x: x["mean"], reverse=not lower_is_better) + + print(f"\n{'='*80}") + print("MODEL RANKING") + print(f"{'='*80}") + print( + f"Ranked by: {primary_metric.replace('_', ' ').upper()} " + f"({'Lower is better' if lower_is_better else 'Higher is better'})" + ) + print(f"{'-'*80}") + + # Create ranking table with all metrics + ranking_data = [] + for rank, score in enumerate(model_scores, 1): + model_name = score["model"] + model_df = all_dfs[model_name] + + row = { + "Rank": f"{rank}.", + "Model": model_name.upper(), + primary_metric: f"{score['mean']:.4f} ± {score['std']:.4f}", + } + + # Add other metrics + metric_cols = [c for c in display_cols if c != "seed" and c != primary_metric] + for col in metric_cols: + if col in model_df.columns: + vals = model_df[col].dropna() + if len(vals) > 0: + row[col] = f"{vals.mean():.4f} ± {safe_std(vals):.4f}" + + ranking_data.append(row) + + # Print ranking table + ranking_df = pd.DataFrame(ranking_data) + print(ranking_df.to_string(index=False)) + print(f"{'='*80}\n") + + +def get_cv_metric_keys(problem_type, is_multi_output): + """Get metric keys for CV results. + + Parameters + ---------- + problem_type : str + "classification" or "regression" + is_multi_output : bool + Whether the problem has multiple outputs + + Returns + ------- + list + List of metric keys + """ + if problem_type == "classification": + if is_multi_output: + return ["accuracy_mean", "balanced_accuracy_mean", "f1_mean"] + else: + return ["accuracy", "balanced_accuracy", "f1"] + else: + if is_multi_output: + return ["r2_mean", "rmse_mean", "mae_mean"] + else: + return ["r2", "rmse", "mae"] + + +def print_cv_fold_results(folds, fold_summaries, metric_keys): + """Print per-fold CV results. + + Parameters + ---------- + folds : dict + Dictionary of fold results + fold_summaries : list + List of fold summaries + metric_keys : list + Metric keys to display + """ + print(f"\nPer-Fold Metrics (averaged across seeds per fold):") + + # Print table header + print(f"\n {'Fold':<6}", end="") + for mk in metric_keys: + print(f"{mk:<20}", end="") + print() + print(f" {'-'*6}", end="") + for _ in metric_keys: + print(f"{'-'*20}", end="") + print() + + # Print per-fold metrics + for fold_id in sorted(folds.keys()): + fold_results_list = fold_summaries[fold_id] + fold_df = pd.DataFrame(fold_results_list) + + print(f" {fold_id:<6}", end="") + for mk in metric_keys: + if mk in fold_df.columns: + mean_val = fold_df[mk].mean() + std_val = fold_df[mk].std() + print(f"{mean_val:.4f} ± {std_val:.4f} ", end="") + else: + print(f"{'N/A':<20}", end="") + print() + + +def print_tuning_summary(val_df, test_df, best_cv_scores, problem_type, n_seeds): + """Print summary of hyperparameter tuning results. + + Parameters + ---------- + val_df : DataFrame + Validation metrics + test_df : DataFrame + Test metrics + best_cv_scores : list + Best CV scores across seeds + problem_type : str + "classification" or "regression" + n_seeds : int + Number of seeds + """ + print(f"\nValidation Set Performance (across {n_seeds} seeds):") + if problem_type == "regression": + print(f" R² = {val_df['r2'].mean():.4f} ± {safe_std(val_df['r2']):.4f}") + print(f" RMSE = {val_df['rmse'].mean():.4f} ± {safe_std(val_df['rmse']):.4f}") + print(f" MAE = {val_df['mae'].mean():.4f} ± {safe_std(val_df['mae']):.4f}") + else: + print(f" Accuracy = {val_df['accuracy'].mean():.4f} ± {safe_std(val_df['accuracy']):.4f}") + print(f" Balanced Accuracy = {val_df['balanced_accuracy'].mean():.4f} ± {safe_std(val_df['balanced_accuracy']):.4f}") + print(f" F1 = {val_df['f1'].mean():.4f} ± {safe_std(val_df['f1']):.4f}") + + print(f"\nTest Set Performance (across {n_seeds} seeds):") + if problem_type == "regression": + print(f" R² = {test_df['r2'].mean():.4f} ± {safe_std(test_df['r2']):.4f}") + print(f" RMSE = {test_df['rmse'].mean():.4f} ± {safe_std(test_df['rmse']):.4f}") + print(f" MAE = {test_df['mae'].mean():.4f} ± {safe_std(test_df['mae']):.4f}") + else: + print(f" Accuracy = {test_df['accuracy'].mean():.4f} ± {safe_std(test_df['accuracy']):.4f}") + print(f" Balanced Accuracy = {test_df['balanced_accuracy'].mean():.4f} ± {safe_std(test_df['balanced_accuracy']):.4f}") + print(f" F1 = {test_df['f1'].mean():.4f} ± {safe_std(test_df['f1']):.4f}") + + print(f"\nBest CV Score (during tuning): {np.mean(best_cv_scores):.4f} ± {safe_std(np.array(best_cv_scores)):.4f}") diff --git a/ecosci/evaluation/reporting/evaluation.py b/ecosci/evaluation/reporting/evaluation.py new file mode 100644 index 0000000..5664c40 --- /dev/null +++ b/ecosci/evaluation/reporting/evaluation.py @@ -0,0 +1,385 @@ +"""Main evaluation orchestration functions. + +This module contains the high-level functions that orchestrate metrics computation, +feature importance analysis, and plot generation. +""" + +from typing import Dict, Optional +import os +import json +import numpy as np +import pandas as pd + +from ..metrics import ( + compute_classification_metrics, + compute_regression_metrics, + compute_multi_output_classification_metrics, + compute_multi_output_regression_metrics, +) +from ..plotting import ( + plot_metric_comparisons, + plot_confusion_matrices, + plot_pr_curves, + plot_residuals, + plot_cv_comparison, + plot_validation_vs_test, +) +from ..feature_importance import ( + extract_feature_importance, + extract_cv_feature_importance, +) +from .display import ( + get_metric_columns, + print_model_results, + print_model_comparison, + print_model_ranking, + get_cv_metric_keys, + print_cv_fold_results, + print_tuning_summary, +) +from .output import ( + save_cv_reports, + save_tuning_results, + save_tuning_summary_report, +) + + +def evaluate_and_report( + results, y_test, output_dir: str = "outputs", problem_type: str = "classification", + label_names: list = None, feature_names: list = None, X_test=None +): + """Compute metrics across runs and save a compact report and plots. + + Parameters + ---------- + results : dict or list + If dict: Output from Trainer.run() with multiple models {model_name: [run_results]} + If list: Single model results (backward compatibility) + y_test : array-like + Ground-truth labels (classification) or values (regression) for the test set. + output_dir : str + Where to save the report and plots. + problem_type : str + "classification" or "regression". Determines which metrics to compute. + label_names : list, optional + Names of the target variables (for multi-output). + feature_names : list, optional + Names of the input features. + X_test : array-like, optional + Test features for permutation importance. + + Returns + ------- + list + Combined summary of results across all models + """ + os.makedirs(output_dir, exist_ok=True) + + # Handle backward compatibility: convert list to dict + if isinstance(results, list): + model_name = results[0].get("model_name", "model") if results else "model" + results = {model_name: results} + + # Determine if we have multi-output + is_multi_output = len(y_test.shape) > 1 and y_test.shape[1] > 1 + + # Compute metrics for each model + all_summaries = {} + all_dfs = {} + + for model_name, model_results in results.items(): + summary = [] + for r in model_results: + seed = r.get("seed") + y_pred = r.get("y_pred") + y_proba = r.get("y_proba") + + if is_multi_output: + if problem_type == "regression": + metrics = compute_multi_output_regression_metrics(y_test, y_pred) + else: + metrics = compute_multi_output_classification_metrics(y_test, y_pred, y_proba) + else: + if problem_type == "regression": + metrics = compute_regression_metrics(y_test, y_pred) + else: + metrics = compute_classification_metrics(y_test, y_pred, y_proba) + + metrics["seed"] = seed + metrics["model"] = model_name + summary.append(metrics) + + all_summaries[model_name] = summary + all_dfs[model_name] = pd.DataFrame(summary) + + # Save individual model reports + for model_name, summary in all_summaries.items(): + model_dir = os.path.join(output_dir, model_name) + os.makedirs(model_dir, exist_ok=True) + report_path = os.path.join(model_dir, f"report_{model_name}.json") + with open(report_path, "w") as f: + json.dump(summary, f, indent=2) + + # Combine all results + combined_df = pd.concat(all_dfs.values(), ignore_index=True) + combined_summary = [item for sublist in all_summaries.values() for item in sublist] + + # Save combined report + with open(os.path.join(output_dir, "report_all_models.json"), "w") as f: + json.dump(combined_summary, f, indent=2) + + # Print results + display_cols, comparison_metrics, plot_metrics = get_metric_columns( + problem_type, is_multi_output, combined_df + ) + + print_model_results(all_dfs, display_cols, is_multi_output, label_names) + + if len(all_dfs) > 1: + print_model_comparison(all_dfs, comparison_metrics) + print_model_ranking(all_dfs, display_cols, problem_type, is_multi_output) + + # Create plots + plot_metric_comparisons(combined_df, plot_metrics, output_dir, all_dfs) + + if problem_type == "classification": + plot_confusion_matrices(results, y_test, output_dir) + plot_pr_curves(results, y_test, output_dir) + elif problem_type == "regression" and not is_multi_output: + plot_residuals(results, y_test, output_dir) + + # Feature importance + extract_feature_importance(results, X_test, y_test, output_dir, problem_type, feature_names) + + return combined_summary + + +def evaluate_and_report_cv( + results, output_dir: str = "outputs", problem_type: str = "classification", + label_names: list = None, feature_names: list = None, X_test=None +): + """Compute metrics for k-fold cross-validation with per-fold breakdown. + + Parameters + ---------- + results : dict + Output from Trainer.run_cv() with format {model_name: [run_results_with_fold_info]} + output_dir : str + Where to save the report and plots. + problem_type : str + "classification" or "regression". + label_names : list, optional + Names of the target variables. + feature_names : list, optional + Names of the input features. + X_test : array-like, optional + Test features (note: in CV, each fold has different test set). + + Returns + ------- + dict + Summaries for all models + """ + os.makedirs(output_dir, exist_ok=True) + + print(f"\n{'='*80}") + print("K-FOLD CROSS-VALIDATION RESULTS") + print(f"{'='*80}\n") + + # Determine if multi-output + first_result = next(iter(results.values()))[0] + y_test_sample = first_result["y_test"] + is_multi_output = len(y_test_sample.shape) > 1 and y_test_sample.shape[1] > 1 + + all_summaries = {} + all_dfs = {} + + for model_name, model_results in results.items(): + # Group results by fold + folds = {} + for r in model_results: + fold_id = r.get("fold") + if fold_id not in folds: + folds[fold_id] = [] + folds[fold_id].append(r) + + n_folds = len(folds) + + print(f"\n{model_name.upper()} - {n_folds} Folds") + print(f"{'-'*80}") + + # Compute metrics for each fold + fold_summaries = [] + for fold_id in sorted(folds.keys()): + fold_results = folds[fold_id] + fold_metrics_list = [] + + for r in fold_results: + seed = r.get("seed") + y_pred = r.get("y_pred") + y_test = r.get("y_test") + y_proba = r.get("y_proba") + + if problem_type == "classification": + if is_multi_output: + metrics = compute_multi_output_classification_metrics(y_test, y_pred, y_proba) + else: + metrics = compute_classification_metrics(y_test, y_pred, y_proba) + else: + if is_multi_output: + metrics = compute_multi_output_regression_metrics(y_test, y_pred) + else: + metrics = compute_regression_metrics(y_test, y_pred) + + metrics["seed"] = seed + metrics["fold"] = fold_id + fold_metrics_list.append(metrics) + + fold_summaries.append(fold_metrics_list) + + # Print per-fold metrics + metric_keys = get_cv_metric_keys(problem_type, is_multi_output) + print_cv_fold_results(folds, fold_summaries, metric_keys) + + # Compute overall metrics + all_fold_metrics = [m for fold_list in fold_summaries for m in fold_list] + overall_df = pd.DataFrame(all_fold_metrics) + + # Print overall metrics + print(f"\n {'Overall':<6}", end="") + for mk in metric_keys: + if mk in overall_df.columns: + mean_val = overall_df[mk].mean() + std_val = overall_df[mk].std() + print(f"{mean_val:.4f} ± {std_val:.4f} ", end="") + else: + print(f"{'N/A':<20}", end="") + print("\n") + + # Save reports + save_cv_reports(model_name, all_fold_metrics, fold_summaries, folds, overall_df, output_dir) + + all_summaries[model_name] = overall_df + all_dfs[model_name] = overall_df + + # Create plots + plot_cv_comparison(all_dfs, output_dir, problem_type, is_multi_output) + + print(f"\n{'='*80}\n") + + # Feature importance + extract_cv_feature_importance(results, output_dir, problem_type, feature_names) + + return all_summaries + + +def evaluate_tuning_results( + results: Dict, + y_val: np.ndarray, + y_test: np.ndarray, + output_dir: str = "outputs", + problem_type: str = "classification", + label_names: Optional[list] = None, + feature_names: Optional[list] = None, + X_val: Optional[np.ndarray] = None, + X_test: Optional[np.ndarray] = None, +): + """Evaluate and report results from hyperparameter tuning. + + Parameters + ---------- + results : Dict + Results dictionary from trainer.run_with_tuning() + y_val : np.ndarray + Validation set labels + y_test : np.ndarray + Test set labels + output_dir : str + Directory to save outputs + problem_type : str + "classification" or "regression" + label_names : list, optional + Names of label columns + feature_names : list, optional + Names of features + X_val : np.ndarray, optional + Validation features + X_test : np.ndarray, optional + Test features + + Returns + ------- + dict + Summary statistics for each model + """ + os.makedirs(output_dir, exist_ok=True) + + print(f"\n{'='*80}") + print("HYPERPARAMETER TUNING EVALUATION") + print(f"{'='*80}\n") + + all_summaries = {} + + for model_name, model_results in results.items(): + print(f"\n{model_name.upper()}") + print("-" * 60) + + # Collect metrics for each seed + val_metrics_per_seed = [] + test_metrics_per_seed = [] + best_params_per_seed = [] + best_cv_scores = [] + + for run in model_results: + seed = run['seed'] + best_params = run['best_params'] + best_cv_score = run['best_cv_score'] + + best_params_per_seed.append(best_params) + best_cv_scores.append(best_cv_score) + + # Compute validation metrics + if problem_type == "regression": + val_metrics = compute_regression_metrics(y_val, run['y_val_pred']) + else: + val_metrics = compute_classification_metrics( + y_val, run['y_val_pred'], run.get('y_val_proba') + ) + val_metrics['seed'] = seed + val_metrics_per_seed.append(val_metrics) + + # Compute test metrics + if problem_type == "regression": + test_metrics = compute_regression_metrics(y_test, run['y_test_pred']) + else: + test_metrics = compute_classification_metrics( + y_test, run['y_test_pred'], run.get('y_test_proba') + ) + test_metrics['seed'] = seed + test_metrics_per_seed.append(test_metrics) + + # Create summary DataFrames + val_df = pd.DataFrame(val_metrics_per_seed) + test_df = pd.DataFrame(test_metrics_per_seed) + + # Print summary + print_tuning_summary(val_df, test_df, best_cv_scores, problem_type, len(model_results)) + + # Save results + save_tuning_results(model_name, val_df, test_df, best_params_per_seed, output_dir) + + # Create plots + plot_validation_vs_test(val_df, test_df, model_name, output_dir, problem_type) + + # Store summary + all_summaries[model_name] = { + 'validation': val_df, + 'test': test_df, + 'best_params': best_params_per_seed, + 'best_cv_scores': best_cv_scores, + } + + # Save summary report + save_tuning_summary_report(all_summaries, output_dir, problem_type) + + return all_summaries diff --git a/ecosci/evaluation/reporting/output.py b/ecosci/evaluation/reporting/output.py new file mode 100644 index 0000000..b80c936 --- /dev/null +++ b/ecosci/evaluation/reporting/output.py @@ -0,0 +1,151 @@ +"""Output and file saving utilities for evaluation reports.""" + +import os +import json +import pandas as pd +from ..metrics import safe_std + + +def save_cv_reports(model_name, all_fold_metrics, fold_summaries, folds, overall_df, output_dir): + """Save CV reports and metrics to files. + + Parameters + ---------- + model_name : str + Name of the model + all_fold_metrics : list + All metrics across folds + fold_summaries : list + Per-fold summaries + folds : dict + Dictionary of fold results + overall_df : DataFrame + Overall results DataFrame + output_dir : str + Output directory + """ + model_dir = os.path.join(output_dir, model_name) + os.makedirs(model_dir, exist_ok=True) + + # Save detailed report + report_path = os.path.join(model_dir, f"report_{model_name}_cv.json") + with open(report_path, "w") as f: + json.dump(all_fold_metrics, f, indent=2, default=str) + print(f" Detailed JSON report saved to: {report_path}") + + # Save detailed metrics as CSV (all seeds and folds) + csv_detailed_path = os.path.join(model_dir, f"metrics_{model_name}_cv_detailed.csv") + overall_df.to_csv(csv_detailed_path, index=False) + print(f" Detailed CSV metrics saved to: {csv_detailed_path}") + + # Save per-fold summary (averaged across seeds) + fold_summary_data = [] + for fold_id in sorted(folds.keys()): + fold_results_list = fold_summaries[fold_id] + fold_df = pd.DataFrame(fold_results_list) + + # Calculate mean and std for each metric + fold_summary = {"fold": fold_id} + for col in fold_df.columns: + if col not in ["seed", "fold"]: + fold_summary[f"{col}_mean"] = fold_df[col].mean() + fold_summary[f"{col}_std"] = fold_df[col].std() + fold_summary_data.append(fold_summary) + + # Add overall row + overall_summary = {"fold": "overall"} + for col in overall_df.columns: + if col not in ["seed", "fold"]: + overall_summary[f"{col}_mean"] = overall_df[col].mean() + overall_summary[f"{col}_std"] = overall_df[col].std() + fold_summary_data.append(overall_summary) + + fold_summary_df = pd.DataFrame(fold_summary_data) + csv_summary_path = os.path.join(model_dir, f"metrics_{model_name}_cv_per_fold.csv") + fold_summary_df.to_csv(csv_summary_path, index=False) + print(f" Per-fold summary CSV saved to: {csv_summary_path}") + + +def save_tuning_results(model_name, val_df, test_df, best_params_per_seed, output_dir): + """Save hyperparameter tuning results to files. + + Parameters + ---------- + model_name : str + Name of the model + val_df : DataFrame + Validation metrics + test_df : DataFrame + Test metrics + best_params_per_seed : list + Best parameters for each seed + output_dir : str + Output directory + """ + model_dir = os.path.join(output_dir, model_name) + os.makedirs(model_dir, exist_ok=True) + + # Save validation metrics + val_csv = os.path.join(model_dir, f"metrics_{model_name}_validation.csv") + val_df.to_csv(val_csv, index=False) + print(f"\n Validation metrics saved to: {val_csv}") + + # Save test metrics + test_csv = os.path.join(model_dir, f"metrics_{model_name}_test.csv") + test_df.to_csv(test_csv, index=False) + print(f" Test metrics saved to: {test_csv}") + + # Save hyperparameter summary + params_df = pd.DataFrame(best_params_per_seed) + params_csv = os.path.join(model_dir, f"best_params_{model_name}.csv") + params_df.to_csv(params_csv, index=False) + print(f" Best parameters saved to: {params_csv}") + + +def save_tuning_summary_report(all_summaries, output_dir, problem_type): + """Save summary report comparing all models. + + Parameters + ---------- + all_summaries : dict + Dictionary of summaries for all models + output_dir : str + Output directory + problem_type : str + "classification" or "regression" + """ + summary_report = {} + for model_name, summary in all_summaries.items(): + val_df = summary['validation'] + test_df = summary['test'] + + if problem_type == "regression": + summary_report[model_name] = { + 'val_r2_mean': float(val_df['r2'].mean()), + 'val_r2_std': float(safe_std(val_df['r2'])), + 'test_r2_mean': float(test_df['r2'].mean()), + 'test_r2_std': float(safe_std(test_df['r2'])), + 'val_rmse_mean': float(val_df['rmse'].mean()), + 'val_rmse_std': float(safe_std(val_df['rmse'])), + 'test_rmse_mean': float(test_df['rmse'].mean()), + 'test_rmse_std': float(safe_std(test_df['rmse'])), + } + else: + summary_report[model_name] = { + 'val_accuracy_mean': float(val_df['accuracy'].mean()), + 'val_accuracy_std': float(safe_std(val_df['accuracy'])), + 'test_accuracy_mean': float(test_df['accuracy'].mean()), + 'test_accuracy_std': float(safe_std(test_df['accuracy'])), + 'val_f1_mean': float(val_df['f1'].mean()), + 'val_f1_std': float(safe_std(val_df['f1'])), + 'test_f1_mean': float(test_df['f1'].mean()), + 'test_f1_std': float(safe_std(test_df['f1'])), + } + + # Save summary report + summary_path = os.path.join(output_dir, "tuning_summary_all_models.json") + with open(summary_path, 'w') as f: + json.dump(summary_report, f, indent=2) + print(f"\n{'='*60}") + print(f"Summary report saved to: {summary_path}") + print(f"{'='*60}\n") diff --git a/ecosci/hyperopt.py b/ecosci/hyperopt.py new file mode 100644 index 0000000..851c8d5 --- /dev/null +++ b/ecosci/hyperopt.py @@ -0,0 +1,507 @@ +"""Hyperparameter optimisation for different model types. + +This module provides hyperparameter tuning functionality using scikit-learn's +search methods (GridSearchCV, RandomizedSearchCV). Each model type has a dedicated +tuning function with sensible default search spaces that can be overridden via config. + +Key features: +- Separate tuning functions for each model type (MLP, RandomForest, XGBoost, SVM, Linear) +- Uses validation set for tuning, test set held out for final evaluation +- Supports both exhaustive grid search and randomized search +- Returns best model and tuning results +""" + +from typing import Dict, Optional, Tuple +import numpy as np +from sklearn.model_selection import GridSearchCV, RandomizedSearchCV +from sklearn.base import BaseEstimator +import joblib +import os + + +class HyperparameterTuner: + """Hyperparameter tuning for various model types. + + Parameters + ---------- + problem_type : str + "classification" or "regression" + n_iter : int, optional + Number of iterations for RandomizedSearchCV (default: 50) + cv : int, optional + Number of cross-validation folds to use during tuning (default: 3) + scoring : str, optional + Scoring metric for tuning (default: auto-selected based on problem_type) + search_method : str, optional + Either "grid" for GridSearchCV or "random" for RandomizedSearchCV (default: "random") + random_state : int, optional + Random state for reproducibility (default: 42) + n_jobs : int, optional + Number of parallel jobs (default: -1, use all cores) + verbose : int, optional + Verbosity level (default: 1) + """ + + def __init__( + self, + problem_type: str = "regression", + n_iter: int = 50, + cv: int = 3, + scoring: Optional[str] = None, + search_method: str = "random", + random_state: int = 42, + n_jobs: int = -1, + verbose: int = 1, + ): + self.problem_type = problem_type + self.n_iter = n_iter + self.cv = cv + self.search_method = search_method + self.random_state = random_state + self.n_jobs = n_jobs + self.verbose = verbose + + # Auto-select scoring metric if not provided + if scoring is None: + if problem_type == "regression": + self.scoring = "neg_mean_squared_error" + else: + self.scoring = "accuracy" + else: + self.scoring = scoring + + def _get_search_estimator(self, model: BaseEstimator, param_space: Dict, groups: Optional[np.ndarray] = None): + """Create appropriate search estimator (Grid or Randomized). + + Parameters + ---------- + model : BaseEstimator + Base model to tune + param_space : Dict + Parameter search space + groups : np.ndarray, optional + Group labels for GroupKFold. If provided, uses GroupKFold instead of regular KFold. + + Returns + ------- + search estimator + GridSearchCV or RandomizedSearchCV configured appropriately + """ + from sklearn.model_selection import GroupKFold + + # Determine CV strategy + if groups is not None: + # Use GroupKFold to respect group structure + cv_strategy = GroupKFold(n_splits=self.cv) + else: + # Use default KFold (integer cv value) + cv_strategy = self.cv + + if self.search_method == "grid": + return GridSearchCV( + model, + param_space, + cv=cv_strategy, + scoring=self.scoring, + n_jobs=self.n_jobs, + verbose=self.verbose, + refit=True, + ) + else: + return RandomizedSearchCV( + model, + param_space, + n_iter=self.n_iter, + cv=cv_strategy, + scoring=self.scoring, + n_jobs=self.n_jobs, + verbose=self.verbose, + refit=True, + random_state=self.random_state, + ) + + def tune_mlp( + self, + X_train: np.ndarray, + y_train: np.ndarray, + param_space: Optional[Dict] = None, + groups: Optional[np.ndarray] = None, + ) -> Tuple[BaseEstimator, Dict]: + """Tune MLP hyperparameters. + + Parameters + ---------- + X_train : np.ndarray + Training features + y_train : np.ndarray + Training labels + param_space : Dict, optional + Custom parameter search space. If None, uses sensible defaults. + groups : np.ndarray, optional + Group labels for GroupKFold cross-validation + + Returns + ------- + best_model : BaseEstimator + Trained model with best hyperparameters + results : Dict + Dictionary containing best parameters, best score, and CV results + """ + from sklearn.neural_network import MLPRegressor, MLPClassifier + + # Default parameter space + if param_space is None: + param_space = { + 'hidden_layer_sizes': [ + (16,), (32,), (64,), + (16, 8), (32, 16), (64, 32), + (32, 16, 8), (64, 32, 16), + ], + 'activation': ['relu', 'tanh'], + 'alpha': [0.0001, 0.001, 0.01, 0.1], + 'learning_rate_init': [0.0001, 0.001, 0.01], + 'batch_size': [8, 16, 32, 64], + 'early_stopping': [True], + 'validation_fraction': [0.1], + 'n_iter_no_change': [10, 20], + 'max_iter': [1000], + } + + # Create base model + if self.problem_type == "regression": + base_model = MLPRegressor(random_state=self.random_state, verbose=False) + else: + base_model = MLPClassifier(random_state=self.random_state, verbose=False) + + # Perform search + search = self._get_search_estimator(base_model, param_space, groups=groups) + search.fit(X_train, y_train, groups=groups) + + results = { + 'best_params': search.best_params_, + 'best_score': search.best_score_, + 'cv_results': search.cv_results_, + } + + return search.best_estimator_, results + + def tune_random_forest( + self, + X_train: np.ndarray, + y_train: np.ndarray, + param_space: Optional[Dict] = None, + groups: Optional[np.ndarray] = None, + ) -> Tuple[BaseEstimator, Dict]: + """Tune Random Forest hyperparameters. + + Parameters + ---------- + X_train : np.ndarray + Training features + y_train : np.ndarray + Training labels + param_space : Dict, optional + Custom parameter search space. If None, uses sensible defaults. + groups : np.ndarray, optional + Group labels for GroupKFold cross-validation + + Returns + ------- + best_model : BaseEstimator + Trained model with best hyperparameters + results : Dict + Dictionary containing best parameters, best score, and CV results + """ + from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier + + # Default parameter space + if param_space is None: + param_space = { + 'n_estimators': [100, 200, 500, 1000], + 'max_depth': [5, 10, 20, 30, 50, None], + 'min_samples_split': [2, 5, 10, 20], + 'min_samples_leaf': [1, 2, 5, 10], + 'max_features': ['sqrt', 'log2', 0.5], + 'bootstrap': [True, False], + } + + # Create base model + if self.problem_type == "regression": + base_model = RandomForestRegressor(random_state=self.random_state, n_jobs=1) + else: + base_model = RandomForestClassifier(random_state=self.random_state, n_jobs=1) + + # Perform search + search = self._get_search_estimator(base_model, param_space, groups=groups) + search.fit(X_train, y_train, groups=groups) + + results = { + 'best_params': search.best_params_, + 'best_score': search.best_score_, + 'cv_results': search.cv_results_, + } + + return search.best_estimator_, results + + def tune_xgboost( + self, + X_train: np.ndarray, + y_train: np.ndarray, + param_space: Optional[Dict] = None, + groups: Optional[np.ndarray] = None, + ) -> Tuple[BaseEstimator, Dict]: + """Tune XGBoost hyperparameters. + + Parameters + ---------- + X_train : np.ndarray + Training features + y_train : np.ndarray + Training labels + param_space : Dict, optional + Custom parameter search space. If None, uses sensible defaults. + groups : np.ndarray, optional + Group labels for GroupKFold cross-validation + + Returns + ------- + best_model : BaseEstimator + Trained model with best hyperparameters + results : Dict + Dictionary containing best parameters, best score, and CV results + """ + try: + from xgboost import XGBRegressor, XGBClassifier + except ImportError: + raise ImportError("xgboost is required: install via pip install xgboost") + + # Default parameter space + if param_space is None: + param_space = { + 'n_estimators': [100, 200, 500], + 'max_depth': [3, 5, 7, 10], + 'learning_rate': [0.01, 0.05, 0.1, 0.2], + 'subsample': [0.6, 0.8, 1.0], + 'colsample_bytree': [0.6, 0.8, 1.0], + 'min_child_weight': [1, 3, 5], + 'reg_alpha': [0, 0.01, 0.1, 1.0], + 'reg_lambda': [0.1, 1.0, 10.0], + } + + # Create base model + if self.problem_type == "regression": + base_model = XGBRegressor(random_state=self.random_state, n_jobs=1) + else: + base_model = XGBClassifier( + random_state=self.random_state, + eval_metric='logloss', + n_jobs=1 + ) + + # Perform search + search = self._get_search_estimator(base_model, param_space, groups=groups) + search.fit(X_train, y_train, groups=groups) + + results = { + 'best_params': search.best_params_, + 'best_score': search.best_score_, + 'cv_results': search.cv_results_, + } + + return search.best_estimator_, results + + def tune_svm( + self, + X_train: np.ndarray, + y_train: np.ndarray, + param_space: Optional[Dict] = None, + groups: Optional[np.ndarray] = None, + ) -> Tuple[BaseEstimator, Dict]: + """Tune SVM hyperparameters. + + Parameters + ---------- + X_train : np.ndarray + Training features + y_train : np.ndarray + Training labels + param_space : Dict, optional + Custom parameter search space. If None, uses sensible defaults. + groups : np.ndarray, optional + Group labels for GroupKFold cross-validation + + Returns + ------- + best_model : BaseEstimator + Trained model with best hyperparameters + results : Dict + Dictionary containing best parameters, best score, and CV results + """ + from sklearn.svm import SVR, SVC + + # Default parameter space + if param_space is None: + if self.problem_type == "regression": + param_space = { + 'kernel': ['rbf', 'poly', 'linear'], + 'C': [0.1, 1, 10, 100], + 'epsilon': [0.01, 0.1, 0.5, 1.0], + 'gamma': ['scale', 'auto', 0.001, 0.01, 0.1], + } + else: + param_space = { + 'kernel': ['rbf', 'poly', 'linear'], + 'C': [0.1, 1, 10, 100], + 'gamma': ['scale', 'auto', 0.001, 0.01, 0.1], + } + + # Create base model + if self.problem_type == "regression": + base_model = SVR() + else: + base_model = SVC(probability=True, random_state=self.random_state) + + # Perform search + search = self._get_search_estimator(base_model, param_space, groups=groups) + search.fit(X_train, y_train, groups=groups) + + results = { + 'best_params': search.best_params_, + 'best_score': search.best_score_, + 'cv_results': search.cv_results_, + } + + return search.best_estimator_, results + + def tune_linear( + self, + X_train: np.ndarray, + y_train: np.ndarray, + param_space: Optional[Dict] = None, + groups: Optional[np.ndarray] = None, + ) -> Tuple[BaseEstimator, Dict]: + """Tune Linear model hyperparameters. + + For linear regression, there are few hyperparameters to tune, + but we can tune regularisation if using Ridge/Lasso. + + Parameters + ---------- + X_train : np.ndarray + Training features + y_train : np.ndarray + Training labels + param_space : Dict, optional + Custom parameter search space. If None, uses Ridge regression with alpha tuning. + groups : np.ndarray, optional + Group labels for GroupKFold cross-validation + + Returns + ------- + best_model : BaseEstimator + Trained model with best hyperparameters + results : Dict + Dictionary containing best parameters, best score, and CV results + """ + from sklearn.linear_model import Ridge, Lasso, ElasticNet, LogisticRegression + + # Default parameter space - use Ridge for regression, LogisticRegression for classification + if param_space is None: + if self.problem_type == "regression": + # Use Ridge with alpha tuning + param_space = { + 'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], + } + base_model = Ridge(random_state=self.random_state) + else: + # Use LogisticRegression with C and penalty tuning + # Separate parameter spaces for different penalties to avoid invalid combinations + c_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0] + param_space = [ + {'C': c_values, 'penalty': ['l1'], 'solver': ['saga']}, + {'C': c_values, 'penalty': ['l2'], 'solver': ['saga']}, + {'C': c_values, 'penalty': ['elasticnet'], 'solver': ['saga'], 'l1_ratio': [0.1, 0.5, 0.9]}, + ] + base_model = LogisticRegression( + random_state=self.random_state, + max_iter=1000 + ) + else: + # Use provided param_space and determine model type from it + if self.problem_type == "regression": + base_model = Ridge(random_state=self.random_state) + else: + base_model = LogisticRegression( + random_state=self.random_state, + max_iter=1000 + ) + + # Perform search + search = self._get_search_estimator(base_model, param_space, groups=groups) + search.fit(X_train, y_train, groups=groups) + + results = { + 'best_params': search.best_params_, + 'best_score': search.best_score_, + 'cv_results': search.cv_results_, + } + + return search.best_estimator_, results + + def tune_model( + self, + model_name: str, + X_train: np.ndarray, + y_train: np.ndarray, + param_space: Optional[Dict] = None, + groups: Optional[np.ndarray] = None, + ) -> Tuple[BaseEstimator, Dict]: + """Tune hyperparameters for specified model type. + + Parameters + ---------- + model_name : str + Name of model to tune ('mlp', 'random_forest', 'xgboost', 'svm', 'linear') + X_train : np.ndarray + Training features + y_train : np.ndarray + Training labels + param_space : Dict, optional + Custom parameter search space. If None, uses sensible defaults. + groups : np.ndarray, optional + Group labels for GroupKFold cross-validation + + Returns + ------- + best_model : BaseEstimator + Trained model with best hyperparameters + results : Dict + Dictionary containing best parameters, best score, and CV results + """ + model_name_lower = model_name.lower() + + if model_name_lower == "mlp": + return self.tune_mlp(X_train, y_train, param_space, groups=groups) + elif model_name_lower == "random_forest": + return self.tune_random_forest(X_train, y_train, param_space, groups=groups) + elif model_name_lower == "xgboost": + return self.tune_xgboost(X_train, y_train, param_space, groups=groups) + elif model_name_lower == "svm": + return self.tune_svm(X_train, y_train, param_space, groups=groups) + elif model_name_lower in ["linear", "logistic"]: + return self.tune_linear(X_train, y_train, param_space, groups=groups) + else: + raise ValueError(f"Unknown model name for tuning: {model_name}") + + def save_tuning_results(self, results: Dict, output_path: str): + """Save tuning results to disk. + + Parameters + ---------- + results : Dict + Results dictionary from tuning + output_path : str + Path to save results (as joblib file) + """ + os.makedirs(os.path.dirname(output_path), exist_ok=True) + joblib.dump(results, output_path) diff --git a/ecosci/models.py b/ecosci/models.py index 1bdfe62..c59632d 100644 --- a/ecosci/models.py +++ b/ecosci/models.py @@ -19,7 +19,10 @@ class ModelZoo: @staticmethod def get_model( - name: str, problem_type: str = "classification", params: Dict[str, Any] = None + name: str, + problem_type: str = "classification", + params: Dict[str, Any] = None, + n_outputs: int = 1 ): """Return a model instance by name using params dict. @@ -31,6 +34,8 @@ def get_model( "classification" or "regression". Determines which estimator is used. params : Dict[str, Any] Hyperparameters forwarded to the underlying estimator. + n_outputs : int + Number of output targets. If > 1, wraps the model in MultiOutput wrapper. """ params = params or {} @@ -39,13 +44,14 @@ def get_model( if problem_type == "classification": # shallow MLP: single hidden layer default - return MLPClassifier( + base_model = MLPClassifier( hidden_layer_sizes=tuple(params.get("hidden_layer_sizes", (32,))), max_iter=params.get("max_iter", 200), early_stopping=params.get("early_stopping", True), validation_fraction=params.get("validation_fraction", 0.1), n_iter_no_change=params.get("n_iter_no_change", 10), random_state=params.get("random_state", 0), + verbose=params.get("verbose", False), # Disable verbose output by default **{ k: v for k, v in params.items() @@ -57,10 +63,13 @@ def get_model( "validation_fraction", "n_iter_no_change", "random_state", + "verbose", ] }, ) + return ModelZoo.wrap_for_multioutput(base_model, problem_type, n_outputs) else: + # MLPRegressor natively supports multi-output return MLPRegressor( hidden_layer_sizes=tuple(params.get("hidden_layer_sizes", (32,))), max_iter=params.get("max_iter", 200), @@ -68,6 +77,7 @@ def get_model( validation_fraction=params.get("validation_fraction", 0.1), n_iter_no_change=params.get("n_iter_no_change", 10), random_state=params.get("random_state", 0), + verbose=params.get("verbose", False), # Disable verbose output by default **{ k: v for k, v in params.items() @@ -79,6 +89,7 @@ def get_model( "validation_fraction", "n_iter_no_change", "random_state", + "verbose", ] }, ) @@ -86,6 +97,7 @@ def get_model( if name.lower() == "random_forest": from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor + # Random Forest natively supports multi-output if problem_type == "classification": return RandomForestClassifier( random_state=params.get("random_state", 0), @@ -101,14 +113,16 @@ def get_model( from sklearn.svm import SVC, SVR if problem_type == "classification": - return SVC( + base_model = SVC( probability=True, random_state=params.get("random_state", 0), **{k: v for k, v in params.items() if k != "random_state"}, ) + return ModelZoo.wrap_for_multioutput(base_model, problem_type, n_outputs) else: # SVR doesn't support random_state parameter - return SVR(**{k: v for k, v in params.items() if k != "random_state"}) + base_model = SVR(**{k: v for k, v in params.items() if k != "random_state"}) + return ModelZoo.wrap_for_multioutput(base_model, problem_type, n_outputs) if name.lower() == "xgboost": try: @@ -125,14 +139,16 @@ def get_model( params_filtered = { k: v for k, v in params.items() if k != "eval_metric" } - return XGBClassifier(eval_metric=eval_metric, **params_filtered) + base_model = XGBClassifier(eval_metric=eval_metric, **params_filtered) + return ModelZoo.wrap_for_multioutput(base_model, problem_type, n_outputs) else: - return XGBRegressor(**params) + base_model = XGBRegressor(**params) + return ModelZoo.wrap_for_multioutput(base_model, problem_type, n_outputs) if name.lower() == "logistic": from sklearn.linear_model import LogisticRegression - return LogisticRegression( + base_model = LogisticRegression( random_state=params.get("random_state", 0), max_iter=params.get("max_iter", 1000), **{ @@ -141,18 +157,75 @@ def get_model( if k not in ["random_state", "max_iter"] }, ) + return ModelZoo.wrap_for_multioutput(base_model, problem_type, n_outputs) if name.lower() == "linear": - from sklearn.linear_model import LinearRegression + from sklearn.linear_model import LinearRegression, Ridge if problem_type == "classification": raise ValueError( "Linear model is only for regression. Use 'logistic' for classification." ) - # LinearRegression doesn't support random_state - return LinearRegression( - **{k: v for k, v in params.items() if k != "random_state"} - ) + # If alpha is specified, use Ridge regression (supports regularisation) + # Otherwise use plain LinearRegression + if "alpha" in params: + base_model = Ridge( + random_state=params.get("random_state", 0), + **{k: v for k, v in params.items() if k != "random_state"} + ) + else: + # LinearRegression doesn't support random_state or alpha + base_model = LinearRegression( + **{k: v for k, v in params.items() if k not in ["random_state", "alpha"]} + ) + + # Wrap in MultiOutputRegressor if needed + if n_outputs > 1: + from sklearn.multioutput import MultiOutputRegressor + return MultiOutputRegressor(base_model) + return base_model raise ValueError(f"Unknown model name: {name}") + + @staticmethod + def wrap_for_multioutput(model, problem_type: str, n_outputs: int): + """Wrap a model for multi-output prediction if necessary. + + Parameters + ---------- + model : estimator + Base sklearn-compatible model. + problem_type : str + "classification" or "regression". + n_outputs : int + Number of output targets. + + Returns + ------- + estimator + The model, wrapped in MultiOutput if n_outputs > 1. + """ + if n_outputs <= 1: + return model + + # Some models natively support multi-output + native_multioutput = [ + 'RandomForestClassifier', 'RandomForestRegressor', + 'ExtraTreesClassifier', 'ExtraTreesRegressor', + 'DecisionTreeClassifier', 'DecisionTreeRegressor', + 'KNeighborsClassifier', 'KNeighborsRegressor', + 'MLPRegressor', # MLP Regressor supports multi-output + ] + + model_class = model.__class__.__name__ + if model_class in native_multioutput: + return model + + # Wrap in MultiOutput wrapper + if problem_type == "classification": + from sklearn.multioutput import MultiOutputClassifier + return MultiOutputClassifier(model) + else: + from sklearn.multioutput import MultiOutputRegressor + return MultiOutputRegressor(model) diff --git a/ecosci/trainer.py b/ecosci/trainer.py index 49f58fa..0eca02f 100644 --- a/ecosci/trainer.py +++ b/ecosci/trainer.py @@ -5,14 +5,16 @@ - Uses scikit-learn's built-in early stopping for the MLP if enabled in config - Saves each trained model as a `.joblib` file - Returns predictions per run so the evaluator can compute metrics +- Supports hyperparameter tuning with train/val/test splits """ -from typing import Any, Dict, List +from typing import Any, Dict, Optional from tqdm import tqdm import numpy as np import random import joblib import os +import json class Trainer: @@ -67,6 +69,12 @@ def run(self, cfg: Dict[str, Any], X_train, X_val, X_test, y_train, y_val, y_tes seeds_list = [base_seed + i for i in range(repetitions)] all_results = {} + + # Determine number of outputs from y_train shape + if len(y_train.shape) == 1: + n_outputs = 1 + else: + n_outputs = y_train.shape[1] # Train each model type for model_idx, model_cfg in enumerate(models_cfg): @@ -84,7 +92,7 @@ def run(self, cfg: Dict[str, Any], X_train, X_val, X_test, y_train, y_val, y_tes if "random_state" in mparams_local or True: mparams_local.setdefault("random_state", s) - model = self.model_factory(mname, self.problem_type, mparams_local) + model = self.model_factory(mname, self.problem_type, mparams_local, n_outputs) # fit: sklearn models have fit(X, y). For MLP, early_stopping is # handled internally if enabled via params in the YAML config. @@ -130,3 +138,334 @@ def run(self, cfg: Dict[str, Any], X_train, X_val, X_test, y_train, y_val, y_tes all_results[mname] = model_results return all_results + + def run_cv(self, cfg: Dict[str, Any], fold_data_list): + """Run k-fold cross-validation training. + + Parameters + ---------- + cfg : Dict[str, Any] + Configuration dictionary + fold_data_list : list of tuples + List of (X_train, X_val, X_test, y_train, y_val, y_test, fold_id) tuples + + Returns + ------- + dict + Results organised as {model_name: [run_results_with_fold_info]} + """ + models_cfg = cfg.get("models", []) + + seeds = cfg.get("training", {}).get("seeds") + repetitions = cfg.get("training", {}).get("repetitions", 1) + base_seed = cfg.get("training", {}).get("random_seed", 0) + + seeds_list = [] + if seeds: + seeds_list = seeds + else: + seeds_list = [base_seed + i for i in range(repetitions)] + + all_results = {} + + # Train each model type + for model_idx, model_cfg in enumerate(models_cfg): + mname = model_cfg.get("name", "mlp") + mparams = model_cfg.get("params", {}) + + model_results = [] + + # Iterate over each fold + for fold_idx, (X_train, X_val, X_test, y_train, y_val, y_test, fold_id) in enumerate(fold_data_list): + # Determine number of outputs + if len(y_train.shape) == 1: + n_outputs = 1 + else: + n_outputs = y_train.shape[1] + + # Run each seed for this fold + for s in tqdm(seeds_list, desc=f"{mname.upper()} - Fold {fold_id+1}/{len(fold_data_list)}", unit="seed"): + self._set_seed(s) + + # ensure model has random_state where appropriate + mparams_local = dict(mparams) + if "random_state" in mparams_local or True: + mparams_local.setdefault("random_state", s) + + model = self.model_factory(mname, self.problem_type, mparams_local, n_outputs) + + # fit + if ( + X_val is not None + and hasattr(model, "partial_fit") + and cfg.get("training", {}).get("use_partial_fit", False) + ): + model.partial_fit(X_train, y_train) + else: + model.fit(X_train, y_train) + + y_pred = model.predict(X_test) + y_proba = None + + # Only try to get probabilities for classification problems + if self.problem_type == "classification" and hasattr(model, "predict_proba"): + try: + y_proba = model.predict_proba(X_test) + except Exception as e: + # predict_proba may fail for some classifiers (e.g., SVC without probability=True) + # or when the model has issues with the data. This is not critical for training, + # but we silently continue without probabilities for this fold. + y_proba = None + + # save model for this run in model-specific subfolder + model_dir = os.path.join(self.output_dir, mname, f"fold{fold_id}") + os.makedirs(model_dir, exist_ok=True) + fname = os.path.join(model_dir, f"model_{mname}_fold{fold_id}_seed{s}.joblib") + joblib.dump(model, fname) + + model_results.append( + { + "seed": s, + "fold": fold_id, + "model_name": mname, + "model_path": fname, + "y_pred": y_pred, + "y_proba": y_proba, + "y_test": y_test, # Include y_test for fold-specific evaluation + "X_test": X_test, # Include X_test for permutation importance + } + ) + + all_results[mname] = model_results + + return all_results + + def run_with_tuning( + self, + cfg: Dict[str, Any], + X_train: np.ndarray, + X_val: np.ndarray, + X_test: np.ndarray, + y_train: np.ndarray, + y_val: np.ndarray, + y_test: np.ndarray, + group_assignments: Optional[Dict] = None, + groups_train: Optional[np.ndarray] = None, + groups_val: Optional[np.ndarray] = None, + ): + """Run training with hyperparameter tuning. + + This method performs hyperparameter optimisation on the training + validation + sets, then evaluates the best model on the held-out test set. Multiple seeds + are used for stable results. + + Parameters + ---------- + cfg : Dict[str, Any] + Configuration dictionary with tuning settings + X_train : np.ndarray + Training features + X_val : np.ndarray + Validation features (used during hyperparameter tuning) + X_test : np.ndarray + Test features (held out for final evaluation) + y_train : np.ndarray + Training labels + y_val : np.ndarray + Validation labels + y_test : np.ndarray + Test labels + group_assignments : Dict, optional + Dictionary mapping split names to group IDs (for documentation) + groups_train : np.ndarray, optional + Group IDs for training samples + groups_val : np.ndarray, optional + Group IDs for validation samples + + Returns + ------- + dict + Results organised as {model_name: [run_results_with_tuning_info]} + """ + from ecosci.hyperopt import HyperparameterTuner + + models_cfg = cfg.get("models", []) + tuning_cfg = cfg.get("tuning", {}) + + # Get seeds for stable results + seeds = cfg.get("training", {}).get("seeds") + repetitions = cfg.get("training", {}).get("repetitions", 1) + base_seed = cfg.get("training", {}).get("random_seed", 0) + + if seeds: + seeds_list = seeds + else: + seeds_list = [base_seed + i for i in range(repetitions)] + + # Tuning configuration + search_method = tuning_cfg.get("search_method", "random") + n_iter = tuning_cfg.get("n_iter", 50) + cv_folds = tuning_cfg.get("cv_folds", 3) + scoring = tuning_cfg.get("scoring") + n_jobs = tuning_cfg.get("n_jobs", -1) + verbose = tuning_cfg.get("verbose", 1) + + all_results = {} + + # Train each model type + for model_idx, model_cfg in enumerate(models_cfg): + mname = model_cfg.get("name", "mlp") + + # Get hyperparameter search space from config + param_space = model_cfg.get("param_space") + + model_results = [] + + print(f"\n{'='*60}") + print(f"Tuning {mname.upper()}") + print(f"{'='*60}") + print(f"Metric: {scoring} ({'lower is better' if 'neg_' in str(scoring) else 'higher is better'})") + print(f"Search method: {search_method}") + print(f"CV folds: {cv_folds}") + if search_method == "random": + print(f"Iterations: {n_iter}") + + # Combine train and val for tuning (GridSearchCV will split internally) + X_train_val = np.vstack([X_train, X_val]) + y_train_val = np.concatenate([y_train, y_val]) + + # Combine groups for grouped CV during tuning (if groups are provided) + groups_train_val = None + if groups_train is not None and groups_val is not None: + groups_train_val = np.concatenate([groups_train, groups_val]) + + # Track best params across seeds + best_params_per_seed = [] + + for seed_idx, s in enumerate(tqdm(seeds_list, desc=f"Seeds", unit="seed")): + self._set_seed(s) + + # Create tuner for this seed + tuner = HyperparameterTuner( + problem_type=self.problem_type, + n_iter=n_iter, + cv=cv_folds, + scoring=scoring, + search_method=search_method, + random_state=s, + n_jobs=n_jobs, + verbose=verbose if seed_idx == 0 else 0, # Only verbose for first seed + ) + + # Tune hyperparameters + print(f"\n Seed {s}: Tuning hyperparameters...") + best_model, tuning_results = tuner.tune_model( + mname, X_train_val, y_train_val, param_space, groups=groups_train_val + ) + + best_params = tuning_results['best_params'] + best_score = tuning_results['best_score'] + best_params_per_seed.append(best_params) + + # Make the score more interpretable + if self.problem_type == "regression": + # For regression, the score is typically neg_mean_squared_error + if best_score < 0: + mse = -best_score + rmse = np.sqrt(mse) + print(f" Seed {s}: Best CV score: MSE={mse:.4f}, RMSE={rmse:.4f} (lower is better)") + else: + # If positive (e.g., R²), show as-is + print(f" Seed {s}: Best CV score: {best_score:.4f}") + else: + # For classification, usually accuracy or similar (higher is better) + print(f" Seed {s}: Best CV score: {best_score:.4f}") + print(f" Seed {s}: Best params: {best_params}") + + # Retrain on train set only (optional, but good practice) + # The best_model is already trained on train_val, but we can + # retrain on just train to match the typical workflow + from ecosci.models import ModelZoo + + # Determine number of outputs + if len(y_train.shape) == 1: + n_outputs = 1 + else: + n_outputs = y_train.shape[1] + + # Create new model with best params and train on train set + final_model = ModelZoo.get_model( + mname, + self.problem_type, + best_params, + n_outputs + ) + final_model.fit(X_train, y_train) + + # Evaluate on validation and test sets + y_val_pred = final_model.predict(X_val) + y_test_pred = final_model.predict(X_test) + + y_val_proba = None + y_test_proba = None + if self.problem_type == "classification" and hasattr(final_model, "predict_proba"): + try: + y_val_proba = final_model.predict_proba(X_val) + y_test_proba = final_model.predict_proba(X_test) + except Exception as e: + # predict_proba may fail for some classifiers (e.g., SVC without probability=True) + # or when the model has issues with the data. This is not critical for evaluation, + # but we silently continue without probabilities. Metrics like ROC-AUC won't be available. + y_val_proba = None + y_test_proba = None + + # Save model + model_dir = os.path.join(self.output_dir, mname) + os.makedirs(model_dir, exist_ok=True) + model_fname = os.path.join(model_dir, f"model_{mname}_seed{s}.joblib") + joblib.dump(final_model, model_fname) + + # Save tuning results + tuning_fname = os.path.join(model_dir, f"tuning_results_{mname}_seed{s}.json") + tuning_summary = { + 'best_params': best_params, + 'best_cv_score': float(best_score), + 'seed': s, + 'cv_folds': cv_folds, + 'search_method': search_method, + } + with open(tuning_fname, 'w') as f: + json.dump(tuning_summary, f, indent=2) + + model_results.append({ + 'seed': s, + 'model_name': mname, + 'model_path': model_fname, + 'best_params': best_params, + 'best_cv_score': best_score, + 'y_val_pred': y_val_pred, + 'y_val_proba': y_val_proba, + 'y_test_pred': y_test_pred, + 'y_test_proba': y_test_proba, + 'tuning_results_path': tuning_fname, + }) + + # Save summary of best params across all seeds + summary_path = os.path.join(model_dir, f"tuning_summary_{mname}.json") + summary = { + 'model_name': mname, + 'n_seeds': len(seeds_list), + 'seeds': seeds_list, + 'best_params_per_seed': best_params_per_seed, + 'search_method': search_method, + 'cv_folds': cv_folds, + 'group_assignments': group_assignments, + } + with open(summary_path, 'w') as f: + json.dump(summary, f, indent=2) + + print(f"\n Saved tuning summary to: {summary_path}") + + all_results[mname] = model_results + + return all_results diff --git a/run.py b/run.py index 9525f51..c77ffd5 100644 --- a/run.py +++ b/run.py @@ -9,7 +9,7 @@ from ecosci.data import CSVDataLoader from ecosci.models import ModelZoo from ecosci.trainer import Trainer -from ecosci.eval import evaluate_and_report +from ecosci.evaluation import evaluate_and_report, evaluate_and_report_cv parser = argparse.ArgumentParser(description="Train a model using a YAML config.") parser.add_argument("--config", required=True, help="Path to YAML config file") @@ -26,18 +26,21 @@ # Load and prepare data data_cfg = cfg.get("data", {}) problem_type = cfg.get("problem_type", "classification") +cv_group_column = data_cfg.get("cv_group_column") + loader = CSVDataLoader( path=data_cfg.get("path"), features=data_cfg.get("features"), label=data_cfg.get("label"), + labels=data_cfg.get("labels"), test_size=data_cfg.get("test_size", 0.2), val_size=data_cfg.get("val_size", 0.2), random_state=data_cfg.get("random_state", 0), scaling=data_cfg.get("scaling", "standard"), impute_strategy=data_cfg.get("impute_strategy", "mean"), problem_type=problem_type, + cv_group_column=cv_group_column, ) -X_train, X_val, X_test, y_train, y_val, y_test = loader.prepare() # Get output directory output_dir = cfg.get("output", {}).get("dir", "outputs") @@ -48,26 +51,107 @@ problem_type=cfg.get("problem_type", "classification"), output_dir=output_dir, ) -results = trainer.run(cfg, X_train, X_val, X_test, y_train, y_val, y_test) -# Evaluate -summary = evaluate_and_report( - results, - y_test, - output_dir=output_dir, - problem_type=problem_type, -) +# Check if hyperparameter tuning is enabled +tuning_enabled = cfg.get("tuning", {}).get("enabled", False) + +# Determine which mode to use +if tuning_enabled and cv_group_column is not None: + # Hyperparameter tuning mode with grouped train/val/test splits + n_train_groups = data_cfg.get("n_train_groups", 4) + n_val_groups = data_cfg.get("n_val_groups", 2) + n_test_groups = data_cfg.get("n_test_groups", 2) + + print(f"\n{'='*70}") + print(f"Hyperparameter Tuning Mode") + print(f"{'='*70}") + print(f"Using grouped train/val/test splits with group column: {cv_group_column}") + print(f" Train groups: {n_train_groups}") + print(f" Val groups: {n_val_groups}") + print(f" Test groups: {n_test_groups}") + print(f"{'='*70}\n") + + # Prepare grouped splits + (X_train, X_val, X_test, y_train, y_val, y_test, group_assignments, + groups_train, groups_val, groups_test) = \ + loader.prepare_grouped_splits(n_train_groups, n_val_groups, n_test_groups) + + # Run training with hyperparameter tuning + results = trainer.run_with_tuning( + cfg, X_train, X_val, X_test, y_train, y_val, y_test, group_assignments, + groups_train, groups_val + ) + + # Evaluate on both validation and test sets + # For tuning mode, we want to see performance on both val and test + from ecosci.evaluation import evaluate_tuning_results + summary = evaluate_tuning_results( + results, + y_val, + y_test, + output_dir=output_dir, + problem_type=problem_type, + label_names=loader.labels if hasattr(loader, 'labels') else None, + feature_names=loader.processed_feature_names if hasattr(loader, 'processed_feature_names') else None, + X_val=X_val, + X_test=X_test, + ) + +elif cv_group_column is not None: + # K-fold cross-validation mode (no tuning) + print(f"Running k-fold cross-validation using group column: {cv_group_column}") + fold_data_list = loader.prepare_cv_folds() + results = trainer.run_cv(cfg, fold_data_list) + + # Evaluate with CV-specific reporting + summary = evaluate_and_report_cv( + results, + output_dir=output_dir, + problem_type=problem_type, + label_names=loader.labels if hasattr(loader, 'labels') else None, + feature_names=loader.processed_feature_names if hasattr(loader, 'processed_feature_names') else None, + ) +else: + # Regular train/test split (no tuning, no CV) + X_train, X_val, X_test, y_train, y_val, y_test = loader.prepare() + results = trainer.run(cfg, X_train, X_val, X_test, y_train, y_val, y_test) + + # Evaluate + summary = evaluate_and_report( + results, + y_test, + output_dir=output_dir, + problem_type=problem_type, + label_names=loader.labels if hasattr(loader, 'labels') else None, + feature_names=loader.processed_feature_names if hasattr(loader, 'processed_feature_names') else None, + X_test=X_test, + ) # Print quick summary -if problem_type == "regression": - r2s = [r.get("r2") for r in summary if "r2" in r] - if r2s: - print(f"\nMean R²: {np.mean(r2s):.3f}") - rmses = [r.get("rmse") for r in summary if "rmse" in r] - if rmses: - print(f"Mean RMSE: {np.mean(rmses):.3f}") +if tuning_enabled and cv_group_column is not None: + # For tuning mode + print(f"\n{'='*70}") + print(f"Hyperparameter Tuning Complete!") + print(f"{'='*70}") + print(f"Results saved to: {output_dir}/") + print(f" - Best hyperparameters per seed") + print(f" - Validation and test set predictions") + print(f" - Model checkpoints") + print(f"{'='*70}\n") +elif cv_group_column is not None: + # For CV, summary is a dict of DataFrames + print(f"\nDone. See {output_dir}/ for full CV reports and plots.") else: - accs = [r.get("accuracy") for r in summary if "accuracy" in r] - if accs: - print(f"\nMean accuracy: {np.mean(accs):.3f}") -print(f"Done. See {output_dir}/ for full report and plots.") + # For regular split, summary is a list of dicts + if problem_type == "regression": + r2s = [r.get("r2") for r in summary if "r2" in r] + if r2s: + print(f"\nMean R²: {np.mean(r2s):.3f}") + rmses = [r.get("rmse") for r in summary if "rmse" in r] + if rmses: + print(f"Mean RMSE: {np.mean(rmses):.3f}") + else: + accs = [r.get("accuracy") for r in summary if "accuracy" in r] + if accs: + print(f"\nMean accuracy: {np.mean(accs):.3f}") + print(f"Done. See {output_dir}/ for full report and plots.") diff --git a/run_tests.py b/run_tests.py index 5b2f7e2..64aac0d 100644 --- a/run_tests.py +++ b/run_tests.py @@ -19,6 +19,7 @@ "trainer": "tests/test_trainer.py", "eval": "tests/test_eval.py", "config": "tests/test_config.py", + "hyperopt": "tests/test_hyperopt.py", "integration": "tests/test_integration.py", } diff --git a/tests/test_data.py b/tests/test_data.py index 3efe992..8facab5 100644 --- a/tests/test_data.py +++ b/tests/test_data.py @@ -80,3 +80,77 @@ def test_random_state_makes_splits_reproducible(sample_csv): np.testing.assert_array_equal(X_train1, X_train2) np.testing.assert_array_equal(y_train1, y_train2) + + +@pytest.fixture +def multi_label_csv(): + """Create a temporary CSV with multiple labels.""" + data = pd.DataFrame( + { + "num1": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0] * 5, + "num2": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] * 5, + "label1": [0, 1, 0, 1, 0, 1, 0, 1, 0, 1] * 5, + "label2": [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] * 5, + } + ) + with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: + data.to_csv(f, index=False) + return f.name + + +def test_multi_label_loader_creates_splits(multi_label_csv): + """Test that data loader handles multiple labels.""" + loader = CSVDataLoader( + path=multi_label_csv, + features=["num1", "num2"], + labels=["label1", "label2"], + test_size=0.2, + val_size=0.2, + random_state=42, + ) + X_train, X_val, X_test, y_train, y_val, y_test = loader.prepare() + + # Verify we got all the data split properly + total_samples = len(y_train) + len(y_val) + len(y_test) + assert total_samples == 50 + + # Check that y has 2 outputs + assert y_train.shape[1] == 2 + assert y_test.shape[1] == 2 + assert y_val.shape[1] == 2 + + +def test_single_label_via_labels_param(sample_csv): + """Test that passing single label via labels param works.""" + loader = CSVDataLoader( + path=sample_csv, + features=["num1", "num2"], + labels=["label"], # Single label as list + test_size=0.2, + val_size=0.2, + random_state=42, + ) + X_train, X_val, X_test, y_train, y_val, y_test = loader.prepare() + + # Should work like single label (1D array) + assert len(y_train.shape) == 1 or (len(y_train.shape) == 2 and y_train.shape[1] == 1) + assert len(X_train) == len(y_train) + + +def test_multi_label_regression(multi_label_csv): + """Test multi-label with regression problem type.""" + loader = CSVDataLoader( + path=multi_label_csv, + features=["num1", "num2"], + labels=["label1", "label2"], + test_size=0.2, + val_size=0.2, + random_state=42, + problem_type="regression", + ) + X_train, X_val, X_test, y_train, y_val, y_test = loader.prepare() + + # Verify shape + assert y_train.shape[1] == 2 + assert not np.isnan(X_train).any() + assert not np.isnan(y_train).any() diff --git a/tests/test_eval.py b/tests/test_eval.py index 066d9e9..8cdc67c 100644 --- a/tests/test_eval.py +++ b/tests/test_eval.py @@ -7,7 +7,7 @@ import json import os -from ecosci.eval import compute_classification_metrics, evaluate_and_report +from ecosci.evaluation import compute_classification_metrics, evaluate_and_report @pytest.fixture @@ -118,7 +118,7 @@ def test_evaluate_and_report_multiple_seeds_computes_stats(temp_output_dir): # Regression tests def test_regression_metrics_computed_correctly(): """Test that regression metrics are computed correctly.""" - from ecosci.eval import compute_regression_metrics + from ecosci.evaluation import compute_regression_metrics y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) y_pred = np.array([1.1, 2.2, 2.9, 4.1, 4.8]) @@ -142,7 +142,7 @@ def test_regression_metrics_computed_correctly(): def test_regression_metrics_perfect_prediction(): """Test regression metrics for perfect predictions.""" - from ecosci.eval import compute_regression_metrics + from ecosci.evaluation import compute_regression_metrics y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) y_pred = y_true.copy() @@ -158,7 +158,7 @@ def test_regression_metrics_perfect_prediction(): def test_regression_metrics_with_zeros(): """Test that MAPE is None when y_true contains zeros.""" - from ecosci.eval import compute_regression_metrics + from ecosci.evaluation import compute_regression_metrics y_true = np.array([0.0, 1.0, 2.0, 3.0]) y_pred = np.array([0.1, 1.1, 2.1, 3.1]) @@ -212,3 +212,87 @@ def test_evaluate_and_report_regression_multiple_seeds(temp_output_dir): assert all("r2" in m for m in summary) # Report is now saved in model-specific subfolder assert os.path.exists(os.path.join(temp_output_dir, "model", "report_model.json")) + + +# Multi-output tests +def test_multi_output_classification_metrics(): + """Test multi-output classification metrics computation.""" + from ecosci.evaluation import compute_multi_output_classification_metrics + + # Multi-output: predicting 2 outputs + y_true = np.array([[0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0]]) + y_pred = np.array([[0, 1], [1, 0], [0, 0], [1, 1], [0, 1], [1, 0]]) + + metrics = compute_multi_output_classification_metrics(y_true, y_pred) + + assert "accuracy_mean" in metrics + assert "balanced_accuracy_mean" in metrics + assert "f1_mean" in metrics + assert "n_outputs" in metrics + assert metrics["n_outputs"] == 2 + assert "per_output" in metrics + assert len(metrics["per_output"]) == 2 + + +def test_multi_output_regression_metrics(): + """Test multi-output regression metrics computation.""" + from ecosci.evaluation import compute_multi_output_regression_metrics + + # Multi-output: predicting 3 outputs + y_true = np.array([[1.0, 2.0, 3.0], [2.0, 3.0, 4.0], [3.0, 4.0, 5.0], [4.0, 5.0, 6.0]]) + y_pred = np.array([[1.1, 2.1, 3.1], [2.1, 3.1, 4.1], [2.9, 3.9, 4.9], [4.1, 5.1, 6.1]]) + + metrics = compute_multi_output_regression_metrics(y_true, y_pred) + + assert "mse_mean" in metrics + assert "rmse_mean" in metrics + assert "mae_mean" in metrics + assert "r2_mean" in metrics + assert "n_outputs" in metrics + assert metrics["n_outputs"] == 3 + assert "per_output" in metrics + assert len(metrics["per_output"]) == 3 + + +def test_evaluate_and_report_multi_output_classification(temp_output_dir): + """Test evaluation for multi-output classification.""" + y_true = np.array([[0, 1], [1, 0], [0, 1], [1, 0]] * 10) + + results = [ + { + "seed": 42, + "y_pred": np.array([[0, 1], [1, 0], [0, 0], [1, 1]] * 10), + "y_proba": None, + "model_name": "model", + } + ] + + summary = evaluate_and_report(results, y_true, temp_output_dir, problem_type="classification") + + assert isinstance(summary, list) + assert len(summary) > 0 + assert "accuracy_mean" in summary[0] + assert "n_outputs" in summary[0] + + +def test_evaluate_and_report_multi_output_regression(temp_output_dir): + """Test evaluation for multi-output regression.""" + np.random.seed(42) + y_true = np.random.randn(30, 3) # 3 outputs + + results = [ + { + "seed": 42, + "y_pred": y_true + np.random.randn(30, 3) * 0.1, + "y_proba": None, + "model_name": "model", + } + ] + + summary = evaluate_and_report(results, y_true, temp_output_dir, problem_type="regression") + + assert isinstance(summary, list) + assert len(summary) > 0 + assert "mse_mean" in summary[0] + assert "n_outputs" in summary[0] + assert summary[0]["n_outputs"] == 3 diff --git a/tests/test_hyperopt.py b/tests/test_hyperopt.py new file mode 100755 index 0000000..52e9e51 --- /dev/null +++ b/tests/test_hyperopt.py @@ -0,0 +1,235 @@ +#!/usr/bin/env python3 +"""Tests for hyperparameter tuning functionality. + +This module tests the hyperparameter optimisation features including: +- Grouped train/validation/test splits +- Hyperparameter tuning for different model types (MLP, Random Forest) +- Full integration of tuning with the training pipeline +""" + +import numpy as np +import pandas as pd +import tempfile +import os +import sys + +from ecosci.data import CSVDataLoader + +def create_test_data(): + """Create a small synthetic dataset with groups for testing.""" + np.random.seed(42) + + n_samples = 200 + n_features = 5 + n_groups = 8 + + # Create features + X = np.random.randn(n_samples, n_features) + + # Create groups (e.g., patches) + groups = np.repeat(range(n_groups), n_samples // n_groups) + + # Create target with some signal + y = X[:, 0] * 2 + X[:, 1] - X[:, 2] * 0.5 + np.random.randn(n_samples) * 0.5 + + # Create DataFrame + df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(n_features)]) + df['target'] = y + df['group_id'] = groups + + return df + +def test_grouped_splits(): + """Test grouped train/val/test splits.""" + # Create test data + df = create_test_data() + + # Save to temporary file + with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f: + df.to_csv(f.name, index=False) + temp_file = f.name + + try: + # Create loader + loader = CSVDataLoader( + path=temp_file, + features=[f'feature_{i}' for i in range(5)], + labels=['target'], + cv_group_column='group_id', + problem_type='regression', + ) + + # Test grouped splits + (X_train, X_val, X_test, _, _, _, group_assignments, _, _, _) = \ + loader.prepare_grouped_splits(n_train_groups=4, n_val_groups=2, n_test_groups=2) + + # Verify no overlap + assert X_train.shape[0] > 0, "Train set is empty" + assert X_val.shape[0] > 0, "Val set is empty" + assert X_test.shape[0] > 0, "Test set is empty" + assert len(group_assignments) == 3, "Should have 3 group assignments (train, val, test)" + + finally: + os.unlink(temp_file) + +def test_hyperparameter_tuner_mlp(): + """Test hyperparameter tuner with MLP.""" + from ecosci.hyperopt import HyperparameterTuner + + # Create simple test data + np.random.seed(42) + X_train = np.random.randn(100, 5) + y_train = X_train[:, 0] * 2 + np.random.randn(100) * 0.5 + + # Test MLP tuning + tuner = HyperparameterTuner( + problem_type='regression', + n_iter=3, # Small for testing + cv=2, # Small for testing + verbose=0, + ) + + param_space = { + 'hidden_layer_sizes': [(16,), (32,)], + 'alpha': [0.001, 0.01], + 'max_iter': [100], + } + + best_model, results = tuner.tune_mlp(X_train, y_train, param_space) + + assert best_model is not None, "Best model should not be None" + assert 'best_params' in results, "Results should contain best_params" + assert 'best_score' in results, "Results should contain best_score" + + +def test_hyperparameter_tuner_random_forest(): + """Test hyperparameter tuner with Random Forest.""" + from ecosci.hyperopt import HyperparameterTuner + + # Create simple test data + np.random.seed(42) + X_train = np.random.randn(100, 5) + y_train = X_train[:, 0] * 2 + np.random.randn(100) * 0.5 + + # Test Random Forest tuning + tuner = HyperparameterTuner( + problem_type='regression', + n_iter=3, # Small for testing + cv=2, # Small for testing + verbose=0, + ) + + param_space = { + 'n_estimators': [10, 20], + 'max_depth': [3, 5], + } + + best_model, results = tuner.tune_random_forest(X_train, y_train, param_space) + + assert best_model is not None, "Best model should not be None" + assert 'best_params' in results, "Results should contain best_params" + assert 'best_score' in results, "Results should contain best_score" +def test_tuning_integration(): + """Test full integration with config and hyperparameter tuning.""" + # Create test data + df = create_test_data() + + # Save to temporary file + with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f: + df.to_csv(f.name, index=False) + temp_file = f.name + + # Create temporary output directory + temp_output_dir = tempfile.mkdtemp() + + try: + # Create a minimal config + config = { + 'problem_type': 'regression', + 'data': { + 'path': temp_file, + 'features': [f'feature_{i}' for i in range(5)], + 'labels': ['target'], + 'cv_group_column': 'group_id', + 'n_train_groups': 4, + 'n_val_groups': 2, + 'n_test_groups': 2, + 'scaling': 'standard', + }, + 'tuning': { + 'enabled': True, + 'search_method': 'random', + 'n_iter': 2, # Very small for testing + 'cv_folds': 2, + 'verbose': 0, + }, + 'models': [ + { + 'name': 'linear', + 'param_space': { + 'alpha': [0.1, 1.0], + } + } + ], + 'training': { + 'repetitions': 2, # Small for testing + 'random_seed': 42, + }, + 'output': { + 'dir': temp_output_dir, + } + } + + from ecosci.data import CSVDataLoader + from ecosci.trainer import Trainer + from ecosci.models import ModelZoo + + # Load data - extract n_*_groups before passing to loader + n_train_groups = config['data'].pop('n_train_groups') + n_val_groups = config['data'].pop('n_val_groups') + n_test_groups = config['data'].pop('n_test_groups') + + loader = CSVDataLoader(**config['data'], problem_type=config['problem_type']) + (X_train, X_val, X_test, y_train, y_val, y_test, group_assignments, + groups_train, groups_val, groups_test) = \ + loader.prepare_grouped_splits( + n_train_groups=n_train_groups, + n_val_groups=n_val_groups, + n_test_groups=n_test_groups + ) + + # Create trainer + trainer = Trainer( + ModelZoo.get_model, + problem_type=config['problem_type'], + output_dir=config['output']['dir'], + ) + + # Run training with tuning + results = trainer.run_with_tuning( + config, X_train, X_val, X_test, y_train, y_val, y_test, group_assignments + ) + + # Check outputs + assert len(results) > 0, "No models were trained" + + for model_name in results.keys(): + model_dir = os.path.join(temp_output_dir, model_name) + assert os.path.exists(model_dir), f"Model directory not found: {model_dir}" + + # Check for key files + files = os.listdir(model_dir) + assert any('model_' in f for f in files), "No model files found" + assert any('tuning_results_' in f for f in files), "No tuning results found" + + finally: + os.unlink(temp_file) + # Cleanup temp output dir + import shutil + shutil.rmtree(temp_output_dir, ignore_errors=True) + + +if __name__ == "__main__": + # This file can now be run with pytest + import pytest + sys.exit(pytest.main([__file__, "-v"])) \ No newline at end of file diff --git a/tests/test_integration.py b/tests/test_integration.py index 1e297c0..edb1e51 100644 --- a/tests/test_integration.py +++ b/tests/test_integration.py @@ -12,7 +12,7 @@ from ecosci.data import CSVDataLoader from ecosci.models import ModelZoo from ecosci.trainer import Trainer -from ecosci.eval import evaluate_and_report +from ecosci.evaluation import evaluate_and_report @pytest.fixture @@ -315,3 +315,132 @@ def test_regression_pipeline_svm(regression_csv, temp_output_dir): assert len(results["svm"]) == 1 assert "y_pred" in results["svm"][0] assert len(results["svm"][0]["y_pred"]) == len(y_test) + + +# Multi-output integration tests +@pytest.fixture +def multi_output_csv(): + """Create a temporary CSV with multiple labels.""" + np.random.seed(42) + n = 100 + data = { + "num1": np.random.randn(n), + "num2": np.random.randn(n) + 1, + "label1": np.random.choice([0, 1], n), + "label2": np.random.choice([0, 1, 2], n), + } + df = pd.DataFrame(data) + with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: + df.to_csv(f, index=False) + return f.name + + +def test_multi_output_classification_pipeline(multi_output_csv, temp_output_dir): + """Test complete pipeline for multi-output classification.""" + loader = CSVDataLoader( + path=multi_output_csv, + features=["num1", "num2"], + labels=["label1", "label2"], + test_size=0.2, + val_size=0.2, + random_state=42, + problem_type="classification", + ) + X_train, X_val, X_test, y_train, y_val, y_test = loader.prepare() + + # Verify multi-output shape + assert y_train.shape[1] == 2 + assert y_test.shape[1] == 2 + + cfg = { + "problem_type": "classification", + "models": [ + {"name": "random_forest", "params": {"n_estimators": 50, "random_state": 42}}, + {"name": "logistic", "params": {"max_iter": 500, "random_state": 42}}, + ], + "training": {"repetitions": 2, "random_seed": 0}, + "output_dir": temp_output_dir, + } + + trainer = Trainer( + ModelZoo.get_model, problem_type="classification", output_dir=temp_output_dir + ) + results = trainer.run(cfg, X_train, X_val, X_test, y_train, y_val, y_test) + + # Verify predictions have correct shape + assert results["random_forest"][0]["y_pred"].shape == y_test.shape + assert results["logistic"][0]["y_pred"].shape == y_test.shape + + summary = evaluate_and_report( + results, y_test, temp_output_dir, problem_type="classification" + ) + + # Check that we got multi-output metrics + assert "accuracy_mean" in summary[0] + assert "n_outputs" in summary[0] + assert summary[0]["n_outputs"] == 2 + + +def test_multi_output_regression_pipeline(temp_output_dir): + """Test complete pipeline for multi-output regression.""" + np.random.seed(42) + n = 100 + data = { + "num1": np.random.randn(n), + "num2": np.random.randn(n) + 1, + "target1": np.random.randn(n) * 2, + "target2": np.random.randn(n) * 3 + 5, + } + df = pd.DataFrame(data) + + with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: + df.to_csv(f, index=False) + csv_path = f.name + + loader = CSVDataLoader( + path=csv_path, + features=["num1", "num2"], + labels=["target1", "target2"], + test_size=0.2, + val_size=0.2, + random_state=42, + problem_type="regression", + ) + X_train, X_val, X_test, y_train, y_val, y_test = loader.prepare() + + # Verify multi-output shape + assert y_train.shape[1] == 2 + assert y_test.shape[1] == 2 + + cfg = { + "problem_type": "regression", + "models": [ + {"name": "random_forest", "params": {"n_estimators": 50, "random_state": 42}}, + {"name": "mlp", "params": {"hidden_layer_sizes": [16], "max_iter": 200, "random_state": 42}}, + ], + "training": {"repetitions": 2, "random_seed": 0}, + "output_dir": temp_output_dir, + } + + trainer = Trainer( + ModelZoo.get_model, problem_type="regression", output_dir=temp_output_dir + ) + + import warnings + from sklearn.exceptions import ConvergenceWarning + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", category=ConvergenceWarning) + results = trainer.run(cfg, X_train, X_val, X_test, y_train, y_val, y_test) + + # Verify predictions have correct shape + assert results["random_forest"][0]["y_pred"].shape == y_test.shape + assert results["mlp"][0]["y_pred"].shape == y_test.shape + + summary = evaluate_and_report( + results, y_test, temp_output_dir, problem_type="regression" + ) + + # Check that we got multi-output metrics + assert "mse_mean" in summary[0] + assert "n_outputs" in summary[0] + assert summary[0]["n_outputs"] == 2 diff --git a/tests/test_models.py b/tests/test_models.py index 794e2ed..4e6d621 100644 --- a/tests/test_models.py +++ b/tests/test_models.py @@ -141,3 +141,60 @@ def test_svm_regression(): params = model.get_params() assert params["C"] == 1.0 assert params["kernel"] == "rbf" + + +# Multi-output tests +def test_multi_output_classification_wrapping(): + """Test that models are wrapped for multi-output classification.""" + from sklearn.multioutput import MultiOutputClassifier + + # Logistic regression should be wrapped + model = ModelZoo.get_model( + "logistic", problem_type="classification", params={"random_state": 42}, n_outputs=2 + ) + assert isinstance(model, MultiOutputClassifier) + + # Random Forest should NOT be wrapped (native support) + model_rf = ModelZoo.get_model( + "random_forest", problem_type="classification", params={"random_state": 42}, n_outputs=2 + ) + assert not isinstance(model_rf, MultiOutputClassifier) + + +def test_multi_output_regression_wrapping(): + """Test that models are wrapped for multi-output regression.""" + from sklearn.multioutput import MultiOutputRegressor + + # Linear regression should be wrapped + model = ModelZoo.get_model( + "linear", problem_type="regression", params={}, n_outputs=2 + ) + assert isinstance(model, MultiOutputRegressor) + + # Random Forest should NOT be wrapped (native support) + model_rf = ModelZoo.get_model( + "random_forest", problem_type="regression", params={"random_state": 42}, n_outputs=2 + ) + assert not isinstance(model_rf, MultiOutputRegressor) + + # MLP Regressor should NOT be wrapped (native support) + model_mlp = ModelZoo.get_model( + "mlp", problem_type="regression", params={"random_state": 42}, n_outputs=2 + ) + assert not isinstance(model_mlp, MultiOutputRegressor) + + +def test_single_output_not_wrapped(): + """Test that single output models are not wrapped.""" + from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor + + # Single output should not be wrapped + model_cls = ModelZoo.get_model( + "logistic", problem_type="classification", params={"random_state": 42}, n_outputs=1 + ) + assert not isinstance(model_cls, MultiOutputClassifier) + + model_reg = ModelZoo.get_model( + "linear", problem_type="regression", params={}, n_outputs=1 + ) + assert not isinstance(model_reg, MultiOutputRegressor)