Skip to content

[OpenSTEF 4.0] Periodic hyperparameter retuning in the backtesting baseline #852

@egordm

Description

@egordm

As a benchmark operator
I want to configure the openstef4 backtesting baseline to retune hyperparameters on a schedule (e.g. every 7 days of backtest time)
So that benchmark results reflect realistic periodic model maintenance and capture the benefit of adaptive tuning over a static config

❗Priority (What if we don't do this?/Are there any deadlines? etc.)

Currently the backtest always uses a fixed hyperparameter config. Real production deployments periodically retune. This skews benchmark results optimistically for tuned models and makes it hard to evaluate whether periodic retuning actually improves forecast quality.

Definition of Done:

✅ Acceptance criteria

  • TuningSchedule config class added to openstef-models (alongside HyperparameterTuner); controls retune_every, n_trials, metric_name
  • OpenSTEF4BacktestForecaster accepts an optional tuning_schedule field; when set, runs HyperparameterTuner before each fit() that falls on the schedule
  • Works for both single-model and ensemble configs: for ensembles, each base-model member is tuned independently and the ensemble config is reassembled with tuned sub-configs
  • Tuned config is cached between retune windows (does not retune on every fit)
  • TuningSchedule is optional — existing users who don't set it see zero behaviour change
  • openstef-beam[baselines,tuning] extras cover all new dependencies (optuna stays optional)

📄 Documentation criteria:

  • Update the openstef4 baseline docstring / README section
  • Add example usage to the benchmarking tutorial or a new tutorial showing backtesting + tuning

🧪 Test criteria:

  • Unit test: TuningSchedule.is_due() with various retune_every values
  • Unit test: OpenSTEF4BacktestForecaster.fit() with tuning schedule — verify tuner is called only when due, carries forward tuned config between retune windows
  • Unit test: ensemble path — verify each member is tuned separately

⌛ Dependencies:

🚀 Releasing:

Part of OpenSTEF 4.0; no separate release needed

Other information:

🌍 Background

Proposed design (from design discussion):

  • TuningSchedule is a Pydantic BaseModel (or frozen base-class-based design if generics become unwieldy). It owns:
    • retune_every: timedelta — how often to retune
    • n_trials: int — optuna trial budget per retune
    • metric_name: str — objective metric
    • direction: Literal["minimize", "maximize"] — objective direction
    • is_due(horizon, last_tuned_at) and mark_done(horizon) behaviour (via private state or passed-in state)
  • In OpenSTEF4BacktestForecaster.fit(), before workflow.fit():
    1. Check tuning_schedule.is_due(data.horizon)
    2. If yes → run HyperparameterTuner(config=current_config, train_dataset=training_data, ...)
    3. Cache result.best_config as _tuned_config
    4. Build the workflow from _tuned_config (instead of workflow_template.config)
  • For ensembles: the baseline iterates ensemble_config.members, tunes each member.config separately, reassembles

The split/tune/reassemble logic for ensembles belongs in the baseline, not in TuningSchedule, keeping each layer scoped to what it knows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions