As a benchmark operator
I want to configure the openstef4 backtesting baseline to retune hyperparameters on a schedule (e.g. every 7 days of backtest time)
So that benchmark results reflect realistic periodic model maintenance and capture the benefit of adaptive tuning over a static config
❗Priority (What if we don't do this?/Are there any deadlines? etc.)
Currently the backtest always uses a fixed hyperparameter config. Real production deployments periodically retune. This skews benchmark results optimistically for tuned models and makes it hard to evaluate whether periodic retuning actually improves forecast quality.
Definition of Done:
✅ Acceptance criteria
📄 Documentation criteria:
- Update the openstef4 baseline docstring / README section
- Add example usage to the benchmarking tutorial or a new tutorial showing backtesting + tuning
🧪 Test criteria:
- Unit test:
TuningSchedule.is_due() with various retune_every values
- Unit test:
OpenSTEF4BacktestForecaster.fit() with tuning schedule — verify tuner is called only when due, carries forward tuned config between retune windows
- Unit test: ensemble path — verify each member is tuned separately
⌛ Dependencies:
🚀 Releasing:
Part of OpenSTEF 4.0; no separate release needed
Other information:
🌍 Background
Proposed design (from design discussion):
TuningSchedule is a Pydantic BaseModel (or frozen base-class-based design if generics become unwieldy). It owns:
retune_every: timedelta — how often to retune
n_trials: int — optuna trial budget per retune
metric_name: str — objective metric
direction: Literal["minimize", "maximize"] — objective direction
is_due(horizon, last_tuned_at) and mark_done(horizon) behaviour (via private state or passed-in state)
- In
OpenSTEF4BacktestForecaster.fit(), before workflow.fit():
- Check
tuning_schedule.is_due(data.horizon)
- If yes → run
HyperparameterTuner(config=current_config, train_dataset=training_data, ...)
- Cache
result.best_config as _tuned_config
- Build the workflow from
_tuned_config (instead of workflow_template.config)
- For ensembles: the baseline iterates
ensemble_config.members, tunes each member.config separately, reassembles
The split/tune/reassemble logic for ensembles belongs in the baseline, not in TuningSchedule, keeping each layer scoped to what it knows.
As a benchmark operator
I want to configure the openstef4 backtesting baseline to retune hyperparameters on a schedule (e.g. every 7 days of backtest time)
So that benchmark results reflect realistic periodic model maintenance and capture the benefit of adaptive tuning over a static config
❗Priority (What if we don't do this?/Are there any deadlines? etc.)
Currently the backtest always uses a fixed hyperparameter config. Real production deployments periodically retune. This skews benchmark results optimistically for tuned models and makes it hard to evaluate whether periodic retuning actually improves forecast quality.
Definition of Done:
✅ Acceptance criteria
TuningScheduleconfig class added toopenstef-models(alongsideHyperparameterTuner); controlsretune_every,n_trials,metric_nameOpenSTEF4BacktestForecasteraccepts an optionaltuning_schedulefield; when set, runsHyperparameterTunerbefore eachfit()that falls on the scheduleTuningScheduleis optional — existing users who don't set it see zero behaviour changeopenstef-beam[baselines,tuning]extras cover all new dependencies (optuna stays optional)📄 Documentation criteria:
🧪 Test criteria:
TuningSchedule.is_due()with variousretune_everyvaluesOpenSTEF4BacktestForecaster.fit()with tuning schedule — verify tuner is called only when due, carries forward tuned config between retune windows⌛ Dependencies:
HyperparameterTunerlives there🚀 Releasing:
Part of OpenSTEF 4.0; no separate release needed
Other information:
🌍 Background
Proposed design (from design discussion):
TuningScheduleis a PydanticBaseModel(or frozen base-class-based design if generics become unwieldy). It owns:retune_every: timedelta— how often to retunen_trials: int— optuna trial budget per retunemetric_name: str— objective metricdirection: Literal["minimize", "maximize"]— objective directionis_due(horizon, last_tuned_at)andmark_done(horizon)behaviour (via private state or passed-in state)OpenSTEF4BacktestForecaster.fit(), beforeworkflow.fit():tuning_schedule.is_due(data.horizon)HyperparameterTuner(config=current_config, train_dataset=training_data, ...)result.best_configas_tuned_config_tuned_config(instead ofworkflow_template.config)ensemble_config.members, tunes eachmember.configseparately, reassemblesThe split/tune/reassemble logic for ensembles belongs in the baseline, not in
TuningSchedule, keeping each layer scoped to what it knows.