Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
398 commits
Select commit Hold shift + click to select a range
6ea07e7
restore masking_strategy to random
shmh40 Nov 28, 2025
4281aff
restore loader_num_workers to 8
shmh40 Nov 28, 2025
950e5b4
set loader_num_workers to 8
Jubeku Nov 28, 2025
15b46e9
fix indentation of else: assert False in _get_sample msds
shmh40 Nov 28, 2025
76270aa
[1269] Noise generation in diffusion inference (#1374)
moritzhauschulz Nov 28, 2025
6fe8561
Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into shmh…
clessig Nov 28, 2025
b662bf2
Made pre-trained encoder weights optional
MatKbauer Nov 28, 2025
3b55ef5
Update validation to new data structure
MatKbauer Dec 2, 2025
dc736e5
merge with dev
tjhunter Dec 2, 2025
2b2c977
linter warnings
tjhunter Dec 2, 2025
c8a2aad
commenting tests
tjhunter Dec 2, 2025
2599ec2
Restructured code so that mask generation and application is cleanly …
clessig Dec 2, 2025
c8a26d7
Commit
clessig Dec 2, 2025
23e0267
Update
clessig Dec 2, 2025
33d9d8d
Merge branch 'shmh40/dev/1270-idx-global-local' of github.com:ecmwf/W…
clessig Dec 2, 2025
9f5e49c
Fixed uv.lock
clessig Dec 2, 2025
3641e1f
Fix for integration test
clessig Dec 2, 2025
9a1a6a9
Re-enabled multi-source training
clessig Dec 3, 2025
402b8de
1390 - Adapt forward pass of new batch object (#1391)
Jubeku Dec 3, 2025
2cd3971
Completed migration to new batch class by removing reference to old l…
clessig Dec 3, 2025
51754fa
Fixed missing non_blocking=True in to_device()
clessig Dec 3, 2025
69b53a6
Removed old comments
clessig Dec 3, 2025
59510dd
Fixed problem with non_blocking=True
clessig Dec 3, 2025
b69b743
Cleaned up comments and return values a bit
clessig Dec 4, 2025
d36367a
Changed args to embedding
clessig Dec 4, 2025
3f52a8d
Changed core functions to take sample as arg
clessig Dec 4, 2025
9065219
Changed that model takes sample as input
clessig Dec 4, 2025
12bae15
Fixes for diffusion
clessig Dec 4, 2025
7745e47
Switched to lists of model / target stratgies
clessig Dec 4, 2025
bf17bfe
Updated config
clessig Dec 4, 2025
89f770e
Changed to per masking strategy loss terms
clessig Dec 5, 2025
a93fdb3
Removed old masking options. Still needs to be fully cleaned up
clessig Dec 5, 2025
454dffb
More robust handling of empty streams
clessig Dec 5, 2025
5cbbaa3
Fixed incorrect handling of empty target_coords_idx
clessig Dec 5, 2025
9c74741
Fixed problem when number of model and target samples is different
clessig Dec 5, 2025
085b55f
Example for config with non-trivial model and target inputs
clessig Dec 5, 2025
4dac76d
Fixed bug in total sample counting
clessig Dec 5, 2025
fe2f63a
Re-enabled missing healpix level
clessig Dec 5, 2025
b9195bb
Fixed incorrect handling of masking and student_teacher modes. Follow…
clessig Dec 6, 2025
43f9b01
An encoder formed by embedding + local assimilation + global assimila…
kctezcan Dec 6, 2025
4d27a95
Formatting
clessig Dec 6, 2025
9cf040e
Fix source-target matching problem.
clessig Dec 6, 2025
5fca790
Enabled multiple input steps. Fixed various robustness that arose thr…
clessig Dec 7, 2025
47e81fa
Linting
clessig Dec 7, 2025
e0f6cc4
Missing update to validation()
clessig Dec 9, 2025
8f097ec
Improved robustness through sanity checking of arguments
clessig Dec 9, 2025
6b64511
Improved handling of corner cases
clessig Dec 9, 2025
ed886e2
Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into shmh…
clessig Dec 9, 2025
303f48a
- Fixed incorrect call to get_forecast_steps() in validation
clessig Dec 9, 2025
9638de8
[NOT WORKING] Merged current data-branch. TargetAuxCalculator argumen…
MatKbauer Dec 9, 2025
7299106
More fixed to validation
clessig Dec 9, 2025
45189a4
Adding stream_id
clessig Dec 9, 2025
50b0a89
[NOT WORKING] Added modifications from data branch
MatKbauer Dec 10, 2025
5bed792
Cleaned up ModelOutput class to have proper access functions and a be…
clessig Dec 10, 2025
06f2e06
Switched to use dict to internally represent streams_datasets
clessig Dec 10, 2025
ad5a19c
Improving robustness of interface of ModelOutput class
clessig Dec 10, 2025
4f8abbb
Re-enabling model output
clessig Dec 10, 2025
d36716c
Ruff
clessig Dec 11, 2025
b8d95b2
Minor clean-ups and additional comments
clessig Dec 11, 2025
081d90a
Minor cleanups
clessig Dec 11, 2025
6b8fe83
Cleaned up handling of masks and masking metadata
clessig Dec 11, 2025
5a8ad49
Resolved bugs when updating data structure
MatKbauer Dec 11, 2025
eedaa8a
Updated to new data output structure
MatKbauer Dec 11, 2025
f768046
Linter
MatKbauer Dec 11, 2025
ca9e605
Current working version of default_config
clessig Dec 11, 2025
f8b1ca6
Fixed problem with branches with old code and incomplete cleanup
clessig Dec 11, 2025
003b0cf
Updated to test convergence of integration test.
clessig Dec 11, 2025
f38e6d2
Updated settings
clessig Dec 11, 2025
7e7ff8e
Clessig/ypd/dev/1353 add tokens latent state finalization (#1452)
clessig Dec 12, 2025
31a0b96
Ruffed
clessig Dec 12, 2025
4fe90d7
Adding sanity check for register tokens
clessig Dec 12, 2025
46bd7a2
Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into shmh…
clessig Dec 12, 2025
48dee1e
Update to latest data branch: latent_state dataclass
MatKbauer Dec 12, 2025
e2c09f2
Update to latest develop with new data structure
MatKbauer Dec 20, 2025
238e321
Merge branch 'develop' into mk/develop/1300_assemble_diffusion_model
Jubeku Jan 14, 2026
458e652
debug target_aux, loss_module, engines, etc
Jubeku Jan 14, 2026
61dce39
debug, diffusion_rn and batch.sample
Jubeku Jan 14, 2026
ea4d76c
Corrected latent token retrieval in loss calculation
MatKbauer Jan 15, 2026
b875734
working training loop on single sample
Jubeku Jan 15, 2026
c91d5c9
update config to fit forecast checkpoint
Jubeku Jan 15, 2026
3a8fead
Merge branch 'develop' into mk/develop/1300_assemble_diffusion_model
Jubeku Jan 19, 2026
d58032d
Merge branch 'develop' into mk/develop/1300_assemble_diffusion_model
Jubeku Jan 19, 2026
91d633b
reset default config
Jubeku Jan 19, 2026
bbdb3a1
modify default config for diffusion
Jubeku Jan 19, 2026
43b21c4
adding encoder loading to model interface
Jubeku Jan 19, 2026
52b6bb1
setting checkpoint to null temporarily
Jubeku Jan 20, 2026
0f7d4e5
rm activation checkpoint around diff forecast engine
Jubeku Jan 20, 2026
a51f706
[Diff] sbAsma/issue1279 noise conditioning (#1358)
sbAsma Jan 23, 2026
47566be
Correct forecast engine initialization
MatKbauer Jan 23, 2026
82a78f9
Merge branch 'develop' into 1300_assemble_diffusion_model_w_develop
moritzhauschulz Feb 8, 2026
3ce80f0
code runs...
moritzhauschulz Feb 8, 2026
a144867
remove some debugging code
moritzhauschulz Feb 18, 2026
e5cccbe
Merge branch 'develop' into mh/develop/1843_viz_denoised_image
moritzhauschulz Feb 18, 2026
63b3f78
adjusted diffusion config
moritzhauschulz Feb 18, 2026
83bb4c9
fixed inference
moritzhauschulz Feb 18, 2026
bb3bbe5
actually fiex inference (via config)
moritzhauschulz Feb 18, 2026
b5ee071
Plot maps during training at validation time
MatKbauer Feb 19, 2026
55b69c2
Intermediate state. Single sample overfitting works
MatKbauer Feb 20, 2026
a93b978
Intermediate multi-GPU error state
MatKbauer Feb 20, 2026
be6cb24
Successful single-sample overfitting on one GPU
MatKbauer Feb 20, 2026
2c63c7e
Minor config change
MatKbauer Feb 20, 2026
4414fe6
Adding missing reset() function for FSDP
clessig Feb 21, 2026
c917777
Linting
clessig Feb 21, 2026
4ae7c13
Linting
clessig Feb 21, 2026
268d34f
Linting
clessig Feb 21, 2026
6a487d9
Workding on FSDP
clessig Feb 21, 2026
351e8f9
Working on FSDP
clessig Feb 21, 2026
fbc7cd1
Linting
clessig Feb 21, 2026
873f7b3
minor config changes
moritzhauschulz Feb 23, 2026
7149866
Activating diffusion model
MatKbauer Feb 24, 2026
dbefecc
Merge branch 'mk/mh/1843_viz_denoised_image' of github.com:ecmwf/Weat…
MatKbauer Feb 24, 2026
fcf4bbc
Merge branch 'mk/mh/1843_viz_denoised_image' into mh/mk/diffusion-sin…
moritzhauschulz Feb 25, 2026
610334c
temp set MLP
moritzhauschulz Feb 25, 2026
c1cf8f5
Combined physical and latent loss experiments
MatKbauer Feb 27, 2026
506089c
Adjust diffusion config to 3 samples per GPU
MatKbauer Feb 27, 2026
ffe89c2
Pull plot_train from develop
MatKbauer Feb 27, 2026
f7a42f6
Latent size downscaling mlps for diffusion
MatKbauer Mar 2, 2026
1db5fe6
Improve support for latent losses
clessig Mar 2, 2026
b9feb92
Fix to support models trained on older code versions
MatKbauer Mar 2, 2026
49a886f
Enable plot train latent loss
MatKbauer Mar 2, 2026
9e205ab
Repair code and update to load more recent pre-trained model
MatKbauer Mar 3, 2026
31b93a0
Fixes that enable basic single sample overfitting (#2003)
moritzhauschulz Mar 9, 2026
b55ab9d
Merge branch 'develop' into mk/mh/diffusion-single-sample-rebase
Jubeku Mar 10, 2026
44c8816
Update diffusion to develop (#2022)
Jubeku Mar 10, 2026
6b0fbb0
add noise distribution plotting
Jubeku Mar 10, 2026
d3bd383
Merge branch 'mk/mh/diffusion-single-sample' into jk/mk/mh/diffusion-…
Jubeku Mar 10, 2026
1a04f33
plot noise distribution and decoded noised tokens
Jubeku Mar 11, 2026
3f731ad
fix noise level in validation to p_mean
Jubeku Mar 12, 2026
100b5c2
rm noise and token distribution plotting
Jubeku Mar 12, 2026
89d5c85
add multple fixed val noise levels
Jubeku Mar 12, 2026
79dcc90
enable multiple fixed val noise levels
Jubeku Mar 13, 2026
162a860
Mh/single sample diffusion plotting update (#2049)
moritzhauschulz Mar 15, 2026
7ed3b8d
not write zarr outputs, update plot_training
Jubeku Mar 15, 2026
3d52633
update val noise levels
Jubeku Mar 16, 2026
594412b
avoid rounding of validation noise levels for logging
Jubeku Mar 19, 2026
366c6b5
ERA5 distribution setup
MatKbauer Mar 23, 2026
bad9073
Update diffusion config to normal train/val split
MatKbauer Mar 23, 2026
1d4f30b
Testing inference
MatKbauer Mar 25, 2026
57a300c
Train more samples
MatKbauer Mar 25, 2026
3857b2d
Untrack runs_plot_train
MatKbauer Mar 26, 2026
cfce7a2
Enable DDP training
MatKbauer Mar 27, 2026
c2029c5
inter commit
moritzhauschulz Mar 27, 2026
30fed1b
Add z500 only configs
MatKbauer Mar 30, 2026
96a074b
Overfitting to constant random noise fails
MatKbauer Mar 31, 2026
2490a31
implement date-time conditioning data flow
moritzhauschulz Mar 31, 2026
77e248c
change config
moritzhauschulz Mar 31, 2026
c787a6c
Merge develop and fix overfitting to static noise tensor
MatKbauer Mar 31, 2026
2153713
apply PR review changes
moritzhauschulz Mar 31, 2026
e8664ab
Finish SwiGLU implementation
Mar 31, 2026
c1ead62
Implement XSA
Mar 31, 2026
76aeada
fixed MLP implementation
moritzhauschulz Apr 1, 2026
8658f69
re-added adalayernormlayer
moritzhauschulz Apr 1, 2026
352f1ab
change norms
moritzhauschulz Apr 1, 2026
6c03b35
updated adanorm, added new diffusion forward
moritzhauschulz Apr 1, 2026
546847f
fix conditioning during inference
moritzhauschulz Apr 1, 2026
922dd95
disable inference
moritzhauschulz Apr 1, 2026
11f9687
adjust eta rendering
moritzhauschulz Apr 1, 2026
6ff02bb
Merge branch 'develop' into sophiex/dev-fc/feat-swiglu-xsa
sophie-xhonneux Apr 1, 2026
f0e77d7
New best
sophie-xhonneux Apr 2, 2026
594a119
Fix plotting reset
Apr 2, 2026
92af4bf
Inference success with z500 d128 model
MatKbauer Apr 3, 2026
9eec75f
Multi-sample small working
MatKbauer Apr 6, 2026
b93b792
Successful inference with 5% noise or full noise with small model and…
MatKbauer Apr 7, 2026
720681f
Minor diffusion config update
MatKbauer Apr 7, 2026
2c48e31
Config for 128-dim hl5 z500
MatKbauer Apr 8, 2026
900e220
config changes
moritzhauschulz Apr 8, 2026
d1f2a08
Inference diagnostic tools
MatKbauer Apr 8, 2026
4eaf333
Refined inference diagnostics
MatKbauer Apr 8, 2026
f746eab
Minor adjustments
MatKbauer Apr 8, 2026
45790cd
Minor adjustments
MatKbauer Apr 8, 2026
bd42849
diffusion adjustment
moritzhauschulz Apr 9, 2026
aa954dd
Merge branch 'mk/debug-single-sample-diffusion' into mh/diffusion-dat…
moritzhauschulz Apr 9, 2026
ba913de
Config edits for 512 dim model
MatKbauer Apr 10, 2026
f024715
Clean up for reproduction and hand-over
MatKbauer Apr 13, 2026
dca612a
Merge branch 'mk/debug-single-sample-diffusion' into mh/diffusion-dat…
moritzhauschulz Apr 13, 2026
9c09043
merged conditioning with Matze's branch
moritzhauschulz Apr 13, 2026
1b06241
update num blocks
moritzhauschulz Apr 13, 2026
394fba3
incorporate feedback from matze
moritzhauschulz Apr 13, 2026
1f8fc09
LayerNorm config
MatKbauer Apr 15, 2026
c3b676d
quickfixes
moritzhauschulz Apr 17, 2026
38ac54f
add layernorm
moritzhauschulz Apr 17, 2026
6b7fe2b
now running with new layer structure
moritzhauschulz Apr 17, 2026
ddcd916
minor additions
moritzhauschulz Apr 17, 2026
a3cf6b6
Config and plot noised/denoised side by side
MatKbauer Apr 18, 2026
1c8623c
Remove fixed seed from inference
MatKbauer Apr 20, 2026
ee301e2
inter changes
moritzhauschulz Apr 20, 2026
1002b40
some fixes
moritzhauschulz Apr 20, 2026
0a45413
config changes
moritzhauschulz Apr 20, 2026
6afe572
Merge branch 'mk/mutli-sample-diff' into mh/diffusion-date-time-condi…
moritzhauschulz Apr 20, 2026
320bf07
config changes
moritzhauschulz Apr 21, 2026
7539a98
default config now converges
moritzhauschulz Apr 22, 2026
3569447
config changes
moritzhauschulz Apr 22, 2026
b432e68
remove conditioning
moritzhauschulz Apr 22, 2026
bc243e5
remove more conditioning
moritzhauschulz Apr 22, 2026
092079d
Mh/diffusion era5 uncond (#2257)
moritzhauschulz Apr 24, 2026
efaadab
adding validation pass of random noise
Jubeku Apr 24, 2026
e2c8de3
Merge branch 'develop' into jk/develop/diffusion-full-pipeline
Jubeku Apr 24, 2026
ebf1375
Merge branch 'develop' into sophiex/dev-fc/feat-swiglu-xsa
MatKbauer Apr 28, 2026
30461ea
rm plotting during validation, rm decoding of noised tokens, lint
Jubeku Apr 28, 2026
0ec723a
only write first noise leveld during validation
Jubeku Apr 29, 2026
f6df910
Configs for forecast model candidate
MatKbauer Apr 29, 2026
e095f8b
Store denoising steps during diffusion inference (#2284)
Jubeku Apr 30, 2026
a717fed
Fix max_num_targets=-1 for inference
MatKbauer May 2, 2026
18fcc58
Config minor adjustment: add forecast.time_step: 06:00:00 back in
MatKbauer May 5, 2026
c6d1d75
Bugfix + remove assertion FC_offset=0 (#2323)
kctezcan May 6, 2026
646fa7e
Change config to 1-step pre-training
MatKbauer May 7, 2026
44cd8b9
plot histograms
iluise May 7, 2026
045d311
add agg_dim
iluise May 7, 2026
17c97d7
Merge branch 'develop' into iluise/develop/global-stats
iluise May 7, 2026
518c909
Adjusted configs for forecast backbone model
MatKbauer May 8, 2026
85f6cbd
Merge branch 'jk/develop/diffusion-full-pipeline' of github.com:ecmwf…
MatKbauer May 8, 2026
2a29edf
Working era5 distribution config
MatKbauer May 12, 2026
12a1a9a
Adjust sigma_data and lr_max for diffusion training
MatKbauer May 12, 2026
f01ff6f
Merge feature to generate histogram over samples
MatKbauer May 14, 2026
67cb91c
Add geoinfo to stream config and ensure identical target samples duri…
MatKbauer May 15, 2026
4e68c02
Add missing cf.data_laoding.rng_seed back in
MatKbauer May 15, 2026
945045d
Merged SwiGLU and XSA
MatKbauer May 18, 2026
1abbff1
Update diffusion config to swiglu chkpt and p_mean=1.5
MatKbauer May 18, 2026
366ae77
bug fix
moritzhauschulz Apr 22, 2026
dc6d82d
bug fix
moritzhauschulz Apr 22, 2026
aaeb073
config change
moritzhauschulz Apr 22, 2026
bce064d
plot config
moritzhauschulz Apr 23, 2026
77f6f43
plot config
moritzhauschulz May 4, 2026
140d1ca
update stage handling in diffusion
moritzhauschulz May 4, 2026
825f841
re-implement conditioning, update adalayernorm and embedding function
moritzhauschulz May 7, 2026
dd97830
remove debugging tool
moritzhauschulz May 7, 2026
77d5e0a
implement time only / day only conditioning
moritzhauschulz May 7, 2026
11c9e6a
date_time conditioning
moritzhauschulz May 13, 2026
b6ace25
activate swiglu, xsa
moritzhauschulz May 19, 2026
2b1ff15
Update diffusion config with more pre-trained models
MatKbauer May 19, 2026
4dca300
initial commit with data flow for the forecast conditioning (conditio…
moritzhauschulz May 19, 2026
d2e7b5d
offset 1
moritzhauschulz May 19, 2026
fde1230
bug fix from merge
moritzhauschulz May 19, 2026
3f89769
change ada_ln argument passing
moritzhauschulz May 20, 2026
d74029d
naive implementation of conditioning via concatenation
moritzhauschulz May 20, 2026
8a1b698
remove CLAUDE.md
moritzhauschulz May 20, 2026
1844cbf
implemented cross-attn in fe engine
moritzhauschulz May 20, 2026
372bab4
removed concatenation option
moritzhauschulz May 20, 2026
13560fc
date in config
moritzhauschulz May 20, 2026
66ff754
comment in config
moritzhauschulz May 20, 2026
7369439
minor improvements
moritzhauschulz May 20, 2026
810987a
assert offset zero
moritzhauschulz May 20, 2026
8acbb01
Config for 2048-dim model
MatKbauer May 21, 2026
58bf2d4
roll back data flow (not working)
moritzhauschulz May 21, 2026
9b652f4
cleanup rollback
moritzhauschulz May 21, 2026
27a3b1b
inter commit
moritzhauschulz May 21, 2026
8534dd2
fixes – forecast + cross_attn should run now
moritzhauschulz May 21, 2026
df3adb9
Merge remote-tracking branch 'moritzhauschulz/mh/jk/diffusion-full-pi…
MatKbauer May 21, 2026
17c11fa
Add 2048-dim diffusion configs and inference fixes
MatKbauer May 26, 2026
58ef4de
[DRAFT] Mh/jk/diffusion full pipeline forecast (#2396)
moritzhauschulz May 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions NOTICE
Original file line number Diff line number Diff line change
@@ -1,3 +1,43 @@
=======================================================================
NVLABS/EDM (Elucidating the Design of Diffusion Models)

This software incorporates code from the 'edm' repository.

Original Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

The source code is available at:
https://github.com/NVlabs/edm

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0

=======================================================================
google-deepmind/graphcast (several associated papers)

This software incorporates code from the 'google-deepmind/graphcast' repository, with adaptations.

Original Copyright 2024 DeepMind Technologies Limited.

The source code is available at:
https://github.com/google-deepmind/graphcast

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0

=======================================================================
facebookresearch/DiT (Scalable Diffusion Models with Transformers (DiT))

This software incorporates code from the 'facebookresearch/DiT' repository, with adaptations.

The source code is available at:
https://github.com/facebookresearch/DiT

The code and model weights are licensed under CC-BY-NC.
See https://raw.githubusercontent.com/facebookresearch/DiT/refs/heads/main/LICENSE.txt for details.
This project includes code derived from project "DINOv2: Learning Robust Visual Features without Supervision",
originally developed by Meta Platforms, Inc. and affiliates,
licensed under the Apache License, Version 2.0.
Expand Down
339 changes: 339 additions & 0 deletions config/config_diffusion.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,339 @@
# (C) Copyright 2025 WeatherGenerator contributors.
#
# This software is licensed under the terms of the Apache Licence Version 2.0
# which can be obtained at http://www.apache.org/licenses/LICENSE-2.0.
#
# In applying this licence, ECMWF does not waive the privileges and immunities
# granted to it by virtue of its status as an intergovernmental organisation
# nor does it submit to any jurisdiction.

embed_orientation: "channels"
embed_unembed_mode: "block"
embed_dropout_rate: 0.1

ae_local_dim_embed: 512
ae_local_num_blocks: 0
ae_local_num_heads: 16
ae_local_dropout_rate: 0.1
ae_local_with_qk_lnorm: True

ae_local_num_queries: 1
ae_local_queries_per_cell: False
ae_adapter_num_heads: 16
ae_adapter_embed: 128
ae_adapter_with_qk_lnorm: True
ae_adapter_with_residual: True
ae_adapter_dropout_rate: 0.1

ae_global_dim_embed: 512
ae_global_num_blocks: 4
ae_global_num_heads: 32
ae_global_dropout_rate: 0.1
ae_global_with_qk_lnorm: True
# TODO: switching to < 1 triggers triton-related issues.
# See https://github.com/ecmwf/WeatherGenerator/issues/1050
ae_global_att_dense_rate: 1.0
ae_global_block_factor: 64
ae_global_mlp_hidden_factor: 2
ae_global_trailing_layer_norm: False

ae_aggregation_num_blocks: 0
ae_aggregation_num_heads: 32
ae_aggregation_dropout_rate: 0.1
ae_aggregation_with_qk_lnorm: True
ae_aggregation_att_dense_rate: 1.0
ae_aggregation_block_factor: 64
ae_aggregation_mlp_hidden_factor: 2

decoder_type: PerceiverIOCoordConditioning # Main options PerceiverIOCoordConditioning or Linear
pred_adapter_kv: False
pred_self_attention: True
pred_dyadic_dims: False
pred_mlp_adaln: True
num_class_tokens: 0
num_register_tokens: 0

# number of steps offset applied to first target window; if set to zero and forecast_steps=0 then
# one is training an auto-encoder
fe_num_blocks: 6
fe_num_heads: 16
fe_dropout_rate: 0.1
fe_with_qk_lnorm: True
fe_diffusion_model: True
fe_diffusion_model_conditioning: "forecast" # options: "date_time", "time", "forecast"
fe_diffusion_model_conditioning_type: "cross_attn" # options: "cross_attn", "ada_ln"
fe_layer_norm_after_blocks: [] # Index starts at 0. Thus, [3] adds a LayerNorm after the fourth layer
fe_impute_latent_noise_std: 0.0 # 1e-4
# currently fixed to 1.0 (due to limitations with flex_attention and triton)
forecast_att_dense_rate: 1.0
with_step_conditioning: True # False
# Diffusion related parameters
diffusion_conditioning_embed_dim: 32
frequency_embedding_dim: 256
embedding_dim: 512
sigma_min: 0.002
sigma_max: 80
sigma_data: 1.0
rho: 7
p_mean: 1.5
p_std: 1.2

healpix_level: 5

# Use 2D RoPE instead of traditional global positional encoding
# When True: uses 2D RoPE based on healpix cell coordinates (lat/lon)
# When False: uses traditional pe_global positional encoding
rope_2D: True
mlp_type: swiglu
use_xsa: True
# mlp_type: mlp
# use_xsa: False

with_mixed_precision: True
with_flash_attention: True
compile_model: False
with_fsdp: False
attention_dtype: bf16
mixed_precision_dtype: bf16
mlp_norm_eps: 1e-5
norm_eps: 1e-4

latent_noise_kl_weight: 0.0 # 1e-5
latent_noise_gamma: 2.0
latent_noise_saturate_encodings: 5
latent_noise_use_additive_noise: False
latent_noise_deterministic_latents: True


freeze_modules: ".*latent_pre_norm.*|.*latent_heads.*|.*pred_heads.*|.*target_token_engines.*|.*embed_target_coords.*|.*encoder.*|.*StreamEmbedder_ERA5.*|.*embed_engine.*|.*embed_engine.*|.*ae_local_engine.*|.*ae_local_global_engine.*|.*ae_global_engine.*"
# freeze_modules: ".*latent_pre_norm.*|.*latent_heads.*|.*encoder.*|.*StreamEmbedder_ERA5.*|.*embed_engine.*|.*embed_engine.*|.*fe.*|.*ae_local_engine.*|.*ae_local_global_engine.*|.*ae_global_engine.*"
# freeze_modules: ".*latent_pre_norm.*|.*latent_heads.*|.*encoder.*|.*StreamEmbedder_ERA5.*|.*embed_engine.*|.*embed_engine.*|.*ae_local_engine.*|.*ae_local_global_engine.*|.*ae_global_engine.*"
# freeze_modules: ""
# load_chkpt: {'run_id': 't0bdz7qn', 'epoch': -1} # multi-var d2048 hl5, sigma_data=1.7
# load_chkpt: {'run_id': 'dcl584vo', 'epoch': -1} # z500 d2048 hl5, sigma_data=159.08
# load_chkpt: {'run_id': 'tvkicam9', 'epoch': -1} # z500 d2048 hl3 enc-lnorm, sigma_data=1.0
# load_chkpt: {'run_id': 'q9grso75', 'epoch': -1} # z500 d2048 hl3, sigma_data=39.2936
# load_chkpt: {'run_id': 'qxivdyqz', 'epoch': -1} # z500 d2048 hl5 enc-lnorm, sigma_data=1.0
# load_chkpt: {'run_id': 'h8x1qgz3', 'epoch': -1} # z500 d128 hl5, sigma_data=12.93
# load_chkpt: {'run_id': '', 'epoch': -1} # z500 d128 hl5 enc-lnorm, sigma_data=1.0
# load_chkpt: {'run_id': 'wvpb76ai', 'epoch': -1} # multi-var d2048 hl3 enc-lnorm, sigma_data=1.0
# load_chkpt: {'run_id': 'ae4wlc5m', 'epoch': -1} # multi-var d2048 hl3, sigma_data=2.7047
# load_chkpt: {'run_id': 'r45iwyns', 'epoch': -1} # multi-var d512 hl3, sigma_data=1.1785
# load_chkpt: {'run_id': 'ydka6uql', 'epoch': -1} # multi-var d512 hl4, sigma_data=0.827
# load_chkpt: {'run_id': 'lwjkb3y4', 'epoch': -1} # multi-var d512 hl5, sigma_data=0.5789
# load_chkpt: {'run_id': 'v8kd6xc1', 'epoch': -1} # multi-var d512 hl5 nopos, sigma_data=0.6481
# load_chkpt: {'run_id': 'lwjkb3y4', 'epoch': -1} # multi-var d512 hl5 enc-lnorm, sigma_data=1.0
# load_chkpt: {'run_id': 'y1gu5md8', 'epoch': -1} # multi-var d512 hl5, sigma_dqta=1.0, diffusion-full-pipeline
# load_chkpt: {'run_id': 'mal6u4gc', 'epoch': -1} # multi-var d512 hl5, sigma_dqta=1.0, geoinfos 64 epochs, diffusion-full-pipeline
# load_chkpt: {'run_id': 'zrpncqb0', 'epoch': -1} # multi-var d512 hl5, sigma_dqta=1.0, geoinfos 196 epochs, diffusion-full-pipeline
# load_chkpt: {'run_id': 'm6fs8wvj', 'epoch': -1} # multi-var d512 hl5, sigma_data=1.0, swiglu xsa geoinfos, diffusion-full-pipeline
# load_chkpt: {'run_id': 'cgxt9imf', 'epoch': -1} # diffusion model to fine-tune decoder, p_mean=0.5, SwiGLU+XSA+geoinfos, based on m6fs8wvj backbone
# load_chkpt: {'run_id': 'wo5mf2z4', 'epoch': -1} # diffusion model to fine-tune decoder, p_mean=1.5, SwiGLU+XSA+geoinfos, based on m6fs8wvj backbone
# load_chkpt: {'run_id': 'zf6wnmpe', 'epoch': -1} # multi-var d2048 hl5, sigma_data=1.832
# load_chkpt: {'run_id': 'mivw6jda', 'epoch': -1} # multi-var d2048 hl5 enc-lnorm, sigma_data=1.0
# load_chkpt: {'run_id': 'j74tn8le', 'epoch': -1} # forecasting d512 hl5, diffusion-full-pipeline, p_mean=-1.5, based on m6fs8wvj backbone
# load_chkpt: {'run_id': 'j7lr0jws', 'epoch': -1} # forecasting d512 hl5, diffusion-full-pipeline, p_mean=-1.2, based on m6fs8wvj backbone
# load_chkpt: {'run_id': 'cbras2el', 'epoch': -1} # forecasting d512 hl5, diffusion-full-pipeline, p_mean=-0.5, based on m6fs8wvj backbone
# load_chkpt: {'run_id': 'kn3124hp', 'epoch': -1} # forecasting d512 hl5, diffusion-full-pipeline, p_mean=0.0, based on m6fs8wvj backbone
# load_chkpt: {'run_id': 'qqbu9852', 'epoch': -1} # forecasting d512 hl5, diffusion-full-pipeline, p_mean=0.5, based on m6fs8wvj backbone
# load_chkpt: {'run_id': 'vqsh3yrl', 'epoch': -1} # forecasting d512 hl5, diffusion-full-pipeline, p_mean=1.0, based on m6fs8wvj backbone
# load_chkpt: {'run_id': 'xl8h7vbt', 'epoch': -1} # forecasting d512 hl5, diffusion-full-pipeline, p_mean=1.5, based on m6fs8wvj backbone
# load_chkpt: {'run_id': 'p9m2jwvc', 'epoch': -1} # forecasting d512 hl5, diffusion-full-pipeline, p_mean=2.0, based on m6fs8wvj backbone


norm_type: "LayerNorm"

#####################################

streams_directory: "./config/streams/era5_1deg_forecasting/"
# streams_directory: "./config/streams/era5_1deg_forecasting_z500/"
streams: ???

# type of zarr_store
zarr_store: "zip" # "zarr" for LocalStore, "zip" for ZipStore

general:

# mutable parameters
istep: 0
rank: ???
world_size: ???

# local_rank,
# with_ddp,
# data_path_*,
# model_path,
# run_path,
# path_shared_

multiprocessing_method: "fork"

desc: ""
run_id: ???
run_history: []

# logging frequency in the training loop (in number of batches)
train_logging:
terminal: 10
metrics: 20
checkpoint: 250
log_grad_norms: False

# parameters for data loading
data_loading :

num_workers: 12
rng_seed: ???
repeat_data_in_mini_epoch : False

# pin GPU memory for faster transfer; it is possible that enabling memory_pinning with
# FSDP2 + DINOv2 can cause the job to hang and trigger a PyTorch timeout error.
# If this happens, you can disable the flag, but performance will drop on GH200.
memory_pinning: True


# config for training
training_config:

# training_mode: "masking", "student_teacher", "latent_loss"
training_mode: ["masking","student_teacher"]

num_mini_epochs: 128
samples_per_mini_epoch: 4096
shuffle: True

start_date: 1979-01-01T00:00
end_date: 2022-12-31T18:00

time_window_step: 06:00:00
time_window_len: 06:00:00

learning_rate_scheduling :
lr_start: 1e-6 #5e-5
lr_max: 1e-5 #1e-4
lr_final_decay: 1e-6
lr_final: 0.0
num_steps_warmup: 64
num_steps_cooldown: 512
policy_warmup: "cosine"
policy_decay: "constant"
policy_cooldown: "linear"
parallel_scaling_policy: "sqrt"

optimizer:
grad_clip: 1.0
weight_decay: 0.1
log_grad_norms: False
adamw :
# parameters are scaled by number of DDP workers
beta1 : 0.975
beta2 : 0.9875
eps : 2e-08

losses : {
"physical": {
type: LossPhysical,
weight: 0.0,
loss_fcts: {
"mse": {},
},
target_and_aux_calc: "Physical",
},
"latent_diff": {
type: LossLatentDiffusion,
weight: 1.0,
target_and_aux_calc: DiffusionLatentTargetEncoder,
loss_fcts: { "mse": { }, },
}
}

model_input: {
"forecasting" : {
# masking strategy: "random", "healpix", "forecast"
masking_strategy: "forecast",
masking_strategy_config: {diffusion_rn: True},
num_steps_input: 2,
num_samples: 1,
}
}

target_input: {
"forecasting" : {
masking_strategy: "forecast",
masking_strategy_config: {diffusion_rn: True},
num_steps_input: 1,
num_samples: 1,
}
}

forecast :
time_step: 06:00:00
num_steps: 1
offset: 0
policy: "fixed"


# validation config; full validation config is merge of training and validation config
validation_config:

# Noise levels (eta values in standard normal space) at which to evaluate the
# diffusion model during validation. sigma = exp(eta * p_std + p_mean).
# Each value produces a separate validation pass with independently logged metrics.
validation_noise_levels: [1.0, 2.0, 3.0, 4.0]

samples_per_mini_epoch: 256
shuffle: True

start_date: 2023-10-01T00:00
end_date: 2023-12-31T18:00

# whether to track the exponential moving average of weights for validation
validate_with_ema:
enabled : True
ema_ramp_up_ratio: 0.09
ema_halflife_in_thousands: 1e-3

# parameters for validation samples that are written to disk
output : {
# number of samples that are written
num_samples: 0,
# write samples in normalized model space
normalized_samples: False,
# output streams to write; default all
streams: null,
}

# run validation before training starts (mainly for model development)
validate_before_training: True


# test config; full test config is merge of validation and test config
# test config is used by default when running inference

# Tags for experiment tracking
# These tags will be logged in MLFlow along with completed runs for train, eval, val
# The tags are free-form, with the following rules:
# - tags should be primitive types (strings, numbers, booleans). NO lists or dictionaries
# - tags should not duplicate existing config entries.
# - try to reuse existing tags where possible. MLFlow does not like having too many unique tags
# - do not use long strings in values (less than 20 characters is a good rule of thumb, we may enforce this in the future)
wgtags:
# The name of the organization of the person running the experiment.
# This may be autofilled in the future. Expected values are lowercase strings
# e.g. "ecmwf", "cmcc", "metnor", "jsc", "escience"
org: null
# The Github issue corresponding to this run (number such as 1234)
# Github issues are the central point when running experiment and contain
# links to hedgedocs, code branches, pull requests etc.
# It is recommended to associate a run with a Github issue.
issue: null
# The name of the experiment. This is a distinctive codename for the experiment campaign being run.
# This is expected to be the primary tag for comparing experiments in MLFlow, along with the
# issue number.
# Expected values are lowercase strings with no spaces, just underscores:
# Examples: "rollout_ablation_grid"
exp: null
# *** Experiment-specific tags ***
# All extra tags (including lists, dictionaries, etc.) are treated
# as strings by mlflow, so treat all extra tags as simple string key: value pairs.
grid: null
Loading