GenDA - Generative Data Assimilation.
Experiments in generative neural data assimilation for multi-modal surface ocean state estimation. This code is associated with the paper:
Martin, S. A., Manucharyan, G. E., & Klein, P. (under review), Generative Data Assimilation for Surface Ocean State Estimation from Multi-Modal Satellite Observations, EarthArXiv
The problem: Estimate the multi-modal dynamical state of the surface ocean (sea surface height, temperature, salinity, and surface currents) from sparse satellite observations of sea surface height and temperature and low-resolution objective analysis products for sea surface height, temperature, and salinity.
The approach: Given high-resolution training data from eddy-resolving numerical simulations, train a generative model to produce realistic multi-modal surface snapshots from the model (e.g. sea surface height, temperature, salinity, & surface currents). Can we then use this generative model to estimate poorly-observed quantities (e.g. surface currents/salinity) from satellite observables (e.g. sea surface height and temperature)?
Motivations for a generative approach vs regression approach:
- Predicting single value with regression approach smooths out small-scale features, impacting higher-order dynamical diagnostics. Generative approach potentially allows to generate ensemble of high-resolution reconstructions each of which preserves the fine-scale features.
- Regression approach provides no robust way to transfer from training environment (simulation data) to real-world observations. Subtle differences between real observations at inference and simulated observations during training propagate through the network with no well-defined behaviour. Generative approach would ensure fields generated from observations 'look like' the simulated data - i.e. hopefully preserve the simulation's physics.
The Method: Score-Based Data Assimilation (referred to here as 'generative data assimilation' or 'GenDA')
Step 1: Train unconditional diffusion model to produce realistic multi-modal samples. NB: this training is conducted on full model fields with no generation of simulated observations.
Step 2: Guide the generation from the trained model using sparse observations by taking gradient steps wrt the state estimate, x, while keeping the diffusion model parameters fixed to preserve the qualitative nature of the model outputs. (Method proposed by Rozet & Louppe 2023 and recently applied to atmospheric reanalysis by Manshausen et al.).

Training data: simulation data from the 1/12 degree global reanalysis product GLORYS 12 sub-setted in a region surrounding the Gulf Stream.
Experiments:
- Observing System Simulation Experiment (OSSE): estimate state from simulated satellite observations and compare to known 2D ground truth.
- Observing System Experiment (OSE): estimate state from real-world satellite observations and compare to some independent withheld observations.
Structure of the code:
./pre-processingcontains code for preparing the desired target fields from publicly available datasets. For example, we subtract geostrophic currents and Ekman currents (derived using a linear regression model) from the surface current variable we seek to reconstruct../srccontains utility code (e.g. dataloaders, neural network architecture for a baseline UNet regression approach)- The GenDA diffusion model code is adapted from NVIDIA Modulus CorrDiff(installed from upstream repo on 07/21/2024, looks like they refactored the code since so I include here my local copy for reproducibility).
./confcontains hydra config files used for model training../trainingcontains training scripts../inferencecontains inference scripts for both the OSE and OSSE.

