Direct netCDF I/O for soca state/geometry (soca_io_mod) by travissluka · Pull Request #1242 · JCSDA-internal/soca

travissluka · 2026-05-13T02:51:09Z

Description

Resolves #1225

Rips out FMS I/O! Replaces SOCA's register_restart_field / save_restart / restore_state paths with direct nf90_* I/O in a new module src/soca/IO/soca_io_mod.F90. Same on-disk file format. The register-then-commit API mirrors FMS semantics so call sites in soca_geom_mod, soca_fields_mod, and soca_balance_mod are minimally adapted.

Why:

FMS I/O v1 is being retired and we have to move off it.
FMS I/O v2 was problematic when we tried it — not a drop-in.
Independent of that pressure, we wanted out anyway: a follow-up PR builds on this to greatly accelerate ensemble writes (one PE per member writing its own file in parallel, instead of the FMS-mandated collective on the compute communicator).

Issue(s) addressed

n/a

Dependencies

n/a

Impact

soca-only. Same files on disk; same public Fortran interface in soca_geom, soca_fields, soca_balance. FMS dependency for register_restart_field / save_restart / restore_state is removed; mpp_gather / mpp_scatter / mpp_broadcast are still used.

Manual Testing Instructions (optional)

n/a

Checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have run the unit tests before creating the PR

Introduces src/soca/IO/soca_io_mod.F90 with soca_io_file_reader / soca_io_file_writer types (init/enqueue/commit/close), and routes all state, geometry, and balance-coefficient I/O through it. PE 0 calls nf90_get_var / nf90_put_var; mpp_broadcast and mpp_gather distribute / collect data on the geometry's pelist. Removes all fms_io_mod usages (register_restart_field, restore_state, save_restart, read_data, file_exist, field_exist, fms_io_init, fms_io_exit) from soca, including the global_soca_geom_counter shim. fms_init / fms_end are retained (subsystem init only, no I/O). Resolves the GDAS Tsnz_h shape mismatch by building nf90_get_var count from the file's actual dim sizes, not the caller's buffer rank. Per-domain state writes (ocn/sfc/ice/wav/bio) and reads now use the new module; the FMS code paths are removed in the same change rather than gated. Refs #1125. Obsoletes PR #1241.

Under FMS, the writer would implicitly add the .nc suffix, so test YAMLs and downstream readers reference 'ocn.<exp>.<typ>.<date>.nc'. The direct-netCDF writer in soca_io_mod is faithful to whatever string it's given, so files were landing without the extension and no consumer could find them.

Six soca_add_test() calls listed test_socs_parameters_diffusion as a dependency (should be test_soca_). Because the named test does not exist, ctest treated the dependency as void and ran the variational tests concurrently with parameters_diffusion under -j>1, causing races on the diffusion calibration outputs.

Drops the SOCA_IO_SERIAL / SOCA_IO_PARALLEL mode bifurcation (and the method= arg + io.method config knob) and rewrites the reader to do per-PE strided nf90_get_var, mirroring FMS mpp_io's domain-decomposed read pattern. Each PE opens the file once via NF90_NOWRITE and reads its own compute-domain tile via nf90_get_var(start, count) -- the same shape as fms_io's READ_RECORD_ for fileset.NE.MPP_SINGLE. Adds a module-level read cache mirroring fms_io's get_file_unit + files_read(i)%var(j) tables: - one nf90_open per (PE, filename) for the whole run - cached (varid, file_ndims, middle_dim sizes) per (file, var) - soca_io_close_all() flushes the cache; wired into soca_geom_end Caller updates: drop method= from reader/writer init() and the soca_io_method_from_config helper from soca_fields_mod / soca_geom_mod / soca_balance_mod. 1-deg 20-mem LETKF read phase (LocalEnsembleDA before solver ctor): baseline (FMS): 3.91 s before this commit (no cache): 13.92 s with this commit: 3.97 s Outputs are bit-identical.

…cache) Adds a second reader_commit implementation -- PE 0 nf90_get_var the global field + mpp_broadcast -- alongside the existing per-PE strided path. Both paths share the file-handle + var-metadata cache. Select via SOCA_IO_READ_MODE=broadcast|strided (default broadcast); the env var is latched on first reader_commit. Why broadcast as default: in DA cycling the page cache is always cold, since each cycle reads files that the previous cycle / model run just wrote. On cold cache the broadcast path (1 sequential read on PE 0, kernel readahead happy) beats strided (8 concurrent strided readers thrash the prefetcher and pay 8x the open-syscall cost). 1-deg 20-mem LETKF, rancor, taskset -c 0-7, cold cache (sample 1): broadcast: 13.5 s read strided: 30.5 s read Warm cache reverses the result (strided 3.9 s vs broadcast 13.3 s) but cycling never sees warm cache. On a parallel filesystem (Lustre/GPFS) or multi-node setup the strided path may win again -- toggle the env var to test.

Drops SOCA_IO_READ_MODE in favor of a yaml block on the geometry config: geometry: io: read mode: broadcast | strided If absent the module-level default (broadcast) is kept. soca_geom_init calls soca_io_read_mode_from_config(f_conf) once at startup; the choice latches for every subsequent reader_commit. Unrecognized values abort so typos surface immediately.

Annotates testinput/letkf.yml's geometry block with the two read-mode options so a contributor reading the canonical letkf example sees both choices without having to hunt for the soca_io_mod docstring.

…t-io

The write path emits FMS-style auto-numbered dim names (xaxis_N / yaxis_N / zaxis_N / Time); the file-header docstring incorrectly claimed xh / yh / Time. Correct the comment -- no code change.

Writer: - enqueue holds pointers to caller buffers instead of allocating and copying (mirrors FMS register_restart_field contract). Compute-slice extraction moves from enqueue to commit, so peak per-writer memory drops from sum(var_bytes) to one (nx_c x ny_c) tile. - commit uses the 3D mpp_gather overload for 3D vars: one collective per var instead of nlevels, and no per-level gbuf3d(:,:,k) = gbuf2d memcpy on root. - Reuse gbuf3d / tile3 across 3D vars when nlevels matches. Reader: - commit_reader_strided hoists tile2 out of the per-var loop and reuses tile3 / tile4 when trailing dims match. - Dead tile3/tile4 zero-inits removed (read_var_strided fully fills the buffer). read_var_strided: fixed-size stack arrays for st / ct instead of per-call heap allocation. put_axis_coord_data: reuse idxbuf across axes. Cleanup: - Remove dead 'use mpi' and 'use fckit_configuration_module'. - Remove dead cartesian_axis parameter on writer_enqueue_1d (it was stored but never written to the netcdf output). - Drop unused nprocs out-param of mpi_pelist. - soca_io_close_all: use ncc on nf90_close instead of silently discarding the status. - commit_reader_scatter: matching dead-init cleanup; comment updated to flag it as pending parallel-ensemble I/O exercise. - Trim verbose / redundant comments.

travissluka · 2026-05-13T02:56:24Z

Note, @shlyaeva, this uses the strided method that you previously tested, and that is the default, so this PR should now be a drop-in replacement with no changes needed to yamls. The other reader method is disabled (but the code is still there so that it can be used with the ensemble I/O follow-up)

Copilot

Pull request overview

This PR replaces SOCA’s previous FMS restart I/O pathway (register_restart_field / save_restart / restore_state) with direct NetCDF Fortran (nf90_*) I/O implemented in a new soca_io_mod module, and updates geometry/fields/balance call sites to use a register-then-commit reader/writer API.

Changes:

Added src/soca/IO/soca_io_mod.F90 providing soca_io_reader/soca_io_writer plus basic file/variable existence helpers and a cached read-handle table.
Updated soca_geom_mod, soca_fields_mod, and soca_balance_mod to use the new reader/writer API in place of FMS I/O.
Extended LETKF YAML test input with an io.read mode knob and adjusted test dependencies naming in test/CMakeLists.txt.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
test/testinput/letkf.yml	Adds YAML configuration for geometry I/O read strategy.
test/CMakeLists.txt	Fixes/aligns diffusion parameter test dependency name.
src/soca/LinearVariableChange/Balance/soca_balance_mod.F90	Switches sea-ice Jacobian read from `read_data` to `soca_io_reader`.
src/soca/IO/soca_io_mod.F90	Introduces direct NetCDF I/O module with writer/reader and cached read handles.
src/soca/IO/CMakeLists.txt	Adds the new IO module to SOCA build sources.
src/soca/Geometry/soca_geom_mod.F90	Replaces gridspec read/write via FMS restart API with `soca_io_reader`/`writer`; closes cached handles on geom end.
src/soca/Fields/soca_fields_mod.F90	Replaces restart I/O and existence checks with `soca_io_*` equivalents; updates filename generation to include `.nc`.
src/soca/CMakeLists.txt	Adds IO subdirectory to the build.

Comments suppressed due to low confidence (3)

src/soca/IO/soca_io_mod.F90:196

The writer stores pointers to src dummy arguments (self%vars(... )%data* => src) and relies on them remaining valid until commit(). Per the Fortran standard this is only safe if the actual arguments have the TARGET (or POINTER) attribute; otherwise the stored pointers can become undefined after enqueue_* returns. Either (a) change the API to copy data into internal allocatables, or (b) require/enforce TARGET on all actual arrays passed to enqueue (including allocatable components like geom%lon, varwrapper%data, etc.), and update call sites accordingly.

subroutine writer_enqueue_1d(self, name, src, long_name, units)
  class(soca_io_writer),         intent(inout) :: self
  character(len=*),              intent(in)    :: name
  real(kind=kind_real), target,  intent(in)    :: src(:)
  character(len=*), optional,    intent(in)    :: long_name, units

  call grow_if_needed(self)
  self%nvars = self%nvars + 1
  self%vars(self%nvars)%name    = name
  self%vars(self%nvars)%ndims   = 1
  self%vars(self%nvars)%nlevels = 1
  self%vars(self%nvars)%data1d => src
  self%vars(self%nvars)%long_name = name
  self%vars(self%nvars)%units     = 'none'
  if (present(long_name)) self%vars(self%nvars)%long_name = long_name
  if (present(units))     self%vars(self%nvars)%units     = units
end subroutine writer_enqueue_1d

src/soca/IO/soca_io_mod.F90:401

Same issue for 3D gathers: gbuf3d is only allocated on root, but it is passed to mpp_gather on all PEs. Passing an unallocated allocatable as an actual argument is not safe even if non-root never reads it. Allocate a dummy gbuf3d(1,1,1) (or otherwise ensure it is allocated) on non-root PEs before calling mpp_gather.

      nlev = self%vars(v)%nlevels
      if (is_root) then
        if (allocated(gbuf3d)) then
          if (size(gbuf3d, 3) /= nlev) deallocate(gbuf3d)
        end if
        if (.not. allocated(gbuf3d)) allocate(gbuf3d(self%nx_g, self%ny_g, nlev))
      end if
      if (allocated(tile3)) then
        if (size(tile3, 3) /= nlev) deallocate(tile3)
      end if
      if (.not. allocated(tile3)) allocate(tile3(nx_c, ny_c, nlev))
      tile3 = self%vars(v)%data3d(i_off : i_off + nx_c - 1, &
                                  j_off : j_off + ny_c - 1, :)
      call mpp_gather(self%isc, self%iec, self%jsc, self%jec, nlev, pelist, &
                      tile3, gbuf3d, is_root)
      if (is_root) then

src/soca/Geometry/soca_geom_mod.F90:257

The new greader%enqueue(...) calls pass allocatable components like self%lon, self%mask2d, etc. that are not declared with TARGET. Since soca_io_reader stores pointers to the enqueued buffers until commit(), these actual arguments should be TARGET (or the reader should be changed to copy into internal storage) to avoid non-standard/undefined pointer lifetimes.

    call greader%init(self%Domain%mpp_domain, str)
    call greader%enqueue("lonh",    self%lonh)
    call greader%enqueue("lath",    self%lath)
    call greader%enqueue("lonq",    self%lonq)
    call greader%enqueue("latq",    self%latq)
    call greader%enqueue("lon",     self%lon)
    call greader%enqueue("lat",     self%lat)
    call greader%enqueue("lonu",    self%lonu)
    call greader%enqueue("latu",    self%latu)
    call greader%enqueue("lonv",    self%lonv)
    call greader%enqueue("latv",    self%latv)
    call greader%enqueue("sin_rot", self%sin_rot)
    call greader%enqueue("cos_rot", self%cos_rot)
    call greader%enqueue("dx",      self%dx)
    call greader%enqueue("dy",      self%dy)
    call greader%enqueue("area",    self%cell_area)
    call greader%enqueue("mask2d",  self%mask2d)
    call greader%enqueue("mask2du", self%mask2du)
    call greader%enqueue("mask2dv", self%mask2dv)
    call greader%commit()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Drop dangling references to the removed soca_io_read_mode_from_config / 'io.read mode' YAML key in soca_geom_mod and the letkf test input. The selector was already gone from soca_io_mod; without these the PR fails to compile.

- soca_io_mod: allocate gbuf2d / gbuf3d as 1x1 dummies on non-root so the actuals passed to mpp_gather are always allocated (assumed-shape dummy requires it). Pattern matches commit_reader_scatter. - soca_io_mod: refresh stale 'Data is copied in' comment to reflect the pointer-based enqueue semantics (caller buffer kept alive/unmutated through commit; actual must satisfy TARGET association rules). - soca_geom_mod: TARGET on the 'self' dummy in soca_geom_init, soca_geom_init_fieldset, and soca_geom_write so the allocatable components (self%lonh, self%lat, ..., self%mask2d*) are valid pointer targets for the soca_io reader/writer enqueue. TARGET on the local fieldData / fieldDataVars too. - soca_fields_mod: TARGET on h_common and on the local vars(:) wrapper array so vars(n)%data sections used in enqueue are valid pointer targets. - soca_balance_mod: TARGET on local kct for the same reason.

travissluka · 2026-05-13T03:12:14Z

Thanks for the review. All six Copilot findings addressed:

soca_io_read_mode_from_config undefined — fixed in 4f33324 (dropped the matching import + call + YAML key that referenced the removed routine).
gbuf2d/gbuf3d only allocated on root — fixed in 71f14e2 (allocate 1×1 dummy on non-root so the actuals passed to mpp_gather are always allocated; pattern matches commit_reader_scatter).
Stale "Data is copied in" comment — refreshed in 71f14e2 to describe the pointer-based semantics and the TARGET-association contract.
Reader stores pointers to dst (F03 conformance) — addressed by adding TARGET to callers' actual arguments (see below), keeping the FMS-mirroring pointer design. Same for the writer.
kct missing TARGET — added in 71f14e2 on the local declaration.
h_common missing TARGET — added in 71f14e2, plus TARGET on the local vars wrapper array (so vars(n)%data sections are valid pointer targets), and TARGET on the self dummy of soca_geom_init / soca_geom_init_fieldset / soca_geom_write (so self%lonh and friends are valid targets).

All 7 writer/reader-heavy soca ctests pass locally; cluster-faithful timing unchanged from the prior commit on this branch.

shlyaeva · 2026-05-13T12:23:28Z

@travissluka thank you! I'll test this today in the workflow and report back.

shlyaeva

Thank you @travissluka! This approval is based solely on testing with 0.25 degree soca in GFSv17 "hybrid" configuration (the one that also runs LETKF) for 1-2 cycles. The timings of I/O are slightly better than on develop 🎉 . I have not reviewed any of this code, and don't plan to review it.

(side note unrelated to this PR: the overall timing of 3DVar is significantly slower than with the code from a few weeks ago, I suspect something may have changed in how we calibrate vertical diffusion. we'll investigate that separately).

travissluka · 2026-05-14T19:49:34Z

Thank you @travissluka! This approval is based solely on testing with 0.25 degree soca in GFSv17 "hybrid" configuration (the one that also runs LETKF) for 1-2 cycles. The timings of I/O are slightly better than on develop 🎉 . I have not reviewed any of this code, and don't plan to review it.

I think the slight improvement is because FMS was copying fields before writing them, whereas now we just pass pointers around. Memory usage should also be slightly less now.

shlyaeva

Thank you for pointing me to looking at the memory, I just rebuilt with soca develop with using the same oops/etc branches for apples to apples comparison (the previous one again wasn't!), and actually there are two issues (one crucial).

[major, needs fixing before the merge]. The issue with vertical diffusion that I mentioned above is due to this branch.
On develop, consistent with other builds:

  0: group 1 of 1
  0: Generating vertical scales
  0:   Reading from file
  0:   mask: NONE
  0:   scale type: Gaussian
  0:   vtScales:  min: 5  mean: 7.17912  max: 15
  0: Diffusion: vertical iterations: 450
  0: Calculating vertical normalization...
  0: Info     : Writing file: ./staticb/vt_ocean.nc

on this branch:

  0: ==================================================================================
  0:  saber::Diffusion calibration
  0: ----------------------------------------------------------------------------------
  0:
  0: group 1 of 1
  0: Generating vertical scales
  0:   Reading from file
  0:   mask: NONE
  0:   scale type: Gaussian
  0:   vtScales:  min: -2.30585  mean: 7.85725  max: 32.6274
  0: Diffusion: vertical iterations: 2128
  0: Calculating vertical normalization...
  0: Info     : Writing file: ./staticb/vt_ocean.nc
  0: ==================================================================================

I don't know yet why this happens, I'll dig into what exactly is happening in our workflow here, compare some files and analyze.

Memory is actually worse here than on develop, especially for LETKF.
Var:
develop:

  0: OOPS_STATS IncrementalAssimilation start            - Runtime:     4.81 sec,  Local Memory:   470.56 MB
  0: OOPS_STATS Variational end                          - Runtime:   115.71 sec,  Local Memory:     4.19 GB
  0: OOPS_STATS Run end                                  - Runtime:    117.60 sec,  Memory: total:   418.78 GB, per task: min =   825.51 MB, max =  4189.71 MB

this branch:

  0: OOPS_STATS IncrementalAssimilation start            - Runtime:     5.85 sec,  Local Memory:   517.79 MB
  0: OOPS_STATS Variational end                          - Runtime:   156.98 sec,  Local Memory:     3.81 GB
  0: OOPS_STATS Run end                                  - Runtime:    159.86 sec,  Memory: total:   526.89 GB, per task: min =     1.05 GB, max =     3.81 GB

You'll notice the difference in runtime -- this is due to vertical diffusion behaving differently.

LETKF:
develop:

  0: OOPS_STATS LocalEnsembleDA before solver ctor       - Runtime:    30.10 sec,  Local Memory:     1.78 GB
  0: OOPS_STATS LocalEnsembleDA before solver            - Runtime:     97.67 sec,  Memory: total:   742.32 GB, per task: min =     2.83 GB, max =     5.08 GB
  0: OOPS_STATS Run end                                  - Runtime:    631.32 sec,  Memory: total:   898.97 GB, per task: min =     2.98 GB, max =     8.45 GB

this branch:

  0: OOPS_STATS LocalEnsembleDA before solver ctor       - Runtime:    11.51 sec,  Local Memory:     4.31 GB
  0: OOPS_STATS LocalEnsembleDA before solver            - Runtime:     77.66 sec,  Memory: total:  1420.82 GB, per task: min =     5.66 GB, max =     7.71 GB
  0: OOPS_STATS Run end                                  - Runtime:    546.03 sec,  Memory: total:  1597.63 GB, per task: min =     6.00 GB, max =     9.41 GB

@travissluka can you look into whether we can bring the memory down? I'll try to figure out why the vertical diffusion estimation is misbehaving.

travissluka · 2026-05-14T23:02:03Z

@shlyaeva 1) I'm able to reproduce the vertical correlation bug locally, so i can look into it on my side (it might have something to do with the levels in the yaml not matching up with the number of levels in the file, and the old code just silently ignored that mismatch... or not.. i dunno 2) I found the memory leak, the I/O code was forgetting to close file handles until the application exit.

Once I get number 1 fixed i'll push changes for a retest

Edit: I'm not actually able to reproduce the vertical correlation bug. Given the value range, that does oddly seem like it's reading in temperature values, not correlation values 🤔

…e checks Three related fixes in the direct-netCDF reader path: 1. read_var_strided previously hardcoded "last Fortran dim is trailing time" and set ct(file_ndims)=1 unconditionally. For a file with spatial-only layout (Temp(z,y,x) with no leading Time), this read only level 1 of z into a multi-level destination, leaving the rest uninitialized -- showing up as garbage scale values in the vertical diffusion calibration. Now discriminate file_ndims == dst_rank (no time, fill every dim from the file) vs file_ndims > dst_rank (trailing time + middle squeeze, e.g. CICE Tsnz_h's nksnow=1). A total-element-count check catches silent partial-fills and dim-size mismatches (e.g. file z=75 vs destination z=25). 2. Drop the read_cache / cached_open / soca_io_close_all machinery. Each reader_commit nf90_opens, reads all enqueued vars (with inline inq_varid + inquire_variable + inquire_dimension; microseconds), and nf90_closes. Holding NetCDF4 handles open across commits was bloating LETKF per-task memory by ~MB-to-GB scale (HDF5 metadata + chunk caches per open file). Restores per-task max to match develop. 3. Add check_buf_1d / check_buf_2d helpers called from every reader/writer enqueue. Catches unallocated and wrong-sized caller buffers at the enqueue site instead of silently storing the pointer for a later get_var/put_var to mishandle.

travissluka · 2026-05-15T18:19:45Z

Both items addressed in 2d7ab3c.

Vertical diffusion: your vt_scales file has Temp(z, y, x) with no leading Time dim. The reader hardcoded "last Fortran dim is trailing time" and sliced it to 1, so only z-level 1 read correctly; levels 2-N stayed uninitialized → your [-2.31 .. 32.6] garbage. Now discriminates file_ndims == dst_rank (no time) vs file_ndims > dst_rank (trailing time + CICE-style middle squeeze), with a total-element-count check to catch level-size mismatches too.

Memory: dropped the read-side file/var cache that was keeping NetCDF4 handles (and their HDF5 chunk/metadata caches) open across commits. Per-task max now matches develop on the local emulator.

72/72 ctests pass.

shlyaeva

I've retested cleanly (two separate experiments, the only difference is soca branch), and the latest fix fixed both the issue with reading vertical diffusion parameters, and with memory. Thank you @travissluka!

Var:
this branch:

Runtime:    132.91 sec,  Memory: total:   405.00 GB, per task: min =   800.67 MB, max =  3408.76 MB

develop (ran after this branch):

Runtime:    111.65 sec,  Memory: total:   418.95 GB, per task: min =   824.59 MB, max =  4297.21 MB

LETKF:
this branch:

Runtime:    541.38 sec,  Memory: total:   880.34 GB, per task: min =     2.91 GB, max =     6.54 GB

develop:

Runtime:    642.86 sec,  Memory: total:   899.06 GB, per task: min =     2.97 GB, max =     8.61 GB

Runtimes for the other tasks using soca I/O (bmat tasks, parallel recentering) have reasonably similar runtime and memory use.

DavidNew-NOAA

No objections on my end

eap · 2026-05-19T22:34:16Z

@travissluka
I tried building this with my spack-stack 2.1 environment and got the following build error. This could still be a problem specific to how I configured the build but I'd appreciate if you could check it out. Feel free to use the jedi-tools EC2 scripts to launch a spack-stack 2.1 VM.

[ 95%] Building Fortran object soca/src/soca/CMakeFiles/soca.dir/IO/soca_io_mod.F90.o
/home/ubuntu/gcc_jedi_bundle_test/jedi-bundle/soca/src/soca/IO/soca_io_mod.F90:671:52:

  671 |                        ishift=ishift, jshift=jshift)
      |                                                    1
Error: There is no specific subroutine for the generic ‘mpp_scatter’ at (1)
/home/ubuntu/gcc_jedi_bundle_test/jedi-bundle/soca/src/soca/IO/soca_io_mod.F90:689:52:

  689 |                        ishift=ishift, jshift=jshift)
      |                                                    1
Error: There is no specific subroutine for the generic ‘mpp_scatter’ at (1)
/home/ubuntu/gcc_jedi_bundle_test/jedi-bundle/soca/src/soca/IO/soca_io_mod.F90:711:56:

  711 |                            ishift=ishift, jshift=jshift)
      |                                                        1
Error: There is no specific subroutine for the generic ‘mpp_scatter’ at (1)
/home/ubuntu/gcc_jedi_bundle_test/jedi-bundle/soca/src/soca/IO/soca_io_mod.F90:715:56:

  715 |                            ishift=ishift, jshift=jshift)
      |                                                        1
Error: There is no specific subroutine for the generic ‘mpp_scatter’ at (1

travissluka · 2026-05-20T15:38:03Z

@travissluka I tried building this with my spack-stack 2.1 environment and got the following build error. This could still be a problem specific to how I configured the build but I'd appreciate if you could check it out. Feel free to use the jedi-tools EC2 scripts to launch a spack-stack 2.1 VM.

It turns out those ishift/jshift keywords were removed in the newer version of FMS that's in spack stack 2.1. I'll refactor to remove that dependency.

FMS 2025.02 removed the ishift/jshift optional args from mpp_scatter, breaking compilation on spack-stack 2.1. Pre-apply the shift in the indices we pass; behavior unchanged on FMS 2024.02 (spack-stack 1.9.2).

travissluka · 2026-05-20T16:06:18Z

Pushed in a547504. No answer changes expected. Note: this code path is dormant on this PR — wired up in the follow-on parallel-ensemble I/O PR — so it unblocks spack-stack 2.1 compile but isn't exercised functionally yet.

Dooruk

I ran two identical 0.25-deg multi-node 3DFGAT tests with a January 15th SOCA version and this branch. Residual norms are identical. I do see some improvement which is great, especially with the runtime in the new version:

These are with Spack-stack 1.9

This branch:

OOPS_STATS Run end      - Runtime:    255.88 sec,  Memory: total:  1140.07 GB, per task: min =     1.51 GB, max =     4.14 GB
Run: Finishing oops::Variational<SOCA, UFO and IODA observations> with status = 0
OOPS Ending   2026-05-20 14:41:34 (UTC+0000)

Jan15th:

OOPS_STATS Run end      - Runtime:    358.17 sec,  Memory: total:  1292.89 Gb, per task: min =     1.54 Gb, max =     4.51 Gb
Run: Finishing oops::Variational<SOCA, UFO and IODA observations> with status = 0
OOPS Ending   2026-05-20 14:39:44 (UTC+0000)

DavidNew-NOAA · 2026-05-20T18:57:18Z

@travissluka I was getting the following error compiling your branch with spack stack 2.1

/home/ubuntu/jedi/jedi-bundle/soca/src/soca/Geometry/soca_geom_mod.F90:34:5:

   34 | use time_interp_external_mod, only : time_interp_external_init
      |     1
Fatal Error: Cannot open module file ‘time_interp_external_mod.mod’ for reading at (1): No such file or directory
compilation terminated.

time_interp_external_mod was removed I believe in FMS/2025.1. Changing time_interp_external_mod to the new time_interp_external2_mod fixes the compilation error.

travissluka · 2026-05-20T20:44:01Z

@travissluka I was getting the following error compiling your branch with spack stack 2.1
/home/ubuntu/jedi/jedi-bundle/soca/src/soca/Geometry/soca_geom_mod.F90:34:5:

   34 | use time_interp_external_mod, only : time_interp_external_init
      |     1
Fatal Error: Cannot open module file ‘time_interp_external_mod.mod’ for reading at (1): No such file or directory
compilation terminated.
time_interp_external_mod was removed I believe in FMS/2025.1. Changing time_interp_external_mod to the new time_interp_external2_mod fixes the compilation error.

@DavidNew-NOAA Well, that's obnoxious. Does time_interp_external2_mod exist in the earlier FMS? (e.g. if we switch that now, does it break backward compatibility with spack-stack < 2.1 ?) Since NASA has now approved this PR, I think we (@shlyaeva , @guillaumevernieres ) should go ahead and merge this, and the final spack stack 2.1 fixes can be applied afterward

DavidNew-NOAA · 2026-05-20T20:59:57Z

@travissluka I concur.

shlyaeva · 2026-05-20T21:07:32Z

Sounds good to me, I'm merging since both I and Doruk approved. If there are remaining issues, let's do follow-up PRs!

Dooruk · 2026-05-20T21:08:22Z

Have other centers switched to spack 2 or 2.1?

FWIW, I naively googled is "time_interp_external2_mod" in fms/2024.02 and found the following:

https://egitlab.gfdl.noaa.gov/NOAA-GFDL/FMS/-/blob/2024.02/time_interp/time_interp_external2.F90?ref_type=tags

Welp, nevermind 😄

Rebased onto develop after the direct-netCDF soca_io_mod PR (#1242) landed. End-to-end ensemble-parallel I/O on top of the stateless direct-netCDF reader. Opt-in via the new oops StateSet/IncrementSet bulk-readEnsemble/writeEnsemble dispatch path; soca provides the model-level implementations. State/Increment: - State::{readEnsemble,writeEnsemble} and Increment::{readEnsemble, writeEnsemble} (Fields + interface Fortran bindings to match) - writeEnsemble: gather each member to a strided writer PE (soca_io_ensemble_root_pe assignment), then phase-2 per-member netCDF writes happen concurrently across writer PEs - readEnsemble: per-member loop today, honoring single-state read mode (strided vs scatter); parallel-across-members reads via soca_io_readers_commit_ensemble are wired for a follow-up soca_io_mod: - writer staging split into define/gather/write/close phases plus soca_io_writers_commit_ensemble; reader staging mirrored with read/distribute/close and soca_io_readers_commit_ensemble. Reader opens+closes ncid inside stage_read (no global file-handle cache, matching develop's stateless model post-2d7ab3cf) - new public knobs: soca_io_ensemble_write_parallel, soca_io_ensemble_read_scatter, soca_io_single_state_read_scatter, soca_io_async_mpi, soca_io_ensemble_root_pe - mpi_pelist_and_comm helper for the writer/reader PE subsets Geometry: - soca_io_config_from_yaml called from soca_geom_init resolves the parallel/sequential, scatter/strided, async-mpi knobs from geometry.io once; values persist module-level for the run Companion to JCSDA-internal/oops setID fix in StateSet::buildFromConfigs (needed so PseudoModel and HTLM dispatch see correct per-member IDs when StateSet takes the HasReadEnsemble path).

travissluka added 10 commits May 11, 2026 13:35

soca letkf ctest: document soca_io read mode block

4652de9

Annotates testinput/letkf.yml's geometry block with the two read-mode options so a contributor reading the canonical letkf example sees both choices without having to hunt for the soca_io_mod docstring.

Merge remote-tracking branch 'origin/develop' into feature/soca-direc…

8c16936

…t-io

soca_io_mod: fix stale docstring about output dim names

1449bc6

The write path emits FMS-style auto-numbered dim names (xaxis_N / yaxis_N / zaxis_N / Time); the file-header docstring incorrectly claimed xh / yh / Time. Correct the comment -- no code change.

Copilot AI review requested due to automatic review settings May 13, 2026 02:51

travissluka self-assigned this May 13, 2026

Copilot started reviewing on behalf of travissluka May 13, 2026 02:51 View session

travissluka requested review from DavidNew-NOAA, DavidSRussell, eap, guillaumevernieres and shlyaeva May 13, 2026 02:54

Copilot AI reviewed May 13, 2026

View reviewed changes

travissluka added 2 commits May 12, 2026 21:02

claude was BAD and forgot to commit this

4f33324

Drop dangling references to the removed soca_io_read_mode_from_config / 'io.read mode' YAML key in soca_geom_mod and the letkf test input. The selector was already gone from soca_io_mod; without these the PR fails to compile.

shlyaeva approved these changes May 14, 2026

View reviewed changes

travissluka mentioned this pull request May 14, 2026

Parallel ensemble I/O for State and Increment #1243

Draft

3 tasks

shlyaeva requested changes May 14, 2026

View reviewed changes

travissluka requested a review from shlyaeva May 15, 2026 18:25

shlyaeva approved these changes May 18, 2026

View reviewed changes

DavidNew-NOAA approved these changes May 18, 2026

View reviewed changes

shlyaeva requested a review from Dooruk May 19, 2026 20:06

soca_io_mod: drop mpp_scatter ishift/jshift kwargs (FMS 2025.02 compat)

a547504

FMS 2025.02 removed the ishift/jshift optional args from mpp_scatter, breaking compilation on spack-stack 2.1. Pre-apply the shift in the indices we pass; behavior unchanged on FMS 2024.02 (spack-stack 1.9.2).

Dooruk approved these changes May 20, 2026

View reviewed changes

travissluka added the ready for merge PR is reviewed and is ready for merge label May 20, 2026

shlyaeva merged commit 4b0c6a8 into develop May 20, 2026
3 checks passed

shlyaeva deleted the feature/soca-direct-io branch May 20, 2026 21:07

travissluka mentioned this pull request May 21, 2026

Use time_interp_external2_mod for FMS 2.1 compatibility #1245

Merged

3 tasks

Conversation

travissluka commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue(s) addressed

Dependencies

Impact

Manual Testing Instructions (optional)

Checklist

Uh oh!

travissluka commented May 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

travissluka commented May 13, 2026

Uh oh!

shlyaeva commented May 13, 2026

Uh oh!

shlyaeva left a comment

Choose a reason for hiding this comment

Uh oh!

travissluka commented May 14, 2026

Uh oh!

shlyaeva left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

travissluka commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

travissluka commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shlyaeva left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DavidNew-NOAA left a comment

Choose a reason for hiding this comment

Uh oh!

eap commented May 19, 2026

Uh oh!

travissluka commented May 20, 2026

Uh oh!

travissluka commented May 20, 2026

Uh oh!

Dooruk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DavidNew-NOAA commented May 20, 2026

Uh oh!

travissluka commented May 20, 2026

Uh oh!

DavidNew-NOAA commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shlyaeva commented May 20, 2026

Uh oh!

Uh oh!

Dooruk commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

travissluka commented May 13, 2026 •

edited

Loading

shlyaeva left a comment •

edited

Loading

travissluka commented May 14, 2026 •

edited

Loading

travissluka commented May 15, 2026 •

edited

Loading

shlyaeva left a comment •

edited

Loading

Dooruk left a comment •

edited

Loading

DavidNew-NOAA commented May 20, 2026 •

edited

Loading

Dooruk commented May 20, 2026 •

edited

Loading