Migrate SOCA IO from fms_io to fms2_io#1241
Conversation
Replace fms_io_mod with fms2_io_mod across all Fortran source files: - soca_balance_mod.F90: Replace read_data from fms_mod with open_file/read_data/close_file from fms2_io_mod - soca_geom_mod.F90: Replace restart_file_type with FmsNetcdfDomainFile_t, register_restart_field/restore_state/save_restart with open_file/register_restart_field/read_restart/write_restart/close_file, remove fms_io_init/fms_io_exit and associated counter - soca_fields_mod.F90: Same pattern replacements plus replace file_exist/field_exist with open_file/variable_exists Co-authored-by: DavidNew-NOAA <134300700+DavidNew-NOAA@users.noreply.github.com> Agent-Logs-Url: https://github.com/DavidNew-NOAA/soca/sessions/79ffa8fd-153d-4e6d-a50b-0f9b31130309
Co-authored-by: DavidNew-NOAA <134300700+DavidNew-NOAA@users.noreply.github.com> Agent-Logs-Url: https://github.com/DavidNew-NOAA/soca/sessions/79ffa8fd-153d-4e6d-a50b-0f9b31130309
…art for writes Use read_data() for all file reads instead of register_restart_field + read_restart, since input files (gridspec, ocean data, ice data) may not have restart metadata. Add is_restart=.true. to open_file for write operations so register_restart_field + write_restart works correctly when creating new files. Agent-Logs-Url: https://github.com/DavidNew-NOAA/soca/sessions/40ac97e7-ffa8-4a2f-bc1b-e089f96badc9 Co-authored-by: DavidNew-NOAA <134300700+DavidNew-NOAA@users.noreply.github.com>
FMS2's register_restart_field requires explicit dimension names when writing (non-read-only). Added: - ["xaxis_1"] for 1D x-axis variables (lonh, lonq) - ["yaxis_1"] for 1D y-axis variables (lath, latq) - ["xaxis_1", "yaxis_1"] for 2D domain-decomposed variables - ["xaxis_1", "yaxis_1", "zaxis_1"] for 3D domain-decomposed variables Agent-Logs-Url: https://github.com/DavidNew-NOAA/soca/sessions/4a7f62e8-b633-428b-acfa-7788b715560a Co-authored-by: DavidNew-NOAA <134300700+DavidNew-NOAA@users.noreply.github.com>
|
@DavidNew-NOAA and @eap could you please discuss and decide which PR should be reviewed and tested, and let us know. |
|
If people like it. Yes! If there's big issues, then probably no. Maybe move #1239 to a draft for the moment. In any case your inital work was useful and preserved, so this final squashed PR will be a coauthor commit. Edit: @DavidNew-NOAA It wasn't my intention to scoop you here; I just got carried away by a cocktail of Claude and schedule pressure. Also if you have any unsubmitted work that improves this PR, feel free to push it directly to my branch. |
|
@eap No worries, thanks for your work on this |
|
OK, not great news. Just as in my previous half-hearted attempt to switch to FMS2, this is noticeably slower than develop, both for read and for write. I ran a 3DVar and an LETKF as in gdasapp configuration, and compared to my previous run from a month or so ago. Comparing I/O: State ctor in LETKF (includes read for ensembles): 8s vs 205s Total runtimes (use these estimates with caution, as the codebase in general changed quite a bit since the previous run, so it's not an apples to apples comparison): 90s vs 140s for Var and 570s vs 1100s for LETKF (it was always dominated by I/O, and now a lot more so). @DavidNew-NOAA from your experience adapting fv3-jedi, do you have any ideas/tricks we can apply here to make this faster? |
|
@shlyaeva My FMS2 IO implementation in FV3-JEDI is plain vanilla FMS. I'm going to review this PR tomorrow morning and see what I can find. |
The seaice reads in soca_fields_read_seaice routed 4D/5D buffers through fms2_io's domain_read_4d/domain_read_5d, which loop on root issuing one nf90_get_var per io-domain PE (FMS source comment acknowledges these "do not use mpp_scatter yet"). At N PEs that is N small disk reads per variable instead of one bulk read. Add direct-netCDF helpers in soca_utils that open on root only, do one nf90_get_var of the global cube, broadcast to all PEs via mpp_broadcast, and have each PE slice its compute-domain portion. Same idiom already used in soca_ciceutils_mod. Convert the three seaice read blocks (cice_vars_cats, cice_vars_cats_levs, cice_vars_snow) to use them. Ocean reads (2D/3D) stay on fms2_io's read_data — those already hit the fast bulk-read + mpp_scatter path in domain_read_2d/3d. The write path is untouched. Bit-for-bit reproducible: root reads the same bytes, broadcast is exact, each PE slices the same compute-domain region.
This reverts commit 914ba4b.
|
This new change is even more disruptive but it ditches the fms2_io in many cases and uses netcdf reads directly. @DavidNew-NOAA @shlyaeva - do you know a good way to performance profile or test this? The unit-tests aren't heavy enough to show any change (they didn't flag the regression, so I assume they won't be likely to show improvement). Alternatively, if either of you were willing to test this latest change. |
|
@eap I will defer to @shlyaeva about whether it's appropriate to replace FMS routines with direct netcdf calls since she's more involved with SOCA than I am. For the reads I would at least like to see how the batched reads perform using the @BenjaminRuston Do you have any general advice for maintaining IO performance when transitioning from FMS IO to FMS2 IO? See #1241 (comment) |
|
I've tested the latest commit, it's not better unfortunately (about the same as the previous version. I did make sure I rebuilt, because I didn't believe it at first). @DavidNew-NOAA and I will meet and discuss how to proceed. |
|
I had dreams at one point of rewriting the soca IO with raw netcdf calls, bypassing fms, and with some custom MPI magic in the background so that we could do per-ensemble member I/O (see #1125) maybe now would be a good time to revive that effort? I could have my "assistant" give it a try if you think it would be worth it. |
|
@travissluka Yes, we're on the same wavelength. I was just brainstorming something similar myself. I'm will get in touch |
|
I've pushed a commit reverting the netcdf fms2_io bypass reads. As @shlyaeva noted, there was no performance gain and since the complexity was significantly higher I figure it isn't worth keeping. Anyways the change remains in the history of this branch (see link above) if there's any interest |
|
@eap I had a tagup with @DavidNew-NOAA and @shlyaeva . i think Dave is going to tackle looking at your PR here to see if there is anything he can do to make fms2 I/O go faster. Simultaneously, I'll take a look at implementing #1125 to see if I can have a raw netcdf IO in soca that can use a parallel state read/write from LETKF to substantially speed up the ensemble I/O. I think they're okay with slower I/O, so long as the LETKF isn't THAT horribly slower. |
|
I've pushed By default, the IO operates the same as FMS IO (a single read on PE 0 followed by an MPI scatter). There is also a strided parallel read mode that might be worth testing, enabled by changing the yaml (see the letkf ctest yaml) Don't run those tests back-to-back on the same nodes, though, the Linux page cache makes the 'strided' mode look 5–10× faster, but that benefit won't carry over to real-world cycling. I'm curious if either of these modes performs the same as the old FMS IO v1 did. If it doesn't, then something weird is going on, and we should look at other causes, because it might not be the FMS IO v2 itself that's misbehaving. The changes I have in that branch should be a nearly direct soca-specific port of the FMS IO v1 code. If the tests DO work for you, then the next step will be modifying this to work with parallel ensemble reads/writes, but the branch would hopefully be good enough to merge as is without having the FMS IO dependency anymore. |
|
Thanks for the update @travissluka . I looked into FMS2 IO performance improvements today. The latest version of FMS has MPI-IO and collective mode IO as options. This should allow reading/writing from multiple ranks. Unfortunately, this newest version of FMS is not in Spack Stack 1.9, and I'm still working on getting FV3-JEDI and FV3-JEDI-LM to compile with spack stack 2.1, so I can build the JEDI bundle with 2.1. Once I have that done, I can test the changes I made in feature/fms-to-fms2-extended-mpiio . Hopefully tomorrow. |
|
@travissluka I've rerun Var and LETKF cases with |
|
I've rerun the develop benchmark case to ensure that all the other branches are consistent and I'm comparing apples to apples. This did increase the overall time for the "develop" runs, but didn't change my analysis much. Here's relevant comparison: Var:
LETKF:
The interesting findings here are:
|
|
I have now also ran Var:
LETKF:
Looks like |
|
thanks @shlyaeva for testing! I'll change the |
|
Thank you @travissluka the hard work. I've literally never been happier to have my PR rejected. |
* Replace FMS I/O with direct netCDF (soca_io_mod) Introduces src/soca/IO/soca_io_mod.F90 with soca_io_file_reader / soca_io_file_writer types (init/enqueue/commit/close), and routes all state, geometry, and balance-coefficient I/O through it. PE 0 calls nf90_get_var / nf90_put_var; mpp_broadcast and mpp_gather distribute / collect data on the geometry's pelist. Removes all fms_io_mod usages (register_restart_field, restore_state, save_restart, read_data, file_exist, field_exist, fms_io_init, fms_io_exit) from soca, including the global_soca_geom_counter shim. fms_init / fms_end are retained (subsystem init only, no I/O). Resolves the GDAS Tsnz_h shape mismatch by building nf90_get_var count from the file's actual dim sizes, not the caller's buffer rank. Per-domain state writes (ocn/sfc/ice/wav/bio) and reads now use the new module; the FMS code paths are removed in the same change rather than gated. Refs #1125. Obsoletes PR #1241. * soca_genfilename: append .nc to output filenames Under FMS, the writer would implicitly add the .nc suffix, so test YAMLs and downstream readers reference 'ocn.<exp>.<typ>.<date>.nc'. The direct-netCDF writer in soca_io_mod is faithful to whatever string it's given, so files were landing without the extension and no consumer could find them. * test: fix test_socs_parameters_diffusion typo in TEST_DEPENDS Six soca_add_test() calls listed test_socs_parameters_diffusion as a dependency (should be test_soca_). Because the named test does not exist, ctest treated the dependency as void and ran the variational tests concurrently with parameters_diffusion under -j>1, causing races on the diffusion calibration outputs. * soca_io_mod: per-PE strided reads + file-handle / var-metadata cache Drops the SOCA_IO_SERIAL / SOCA_IO_PARALLEL mode bifurcation (and the method= arg + io.method config knob) and rewrites the reader to do per-PE strided nf90_get_var, mirroring FMS mpp_io's domain-decomposed read pattern. Each PE opens the file once via NF90_NOWRITE and reads its own compute-domain tile via nf90_get_var(start, count) -- the same shape as fms_io's READ_RECORD_ for fileset.NE.MPP_SINGLE. Adds a module-level read cache mirroring fms_io's get_file_unit + files_read(i)%var(j) tables: - one nf90_open per (PE, filename) for the whole run - cached (varid, file_ndims, middle_dim sizes) per (file, var) - soca_io_close_all() flushes the cache; wired into soca_geom_end Caller updates: drop method= from reader/writer init() and the soca_io_method_from_config helper from soca_fields_mod / soca_geom_mod / soca_balance_mod. 1-deg 20-mem LETKF read phase (LocalEnsembleDA before solver ctor): baseline (FMS): 3.91 s before this commit (no cache): 13.92 s with this commit: 3.97 s Outputs are bit-identical. * soca_io_mod: add broadcast read path, default to it (real-world cold cache) Adds a second reader_commit implementation -- PE 0 nf90_get_var the global field + mpp_broadcast -- alongside the existing per-PE strided path. Both paths share the file-handle + var-metadata cache. Select via SOCA_IO_READ_MODE=broadcast|strided (default broadcast); the env var is latched on first reader_commit. Why broadcast as default: in DA cycling the page cache is always cold, since each cycle reads files that the previous cycle / model run just wrote. On cold cache the broadcast path (1 sequential read on PE 0, kernel readahead happy) beats strided (8 concurrent strided readers thrash the prefetcher and pay 8x the open-syscall cost). 1-deg 20-mem LETKF, rancor, taskset -c 0-7, cold cache (sample 1): broadcast: 13.5 s read strided: 30.5 s read Warm cache reverses the result (strided 3.9 s vs broadcast 13.3 s) but cycling never sees warm cache. On a parallel filesystem (Lustre/GPFS) or multi-node setup the strided path may win again -- toggle the env var to test. * soca_io_mod: move read-mode selector from env var to YAML Drops SOCA_IO_READ_MODE in favor of a yaml block on the geometry config: geometry: io: read mode: broadcast | strided If absent the module-level default (broadcast) is kept. soca_geom_init calls soca_io_read_mode_from_config(f_conf) once at startup; the choice latches for every subsequent reader_commit. Unrecognized values abort so typos surface immediately. * soca letkf ctest: document soca_io read mode block Annotates testinput/letkf.yml's geometry block with the two read-mode options so a contributor reading the canonical letkf example sees both choices without having to hunt for the soca_io_mod docstring. * soca_io_mod: fix stale docstring about output dim names The write path emits FMS-style auto-numbered dim names (xaxis_N / yaxis_N / zaxis_N / Time); the file-header docstring incorrectly claimed xh / yh / Time. Correct the comment -- no code change. * soca_io_mod: pointer-based writer, 3D mpp_gather, cleanup Writer: - enqueue holds pointers to caller buffers instead of allocating and copying (mirrors FMS register_restart_field contract). Compute-slice extraction moves from enqueue to commit, so peak per-writer memory drops from sum(var_bytes) to one (nx_c x ny_c) tile. - commit uses the 3D mpp_gather overload for 3D vars: one collective per var instead of nlevels, and no per-level gbuf3d(:,:,k) = gbuf2d memcpy on root. - Reuse gbuf3d / tile3 across 3D vars when nlevels matches. Reader: - commit_reader_strided hoists tile2 out of the per-var loop and reuses tile3 / tile4 when trailing dims match. - Dead tile3/tile4 zero-inits removed (read_var_strided fully fills the buffer). read_var_strided: fixed-size stack arrays for st / ct instead of per-call heap allocation. put_axis_coord_data: reuse idxbuf across axes. Cleanup: - Remove dead 'use mpi' and 'use fckit_configuration_module'. - Remove dead cartesian_axis parameter on writer_enqueue_1d (it was stored but never written to the netcdf output). - Drop unused nprocs out-param of mpi_pelist. - soca_io_close_all: use ncc on nf90_close instead of silently discarding the status. - commit_reader_scatter: matching dead-init cleanup; comment updated to flag it as pending parallel-ensemble I/O exercise. - Trim verbose / redundant comments. * claude was BAD and forgot to commit this Drop dangling references to the removed soca_io_read_mode_from_config / 'io.read mode' YAML key in soca_geom_mod and the letkf test input. The selector was already gone from soca_io_mod; without these the PR fails to compile. * Address Copilot review: TARGET attribute + gbuf non-root dummy + comment - soca_io_mod: allocate gbuf2d / gbuf3d as 1x1 dummies on non-root so the actuals passed to mpp_gather are always allocated (assumed-shape dummy requires it). Pattern matches commit_reader_scatter. - soca_io_mod: refresh stale 'Data is copied in' comment to reflect the pointer-based enqueue semantics (caller buffer kept alive/unmutated through commit; actual must satisfy TARGET association rules). - soca_geom_mod: TARGET on the 'self' dummy in soca_geom_init, soca_geom_init_fieldset, and soca_geom_write so the allocatable components (self%lonh, self%lat, ..., self%mask2d*) are valid pointer targets for the soca_io reader/writer enqueue. TARGET on the local fieldData / fieldDataVars too. - soca_fields_mod: TARGET on h_common and on the local vars(:) wrapper array so vars(n)%data sections used in enqueue are valid pointer targets. - soca_balance_mod: TARGET on local kct for the same reason. * soca_io_mod: fix no-time-dim reads, drop reader cache, add buffer-size checks Three related fixes in the direct-netCDF reader path: 1. read_var_strided previously hardcoded "last Fortran dim is trailing time" and set ct(file_ndims)=1 unconditionally. For a file with spatial-only layout (Temp(z,y,x) with no leading Time), this read only level 1 of z into a multi-level destination, leaving the rest uninitialized -- showing up as garbage scale values in the vertical diffusion calibration. Now discriminate file_ndims == dst_rank (no time, fill every dim from the file) vs file_ndims > dst_rank (trailing time + middle squeeze, e.g. CICE Tsnz_h's nksnow=1). A total-element-count check catches silent partial-fills and dim-size mismatches (e.g. file z=75 vs destination z=25). 2. Drop the read_cache / cached_open / soca_io_close_all machinery. Each reader_commit nf90_opens, reads all enqueued vars (with inline inq_varid + inquire_variable + inquire_dimension; microseconds), and nf90_closes. Holding NetCDF4 handles open across commits was bloating LETKF per-task memory by ~MB-to-GB scale (HDF5 metadata + chunk caches per open file). Restores per-task max to match develop. 3. Add check_buf_1d / check_buf_2d helpers called from every reader/writer enqueue. Catches unallocated and wrong-sized caller buffers at the enqueue site instead of silently storing the pointer for a later get_var/put_var to mishandle. * soca_io_mod: drop mpp_scatter ishift/jshift kwargs (FMS 2025.02 compat) FMS 2025.02 removed the ishift/jshift optional args from mpp_scatter, breaking compilation on spack-stack 2.1. Pre-apply the shift in the indices we pass; behavior unchanged on FMS 2024.02 (spack-stack 1.9.2).
Replaces all uses of the legacy fms_io_mod API with the FMS2 fms2_io_mod API across SOCA's IO paths. Also fixes a latent bug in reading GDAS-format sea ice files, exposed because FMS2 enforces stricter array-shape matching than FMS1. This PR is based on the change here #1239 originally authored by @DavidNew-NOAA.
Motivation: we are seeking to do a jedi-bundle release soon (end of May), and the fms IO api migration is blocking the release (see reviewer notes).
Details:
New helper soca_register_domain_axes: FMS2 domain-decomposed reads require the x and y axes to be explicitly registered before calling read_data. SOCA reads files with several different horizontal dimension names (staggered grids: lonh/lath, lonq/latq, ni/nj, etc.), so axis registration cannot be hardcoded. The new soca_register_domain_axes utility discovers all dimensions in the open file, matches them against the domain's global nx/ny, and registers matching ones as 'x' or 'y'.
Bug fix: GDAS ice file snow temperature (Tsnz_h) GDAS ice restart files store snow layer temperature as Tsnz_h(time, nc=5, nksnow=1, nj, ni) — with an explicit nksnow dimension even though there is only one snow layer. FMS1 silently read this into a 3D (x, y, nc) buffer. FMS2 enforces shape consistency and requires a 4D buffer (x, y, nksnow, nc).
Reviewer notes:
Truth be told; fortran, netcdf, and model io semantics are all domains where I have minimal experience. This pull request was created with extensive AI help and heavy lifting. I have done my best to understand it and I find it reasonable, but if there were a more parsomoniuous way to achieve the same goal I am not the person likely to find it. Similarly, if this introduces hazards I'm unlikely to find them. So please review carefully and understand that I have no emotional attachment to this code. (if it not for the upcoming release, I would not be the one initiating this PR).
Issue(s) addressed
Resolves #1225
run-ci-on-draft = true
jedi-ci-test-select=gcc