Skip to content

Conversation

@Mikolaj-A-Kowalski
Copy link
Contributor

Closes #35

Writes a set of CSV files to a folder instead of 'pickle'ing the output data class. Is more human-readable and should make reading data by post-processing scripts easier.

The drawback is that if new members of unsupported type are introduced they may not be written (a warning is raised in that case) and adding extra case to the export functions might be necessary.

I will need your input @jgwalkup to make sure that this is more or less what you had in mind (hence the draft). I don't feel 100% comfortable with the physics (biology?) which the code is doing so I need you to make sure I am writing everything that needs to be written and that the output format is useful ;-)

What the code now does, instead of serialising the Output class to a picke archive, the data is written to a folder and filled with CSV files.

Basically for each pd.DataFrame member of Output a separate CSV file is written. The pd.Series and Number data members are collected and written to series.csv and scalars.csv files respectively.

The Initialisation dictionary is written to a sub folder in a similar fashion with the difference that np.ndarray and pd.Series members are written to separate files.

After the run one should obtain a listing in an output folder as the following (for outname = 7778)

7778/
├── EnzymesSeries.csv
├── Enzyme_TaxonSeries.csv
├── Growth_yield.csv
├── Initialization
│   ├── Ea.csv
│   ├── EnzAttrib.csv
│   ├── EnzGenes.csv
│   ├── EnzProdConstit.csv
│   ├── EnzProdConsti_trait.csv
│   ├── EnzProdInduce.csv
│   ├── EnzProdInduci_trait.csv
│   ├── Enzymes.csv
│   ├── Km0.csv
│   ├── Microbes.csv
│   ├── Microbes_pp.csv
│   ├── MinRatios.csv
│   ├── MonInput.csv
│   ├── Monomer_ratio.csv
│   ├── Monomers.csv
│   ├── MonomersProduced.csv
│   ├── OsmoGenes.csv
│   ├── OsmoProdConsti.csv
│   ├── OsmoProdConsti_trait.csv
│   ├── OsmoProdInduci.csv
│   ├── OsmoProdInduci_trait.csv
│   ├── ReqEnz.csv
│   ├── scalars.csv
│   ├── SubInput.csv
│   ├── Substrates.csv
│   ├── TaxDroughtTol.csv
│   ├── Uptake_Ea.csv
│   ├── UptakeGenesCost.csv
│   ├── UptakeGenes.csv
│   ├── UptakeGenes_trait.csv
│   ├── Uptake_Km0.csv
│   ├── Uptake_ReqEnz.csv
│   ├── Uptake_Vmax0.csv
│   └── Vmax0.csv
├── microbes_grid_taxa.csv
├── MicrobesSeries.csv
├── MicrobesSeries_repop.csv
├── Microbial_traits.csv
├── MonomersSeries.csv
├── Osmolyte_TaxonSeries.csv
├── Runtime.csv
├── scalars.csv
├── series.csv
├── SubstratesSeries.csv
└── Taxon_count.csv

Closes #35

Writes a set of CSV files to a folder instead of 'pickle'ing the output
data class. Is more human-readable and should make reading data by
post-processing scripts easier.

The drawback is that if new members of unsupported type are introduced
they may not be written (a warning is raised in that case) and adding
extra case to the export functions might be necessary.
@sjavis
Copy link
Contributor

sjavis commented Jun 23, 2025

These are the main points from our discussion earlier (although we didn't come to hard conclusions):

  • We could separate writing of the initialization data from the output data. This could make reading and writing the outputs more straightforward. It could also enable writing multiple times during the run without writing the initialization data every time.

  • It might be better to have a more standardized format to allow for the different dimensional output, eg using netcdf files. This might allow the outputs to be written in a single file? This would simplify reading the data and could make it more efficient for reading specific variables / dimensions rather than needing to read the whole data.

  • We want it to be easy to specify which variables should be output, possibly through an input file. This would allow the user to output fewer variables to lower memory usage, or more / all variables.

  • Having modified the output file types we will want to have an updated extraction script to convert ensemble outputs into the same format of csv files that are currently used.

Modify this or reply if I missed anything or got anything wrong.

@jgwalkup
Copy link
Collaborator

@sjavis Thanks for summarizing everything we discussed. I updated the project board with issues #40 (updated extraction script) and #41 (Add option to select model output).

I think that @bioatmosphere comment on #35 that the model output dimensionality is already high and can go higher makes a good case for using netCDF as the default export format.

@Mikolaj-A-Kowalski I think that these changes you have made are in line with what I was asking for. I want to be able to export files in CSV format, but I think it should be included as an optional feature in the updated post processing data extraction script.

@Mikolaj-A-Kowalski
Copy link
Contributor Author

Mikolaj-A-Kowalski commented Jun 27, 2025

Hi @jgwalkup thanks for the comments!

Also I am thinking that perhaps we should move this discussion back to #35 since this PR is not going to be merged most likely (it will be cleaner to open a new one for what we eventually settle on).

I agree that the combination of netCDF (probably version 4 format with multiple groups?) and some post-processing scripts is a good idea. But I am also thinking that we should try to pivot the discussion a bit and maybe try to discuss more an interface to export/import data that would be optimal for your use? It is just that I think this is a first thing we should try to clarify. Then it will be easier to figure out how exactly to structure the (netCDF) output file.

For the reading of the data I imagine something along the following lines:

  • We would store a result of each replica in an independent file in some output folder
  • Then in we would have a "post-processing" module in the DEMENTpy package, that could be used like:
# Assumes we made dementpy a python package already
from dementpy.postprocessing import OutputDatasets

with OutputDatasets("./path/to/output_folder/prefix-*.nc") as out: 
    iterator_over_replicas, axis = out.get_iter_over("output_variable_name") 
    # axis will ba a tuple of dimension labels e.g. ("x", "y", "taxa") 
    
    #  `iterator_over_replicas` will yield a rank == len(axis) numpy array with the data from each replica
    # (or to avoid unnecessary copies we could just return e.g. netCDF Variable handle?)
    # For example then to calculate the mean of the sum of all "taxa" over the replicas we could do something like this
    mean = np.mean([ np.sum(data, axis =(0,1) ) for data in iterator_over_replicas], axis = 0)

Although I am not sure if the usage example is representative of how you need to access the data.

On the other side we need to specify some interface to register and write the output variables to the output file. I was thinking along the lines of:

out = OutputFile(
  fname="./output/name.nc", 
  dimensions = { 
    "x" : np.linspace(-1.0,1.0, 100), 
    "y":  np.linspace(-1.0,1.0, 100),
    "taxa": ["taxa0", "taxa1"]
    }
)

# Each output variable is given a reader function 
# that extracts relevant multi-dimensional data from `Grid` to  numpy.ndarray 
# Shape will be checked inside the `OutputFile`  on each extraction
def read_ones(eco : Grid) -> np.ndarray: 
    # Normally we would use data from the Grid
    # For the purpose of example we return just ones
    return np.ones((100,100))


out.add_output_variable(
    name="ones", 
    dims=("x", "y"), 
    interval=20,
    reader_func = read_ones
)


with OutputFile.open() as out: 
   ...  # Simulation code
    for pulse in range(n_pulse)
       .... # Simulation code 
       for cycle in range(n_cycle:
           # Simulation code
           out.write(cycle= pulse * n_cycles + cycle, grid_obj=Ecosystem)

Basically we would need to pre-define dimensions at the start and then just register different output variables with some data and an instruction how to extract this variable from the Grid at a particular state (a function).
Handling of the write intervals could get a little bit of tricky sine we would need to define an unlimited time dimension for each interval value (to avoid large sections of missing data, I don't think netCDF optimises space usage in that case), but it is feasible.

At the later stage we could automate registration of output variables based on some user input file.

Also the errors in data extraction should not cause the calculation to terminate I think. Output file would catch and log them and print that some variables were not captured as a result at the end of the simulation.

@dorchard
Copy link

Can this be merged now given that it has been approved @Mikolaj-A-Kowalski ?

@Mikolaj-A-Kowalski
Copy link
Contributor Author

Can this be merged now given that it has been approved @Mikolaj-A-Kowalski ?

Myself I have a feeling that it is really the decision for @jgwalkup On my side I think that the hanging issue was that we weren't sure this kind of refactoring is what we wanted to do (hence the discussion about using NetCDF above). But if the CSV-only approach is required. It is basically ready to merge marge in the form it is in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change output file format

5 participants