-
Notifications
You must be signed in to change notification settings - Fork 8
Change output format to CSV #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Closes #35 Writes a set of CSV files to a folder instead of 'pickle'ing the output data class. Is more human-readable and should make reading data by post-processing scripts easier. The drawback is that if new members of unsupported type are introduced they may not be written (a warning is raised in that case) and adding extra case to the export functions might be necessary.
|
These are the main points from our discussion earlier (although we didn't come to hard conclusions):
Modify this or reply if I missed anything or got anything wrong. |
|
@sjavis Thanks for summarizing everything we discussed. I updated the project board with issues #40 (updated extraction script) and #41 (Add option to select model output). I think that @bioatmosphere comment on #35 that the model output dimensionality is already high and can go higher makes a good case for using netCDF as the default export format. @Mikolaj-A-Kowalski I think that these changes you have made are in line with what I was asking for. I want to be able to export files in CSV format, but I think it should be included as an optional feature in the updated post processing data extraction script. |
|
Hi @jgwalkup thanks for the comments! Also I am thinking that perhaps we should move this discussion back to #35 since this PR is not going to be merged most likely (it will be cleaner to open a new one for what we eventually settle on). I agree that the combination of netCDF (probably version 4 format with multiple groups?) and some post-processing scripts is a good idea. But I am also thinking that we should try to pivot the discussion a bit and maybe try to discuss more an interface to export/import data that would be optimal for your use? It is just that I think this is a first thing we should try to clarify. Then it will be easier to figure out how exactly to structure the (netCDF) output file. For the reading of the data I imagine something along the following lines:
# Assumes we made dementpy a python package already
from dementpy.postprocessing import OutputDatasets
with OutputDatasets("./path/to/output_folder/prefix-*.nc") as out:
iterator_over_replicas, axis = out.get_iter_over("output_variable_name")
# axis will ba a tuple of dimension labels e.g. ("x", "y", "taxa")
# `iterator_over_replicas` will yield a rank == len(axis) numpy array with the data from each replica
# (or to avoid unnecessary copies we could just return e.g. netCDF Variable handle?)
# For example then to calculate the mean of the sum of all "taxa" over the replicas we could do something like this
mean = np.mean([ np.sum(data, axis =(0,1) ) for data in iterator_over_replicas], axis = 0)Although I am not sure if the usage example is representative of how you need to access the data. On the other side we need to specify some interface to register and write the output variables to the output file. I was thinking along the lines of: out = OutputFile(
fname="./output/name.nc",
dimensions = {
"x" : np.linspace(-1.0,1.0, 100),
"y": np.linspace(-1.0,1.0, 100),
"taxa": ["taxa0", "taxa1"]
}
)
# Each output variable is given a reader function
# that extracts relevant multi-dimensional data from `Grid` to numpy.ndarray
# Shape will be checked inside the `OutputFile` on each extraction
def read_ones(eco : Grid) -> np.ndarray:
# Normally we would use data from the Grid
# For the purpose of example we return just ones
return np.ones((100,100))
out.add_output_variable(
name="ones",
dims=("x", "y"),
interval=20,
reader_func = read_ones
)
with OutputFile.open() as out:
... # Simulation code
for pulse in range(n_pulse)
.... # Simulation code
for cycle in range(n_cycle:
# Simulation code
out.write(cycle= pulse * n_cycles + cycle, grid_obj=Ecosystem)Basically we would need to pre-define dimensions at the start and then just register different output variables with some data and an instruction how to extract this variable from the At the later stage we could automate registration of output variables based on some user input file. Also the errors in data extraction should not cause the calculation to terminate I think. Output file would catch and log them and print that some variables were not captured as a result at the end of the simulation. |
|
Can this be merged now given that it has been approved @Mikolaj-A-Kowalski ? |
Myself I have a feeling that it is really the decision for @jgwalkup On my side I think that the hanging issue was that we weren't sure this kind of refactoring is what we wanted to do (hence the discussion about using NetCDF above). But if the CSV-only approach is required. It is basically ready to merge marge in the form it is in this PR. |
Closes #35
Writes a set of CSV files to a folder instead of 'pickle'ing the output data class. Is more human-readable and should make reading data by post-processing scripts easier.
The drawback is that if new members of unsupported type are introduced they may not be written (a warning is raised in that case) and adding extra case to the export functions might be necessary.
I will need your input @jgwalkup to make sure that this is more or less what you had in mind (hence the draft). I don't feel 100% comfortable with the physics (biology?) which the code is doing so I need you to make sure I am writing everything that needs to be written and that the output format is useful ;-)
What the code now does, instead of serialising the
Outputclass to a picke archive, the data is written to a folder and filled with CSV files.Basically for each
pd.DataFramemember ofOutputa separate CSV file is written. Thepd.SeriesandNumberdata members are collected and written toseries.csvandscalars.csvfiles respectively.The
Initialisationdictionary is written to a sub folder in a similar fashion with the difference thatnp.ndarrayandpd.Seriesmembers are written to separate files.After the run one should obtain a listing in an output folder as the following (for
outname= 7778)