Skip to content

Add initial set of generation and storage schemas#93

Open
EllieKallmier wants to merge 5 commits into
mainfrom
gen-and-storage-schema-definitions
Open

Add initial set of generation and storage schemas#93
EllieKallmier wants to merge 5 commits into
mainfrom
gen-and-storage-schema-definitions

Conversation

@EllieKallmier
Copy link
Copy Markdown
Member

Adds YAML validation schemas for generation and storage input tables to the initial set of ISPyPSA input table schemas.

Schemas added:

  • generators_existing_planned.yaml - Existing and planned generator characteristics
  • generators_new_entrant.yaml - New entrant generator technology options
  • storage_existing_planned.yaml - Existing and planned storage unit characteristics
  • storage_new_entrant.yaml - New entrant storage technology options (battery, PHES)
  • costs_connection.yaml - Connection cost data
  • costs_fuel_prices.yaml - Fuel price projections
  • costs_new_entrant_build.yaml - Build cost projections for new entrants
  • emissions_reduction.yaml - Emissions reduction targets/constraints

Still to be added: policy table schemas.

Schemas follow structure as outlined in #85 and mostly follow the draft table structures given in the alternative templater review, with tweaks to match the 2026 (v7.5) IASR workbook data.

@EllieKallmier EllieKallmier added type: feature New feature or request category: data-validation Relates to data validation practices across any module - e.g tables, schema or enforcement labels Apr 14, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
see 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@nick-gorman nick-gorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great Ellie, just a few comments / questions.

  • I made a bunch of comments on just one table, but they might apply in other cases. But this should be obvious.
  • In the inline comments I suggested validating the fuel types in the fuel cost table off the generator tables. But could also make sense to do this in reverse, if the fuel cost table exists require that the generator table values exist in the fuel cost table. But then this raises the question of what happens when the fuel cost table isn't provided. Another option would be to have a canonical values yaml (or csv) and we validate off it for things like fuel types and technology types etc. This is a broader issue than fuel types but just using it as an exmaple.
  • Do you think we should have a standard way of separating descriptions notes as pertaining to source / IASR vs ISPyPSA behaviour.

Comment on lines +27 to +28
If costs are given by region in source data, create separate rows for each
geo_id in the region with the same cost.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea to have this note. Do you think we should consistently use a heading like "Source notes", "Data preparation notes" to preface notes like this. Anyway not a big one, just a thought.

Standardised technology name mapping to new entrant technologies in
`generators_new_entrant` and/or `storage_new_entrant` tables.

If blank: treat as geo_id-level VRE connection cost.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be more explicit. Maybe, "If blank: the cost is applied to all new entrant technologies in geo_id which have not been provided a technology specific connection cost for the applicable year."

type: int
required: true
description: >
Financial year in which this cost applies.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just "Year" to leave open the cost being applied on calendar year basis.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I hear this - maybe also though I'd add a note (but maybe in the description?) that says something along the lines of "Year type is either financial or calendar year based on config" (phrased better).

Side thought: I'm not sure how we want to handle converting data that's given as FY into calendar (or vice versa) when it's just a single value for each year? I don't think a big deal but good to take a standard approach

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the side thought, agree, but I would kind of leave this up to the user. I.e. the templater doesn't have a calendar year option, but if the user wants to fill out the ISPyPSA tables with their own inpus they are free to specify them as calendar years.

Comment on lines +24 to +25
If absent: no dynamic marginal costs calculated. Requires later user input of
fixed or dynamic marginal cost.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be more specific about where these inputs would need to be provided if they aren't given here?

Comment thread src/ispypsa/validation/schemas/costs_new_entrant_build.yaml
Comment on lines +22 to +23

Should match fuel_type values in `generators_existing_planned` or
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be validated?

fuel_type:
type: string
required: true
description: Fuel type used by the generator or storage unit.
Copy link
Copy Markdown
Member

@nick-gorman nick-gorman Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we note here the model behaviour if the fuel_type (or maybe even fuel_type price_node combo) for a generator is missing? Is this allowed? Maybe its the same behaviour as if the whole table is missing but just applied to a singe generator. Should this be stated at the table description level, noting that if absent behaviour applies to the whole table and on row basis. And I guess this commet applies to other tables as well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also been wondering about this, and I think it's still for me a bit of a question around whether what I've sort of proposed for the missing table case is the best way forward. But yes as it stands I think it makes sense to add some cross-validation here to the summary tables and as you say add a table-level note re: if absent behaviour.

Note: for all power stations except for Kogan Gas, this should exactly match
the `power_station` column.

If absent: assume no dynamic marginal price is calculated for this model run.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean zero cost or a fixed cost?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it would mean a fixed price (user to supply), or that users would need to provide dynamic prices (if that's desired). I can clarify this wording to be more specifically about the calculation of dynamic prices and the consequence/required action.

I'm also not totally convinced this is the best way to handle this scenario, I just think that this is a case with an obvious set of options that might be desirable to users AND can offer a simplification. I would be very happy to chat more about this case and whether there's a nicer way to implement!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if fuel_price_node isn't supplied then the user needs to add a column called fuel_price, which would just specify one fixed price. Or you could allow numeric values in fuel_price_node which override the dynamic values, and just not allow NaN. The user can set to zero if thats what they want.

required: false
units: '%'
description: >
Maximum allowed state of charge (%). Must be between 0.0 and 100.0,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validation?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes for sure - at the time I was thinking it might be easier to define column-level validation rules (in the schema) once we had a more solid plan for how the validation will be handled (i.e. if there's an existing package or smth we want to use). But yeah it's probably just as easy to define clearly now and update that as we go. I might in that case use the custom_validation attribute at the column level and start defining a suite of standard validations. Lmk if you have other ideas or preferences!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also - could use allowed_values but atm that's defined as a list of permitted values, which doesn't translate well here; we could instead update the definition of that attribute.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yep, I agree it could make sense to wait till we know more about how we will be doing the validation.

type: date
required: false
description: >
Date when the storage unit begins operation. Format: %d/%m/%Y
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation?

@EllieKallmier
Copy link
Copy Markdown
Member Author

Collection of my (edited) thoughts and updates:

New schema additions for validation metadata

Super open to not doing this/changing stuff around - particularly as validation “enforcement” gets set up and might require certain structures etc.. But to summarise - shift away from embedding validation constraints as prose in description fields (e.g. "All values should be >= $0.0") and towards explicit fields at the column level. The new fields I’m testing out are:

Field Meaning
gte Greater than or equal to (inclusive lower bound)
gt Strictly greater than (exclusive lower bound)
lte Less than or equal to (inclusive upper bound)
lt Strictly less than (exclusive upper bound)
format Format string for date or string pattern validation (e.g. "%d/%m/%Y")
allowed_values Explicit list of permitted values
allowed_values_from Reference to a column in another table whose values define the permitted set
nan_fill Value applied to null/NaN cells at the validation enforcement step - IF required : false for this column.

The intent is that these fields can be read directly by validation enforcement code without parsing prose. Table-level custom_validation is reserved for more complex cross-table/cross-column conditions that can't be expressed this way, and allowed_values stays as a static list (discrete). The idea being that these fields only exist as needed per column.


Clarified semantics of required vs nan_fill

The column-level required field defines: whether a column can contain null values at validation time (without raising an error).

For columns that are required: false but have a sensible default value, nan_fill specifies the value that will be applied to any null/NaN cells during the validation enforcement step. For example:

lcf_build: type: float required: false units: '%' gte: 0.0 nan_fill: 100.0

This says: the column is optional; if cells are null, fill them with 100.0 (i.e., no locational scaling); validate that all resulting values are >= 0.


Standardised description sub-headers

Description fields now use consistent sub-headers to distinguish different types of content:

  • Source tables: — the IASR workbook tables drawn on by the templater to produce this output table
  • Source notes: — templater transformation notes (what has already happened to get to this point; not enforced by the validator)
  • If absent: — behavioural consequence when the entire table or column is not present
  • If absent (or empty): — used where both the column being absent and individual null cells have the same consequence

I’m not locked in on having the source notes stuff here, maybe it’s more useful for me at this stage during the templater refactor but will become redundant/doubled as the templater documentation gets filled out? But the idea being that Source notes: defines stuff that has happened before reaching validation, in part documenting the templater behaviour but also in particular where new rows have been added that don’t exist in the IASR workbook. Happy to lose this subheading if it feels like a double up/not that useful!


Simplified dynamic marginal cost data requirements

Basically - I was getting stuck on how to imply this kind of flexibly required if/then structure in the schema without having a more concrete picture of the validation and where it sits in the flow, and if/how we required users to manually fill some values at different points etc - so I made an executive decision to simplify and remove this optionality/flexibility for the moment so I can just keep moving!

Previously, the generator tables used an optional_sets field to indicate that fuel_price_node (now fuel_price_mapping), vom, and heat_rate were all required together or all absent. This approach was removed. These columns — along with fom — are now simply required: true in both new entrant generator and storage tables.

The same change applies to costs_fuel_prices, which has been made required: true at the table level (was previously optional). This is noted in the schema with a comment explaining the intent to revisit optionality of SRMC-related data once the validation enforcement design is clearer.

I do think this is important to revisit just with a more serious think about the user interaction piece (where and how to enforce)!


Column identifier rename: fuel_price_node → fuel_price_mapping

I renamed the column fuel_price_node in the generator tables and costs_fuel_prices back to fuel_price_mapping. Because - in the IASR v7.5 workbook structure: fuel prices are not specifically linked to a geographic node - for many generators they are identified by a mapping ID that doesn't correspond to a specific location. There’s no region/subregion/rez/location column in fuel price tables anymore basically!


Cross-table references: new entrant technology validation

The technology column in both generators_new_entrant and storage_new_entrant now has:

`allowed_values_from:

  • costs_new_entrant_build: technology`

This enforces that any technology defined in the new entrant asset tables also has a build cost entry.

Open question for discussion: This validates that new entrant assets have build costs, but it doesn't validate that all entries in costs_new_entrant_build correspond to technologies actually defined in the asset tables (which in my mind would be about catching stale or misspelled entries).

@EllieKallmier EllieKallmier requested a review from nick-gorman May 11, 2026 23:50
Copy link
Copy Markdown
Member

@nick-gorman nick-gorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Ellie,

This looks really good, big fan.

I think there are some broader questions about how we implement referential integrity, which we haven't quiet answered, similar to your question on costs_new_entrant_build. But I think that is something we can keep thinking through and doesn't need to block this PR by any means. I might open a discussion on referential integrity.

Note: for all power stations except for Kogan Gas, this should exactly match
the `power_station` column.

If absent: assume no dynamic marginal price is calculated for this model run.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if fuel_price_node isn't supplied then the user needs to add a column called fuel_price, which would just specify one fixed price. Or you could allow numeric values in fuel_price_node which override the dynamic values, and just not allow NaN. The user can set to zero if thats what they want.

required: false
units: '%'
description: >
Maximum allowed state of charge (%). Must be between 0.0 and 100.0,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yep, I agree it could make sense to wait till we know more about how we will be doing the validation.

Comment thread src/ispypsa/validation/schemas/costs_fuel_prices.yaml Outdated
fuel_type:
type: string
required: true
allowed_values: ["Black Coal", "Brown Coal", "Liquid Fuel", "Gas", "Water", "Solar", "Wind", "Biomass", "Biomethane"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd did something similar in network_geography.yaml, but I wonder if hard coding like this is actually bad as it prevents people defining new fuel costs. Thinking again of a toy example type case where I might just want to define whatever fuels I like Coal, Hydrogen, etc

Another way of doing this might be force the generation tables to only have fuel types that exist in costs_fuel_prices.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeahh I thought about this too, and after the chat this morning about the role of validator totally agree hard-coding isn't ideal here. And yep I like that option - will implement!

EllieKallmier and others added 2 commits May 13, 2026 12:30
Co-authored-by: nick-gorman <40549624+nick-gorman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: data-validation Relates to data validation practices across any module - e.g tables, schema or enforcement type: feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants