Skip to content

368/numpy conv func#386

Open
yuvraajnarula wants to merge 17 commits intoopenclimatefix:mainfrom
yuvraajnarula:368/numpy-conv-func
Open

368/numpy conv func#386
yuvraajnarula wants to merge 17 commits intoopenclimatefix:mainfrom
yuvraajnarula:368/numpy-conv-func

Conversation

@yuvraajnarula
Copy link

Pull Request

Description

This PR introduces a unified xarray -> NumpySample conversion function to replace the existing modality-specific numpy conversion functions in numpy_sample.

What this pr does

  • Adds a single convert_xarray_dict_to_numpy_sample function that:
  • accepts a dictionary of xarray.DataArray objects (matching the structure already used in the PVNet datasets)
  • converts all supported modalities (generation, nwp, satellite) into a single, structured NumpySample
  • Extracts the core modality conversion logic into internal helpers (_convert_generation, _convert_nwp, _convert_satellite)
  • Keeps the existing convert_*_to_numpy_sample functions as thin wrappers for now

What this PR intentionally does not do

  • Remove or refactor *SampleKey classes
  • Update PVNetDataset or other downstream consumers to the new unified API
  • Change the runtime structure of existing samples
  • Those changes are intentionally deferred until we agree on whether this new approach is the right long-term direction.

Design question for reviewers

@dfulu
The main motivation for this PR is to move away from:

  • many modality-specific numpy conversion functions
  • scattered *SampleKey classes defining implicit sample structure

and towards:

  • a single, explicit conversion entrypoint
  • a structured, self-describing NumpySample layout

Before continuing with:

  • deprecating/removing *SampleKey
  • migrating datasets and collate logic

I’d really appreciate feedback on whether this unified conversion approach feels sustainable and directionally correct for the project.

Fixes #368

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

  • Yes

If your changes affect data processing, have you plotted any changes? i.e. have you done a quick sanity check?

  • Yes

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

@dfulu
Copy link
Member

dfulu commented Jan 5, 2026

@yuvraajnarula just a heads up that your PR isn't passing linting. Also just a warning that I don't think your PR will pass the pytests since you've changed the structure of the dict. So there's probably quite a lot of changes that will be required.

Also the main point of the issue was to remove the calls to convert_X_to_numpy_sample() inside the the torch PVNetDataset class which isn't addressed here

@yuvraajnarula
Copy link
Author

@dfulu This implementation addresses the immediate test failure regarding datetime encodings. Please let me know if it aligns with the core principles for this issue.

@yuvraajnarula
Copy link
Author

Hey @devsjc
Any reviews on the current implementation, and if it meets the standard. Also, I have fixed linting on my end and double checked with ruff as well. So let me know if there's anything more required for this designated issue.

@yuvraajnarula
Copy link
Author

Hey @devsjc
Any reviews?

Copy link
Member

@dfulu dfulu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuvraajnarula thanks for getting the linting passing. I'm sorry to say that there are still quite a few changes required and ones which our testing suite doesn't catch.

We keep compatibility between ocf-data-sampler and PVNet, and particularly with models we've already trained. You changes modify the structure of and values in the NumpySample dict. For example you change the nesting structure of the dictionary, and also the representation of the times. NumpySample should not change in this PR. Neither should batch_to_tensor().

Also the changes you make to the Location class are dangerous since the new properties x/y sometimes return the lon/lats and sometimes the ocsgb coords. Similarly the longitude/latitude properties sometimes return the osgb coords.

Also, the point of this issue was essentially to remove the need for the numpy_sample/[generation.py, nwp.py, satellite.py] files and compile those in a single place

@yuvraajnarula
Copy link
Author

Hey @dfulu, sorry for the noise and back-and-forth on this PR. I understand the compatibility constraints are strict here. I’ve now updated the implementation to fully preserve the existing NumpySample structure (keys/nesting/dtypes), kept batch_to_tensor() unchanged, and reverted the risky Location changes. The numpy conversion logic from generation.py, nwp.py, and satellite.py is now consolidated into the unified converter as intended, and lint/tests are passing locally.

Even went through all the testing that was required with linting. I hope these current implementations match your core principles.

@yuvraajnarula yuvraajnarula requested a review from dfulu February 13, 2026 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate numpy conversion functions

2 participants