Description of Refactoring/Improvement
Split DataHandler into two focused classes: SimulationDataLoader (for loading .npy files) and TensorFlowDatasetConverter (for SED processing and TensorFlow conversion). Keep DataHandler as a deprecated wrapper for backward compatibility.
Goals and Objectives
- Separate file loading logic from data conversion logic
- Create reusable converter that works with any PSF dataset format (not just simulations)
- Enable upcoming adapter pattern for supporting both simulation and real observational data
- Maintain backward compatibility with existing code
Current Code Behaviour
DataHandler currently mixes three responsibilities:
- Loading simulation
.npy files from disk
- Converting NumPy arrays to TensorFlow tensors
- Processing SEDs with simPSF
This tight coupling makes it difficult to:
- Use PSF dataset dataclasses (like
SHEPSFDataset) without converting to dicts
- Reuse conversion logic for different data sources
- Test loading and conversion independently
Proposed Changes
Create simulation_data_loader.py:
class SimulationDataLoader:
"""Loads .npy simulation files and validates structure."""
def load(self):
"""Load from disk, validate, and return dataset dict."""
Create tensorflow_converter.py:
class TensorFlowDatasetConverter:
"""Converts PSF datasets to TensorFlow tensors."""
def convert_psf_dataset(self, dataset, target_field='images'):
"""Convert PSF dataclass to TF dict."""
def convert_dict(self, dataset_dict, dataset_type='train'):
"""Convert legacy dict to TF dict."""
def _process_seds(self, sed_data):
"""Process SEDs using simPSF."""
Update data_handler.py:
class DataHandler:
"""DEPRECATED: Thin wrapper delegating to SimulationDataLoader."""
def __init__(self, *args, **kwargs):
warnings.warn("DataHandler is deprecated...", DeprecationWarning)
self._loader = SimulationDataLoader(*args, **kwargs)
Expected Benefits
- Reusability:
TensorFlowDatasetConverter works with any PSF dataset (Euclid, Roman, JWST)
- Testability: Can test loading and conversion independently
- Maintainability: Single Responsibility Principle - each class has one clear purpose
- Extensibility: Enables adapter pattern for unified training interface
- Backward compatibility: Existing code continues working with deprecation warning
Dependencies
- No breaking changes - existing code using
DataHandler continues to work
- Enables follow-up PRs for adapter pattern implementation
- Future external API (
TrainWaveDiffPSF) will use TensorFlowDatasetConverter directly
Testing Plan
-
Unit tests for SimulationDataLoader:
- Test loading .npy files
- Test validation of simulation-specific structure
- Test error handling for missing/invalid files
-
Unit tests for TensorFlowDatasetConverter:
- Test convert_psf_dataset() with mock PSF dataclass
- Test convert_dict() with simulation dict
- Test SED processing pipeline
- Verify correct tensor shapes and dtypes
-
Integration tests:
- Test deprecated DataHandler produces identical results to new classes
- Test with real simulation data end-to-end
-
Regression tests:
- Ensure existing training scripts work unchanged
- Verify deprecation warning fires correctly
Additional Context
This refactoring is prerequisite work for supporting Euclid SHEPSFDataset (real observational data from Euclid SHE) alongside existing simulation workflows. The converter's generic design (convert_psf_dataset()) will support future missions (Roman, JWST) without modification.
Related: Upcoming PRs will introduce TrainingDataAdapter pattern that builds on these refactored components.
Impact Assessment
Low risk, high value foundation work:
- No breaking changes - 100% backward compatible via deprecated wrapper
- Enables future work - Required for adapter pattern and real data support
- Small scope - Code reorganization without algorithmic changes
- Well-isolated - Changes contained to data loading/conversion layer
- Migration path - Deprecation warning guides users to new classes
Estimated files changed: 3 new, 1 modified
Estimated LOC: ~400 (mostly moved, not new logic)
Next Steps
- Implement
SimulationDataLoader and TensorFlowDatasetConverter
- Add deprecation wrapper to
DataHandler
- Write comprehensive unit and integration tests
- Update internal documentation with migration examples
- Merge this PR before opening follow-up adapter pattern PRs:
PR #199 : Add TrainingDataAdapter for simulation data
PR (TBD) : Add PSFDatasetAdapter for real data
PR #200 : Migrate training code to use adapters
Thank you for starting this request to refactor or improve the code. We will review it and collaborate to enhance the codebase together! 🛠️
Description of Refactoring/Improvement
Split
DataHandlerinto two focused classes:SimulationDataLoader(for loading.npyfiles) andTensorFlowDatasetConverter(for SED processing and TensorFlow conversion). KeepDataHandleras a deprecated wrapper for backward compatibility.Goals and Objectives
Current Code Behaviour
DataHandlercurrently mixes three responsibilities:.npyfiles from diskThis tight coupling makes it difficult to:
SHEPSFDataset) without converting to dictsProposed Changes
Create
simulation_data_loader.py:Create
tensorflow_converter.py:Update
data_handler.py:Expected Benefits
TensorFlowDatasetConverterworks with any PSF dataset (Euclid, Roman, JWST)Dependencies
DataHandlercontinues to workTrainWaveDiffPSF) will useTensorFlowDatasetConverterdirectlyTesting Plan
Unit tests for SimulationDataLoader:
Unit tests for TensorFlowDatasetConverter:
Integration tests:
Regression tests:
Additional Context
This refactoring is prerequisite work for supporting Euclid
SHEPSFDataset(real observational data from Euclid SHE) alongside existing simulation workflows. The converter's generic design (convert_psf_dataset()) will support future missions (Roman, JWST) without modification.Related: Upcoming PRs will introduce
TrainingDataAdapterpattern that builds on these refactored components.Impact Assessment
Low risk, high value foundation work:
Estimated files changed: 3 new, 1 modified
Estimated LOC: ~400 (mostly moved, not new logic)
Next Steps
SimulationDataLoaderandTensorFlowDatasetConverterDataHandlerPR #199 : Add
TrainingDataAdapterfor simulation dataPR (TBD) : Add
PSFDatasetAdapterfor real dataPR #200 : Migrate training code to use adapters
Thank you for starting this request to refactor or improve the code. We will review it and collaborate to enhance the codebase together! 🛠️