Skip to content

Refactor: Generalize dataset base classes & consolidate dynamic splitting logic #122

@aditya0by0

Description

@aditya0by0

Description:
Currently, the code related to dynamic splitting in chebi.py and the proteins repo’s data class is duplicated. Both implementations are effectively the same, which leads to unnecessary code redundancy.

Proposed changes:

  1. Move common code to base class — e.g., DynamicDataset — to encapsulate shared dynamic splitting logic.

    • Both ChEBI and protein dataset classes should inherit from this base class.
    • This will centralize changes and make maintenance easier.
  2. Refactor dataset hierarchy to be more generic:

    • Certain hyperparameters that are specific to ChEBI, such as

      chebi_version: int = 200

      in XYBaseDataModule, should be pushed down into a ChEBI-specific base class rather than existing in a generic base.

  3. Outcome:

    • Eliminate duplicate code between chebi.py and the proteins repo.
    • Improve maintainability by isolating dataset-specific configurations.
    • Make it easier to introduce new datasets without rewriting the splitting logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requeston-holdWork on this issue is temporarily pausedpriority: lowIssue with low priority

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions