Skip to content

GSOC 2026: Improve efficiency of OptiSim selection method #285

@marco-2023

Description

@marco-2023

Description

OptiSim is a distance-based method implemented in Selector. The method itself scales linearly with the number of points to select. However, the algorithm is slowed down by the setup phase, where an optimal radius is searched to obtain the required number of selected data points. In the current implementation, the associated cost makes the use of large databases impractical for this method. The primary goal of this project is to fix this limitation.

📚 Package Description and Impact

The Selector library provides methods for selecting a diverse subset of a dataset. The goal is to build representative subsets from large datasets without introducing bias, and in some cases even reducing it (e.g., when the original dataset is unbalanced). This is critical for data-driven modeling across various fields and helps mitigate issues such as class imbalance in machine learning training sets. More information about this package can be found at https://www.biorxiv.org/content/10.1101/2025.11.21.689756v1 and https://selector.qcdevs.org.

👷 What will you do?

The main focus will be to refactor the OptiSim class and likely the radius optimization utility function. The objective is to decrease the computational prefactor by several orders of magnitude, while at the same time maintaining readability and a clear mapping between the implementation and the underlying mathematics.

🏁 Expected Outcomes

  1. Decrease the computational cost of OptiSim by several orders of magnitude.
  2. Improve the efficiency of the radius optimization used by OptiSim.
  3. Write tests to ensure correctness and numerical stability after performance optimizations.
  4. Provide profiling-based examples demonstrating the performance improvements.
Required skills Python, OOP, DevOps
Preferred skills Experience profiling and improving the performance of Python applications, preferably in scientific computing
Project size Small
Difficulty Medium

🙋 Mentors

Fanwang Meng fanwang.meng_at_queensu.ca @FanwangM
Marco Martínez-González mmg870630_at_gmail_dot_com @marco-2023
Farnaz Heidar-Zadeh farnaz.heidarzadeh_at_queensu.ca @FarnazH

🏋️ Stretch Goal

Improve the performance of the directed sphere exclusion (DISE) algorithm also.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions