-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Description
OptiSim is a distance-based method implemented in Selector. The method itself scales linearly with the number of points to select. However, the algorithm is slowed down by the setup phase, where an optimal radius is searched to obtain the required number of selected data points. In the current implementation, the associated cost makes the use of large databases impractical for this method. The primary goal of this project is to fix this limitation.
📚 Package Description and Impact
The Selector library provides methods for selecting a diverse subset of a dataset. The goal is to build representative subsets from large datasets without introducing bias, and in some cases even reducing it (e.g., when the original dataset is unbalanced). This is critical for data-driven modeling across various fields and helps mitigate issues such as class imbalance in machine learning training sets. More information about this package can be found at https://www.biorxiv.org/content/10.1101/2025.11.21.689756v1 and https://selector.qcdevs.org.
👷 What will you do?
The main focus will be to refactor the OptiSim class and likely the radius optimization utility function. The objective is to decrease the computational prefactor by several orders of magnitude, while at the same time maintaining readability and a clear mapping between the implementation and the underlying mathematics.
🏁 Expected Outcomes
- Decrease the computational cost of
OptiSimby several orders of magnitude. - Improve the efficiency of the radius optimization used by
OptiSim. - Write tests to ensure correctness and numerical stability after performance optimizations.
- Provide profiling-based examples demonstrating the performance improvements.
| Required skills | Python, OOP, DevOps |
| Preferred skills | Experience profiling and improving the performance of Python applications, preferably in scientific computing |
| Project size | Small |
| Difficulty | Medium |
🙋 Mentors
| Fanwang Meng | fanwang.meng_at_queensu.ca | @FanwangM |
| Marco Martínez-González | mmg870630_at_gmail_dot_com | @marco-2023 |
| Farnaz Heidar-Zadeh | farnaz.heidarzadeh_at_queensu.ca | @FarnazH |
🏋️ Stretch Goal
Improve the performance of the directed sphere exclusion (DISE) algorithm also.