Skip to content

AIB001/MolScope

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MolScope

Select representative molecules from a candidate set to maximize coverage of the MW × ALogP chemical space.

Given an Excel file of molecules with molecular weight (MW) and ALogP values, MolScope picks a subset that spreads as evenly as possible across the 2D chemical space. It uses greedy farthest-point initialization followed by Metropolis-Hastings simulated annealing to maximize the minimum pairwise distance between selected molecules.

Installation

pip install -e .

Quick Start

Command Line

molscope molecules.xlsx -n 100

This reads molecules.xlsx, selects 100 representative molecules, and outputs:

  • selected_molecules.xlsx — the selected subset
  • chemical_space.png — coverage visualization

Python API

from molscope import run_pipeline

df_selected, indices, min_dist, history = run_pipeline(
    input_file="molecules.xlsx",
    n_select=100,
)

MCP Server

MolScope can run as an MCP server for integration with Claude Code or other MCP clients:

python molscope/server.py

CLI Options

Option Default Description
input_file (required) Input Excel file
-n, --n-select 100 Number of molecules to select
-i, --iterations 100000 Metropolis-Hastings iterations
-o, --output selected_molecules.xlsx Output Excel path
-p, --plot chemical_space.png Output plot path
--mw-col MW Column name for molecular weight
--alogp-col ALogP Column name for ALogP
--lambda-var 0.01 Weight for spatial variance
--seed 42 Random seed
-q, --quiet off Suppress progress output

Input Format

An Excel file (.xlsx) with at least two numeric columns for MW and ALogP. Rows with missing values in these columns are automatically dropped.

Method

Problem Formulation

Given $N$ candidate molecules, each described by molecular weight $\text{MW}_i$ and $\text{ALogP}_i$, select a subset $S$ of size $k$ that maximally covers the 2D chemical space.

Normalization. Raw descriptors are scaled to $[-1, 1]$:

$$\hat{x}_i = \frac{2(x_i - x_{\min})}{x_{\max} - x_{\min}} - 1$$

Each molecule is then represented as a point $\mathbf{p}_i = (\widehat{\text{MW}}_i,; \widehat{\text{ALogP}}_i) \in [-1, 1]^2$.

Objective. The optimization maximizes:

$$f(S) = \underbrace{\min_{i,j \in S,; i \neq j} \lVert \mathbf{p}_i - \mathbf{p}_j \rVert_2}_{d_{\min}(S)} + \lambda \underbrace{\bigl(\operatorname{Var}(S_1) + \operatorname{Var}(S_2)\bigr)}_{\text{spatial variance}}$$

where $S_1, S_2$ are the first and second coordinates of the selected points, and $\lambda$ (default 0.01) balances the two terms. The first term (maximin criterion) prevents clustering; the second encourages spread across the full space.

Algorithm

Step 1 — Greedy farthest-point initialization. Start with the pair $(i^, j^)$ of maximum distance. Iteratively add the point with the largest minimum distance to the current set:

$$s_{t+1} = \arg\max_{i \notin S_t} \min_{j \in S_t} \lVert \mathbf{p}_i - \mathbf{p}_j \rVert_2$$

This gives a 2-approximation to the maximin problem and serves as the initial solution.

Step 2 — Metropolis-Hastings refinement with simulated annealing. At each iteration $t$:

  1. Propose a swap: randomly pick $a \in S$ and $b \notin S$, let $S' = S \setminus {a} \cup {b}$
  2. Compute $\Delta = f(S') - f(S)$
  3. Accept with probability:

$$P(\text{accept}) = \begin{cases} 1 & \text{if } \Delta > 0 \ \exp(\Delta / T_t) & \text{otherwise} \end{cases}$$

The temperature follows an exponential decay schedule:

$$T_t = T_{\text{start}} \cdot \left(\frac{T_{\text{end}}}{T_{\text{start}}}\right)^{t/(n_{\text{iter}}-1)}$$

where $T_{\text{start}} = 0.05$ and $T_{\text{end}} = 0.0005$ by default. Early iterations explore broadly; later iterations exploit the best region found.

Output

The achieved $d_{\min}$ defines a non-overlapping coverage radius $R = d_{\min} / 2$ around each selected molecule — the plot draws these circles to visualize coverage.

License

MIT

About

Chemical Space Coverage-Based Molecular Selection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages