Skip to content

Conversation

@axiomcura
Copy link
Member

@axiomcura axiomcura commented Jan 16, 2026

This PR prepares the analysis of outputs generated by the buscar pipeline. It introduces three notebooks that together document the full buscar workflow, the generation of aggregate profiles, and the computation of cluster centroids identified by buscar.

These notebooks produce intermediate outputs that enable inspection of how buscar processes the data and derives its scoring. Plots and final figures will be added in a subsequent PR.

Notebooks

  • 1.cfret-pilot-buscar-analysis.ipynb: Runs the full buscar pipeline.
  • 2.2.generate-aggregate-profiles.ipynb: Generates aggregate profiles at the replicate level and treatment-level consensus.
  • 3.generate-centroid.ipynb: Computes a centroid for each cluster identified by the buscar module.

Other changes

  • 2.preprocessing.ipynb: Removed drug_x (data cannot be published).
  • Updated the clustering parameter grid to allow a broader search of the parameter space.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@axiomcura axiomcura marked this pull request as ready for review January 16, 2026 16:15
@axiomcura axiomcura requested a review from wli51 January 16, 2026 20:45
Copy link
Contributor

@wli51 wli51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work and happy to see things coming together. I have a question about the centroid generating median aggregation step if you could clarify things a little more it would be great


# calculate phenotypic distance scores
if treatment_dist_scores_outpath.exists():
print("Treatment phenotypic distance scores already exist, skipping this step.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to perform some content check to ensure everything needed is present so in the event you include/uninclude samples earlier the pipeline knows if stuff needs to be re-generated or perhaps a force-re-run flag would be handy.

cfret_df.select(cfret_meta + cfret_feats).head()


# We use **median aggregation** to generate centroid profiles for each cluster. For each cluster, we calculate the component-wise median across all cells to create a synthetic representative profile that captures the central tendency. This approach is robust to outliers, consistent with replicate and consensus profile generation workflow, and works well for high-dimensional morphological features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am understanding your goals wrong or the pycytominer aggregation wrong, so please let me know if so. You said here in documentation that the data was aggregated by component-wise median, which I assume just involves median-ing every feature separately (or is this a typo and you meant geometric median?). I have read about component-wise median operations generalizing poorly from 1d to higher dimensional data and can produce unrealistic centroid profiles.

In the extreme case if we have very anti-correlated or correlated features the median profile would be far from anything that is typical or realistic, unless this type of correlation structure has been previously dealt with:

| Sample | Feature X | Feature Y |
|--------|-----------|-----------|
| S1     |    1      |    10     |
| S2     |    2      |     9     |
| S3     |    9      |     2     |
| S4     |   10      |     1     |
| Median    |   5.5     |   5.5     |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants