-
Notifications
You must be signed in to change notification settings - Fork 1
Cfret pilot analysis #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
wli51
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work and happy to see things coming together. I have a question about the centroid generating median aggregation step if you could clarify things a little more it would be great
|
|
||
| # calculate phenotypic distance scores | ||
| if treatment_dist_scores_outpath.exists(): | ||
| print("Treatment phenotypic distance scores already exist, skipping this step.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to perform some content check to ensure everything needed is present so in the event you include/uninclude samples earlier the pipeline knows if stuff needs to be re-generated or perhaps a force-re-run flag would be handy.
| cfret_df.select(cfret_meta + cfret_feats).head() | ||
|
|
||
|
|
||
| # We use **median aggregation** to generate centroid profiles for each cluster. For each cluster, we calculate the component-wise median across all cells to create a synthetic representative profile that captures the central tendency. This approach is robust to outliers, consistent with replicate and consensus profile generation workflow, and works well for high-dimensional morphological features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am understanding your goals wrong or the pycytominer aggregation wrong, so please let me know if so. You said here in documentation that the data was aggregated by component-wise median, which I assume just involves median-ing every feature separately (or is this a typo and you meant geometric median?). I have read about component-wise median operations generalizing poorly from 1d to higher dimensional data and can produce unrealistic centroid profiles.
In the extreme case if we have very anti-correlated or correlated features the median profile would be far from anything that is typical or realistic, unless this type of correlation structure has been previously dealt with:
| Sample | Feature X | Feature Y |
|--------|-----------|-----------|
| S1 | 1 | 10 |
| S2 | 2 | 9 |
| S3 | 9 | 2 |
| S4 | 10 | 1 |
| Median | 5.5 | 5.5 |
This PR prepares the analysis of outputs generated by the buscar pipeline. It introduces three notebooks that together document the full buscar workflow, the generation of aggregate profiles, and the computation of cluster centroids identified by buscar.
These notebooks produce intermediate outputs that enable inspection of how buscar processes the data and derives its scoring. Plots and final figures will be added in a subsequent PR.
Notebooks
Other changes
drug_x(data cannot be published).