feat: add hierarchical FDR correction for dose-response data #116

shntnu · 2025-12-08T02:29:36Z

Summary

Implements two-stage hierarchical FDR to reduce over-correction when testing related hypotheses (e.g., multiple doses of the same compound).

Add hierarchical_by parameter to mean_average_precision()
Stage 1: Use minimum p-value per group, apply BH at group level
Stage 2: For significant groups, apply BH within each group

Usage

result = mean_average_precision(
    ap_scores,
    sameby=["compound", "dose"],      # mAP per compound×dose
    hierarchical_by=["compound"],     # but correct at compound level
    null_size=10000,
    threshold=0.05,
    seed=42,
)

When hierarchical_by is specified, the result includes additional columns:

stage1_p_value: Group-level p-value (minimum p-value in group)
stage1_corrected_p_value: BH-corrected Stage 1 p-value
stage1_significant: Whether the group passed Stage 1

Why min p-value instead of Simes?

For dose-response data, low doses are expected to be inactive. Simes' method penalizes compounds for having inactive low doses, which is biologically normal. Min p-value is more appropriate: a compound passes Stage 1 if ANY dose shows activity.

Example on LINCS data (4 plates, 58 compounds × 6 doses):

Flat BH: 26 significant doses
Hierarchical with Simes: 33 significant doses (but misses compounds with strong high-dose signal)
Hierarchical with min-p: 49 significant doses (88% power gain over flat BH)

Why hierarchical FDR matters

With 1000 compounds × 5 doses = 5000 tests, standard BH treats each as independent. But doses of the same compound test the same underlying hypothesis. Hierarchical FDR:

Stage 1: 1000 compound tests → say 50 pass (if any dose is active)
Stage 2: 50 × 5 = 250 dose tests, corrected in groups of 5
Much less harsh than correcting across 5000

Test plan

8 tests for hierarchical FDR behavior
All tests pass
Ruff checks pass

Context for Broadies: See https://github.com/broadinstitute/cpg0037-oasis-broad-U2OS-data/issues/9#issuecomment-3624961526 for a real-world example of its utility

Closes #115

🤖 Generated with Claude Code

shntnu · 2025-12-08T14:23:52Z

Design note: Why min p-value instead of Simes?

The initial implementation 1ede068 used Simes' method for Stage 1 p-value aggregation, following Yekutieli (2008) hierarchical FDR. However, testing on real dose-response data (LINCS, 4 plates, 58 compounds × 6 doses) revealed a problem:

Simes penalizes compounds for having inactive low doses - which is the expected biological behavior in dose-response data.

For example, compound BRD-K72414522-001-06-7:

10µM dose: mAP=0.47, p=0.0029 (clearly active)
Low doses: p=0.03-0.14 (expected to be inactive)
Simes p-value: 0.012 → after BH: 0.058 (just missed 0.05 threshold!)

The compound has a strong signal at high dose but Simes dilutes it with the inactive low doses.

Min p-value is more appropriate: a compound passes Stage 1 if ANY dose is active. This matches the biological question: "Does this compound have a phenotype at any tested dose?"

Results on LINCS data:

Method	Stage 1 compounds	Significant doses
Flat BH	-	26
Hierarchical + Simes	10	33
Hierarchical + min-p	19	49

Min-p provides 88% power gain over flat BH while correctly handling the dose-response structure.

When would Simes be appropriate?

Simes would be better when you expect most or all group members to be active (e.g., testing replicates of the same condition). For dose-response, where only high doses are expected to be active, min-p is the right choice.

If there's future demand, we could add a stage1_method parameter to allow users to choose. For now, min-p is hardcoded as the biologically appropriate default for the dose-response use case.

src/copairs/map/map.py

afermg

The logic seems okay to me, but (and I consulted with @alxndrkalinin) we don't want to have an overloaded function with a bunch of arguments: Let's do composition instead.

At a general level, this means:

Split p value calculation from statistical correction
Isolate hierarchical statistical correction into its own function (ideally in a new file).

I'm a bit torn as to how much to modify the original mean_average_precision function, but I think it's worth isolating the two main steps to avoid code duplication, before and after the p-value is calculated this section.

Then we would have composition:

One small function (e.g., get_map_pvalue that covers the p-value calculation (the first section of mean_average_precision)
One function with hierarchical FDR correction
refactor the function mean_average_precision into get_map_pvalue and multipletests, to retain backwards compatibility.
Potentially another function that wraps get_map_pvalue and either multipletests or hierarchical_fdr, if you want the convenience of map+hierarchical fdr in one.

This minimises repetition (as it is only the call to multipletests), while maximising modularity in case we have to add a different statistical correction in the future. It should be relatively simple, the code would remain modular enough, and we wouldn't start accumulating a bunch of flags and arguments on the main functions.

Tests run on my side so far.

src/copairs/map/map.py

tests/test_hierarchical_fdr.py

src/copairs/map/map.py

shntnu · 2026-01-08T18:39:47Z

At a general level, this means:

Split p value calculation from statistical correction

Isolate hierarchical statistical correction into its own function (ideally in a new file).

I'm a bit torn as to how much to modify the original mean_average_precision funciton, but I think it's worth isolating the two main steps to avoid code duplication, before and after the p-value is calculated this section.

Then we would have composition:

One small function (e.g., get_map_pvalue that covers the p-value calculation (the first section of mean_average_precision)

One function with hierarchical FDR correction

refactor the function mean_average_precision into get_map_pvalue and multipletests, to retain backwards compatibility.

Potentially another function that wraps get_map_pvalue and either multipletests or hierarchical_fdr, if you want the convenience of map+hierarchical fdr in one.

This minimises repetition (as it is only the call to multipletests), while maximising modularity in case we have to add a different statistical correction in the future. It should be relatively simple, the code would remain modular enough, and we wouldn't start accumulating a bunch of flags and arguments on the main functions.

Thanks for the excellent suggestions!

I've made all these changes (using Claude) but have not reviewed it carefully myself. I'll tag you again when I'm ready.

shntnu · 2026-01-08T20:38:17Z

Alright ready for you @afermg

afermg

I like the structure much more, but issues regarding the signature of mean_average_precision remain. We may need to bring Alex or John to give their opinion, but mine is that we should not lightly add new parameters unless we know that they are going to be widely used. I suggest instead creating another mean_average_precision function that uses hierarchical correction, or (if it doesn't add too much complexity) a function that dispatches different correction methods for the same mAP table.

tests/test_hierarchical_fdr.py

src/copairs/map/map.py

afermg · 2026-01-16T16:18:59Z

src/copairs/map/map.py

+    progress_bar: bool = True,
+    max_workers: Optional[int] = None,
+    cache_dir: Optional[Union[str, Path]] = None,
+    hierarchical_by: Optional[List[str]] = None,


My previous suggestion was:

Potentially another function that wraps get_map_pvalue and either multipletests or hierarchical_fdr, if you want the convenience of map+hierarchical fdr in one.

I don't think we should modify the signature of our core mean_average_precision function. Since hierarchical FDR is not standard (at least for now), we don't want to accumulate arguments unless strictly necessary (to avoid ending up with overloaded seaborn-like functions). I suggest instead having something like mean_average_precision_hierarchical. At the cost of duplicated documentation, we have a clear separation of behaviours. If you want one function to rule them all we can just wrap both hierarchical and non-hierarchical and pass all arguments.

This design follows the previous decision of splitting matching between monolabel and multilabel.

I am trying to come up with a better solution, but perhaps using mean_average_precision as the dispatcher of subfunctions is the correct decision. I suggest we consult Alex or John before adding a parameter to a signature though. My personal preference is to keep mean_average_precision as-is, and add mean_average_precision_hierarchical as an alternative function that covers this case (since as far as I know it is rather niche).

tests/test_hierarchical_fdr.py

shntnu · 2026-01-16T17:57:12Z

I like the structure much more, but issues regarding the signature of mean_average_precision remain. We may need to bring Alex or John to give their opinion, but mine is that we should not lightly add new parameters unless we know that they are going to be widely used. I suggest instead creating another mean_average_precision function that uses hierarchical correction, or (if it doesn't add too much complexity) a function that dispatches different correction methods for the same mAP table.

I've resolved all the minor comments now.

Please feel free to ping / discuss with Alex or John to decide what to do here

Implements two-stage hierarchical FDR (Yekutieli 2008) to reduce over-correction when testing related hypotheses (e.g., multiple doses of the same compound). - Add `hierarchical_by` parameter to `mean_average_precision()` - Stage 1: Aggregate p-values by group using Simes' method, apply BH - Stage 2: For significant groups, apply BH within each group - Add `simes_pvalue()` function for p-value combination - Fix `silent_thread_map` bug (handle `leave` kwarg) - Add comprehensive tests for hierarchical FDR Closes #115 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Simes method penalizes compounds for having inactive low doses, which is the expected biological behavior in dose-response data. Min p-value is more appropriate: a compound passes Stage 1 if ANY dose shows activity. - Replace simes_pvalue() aggregation with simple min() - Remove unused simes_pvalue function and tests - Update docstrings to reflect the change

The silent_thread_map leave kwarg issue will be properly fixed in the fix/silent-thread-map-leave-kwarg branch. This PR should be rebased after that fix is merged. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Split mean_average_precision into modular components per PR review: - get_map_pvalue(): compute mAP scores and p-values - apply_fdr_correction(): standard BH correction - apply_hierarchical_fdr(): two-stage hierarchical correction This enables composition without accumulating flags on the main function. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…tion Consistent naming pattern with apply_fdr_correction. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Change relative import to absolute import in map.py - Remove tests that only validate DataFrame structure - Add TODO for test improvements once API is finalized Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extract hierarchical FDR into a dedicated function rather than adding a parameter to mean_average_precision, following the existing pattern for monolabel/multilabel functions. This keeps the main API simple and avoids parameter accumulation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

shntnu · 2026-01-16T18:12:37Z

Please feel free to ping / discuss with Alex or John to decide what to do here

Here's what @afermg 's version of this would look like b9934c7 (we can revert it if we decide not to split the function)

afermg · 2026-01-16T18:37:09Z

Feel free to bring up any opinions (or lack of thereof). @alxndrkalinin @johnarevalo -- You don't have to read the PR/Code review, but let us know if there is any preference between adding a new argument to mean_average_precision or add a different function mean_average_precision_hierarchical to support a different statistical correction. On my side this PR will be ready once the couple remaining minor changes are implemented.

This additional function for hierarchical FDR is better IMO. I just added a couple of comments on the commit wrt documentation and one argument that changed position. Once those are fixed it will be good enough for me.

Reorder parameters in mean_average_precision_hierarchical so that the first 5 required parameters match mean_average_precision exactly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

shntnu force-pushed the hierarchical-fdr branch from b06003f to 1ede068 Compare December 8, 2025 02:30

shntnu changed the title ~~feat: add hierarchical FDR correction for grouped hypotheses~~ feat: add hierarchical FDR correction for dose-response data Dec 8, 2025

afermg reviewed Dec 8, 2025

View reviewed changes

src/copairs/map/map.py Outdated Show resolved Hide resolved

afermg suggested changes Dec 9, 2025

View reviewed changes

src/copairs/map/map.py Outdated Show resolved Hide resolved

tests/test_hierarchical_fdr.py Outdated Show resolved Hide resolved

src/copairs/map/map.py Outdated Show resolved Hide resolved

src/copairs/map/map.py Outdated Show resolved Hide resolved

shntnu force-pushed the hierarchical-fdr branch from baff4b1 to eea7f76 Compare January 8, 2026 17:07

shntnu requested a review from afermg January 8, 2026 20:38

afermg suggested changes Jan 16, 2026

View reviewed changes

shntnu force-pushed the hierarchical-fdr branch from 8cebe77 to 818c198 Compare January 16, 2026 18:11

shntnu and others added 9 commits January 16, 2026 13:11

style: rename apply_hierarchical_fdr to apply_hierarchical_fdr_correc…

b32ca7a

…tion Consistent naming pattern with apply_fdr_correction. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

style: format map.py

a3840f3

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: add comment noting stage1 columns could be dropped in future

e025f82

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: use absolute imports and trim superficial tests

d0818ec

- Change relative import to absolute import in map.py - Remove tests that only validate DataFrame structure - Add TODO for test improvements once API is finalized Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

shntnu force-pushed the hierarchical-fdr branch from 818c198 to b9934c7 Compare January 16, 2026 18:11

refactor: move hierarchical_by parameter to end for API consistency

4ff3f2c

Reorder parameters in mean_average_precision_hierarchical so that the first 5 required parameters match mean_average_precision exactly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: add hierarchical FDR correction for dose-response data #116

Are you sure you want to change the base?

feat: add hierarchical FDR correction for dose-response data #116

Uh oh!

Conversation

shntnu commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Why min p-value instead of Simes?

Why hierarchical FDR matters

Test plan

Uh oh!

shntnu commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design note: Why min p-value instead of Simes?

When would Simes be appropriate?

Uh oh!

Uh oh!

afermg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shntnu commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shntnu commented Jan 8, 2026

Uh oh!

afermg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

afermg Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

afermg Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shntnu commented Jan 16, 2026

Uh oh!

shntnu commented Jan 16, 2026

Uh oh!

afermg commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shntnu commented Dec 8, 2025 •

edited

Loading

shntnu commented Dec 8, 2025 •

edited

Loading

afermg left a comment •

edited

Loading

shntnu commented Jan 8, 2026 •

edited

Loading

afermg commented Jan 16, 2026 •

edited

Loading