feat: `AnnData.can_write` based on `AnnData.reduce` + refactored `AnnData.__sizeof__` by ilan-gold · Pull Request #2372 · scverse/anndata

ilan-gold · 2026-03-19T13:21:54Z

maybe TODOs:

fold? reduce?
Implement this inside write_{zarr,h5ad} to prevent partial stores from being written?

We introduce AnnData.fold and AnnData.can_write for "crawling" an AnnData and accumulating, implementing a can_write to tell if an AnnData can be written to a given format.

I'm also open to reduce as the name instead of fold.

For #2236, I think we could implement this more complexly by building on fold and accumulating a set of AdRef (or similar) objects pointing at what can't be written. Open to implement this! I reimplemented __sizeof__ to stress-test using reference-accessors and it went ok, but it's not amazingly smooth (like uns and trying to figure out if you are in a "top-level" elem like obs or varm)

If zarr-developers/zarr-python#3613 ever happens (I hope so) or we do #2238, this will be crucial to making things smoother with hdf5 for preventing incompatibilities.

Towards Simplify backed internal checking #2369 and moving towards something raised in feat: Add HTML representation #2236
Tests added

Release note not necessary because:

codecov · 2026-03-19T13:23:17Z

Codecov Report

❌ Patch coverage is 97.36842% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.59%. Comparing base (fbc696f) to head (6cffc05).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/anndata/_core/anndata.py	98.30%	1 Missing ⚠️
src/anndata/acc/__init__.py	93.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2372      +/-   ##
==========================================
- Coverage   87.45%   85.59%   -1.87%     
==========================================
  Files          49       49              
  Lines        7742     7801      +59     
==========================================
- Hits         6771     6677      -94     
- Misses        971     1124     +153

Files with missing lines	Coverage Δ
src/anndata/types.py	`100.00% <100.00%> (ø)`
src/anndata/_core/anndata.py	`86.36% <98.30%> (+0.95%)`	⬆️
src/anndata/acc/__init__.py	`96.48% <93.33%> (-0.14%)`	⬇️

... and 8 files with indirect coverage changes

src/anndata/_core/anndata.py

ilan-gold · 2026-03-19T14:41:49Z

Just looking for a first-pass vibe check on this!

src/anndata/_core/anndata.py

flying-sheep

Very exciting! I think this mainly needs some serious documenting and examples, not much else.

For examples, maybe first something that just handles one kind of accessor/reference first to show how one can short-circuit by returning the accumulator early.

Regarding naming: the canonical name in the stdlib is reduce, it even used to be in builtins, but maybe the scientific python world deviates for no good reason again. It’s also fair if you have other objections to reduce!

src/anndata/_core/anndata.py

flying-sheep · 2026-03-20T16:02:26Z

src/anndata/_core/anndata.py

+                    (is_ad_ref := isinstance(ref_acc, AdRef))
+                    or (
+                        is_ref_acc := isinstance(
+                            ref_acc, LayerAcc | MultiAcc | GraphAcc
+                        )
+                    )


this is

isinstance(ref_acc, AdRef | RefAcc) and not isinstance(ref_acc, MetaAcc)

because you want to handle each homogeneous array.

I think there are some ways to make this more ergonomic, e.g.:

we can add more marker superclasses that handle this use case, or

we could change the implementation of fold/reduce so it descends into things and returning or raising some marker would prevent descent. that might help handling things as early as possible and therefore having less complicated logic.

katosh · 2026-03-20T20:06:04Z

Hi @ilan-gold, interesting direction! The HTML repr PR (#2236) intersects with fold in a couple of ways, so I wanted to share some observations from the implementation side.

HTML repr (#2236): traversal overlap

The _repr_html_ traverses the same element tree that fold formalizes. The core loop lives in html.py:_render_all_sections and iterates over a SECTION_ORDER tuple:

SECTION_ORDER = ("X", "obs", "var", "uns", "obsm", "varm", "layers", "obsp", "varp", "raw")

for section in SECTION_ORDER:
    parts.append(_render_section(adata, section, context))

This is structurally very similar to fold's inner loop. A few differences worth noting:

raw is missing from fold. The repr includes .raw as a collapsible section (X, var, varm). Would fold eventually cover raw?
Section-level grouping. The repr builds a tree (section headers, then entries, then nested content). Since entries within a section are visited contiguously, the callback could track state in the accumulator to detect section transitions: in DFS-pre mode the parent visit signals "entering," and the next section's parent visit implicitly signals "exiting" the previous one. This is workable, but it would be smoother if fold supported explicit enter/exit events for sections, so consumers don't need to roll their own state tracking for something that fold already knows internally.
Custom/unknown sections. The repr also detects mapping-like attributes not in the standard list (for ecosystem packages like TreeData that add .obst/.vart). fold currently hardcodes the same attribute list. Worth considering whether it should be extensible or whether that's out of scope.
Nested AnnData in uns. The repr handles AnnData objects found inside uns values, with depth tracking to prevent infinite recursion. fold doesn't recurse into nested AnnData within uns, so that use case wouldn't be covered.

On the positive side, the per-column iteration of obs/var via MetaAcc aligns well with how the repr already iterates df.columns to render each column individually.

The repr would naturally use DFS-pre order (render section header first, then its entries), so it's good that option exists. If fold stabilizes, I could likely use it as the backbone of the repr's traversal by accumulating a list[RenderedSection]. The SECTION_ORDER loop would become a fold call. The main prerequisite would be that fold supports distinguishing "entering a section" from "visiting a leaf within it."

Serialization checks: `can_write` overlap

I understand can_write has a different primary goal: preventing partial writes to zarr/h5 stores by checking upfront whether the whole object is writable. That's valuable on its own. But there's a nice overlap: the repr already performs the same registry lookups at element granularity (utils.py:is_serializable, formatters.py:_check_series_serializability) to show warning icons on entries that can't be written. These checks:

Query _REGISTRY.write to test if a type has a registered writer (same approach as can_write's predicate)
Additionally check Series backing arrays, key name validity (slashes, non-string keys), and nested container contents
Return (is_serializable, reason) tuples so the repr can show why something can't be written

If the underlying element-level check were factored into a shared utility, something like can_write_elem(obj, store_type=) -> tuple[bool, str], both use cases could build on it: can_write would fold with all(...), and the repr would call it per-entry to get the reason string for its warning icon. I'd be happy to contribute to making that work if it's a direction you'd consider.

Happy to discuss further, just wanted to flag the touchpoints while the API is still taking shape.

ilan-gold · 2026-03-23T10:25:26Z

raw is missing from fold. The repr includes .raw as a collapsible section (X, var, varm). Would fold eventually cover raw?

Great catch! I think we currently don't have a path accessor for it, so I think that's the same thing as uns

Nested AnnData in uns. The repr handles AnnData objects found inside uns values, with depth tracking to prevent infinite recursion. fold doesn't recurse into nested AnnData within uns, so that use case wouldn't be covered.

Phil brought this up in a call on Friday as well, not sure (again) what is to be done

The main prerequisite would be that fold supports distinguishing "entering a section" from "visiting a leaf within it."

I'm not sure I follow this line of thinking fully - why can't you distinguish between these two now? If you are doing pre order, you know you enter a section when you see the reference to obs or varp etc. No? Why do you need an exit signal as well?

If the underlying element-level check were factored into a shared utility, something like can_write_elem(obj, store_type=) -> tuple[bool, str], both use cases could build on it: can_write would fold with all(...), and the repr would call it per-entry to get the reason string for its warning icon. I'd be happy to contribute to making that work if it's a direction you'd consider.

That could make sense. Once you merge this into your PR we can refactor this if needed + work out more kinks

ilan-gold · 2026-03-23T12:54:29Z

src/anndata/_core/anndata.py

-
-        sizes = {}
-        attrs = ["X", "_obs", "_var"]
-        attrs_multi = ["_uns", "_obsm", "_varm", "varp", "_obsp", "_layers"]


FWIW this never handled raw anyway

ilan-gold · 2026-03-23T15:56:57Z

See the reduce docs for more info but we can now visit + introspect said visits to .raw and .uns. Aside from that, things should be clearer now:

__sizeof__ now actually visits raw and does more than __sizeof__ on uns (which as I understand, doesn't actually visit the values of that dict)
fold -> reduce
Docs added with clear explanation of how the accumulation can be used for introspection
Tests added that explicitly check the traversal in show_stratified of all elements

katosh · 2026-03-23T17:35:29Z

Hi @ilan-gold, thanks for the updates! The rename to reduce, the raw/uns support, the docs, and the traversal tests are all great improvements. Really nice to see __sizeof__ now covering raw properly too.

I was curious how reduce would work as a traversal backbone for the HTML repr, so I ran an experiment: merged ig/fold_can_write into the repr branch and rewrote _render_all_sections to use adata.reduce(DFS-pre). The experiment lives here: settylab/anndata#6 (the PR description has a guided walkthrough, and the code has PAIN POINT comments at each friction point).

Before getting into findings: is reduce intended as a general-purpose traversal tool that consumers like the repr could build on? Or is it primarily an internal utility for __sizeof__ and can_write, where the callback cares about individual values but not about the tree structure? The answer changes how much the points below matter.

What worked well:

The shared IO registry utilities are a clear win. I extracted IORegistry.get_writeable_types() and IORegistry.has_spec() from the can_write logic, and both can_write and the repr's serializability checks now use them instead of each rolling their own _REGISTRY.get_spec() try/except. This simplification is independent of reduce itself.

What I ran into:

Accumulator complexity (the enter/exit question). Without enter/exit signals, the callback must carry a 4-tuple through every call to track which section is "open." The "close section" logic appears in two places, and section entry is detected via isinstance checks on the accessor hierarchy. These checks are fragile: for example, LayerAcc is used for both X (k=None) and individual layer entries (k='counts'), so a naive isinstance match treats each layer as its own section. The initial implementation had exactly this bug. All of this re-derives structure that reduce already knows internally.

Without enter/exit events (what the experiment implements):

def render_html(elem, *, accumulate, ref_acc):
    sections, current_rows, current_ref, current_elem = accumulate

    # Detect section entry by isinstance-checking the accessor hierarchy.
    # Also implicitly closes the *previous* section, since there's no exit signal.
    if isinstance(ref_acc, (AdAcc, MapAcc)):
        if current_rows is not None:
            # Wrap up the previous section (need current_elem for entry count, collapse, etc.)
            sections.append(render_section(current_ref, "\n".join(current_rows), n_items=len(current_elem), ...))
        current_ref = ref_acc
        current_elem = elem
        current_rows = []
    elif isinstance(ref_acc, (AdRef, RefAcc)):
        # Leaf: format and collect
        output = formatter_registry.format_value(elem, context)
        current_rows.append(_render_entry_row(ref_acc.k, output))

    return sections, current_rows, current_ref, current_elem

sections, leftover_rows, last_ref, last_elem = adata.reduce(
    render_html, init=([], None, None, None), order="DFS-pre",
)
# The *last* section never got an "enter-next" to trigger its wrap-up,
# so we must duplicate the render_section() call here:
if leftover_rows is not None:
    sections.append(render_section(last_ref, "\n".join(leftover_rows), n_items=len(last_elem), ...))

With optional on_enter/on_exit hooks (hypothetical):

def on_enter(elem, *, accumulate, ref_acc):
    accumulate.append(SectionBuilder(ref_acc=ref_acc, elem=elem, rows=[]))
    return accumulate

def on_exit(elem, *, accumulate, ref_acc):
    sb = accumulate[-1]
    accumulate[-1] = render_section(sb.ref_acc, "\n".join(sb.rows), n_items=len(sb.elem), ...)
    return accumulate

def visit_leaf(elem, *, accumulate, ref_acc):
    output = formatter_registry.format_value(elem, context)
    accumulate[-1].rows.append(_render_entry_row(ref_acc.k, output))
    return accumulate

sections = adata.reduce(
    visit_leaf, init=[],
    on_enter=on_enter, on_exit=on_exit,
)

The accumulator becomes a plain list, each concern is its own function, and the ReduceFunc protocol stays unchanged. Consumers that don't need hooks simply don't pass them. See the PR diff for the full implementation, specifically _finalize_section and _is_section_visit.

Metadata mismatch. For obs/var, len(elem) returns the number of rows, not columns. The callback needs a per-type branch to get the correct entry count. The original renderer just called len(df.columns).
Opaque uns/raw. reduce doesn't descend into these, so the callback falls back to the same dedicated renderers the old code used (_render_uns_section, _render_raw_section). For the repr, these are the most complex sections (nested AnnData, type hints, depth tracking), so reduce doesn't help where it would help most.
Hardcoded attribute list. Extension packages (TreeData adding .obst/.vart) are invisible to reduce. Custom sections and unknown section detection must be handled entirely outside the reduce call. For can_write this is a correctness concern: it could return True when a custom section contains non-serializable data. Having extensions override reduce wouldn't compose well either (last override wins), so some form of discovery mechanism would be needed.
Robustness regression. This is probably the most important point. The current per-section rendering is deliberately defensive: each section is rendered independently, and if one section fails, the rest still render fine. With reduce as the backbone, a single problematic element can crash the entire traversal. For example, a non-string DataFrame column name (e.g., a tuple key in obs) causes the accessor system to throw, which aborts reduce entirely. The repr goes from showing all other sections correctly with just one error indicator, to showing nothing but "reduce failed."

Visual test comparison (test 23 "Serialization warnings"):
- Original (per-section rendering) — tuple column name handled gracefully, other sections render fine
- With reduce — entire AnnData shows "reduce failed" because the accessor system crashes on the tuple key
Screenshots (test 23)

Original — per-section rendering:

With reduce:

can_write_elem shared utility:

Sounds like a good plan. Both can_write and the repr's per-element warning icons ultimately query _REGISTRY for writers, so extracting a shared has_writer(obj, store_type=) utility should be straightforward once both PRs are closer to merging.

Nested AnnData in uns:

Understood that this is still an open question. For reference, the HTML repr already handles this case with a simple depth counter: when it encounters an AnnData inside uns, it calls generate_repr_html recursively with depth + 1 and stops at max_depth (html.py#L273). If reduce eventually wants to descend into nested AnnData, that pattern could serve as a reference. No rush on this though, perfectly fine to scope it out for now.

Net assessment:

For __sizeof__ and can_write, where the callback accumulates a scalar and doesn't care about tree structure, reduce works well. For the repr, where the callback needs to produce grouped, ordered, nested HTML with section headers, entry counts, truncation, and custom sections, reduce makes the code harder to follow than the original per-section loop. The shared registry utilities are the real win, and they don't depend on reduce.

Happy to walk through the code in more detail. The experiment branch has PAIN POINT comments at each friction point.

feat: AnnData.can_write based on AnnData.fold

b907554

ilan-gold added this to the 0.13.0 milestone Mar 19, 2026

ilan-gold added the skip-gpu-ci label Mar 19, 2026

ilan-gold added 2 commits March 19, 2026 14:42

chore: docs

19daed5

refactor: use accessors

4125375

ilan-gold changed the title ~~feat: AnnData.can_write based on AnnData.fold~~ feat: AnnData.can_write based on AnnData.fold + refactored AnnData.__sizeof__ Mar 19, 2026

ilan-gold commented Mar 19, 2026

View reviewed changes

src/anndata/_core/anndata.py Show resolved Hide resolved

ilan-gold requested a review from flying-sheep March 19, 2026 14:41

ilan-gold commented Mar 19, 2026

View reviewed changes

src/anndata/_core/anndata.py Outdated Show resolved Hide resolved

fix: DFS order + fixes

8be5ba2

ilan-gold commented Mar 19, 2026

View reviewed changes

src/anndata/_core/anndata.py Outdated Show resolved Hide resolved

chore: add test for uns

0f4d1b0

flying-sheep reviewed Mar 20, 2026

View reviewed changes

src/anndata/_core/anndata.py Show resolved Hide resolved

flying-sheep reviewed Mar 20, 2026

View reviewed changes

Merge branch 'main' into ig/fold_can_write

9338baa

ilan-gold commented Mar 23, 2026

View reviewed changes

ilan-gold added 5 commits March 23, 2026 16:07

feat: raw + uns traversal

69daf90

fix: fold -> reduce

932d766

chore: docs

e0f3ee2

Merge branch 'main' into ig/fold_can_write

436fc68

fix: meth not func

ee04741

ilan-gold marked this pull request as ready for review March 23, 2026 15:46

ilan-gold added 4 commits March 23, 2026 16:46

fix: fold not reduce in relnote

6d6f454

fix: nested

1f77a4c

chore: more func clarification

91adffe

fix: link

928b72a

fix: link

19a915d

ilan-gold requested a review from flying-sheep March 23, 2026 15:58

ilan-gold added 2 commits March 23, 2026 16:59

refactor: simpler

c0886fe

fix: relnote number

6cffc05

katosh mentioned this pull request Mar 23, 2026

Html rep reduce settylab/anndata#6

Draft

flying-sheep changed the title ~~feat: AnnData.can_write based on AnnData.fold + refactored AnnData.__sizeof__~~ feat: AnnData.can_write based on AnnData.reduce + refactored AnnData.__sizeof__ Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `AnnData.can_write` based on `AnnData.reduce` + refactored `AnnData.sizeof`#2372

feat: `AnnData.can_write` based on `AnnData.reduce` + refactored `AnnData.sizeof`#2372
ilan-gold wants to merge 18 commits intomainfrom
ig/fold_can_write

ilan-gold commented Mar 19, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

ilan-gold commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

flying-sheep left a comment

Uh oh!

Uh oh!

flying-sheep Mar 20, 2026

Uh oh!

katosh commented Mar 20, 2026 •

edited

Loading

Uh oh!

ilan-gold commented Mar 23, 2026 •

edited

Loading

Uh oh!

ilan-gold Mar 23, 2026

Uh oh!

ilan-gold commented Mar 23, 2026 •

edited

Loading

Uh oh!

katosh commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ilan-gold commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

ilan-gold commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flying-sheep Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

katosh commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HTML repr (#2236): traversal overlap

Serialization checks: can_write overlap

Uh oh!

ilan-gold commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

katosh commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ilan-gold commented Mar 19, 2026 •

edited

Loading

codecov bot commented Mar 19, 2026 •

edited

Loading

katosh commented Mar 20, 2026 •

edited

Loading

Serialization checks: `can_write` overlap

ilan-gold commented Mar 23, 2026 •

edited

Loading

ilan-gold commented Mar 23, 2026 •

edited

Loading

katosh commented Mar 23, 2026 •

edited

Loading