[codex] Upgrade ShardedZarrStore for pyramid zarrs#86
Conversation
WalkthroughRefactors ChangesShardedZarrStore v2 and HAMT Converter
Sequence Diagram(s)sequenceDiagram
participant Caller
participant convert_hamt_to_sharded
participant HAMTStore
participant ShardedZarrStore
Caller->>convert_hamt_to_sharded: cas, hamt_root_cid, chunks_per_shard
convert_hamt_to_sharded->>HAMTStore: open(hamt_root_cid)
convert_hamt_to_sharded->>ShardedZarrStore: open(manifest_version=v2)
loop metadata pass
convert_hamt_to_sharded->>HAMTStore: filtered Zarr metadata keys
convert_hamt_to_sharded->>ShardedZarrStore: set_pointer(key, cid)
end
loop chunk pass
convert_hamt_to_sharded->>HAMTStore: normalized chunk keys
convert_hamt_to_sharded->>ShardedZarrStore: set_pointer(key, cid)
end
convert_hamt_to_sharded->>ShardedZarrStore: flush()
convert_hamt_to_sharded-->>Caller: new root CID
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@py_hamt/hamt_to_sharded_converter.py`:
- Around line 12-15: Handle classic Zarr v2 chunk keys in _is_zarr_chunk_key and
the conversion flow that reaches set_pointer(), since only c/... keys are
currently recognized. Update the logic in hamt_to_sharded_converter.py to either
reject .zarray-based inputs early or translate v2 chunk keys like 0.0 and
var/0.0 into the sharded chunk-key form before pointer assignment, so they are
not treated as root metadata.
In `@py_hamt/sharded_zarr_store.py`:
- Around line 873-881: Validate the decoded shard length before storing it in
the shard cache: in sharded_zarr_store.py, after dag_cbor.decode in the shard
decode path, add a length check on decoded_shard against
array_index.chunks_per_shard before building shard_data and calling
_shard_data_cache.put. Keep the existing CID/type validation in place, but
reject malformed shards with the wrong number of entries early so later indexed
reads do not fail.
- Around line 150-157: Resizing in sharded_zarr_store.py currently only
truncates shard CID slots in the resize flow, which leaves stale chunk data in
the retained tail shard when the array shrinks. Update the resize logic around
the shard CID handling to also clear any chunk entries whose offsets are beyond
total_chunks in the final retained shard, and make sure any cached shard/chunk
entries are evicted or refreshed so old data cannot reappear after a later grow.
Apply the same cleanup in the related resize paths tied to the shard CID updates
and cache maintenance.
- Around line 1158-1196: The flush path in ShardedZarrStore currently snapshots
dirty shard data and calls mark_clean() without holding the corresponding
_shard_locks, so a concurrent set() or delete() can mutate the shard while
cas.save() is in progress and then get incorrectly cleared. Update the flush
logic around the dirty_shards loop to acquire the shard lock for each cache_key
before reading from _shard_data_cache, saving via cas.save, and marking the
shard clean, and keep the lock held until the cache entry is safely updated; use
the existing _shard_locks, _shard_data_cache, and mark_clean flow as the place
to apply the fix.
- Around line 1315-1316: The public set_pointer mutator currently bypasses the
same read-only write guard used by set(), delete(), resize_store(), and
graft_store(). Update set_pointer in ShardedZarrStore to enforce the read-only
check before calling _set_pointer, using the same guard pattern already applied
in the other mutating methods so pointer updates are blocked when the store is
read-only.
- Around line 1058-1062: The timeout handling in sharded_zarr_store.py should
explicitly chain the caught asyncio.TimeoutError when re-raising the
RuntimeError in this load path. Update the except asyncio.TimeoutError block in
the shard loading logic to preserve the original exception context while keeping
the existing pending-load cleanup for _pending_shard_loads and the timeout
message that references shard_idx.
- Around line 683-688: Reject chunk-shape updates for an existing array index in
sharded_zarr_store.py: when the existing index is reused in the branch that
updates existing.array_shape and existing.chunk_shape, validate that the
incoming chunk_shape matches the stored one and fail instead of overwriting it.
Keep the fix localized to the existing array index handling path in the array
index creation/update logic so coordinate-to-slot mapping cannot change for
already-written shards.
- Around line 297-305: The _V1_COORDINATE_ARRAY_PREFIXES class attribute in
ShardedZarrStore is currently mutable and triggers Ruff; update the symbol to be
an immutable ClassVar backed by a frozenset so it cannot be modified as shared
class state. Keep the same coordinate prefixes, but change the declaration and
any related type annotation in the class definition to use ClassVar and
frozenset to satisfy linting.
- Around line 1324-1335: Invalidate any cached entry in _metadata_read_cache
when set_pointer() updates a metadata pointer so later get() calls do not return
stale bytes. Update the metadata-pointer branch in
ShardedZarrStore.set_pointer() to remove or refresh the cache entry for that key
before returning, and make sure the fix covers keys handled through
_array_path_from_metadata_key() and the existing
_register_array_metadata_from_bytes() flow.
In `@tests/test_sharded_zarr_store_v2.py`:
- Around line 69-906: The v2 path-aware chunk/index behavior in ShardedZarrStore
is only covered by example-based tests, so add at least one Hypothesis invariant
test in test_v2_sharded_zarr_store_v2.py. Use the existing helpers and symbols
such as ShardedZarrStore.open, list_dir, list_prefix, get, exists, and
ArrayIndex to generate varied array/group paths and chunk layouts, then assert
the v2 invariants still hold across those combinations. Keep the test focused on
the new v2 routing/metadata behavior rather than duplicating the current fixed
cases.
- Around line 41-47: The helper method signature in load uses the built-in name
id, which will trigger Ruff A002 under the tests lint rules. Rename that
parameter to a non-builtin identifier in the load method of the test helper
class, and update any internal references to match so the test file continues to
pass ruff check.
In `@tests/test_sharded_zarr_store.py`:
- Around line 785-786: The assertion in the directory listing check is too weak
because it only verifies zarr.json, which would miss regressions where chunk
entries are no longer returned. Update the check around prefixed_dir_keys in the
test to also assert that the expected chunk directory entry c is present, so
list_dir(prefix) is validated for both metadata and chunk contents.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: e52f2759-26cf-43ea-9d50-adb7b40e41b0
📒 Files selected for processing (5)
py_hamt/hamt_to_sharded_converter.pypy_hamt/sharded_zarr_store.pytests/test_sharded_store_grafting.pytests/test_sharded_zarr_store.pytests/test_sharded_zarr_store_v2.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/test_sharded_zarr_store_v2.py`:
- Line 310: The test helpers using monkeypatch are missing an explicit type
annotation, which violates the fully typed test rule. Update the affected test
functions in test_sharded_zarr_store_v2, including
test_v2_resize_blocks_interleaved_chunk_mutators and the other monkeypatch-based
tests in that file, so the monkeypatch fixture parameter is annotated as
pytest.MonkeyPatch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f05c5aaa-d3dd-4afd-b77d-30350f61861f
📒 Files selected for processing (3)
py_hamt/hamt_to_sharded_converter.pypy_hamt/sharded_zarr_store.pytests/test_sharded_zarr_store_v2.py
🚧 Files skipped from review as they are similar to previous changes (2)
- py_hamt/hamt_to_sharded_converter.py
- py_hamt/sharded_zarr_store.py
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_v2_resize_blocks_interleaved_chunk_mutators(monkeypatch) -> None: |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
from pathlib import Path
path = Path("tests/test_sharded_zarr_store_v2.py")
lines = path.read_text().splitlines()
for start in (300, 390, 445):
end = start + 20
print(f"\n--- lines {start}-{end} ---")
for i in range(start, min(end, len(lines)) + 1):
print(f"{i:4d}: {lines[i-1]}")
PYRepository: dClimate/py-hamt
Length of output: 2357
Annotate monkeypatch as pytest.MonkeyPatch. tests/test_sharded_zarr_store_v2.py:310, 400, 455 leave the fixture parameter untyped, which breaks the repo’s fully typed test rule.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_sharded_zarr_store_v2.py` at line 310, The test helpers using
monkeypatch are missing an explicit type annotation, which violates the fully
typed test rule. Update the affected test functions in
test_sharded_zarr_store_v2, including
test_v2_resize_blocks_interleaved_chunk_mutators and the other monkeypatch-based
tests in that file, so the monkeypatch fixture parameter is annotated as
pytest.MonkeyPatch.
Source: Coding guidelines
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #86 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 8 8
Lines 1617 2270 +653
==========================================
+ Hits 1617 2270 +653 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Summary
Implements FPAR pyramid Ticket 1 by upgrading
ShardedZarrStoreto support path-awaresharded_zarr_v2manifests with one shard index per Zarr array path.Changes
ArrayIndex/ChunkKeyhandling and path-aware chunk parsing, get/set/exists/delete/list/resize/graft behavior.c/...metadata chunks and top-levelcgroup paths.migrate_v1_to_v2(primary_array_path)support that reuses existing v1 shard blocks and rewrites metadata under the requested primary path.c-named paths, chunk-prefix listing, grafting, and malformed shard loads.Validation
uv run pytest tests/test_sharded_zarr_store_v2.py tests/test_converter.py -q: 23 passeduv run pre-commit run --all-files --show-diff-on-failure: passedbash run-checks.sh: 157 passed, 2 skipped, 100% coverage, hooks passedNotes
FPAR_PYRAMID_TICKETS.mdwas intentionally not included in this branch.Summary by CodeRabbit
New Features
Bug Fixes
Tests