Skip to content

[codex] Upgrade ShardedZarrStore for pyramid zarrs#85

Closed
Faolain wants to merge 2 commits into
mainfrom
codex/upgrade-sharded-zarr-pyramids
Closed

[codex] Upgrade ShardedZarrStore for pyramid zarrs#85
Faolain wants to merge 2 commits into
mainfrom
codex/upgrade-sharded-zarr-pyramids

Conversation

@Faolain

@Faolain Faolain commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements FPAR pyramid Ticket 1 by upgrading ShardedZarrStore to support path-aware sharded_zarr_v2 manifests with one shard index per Zarr array path.

Changes

  • Added v2 per-array ArrayIndex/ChunkKey handling and path-aware chunk parsing, get/set/exists/delete/list/resize/graft behavior.
  • Preserved v1 compatibility, including legacy root c/... metadata chunks and top-level c group paths.
  • Added optional migrate_v1_to_v2(primary_array_path) support that reuses existing v1 shard blocks and rewrites metadata under the requested primary path.
  • Updated HAMT-to-sharded conversion to create v2 stores and copy metadata before chunk pointers.
  • Added v2 regression coverage for grouped pyramid arrays, coordinate arrays, migration, c-named paths, chunk-prefix listing, grafting, and malformed shard loads.

Validation

  • bash run-checks.sh
    • 157 passed, 2 skipped
    • 100% coverage
    • pre-commit hooks passed
  • Independent reviewing agent completed final pass with no blocking issues remaining.

Notes

FPAR_PYRAMID_TICKETS.md was intentionally not included in this commit.

Summary by CodeRabbit

  • New Features

    • Added support for a new sharded storage format that better handles multiple arrays and grouped data.
    • Enabled converting existing stores into the new format while keeping data readable.
    • Improved directory listing and path handling for chunked data.
  • Bug Fixes

    • Fixed metadata and chunk lookups for grouped arrays and nested paths.
    • Preserved shard data during format migration and improved reopen behavior after resizing.
    • Tightened validation for malformed stores and invalid chunk locations.
  • Tests

    • Expanded test coverage for migration, resizing, listing, and error handling across both store formats.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Adds a sharded_zarr_v2 manifest format to ShardedZarrStore with per-array ArrayIndex geometry, replacing the single global shard index. Updates all core store operations (get, set, delete, flush, list, resize) for v2 routing, adds resize_variable and migrate_v1_to_v2, and rewrites the HAMT-to-sharded converter to use a two-pass key migration without xarray introspection.

Changes

ShardedZarrStore v2 manifest and HAMT converter

Layer / File(s) Summary
ArrayIndex dataclass and v2 module constants
py_hamt/sharded_zarr_store.py
Introduces ArrayIndex with geometry validation, derived chunk/shard counts, and shard CID list handling; adds ShardCacheKey type alias and manifest version constants; refactors MemoryBoundedLRUCache to accept ShardCacheKey (int or tuple) instead of int.
ShardedZarrStore init, open, and root loading
py_hamt/sharded_zarr_store.py
Reworks __init__ to track manifest_version, array_indices dict, primary_array_path, and default chunks_per_shard; extends open() with manifest_version and primary_array_path parameters; adds _initialize_new_root_v2, _load_v2_root, per-array registration helpers, and v2 parent-group metadata initialization; aligns v1 root init fields.
Shard cache load, pending-load sync, and with_read_only clone
py_hamt/sharded_zarr_store.py
Updates _load_or_initialize_shard_cache to use per-array cache keys, await pending-load events keyed by the full cache key, bounds-check against the resolved ArrayIndex, and initialize empty shard lists when no shard CID exists; updates with_read_only to carry v2 fields into the clone.
get, set, delete, exists, flush
py_hamt/sharded_zarr_store.py
Updates get to branch on v2 chunk-key parsing and resolve per-array shard locations with partial-read support; updates set/set_pointer/exists to ensure v2 group metadata, register array metadata from JSON blobs, route writes to per-array shard slots, and mark dirty; rewrites flush to persist dirty shards and synchronize v2 ArrayIndex entries back into the root manifest.
Listing, delete, resize, resize_variable, migrate_v1_to_v2, list_dir
py_hamt/sharded_zarr_store.py
Updates delete to clear per-array shard slots; updates list/list_prefix to include v2 chunk keys; adds v2-specific list_dir with chunk-key prefix handling and deduplication; updates graft_store/resize_store for per-array indexing; adds resize_variable to refresh Zarr metadata and re-register ArrayIndex; adds migrate_v1_to_v2 to reuse existing v1 shard CIDs under a v2 root.
HAMT-to-sharded converter two-pass migration
py_hamt/hamt_to_sharded_converter.py
Removes xarray/ZarrHAMTStore introspection; adds _is_zarr_chunk_key helper; changes cas parameter type to ContentAddressedStore; initializes ShardedZarrStore v2 with only chunks_per_shard and manifest_version; performs two-pass migration copying non-chunk keys before chunk keys.
New v2 test suite
tests/test_sharded_zarr_store_v2.py
Adds LocalCIDCAS in-memory backend and _pyramid_level helper; covers grouped pyramid arrays, array-local resize, v1-to-v2 migration (shard reuse, metadata rewrite, coordinate chunks, c/-named groups), list_dir traversal, defensive validation, graft rejection, converter chunk-key classification, and ArrayIndex/ShardedZarrStore error paths.
Existing test updates
tests/test_sharded_store_grafting.py, tests/test_sharded_zarr_store.py
Broadens the out-of-bounds graft_store error match pattern; updates list_dir assertion to expect returned entries instead of NotImplementedError.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant ShardedZarrStore
  participant MemoryBoundedLRUCache
  participant CAS

  Caller->>ShardedZarrStore: set(key, bytes)
  ShardedZarrStore->>ShardedZarrStore: parse chunk key → array_path + coords
  ShardedZarrStore->>ShardedZarrStore: resolve ArrayIndex for array_path
  ShardedZarrStore->>CAS: store chunk bytes → CID
  ShardedZarrStore->>MemoryBoundedLRUCache: mark_dirty((array_path, shard_idx))

  Caller->>ShardedZarrStore: flush()
  ShardedZarrStore->>MemoryBoundedLRUCache: iterate dirty (array_path, shard_idx) entries
  ShardedZarrStore->>CAS: store shard DAG-CBOR block → shard CID
  ShardedZarrStore->>ShardedZarrStore: update ArrayIndex.shard_cids[shard_idx]
  ShardedZarrStore->>CAS: store v2 root manifest → new root CID
Loading
sequenceDiagram
  participant Caller
  participant convert_hamt_to_sharded
  participant HAMTStore
  participant ShardedZarrStore

  Caller->>convert_hamt_to_sharded: cas, hamt_root_cid, chunks_per_shard
  convert_hamt_to_sharded->>ShardedZarrStore: open(manifest_version=v2, chunks_per_shard)
  convert_hamt_to_sharded->>HAMTStore: iterate all keys
  loop non-chunk keys first
    convert_hamt_to_sharded->>ShardedZarrStore: set_pointer(key, cid)
  end
  loop chunk keys second
    convert_hamt_to_sharded->>ShardedZarrStore: set_pointer(key, cid)
  end
  convert_hamt_to_sharded->>ShardedZarrStore: flush() → new root CID
  convert_hamt_to_sharded->>Caller: root CID string
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • dClimate/py-hamt#44: Refactored the CAS abstraction to ContentAddressedStore, which directly precedes this PR's change of convert_hamt_to_sharded's cas parameter type.
  • dClimate/py-hamt#68: Introduced the initial sharded Zarr store (sharded_zarr_v1) and HAMT converter that this PR substantially extends with v2 manifest support.

Suggested reviewers

  • 0xSwego
  • TheGreatAlgo

🐇 Two passes through the HAMT we hop,
First metadata, then chunks on top!
Each array gets its own index slot,
Shard by shard, CID by CID — what a lot!
v2 is here, the migration's done,
This bunny's sharding work is never quite done. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: upgrading ShardedZarrStore to support pyramid zarrs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/upgrade-sharded-zarr-pyramids

Comment @coderabbitai help to get the list of available commands.

@Faolain Faolain marked this pull request as ready for review June 29, 2026 16:33

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac1d1d2942

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +684 to +687
existing.array_shape = tuple(array_shape)
existing.chunk_shape = tuple(chunk_shape)
existing._validate_geometry()
existing.resize(tuple(array_shape))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reset stale shards when array geometry changes

When a v2 array's metadata is replaced for the same path with a different chunk grid or rank, this branch updates the geometry but keeps/resizes the previous shard_cids. Any old chunk slots that remain in range are then interpreted under the new grid, so an overwritten or rechunked array can return stale chunks before they are rewritten, or for fill chunks that Zarr omits. Reset the shard index when the chunk geometry changes instead of reusing the old pointers.

Useful? React with 👍 / 👎.

Comment on lines +777 to +781
for group_path in group_paths:
metadata_key = (
"zarr.json" if group_path == "" else f"{group_path}/zarr.json"
)
metadata[metadata_key] = await self.cas.save(group_metadata, codec="raw")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve existing group metadata during migration

During v1-to-v2 migration this helper runs after migrated_metadata has been built, but it unconditionally assigns the root and parent zarr.json entries. If the v1 store already had group metadata at those keys, such as dataset attributes or consolidated metadata in zarr.json or 0/zarr.json, it is replaced with an empty group document and the migrated dataset silently loses metadata. Only add these placeholders when the key is absent.

Useful? React with 👍 / 👎.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (e932534) to head (ac1d1d2).

Additional details and impacted files
@@            Coverage Diff             @@
##              main       #85    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            8         8            
  Lines         1617      2127   +510     
==========================================
+ Hits          1617      2127   +510     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dClimate dClimate deleted a comment from Faolain Jun 29, 2026
@0xSwego 0xSwego closed this Jun 29, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (1)
tests/test_sharded_zarr_store_v2.py (1)

679-739: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider a small Hypothesis geometry test for ArrayIndex. The rank/chunk/shard-count invariants here are pure and fit a property-based test better than a handful of fixed examples.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_sharded_zarr_store_v2.py` around lines 679 - 739, Add a small
Hypothesis-based geometry test for ArrayIndex instead of relying only on the
fixed validation examples in test_array_index_validation_paths. Use ArrayIndex
and its constructors/new/resize behavior to generate valid and invalid shapes,
chunk shapes, and chunks_per_shard values, then assert the rank, chunk, and
shard-count invariants hold across many cases. Keep the existing targeted
error-message checks, but factor the pure geometry coverage into a
property-based test so the ArrayIndex constraints are exercised more broadly.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@py_hamt/hamt_to_sharded_converter.py`:
- Around line 18-19: The convert_hamt_to_sharded entrypoint currently advertises
a generic ContentAddressedStore even though the implementation only works with
CID-backed stores. Update convert_hamt_to_sharded to either narrow the cas
annotation/docs to the CID-backed store type or add an upfront validation that
rejects non-CID immutable IDs before any sharded migration work begins. Make
sure the pointer encoding/decoding path in convert_hamt_to_sharded and the
ShardedZarrStore handoff stays consistent with CID assumptions.

In `@py_hamt/sharded_zarr_store.py`:
- Around line 1593-1597: The resize path in sharded_zarr_store.py leaves stale
cached and persisted shard slots after `array_index.resize()`, so shrinking can
still expose CIDs beyond the new `total_chunks`. Update the resize handling
around `array_index.resize()`, `_sync_arrays_to_root()`, and the dirty shard
cache to clear any out-of-shape shard entries and last-shard slots before
marking `_dirty_root`; also make sure `flush()` and any shard-index accessors in
this class tolerate the truncated `shard_cids` state without indexing past the
new bounds or re-persisting old chunks after re-expansion.
- Around line 1352-1355: Remove the legacy metadata fallback path for migrated
chunks in the shard write/delete flow so stale metadata cannot reappear after a
v2 shard mutation. Update the shard update and deletion logic around the shard
list mutation code and the `get()`/`exists()` lookup path in
`sharded_zarr_store.py` so once a chunk is promoted into a shard, reads no
longer consult the old metadata entry after the shard entry is written or
removed. Use the existing shard mutation helpers and cache invalidation behavior
to ensure the shard state is the single source of truth.
- Around line 777-782: The _add_missing_group_metadata() migration path is
overwriting existing group metadata by always saving group_metadata into each
zarr.json entry. Update the logic in _add_missing_group_metadata() to first
check whether a parent zarr.json already exists and only create missing group
metadata, preserving any existing attributes or consolidated metadata during
v1-to-v2 migration. Use the existing metadata key generation in this method to
locate the write path and avoid replacing already-present group records.
- Around line 1058-1062: The timeout handling in the shard load path drops the
original exception context, so update the `except asyncio.TimeoutError` block in
`sharded_zarr_store.py` to chain the new `RuntimeError` from the caught timeout
exception. Keep the cleanup of `_pending_shard_loads` as-is, but raise the error
using the original `asyncio.TimeoutError` variable so the `ShardedZarrStore`
load failure preserves context and satisfies the B904 rule.
- Line 10: The shared coordinate prefix collection in ShardedZarrStore is
currently a mutable class attribute and should be made immutable to satisfy the
Ruff RUF012 warning. Update the `_V1_COORDINATE_ARRAY_PREFIXES` declaration in
`ShardedZarrStore` to use `ClassVar[frozenset[str]]` (or otherwise move the
initialization into `__init__`) so the prefix set cannot be mutated across
instances.
- Around line 1158-1196: The flush path in ShardedZarrStore is reading, saving,
and marking shard data clean without holding the per-shard lock, which can race
with concurrent set()/delete() updates. Update the flush logic around the dirty
shard loop to acquire and hold the corresponding _shard_locks entry for each
cache_key while fetching from _shard_data_cache, encoding with dag_cbor, saving
via cas.save, updating shard_cids, and calling mark_clean. Use the existing
_shard_locks, _shard_data_cache, and _manifest_version branches in the flush
method so a shard cannot be modified after it is persisted but before it is
marked clean.

In `@tests/test_sharded_zarr_store_v2.py`:
- Around line 41-48: The helper method load currently uses the parameter name
id, which shadows Python’s builtin and triggers Ruff A002. Rename that parameter
in load to a more descriptive name (for example, key or item_id) and update the
lookup using self._key(...) so the test remains Ruff-clean without changing
behavior.

---

Nitpick comments:
In `@tests/test_sharded_zarr_store_v2.py`:
- Around line 679-739: Add a small Hypothesis-based geometry test for ArrayIndex
instead of relying only on the fixed validation examples in
test_array_index_validation_paths. Use ArrayIndex and its
constructors/new/resize behavior to generate valid and invalid shapes, chunk
shapes, and chunks_per_shard values, then assert the rank, chunk, and
shard-count invariants hold across many cases. Keep the existing targeted
error-message checks, but factor the pure geometry coverage into a
property-based test so the ArrayIndex constraints are exercised more broadly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 33a6d4af-7a0d-41eb-a4ac-69d14b6a9bcd

📥 Commits

Reviewing files that changed from the base of the PR and between e932534 and ac1d1d2.

📒 Files selected for processing (5)
  • py_hamt/hamt_to_sharded_converter.py
  • py_hamt/sharded_zarr_store.py
  • tests/test_sharded_store_grafting.py
  • tests/test_sharded_zarr_store.py
  • tests/test_sharded_zarr_store_v2.py

Comment on lines 18 to +19
async def convert_hamt_to_sharded(
cas: KuboCAS, hamt_root_cid: str, chunks_per_shard: int
cas: ContentAddressedStore, hamt_root_cid: str, chunks_per_shard: int

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Narrow the CAS contract back to CID-backed implementations.

ContentAddressedStore allows non-CID immutable IDs, but this code immediately enforces CID and base32-encodes pointers before handing them to ShardedZarrStore, which also decodes them as CIDs. That means a valid non-CID ContentAddressedStore now type-checks at the boundary and then fails on the first migrated key. Either narrow the annotation/docstring again or reject non-CID-backed stores up front.

Also applies to: 61-76

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@py_hamt/hamt_to_sharded_converter.py` around lines 18 - 19, The
convert_hamt_to_sharded entrypoint currently advertises a generic
ContentAddressedStore even though the implementation only works with CID-backed
stores. Update convert_hamt_to_sharded to either narrow the cas annotation/docs
to the CID-backed store type or add an upfront validation that rejects non-CID
immutable IDs before any sharded migration work begins. Make sure the pointer
encoding/decoding path in convert_hamt_to_sharded and the ShardedZarrStore
handoff stays consistent with CID assumptions.

from collections.abc import AsyncIterator, Iterable
from typing import DefaultDict, Dict, List, Optional, Set, Tuple
from dataclasses import dataclass
from typing import DefaultDict, Dict, List, Optional, Set, Tuple, cast

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
uv run ruff check py_hamt/sharded_zarr_store.py --select RUF012

Repository: dClimate/py-hamt

Length of output: 194


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf 'ruff: '
(command -v ruff && ruff --version) || true
printf 'python: '
python3 --version

echo
echo '--- file outline ---'
ast-grep outline py_hamt/sharded_zarr_store.py --view expanded | sed -n '1,220p'

echo
echo '--- surrounding lines 1-80 ---'
sed -n '1,80p' py_hamt/sharded_zarr_store.py

echo
echo '--- surrounding lines 280-320 ---'
sed -n '280,320p' py_hamt/sharded_zarr_store.py

Repository: dClimate/py-hamt

Length of output: 9152


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
from pathlib import Path
p = Path("py_hamt/sharded_zarr_store.py")
text = p.read_text()
for needle in ["_V1_COORDINATE_ARRAY_PREFIXES", "ClassVar", "frozenset"]:
    print(f"== {needle} ==")
    for i, line in enumerate(text.splitlines(), 1):
        if needle in line:
            print(f"{i}: {line}")
PY

Repository: dClimate/py-hamt

Length of output: 341


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
import shutil, subprocess, sys

ruff = shutil.which("ruff")
if not ruff:
    print("ruff-not-found")
    sys.exit(0)

cmd = [ruff, "check", "py_hamt/sharded_zarr_store.py", "--select", "RUF012"]
print("running:", " ".join(cmd))
proc = subprocess.run(cmd, text=True, capture_output=True)
print("returncode:", proc.returncode)
print(proc.stdout)
print(proc.stderr)
PY

Repository: dClimate/py-hamt

Length of output: 1241


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
from pathlib import Path
text = Path("py_hamt/sharded_zarr_store.py").read_text().splitlines()
for start in (1, 90, 290, 1050, 1060):
    end = min(start + 25, len(text))
    print(f"\n--- lines {start}-{end} ---")
    for i in range(start, end + 1):
        if i <= len(text):
            print(f"{i:4}: {text[i-1]}")
PY

Repository: dClimate/py-hamt

Length of output: 5880


Mark _V1_COORDINATE_ARRAY_PREFIXES as immutable
py_hamt/sharded_zarr_store.py:297-305 defines a mutable class attribute that Ruff flags (RUF012). Annotate it as ClassVar[frozenset[str]] (or move it to __init__) so the shared prefix set can’t be mutated accidentally.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@py_hamt/sharded_zarr_store.py` at line 10, The shared coordinate prefix
collection in ShardedZarrStore is currently a mutable class attribute and should
be made immutable to satisfy the Ruff RUF012 warning. Update the
`_V1_COORDINATE_ARRAY_PREFIXES` declaration in `ShardedZarrStore` to use
`ClassVar[frozenset[str]]` (or otherwise move the initialization into
`__init__`) so the prefix set cannot be mutated across instances.

Sources: Coding guidelines, Linters/SAST tools

Comment on lines +777 to +782
for group_path in group_paths:
metadata_key = (
"zarr.json" if group_path == "" else f"{group_path}/zarr.json"
)
metadata[metadata_key] = await self.cas.save(group_metadata, codec="raw")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Preserve existing group metadata during v1-to-v2 migration.

_add_missing_group_metadata() currently overwrites any existing parent zarr.json, which can drop group attributes or consolidated metadata during migration.

Proposed fix
         for group_path in group_paths:
             metadata_key = (
                 "zarr.json" if group_path == "" else f"{group_path}/zarr.json"
             )
+            if metadata_key in metadata:
+                continue
             metadata[metadata_key] = await self.cas.save(group_metadata, codec="raw")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for group_path in group_paths:
metadata_key = (
"zarr.json" if group_path == "" else f"{group_path}/zarr.json"
)
metadata[metadata_key] = await self.cas.save(group_metadata, codec="raw")
for group_path in group_paths:
metadata_key = (
"zarr.json" if group_path == "" else f"{group_path}/zarr.json"
)
if metadata_key in metadata:
continue
metadata[metadata_key] = await self.cas.save(group_metadata, codec="raw")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@py_hamt/sharded_zarr_store.py` around lines 777 - 782, The
_add_missing_group_metadata() migration path is overwriting existing group
metadata by always saving group_metadata into each zarr.json entry. Update the
logic in _add_missing_group_metadata() to first check whether a parent zarr.json
already exists and only create missing group metadata, preserving any existing
attributes or consolidated metadata during v1-to-v2 migration. Use the existing
metadata key generation in this method to locate the write path and avoid
replacing already-present group records.

Comment on lines 1058 to 1062
except asyncio.TimeoutError:
# Clean up the pending load to allow retry
if shard_idx in self._pending_shard_loads:
self._pending_shard_loads[shard_idx].set()
del self._pending_shard_loads[shard_idx]
if cache_key in self._pending_shard_loads:
self._pending_shard_loads[cache_key].set()
del self._pending_shard_loads[cache_key]
raise RuntimeError(f"Timeout waiting for shard {shard_idx} to load.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
uv run ruff check py_hamt/sharded_zarr_store.py --select B904

Repository: dClimate/py-hamt

Length of output: 194


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant region and surrounding context.
sed -n '1035,1075p' py_hamt/sharded_zarr_store.py

# Find whether the same timeout pattern appears elsewhere.
rg -n "except asyncio\.TimeoutError|raise RuntimeError\\(f\"Timeout waiting for shard" py_hamt/sharded_zarr_store.py

Repository: dClimate/py-hamt

Length of output: 2047


Chain the timeout exception. Raise the RuntimeError with from err here so the original asyncio.TimeoutError stays attached and the B904 lint rule is satisfied.

🧰 Tools
🪛 Ruff (0.15.20)

[warning] 1062-1062: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@py_hamt/sharded_zarr_store.py` around lines 1058 - 1062, The timeout handling
in the shard load path drops the original exception context, so update the
`except asyncio.TimeoutError` block in `sharded_zarr_store.py` to chain the new
`RuntimeError` from the caught timeout exception. Keep the cleanup of
`_pending_shard_loads` as-is, but raise the error using the original
`asyncio.TimeoutError` variable so the `ShardedZarrStore` load failure preserves
context and satisfies the B904 rule.

Sources: Coding guidelines, Linters/SAST tools

Comment on lines +1158 to +1196
for cache_key in sorted(dirty_shards, key=str):
shard_data_list = await self._shard_data_cache.get(cache_key)
if shard_data_list is None:
raise RuntimeError(f"Dirty shard {shard_idx} not found in cache")
raise RuntimeError(f"Dirty shard {cache_key} not found in cache")

# Encode this list into a DAG-CBOR byte representation
shard_data_bytes = dag_cbor.encode(shard_data_list)

# Save the DAG-CBOR block and get its CID
shard_data_bytes = dag_cbor.encode(cast(IPLDKind, shard_data_list))
new_shard_cid_obj = await self.cas.save(
shard_data_bytes,
codec="dag-cbor", # Use 'dag-cbor' codec
codec="dag-cbor",
)

if (
self._root_obj["chunks"]["shard_cids"][shard_idx]
!= new_shard_cid_obj
):
# Store the CID object directly
self._root_obj["chunks"]["shard_cids"][shard_idx] = (
new_shard_cid_obj
if not isinstance(new_shard_cid_obj, CID): # pragma: no cover
raise TypeError(
"ShardedZarrStore requires CAS.save to return CIDs."
)
self._dirty_root = True
# Mark shard as clean after flushing
await self._shard_data_cache.mark_clean(shard_idx)

if self._manifest_version == SHARDED_ZARR_V1:
if not isinstance(cache_key, int): # pragma: no cover
raise TypeError("v1 shard cache keys must be integers.")
shard_idx = int(cache_key)
if (
self._root_obj["chunks"]["shard_cids"][shard_idx]
!= new_shard_cid_obj
):
self._root_obj["chunks"]["shard_cids"][shard_idx] = (
new_shard_cid_obj
)
self.array_indices[""].shard_cids[shard_idx] = new_shard_cid_obj
self._dirty_root = True
else:
if isinstance(cache_key, int): # pragma: no cover
raise TypeError("v2 shard cache keys must include array paths.")
array_path, shard_idx = cache_key
array_index = self.array_indices[array_path]
if array_index.shard_cids[shard_idx] != new_shard_cid_obj:
array_index.shard_cids[shard_idx] = new_shard_cid_obj
self._dirty_root = True
self._sync_arrays_to_root()

await self._shard_data_cache.mark_clean(cache_key)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🔴 Critical | ⚡ Quick win

Hold the per-shard lock while flushing dirty shard data.

set()/delete() mutate shard lists under _shard_locks, but flush() reads, saves, and marks them clean without that lock. A concurrent write can be marked dirty after encoding but before mark_clean(), losing the persisted update.

Proposed direction
         if dirty_shards:
             for cache_key in sorted(dirty_shards, key=str):
-                shard_data_list = await self._shard_data_cache.get(cache_key)
-                if shard_data_list is None:
-                    raise RuntimeError(f"Dirty shard {cache_key} not found in cache")
+                shard_lock = self._shard_locks[cache_key]
+                async with shard_lock:
+                    shard_data_list = await self._shard_data_cache.get(cache_key)
+                    if shard_data_list is None:
+                        raise RuntimeError(
+                            f"Dirty shard {cache_key} not found in cache"
+                        )
 
-                shard_data_bytes = dag_cbor.encode(cast(IPLDKind, shard_data_list))
+                    shard_data_bytes = dag_cbor.encode(cast(IPLDKind, shard_data_list))
+                    # save, update root/array index, then mark_clean while still locked
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@py_hamt/sharded_zarr_store.py` around lines 1158 - 1196, The flush path in
ShardedZarrStore is reading, saving, and marking shard data clean without
holding the per-shard lock, which can race with concurrent set()/delete()
updates. Update the flush logic around the dirty shard loop to acquire and hold
the corresponding _shard_locks entry for each cache_key while fetching from
_shard_data_cache, encoding with dag_cbor, saving via cas.save, updating
shard_cids, and calling mark_clean. Use the existing _shard_locks,
_shard_data_cache, and _manifest_version branches in the flush method so a shard
cannot be modified after it is persisted but before it is marked clean.

Comment on lines 1352 to 1355
if target_shard_list[index_in_shard] != pointer_cid_obj:
target_shard_list[index_in_shard] = pointer_cid_obj
await self._shard_data_cache.mark_dirty(shard_idx)
await self._shard_data_cache.mark_dirty(cache_key)
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Remove legacy metadata fallbacks when shard chunks are written or deleted.

A migrated chunk can exist as metadata first, then later be promoted into a v2 shard. If the metadata fallback remains, deleting the shard entry makes get()/exists() fall back to the stale metadata and the chunk reappears.

Proposed fix
             if target_shard_list[index_in_shard] != pointer_cid_obj:
                 target_shard_list[index_in_shard] = pointer_cid_obj
                 await self._shard_data_cache.mark_dirty(cache_key)
+            if self._root_obj["metadata"].pop(key, None) is not None:
+                self._metadata_read_cache.pop(key, None)
+                self._dirty_root = True
@@
             if target_shard_list[index_in_shard] is not None:
                 target_shard_list[index_in_shard] = None
                 await self._shard_data_cache.mark_dirty(cache_key)
-            elif self._root_obj["metadata"].pop(key, None) is not None:
+            if self._root_obj["metadata"].pop(key, None) is not None:
                 self._metadata_read_cache.pop(key, None)
                 self._dirty_root = True

Also applies to: 1424-1429

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@py_hamt/sharded_zarr_store.py` around lines 1352 - 1355, Remove the legacy
metadata fallback path for migrated chunks in the shard write/delete flow so
stale metadata cannot reappear after a v2 shard mutation. Update the shard
update and deletion logic around the shard list mutation code and the
`get()`/`exists()` lookup path in `sharded_zarr_store.py` so once a chunk is
promoted into a shard, reads no longer consult the old metadata entry after the
shard entry is written or removed. Use the existing shard mutation helpers and
cache invalidation behavior to ensure the shard state is the single source of
truth.

Comment on lines +1593 to +1597
array_index.resize(tuple(new_shape))
if self._primary_array_path == array_index.array_path:
self._set_legacy_geometry_from_index(array_index)
self._sync_arrays_to_root()
self._dirty_root = True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Clear out-of-shape cached and persisted shard slots on resize.

After shrinking, array_index.resize() truncates shard_cids but cached dirty shards and retained last-shard slots can still contain CIDs outside the new total_chunks. flush() can either index past the truncated list or persist stale chunks that reappear after a later expansion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@py_hamt/sharded_zarr_store.py` around lines 1593 - 1597, The resize path in
sharded_zarr_store.py leaves stale cached and persisted shard slots after
`array_index.resize()`, so shrinking can still expose CIDs beyond the new
`total_chunks`. Update the resize handling around `array_index.resize()`,
`_sync_arrays_to_root()`, and the dirty shard cache to clear any out-of-shape
shard entries and last-shard slots before marking `_dirty_root`; also make sure
`flush()` and any shard-index accessors in this class tolerate the truncated
`shard_cids` state without indexing past the new bounds or re-persisting old
chunks after re-expansion.

Comment on lines +41 to +48
async def load(
self,
id: IPLDKind,
offset: int | None = None,
length: int | None = None,
suffix: int | None = None,
) -> bytes:
data = self.store[self._key(id)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Rename id to avoid the Ruff A002 failure.

This helper now shadows the builtin name, so the new test file is not Ruff-clean. As per coding guidelines, "Python code must pass ruff linting (ruff check)" and "Use meaningful variable and function names".

Suggested fix
     async def load(
         self,
-        id: IPLDKind,
+        cid_or_id: IPLDKind,
         offset: int | None = None,
         length: int | None = None,
         suffix: int | None = None,
     ) -> bytes:
-        data = self.store[self._key(id)]
+        data = self.store[self._key(cid_or_id)]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def load(
self,
id: IPLDKind,
offset: int | None = None,
length: int | None = None,
suffix: int | None = None,
) -> bytes:
data = self.store[self._key(id)]
async def load(
self,
cid_or_id: IPLDKind,
offset: int | None = None,
length: int | None = None,
suffix: int | None = None,
) -> bytes:
data = self.store[self._key(cid_or_id)]
🧰 Tools
🪛 Ruff (0.15.20)

[error] 43-43: Function argument id is shadowing a Python builtin

(A002)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_sharded_zarr_store_v2.py` around lines 41 - 48, The helper method
load currently uses the parameter name id, which shadows Python’s builtin and
triggers Ruff A002. Rename that parameter in load to a more descriptive name
(for example, key or item_id) and update the lookup using self._key(...) so the
test remains Ruff-clean without changing behavior.

Sources: Coding guidelines, Linters/SAST tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants