[codex] Upgrade ShardedZarrStore for pyramid zarrs#85
Conversation
WalkthroughAdds a ChangesShardedZarrStore v2 manifest and HAMT converter
Sequence Diagram(s)sequenceDiagram
participant Caller
participant ShardedZarrStore
participant MemoryBoundedLRUCache
participant CAS
Caller->>ShardedZarrStore: set(key, bytes)
ShardedZarrStore->>ShardedZarrStore: parse chunk key → array_path + coords
ShardedZarrStore->>ShardedZarrStore: resolve ArrayIndex for array_path
ShardedZarrStore->>CAS: store chunk bytes → CID
ShardedZarrStore->>MemoryBoundedLRUCache: mark_dirty((array_path, shard_idx))
Caller->>ShardedZarrStore: flush()
ShardedZarrStore->>MemoryBoundedLRUCache: iterate dirty (array_path, shard_idx) entries
ShardedZarrStore->>CAS: store shard DAG-CBOR block → shard CID
ShardedZarrStore->>ShardedZarrStore: update ArrayIndex.shard_cids[shard_idx]
ShardedZarrStore->>CAS: store v2 root manifest → new root CID
sequenceDiagram
participant Caller
participant convert_hamt_to_sharded
participant HAMTStore
participant ShardedZarrStore
Caller->>convert_hamt_to_sharded: cas, hamt_root_cid, chunks_per_shard
convert_hamt_to_sharded->>ShardedZarrStore: open(manifest_version=v2, chunks_per_shard)
convert_hamt_to_sharded->>HAMTStore: iterate all keys
loop non-chunk keys first
convert_hamt_to_sharded->>ShardedZarrStore: set_pointer(key, cid)
end
loop chunk keys second
convert_hamt_to_sharded->>ShardedZarrStore: set_pointer(key, cid)
end
convert_hamt_to_sharded->>ShardedZarrStore: flush() → new root CID
convert_hamt_to_sharded->>Caller: root CID string
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac1d1d2942
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| existing.array_shape = tuple(array_shape) | ||
| existing.chunk_shape = tuple(chunk_shape) | ||
| existing._validate_geometry() | ||
| existing.resize(tuple(array_shape)) |
There was a problem hiding this comment.
Reset stale shards when array geometry changes
When a v2 array's metadata is replaced for the same path with a different chunk grid or rank, this branch updates the geometry but keeps/resizes the previous shard_cids. Any old chunk slots that remain in range are then interpreted under the new grid, so an overwritten or rechunked array can return stale chunks before they are rewritten, or for fill chunks that Zarr omits. Reset the shard index when the chunk geometry changes instead of reusing the old pointers.
Useful? React with 👍 / 👎.
| for group_path in group_paths: | ||
| metadata_key = ( | ||
| "zarr.json" if group_path == "" else f"{group_path}/zarr.json" | ||
| ) | ||
| metadata[metadata_key] = await self.cas.save(group_metadata, codec="raw") |
There was a problem hiding this comment.
Preserve existing group metadata during migration
During v1-to-v2 migration this helper runs after migrated_metadata has been built, but it unconditionally assigns the root and parent zarr.json entries. If the v1 store already had group metadata at those keys, such as dataset attributes or consolidated metadata in zarr.json or 0/zarr.json, it is replaced with an empty group document and the migrated dataset silently loses metadata. Only add these placeholders when the key is absent.
Useful? React with 👍 / 👎.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #85 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 8 8
Lines 1617 2127 +510
==========================================
+ Hits 1617 2127 +510 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (1)
tests/test_sharded_zarr_store_v2.py (1)
679-739: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winConsider a small Hypothesis geometry test for
ArrayIndex. The rank/chunk/shard-count invariants here are pure and fit a property-based test better than a handful of fixed examples.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_sharded_zarr_store_v2.py` around lines 679 - 739, Add a small Hypothesis-based geometry test for ArrayIndex instead of relying only on the fixed validation examples in test_array_index_validation_paths. Use ArrayIndex and its constructors/new/resize behavior to generate valid and invalid shapes, chunk shapes, and chunks_per_shard values, then assert the rank, chunk, and shard-count invariants hold across many cases. Keep the existing targeted error-message checks, but factor the pure geometry coverage into a property-based test so the ArrayIndex constraints are exercised more broadly.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@py_hamt/hamt_to_sharded_converter.py`:
- Around line 18-19: The convert_hamt_to_sharded entrypoint currently advertises
a generic ContentAddressedStore even though the implementation only works with
CID-backed stores. Update convert_hamt_to_sharded to either narrow the cas
annotation/docs to the CID-backed store type or add an upfront validation that
rejects non-CID immutable IDs before any sharded migration work begins. Make
sure the pointer encoding/decoding path in convert_hamt_to_sharded and the
ShardedZarrStore handoff stays consistent with CID assumptions.
In `@py_hamt/sharded_zarr_store.py`:
- Around line 1593-1597: The resize path in sharded_zarr_store.py leaves stale
cached and persisted shard slots after `array_index.resize()`, so shrinking can
still expose CIDs beyond the new `total_chunks`. Update the resize handling
around `array_index.resize()`, `_sync_arrays_to_root()`, and the dirty shard
cache to clear any out-of-shape shard entries and last-shard slots before
marking `_dirty_root`; also make sure `flush()` and any shard-index accessors in
this class tolerate the truncated `shard_cids` state without indexing past the
new bounds or re-persisting old chunks after re-expansion.
- Around line 1352-1355: Remove the legacy metadata fallback path for migrated
chunks in the shard write/delete flow so stale metadata cannot reappear after a
v2 shard mutation. Update the shard update and deletion logic around the shard
list mutation code and the `get()`/`exists()` lookup path in
`sharded_zarr_store.py` so once a chunk is promoted into a shard, reads no
longer consult the old metadata entry after the shard entry is written or
removed. Use the existing shard mutation helpers and cache invalidation behavior
to ensure the shard state is the single source of truth.
- Around line 777-782: The _add_missing_group_metadata() migration path is
overwriting existing group metadata by always saving group_metadata into each
zarr.json entry. Update the logic in _add_missing_group_metadata() to first
check whether a parent zarr.json already exists and only create missing group
metadata, preserving any existing attributes or consolidated metadata during
v1-to-v2 migration. Use the existing metadata key generation in this method to
locate the write path and avoid replacing already-present group records.
- Around line 1058-1062: The timeout handling in the shard load path drops the
original exception context, so update the `except asyncio.TimeoutError` block in
`sharded_zarr_store.py` to chain the new `RuntimeError` from the caught timeout
exception. Keep the cleanup of `_pending_shard_loads` as-is, but raise the error
using the original `asyncio.TimeoutError` variable so the `ShardedZarrStore`
load failure preserves context and satisfies the B904 rule.
- Line 10: The shared coordinate prefix collection in ShardedZarrStore is
currently a mutable class attribute and should be made immutable to satisfy the
Ruff RUF012 warning. Update the `_V1_COORDINATE_ARRAY_PREFIXES` declaration in
`ShardedZarrStore` to use `ClassVar[frozenset[str]]` (or otherwise move the
initialization into `__init__`) so the prefix set cannot be mutated across
instances.
- Around line 1158-1196: The flush path in ShardedZarrStore is reading, saving,
and marking shard data clean without holding the per-shard lock, which can race
with concurrent set()/delete() updates. Update the flush logic around the dirty
shard loop to acquire and hold the corresponding _shard_locks entry for each
cache_key while fetching from _shard_data_cache, encoding with dag_cbor, saving
via cas.save, updating shard_cids, and calling mark_clean. Use the existing
_shard_locks, _shard_data_cache, and _manifest_version branches in the flush
method so a shard cannot be modified after it is persisted but before it is
marked clean.
In `@tests/test_sharded_zarr_store_v2.py`:
- Around line 41-48: The helper method load currently uses the parameter name
id, which shadows Python’s builtin and triggers Ruff A002. Rename that parameter
in load to a more descriptive name (for example, key or item_id) and update the
lookup using self._key(...) so the test remains Ruff-clean without changing
behavior.
---
Nitpick comments:
In `@tests/test_sharded_zarr_store_v2.py`:
- Around line 679-739: Add a small Hypothesis-based geometry test for ArrayIndex
instead of relying only on the fixed validation examples in
test_array_index_validation_paths. Use ArrayIndex and its
constructors/new/resize behavior to generate valid and invalid shapes, chunk
shapes, and chunks_per_shard values, then assert the rank, chunk, and
shard-count invariants hold across many cases. Keep the existing targeted
error-message checks, but factor the pure geometry coverage into a
property-based test so the ArrayIndex constraints are exercised more broadly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 33a6d4af-7a0d-41eb-a4ac-69d14b6a9bcd
📒 Files selected for processing (5)
py_hamt/hamt_to_sharded_converter.pypy_hamt/sharded_zarr_store.pytests/test_sharded_store_grafting.pytests/test_sharded_zarr_store.pytests/test_sharded_zarr_store_v2.py
| async def convert_hamt_to_sharded( | ||
| cas: KuboCAS, hamt_root_cid: str, chunks_per_shard: int | ||
| cas: ContentAddressedStore, hamt_root_cid: str, chunks_per_shard: int |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Narrow the CAS contract back to CID-backed implementations.
ContentAddressedStore allows non-CID immutable IDs, but this code immediately enforces CID and base32-encodes pointers before handing them to ShardedZarrStore, which also decodes them as CIDs. That means a valid non-CID ContentAddressedStore now type-checks at the boundary and then fails on the first migrated key. Either narrow the annotation/docstring again or reject non-CID-backed stores up front.
Also applies to: 61-76
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@py_hamt/hamt_to_sharded_converter.py` around lines 18 - 19, The
convert_hamt_to_sharded entrypoint currently advertises a generic
ContentAddressedStore even though the implementation only works with CID-backed
stores. Update convert_hamt_to_sharded to either narrow the cas annotation/docs
to the CID-backed store type or add an upfront validation that rejects non-CID
immutable IDs before any sharded migration work begins. Make sure the pointer
encoding/decoding path in convert_hamt_to_sharded and the ShardedZarrStore
handoff stays consistent with CID assumptions.
| from collections.abc import AsyncIterator, Iterable | ||
| from typing import DefaultDict, Dict, List, Optional, Set, Tuple | ||
| from dataclasses import dataclass | ||
| from typing import DefaultDict, Dict, List, Optional, Set, Tuple, cast |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
uv run ruff check py_hamt/sharded_zarr_store.py --select RUF012Repository: dClimate/py-hamt
Length of output: 194
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf 'ruff: '
(command -v ruff && ruff --version) || true
printf 'python: '
python3 --version
echo
echo '--- file outline ---'
ast-grep outline py_hamt/sharded_zarr_store.py --view expanded | sed -n '1,220p'
echo
echo '--- surrounding lines 1-80 ---'
sed -n '1,80p' py_hamt/sharded_zarr_store.py
echo
echo '--- surrounding lines 280-320 ---'
sed -n '280,320p' py_hamt/sharded_zarr_store.pyRepository: dClimate/py-hamt
Length of output: 9152
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
from pathlib import Path
p = Path("py_hamt/sharded_zarr_store.py")
text = p.read_text()
for needle in ["_V1_COORDINATE_ARRAY_PREFIXES", "ClassVar", "frozenset"]:
print(f"== {needle} ==")
for i, line in enumerate(text.splitlines(), 1):
if needle in line:
print(f"{i}: {line}")
PYRepository: dClimate/py-hamt
Length of output: 341
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
import shutil, subprocess, sys
ruff = shutil.which("ruff")
if not ruff:
print("ruff-not-found")
sys.exit(0)
cmd = [ruff, "check", "py_hamt/sharded_zarr_store.py", "--select", "RUF012"]
print("running:", " ".join(cmd))
proc = subprocess.run(cmd, text=True, capture_output=True)
print("returncode:", proc.returncode)
print(proc.stdout)
print(proc.stderr)
PYRepository: dClimate/py-hamt
Length of output: 1241
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
from pathlib import Path
text = Path("py_hamt/sharded_zarr_store.py").read_text().splitlines()
for start in (1, 90, 290, 1050, 1060):
end = min(start + 25, len(text))
print(f"\n--- lines {start}-{end} ---")
for i in range(start, end + 1):
if i <= len(text):
print(f"{i:4}: {text[i-1]}")
PYRepository: dClimate/py-hamt
Length of output: 5880
Mark _V1_COORDINATE_ARRAY_PREFIXES as immutable
py_hamt/sharded_zarr_store.py:297-305 defines a mutable class attribute that Ruff flags (RUF012). Annotate it as ClassVar[frozenset[str]] (or move it to __init__) so the shared prefix set can’t be mutated accidentally.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@py_hamt/sharded_zarr_store.py` at line 10, The shared coordinate prefix
collection in ShardedZarrStore is currently a mutable class attribute and should
be made immutable to satisfy the Ruff RUF012 warning. Update the
`_V1_COORDINATE_ARRAY_PREFIXES` declaration in `ShardedZarrStore` to use
`ClassVar[frozenset[str]]` (or otherwise move the initialization into
`__init__`) so the prefix set cannot be mutated across instances.
Sources: Coding guidelines, Linters/SAST tools
| for group_path in group_paths: | ||
| metadata_key = ( | ||
| "zarr.json" if group_path == "" else f"{group_path}/zarr.json" | ||
| ) | ||
| metadata[metadata_key] = await self.cas.save(group_metadata, codec="raw") | ||
|
|
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Preserve existing group metadata during v1-to-v2 migration.
_add_missing_group_metadata() currently overwrites any existing parent zarr.json, which can drop group attributes or consolidated metadata during migration.
Proposed fix
for group_path in group_paths:
metadata_key = (
"zarr.json" if group_path == "" else f"{group_path}/zarr.json"
)
+ if metadata_key in metadata:
+ continue
metadata[metadata_key] = await self.cas.save(group_metadata, codec="raw")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for group_path in group_paths: | |
| metadata_key = ( | |
| "zarr.json" if group_path == "" else f"{group_path}/zarr.json" | |
| ) | |
| metadata[metadata_key] = await self.cas.save(group_metadata, codec="raw") | |
| for group_path in group_paths: | |
| metadata_key = ( | |
| "zarr.json" if group_path == "" else f"{group_path}/zarr.json" | |
| ) | |
| if metadata_key in metadata: | |
| continue | |
| metadata[metadata_key] = await self.cas.save(group_metadata, codec="raw") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@py_hamt/sharded_zarr_store.py` around lines 777 - 782, The
_add_missing_group_metadata() migration path is overwriting existing group
metadata by always saving group_metadata into each zarr.json entry. Update the
logic in _add_missing_group_metadata() to first check whether a parent zarr.json
already exists and only create missing group metadata, preserving any existing
attributes or consolidated metadata during v1-to-v2 migration. Use the existing
metadata key generation in this method to locate the write path and avoid
replacing already-present group records.
| except asyncio.TimeoutError: | ||
| # Clean up the pending load to allow retry | ||
| if shard_idx in self._pending_shard_loads: | ||
| self._pending_shard_loads[shard_idx].set() | ||
| del self._pending_shard_loads[shard_idx] | ||
| if cache_key in self._pending_shard_loads: | ||
| self._pending_shard_loads[cache_key].set() | ||
| del self._pending_shard_loads[cache_key] | ||
| raise RuntimeError(f"Timeout waiting for shard {shard_idx} to load.") |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
uv run ruff check py_hamt/sharded_zarr_store.py --select B904Repository: dClimate/py-hamt
Length of output: 194
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the relevant region and surrounding context.
sed -n '1035,1075p' py_hamt/sharded_zarr_store.py
# Find whether the same timeout pattern appears elsewhere.
rg -n "except asyncio\.TimeoutError|raise RuntimeError\\(f\"Timeout waiting for shard" py_hamt/sharded_zarr_store.pyRepository: dClimate/py-hamt
Length of output: 2047
Chain the timeout exception. Raise the RuntimeError with from err here so the original asyncio.TimeoutError stays attached and the B904 lint rule is satisfied.
🧰 Tools
🪛 Ruff (0.15.20)
[warning] 1062-1062: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@py_hamt/sharded_zarr_store.py` around lines 1058 - 1062, The timeout handling
in the shard load path drops the original exception context, so update the
`except asyncio.TimeoutError` block in `sharded_zarr_store.py` to chain the new
`RuntimeError` from the caught timeout exception. Keep the cleanup of
`_pending_shard_loads` as-is, but raise the error using the original
`asyncio.TimeoutError` variable so the `ShardedZarrStore` load failure preserves
context and satisfies the B904 rule.
Sources: Coding guidelines, Linters/SAST tools
| for cache_key in sorted(dirty_shards, key=str): | ||
| shard_data_list = await self._shard_data_cache.get(cache_key) | ||
| if shard_data_list is None: | ||
| raise RuntimeError(f"Dirty shard {shard_idx} not found in cache") | ||
| raise RuntimeError(f"Dirty shard {cache_key} not found in cache") | ||
|
|
||
| # Encode this list into a DAG-CBOR byte representation | ||
| shard_data_bytes = dag_cbor.encode(shard_data_list) | ||
|
|
||
| # Save the DAG-CBOR block and get its CID | ||
| shard_data_bytes = dag_cbor.encode(cast(IPLDKind, shard_data_list)) | ||
| new_shard_cid_obj = await self.cas.save( | ||
| shard_data_bytes, | ||
| codec="dag-cbor", # Use 'dag-cbor' codec | ||
| codec="dag-cbor", | ||
| ) | ||
|
|
||
| if ( | ||
| self._root_obj["chunks"]["shard_cids"][shard_idx] | ||
| != new_shard_cid_obj | ||
| ): | ||
| # Store the CID object directly | ||
| self._root_obj["chunks"]["shard_cids"][shard_idx] = ( | ||
| new_shard_cid_obj | ||
| if not isinstance(new_shard_cid_obj, CID): # pragma: no cover | ||
| raise TypeError( | ||
| "ShardedZarrStore requires CAS.save to return CIDs." | ||
| ) | ||
| self._dirty_root = True | ||
| # Mark shard as clean after flushing | ||
| await self._shard_data_cache.mark_clean(shard_idx) | ||
|
|
||
| if self._manifest_version == SHARDED_ZARR_V1: | ||
| if not isinstance(cache_key, int): # pragma: no cover | ||
| raise TypeError("v1 shard cache keys must be integers.") | ||
| shard_idx = int(cache_key) | ||
| if ( | ||
| self._root_obj["chunks"]["shard_cids"][shard_idx] | ||
| != new_shard_cid_obj | ||
| ): | ||
| self._root_obj["chunks"]["shard_cids"][shard_idx] = ( | ||
| new_shard_cid_obj | ||
| ) | ||
| self.array_indices[""].shard_cids[shard_idx] = new_shard_cid_obj | ||
| self._dirty_root = True | ||
| else: | ||
| if isinstance(cache_key, int): # pragma: no cover | ||
| raise TypeError("v2 shard cache keys must include array paths.") | ||
| array_path, shard_idx = cache_key | ||
| array_index = self.array_indices[array_path] | ||
| if array_index.shard_cids[shard_idx] != new_shard_cid_obj: | ||
| array_index.shard_cids[shard_idx] = new_shard_cid_obj | ||
| self._dirty_root = True | ||
| self._sync_arrays_to_root() | ||
|
|
||
| await self._shard_data_cache.mark_clean(cache_key) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🔴 Critical | ⚡ Quick win
Hold the per-shard lock while flushing dirty shard data.
set()/delete() mutate shard lists under _shard_locks, but flush() reads, saves, and marks them clean without that lock. A concurrent write can be marked dirty after encoding but before mark_clean(), losing the persisted update.
Proposed direction
if dirty_shards:
for cache_key in sorted(dirty_shards, key=str):
- shard_data_list = await self._shard_data_cache.get(cache_key)
- if shard_data_list is None:
- raise RuntimeError(f"Dirty shard {cache_key} not found in cache")
+ shard_lock = self._shard_locks[cache_key]
+ async with shard_lock:
+ shard_data_list = await self._shard_data_cache.get(cache_key)
+ if shard_data_list is None:
+ raise RuntimeError(
+ f"Dirty shard {cache_key} not found in cache"
+ )
- shard_data_bytes = dag_cbor.encode(cast(IPLDKind, shard_data_list))
+ shard_data_bytes = dag_cbor.encode(cast(IPLDKind, shard_data_list))
+ # save, update root/array index, then mark_clean while still locked🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@py_hamt/sharded_zarr_store.py` around lines 1158 - 1196, The flush path in
ShardedZarrStore is reading, saving, and marking shard data clean without
holding the per-shard lock, which can race with concurrent set()/delete()
updates. Update the flush logic around the dirty shard loop to acquire and hold
the corresponding _shard_locks entry for each cache_key while fetching from
_shard_data_cache, encoding with dag_cbor, saving via cas.save, updating
shard_cids, and calling mark_clean. Use the existing _shard_locks,
_shard_data_cache, and _manifest_version branches in the flush method so a shard
cannot be modified after it is persisted but before it is marked clean.
| if target_shard_list[index_in_shard] != pointer_cid_obj: | ||
| target_shard_list[index_in_shard] = pointer_cid_obj | ||
| await self._shard_data_cache.mark_dirty(shard_idx) | ||
| await self._shard_data_cache.mark_dirty(cache_key) | ||
| return None |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Remove legacy metadata fallbacks when shard chunks are written or deleted.
A migrated chunk can exist as metadata first, then later be promoted into a v2 shard. If the metadata fallback remains, deleting the shard entry makes get()/exists() fall back to the stale metadata and the chunk reappears.
Proposed fix
if target_shard_list[index_in_shard] != pointer_cid_obj:
target_shard_list[index_in_shard] = pointer_cid_obj
await self._shard_data_cache.mark_dirty(cache_key)
+ if self._root_obj["metadata"].pop(key, None) is not None:
+ self._metadata_read_cache.pop(key, None)
+ self._dirty_root = True
@@
if target_shard_list[index_in_shard] is not None:
target_shard_list[index_in_shard] = None
await self._shard_data_cache.mark_dirty(cache_key)
- elif self._root_obj["metadata"].pop(key, None) is not None:
+ if self._root_obj["metadata"].pop(key, None) is not None:
self._metadata_read_cache.pop(key, None)
self._dirty_root = TrueAlso applies to: 1424-1429
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@py_hamt/sharded_zarr_store.py` around lines 1352 - 1355, Remove the legacy
metadata fallback path for migrated chunks in the shard write/delete flow so
stale metadata cannot reappear after a v2 shard mutation. Update the shard
update and deletion logic around the shard list mutation code and the
`get()`/`exists()` lookup path in `sharded_zarr_store.py` so once a chunk is
promoted into a shard, reads no longer consult the old metadata entry after the
shard entry is written or removed. Use the existing shard mutation helpers and
cache invalidation behavior to ensure the shard state is the single source of
truth.
| array_index.resize(tuple(new_shape)) | ||
| if self._primary_array_path == array_index.array_path: | ||
| self._set_legacy_geometry_from_index(array_index) | ||
| self._sync_arrays_to_root() | ||
| self._dirty_root = True |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift
Clear out-of-shape cached and persisted shard slots on resize.
After shrinking, array_index.resize() truncates shard_cids but cached dirty shards and retained last-shard slots can still contain CIDs outside the new total_chunks. flush() can either index past the truncated list or persist stale chunks that reappear after a later expansion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@py_hamt/sharded_zarr_store.py` around lines 1593 - 1597, The resize path in
sharded_zarr_store.py leaves stale cached and persisted shard slots after
`array_index.resize()`, so shrinking can still expose CIDs beyond the new
`total_chunks`. Update the resize handling around `array_index.resize()`,
`_sync_arrays_to_root()`, and the dirty shard cache to clear any out-of-shape
shard entries and last-shard slots before marking `_dirty_root`; also make sure
`flush()` and any shard-index accessors in this class tolerate the truncated
`shard_cids` state without indexing past the new bounds or re-persisting old
chunks after re-expansion.
| async def load( | ||
| self, | ||
| id: IPLDKind, | ||
| offset: int | None = None, | ||
| length: int | None = None, | ||
| suffix: int | None = None, | ||
| ) -> bytes: | ||
| data = self.store[self._key(id)] |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Rename id to avoid the Ruff A002 failure.
This helper now shadows the builtin name, so the new test file is not Ruff-clean. As per coding guidelines, "Python code must pass ruff linting (ruff check)" and "Use meaningful variable and function names".
Suggested fix
async def load(
self,
- id: IPLDKind,
+ cid_or_id: IPLDKind,
offset: int | None = None,
length: int | None = None,
suffix: int | None = None,
) -> bytes:
- data = self.store[self._key(id)]
+ data = self.store[self._key(cid_or_id)]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def load( | |
| self, | |
| id: IPLDKind, | |
| offset: int | None = None, | |
| length: int | None = None, | |
| suffix: int | None = None, | |
| ) -> bytes: | |
| data = self.store[self._key(id)] | |
| async def load( | |
| self, | |
| cid_or_id: IPLDKind, | |
| offset: int | None = None, | |
| length: int | None = None, | |
| suffix: int | None = None, | |
| ) -> bytes: | |
| data = self.store[self._key(cid_or_id)] |
🧰 Tools
🪛 Ruff (0.15.20)
[error] 43-43: Function argument id is shadowing a Python builtin
(A002)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_sharded_zarr_store_v2.py` around lines 41 - 48, The helper method
load currently uses the parameter name id, which shadows Python’s builtin and
triggers Ruff A002. Rename that parameter in load to a more descriptive name
(for example, key or item_id) and update the lookup using self._key(...) so the
test remains Ruff-clean without changing behavior.
Sources: Coding guidelines, Linters/SAST tools
Summary
Implements FPAR pyramid Ticket 1 by upgrading
ShardedZarrStoreto support path-awaresharded_zarr_v2manifests with one shard index per Zarr array path.Changes
ArrayIndex/ChunkKeyhandling and path-aware chunk parsing, get/set/exists/delete/list/resize/graft behavior.c/...metadata chunks and top-levelcgroup paths.migrate_v1_to_v2(primary_array_path)support that reuses existing v1 shard blocks and rewrites metadata under the requested primary path.c-named paths, chunk-prefix listing, grafting, and malformed shard loads.Validation
bash run-checks.shNotes
FPAR_PYRAMID_TICKETS.mdwas intentionally not included in this commit.Summary by CodeRabbit
New Features
Bug Fixes
Tests