-
Notifications
You must be signed in to change notification settings - Fork 5
Open
zarrs/zarrs
#358Description
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "zarrs",
# "numpy",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues
import zarr
import numpy as np
import time
import platform
import subprocess
def clear_cache():
if platform.system() == "Darwin":
subprocess.call(["sync", "&&", "sudo", "purge"])
elif platform.system() == "Linux":
subprocess.call(["sudo", "sh", "-c", "sync; echo 3 > /proc/sys/vm/drop_caches"])
else:
raise Exception("Unsupported platform")
zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})
z = zarr.create_array(
"foo.zarr",
shape=(8192, 4, 128, 128),
shards=(4096, 4, 128, 128),
chunks=(1, 1, 128, 128),
dtype=np.float64,
overwrite=True,
)
# Originally this was in the reproducer but it's clear that the behavior is somewhat dependent on compression
# z[...] = np.random.randn(8192, 4, 128, 128)
z[...] = np.ones((8192, 4, 128, 128))
clear_cache()
t = time.time()
z[...]
print("full read took: ", time.time() - t)
clear_cache()
t = time.time()
z[:4095, ...]
print("partial shard read took: ", time.time() - t)On my mac, the partial read takes as long as the full read:
full read took: 2.4683969020843506
partial shard read took: 2.291616201400757
While I love sequential io just as much as the next guy, this feels wrong. I have a feeling this has to do with the way that concurrency is calculated in the presence of very few shards. As an instructive counter-example, increasing that 4095 to 4096 yields
full read took: 2.334256887435913
partial shard read took: 0.7899982929229736
Which feels much more reasonable.
(I am aware these chunks are comically small, so just setting that aside for now)
UPDATE: I've changed the np.random to be np.ones because the above repro was only working on mac's which have compressed file systems. With np.ones (i.e., non-random data), this issue should now be reproducible on linux machines
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels