vecann: fix concurrent download race in dataset cache#164563
Merged
trunk-io[bot] merged 1 commit intocockroachdb:masterfrom Mar 2, 2026
Merged
vecann: fix concurrent download race in dataset cache#164563trunk-io[bot] merged 1 commit intocockroachdb:masterfrom
trunk-io[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
Contributor
|
😎 Merged successfully - details. |
Member
mw5h
commented
Feb 28, 2026
Contributor
Author
mw5h
left a comment
There was a problem hiding this comment.
@mw5h made 1 comment.
Reviewable status:complete! 0 of 0 LGTMs obtained.
pkg/workload/vecann/datasets.go line 349 at r1 (raw file):
// Atomic rename: the destination is either absent or contains the complete // extracted file. Concurrent processes performing the same download will // each rename their own temp file, with the last writer winning.
We already said this above.
e566d79 to
169d62e
Compare
yuzefovich
approved these changes
Mar 1, 2026
Member
yuzefovich
left a comment
There was a problem hiding this comment.
@yuzefovich reviewed 2 files and all commit messages, and made 3 comments.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on mw5h, nameisbhaskar, and williamchoe3).
pkg/cmd/roachtest/tests/vecindex.go line 256 at r2 (raw file):
loader := vecann.DatasetLoader{ DatasetName: opts.dataset, ResetCache: true,
nit: we can now remove ResetCache option since it's no longer used anywhere.
pkg/workload/vecann/datasets.go line 303 at r2 (raw file):
writer.OnProgress = dl.OnDownloadProgress _, copyErr := io.Copy(&writer, reader) _ = reader.Close()
nit: why are we ignoring this error but not on tempZip.Close? A quick comment would be helpful.
The vecindex roachtest has been failing 100% of the time on master since
the prefix=0 and prefix=3 test variants run concurrently and both
download the same dataset to the same cache directory. The previous code
used a fixed temp file path (destPath + ".zip") for the download, so
concurrent downloaders would clobber each other's writes, producing
either a corrupt zip ("unexpected EOF" during extraction) or a missing
file ("no such file or directory" when one process's defer cleanup
deletes the file before the other can read it).
Fix this by using os.CreateTemp for both the downloaded zip and the
extracted output, giving each concurrent downloader its own unique temp
files. The extracted file is then installed at the destination via atomic
os.Rename, ensuring the cached file is either absent or complete — never
a truncated partial write.
This also removes the ResetCache field from DatasetLoader entirely. It
was only set to true in the vecindex roachtest (added in 2794f07 as
a workaround for corrupted cache files persisting across runs). With
atomic extraction, the cache can never contain a truncated file, so the
workaround is no longer needed. Since no callers remain, the field and
all associated checks are removed.
Fixes: cockroachdb#163471
Fixes: cockroachdb#159333
Release note: None
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
25090fb to
6f5344c
Compare
This was referenced Mar 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vecindex/dbpedia-100k/nodes=3/prefix=0on master.The
prefix=0andprefix=3test variants run concurrently and both downloadthe same dataset to the same cache directory using a fixed temp file path,
causing concurrent writers to corrupt each other's data.
downloadAndUnzipto useos.CreateTempfor unique temp files perdownloader, with atomic
os.Renameto install the extracted file at thedestination. The cached file is now either absent or complete — never truncated.
ResetCache:truefrom the vecindex roachtest (added in 2794f07 asa workaround for corrupt cache files). With atomic extraction the workaround is
no longer needed, and removing it eliminates the redundant 443MB download that
was triggering the race.
Fixes: #163471
Fixes: #159333
Test plan
vecindex/random-s/nodes=1/prefix=0locally with--local— passed (2250s).