Skip to content

vecann: fix concurrent download race in dataset cache#164563

Merged
trunk-io[bot] merged 1 commit intocockroachdb:masterfrom
mw5h:worktree-fix-vecann-race
Mar 2, 2026
Merged

vecann: fix concurrent download race in dataset cache#164563
trunk-io[bot] merged 1 commit intocockroachdb:masterfrom
mw5h:worktree-fix-vecann-race

Conversation

@mw5h
Copy link
Contributor

@mw5h mw5h commented Feb 28, 2026

Summary

  • Fixed a race condition causing 100% nightly failure of vecindex/dbpedia-100k/nodes=3/prefix=0 on master.
    The prefix=0 and prefix=3 test variants run concurrently and both download
    the same dataset to the same cache directory using a fixed temp file path,
    causing concurrent writers to corrupt each other's data.
  • Changed downloadAndUnzip to use os.CreateTemp for unique temp files per
    downloader, with atomic os.Rename to install the extracted file at the
    destination. The cached file is now either absent or complete — never truncated.
  • Removed ResetCache:true from the vecindex roachtest (added in 2794f07 as
    a workaround for corrupt cache files). With atomic extraction the workaround is
    no longer needed, and removing it eliminates the redundant 443MB download that
    was triggering the race.

Fixes: #163471
Fixes: #159333

Test plan

  • Ran vecindex/random-s/nodes=1/prefix=0 locally with --local — passed (2250s).

@trunk-io
Copy link
Contributor

trunk-io bot commented Feb 28, 2026

😎 Merged successfully - details.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor Author

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mw5h made 1 comment.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained.


pkg/workload/vecann/datasets.go line 349 at r1 (raw file):

	// Atomic rename: the destination is either absent or contains the complete
	// extracted file. Concurrent processes performing the same download will
	// each rename their own temp file, with the last writer winning.

We already said this above.

@mw5h mw5h force-pushed the worktree-fix-vecann-race branch from e566d79 to 169d62e Compare February 28, 2026 01:01
@mw5h mw5h marked this pull request as ready for review March 1, 2026 21:47
@mw5h mw5h requested a review from a team as a code owner March 1, 2026 21:47
@mw5h mw5h requested review from a team, nameisbhaskar, williamchoe3 and yuzefovich and removed request for a team March 1, 2026 21:47
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! :lgtm:

@yuzefovich reviewed 2 files and all commit messages, and made 3 comments.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on mw5h, nameisbhaskar, and williamchoe3).


pkg/cmd/roachtest/tests/vecindex.go line 256 at r2 (raw file):

	loader := vecann.DatasetLoader{
		DatasetName: opts.dataset,
		ResetCache:  true,

nit: we can now remove ResetCache option since it's no longer used anywhere.


pkg/workload/vecann/datasets.go line 303 at r2 (raw file):

	writer.OnProgress = dl.OnDownloadProgress
	_, copyErr := io.Copy(&writer, reader)
	_ = reader.Close()

nit: why are we ignoring this error but not on tempZip.Close? A quick comment would be helpful.

The vecindex roachtest has been failing 100% of the time on master since
the prefix=0 and prefix=3 test variants run concurrently and both
download the same dataset to the same cache directory. The previous code
used a fixed temp file path (destPath + ".zip") for the download, so
concurrent downloaders would clobber each other's writes, producing
either a corrupt zip ("unexpected EOF" during extraction) or a missing
file ("no such file or directory" when one process's defer cleanup
deletes the file before the other can read it).

Fix this by using os.CreateTemp for both the downloaded zip and the
extracted output, giving each concurrent downloader its own unique temp
files. The extracted file is then installed at the destination via atomic
os.Rename, ensuring the cached file is either absent or complete — never
a truncated partial write.

This also removes the ResetCache field from DatasetLoader entirely. It
was only set to true in the vecindex roachtest (added in 2794f07 as
a workaround for corrupted cache files persisting across runs). With
atomic extraction, the cache can never contain a truncated file, so the
workaround is no longer needed. Since no callers remain, the field and
all associated checks are removed.

Fixes: cockroachdb#163471
Fixes: cockroachdb#159333

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@mw5h mw5h force-pushed the worktree-fix-vecann-race branch from 25090fb to 6f5344c Compare March 2, 2026 18:48
@trunk-io trunk-io bot merged commit e5b3ac8 into cockroachdb:master Mar 2, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

3 participants