IGNITE-627 Fixed inconsistent value in near cache on concurrent update#13316
Open
anton-vinogradov wants to merge 1 commit into
Open
IGNITE-627 Fixed inconsistent value in near cache on concurrent update#13316anton-vinogradov wants to merge 1 commit into
anton-vinogradov wants to merge 1 commit into
Conversation
72c3411 to
cd92ddc
Compare
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cd92ddc to
ee1c02f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
IGNITE-627
https://issues.apache.org/jira/browse/IGNITE-627
Problem
Near cache, atomic mode. A near node puts a value; almost simultaneously another client updates the same key. Due to message reordering the reader notification for the second update reaches the near node before the response to the node's own put. Result: the near cache is left with a stale value forever (
get(k)returns the old value while the whole cluster sees the new one), and the reader subscription is silently dropped, so no future update ever repairs it.Root cause
The near entry does not exist during the put's in-flight window. For a near cache "no entry" legitimately means "evicted → unsubscribe me", so the protocol cannot distinguish it from "not created yet":
The fix — one guarantee
The near entry exists, and cannot be evicted, for the whole duration of a near update. It is created empty and reserved (
GridNearCacheEntry.reserveEviction()— the existing mechanismGridNearGetFuturealready uses for reads) at future mapping time, before the request is sent; all reservations are released when the future completes, on every completion path.Everything else follows from mechanisms that already exist:
Implementation notes: the reservation is taken before the future is published via
addAtomicFuture(a published future can be completed concurrently, and a reservation taken after completion would never be released);reserveNearCacheEntryis idempotent per future, so remap does not double-reserve. Response processing is not changed.Before / after
Master:
sequenceDiagram participant N as Near node participant P as Primary node N->>P: put(k, 1) Note over P: value 1 (ver 1), near node registered as reader P--)N: response (ver 1) — delayed Note over P: concurrent put(k, 2) → ver 2 P->>N: reader update (ver 2) Note over N: no entry → update dropped N--)P: "no entry" Note over P: reader removed — no future updates Note over N: late response creates entry with stale value 1 Note over N: get(k) = 1 forever, cluster sees 2With the fix:
sequenceDiagram participant N as Near node participant P as Primary node Note over N: mapping: entry created empty + reserveEviction() N->>P: put(k, 1) Note over P: value 1 (ver 1), near node registered as reader P--)N: response (ver 1) — delayed Note over P: concurrent put(k, 2) → ver 2 P->>N: reader update (ver 2) Note over N: entry exists → value 2 (ver 2) applied, reader kept Note over N: late response: ver 1 < ver 2 → discarded by version check Note over N: releaseEviction() on future completion Note over N: get(k) = 2 — consistent with the clusterTransactional part (separate race, same ticket)
In tx caches the reader was registered in
GridDhtTxLocalAdapter.addEntry, i.e. before prepare acquired entry locks, soclearReadersof a concurrently finishing remove could wipe it. Registration is moved intoGridDhtTxPrepareFuture.map(IgniteTxEntry), afteronEntriesLocked().GridNearTxLocal.addReaderis a no-op, so near-local transactions are unaffected.The reservation approach follows Semen Boikov's 2019 prototype (branches
ignite-627/ignite-627-tx), minimized.Tests
IgniteCacheAtomicProtocolTest.testNearEntryUpdateRace{Put,PutIfAbsent,Invoke,PutAll}: blockGridNearAtomicUpdateResponse, update the same key from the primary, unblock, assert the near cache observes the fresh value. They fail on unpatched master (expected:<2> but was:<1>) and pass with the fix.CacheNearReaderUpdateTest(hardfail()since 2016) andtestPut[Remove]ConsistencyMultithreadedfor near-enabled configurations — all pass, stable across repeated runs.IgniteCacheAtomicProtocolTest) — 92 tests, all green.🤖 Generated with Claude Code