feat(datalayer): schema caching for write transactions#3160
Conversation
| return "", err | ||
| } | ||
|
|
||
| // Update cache after successful write |
There was a problem hiding this comment.
misleading comment here - the "write" was successful but the commit hasn't happened yet
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
c9e3c6c to
7a968c8
Compare
a3f7fab to
85e0299
Compare
85e0299 to
bacc03b
Compare
| rwt.mustLock() | ||
| defer rwt.Unlock() |
There was a problem hiding this comment.
Is there a read-only mutex? Would that be worth using here?
There was a problem hiding this comment.
this lock is for serializing access to the rwt itself, it's not a lock on the data being read
| OrderBy("created_xid DESC"). | ||
| Limit(1). |
There was a problem hiding this comment.
Are we ever expecting there to be more than one live hash, or is this defensiveness?
There was a problem hiding this comment.
just defensiveness for now
| if !errors.Is(err, datastore.ErrSchemaNotFound) { | ||
| return nil, err | ||
| } | ||
| return ApplySchemaChangesOverExisting(ctx, rwt, caveatTypeSet, validated, nil, nil) |
There was a problem hiding this comment.
Which issue did this change address?
There was a problem hiding this comment.
When testing I discovered that this would fail on a fresh DB without a schema written - which didn't make sense to do for a WriteSchema call
There was a problem hiding this comment.
Oh yeah that's not ideal.
| var pendingHash SchemaHash | ||
| var pendingSchema *datastore.ReadOnlyStoredSchema |
There was a problem hiding this comment.
Sanity check: are transactions always handled by a single thread? i.e. are there thread safety concerns in this context?
There was a problem hiding this comment.
Imagining something where you race two goroutines within a transaction, both of which are executing a schema write. I realize that sounds like it would be insane calling code.
There was a problem hiding this comment.
yeah in general the transactions are not thread safe, though I'm not sure how good a job we do of documenting that
01e9222 to
d84e296
Compare
19a2701 to
17f5363
Compare
11dfefe to
e612621
Compare
tstirrat15
left a comment
There was a problem hiding this comment.
See comments, otherwise LGTM
| // | ||
| // When WithSchemaHashPrecondition is passed to ReadWriteTx, the datalayer may transparently retry | ||
| // fn with a refreshed schema hash if the stored schema changed since the precondition was captured. | ||
| // This is only safe when ALL schema-sensitive validation (ReadSchema, ValidateRelationshipUpdates, |
There was a problem hiding this comment.
This scares me a little... Do we have a way to check/assert this? Would an analyzer do it?
There was a problem hiding this comment.
I can't think of an easy way to validate this, maybe an analyzer.
It's not really different semantics than before - this schema validation step still happened before my change - just documenting it for future uses.
Maybe I can make a followup?
| case errors.Is(err, datastore.ErrSchemaHashPreconditionFailed): | ||
| return status.Errorf(codes.FailedPrecondition, "%s", err) | ||
| case errors.Is(err, datastore.ErrSchemaNotFound): | ||
| return status.Errorf(codes.NotFound, "%s", err) |
There was a problem hiding this comment.
Does this error get returned on things other than ReadSchema, e.g. WriteRels? If so, it seems like it should be NotFound in some cases and FailedPrecondition in others. Not sure how difficult this would be to implement, though.
There was a problem hiding this comment.
Yeah it would get returned from writerels if there hasn't been a schema write yet, but to me that doesn't sound too unreasonable
On main, newStoredSchemaReaderAdapter catches ErrSchemaNotFound and returns an empty schema adapter with no error. That's behavior I changed in this PR (mentioned in the PR desc) because I couldn't understand a reason for it. So on main, a rel write with no schema would succeed at the ReadSchema call, then fail validation with UnknownNamespaceError because the namespace doesn't exist in the empty schema...IMO this is better but I can see how you'd want a preconditionfailed here.
|
|
||
| // assertSchemaHash acquires a FOR UPDATE lock on the schema_revision row and verifies | ||
| // it matches expectedHash. Returns ErrSchemaHashPreconditionFailed if not found or mismatched. | ||
| func assertSchemaHash(ctx context.Context, tx pgx.Tx, expectedHash string) error { |
There was a problem hiding this comment.
Just to make sure I understand the mental model - these functions are more or less acquiring some kind of lock on the datastore, and then a subsequent call that invokes this same function will fail (e.g. because of the locking behavior of SELECT FOR UPDATE), and that subsequent call will be retried?
There was a problem hiding this comment.
yes, I updated this to do FOR UPDATE only for schema writes, not for rel writes
Rel writes use FOR SHARE which just serializes them around an in flight FOR UPDATE. The only reason this is needed at all is because we have a flag that lets you use repeatable read, otherwise the fact that the DBs are serializable should make this work with just a read.
I added tests to confirm these behaviors.
| } | ||
| if errors.Is(err, ErrSerialization) { | ||
| mdb.Unlock() | ||
| time.Sleep(1 * time.Millisecond) |
There was a problem hiding this comment.
Is this necessary/is there another way to get the same behavior?
| } | ||
| } | ||
|
|
||
| for attempt := 0; ; attempt++ { |
There was a problem hiding this comment.
Is there a reason this isn't range maxSchemaHashRetries?
| if SchemaHash(rwtOpts.SchemaHashPrecondition) != preconditionHash { | ||
| overridden := make([]options.RWTOptionsOption, len(opts)+1) | ||
| copy(overridden, opts) | ||
| overridden[len(opts)] = options.WithSchemaHashPrecondition(string(preconditionHash)) | ||
| dsOpts = overridden | ||
| } |
There was a problem hiding this comment.
Is this the "standard" way to do this kind of logic? I'm kind of surprised that we're not just mutating the ds, or else appending without copying, since the last option will win.
There was a problem hiding this comment.
this works around a specific gocritic check that doesn't like when you append but reassign to a different variable
e612621 to
b5f2833
Compare
Adds ReadStoredSchemaHash to the datstore interface so that the hash can be checked without loading the full schema. OptimizedRevision already loads this field, but can't be used inside of a transaction. Move cache.Set out of WriteSchemaViaStoredSchema and into a post-commit callback in ReadWriteTx. Previously, cache.Set was called inside the transaction callback before the datastore committed, so a rolled-back write would leave a stale entry in the cache. Now the written hash and schema are captured by an onSchemaWritten callback and only committed to the cache after the transaction succeeds. Replace the NoSchemaHashInTransaction bypass in readWriteTransaction.ReadSchema with a direct ReadStoredSchema call using the transaction's own connection. The result is memoized on the readWriteTransaction so multiple ReadSchema calls within the same transaction (e.g. ImportBulkRelationships) pay only one DB round-trip. The read schema is also written to the cross-transaction cache to warm it for subsequent transactions. Propagate ErrSchemaNotFound from ReadSchema (both write transactions and snapshot reads) rather than silently returning an empty schema adapter. An empty schema was previously returned as a migration convenience, but migration phases are held long enough that OptimizedRevision will never return a pre-migration revision by the time the mode switch occurs. Refactor WriteSchemaViaStoredSchema to remove the cache parameter and return (SchemaHash, *ReadOnlyStoredSchema, error). Cache responsibility is now exclusively with ReadWriteTx post-commit. Skip schemacaching.NewCachingDatastoreProxy when reads come from new schema storage: the proxy's per-revision namespace cache is never populated in that mode, leaving only overhead. Parse schemaMode early in server setup so this decision can be made before the proxy chain is assembled; also fix a bug where WithSchemaMode was only appended when ExperimentalSchemaMode was non-empty, leaving the mode unset for the default case.
this calls with all schema modes
5c3a94b to
cf20a2b
Compare
instead of reading the latest schema hash in write calls, we now just store the most recently seen schema hash from other operations and use that. the write transaction will fail if the schema hash doesn't match the current at the time of write. it will retry and revalidate against the current schema.
cf20a2b to
b25807f
Compare
Description
With the experimental schema storage enabled, I was noticing a significant load on
ReadStoredSchemadue to the write path.This PR addresses it by storing a "last seen schema hash". When performing writes, the last schema hash is included in a guard in the transaction, and if it differs from what is stored on the server, the write tx is aborted. Basically this adds optimistic concurrency around schema changes.
I ran into several related issues when validating this:
WriteStoredSchemaupdated the cached version before committing; which means a failed schema write would poison the cache with a bad entry.ErrSchemaNotFoundwas swallowed and replaced with an "empty" schema in some cases; I couldn't find a motivation for this and it seems to make error messages strictly worse so I changed this.Testing
Added unit tests for the issues I listed above, and built and ran this in staging and watched the ReadStoredSchema requests drop to ~0 under write load.