Skip to content

feat(datalayer): schema caching for write transactions#3160

Open
ecordell wants to merge 3 commits into
authzed:mainfrom
ecordell:evan/schemacache
Open

feat(datalayer): schema caching for write transactions#3160
ecordell wants to merge 3 commits into
authzed:mainfrom
ecordell:evan/schemacache

Conversation

@ecordell

@ecordell ecordell commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Description

With the experimental schema storage enabled, I was noticing a significant load on ReadStoredSchema due to the write path.

This PR addresses it by storing a "last seen schema hash". When performing writes, the last schema hash is included in a guard in the transaction, and if it differs from what is stored on the server, the write tx is aborted. Basically this adds optimistic concurrency around schema changes.

I ran into several related issues when validating this:

  • WriteStoredSchema updated the cached version before committing; which means a failed schema write would poison the cache with a bad entry.
  • ErrSchemaNotFound was swallowed and replaced with an "empty" schema in some cases; I couldn't find a motivation for this and it seems to make error messages strictly worse so I changed this.
  • The "old" schema caching proxy was unconditionally installed; I updated to avoid installing it if in new schema mode.

Testing

Added unit tests for the issues I listed above, and built and ran this in staging and watched the ReadStoredSchema requests drop to ~0 under write load.

return "", err
}

// Update cache after successful write

@ecordell ecordell Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misleading comment here - the "write" was successful but the commit hasn't happened yet

@github-actions github-actions Bot added area/cli Affects the command line area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) labels Jun 8, 2026
@ecordell ecordell force-pushed the evan/schemacache branch from c9e3c6c to 7a968c8 Compare June 8, 2026 18:03
@github-actions github-actions Bot added the area/datastore Affects the storage system label Jun 8, 2026
@ecordell ecordell force-pushed the evan/schemacache branch 3 times, most recently from a3f7fab to 85e0299 Compare June 8, 2026 21:59
@ecordell ecordell marked this pull request as ready for review June 8, 2026 22:00
@ecordell ecordell requested a review from a team as a code owner June 8, 2026 22:00
@ecordell ecordell force-pushed the evan/schemacache branch from 85e0299 to bacc03b Compare June 8, 2026 22:18
tstirrat15
tstirrat15 previously approved these changes Jun 8, 2026

@tstirrat15 tstirrat15 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; see comments

Comment on lines +47 to +48
rwt.mustLock()
defer rwt.Unlock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a read-only mutex? Would that be worth using here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lock is for serializing access to the rwt itself, it's not a lock on the data being read

Comment on lines +38 to +39
OrderBy("created_xid DESC").
Limit(1).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ever expecting there to be more than one live hash, or is this defensiveness?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just defensiveness for now

if !errors.Is(err, datastore.ErrSchemaNotFound) {
return nil, err
}
return ApplySchemaChangesOverExisting(ctx, rwt, caveatTypeSet, validated, nil, nil)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which issue did this change address?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing I discovered that this would fail on a fresh DB without a schema written - which didn't make sense to do for a WriteSchema call

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah that's not ideal.

Comment thread pkg/datalayer/impl.go Outdated
Comment on lines +88 to +89
var pendingHash SchemaHash
var pendingSchema *datastore.ReadOnlyStoredSchema

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check: are transactions always handled by a single thread? i.e. are there thread safety concerns in this context?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagining something where you race two goroutines within a transaction, both of which are executing a schema write. I realize that sounds like it would be insane calling code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah in general the transactions are not thread safe, though I'm not sure how good a job we do of documenting that

@ecordell ecordell force-pushed the evan/schemacache branch 2 times, most recently from 01e9222 to d84e296 Compare June 9, 2026 04:08
@ecordell ecordell marked this pull request as draft June 9, 2026 13:20
@ecordell ecordell force-pushed the evan/schemacache branch 3 times, most recently from 19a2701 to 17f5363 Compare June 9, 2026 18:39
@ecordell ecordell marked this pull request as ready for review June 9, 2026 18:59
@ecordell ecordell force-pushed the evan/schemacache branch 3 times, most recently from 11dfefe to e612621 Compare June 9, 2026 21:11

@tstirrat15 tstirrat15 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments, otherwise LGTM

//
// When WithSchemaHashPrecondition is passed to ReadWriteTx, the datalayer may transparently retry
// fn with a refreshed schema hash if the stored schema changed since the precondition was captured.
// This is only safe when ALL schema-sensitive validation (ReadSchema, ValidateRelationshipUpdates,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scares me a little... Do we have a way to check/assert this? Would an analyzer do it?

@ecordell ecordell Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of an easy way to validate this, maybe an analyzer.

It's not really different semantics than before - this schema validation step still happened before my change - just documenting it for future uses.

Maybe I can make a followup?

case errors.Is(err, datastore.ErrSchemaHashPreconditionFailed):
return status.Errorf(codes.FailedPrecondition, "%s", err)
case errors.Is(err, datastore.ErrSchemaNotFound):
return status.Errorf(codes.NotFound, "%s", err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this error get returned on things other than ReadSchema, e.g. WriteRels? If so, it seems like it should be NotFound in some cases and FailedPrecondition in others. Not sure how difficult this would be to implement, though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would get returned from writerels if there hasn't been a schema write yet, but to me that doesn't sound too unreasonable

On main, newStoredSchemaReaderAdapter catches ErrSchemaNotFound and returns an empty schema adapter with no error. That's behavior I changed in this PR (mentioned in the PR desc) because I couldn't understand a reason for it. So on main, a rel write with no schema would succeed at the ReadSchema call, then fail validation with UnknownNamespaceError because the namespace doesn't exist in the empty schema...IMO this is better but I can see how you'd want a preconditionfailed here.


// assertSchemaHash acquires a FOR UPDATE lock on the schema_revision row and verifies
// it matches expectedHash. Returns ErrSchemaHashPreconditionFailed if not found or mismatched.
func assertSchemaHash(ctx context.Context, tx pgx.Tx, expectedHash string) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand the mental model - these functions are more or less acquiring some kind of lock on the datastore, and then a subsequent call that invokes this same function will fail (e.g. because of the locking behavior of SELECT FOR UPDATE), and that subsequent call will be retried?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I updated this to do FOR UPDATE only for schema writes, not for rel writes

Rel writes use FOR SHARE which just serializes them around an in flight FOR UPDATE. The only reason this is needed at all is because we have a flag that lets you use repeatable read, otherwise the fact that the DBs are serializable should make this work with just a read.

I added tests to confirm these behaviors.

Comment thread internal/datastore/memdb/memdb.go Outdated
}
if errors.Is(err, ErrSerialization) {
mdb.Unlock()
time.Sleep(1 * time.Millisecond)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary/is there another way to get the same behavior?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored

Comment thread pkg/datalayer/impl.go Outdated
}
}

for attempt := 0; ; attempt++ {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this isn't range maxSchemaHashRetries?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment thread pkg/datalayer/impl.go Outdated
Comment on lines +134 to +139
if SchemaHash(rwtOpts.SchemaHashPrecondition) != preconditionHash {
overridden := make([]options.RWTOptionsOption, len(opts)+1)
copy(overridden, opts)
overridden[len(opts)] = options.WithSchemaHashPrecondition(string(preconditionHash))
dsOpts = overridden
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the "standard" way to do this kind of logic? I'm kind of surprised that we're not just mutating the ds, or else appending without copying, since the last option will win.

@ecordell ecordell Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works around a specific gocritic check that doesn't like when you append but reassign to a different variable

@ecordell ecordell marked this pull request as draft June 10, 2026 16:49
@ecordell ecordell marked this pull request as ready for review June 10, 2026 20:22
@github-actions github-actions Bot added the area/api v1 Affects the v1 API label Jun 10, 2026
Adds ReadStoredSchemaHash to the datstore interface so that the hash
can be checked without loading the full schema. OptimizedRevision
already loads this field, but can't be used inside of a transaction.

Move cache.Set out of WriteSchemaViaStoredSchema and into a
post-commit callback in ReadWriteTx. Previously, cache.Set was called
inside the transaction callback before the datastore committed, so a
rolled-back write would leave a stale entry in the cache. Now the
written hash and schema are captured by an onSchemaWritten callback and
only committed to the cache after the transaction succeeds.

Replace the NoSchemaHashInTransaction bypass in
readWriteTransaction.ReadSchema with a direct ReadStoredSchema call
using the transaction's own connection. The result is memoized on the
readWriteTransaction so multiple ReadSchema calls within the same
transaction (e.g. ImportBulkRelationships) pay only one DB round-trip.
The read schema is also written to the cross-transaction cache to warm
it for subsequent transactions.

Propagate ErrSchemaNotFound from ReadSchema (both write transactions
and snapshot reads) rather than silently returning an empty schema
adapter. An empty schema was previously returned as a migration
convenience, but migration phases are held long enough that
OptimizedRevision will never return a pre-migration revision by the
time the mode switch occurs.

Refactor WriteSchemaViaStoredSchema to remove the cache parameter and
return (SchemaHash, *ReadOnlyStoredSchema, error). Cache responsibility
is now exclusively with ReadWriteTx post-commit.

Skip schemacaching.NewCachingDatastoreProxy when reads come from new
schema storage: the proxy's per-revision namespace cache is never
populated in that mode, leaving only overhead. Parse schemaMode early
in server setup so this decision can be made before the proxy chain is
assembled; also fix a bug where WithSchemaMode was only appended when
ExperimentalSchemaMode was non-empty, leaving the mode unset for the
default case.
@ecordell ecordell force-pushed the evan/schemacache branch 7 times, most recently from 5c3a94b to cf20a2b Compare June 10, 2026 23:48
instead of reading the latest schema hash in write calls, we now just
store the most recently seen schema hash from other operations and use
that.

the write transaction will fail if the schema hash doesn't match the
current at the time of write. it will retry and revalidate against the
current schema.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api v1 Affects the v1 API area/cli Affects the command line area/datastore Affects the storage system area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants