[Proposal] Incremental reprocessing #349

rkistner · 2025-09-01T13:27:12Z

rkistner
Sep 1, 2025
Maintainer

Background

Currently, when changes to Sync Rules or Sync Streams are deployed, PowerSync re-replicates all data from the source database from scratch, processing it with the new Sync Rules. Once that is ready, clients are switched over to sync from the new copy.

While there is no direct "downtime", it can take a long time on large databases, and clients have to re-sync all data even if only a small portion changed.

Status

2025-12-09: Updated plan with more specifics and implementation tasks.
2025-09-01: Original version of proposal outlined two implementation options.

Proposal

The base idea is to only reprocess bucket or Sync Stream definitions that have actually changed. This is on the definition-level - any change to any single query in a bucket definition would cause the entire bucket definition to be re-processed, and all related buckets to be re-synced.

Specifically, with bucket definitions:

Adding a new bucket definition will process that definition only.
Removing a bucket definition will remove the relevant buckets only, and does not require re-reading any data from the source database.
Changing any data query in the definition will reprocess all queries in that definition, creating new buckets for that definition. No reprocessing of parameter query required if it is unchanged.
Changing a parameter query referencing a table will reprocess that table only. Buckets remain unchanged if the data queries are not modified.
Changing a parameter query that only references request and token parameters require no reprocessing.

For Sync Streams:

Adding a new Sync Stream definition will process that definition only.
Removing a Sync Stream definition will remove the relevant streams only, and does not require re-reading any data from the source database.
Changing the Sync Stream definition will reprocess the entire Sync Stream, resulting in new buckets synced.

In the future, Sync Streams could support more granular reprocessing depending on the changes to the query. For example, only changing a subquery could be treated the same as updating a parameter query in Sync Rules bucket definitions.

Implementation

Where we currently use a separate replication stream per Sync Rules version (and associated logical replication slot in Postgres), this will change to only use a single replication stream, which processes all Sync Rules versions. When a new Sync Rules version is deployed, it re-replicates relevant data:

New definitions are added to the existing replication stream: Replicate all new data for (unchanged definitions, new definitions, removed definitions).
Concurrently, re-snapshot each table related to (new definitions). Here we need to be careful:
1. Make sure to not replace newer replicated data with older snapshot data.
2. Make sure to not introduce significant delays into the replication stream.
3. Do not update unchanged bucket data.
Remove (removed definitions) from the replication stream.
Drop data related to (removed definitions).

What makes this implementation particularly tricky is avoiding updating existing bucket data if unchanged: If we do trigger updates for those, it can cause clients to re-sync the data twice: Once on the old definitions, and again on the new definitions.

Storage changes

The relevant data we store are:

Source tables: a record of every source table we replicate per replication stream
Current data: a record of every row or document we replicate, along with the buckets it belongs to.
Bucket data: data as it appears in each bucket
Parameter lookup data: an "index" for parameter query lookups when tables are referenced in the queries

Currently, each of the above is scoped to a specific Sync Rules version. This needs to be changed to be more granular:

Source tables become "global", not scoped to a Sync Rules version anymore. The same source table may still have multiple entries here if it is replicated with different configuration, for example different config editions, or (later) different filters. This configuration must now be embedded in the table.
Current data becomes scoped to a source table.
Bucket data becomes scoped to a bucket definition (global).
Parameter lookup data becomes scoped to a bucket definition (global).

Implementation progress

Concurrency support in storage (to support streaming + snapshotting at the same time): Concurrent storage write batch support #425
Postgres streaming while snapshotting: [WIP] [Postgres] stream while snapshotting #426
MongoDB streaming while snapshotting: Pending
MySQL streaming while snapshotting: Pending
SQL Server streaming while snapshotting: Pending
Granular Sync Rules representation, allowing "diffing" between versions: Granular sync rule parsing #432
Split out storage for individual source tables & current data: Pending
Split out storage for bucket data and parameter lookups: Pending
Process multiple sync rule versions within the same replication stream: Pending

Other considerations

Defragmenting

Currently, the fact that data is fully reprocessed is used as a method for "defragmenting", as described here. If we implement the incremental reprocessing, we need alternative methods for defragmenting.

Config changes

Changes to replication config affect all bucket & stream definitions, so still requires re-replicating all data. For the most part it is very difficult to predict the effects of config changes on a more granular level.

However, if we avoid creating new operations for unchanged bucket data, we can avoid re-syncing data unaffected by config changes to clients.

G2Jose · 2025-09-04T20:18:48Z

G2Jose
Sep 4, 2025

Just wanted to chime in and say this would be super helpful for my personal use case with powersync in 2 ways:

Currently I have a hard loading screen whenever a full resync happens in my app. I've tried going without, but it seems to require quite a bit of processing that it really tanks UI performance while the sync is in progress. I'm having to deploy sync rules anytime I add a new table.
I'm self hosting powersync and have provisioned a certain number of IOPS and throughput. I recently ran into an issue where I exceeded some of these thresholds and ended up in a state where my EC2 stopped responding for some time.

2 replies

simolus3 Sep 5, 2025
Maintainer

I've tried going without, but it seems to require quite a bit of processing that it really tanks UI performance while the sync is in progress

Out of interest, are you using the newer Rust client for this? On RN, that can greatly improve sync performance (and also improves UI responsiveness by offloading work to a background thread, but there's still a bit of work happening on the main thread). So that might be worth trying out if you haven't looked at it already.

G2Jose Sep 5, 2025

Thanks for this, I didn't know there was a rust client I can drop in!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proposal] Incremental reprocessing #349

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Proposal] Incremental reprocessing #349

Uh oh!

Uh oh!

rkistner Sep 1, 2025 Maintainer

Background

Status

Proposal

Implementation

Storage changes

Implementation progress

Other considerations

Defragmenting

Config changes

Replies: 1 comment · 2 replies

Uh oh!

G2Jose Sep 4, 2025

Uh oh!

simolus3 Sep 5, 2025 Maintainer

Uh oh!

G2Jose Sep 5, 2025

rkistner
Sep 1, 2025
Maintainer

Replies: 1 comment 2 replies

G2Jose
Sep 4, 2025

simolus3 Sep 5, 2025
Maintainer