Skip to content

Conversation

@pav-kv
Copy link
Contributor

@pav-kv pav-kv commented Nov 28, 2023

This commit introduces MemFS.Sync method which persists the entire filesystem. It makes it possible to emulate a process crash and restart, as follows:

fs.Sync()                        // capture/snapshot the filesystem
fs.SetIgnoreSyncs(true)          // prevent changes from taking effect
db.Close()                       // shutdown, with no writes taking effect
fs.ResetToSyncedState()          // rollback to the snapshot
strictFS.SetIgnoreSyncs(false)   // allow changes again
db = Open(..., &Options{FS: fs}) // reopen on top of the reverted data

This commit introduces MemFS.Sync method which persists the entire
filesystem. It makes it possible to emulate a process crash and restart,
as follows:

```
fs.Sync()                        // capture/snapshot the filesystem
fs.SetIgnoreSyncs(true)          // prevent changes from taking effect
db.Close()                       // shutdown, with no writes taking effect
fs.ResetToSyncedState()          // rollback to the snapshot
strictFS.SetIgnoreSyncs(false)   // allow changes again
db = Open(..., &Options{FS: fs}) // reopen on top of the reverted data
```
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Wouldn't correct modeling of a crash be to not sync the entire filesystem? We want to ensure that we've issued the correct syncs to all the individual files?

Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @sumeerbhola)

@pav-kv
Copy link
Contributor Author

pav-kv commented Nov 29, 2023

@jbowens The point of fs.Sync() is not so much in syncing the filesystem, but in checkpointing/forking it for the purposes of the test, in order to roll it back to exactly the same state later. In a real crash I'm imagining something like this happens:

  1. the process is writing and flushing/fsyncing files
    • typically during this operation there is some "dirty" state that is not fully synced
  2. crash
  3. everything flushed so far is still in OS/FS buffers, so likely will persist
  4. a restart will observe the FS at the "dirty" state just before step (2), including the partially-written files that were not fully synced

In an emulated/unit-test environment we can't guarantee an immediate stop of the DB/server in step (2). So we snapshot the state of the filesystem just before step (2), and terminate the DB/server gracefully in step (2). The graceful stop will write some extra state and sync files, but we will ignore it all. In step (4) we rollback the FS to state that it was in just before the "crash" in step (2).

The point of all this is to not let the graceful stop to take effect. We need to instantly cut-off writes, i.e. turn MemFS into a /dev/null.

I could rename it to avoid the confusion. Something like Checkpoint / Stash / Snapshot / Fork would do?

@pav-kv
Copy link
Contributor Author

pav-kv commented Nov 29, 2023

A strict MemFS persists only the data that was synced. This is useful to make sure the necessary syncs are in place, and we can recover after a full system crash.

A non-strict MemFS persists everything. This is not super useful to test crashes, but still can be used to test clean restarts.

There is nothing in between. The fs.(Sync/Checkpoint/Stash/etc) kind of method allows testing restarts in "dirty" state in between "only the synced data is persisted" and "everything is persisted".

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. That makes sense. No need to consider it here, but I can imagine it being worthwhile to reset a random subset of unsynced data as well.

I think it would be possible to implement these various testing semantics as vfs.FS middleware, like we do in vfs/errorfs. But maybe we aren't gonna need it.

:lgtm:

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @sumeerbhola)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants