Skip to content

Add timeouts, and test restore with network partition#2247

Open
aredridel wants to merge 4 commits into
tursodatabase:mainfrom
spice-labs-inc:as/timeouts
Open

Add timeouts, and test restore with network partition#2247
aredridel wants to merge 4 commits into
tursodatabase:mainfrom
spice-labs-inc:as/timeouts

Conversation

@aredridel

Copy link
Copy Markdown

I'm happy to discuss and/or rework this, but I found that sqld did not recover when object storage was non-responsive.

Adding timeouts to the S3 library fixes this for me, and failures are detected.

aredridel added 2 commits June 1, 2026 11:59
Add integration tests for libsql-server bottomless replication restore
behavior when interrupted by various failure modes.

Tests verify sqld can resume and complete an interrupted restore from
S3-compatible object storage (minio) without requiring a restart.

Test cases:
- basic_restore: Sanity check that sqld restores from minio
- sqld_interrupted: sqld killed mid-restore, restarted, completes
- minio_interrupted: minio stopped mid-restore, restarted, sqld retries
- network_partition: sqld disconnected from network mid-restore, reconnected

Infrastructure:
- Docker-based fixtures with isolated networks per test
- Unique container/network names and ports via atomic counters
- Port mapping (not host networking) for isolation
- Automatic cleanup of Docker resources after each test

Files added:
- tests/bottomless/mod.rs
- tests/bottomless/fixtures.rs
- tests/bottomless/basic_restore.rs
- tests/bottomless/sqld_interrupted.rs
- tests/bottomless/minio_interrupted.rs
- tests/bottomless/network_partition.rs
- tests/bottomless/README.md

Files modified:
- tests/tests.rs: Add bottomless module
- Cargo.toml: Add reqwest dev-dependency, remove duplicate hex
- Add LIBSQL_BOTTOMLESS_S3_READ_TIMEOUT_SECS (default 5s)
- Add LIBSQL_BOTTOMLESS_S3_CONNECT_TIMEOUT_SECS (default 5s)
- Add LIBSQL_BOTTOMLESS_S3_OPERATION_ATTEMPT_TIMEOUT_SECS (default 10s)
- Configure TimeoutConfig on aws_sdk_s3::Config in bottomless::replicator::Options::client_config()
- Update meta_store.rs Options construction to include new timeout fields
- Remove #[ignore] from network_partition test
- Fix test fixtures: endpoint timing, image caching, mut minio
aredridel added 2 commits June 1, 2026 12:31
- Increase dataset to 20000 rows to ensure snapshot restore takes time
- Disconnect network immediately after starting sqld restore (500ms delay)
- Add stop/restart after network heals so sqld retries restore
- Remove wait_for_restore_start() since restore happens during startup
- Update assertion to match new row count
- Start network partition 500ms after restore begins (was 1s after waiting for restore log)
- Add 5s wait after reconnecting network for failed restore to time out
- Stop and restart sqld after network heals so it retries the failed restore
- Remove duplicate row count assertion (verify_integrity already checks count)
- Keep dataset at 1000 rows (moderate size, snapshot restore completes fast enough to not hang test)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant