Add timeouts, and test restore with network partition by aredridel · Pull Request #2247 · tursodatabase/libsql

aredridel · 2026-06-01T15:05:40Z

I'm happy to discuss and/or rework this, but I found that sqld did not recover when object storage was non-responsive.

Adding timeouts to the S3 library fixes this for me, and failures are detected.

Add integration tests for libsql-server bottomless replication restore behavior when interrupted by various failure modes. Tests verify sqld can resume and complete an interrupted restore from S3-compatible object storage (minio) without requiring a restart. Test cases: - basic_restore: Sanity check that sqld restores from minio - sqld_interrupted: sqld killed mid-restore, restarted, completes - minio_interrupted: minio stopped mid-restore, restarted, sqld retries - network_partition: sqld disconnected from network mid-restore, reconnected Infrastructure: - Docker-based fixtures with isolated networks per test - Unique container/network names and ports via atomic counters - Port mapping (not host networking) for isolation - Automatic cleanup of Docker resources after each test Files added: - tests/bottomless/mod.rs - tests/bottomless/fixtures.rs - tests/bottomless/basic_restore.rs - tests/bottomless/sqld_interrupted.rs - tests/bottomless/minio_interrupted.rs - tests/bottomless/network_partition.rs - tests/bottomless/README.md Files modified: - tests/tests.rs: Add bottomless module - Cargo.toml: Add reqwest dev-dependency, remove duplicate hex

- Add LIBSQL_BOTTOMLESS_S3_READ_TIMEOUT_SECS (default 5s) - Add LIBSQL_BOTTOMLESS_S3_CONNECT_TIMEOUT_SECS (default 5s) - Add LIBSQL_BOTTOMLESS_S3_OPERATION_ATTEMPT_TIMEOUT_SECS (default 10s) - Configure TimeoutConfig on aws_sdk_s3::Config in bottomless::replicator::Options::client_config() - Update meta_store.rs Options construction to include new timeout fields - Remove #[ignore] from network_partition test - Fix test fixtures: endpoint timing, image caching, mut minio

- Increase dataset to 20000 rows to ensure snapshot restore takes time - Disconnect network immediately after starting sqld restore (500ms delay) - Add stop/restart after network heals so sqld retries restore - Remove wait_for_restore_start() since restore happens during startup - Update assertion to match new row count

- Start network partition 500ms after restore begins (was 1s after waiting for restore log) - Add 5s wait after reconnecting network for failed restore to time out - Stop and restart sqld after network heals so it retries the failed restore - Remove duplicate row count assertion (verify_integrity already checks count) - Keep dataset at 1000 rows (moderate size, snapshot restore completes fast enough to not hang test)

aredridel added 2 commits June 1, 2026 11:59

aredridel force-pushed the as/timeouts branch from 9ecee7b to 9876625 Compare June 1, 2026 16:01

aredridel added 2 commits June 1, 2026 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timeouts, and test restore with network partition#2247

Add timeouts, and test restore with network partition#2247
aredridel wants to merge 4 commits into
tursodatabase:mainfrom
spice-labs-inc:as/timeouts

aredridel commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aredridel commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant