feat(backup): route all backup traffic through rclone-gateway#1175
Open
DavidePrincipi wants to merge 26 commits into
Open
feat(backup): route all backup traffic through rclone-gateway#1175DavidePrincipi wants to merge 26 commits into
DavidePrincipi wants to merge 26 commits into
Conversation
- Two separate images: 1) RCLONE_IMAGE for rclone-gateway.service (HAProxy frontend + Rclone rest/webdav backends), and 2) RESTIC_IMAGE for restic app-level backup clients. - Access to WebDAV and Restic on HTTP port 4694 requires authentication. Cluster nodes have unlimited access, whilst applications can only add new backup data with Rest protocol. Authentication and authorization layers are implemented with HAProxy. - During restoration, the application is granted access to the source repository through the Redis HASH "private/nodes/restore_uuid". - The rclonegwctl write-configuration command generates service configuration from Redis DB acls and keys. Output files: rclone.conf, modrepo.map, auth.map, haproxy.cfg, rclone-webdav.env. - Additional "combined" Rclone remote presents all remotes under the same tree with a uniform three level structure: 1. repo uuid 2. module image name, e.g. "traefik" 3. module uuid - The old rclone-webdav.service is replaced by rclone-gateway.service. The local repository originally served by rclone-webdav.service is now accessible through rclone-gateway on the node that hosts it. Assisted-By: copilot:claude-sonnet-4.6
- Add per-node destination validation. - Use stable UUIDs, generate rclone.conf and store it in new Redis HASH key private/nodes/backup_destination/rclone_conf. Hash keys are the UUIDs of destinations. - Added cluster.backup.run_rclone() helper. - Update add/alter actions to write public data to cluster/backup_repository/<UUID> and secrets to private/nodes/backup_destination/parameters/<UUID> with parallel node validation via validate-backup-destination action. - Remove action cleans private keys. - Replaced add-backup-repository/10validate with non-executable placeholder to work around legacy update bug.
Define a new core event to synchronize rclone-gateway.service configuration and reload it every time a backup destination changes.
- Grant node and module agents read-only access to cluster/*, node/*, module/*, and private/agents/* key spaces. - Convert existing backup destination keys to new format. - Trigger reload of rclone-gateway configuration on all nodes. Add acl-changed event handler to reload rclone-gateway. - Update module database documentation with use_replica connection example.
Add "key" and "pass" to the list of attribute name suffixes that are considered as sensitive and therefore trigger the value obfuscation.
Introduce schedule-backup command with start-timers, stop-timers, and list-timers subcommands. Timers are created as systemd transient units from backup schedules stored in Redis. Add backup-timers.service to manage timer lifecycle with redis.service. Assisted-by: copilot:claude-sonnet-4.6
Add exclusive flock to prevent concurrent runs (exit 3 if locked). Remove retention policy, rclone upload, and atexit status handler — these move to run-backup. Assisted-by: copilot:claude-sonnet-4.6
Orchestrate per-module backups at the node level. Capture
module-backup JSON stdout, store status in Redis under
node/{nodeID}/backup_status/{backupID}, run retention
policy on success, and upload repopath.json metadata.
Retry modules returning exit code 3 (already running).
Assisted-by: copilot:claude-sonnet-4.6
Update list-backups to read from node/{nodeID}/backup_status/{backupID}
instead of the old module/{mid}/backup_status keys. Derive instance
lists from the cluster/backup/{id} instances field.
Assisted-by: copilot:claude-sonnet-4.6
Generate backup{id}.prom in /run/node_exporter from run-backup with
UNKNOWN=-1 on start, CONFLICT=2 in case of concurrent runs of the same
module, SUCCESS=1 on completion, FAILED=0 if any module fails. Remove
the old 10node_monitor event handler.
Remove the configure-backup action that created per-module systemd timer/service units. Add cleanup in 20restart_webdav to remove leftover backup*.timer and backup*.service files for both rootfull and rootless modules. Move backup-timers.service to top-level systemd dir with Requires=redis.service and ConditionPathExists guard. Wire it as Wants= dependency of rclone-gateway.service. Assisted-by: copilot:claude-sonnet-4.6
Update add-backup, alter-backup, remove-backup, and
run-backup cluster actions for the new backup model.
Remove the module-level run-backup action. Update
database documentation.
Define and document "backup-schedule-changed" event.
Migrate module/{mid}/backups SET keys into
cluster/backup/{bid} instances field. Migrate
module/{mid}/backup_status/{bid} keys to
node/{nodeID}/backup_status/{bid}.
Assisted-by: copilot:claude-sonnet-4.6
Move cluster backup upload from the cluster action to
run-backup. On the leader node, generate and upload
cluster-backup-{uuid}.json.gz.gpg before starting
module backups. Stub out 80upload_cluster_backup.
Assisted-by: copilot:claude-sonnet-4.6
- The image name was not prepended to the repository path. The value was incomplete/uncorrect. - Change has no impact, since the repository_path value is actually not used by UI. - Fixed related UI stories, still referencing to an old value format that was never implemented.
The "dokuwiki1@" prefix was never implemented.
Bump cluster-backup dump version to 4. Include rclone_conf and restic_password in backup_repository export. Convert module IDs to UUIDs in backup schedules for portability. On restore-cluster, handle v3-to-v4 conversion for backup destinations and restart backup-timers. On restore-module, resolve UUIDs back to module IDs in backup schedules. Switch cluster-backup to privileged Redis connection. Reload rclone-gateway at end of restore-cluster. Skip schedules with broken destination references. Assisted-by: copilot:claude-sonnet-4.6
Refactor prepare_restic_command() to use a local REST server (rclone-gateway on port 4694) for all backup destinations, replacing per-backend credential handling (S3, B2, Azure, SMB, WebDAV). Authentication now uses REDIS_USER/REDIS_PASSWORD credentials; cluster agents fall back to node credentials since rclone-gateway doesn't know cluster creds. Restic password is now fetched from Redis private key private/agents/backup_destination/restic_password/<repo> with a graceful fallback and deprecation warning when the caller passes an unprivileged connection. Refactor list-backup-repositories and read-backup-repositories to use the new cluster.backup shared library. Parallelize gateway probing, rclone lsjson, and metadata fetches with ThreadPoolExecutor. Assisted-by: copilot:claude-sonnet-4.6
Store rclone_conf and restic_password in private keys instead of saving the raw destination object. Handle dump version to generate rclone_conf from legacy v3 format. Use a pipeline for atomic writes and publish backup-destination-changed event. Assisted-by: copilot:claude-sonnet-4.6
Change backup-destination-changed event payload from destination_id (string) to destination_ids (list) to support bulk imports. Assisted-by: copilot:claude-sonnet-4.6
Use fromisoformat for timestamp parsing. Switch to returncode check to avoid stacktrace noise in logs. Use privileged Redis connection for private/* keys. Assisted-by: copilot:claude-sonnet-4.6
Alias unit rclone-webdav.service is considered running during core update, hence service startup fails for a missing haproxy/ dir. Create haproxy/ dir and ensure rclone-webdav.service is properly stopped, waiting next acl-changed event handler to start it again.
Add a new "rclone" backup provider type that accepts raw rclone.conf content, either as plain text or base64-encoded. The configuration is normalized with a canonical section header and validated through configparser before use. Both add-backup-repository and alter-backup-repository actions handle the new provider with proper error handling and URL generation compatible with extract_rclone_basepath(). Hide secrets from list-backup-repositories output by adding a hide_secrets flag to parse_rclone_params(). The restic password is no longer returned either. Rename the rclone_conf field to rclone_conf_secret in the validate-backup-destination action input schema to follow the naming convention for sensitive data. Allow alter-backup-repository to preserve existing secrets when the client sends an empty string: the action reads the current value from Redis as fallback, for all provider types. Store a basepath field in each backup repository Redis HASH to support rclone provider paths that are not encoded in the url. Assisted-by: copilot:claude-sonnet-4.6
Replace rclone-wrapper subprocess calls with direct WebDAV HTTP requests for uploading repopath JSON manifests and cluster backup GPG files. Add webdav_write_file() helper to cluster.backup module. Remove both rclone-wrapper scripts (agent bin/ and container bin/), now fully replaced by WebDAV calls. Assisted-by: copilot:claude-sonnet-4.6
During Traefik restore, the sigterm handler must not trigger the module removal.
Return the snapshot backup size and start/end timestamps.
This was
linked to
issues
May 13, 2026
gsanchietti
reviewed
May 13, 2026
| --volume=./rclone:/etc/rclone:ro,Z \ | ||
| --volume=./haproxy:/etc/haproxy:ro,Z \ | ||
| --volume=${BACKUP_VOLUME}:/srv/repo:z \ | ||
| --mount=type=tmpfs,tmpfs-size=10M,destination=/var/lib/rclone,chown=true \ |
Member
There was a problem hiding this comment.
Is this special option really needed?
| backup_id = int(sys.argv[1]) | ||
| rdb = agent.redis_connect(host='127.0.0.1') # Connect to local replica | ||
|
|
||
| lock_file = open(os.environ['AGENT_STATE_DIR'] + f"/.module-backup.lock", "w") |
Member
There was a problem hiding this comment.
Not a fun of locks.
Can be avoid?
Comment on lines
+327
to
+334
| _CRYPT_KEY = bytes([ | ||
| 0x9c, 0x93, 0x5b, 0x48, 0x73, 0x0a, 0x55, 0x4d, | ||
| 0x6b, 0xfd, 0x7c, 0x63, 0xc8, 0x86, 0xa9, 0x2b, | ||
| 0xd3, 0x90, 0x19, 0x8e, 0xb8, 0x12, 0x8a, 0xfb, | ||
| 0xf4, 0xde, 0x16, 0x2b, 0x8b, 0x95, 0xf6, 0x38, | ||
| ]) | ||
|
|
||
| _AES_BLOCK_SIZE = 16 # aes.BlockSize in Go |
Member
There was a problem hiding this comment.
This is scary because relies on rclone internals.
Can we avoid it?
| # | ||
| ptasks = [] | ||
| related_backups = [] | ||
| # XXX use an index instead of scan |
Member
There was a problem hiding this comment.
Should we remove this comment?
Comment on lines
+17
to
+18
| if proc.returncode != 0: | ||
| sys.exit(1) |
Member
There was a problem hiding this comment.
Why not returning the real exit code?
Suggested change
| if proc.returncode != 0: | |
| sys.exit(1) | |
| return proc.returncode |
Comment on lines
+100
to
+104
| systemctl enable --now \ | ||
| api-server.service \ | ||
| agent@cluster.service \ | ||
| agent@node.service \ | ||
| # end of service list |
Member
There was a problem hiding this comment.
This is a but ugly, especially the last line with a comment.
Suggested change
| systemctl enable --now \ | |
| api-server.service \ | |
| agent@cluster.service \ | |
| agent@node.service \ | |
| # end of service list | |
| systemctl enable --now | |
| api-server.service \ | |
| agent@cluster.service \ | |
| agent@node.service |
Comment on lines
+37
to
+49
| def write_backup_prom(backup_id, backup_name, status): | ||
| """Write a Prometheus .prom file for this backup's status.""" | ||
| output_file = os.path.join(PROM_DIR, f"backup{backup_id}.prom") | ||
| content = ( | ||
| '# HELP node_backup_status Status of the backup (0 = failure, 1 = success, 2 = conflict, -1 = unknown)\n' | ||
| '# TYPE node_backup_status gauge\n' | ||
| f'node_backup_status{{id="{backup_id}",name="{prometheus_escape_label_value(backup_name)}"}} {status}\n' | ||
| ) | ||
| os.makedirs(PROM_DIR, exist_ok=True) | ||
| tmp_file = f"{output_file}-{os.getpid()}.tmp" | ||
| with open(tmp_file, "w") as f: | ||
| f.write(content) | ||
| os.rename(tmp_file, output_file) |
Member
There was a problem hiding this comment.
Is this backward compatibile?
| @@ -0,0 +1,138 @@ | |||
| #!/usr/bin/python3 | |||
|
|
|||
| # | |||
Member
There was a problem hiding this comment.
Does this change affects also migration and clone?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Modules no longer access backup destinations directly. A per-node
rclone-gatewayservice (HAProxy + Rclone) acts as an authenticated HTTP proxy: nodes get full WebDAV and Restic REST access, while applications are restricted to their own repositories.This removes secret credentials from module environments, centralizes destination configuration in private Redis keys, and enables node-level backup orchestration with
run-backup.Main changes:
rclone-webdav, proxying Restic REST and WebDAV backends with Redis-based authprivate/nodes/*andprivate/agents/*, inaccessible to modulesrun-backupiterates modules, retries locked backups, writes Prometheus metrics, uploads metadata and cluster backup via WebDAVschedule-backupmanages transient timers fromcluster/backup/*definitions*keyand*passparams in audit log