Skip to content

feat(backup): route all backup traffic through rclone-gateway#1175

Open
DavidePrincipi wants to merge 26 commits into
mainfrom
feat-7814
Open

feat(backup): route all backup traffic through rclone-gateway#1175
DavidePrincipi wants to merge 26 commits into
mainfrom
feat-7814

Conversation

@DavidePrincipi
Copy link
Copy Markdown
Member

Modules no longer access backup destinations directly. A per-node rclone-gateway service (HAProxy + Rclone) acts as an authenticated HTTP proxy: nodes get full WebDAV and Restic REST access, while applications are restricted to their own repositories.

This removes secret credentials from module environments, centralizes destination configuration in private Redis keys, and enables node-level backup orchestration with run-backup.

Main changes:

  • rclone-gateway service replaces rclone-webdav, proxying Restic REST and WebDAV backends with Redis-based auth
  • Secrets isolation — rclone configs and Restic passwords stored in private/nodes/* and private/agents/*, inaccessible to modules
  • Node-level orchestrationrun-backup iterates modules, retries locked backups, writes Prometheus metrics, uploads metadata and cluster backup via WebDAV
  • Systemd schedulingschedule-backup manages transient timers from cluster/backup/* definitions
  • API hardening — obfuscate *key and *pass params in audit log
  • Migration — node Redis ACLs and backup keys updated for existing installations

- Two separate images: 1) RCLONE_IMAGE for rclone-gateway.service
  (HAProxy frontend + Rclone rest/webdav backends), and 2) RESTIC_IMAGE
  for restic app-level backup clients.

- Access to WebDAV and Restic on HTTP port 4694 requires
  authentication. Cluster nodes have unlimited access, whilst
  applications can only add new backup data with Rest protocol.
  Authentication and authorization layers are implemented with HAProxy.

- During restoration, the application is granted access to the source
  repository through the Redis HASH "private/nodes/restore_uuid".

- The rclonegwctl write-configuration command generates service
  configuration from Redis DB acls and keys. Output files: rclone.conf,
  modrepo.map, auth.map, haproxy.cfg, rclone-webdav.env.

- Additional "combined" Rclone remote presents all remotes under the
  same tree with a uniform three level structure:

  1. repo uuid
  2. module image name, e.g. "traefik"
  3. module uuid

- The old rclone-webdav.service is replaced by rclone-gateway.service.
  The local repository originally served by rclone-webdav.service is
  now accessible through rclone-gateway on the node that hosts it.

Assisted-By: copilot:claude-sonnet-4.6
- Add per-node destination validation.

- Use stable UUIDs, generate rclone.conf and store it in new Redis HASH
  key private/nodes/backup_destination/rclone_conf. Hash keys are the
  UUIDs of destinations.

- Added cluster.backup.run_rclone() helper.

- Update add/alter actions to write public data to
  cluster/backup_repository/<UUID> and secrets to
  private/nodes/backup_destination/parameters/<UUID> with parallel node
  validation via validate-backup-destination action.

- Remove action cleans private keys.

- Replaced add-backup-repository/10validate with non-executable
  placeholder to work around legacy update bug.
Define a new core event to synchronize rclone-gateway.service configuration and
reload it every time a backup destination changes.
- Grant node and module agents read-only access to
  cluster/*, node/*, module/*, and private/agents/*
  key spaces.

- Convert existing backup destination keys to new format.

- Trigger reload of rclone-gateway configuration on all
  nodes. Add acl-changed event handler to reload
  rclone-gateway.

- Update module database documentation with use_replica
  connection example.
Add "key" and "pass" to the list of attribute name suffixes that are
considered as sensitive and therefore trigger the value obfuscation.
Introduce schedule-backup command with start-timers, stop-timers,
and list-timers subcommands. Timers are created as systemd
transient units from backup schedules stored in Redis.

Add backup-timers.service to manage timer lifecycle with
redis.service.

Assisted-by: copilot:claude-sonnet-4.6
Add exclusive flock to prevent concurrent runs (exit 3 if
locked). Remove retention policy, rclone upload, and atexit status
handler — these move to run-backup.

Assisted-by: copilot:claude-sonnet-4.6
Orchestrate per-module backups at the node level. Capture
module-backup JSON stdout, store status in Redis under
node/{nodeID}/backup_status/{backupID}, run retention
policy on success, and upload repopath.json metadata.

Retry modules returning exit code 3 (already running).

Assisted-by: copilot:claude-sonnet-4.6
Update list-backups to read from node/{nodeID}/backup_status/{backupID}
instead of the old module/{mid}/backup_status keys. Derive instance
lists from the cluster/backup/{id} instances field.

Assisted-by: copilot:claude-sonnet-4.6
Generate backup{id}.prom in /run/node_exporter from run-backup with
UNKNOWN=-1 on start, CONFLICT=2 in case of concurrent runs of the same
module, SUCCESS=1 on completion, FAILED=0 if any module fails.  Remove
the old 10node_monitor event handler.
Remove the configure-backup action that created per-module
systemd timer/service units. Add cleanup in 20restart_webdav
to remove leftover backup*.timer and backup*.service files
for both rootfull and rootless modules.

Move backup-timers.service to top-level systemd dir with
Requires=redis.service and ConditionPathExists guard.
Wire it as Wants= dependency of rclone-gateway.service.

Assisted-by: copilot:claude-sonnet-4.6
Update add-backup, alter-backup, remove-backup, and
run-backup cluster actions for the new backup model.
Remove the module-level run-backup action. Update
database documentation.

Define and document "backup-schedule-changed" event.

Migrate module/{mid}/backups SET keys into
cluster/backup/{bid} instances field. Migrate
module/{mid}/backup_status/{bid} keys to
node/{nodeID}/backup_status/{bid}.

Assisted-by: copilot:claude-sonnet-4.6
Move cluster backup upload from the cluster action to
run-backup. On the leader node, generate and upload
cluster-backup-{uuid}.json.gz.gpg before starting
module backups. Stub out 80upload_cluster_backup.

Assisted-by: copilot:claude-sonnet-4.6
- The image name was not prepended to the repository path. The value was
  incomplete/uncorrect.
- Change has no impact, since the repository_path value is actually not used by UI.
- Fixed related UI stories, still referencing to an old value format that was
  never implemented.
The "dokuwiki1@" prefix was never implemented.
Bump cluster-backup dump version to 4. Include rclone_conf
and restic_password in backup_repository export. Convert
module IDs to UUIDs in backup schedules for portability.

On restore-cluster, handle v3-to-v4 conversion for backup
destinations and restart backup-timers. On restore-module,
resolve UUIDs back to module IDs in backup schedules.

Switch cluster-backup to privileged Redis connection.
Reload rclone-gateway at end of restore-cluster.
Skip schedules with broken destination references.

Assisted-by: copilot:claude-sonnet-4.6
Refactor prepare_restic_command() to use a local REST server
(rclone-gateway on port 4694) for all backup destinations,
replacing per-backend credential handling (S3, B2, Azure, SMB,
WebDAV). Authentication now uses REDIS_USER/REDIS_PASSWORD
credentials; cluster agents fall back to node credentials
since rclone-gateway doesn't know cluster creds.

Restic password is now fetched from Redis private key
private/agents/backup_destination/restic_password/<repo>
with a graceful fallback and deprecation warning when the
caller passes an unprivileged connection.

Refactor list-backup-repositories and read-backup-repositories
to use the new cluster.backup shared library. Parallelize
gateway probing, rclone lsjson, and metadata fetches with
ThreadPoolExecutor.

Assisted-by: copilot:claude-sonnet-4.6
Store rclone_conf and restic_password in private keys instead
of saving the raw destination object. Handle dump version to
generate rclone_conf from legacy v3 format. Use a pipeline for
atomic writes and publish backup-destination-changed event.

Assisted-by: copilot:claude-sonnet-4.6
Change backup-destination-changed event payload from
destination_id (string) to destination_ids (list) to
support bulk imports.

Assisted-by: copilot:claude-sonnet-4.6
Use fromisoformat for timestamp parsing. Switch to
returncode check to avoid stacktrace noise in logs.
Use privileged Redis connection for private/* keys.

Assisted-by: copilot:claude-sonnet-4.6
Alias unit rclone-webdav.service is considered running during core
update, hence service startup fails for a missing haproxy/ dir.

Create haproxy/ dir and ensure rclone-webdav.service is properly
stopped, waiting next acl-changed event handler to start it again.
Add a new "rclone" backup provider type that accepts raw rclone.conf
content, either as plain text or base64-encoded. The configuration
is normalized with a canonical section header and validated through
configparser before use.

Both add-backup-repository and alter-backup-repository actions
handle the new provider with proper error handling and URL
generation compatible with extract_rclone_basepath().

Hide secrets from list-backup-repositories output by adding a
hide_secrets flag to parse_rclone_params(). The restic password
is no longer returned either.

Rename the rclone_conf field to rclone_conf_secret in the
validate-backup-destination action input schema to follow the
naming convention for sensitive data.

Allow alter-backup-repository to preserve existing secrets when
the client sends an empty string: the action reads the current
value from Redis as fallback, for all provider types.

Store a basepath field in each backup repository Redis HASH to
support rclone provider paths that are not encoded in the url.

Assisted-by: copilot:claude-sonnet-4.6
Replace rclone-wrapper subprocess calls with direct WebDAV
HTTP requests for uploading repopath JSON manifests and
cluster backup GPG files. Add webdav_write_file() helper
to cluster.backup module.

Remove both rclone-wrapper scripts (agent bin/ and
container bin/), now fully replaced by WebDAV calls.

Assisted-by: copilot:claude-sonnet-4.6
During Traefik restore, the sigterm handler must not trigger the module
removal.
Return the snapshot backup size and start/end timestamps.
--volume=./rclone:/etc/rclone:ro,Z \
--volume=./haproxy:/etc/haproxy:ro,Z \
--volume=${BACKUP_VOLUME}:/srv/repo:z \
--mount=type=tmpfs,tmpfs-size=10M,destination=/var/lib/rclone,chown=true \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this special option really needed?

backup_id = int(sys.argv[1])
rdb = agent.redis_connect(host='127.0.0.1') # Connect to local replica

lock_file = open(os.environ['AGENT_STATE_DIR'] + f"/.module-backup.lock", "w")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fun of locks.
Can be avoid?

Comment on lines +327 to +334
_CRYPT_KEY = bytes([
0x9c, 0x93, 0x5b, 0x48, 0x73, 0x0a, 0x55, 0x4d,
0x6b, 0xfd, 0x7c, 0x63, 0xc8, 0x86, 0xa9, 0x2b,
0xd3, 0x90, 0x19, 0x8e, 0xb8, 0x12, 0x8a, 0xfb,
0xf4, 0xde, 0x16, 0x2b, 0x8b, 0x95, 0xf6, 0x38,
])

_AES_BLOCK_SIZE = 16 # aes.BlockSize in Go
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is scary because relies on rclone internals.
Can we avoid it?

#
ptasks = []
related_backups = []
# XXX use an index instead of scan
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this comment?

Comment on lines +17 to +18
if proc.returncode != 0:
sys.exit(1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not returning the real exit code?

Suggested change
if proc.returncode != 0:
sys.exit(1)
return proc.returncode

Comment on lines +100 to +104
systemctl enable --now \
api-server.service \
agent@cluster.service \
agent@node.service \
# end of service list
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a but ugly, especially the last line with a comment.

Suggested change
systemctl enable --now \
api-server.service \
agent@cluster.service \
agent@node.service \
# end of service list
systemctl enable --now
api-server.service \
agent@cluster.service \
agent@node.service

Comment on lines +37 to +49
def write_backup_prom(backup_id, backup_name, status):
"""Write a Prometheus .prom file for this backup's status."""
output_file = os.path.join(PROM_DIR, f"backup{backup_id}.prom")
content = (
'# HELP node_backup_status Status of the backup (0 = failure, 1 = success, 2 = conflict, -1 = unknown)\n'
'# TYPE node_backup_status gauge\n'
f'node_backup_status{{id="{backup_id}",name="{prometheus_escape_label_value(backup_name)}"}} {status}\n'
)
os.makedirs(PROM_DIR, exist_ok=True)
tmp_file = f"{output_file}-{os.getpid()}.tmp"
with open(tmp_file, "w") as f:
f.write(content)
os.rename(tmp_file, output_file)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this backward compatibile?

@@ -0,0 +1,138 @@
#!/usr/bin/python3

#
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change affects also migration and clone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Restore backup schedule within disaster recovery Validate write permissions on backup destination

2 participants