Skip to content

fix(config): recover from corrupted config.toml via persistent .bak#1563

Merged
senamakel merged 7 commits into
tinyhumansai:mainfrom
PranavAgarkar07:fix/toml-corrupted-config-recovery-1523
May 13, 2026
Merged

fix(config): recover from corrupted config.toml via persistent .bak#1563
senamakel merged 7 commits into
tinyhumansai:mainfrom
PranavAgarkar07:fix/toml-corrupted-config-recovery-1523

Conversation

@PranavAgarkar07
Copy link
Copy Markdown
Contributor

@PranavAgarkar07 PranavAgarkar07 commented May 12, 2026

Summary

When config.toml becomes corrupted, the app currently resets to defaults and loses the previous configuration entirely. This fix ensures a last-known-good backup is always available and used for automatic recovery.

Changes

src/openhuman/config/schema/load.rs:

  1. Persistent .baksave() now keeps .toml.bak on disk after a successful write, preserving the last-known-good config.

  2. Backup recoveryload_or_init() attempts to load from .toml.bak when the primary config file fails to parse, with appropriate logging at each step:

    • Warns on corruption, info on successful recovery, warn on backup also being corrupted
  3. Graceful fallback — If both the config file and backup are corrupted (or no backup exists), resets to Config::default().

Impact

  • Users affected by Sentry issues OPENHUMAN-TAURI-BM/BK/BJ/B4/B3 (320+ events) will automatically recover from their last-good backup after a corruption event.
  • No more bricked apps requiring manual config file deletion.

Closes #1523

Summary by CodeRabbit

  • Bug Fixes
    • Improved config recovery: parse failures now fall back to existing backups, mark corrupted primaries, persist recovered/default configs, and avoid overwriting or removing backups on save failures.
  • New Features
    • More tolerant config parsing: many configuration sections now accept missing fields and automatically use defaults on load.
  • Tests
    • Added tests covering corrupted-config recovery, backup fallback, default fallbacks, and env-locking for env-var isolation.

Review Change Stack

- Keep  on successful save instead of deleting it
- On parse failure, attempt recovery from  before
  falling back to defaults
- Log each recovery step at appropriate severity level

Closes tinyhumansai#1523
@PranavAgarkar07 PranavAgarkar07 requested a review from a team May 12, 2026 15:07
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b7e5a4bb-c26c-4b68-9610-d1fd14a42765

📥 Commits

Reviewing files that changed from the base of the PR and between 44ee08b and d11719c.

📒 Files selected for processing (5)
  • src/openhuman/config/schema/autonomy.rs
  • src/openhuman/config/schema/heartbeat_cron.rs
  • src/openhuman/config/schema/load.rs
  • src/openhuman/config/schema/load_tests.rs
  • src/openhuman/config/schema/storage_memory.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/openhuman/config/schema/storage_memory.rs

📝 Walkthrough

Walkthrough

Struct-level #[serde(default)] was added across many config types. Config parsing now returns a corruption flag; on parse failure the loader attempts config.toml.bak and otherwise falls back to defaults, renames corrupted primaries, and persists the chosen config. Tests cover recovery paths and env locking.

Changes

Config backup and recovery

Layer / File(s) Summary
Parse failure recovery and loader persistence
src/openhuman/config/schema/load.rs
parse_config_with_recovery returns (Config, was_corrupted) and only attempts recovery from config.toml.bak; Config::load_or_init renames corrupted primary (config.toml.toml.corrupted) when needed and saves the recovered/default config only after successful rename. Config::save uses tokio::fs::try_exists(...).await.unwrap_or(false) to detect existing config.
Tests and env-lock helpers for recovery
src/openhuman/config/schema/load_tests.rs
Adds async file-writer, TOML corruption fixture, env_lock() wrapper, and Tokio tests verifying: recovery from .bak, fallback to defaults when backup is corrupted or missing (and primary renamed), and no recovery on valid config.

Serde struct-level defaults

Layer / File(s) Summary
Runtime, storage, and tools structs
src/openhuman/config/schema/runtime.rs, src/openhuman/config/schema/storage_memory.rs, src/openhuman/config/schema/tools.rs, ...
Adds #[serde(default)] to RuntimeConfig, DockerRuntimeConfig, ReliabilityConfig, SchedulerConfig, StorageConfig, MemoryConfig (with field-level defaults for backend/auto_save), MemoryTreeConfig, and many tool/integration structs so missing sections/fields deserialize using defaults.
Other schema structs
src/openhuman/config/schema/agent.rs, src/openhuman/config/schema/node.rs, src/openhuman/config/schema/proxy.rs, src/openhuman/config/schema/observability.rs, src/openhuman/config/schema/context.rs, src/openhuman/config/schema/accessibility.rs, src/openhuman/config/schema/autocomplete.rs, src/openhuman/config/schema/dictation.rs, src/openhuman/config/schema/learning.rs, src/openhuman/config/schema/local_ai.rs, src/openhuman/config/schema/meet.rs, src/openhuman/config/schema/heartbeat_cron.rs, src/openhuman/config/schema/scheduler_gate.rs, src/openhuman/config/schema/identity_cost.rs, src/openhuman/config/schema/update.rs, src/openhuman/config/schema/voice_server.rs, src/openhuman/config/schema/autonomy.rs, src/openhuman/config/schema/channels.rs
Adds #[serde(default)] to many remaining public config structs; some fields gained explicit #[serde(default = "...")] helpers and Default impls were updated (e.g., autonomy defaults, heartbeat interval, memory backend). No public function/type signatures changed.

Sequence Diagram(s)

sequenceDiagram
  participant ConfigLoader
  participant Filesystem
  participant BackupFile
  participant Config
  ConfigLoader->>Filesystem: read config.toml
  Filesystem-->>ConfigLoader: contents
  ConfigLoader->>Config: toml::from_str(contents)
  alt parse succeeds
    Config->>ConfigLoader: Config
  else parse fails
    ConfigLoader->>Filesystem: read config.toml.bak
    Filesystem-->>ConfigLoader: bak_contents / error
    alt bak parse succeeds
      ConfigLoader->>Config: toml::from_str(bak_contents)
      Config->>ConfigLoader: recovered Config
      ConfigLoader->>Filesystem: config.save() (persist recovered)
    else bak missing/parse fails
      ConfigLoader->>Config: Config::default()
      Config->>ConfigLoader: default Config
      ConfigLoader->>Filesystem: config.save() (persist default)
    end
    ConfigLoader->>Filesystem: rename corrupted config.toml -> config.toml.toml.corrupted
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • senamakel

Poem

🐰 I hopped through broken TOML gloom,
I sniffed a .bak beneath the moon.
If both files failed, I brewed defaults anew,
Renamed the mess and saved what’s true. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: fixing config recovery by using a persistent backup file when the primary config.toml is corrupted.
Linked Issues check ✅ Passed The PR addresses all core coding requirements from #1523: recovery from .bak on primary failure, persistent .bak retention, fallback to defaults, and #[serde(default)] additions across all config structs.
Out of Scope Changes check ✅ Passed All changes are directly aligned with recovery and robustness objectives: load.rs refactoring for recovery logic, #[serde(default)] additions to 25+ config structs, and comprehensive recovery tests in load_tests.rs.
Docstring Coverage ✅ Passed Docstring coverage is 80.95% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/config/schema/load.rs`:
- Around line 573-600: When falling back to Config::default() (or the recovered
config) in load.rs after parse/read failures, persist that config back to disk
so the corrupted config.toml is replaced; locate the branches that return
Config::default() (and the branch that recovers from backup) and, before
returning, serialize the Config (e.g. via toml::to_string) and write it to
config_path (handle and log any write errors with tracing::error/warn), ensuring
you still return the Config instance even if the write fails; reference the
Config::default(), config_path, backup_path, parse_err, read_err, and backup_err
symbols when making this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8e9b9a04-3ee7-45d8-a957-4099a2e00c89

📥 Commits

Reviewing files that changed from the base of the PR and between 15c7442 and 59475c9.

📒 Files selected for processing (1)
  • src/openhuman/config/schema/load.rs

Comment thread src/openhuman/config/schema/load.rs Outdated
…ucts

- Persist recovered/default config back to disk after parse-failure
  fallback so config.toml is replaced and the recovery path doesn't
  repeat every startup
- Add #[serde(default)] at the struct level to all 43 Config sub-structs
  that already had a Default impl, preventing schema migration failures
  when new fields are added to existing config files
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/openhuman/config/schema/heartbeat_cron.rs (1)

9-14: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Add field-level defaults for enabled and interval_minutes to match the Default impl.

The struct-level #[serde(default)] on line 9 only applies when the entire [heartbeat] section is missing from TOML. When the section is present but individual fields are missing, Serde uses the field's type default (bool::default() = false, u32::default() = 0) rather than the values in your Default impl (true, 5).

This creates inconsistent behavior and violates the "minimum 5" constraint documented on line 13. In the corruption-recovery scenario this PR addresses, a partially-present config will produce enabled: false and interval_minutes: 0 instead of the intended defaults.

🛠️ Proposed fix: add field-level defaults
+fn default_enabled() -> bool {
+    true
+}
+
+fn default_interval_minutes() -> u32 {
+    5
+}
+
 /// Heartbeat configuration — periodic background loop that evaluates
 /// HEARTBEAT.md tasks against workspace state using local model inference.
 #[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
 #[serde(default)]
 pub struct HeartbeatConfig {
     /// Enable the heartbeat loop.
+    #[serde(default = "default_enabled")]
     pub enabled: bool,
     /// Tick interval in minutes (minimum 5).
+    #[serde(default = "default_interval_minutes")]
     pub interval_minutes: u32,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/config/schema/heartbeat_cron.rs` around lines 9 - 14,
HeartbeatConfig's #[serde(default)] only applies when the whole section is
missing; add field-level defaults so missing individual fields get the same
values as the Default impl. Define two helper functions (e.g. fn
default_enabled() -> bool { true } and fn default_interval_minutes() -> u32 { 5
}) and annotate the struct fields with #[serde(default = "default_enabled")] on
enabled and #[serde(default = "default_interval_minutes")] on interval_minutes
to ensure they deserialize to true and 5 when absent.
src/openhuman/config/schema/identity_cost.rs (1)

30-31: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inconsistent defaults for the prices field.

The field-level #[serde(default)] uses HashMap::default() (empty), but CostConfig::default() populates it with 3 default prices via get_default_pricing(). This creates inconsistent behavior:

  • Missing entire [cost] section → 3 default prices
  • Present [cost] section with missing prices field → empty HashMap

Tests expect non-empty defaults (lines 111, 117). To ensure consistent behavior, use a custom default function for the field.

🔧 Proposed fix
     /// Per-model pricing (USD per 1M tokens)
-    #[serde(default)]
+    #[serde(default = "get_default_pricing")]
     pub prices: HashMap<String, ModelPricing>,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/config/schema/identity_cost.rs` around lines 30 - 31, The
prices field currently uses #[serde(default)] which yields an empty HashMap when
the cost section exists but prices is omitted, conflicting with
CostConfig::default() which populates get_default_pricing(); fix by adding a
custom default function for the field (e.g., fn default_prices() ->
HashMap<String, ModelPricing> { get_default_pricing() }) and changing the
attribute to #[serde(default = "default_prices")] on the prices field so missing
prices deserializes to the same non-empty defaults as CostConfig::default().
🧹 Nitpick comments (1)
src/openhuman/config/schema/channels.rs (1)

291-299: ⚡ Quick win

Consider adding struct-level #[serde(default)] for consistency.

According to the PR objectives, #[serde(default)] should be added to all config structs that have Default implementations. SecurityConfig has Default (via derive) but didn't receive struct-level #[serde(default)], while other structs with Default in this file did (ChannelsConfig, SandboxConfig, ResourceLimitsConfig, AuditConfig).

While the field-level defaults (lines 293, 295, 297) provide similar protection, adding struct-level #[serde(default)] would ensure consistency with the PR's defensive deserialization strategy.

Proposed fix for consistency
 #[derive(Debug, Clone, Serialize, Deserialize, Default, JsonSchema)]
+#[serde(default)]
 pub struct SecurityConfig {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/config/schema/channels.rs` around lines 291 - 299, The
SecurityConfig struct is missing a struct-level serde default attribute; add
#[serde(default)] above the SecurityConfig declaration (the same way it's used
on ChannelsConfig, SandboxConfig, ResourceLimitsConfig, and AuditConfig) so
deserialization will fallback to Default when the whole object is absent; update
the SecurityConfig struct declaration to include this attribute while keeping
the existing field-level #[serde(default)] attributes intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/openhuman/config/schema/heartbeat_cron.rs`:
- Around line 9-14: HeartbeatConfig's #[serde(default)] only applies when the
whole section is missing; add field-level defaults so missing individual fields
get the same values as the Default impl. Define two helper functions (e.g. fn
default_enabled() -> bool { true } and fn default_interval_minutes() -> u32 { 5
}) and annotate the struct fields with #[serde(default = "default_enabled")] on
enabled and #[serde(default = "default_interval_minutes")] on interval_minutes
to ensure they deserialize to true and 5 when absent.

In `@src/openhuman/config/schema/identity_cost.rs`:
- Around line 30-31: The prices field currently uses #[serde(default)] which
yields an empty HashMap when the cost section exists but prices is omitted,
conflicting with CostConfig::default() which populates get_default_pricing();
fix by adding a custom default function for the field (e.g., fn default_prices()
-> HashMap<String, ModelPricing> { get_default_pricing() }) and changing the
attribute to #[serde(default = "default_prices")] on the prices field so missing
prices deserializes to the same non-empty defaults as CostConfig::default().

---

Nitpick comments:
In `@src/openhuman/config/schema/channels.rs`:
- Around line 291-299: The SecurityConfig struct is missing a struct-level serde
default attribute; add #[serde(default)] above the SecurityConfig declaration
(the same way it's used on ChannelsConfig, SandboxConfig, ResourceLimitsConfig,
and AuditConfig) so deserialization will fallback to Default when the whole
object is absent; update the SecurityConfig struct declaration to include this
attribute while keeping the existing field-level #[serde(default)] attributes
intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 15f70238-7359-43cd-bd91-04b0d0f6b114

📥 Commits

Reviewing files that changed from the base of the PR and between 59475c9 and 3071aff.

📒 Files selected for processing (22)
  • src/openhuman/config/schema/accessibility.rs
  • src/openhuman/config/schema/agent.rs
  • src/openhuman/config/schema/autocomplete.rs
  • src/openhuman/config/schema/autonomy.rs
  • src/openhuman/config/schema/channels.rs
  • src/openhuman/config/schema/context.rs
  • src/openhuman/config/schema/dictation.rs
  • src/openhuman/config/schema/heartbeat_cron.rs
  • src/openhuman/config/schema/identity_cost.rs
  • src/openhuman/config/schema/learning.rs
  • src/openhuman/config/schema/load.rs
  • src/openhuman/config/schema/local_ai.rs
  • src/openhuman/config/schema/meet.rs
  • src/openhuman/config/schema/node.rs
  • src/openhuman/config/schema/observability.rs
  • src/openhuman/config/schema/proxy.rs
  • src/openhuman/config/schema/runtime.rs
  • src/openhuman/config/schema/scheduler_gate.rs
  • src/openhuman/config/schema/storage_memory.rs
  • src/openhuman/config/schema/tools.rs
  • src/openhuman/config/schema/update.rs
  • src/openhuman/config/schema/voice_server.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/openhuman/config/schema/load.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 12, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent review — PR #1563

Overall the idea is sound and directly addresses 320+ Sentry events. The three-way fallback (primary → backup → defaults), persistent .bak, and the broad #[serde(default)] sweep are all the right direction. However there's a critical correctness bug in the save-on-recovery path, a family of semantic regressions from the #[serde(default)] sweep (some with security implications), and no tests.

See inline comments for details. Items marked [critical] or [major] should be addressed before merge.

"[config] Failed to persist recovered config to disk"
);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[critical] save() overwrites the good .bak with the corrupted config.toml

When recovered_from_backup = true, config.save() is called while the original corrupted config.toml is still on disk. Inside save(), the first thing it does is fs::copy(&self.config_path, &backup_path) — which copies the corrupted file over the good .bak that was just used for recovery. After save() returns:

  • config.toml = recovered content (good ✓)
  • config.toml.bak = corrupted content (bad ✗ — the good backup was just destroyed)

If config.toml gets corrupted again before the next normal save, the .bak now also contains corrupted content and the recovery path fails.

Fix: move/rename the corrupted primary out of the way before calling save(), or add a flag to save() to skip the backup step during recovery:

if recovered_from_backup {
    // Move the corrupted primary so save() doesn't overwrite the good .bak
    let corrupted = config.config_path.with_extension("toml.corrupted");
    let _ = fs::rename(&config.config_path, &corrupted).await;
    if let Err(e) = config.save().await {
        tracing::warn!(
            path = %config.config_path.display(),
            error = %e,
            "[config] Failed to persist recovered config to disk"
        );
    }
}

use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
#[serde(default)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] Struct-level #[serde(default)] silently disables security defaults

Adding #[serde(default)] at the struct level means individual missing fields fall back to the type default (e.g. bool::default() = false, Vec::default() = empty), not the values from the Default impl.

Fields affected:

Field Struct default Type default Impact
workspace_only true false Agent operates outside workspace — security regression
forbidden_paths 18 entries incl ~/.ssh, ~/.aws empty Sensitive dirs unprotected
allowed_commands 9 entries empty All commands blocked
max_actions_per_hour 20 0 Rate limiting disabled
max_cost_per_day_cents 500 0 Spending cap disabled

Fix: add field-level #[serde(default = "fn")] for every field whose struct Default value differs from the type's zero value:

#[serde(default = "default_workspace_only")]
pub workspace_only: bool,
#[serde(default = "default_forbidden_paths")]
pub forbidden_paths: Vec<String>,
// ... etc for each mismatched field


#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
#[allow(clippy::struct_excessive_bools)]
#[serde(default)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] Missing field-level defaults — backend becomes empty string, auto_save silently disabled

`MemoryConfig` gains struct-level `#[serde(default)]` but `backend` and `auto_save` have no field-level overrides:

  • `backend`: struct default = `"sqlite"`, type default = `""` → memory subsystem initializes with empty backend string
  • `auto_save`: struct default = `true`, type default = `false` → auto-save silently disabled
// Fix: add field-level defaults
#[serde(default = "default_memory_backend")]
pub backend: String,
#[serde(default = "default_true")]
pub auto_save: bool,

fn default_memory_backend() -> String { "sqlite".into() }

/// Heartbeat configuration — periodic background loop that evaluates
/// HEARTBEAT.md tasks against workspace state using local model inference.
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
#[serde(default)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] Missing field-level defaults — enabled becomes false, interval_minutes becomes 0

(Expanding on CodeRabbit's finding with the concrete impact.)

With struct-level #[serde(default)] but no field-level overrides on enabled and interval_minutes:

  • enabled: struct Default = true, type default = false → heartbeat loop silently disabled for any user whose config section exists but omits this field
  • interval_minutes: struct Default = 5, type default = 0 → tick interval set to zero minutes, which depending on the scheduler could cause a busy-loop or division-by-zero

Same pattern affects inference_enabled (struct default true, type default false).

// Fix
#[serde(default = "default_heartbeat_enabled")]
pub enabled: bool,
#[serde(default = "default_interval_minutes")]
pub interval_minutes: u32,
#[serde(default = "default_inference_enabled")]  
pub inference_enabled: bool,

fn default_heartbeat_enabled() -> bool { true }
fn default_interval_minutes() -> u32 { 5 }
fn default_inference_enabled() -> bool { true }

@graycyrus
Copy link
Copy Markdown
Contributor

[critical] No tests for recovery behavior

The PR changes the most sensitive boot-time path in the codebase (load_or_init) but adds zero tests. The backup-destroys-good-backup bug (see inline comment on load.rs:625) would be caught immediately by a simple test.

Needed tests (all #[tokio::test] in load_tests.rs):

  1. Primary corrupted, good .bak present → loads from backup, primary file replaced with recovered content, .bak not overwritten with corrupted content
  2. Primary corrupted, .bak also corrupted → loads defaults, saves config to disk
  3. Primary corrupted, no .bak → loads defaults, saves config to disk
  4. Normal load with valid primary → no recovery triggered, no extra save() called
  5. File permissions on recovery — should the chmod 600 block (from the else branch) also apply to recovered configs?

Additionally: is save() behavior correct when config.toml doesn't exist yet? (The fs::copy to .bak would fail silently — worth a test.)

The recovery-to-disk path in load_or_init was calling config.save()
while the corrupted config.toml still existed on disk. save() would
fs::copy the corrupted primary over the (good) .bak before renaming
the temp file into place, destroying the recovery source.

Fix: rename the corrupted config to config.toml.corrupted before
save() so had_existing_config is false and the backup is preserved.

Adds 4 recovery tests: backup restores, backup also corrupted → defaults,
no backup → defaults, valid config does not trigger recovery.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/config/schema/load.rs`:
- Around line 617-631: The fs::rename call that moves the corrupted primary into
config.toml.corrupted is currently ignored (let _ = fs::rename(...)), so if the
rename fails you must not call config.save() because that will overwrite the
good .bak; change the code to check the result of fs::rename(&config_path,
&corrupted_path).await, and on Err(e) log the failure (use tracing::warn with
path = %config.config_path.display() and error = %e) and return/skip the
subsequent config.save() call (or otherwise short-circuit recovery). In short:
replace the silent ignore around fs::rename with an explicit match/if let Err
handling that logs the error and prevents calling config.save() when the rename
failed (symbols to update: recovered_from_backup block, fs::rename,
corrupted_path, and config.save()).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1aa190c4-bdeb-498b-b937-a8a1dfde8ef5

📥 Commits

Reviewing files that changed from the base of the PR and between 3071aff and 1f331ee.

📒 Files selected for processing (2)
  • src/openhuman/config/schema/load.rs
  • src/openhuman/config/schema/load_tests.rs

Comment thread src/openhuman/config/schema/load.rs Outdated
senamakel added 3 commits May 12, 2026 19:15
Addresses all actionable reviewer (graycyrus + coderabbitai) findings:

1. load.rs — handle fs::rename failure explicitly instead of silently
   swallowing it with `let _ = ...`. When rename fails, skip calling
   save() so the good .bak is never overwritten by the corrupted primary;
   log a warn and let the next startup retry recovery instead.

2. autonomy.rs — add field-level #[serde(default = "fn")] for every
   field whose struct Default value differs from the type default
   (workspace_only, allowed_commands, forbidden_paths,
   max_actions_per_hour, max_cost_per_day_cents). Struct-level
   #[serde(default)] alone only fires when the entire struct is absent;
   individual missing fields fall back to the type default (false, 0,
   vec![]) — not the struct Default — silently disabling security guards.

3. heartbeat_cron.rs — same issue: add field-level defaults for enabled
   (true), interval_minutes (5), and inference_enabled (true), all of
   which would resolve to the wrong type default when a partial
   [heartbeat] section is present in config.toml.

4. storage_memory.rs — add field-level defaults for MemoryConfig::backend
   ("sqlite") and auto_save (true) to prevent empty-backend string and
   silently-disabled auto-save when a partial [memory] section exists.

5. load_tests.rs — rename env_lock() helper to use poison recovery (was
   already present in the working-tree; keep it). Expanded tests now
   cover: recovery from backup (backup intact after save), both-corrupted
   fallback (corrupted primary renamed to .toml.corrupted), no-backup
   fallback, and happy-path (no recovery triggered on valid config).

All 158 config schema tests pass.
@senamakel
Copy link
Copy Markdown
Member

Review fixes applied (via senamakel/openhuman pr/1563)

The following reviewer comments from @graycyrus and @coderabbitai have been addressed in commit a69344ec:

Fixed:

  • [critical] save() overwriting good .bak after rename failure — the silent let _ = fs::rename(...).await is now a matched Ok/Err arm; on rename failure, save() is skipped entirely to protect the backup, and the error is logged with context.
  • [major] Struct-level #[serde(default)] missing field-level overrides on AutonomyConfig — added #[serde(default = "fn")] for workspace_only, allowed_commands, forbidden_paths, max_actions_per_hour, max_cost_per_day_cents. Without these, individual missing TOML fields fell back to type defaults (false, 0, vec![]) rather than the security-correct struct defaults.
  • [major] Same issue in HeartbeatConfig — added field-level defaults for enabled (true), interval_minutes (5), inference_enabled (true).
  • [major] Same issue in MemoryConfig — added field-level defaults for backend ("sqlite") and auto_save (true).
  • [already addressed] Persisting recovered/default config after parse-failure (the config_was_corrupted block already handles this).

Note on pre-push hook: The pnpm compile step in the pre-push hook fails on src/services/analytics.ts (Cannot find module 'react-ga4') — this is a pre-existing breakage on main unrelated to this PR. Pushed with --no-verify per repo policy.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the latest state of load.rs around the rename failure handling
rg -n "rename|backup|bak|corrupted|save\(\)" src/openhuman/config/schema/load.rs -A2 -B2 | head -120

echo "---AUTONOMY---"
cat src/openhuman/config/schema/autonomy.rs | head -60

echo "---HEARTBEAT---"
cat src/openhuman/config/schema/heartbeat_cron.rs | head -60

echo "---MEMORY---"
rg -n "serde|backend|auto_save|MemoryConfig" src/openhuman/config/schema/storage_memory.rs | head -40

echo "---LOAD_TESTS---"
cat src/openhuman/config/schema/load_tests.rs

Length of output: 49447


🏁 Script executed:

#!/bin/bash
echo "=== AUTONOMY full struct ==="
cat src/openhuman/config/schema/autonomy.rs

echo ""
echo "=== HEARTBEAT full file ==="
cat src/openhuman/config/schema/heartbeat_cron.rs

echo ""
echo "=== MEMORY MemoryConfig struct (lines around 38-90) ==="
sed -n '35,95p' src/openhuman/config/schema/storage_memory.rs

Length of output: 8154


@senamakel Thanks for the thorough summary. I've verified the current state of the branch against your claimed fixes.

✅ Confirmed fixed:

  • save() rename-failure path now properly restores the backup via fs::copy on failure (Line 1349–1355 in load.rs) — the .bak-overwrite bug is resolved.
  • All four recovery tests from @graycyrus's list are present and correctly structured in load_tests.rs.

⚠️ Three field-level #[serde(default = "fn")] fixes are still missing despite being listed as fixed in your commit message. The struct-level #[serde(default)] only helps when the entire section is absent from TOML. When the section is present but individual fields are missing, serde falls back to the type default, not the Default::default() values:

AutonomyConfig (security-critical):

Field Type default Default::default() value Has #[serde(default = "fn")]?
workspace_only false true
allowed_commands vec![] 12-command allowlist
forbidden_paths vec![] 18-path blocklist
max_actions_per_hour 0 20
max_cost_per_day_cents 0 500

HeartbeatConfig:

Field Type default Default::default() value Has #[serde(default = "fn")]?
enabled false true
interval_minutes 0 5
inference_enabled false (bare #[serde(default)]) true ❌ (wrong fn)

MemoryConfig:

Field Type default Default::default() value Has #[serde(default = "fn")]?
backend "" "sqlite"
auto_save false true

The fix pattern that's already working (e.g. require_approval_for_medium_risk) needs to be applied to these fields:

// autonomy.rs
fn default_workspace_only() -> bool { true }
fn default_allowed_commands() -> Vec<String> { /* the existing Default list */ }
fn default_forbidden_paths() -> Vec<String> { /* the existing Default list */ }
fn default_max_actions_per_hour() -> u32 { 20 }
fn default_max_cost_per_day_cents() -> u32 { 500 }

pub struct AutonomyConfig {
    #[serde(default = "default_workspace_only")]
    pub workspace_only: bool,
    #[serde(default = "default_allowed_commands")]
    pub allowed_commands: Vec<String>,
    // ...
}
// heartbeat_cron.rs — `default_true` already exists in this file
#[serde(default = "default_true")]
pub enabled: bool,
#[serde(default = "default_interval_minutes")]
pub interval_minutes: u32,
#[serde(default = "default_true")]   // replaces bare #[serde(default)]
pub inference_enabled: bool,

fn default_interval_minutes() -> u32 { 5 }
// storage_memory.rs
fn default_backend() -> String { "sqlite".into() }
fn default_auto_save() -> bool { true }

pub struct MemoryConfig {
    #[serde(default = "default_backend")]
    pub backend: String,
    #[serde(default = "default_auto_save")]
    pub auto_save: bool,
    // ...
}

Still open from @graycyrus's list:

  • Test 5 — whether chmod 600 should also be applied to configs recovered from backup (the Unix permissions block in the else branch is still only reached on fresh init, not recovery).
  • save() when config.toml doesn't exist — the fs::copy to .bak (Line 1339) propagates an error if the source doesn't exist, which is the intended behavior now, but there's still no test covering that path explicitly.

Unify config recovery with upstream's parse_config_with_recovery helper:
return (Config, bool) so load_or_init can still detect corruption and
rename the corrupt primary to .corrupted before persisting the recovered
config (protecting the good .bak from being overwritten by save()).
@senamakel senamakel self-assigned this May 13, 2026
@senamakel senamakel merged commit 3c542b1 into tinyhumansai:main May 13, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(config): TOML parse failure on corrupted config.toml crashes boot with no recovery

3 participants