Skip to content

Conversation

@chrisgalanis
Copy link
Contributor

@chrisgalanis chrisgalanis commented Jan 5, 2026

This pull request introduces several improvements and enhancements to the OTA update and reboot process, focusing on reliability, observability, and configurability. Key changes include more robust time synchronization checks, improved handling of boot log capture, and better control over the reboot sequence to ensure devices do not inadvertently enter recovery mode.

OTA Update Flow Improvements:

  • Added a new --skip-time-sync-before-reboot flag to allow skipping the NTP time synchronization check before the first reboot, with clear logging and fallback behavior. [1] [2]
  • Created log directories as needed before OTA operations, improving robustness when writing logs.

Reboot and Recovery Handling:

  • The reboot process now waits for SSH disconnection before manipulating the recovery pin, ensuring the pin is held only during the actual shutdown window. The duration for holding the recovery pin is increased to 20 seconds, and a short delay is added after releasing the pin to allow for USB device re-enumeration.
  • The logic for setting the recovery pin during reboot is now parameterized based on the recovery state, improving clarity and correctness. [1] [2]
  • Increased the sleep duration after reboot from 4 to 10 seconds to provide more time for device stabilization.

Boot Log Capture Enhancements:

  • Boot log capture is now performed in a background task, running concurrently with SSH reconnection attempts. The capture waits until a login prompt is detected, writes data incrementally to disk, and provides detailed logging on success or failure. [1] [2] [3]
  • The boot log capture function is refactored into a static method for easier background invocation and improved testability.

Time Synchronization Robustness:

  • The time synchronization check now automatically detects and uses either timedatectl or chronyc, with command timeouts and more robust output parsing. If neither tool is found, a clear error is raised.
  • Time sync is now always checked after reboot and before starting the update, regardless of the pre-reboot check configuration.

Utility and Observability Improvements:

  • Added a utility method to wait for SSH disconnection, improving reliability in detecting device shutdown before proceeding with hardware-level operations.

These changes collectively improve the reliability, observability, and flexibility of the OTA update and device reboot process, making it easier to diagnose issues and adapt to different environments.


OTA update and logging enhancements:

  • Added --skip-time-sync-before-reboot flag to optionally skip NTP check before the first reboot, with clear logging and fallback. [1] [2]
  • Ensured log directories are created before writing logs, improving robustness.

Reboot sequence and recovery pin handling:

  • Waits for SSH disconnect before holding the recovery pin, increases hold duration to 20 seconds, and adds a delay after releasing the pin for USB re-enumeration.
  • Parameterized recovery pin state during reboot for clarity and correctness. [1] [2]
  • Increased post-reboot sleep from 4 to 10 seconds for device stabilization.

Boot log capture improvements:

  • Boot log capture runs in the background, waits for login prompt, writes incrementally to disk, and provides detailed success/failure logs. [1] [2] [3]
  • Refactored boot log capture into a static method for better background usage.

Time synchronization checks:

  • Automatically detects and uses timedatectl or chronyc for time sync, with timeouts and robust output parsing. Raises error if neither is found.
  • Always checks time sync after reboot, regardless of pre-reboot check.

Utility and reliability improvements:

  • Added utility to wait for SSH disconnection, improving reliability in detecting shutdown before hardware operations.

let _start_time = Instant::now();
info!("Starting OTA update to version: {}", self.target_version);

// Create log directory if it doesn't exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can remove comments like this one, they don't bring much value

info!("Overlays wiped successfully, rebooting device");
info!("Overlays wiped successfully");

if !self.skip_time_sync_before_reboot {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this I wonder ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, basically the problem was with the worldcoin-key-retrival service. With pearl, teh harware clock is not working so if it NTP server is not synced then worldcoin-attest is failing and also key-retrieval. This should be fixed with new key-retrieval fix, that is why i ketp it as a flag, so we can test older commits, that don't have key-retrival fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not getting it sorry.

With pearl, teh harware clock is not working so if it NTP server is not synced then worldcoin-attest is failing and also key-retrieval.

That is not true. The clock will synchronize and https requests is going to succeed. And the delay is not big for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default this is not correct. Because reboot is manually triggered faster than the NTP server syncrhonizes the clock....

// Brief delay to allow USB device to be re-enumerated and udev rules to apply
// after FTDI GPIO is released. The FTDI device detaches/reattaches kernel
// drivers which causes /dev/ttyUSB* to be recreated.
tokio::time::sleep(Duration::from_millis(200)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timings like this might be weak spot and break often - but if works now i am fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested and look reliable enough. Even though if we don't ahve logs in the future we will now the root of failure and fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better then make it a constant

#[instrument(skip_all)]
async fn capture_boot_logs(&self, log_suffix: &str) -> Result<()> {
let platform_name = format!("{:?}", self.platform).to_lowercase();
async fn capture_boot_logs_static(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's bad practice to have static in the end of the name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix!

}

if start_time.elapsed() >= timeout && !found_login_prompt {
if found_login_prompt {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we set timeout for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is the basically if the login prompth is fouund or not. If login prompt is never found, then the timeout of ssh will fail later so the process will exit with error

let result = session
.execute_command("TERM=dumb timedatectl status")
// Timeout for individual command execution (10 seconds is generous for timedatectl/chronyc)
const COMMAND_TIMEOUT: Duration = Duration::from_secs(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to define it in top of the function for readibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix

hil/src/boot.rs Outdated
.await
.wrap_err("task panicked")??;
tokio::time::sleep(Duration::from_secs(4)).await;
tokio::time::sleep(Duration::from_secs(10)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to the top of the file or maybe have as separate arg to cli so you don't tune it in sources every time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix!


use super::Ota;

const DELAY_CAPTURE_LOGS: u64 = 200;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better to specify MS here as it's not clear what 200 is (seconds, ms, etc.)


const MAX_ATTEMPTS: u32 = 60; // 60 attempts = 2 minutes max wait
const SLEEP_DURATION: Duration = Duration::from_secs(2);
// Timeout for individual command execution (10 seconds is generous for timedatectl/chronyc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (10 seconds is generous for timedatectl/chronyc) is not helpful as a comment

hil/src/boot.rs Outdated
.await
.wrap_err("task panicked")??;
tokio::time::sleep(Duration::from_secs(4)).await;
tokio::time::sleep(Duration::from_secs(10)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not using the const here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only blocknig comment - everything else is great job!

@chrisgalanis chrisgalanis marked this pull request as draft January 14, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants