-
Notifications
You must be signed in to change notification settings - Fork 101
fix(ota-hil): set recovery mode before capture logs #903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
hil/src/commands/ota/mod.rs
Outdated
| let _start_time = Instant::now(); | ||
| info!("Starting OTA update to version: {}", self.target_version); | ||
|
|
||
| // Create log directory if it doesn't exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can remove comments like this one, they don't bring much value
| info!("Overlays wiped successfully, rebooting device"); | ||
| info!("Overlays wiped successfully"); | ||
|
|
||
| if !self.skip_time_sync_before_reboot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need this I wonder ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, basically the problem was with the worldcoin-key-retrival service. With pearl, teh harware clock is not working so if it NTP server is not synced then worldcoin-attest is failing and also key-retrieval. This should be fixed with new key-retrieval fix, that is why i ketp it as a flag, so we can test older commits, that don't have key-retrival fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also not getting it sorry.
With pearl, teh harware clock is not working so if it NTP server is not synced then worldcoin-attest is failing and also key-retrieval.
That is not true. The clock will synchronize and https requests is going to succeed. And the delay is not big for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default this is not correct. Because reboot is manually triggered faster than the NTP server syncrhonizes the clock....
hil/src/commands/ota/reboot.rs
Outdated
| // Brief delay to allow USB device to be re-enumerated and udev rules to apply | ||
| // after FTDI GPIO is released. The FTDI device detaches/reattaches kernel | ||
| // drivers which causes /dev/ttyUSB* to be recreated. | ||
| tokio::time::sleep(Duration::from_millis(200)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timings like this might be weak spot and break often - but if works now i am fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested and look reliable enough. Even though if we don't ahve logs in the future we will now the root of failure and fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better then make it a constant
hil/src/commands/ota/reboot.rs
Outdated
| #[instrument(skip_all)] | ||
| async fn capture_boot_logs(&self, log_suffix: &str) -> Result<()> { | ||
| let platform_name = format!("{:?}", self.platform).to_lowercase(); | ||
| async fn capture_boot_logs_static( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's bad practice to have static in the end of the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix!
| } | ||
|
|
||
| if start_time.elapsed() >= timeout && !found_login_prompt { | ||
| if found_login_prompt { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we set timeout for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout is the basically if the login prompth is fouund or not. If login prompt is never found, then the timeout of ssh will fail later so the process will exit with error
hil/src/commands/ota/system.rs
Outdated
| let result = session | ||
| .execute_command("TERM=dumb timedatectl status") | ||
| // Timeout for individual command execution (10 seconds is generous for timedatectl/chronyc) | ||
| const COMMAND_TIMEOUT: Duration = Duration::from_secs(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to define it in top of the function for readibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix
hil/src/boot.rs
Outdated
| .await | ||
| .wrap_err("task panicked")??; | ||
| tokio::time::sleep(Duration::from_secs(4)).await; | ||
| tokio::time::sleep(Duration::from_secs(10)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to the top of the file or maybe have as separate arg to cli so you don't tune it in sources every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix!
|
|
||
| use super::Ota; | ||
|
|
||
| const DELAY_CAPTURE_LOGS: u64 = 200; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: better to specify MS here as it's not clear what 200 is (seconds, ms, etc.)
|
|
||
| const MAX_ATTEMPTS: u32 = 60; // 60 attempts = 2 minutes max wait | ||
| const SLEEP_DURATION: Duration = Duration::from_secs(2); | ||
| // Timeout for individual command execution (10 seconds is generous for timedatectl/chronyc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (10 seconds is generous for timedatectl/chronyc) is not helpful as a comment
hil/src/boot.rs
Outdated
| .await | ||
| .wrap_err("task panicked")??; | ||
| tokio::time::sleep(Duration::from_secs(4)).await; | ||
| tokio::time::sleep(Duration::from_secs(10)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not using the const here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only blocknig comment - everything else is great job!
c044b7c to
2c89aa4
Compare
6bed7d8 to
65dacd1
Compare
This pull request introduces several improvements and enhancements to the OTA update and reboot process, focusing on reliability, observability, and configurability. Key changes include more robust time synchronization checks, improved handling of boot log capture, and better control over the reboot sequence to ensure devices do not inadvertently enter recovery mode.
OTA Update Flow Improvements:
--skip-time-sync-before-rebootflag to allow skipping the NTP time synchronization check before the first reboot, with clear logging and fallback behavior. [1] [2]Reboot and Recovery Handling:
Boot Log Capture Enhancements:
Time Synchronization Robustness:
timedatectlorchronyc, with command timeouts and more robust output parsing. If neither tool is found, a clear error is raised.Utility and Observability Improvements:
These changes collectively improve the reliability, observability, and flexibility of the OTA update and device reboot process, making it easier to diagnose issues and adapt to different environments.
OTA update and logging enhancements:
--skip-time-sync-before-rebootflag to optionally skip NTP check before the first reboot, with clear logging and fallback. [1] [2]Reboot sequence and recovery pin handling:
Boot log capture improvements:
Time synchronization checks:
timedatectlorchronycfor time sync, with timeouts and robust output parsing. Raises error if neither is found.Utility and reliability improvements: