Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APP-6886] Add option to reboot device after a certain amount of time #51

Merged
merged 6 commits into from
Dec 17, 2024

Conversation

ale7714
Copy link
Member

@ale7714 ale7714 commented Dec 5, 2024

No description provided.

@ale7714 ale7714 force-pushed the reboot-offline-device branch from 1907a08 to cf3efec Compare December 5, 2024 22:05
@ale7714 ale7714 requested a review from Otterverse December 5, 2024 22:07

// If set, will reboot the device after it has been offline for this duration
// 0 will disable this feature.
DeviceRebootAfterOfflineMinutes int `json:"device_reboot_after_offline_minutes"`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider doing device_reboot_after_offline_hours while not super granular. I also can't imagine you would want less than one hour

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minutes makes the most sense. All the timeouts should have the same "unit" and minutes feels correct. Also, with the way Timeout parsing works, you can use decimal units if needed. Or "1h"... take a look at the Timeout.Unmarshall() function. It treats bare numbers as seconds currently, but we can change that to minutes, but people can specify easier units if they want. "10h30m" and such (we don't have to document/advertise that extra feature!)

Copy link
Member

@Otterverse Otterverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes before I hit the road... No need to wait for me if you need to merge this before I get back on Tuesday though.


// If set, will reboot the device after it has been offline for this duration
// 0 will disable this feature.
DeviceRebootAfterOfflineMinutes int `json:"device_reboot_after_offline_minutes"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to see this be a Timeout instead of int as then the json gets parsed automatically for timeouts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to change the default parsing to "minutes" for the timeout values (currently it assumes bare numbers are seconds.)

Comment on lines 300 to 304
if conf.DeviceRebootAfterOfflineMinutes != 0 &&
conf.DeviceRebootAfterOfflineMinutes < int(time.Duration(conf.OfflineTimeout).Minutes()) &&
conf.DeviceRebootAfterOfflineMinutes < int(time.Duration(conf.UserTimeout).Minutes()) {
return &conf, errw.Errorf("device_reboot_after_offline_minutes cannot be less than offline_timeout or user_timeout")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to see native time parsing, not converting stuff to ints. Look at the other timeout(s) above. If the type here is Timeout, all the parsing happens automagically, so no need to cast and stuff here. It can just follow the pattern of the other timeouts (directly above.)

Comment on lines 699 to 700
syscall.Sync()
err := syscall.Reboot(syscall.LINUX_REBOOT_CMD_RESTART)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels very dangerous. This feels like it'd be a rugpull to running processes. Equivalent to SYSREQ+b Does it actually do a normal shutdown? If not, maybe exec "systemctl reboot" to guarantee a normal restart.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe is the same according to the docs, underhood systemctl is running reboot but will update to use systemctl for consistency

https://www.man7.org/linux/man-pages/man1/systemctl.1.html


// If set, will reboot the device after it has been offline for this duration
// 0 will disable this feature.
DeviceRebootAfterOfflineMinutes int `json:"device_reboot_after_offline_minutes"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minutes makes the most sense. All the timeouts should have the same "unit" and minutes feels correct. Also, with the way Timeout parsing works, you can use decimal units if needed. Or "1h"... take a look at the Timeout.Unmarshall() function. It treats bare numbers as seconds currently, but we can change that to minutes, but people can specify easier units if they want. "10h30m" and such (we don't have to document/advertise that extra feature!)

Copy link
Member

@Otterverse Otterverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More notes... sorry I didn't get all of these noted last time.

@@ -297,6 +297,12 @@ func ConfigFromJSON(defaultConf Config, jsonBytes []byte) (*Config, error) {
return &conf, errw.Errorf("timeout values cannot be less than %s", time.Duration(minTimeout))
}

if conf.DeviceRebootAfterOfflineMinutes != 0 &&
conf.DeviceRebootAfterOfflineMinutes < conf.OfflineTimeout &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be OR... this is an error if either timeout is shorter.

@@ -692,6 +693,16 @@ func (w *Provisioning) mainLoop(ctx context.Context) {
if pMode {
// complex logic, so wasting some variables for readability

if w.cfg.DeviceRebootAfterOfflineMinutes > 0 && lastOnline.Before(now.Add(time.Duration(w.cfg.DeviceRebootAfterOfflineMinutes)*-1)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't considering the logic for single_network (default) mode. There it may be offline (no internet) indefinitely, and needs to compare the timeout to the lastConnected time.

That's why I was suggesting putting this logic down where pMode would normally be started. That is, right above the w.StartProvisioning() call on line 756 below.

Just above that are calculations for hitOfflineTimeout using lastConnectivity which should properly switch it depending on the mode. You can reuse that same "lastConnectivity" to decide if it should reboot instead of starting provisioning.

Comment on lines 699 to 703
cmd := exec.Command("systemctl", "reboot")
output, err := cmd.CombinedOutput()
if err != nil {
w.logger.Error(errw.Wrapf(err, "running 'systemctl reboot' %s", output))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to wait for the reboot. The command will return instantly, but it can take a minute or more to shutdown all services, so it could be a moment before agent is shut down. If we don't wait here... then it'll go into the loop again and likely come right back here and try to reboot again... or start up provisioning or something.

@ale7714 ale7714 requested review from Otterverse and removed request for Otterverse December 11, 2024 19:07
@ale7714
Copy link
Member Author

ale7714 commented Dec 11, 2024

sorry removing review request because i kind of push some changes quickly but did not test or anything, and then got sidetracked

@ale7714 ale7714 requested a review from Otterverse December 11, 2024 20:54
@ale7714
Copy link
Member Author

ale7714 commented Dec 11, 2024

ready for review now

Copy link
Member

@Otterverse Otterverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting there... couple more gotchas though.

if offlineRebootTimeout {
w.logger.Infof("device has been offline for more than %s minutes, rebooting", w.cfg.DeviceRebootAfterOfflineMinutes)

syscall.Sync() // flush file system buffers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed when letting a normal shutdown happen. Filesystems will get synced/unmounted in their own ordering.

if err != nil {
w.logger.Error(errw.Wrapf(err, "running 'systemctl reboot' %s", output))
}
time.Sleep(time.Second * 100) // systemd DefaultTimeoutStopSec defaults to 90 seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't do this, or agent itself won't shut down when called and will just hang until the sleep is done. It will also fail healthchecks if the loop pauses here for too long. There's a function for pausing the loop though. See line 623 above for the previous use of the following:

if !w.mainLoopHealth.Sleep(ctx, time.Second*100) {
	return
}
w.logger.Errorf("Failed to reboot after %s time!", time.Second * 100)			

May also need a slightly longer timeout, as I BELIEVE that systemd 90 second default is per service, so if it's shutting down multiple services, it can definitely take longer than that. I'd think 2 or maybe even 5 minutes, since this isn't going to handle the failure case (where reboot doesn't happen.) Regardless, I'd DEFINITELY suggest logging an error if the timeout expires (included in suggestion above... rewrite as needed) so the user knows things are off the rails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks i missed there's a sleep on the main loop

Copy link
Member

@Otterverse Otterverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized this was fixed, but didn't see a review request. LGTM anyway, but if you have more changes planned, go ahead.

@ale7714 ale7714 merged commit fc02271 into main Dec 17, 2024
2 checks passed
@ale7714 ale7714 deleted the reboot-offline-device branch December 17, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants