-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APP-6886] Add option to reboot device after a certain amount of time #51
Conversation
1907a08
to
cf3efec
Compare
|
||
// If set, will reboot the device after it has been offline for this duration | ||
// 0 will disable this feature. | ||
DeviceRebootAfterOfflineMinutes int `json:"device_reboot_after_offline_minutes"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider doing device_reboot_after_offline_hours
while not super granular. I also can't imagine you would want less than one hour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minutes makes the most sense. All the timeouts should have the same "unit" and minutes feels correct. Also, with the way Timeout parsing works, you can use decimal units if needed. Or "1h"... take a look at the Timeout.Unmarshall() function. It treats bare numbers as seconds currently, but we can change that to minutes, but people can specify easier units if they want. "10h30m" and such (we don't have to document/advertise that extra feature!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few notes before I hit the road... No need to wait for me if you need to merge this before I get back on Tuesday though.
|
||
// If set, will reboot the device after it has been offline for this duration | ||
// 0 will disable this feature. | ||
DeviceRebootAfterOfflineMinutes int `json:"device_reboot_after_offline_minutes"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer to see this be a Timeout
instead of int
as then the json gets parsed automatically for timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to change the default parsing to "minutes" for the timeout values (currently it assumes bare numbers are seconds.)
if conf.DeviceRebootAfterOfflineMinutes != 0 && | ||
conf.DeviceRebootAfterOfflineMinutes < int(time.Duration(conf.OfflineTimeout).Minutes()) && | ||
conf.DeviceRebootAfterOfflineMinutes < int(time.Duration(conf.UserTimeout).Minutes()) { | ||
return &conf, errw.Errorf("device_reboot_after_offline_minutes cannot be less than offline_timeout or user_timeout") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer to see native time parsing, not converting stuff to ints. Look at the other timeout(s) above. If the type here is Timeout, all the parsing happens automagically, so no need to cast and stuff here. It can just follow the pattern of the other timeouts (directly above.)
syscall.Sync() | ||
err := syscall.Reboot(syscall.LINUX_REBOOT_CMD_RESTART) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels very dangerous. This feels like it'd be a rugpull to running processes. Equivalent to SYSREQ+b
Does it actually do a normal shutdown? If not, maybe exec "systemctl reboot" to guarantee a normal restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe is the same according to the docs, underhood systemctl is running reboot but will update to use systemctl for consistency
|
||
// If set, will reboot the device after it has been offline for this duration | ||
// 0 will disable this feature. | ||
DeviceRebootAfterOfflineMinutes int `json:"device_reboot_after_offline_minutes"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minutes makes the most sense. All the timeouts should have the same "unit" and minutes feels correct. Also, with the way Timeout parsing works, you can use decimal units if needed. Or "1h"... take a look at the Timeout.Unmarshall() function. It treats bare numbers as seconds currently, but we can change that to minutes, but people can specify easier units if they want. "10h30m" and such (we don't have to document/advertise that extra feature!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More notes... sorry I didn't get all of these noted last time.
@@ -297,6 +297,12 @@ func ConfigFromJSON(defaultConf Config, jsonBytes []byte) (*Config, error) { | |||
return &conf, errw.Errorf("timeout values cannot be less than %s", time.Duration(minTimeout)) | |||
} | |||
|
|||
if conf.DeviceRebootAfterOfflineMinutes != 0 && | |||
conf.DeviceRebootAfterOfflineMinutes < conf.OfflineTimeout && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be OR... this is an error if either timeout is shorter.
@@ -692,6 +693,16 @@ func (w *Provisioning) mainLoop(ctx context.Context) { | |||
if pMode { | |||
// complex logic, so wasting some variables for readability | |||
|
|||
if w.cfg.DeviceRebootAfterOfflineMinutes > 0 && lastOnline.Before(now.Add(time.Duration(w.cfg.DeviceRebootAfterOfflineMinutes)*-1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't considering the logic for single_network (default) mode. There it may be offline (no internet) indefinitely, and needs to compare the timeout to the lastConnected time.
That's why I was suggesting putting this logic down where pMode would normally be started. That is, right above the w.StartProvisioning()
call on line 756 below.
Just above that are calculations for hitOfflineTimeout using lastConnectivity
which should properly switch it depending on the mode. You can reuse that same "lastConnectivity" to decide if it should reboot instead of starting provisioning.
cmd := exec.Command("systemctl", "reboot") | ||
output, err := cmd.CombinedOutput() | ||
if err != nil { | ||
w.logger.Error(errw.Wrapf(err, "running 'systemctl reboot' %s", output)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to wait for the reboot. The command will return instantly, but it can take a minute or more to shutdown all services, so it could be a moment before agent is shut down. If we don't wait here... then it'll go into the loop again and likely come right back here and try to reboot again... or start up provisioning or something.
sorry removing review request because i kind of push some changes quickly but did not test or anything, and then got sidetracked |
ready for review now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting there... couple more gotchas though.
if offlineRebootTimeout { | ||
w.logger.Infof("device has been offline for more than %s minutes, rebooting", w.cfg.DeviceRebootAfterOfflineMinutes) | ||
|
||
syscall.Sync() // flush file system buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be needed when letting a normal shutdown happen. Filesystems will get synced/unmounted in their own ordering.
if err != nil { | ||
w.logger.Error(errw.Wrapf(err, "running 'systemctl reboot' %s", output)) | ||
} | ||
time.Sleep(time.Second * 100) // systemd DefaultTimeoutStopSec defaults to 90 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't do this, or agent itself won't shut down when called and will just hang until the sleep is done. It will also fail healthchecks if the loop pauses here for too long. There's a function for pausing the loop though. See line 623 above for the previous use of the following:
if !w.mainLoopHealth.Sleep(ctx, time.Second*100) {
return
}
w.logger.Errorf("Failed to reboot after %s time!", time.Second * 100)
May also need a slightly longer timeout, as I BELIEVE that systemd 90 second default is per service, so if it's shutting down multiple services, it can definitely take longer than that. I'd think 2 or maybe even 5 minutes, since this isn't going to handle the failure case (where reboot doesn't happen.) Regardless, I'd DEFINITELY suggest logging an error if the timeout expires (included in suggestion above... rewrite as needed) so the user knows things are off the rails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, thanks i missed there's a sleep on the main loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realized this was fixed, but didn't see a review request. LGTM anyway, but if you have more changes planned, go ahead.
No description provided.