Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSDK-9591 - Kill all lingering module process before exiting #4657

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions module/modmanager/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,19 @@
return err
}

// Kill kills module processes. This is best effort as we do not
// have a lock during this function. Taking the lock will mean that
// we may be blocked, and we do not want to be blocked.
func (mgr *Manager) Kill() {
if mgr.restartCtxCancel != nil {
mgr.restartCtxCancel()
}
mgr.modules.Range(func(_ string, mod *module) bool {
mod.killProcess()
return true
})
}

// Handles returns all the models for each module registered.
func (mgr *Manager) Handles() map[string]modlib.HandlerMap {
res := map[string]modlib.HandlerMap{}
Expand Down Expand Up @@ -1221,6 +1234,14 @@
return nil
}

func (m *module) killProcess() {
if m.process == nil {
return
}
m.logger.Infof("Killing module: %s process", m.cfg.Name)
m.process.Kill()

Check failure on line 1242 in module/modmanager/manager.go

View workflow job for this annotation

GitHub Actions / macos / build

m.process.Kill undefined (type pexec.ManagedProcess has no field or method Kill)
}

func (m *module) registerResources(mgr modmaninterface.ModuleManager) {
for api, models := range m.handles {
if _, ok := resource.LookupGenericAPIRegistration(api.API); !ok {
Expand Down
1 change: 1 addition & 0 deletions module/modmaninterface/interface.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ type ModuleManager interface {
FirstRun(ctx context.Context, conf config.Module) error

Close(ctx context.Context) error
Kill()
}
8 changes: 8 additions & 0 deletions robot/impl/local_robot.go
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,14 @@ func (r *localRobot) Close(ctx context.Context) error {
return err
}

// Kill will attempt to kill any processes on the system started by the robot as quickly as possible.
// This operation is not clean and will not wait for completion.
func (r *localRobot) Kill() {
if r.manager != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we justify this isn't a data race?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r.manager could be nil if startup fails/hangs, but yes, it could also be a data race.

r.manager.Kill()
}
}

// StopAll cancels all current and outstanding operations for the robot and stops all actuators and movement.
func (r *localRobot) StopAll(ctx context.Context, extra map[resource.Name]map[string]interface{}) error {
// Stop all operations
Expand Down
10 changes: 10 additions & 0 deletions robot/impl/resource_manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -599,6 +599,16 @@ func (manager *resourceManager) Close(ctx context.Context) error {
return allErrs
}

// Kill attempts to kill all module processes.
func (manager *resourceManager) Kill() {
// TODO: Kill processes in processManager as well.

// moduleManager may be nil in tests
if manager.moduleManager != nil {
manager.moduleManager.Kill()
}
}

// completeConfig process the tree in reverse order and attempts to build or reconfigure
// resources that are wrapped in a placeholderResource. this function will attempt to
// process resources concurrently when they do not depend on each other unless
Expand Down
1 change: 1 addition & 0 deletions robot/impl/resource_manager_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1325,6 +1325,7 @@ func (fp *fakeProcess) Start(ctx context.Context) error {
func (fp *fakeProcess) Stop() error {
return nil
}
func (fp *fakeProcess) Kill() {}

func (fp *fakeProcess) Status() error {
return nil
Expand Down
5 changes: 5 additions & 0 deletions robot/robot.go
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,11 @@ type LocalRobot interface {

// RestartAllowed returns whether the robot can safely be restarted.
RestartAllowed() bool

// Kill will attempt to kill any processes on the system started by the robot as quickly as possible.
// This operation is not clean and will not wait for completion.
// Only use this if comfortable with leaking resources (in cases where exiting the program as quickly as possible is desired).
Kill()
}

// A RemoteRobot is a Robot that was created through a connection.
Expand Down
8 changes: 7 additions & 1 deletion web/server/entrypoint.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"go.viam.com/rdk/config"
"go.viam.com/rdk/logging"
"go.viam.com/rdk/resource"
"go.viam.com/rdk/robot"
robotimpl "go.viam.com/rdk/robot/impl"
"go.viam.com/rdk/robot/web"
weboptions "go.viam.com/rdk/robot/web/options"
Expand Down Expand Up @@ -261,6 +262,8 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err
forceShutdown := make(chan struct{})
defer func() { <-forceShutdown }()

var myRobot robot.LocalRobot

utils.PanicCapturingGo(func() {
defer close(forceShutdown)

Expand All @@ -280,6 +283,9 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err
case <-doneServing:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're here, I'd recommend removing this whole case <-doneServing stuff (and incidentally the select statement) and move straight to the killing/logging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is that? I thought the justification above for the code as it is makes sense

return true
default:
if myRobot != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If myRobot can be nil here -- this is a data race.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would that be an issue? I can see that myRobot could have started some processes but not yet returned, but I don't know if we can protect against that completely.

myRobot.Kill()
}
s.logger.Fatalw("server failed to cleanly shutdown after deadline", "deadline", hungShutdownDeadline)
return true
}
Expand Down Expand Up @@ -398,7 +404,7 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err
robotOptions = append(robotOptions, robotimpl.WithFTDC())
}

myRobot, err := robotimpl.New(ctx, processedConfig, s.logger, robotOptions...)
myRobot, err = robotimpl.New(ctx, processedConfig, s.logger, robotOptions...)
if err != nil {
cancel()
return err
Expand Down
Loading