Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Euthanasia #16

Merged
merged 14 commits into from
Oct 31, 2019
Merged

Euthanasia #16

merged 14 commits into from
Oct 31, 2019

Conversation

1fish2
Copy link
Collaborator

@1fish2 1fish2 commented Oct 30, 2019

  • Implement per-task timeout, reading a :timeout parameter in seconds from the task spec; default = 1 hour.

  • Log the elapsed run time, whether it terminated by timeout, and the timeout parameter. This is useful feedback and needed to test the timeout feature.

    PROCESS terminated-by-timeout, elapsed 1.562448734S, timeout parameter 1s, exit code 137 (SIGKILL), error string ""
    PROCESS completed, elapsed 1.562448734S, timeout parameter 1s, exit code 0, error string ""
    
  • Fix impose a time limit on each task! #13 Each task can specify a time limit in seconds, default = 1 hour.

    • TODO: Pass the timeout parameter through Gaia.
  • Fix Handle task cancellation when the container hasn't started yet #15 Make it work if Sisyphus gets a task-cancellation request when there's no docker container process to kill, i.e. while pulling the docker image and the task input files.

  • TODO: Unconditionally push STDOUT + the completion log message like to a log dir. Push a requested STDOUT only on success, like other outputs.

  • Check and log the container's completion code and docker oom-killed?. Log the docker error-string.

    • TODO: Use error-string as another failure indicator?
  • Project version "0.0.18".

Code improvements:

  • Make task.clj entirely responsible for its docker process.
  • Partially disentangle three state things and disentangle task.clj vs. core.clj state.
  • Don't mutate the task spec.
  • Share the timer code and make cancelling a timer cancel the sleep without interrupting the delayed function.
  • Fix some race conditions.

1fish2 added 11 commits October 22, 2019 23:34
* In core/sisyphus-handle-kafka, the `task` ns owns the docker process so ask it to kill the process etc. rather than mucking about in there.
* In core/sisyphus-handle-kafka, ask the `task` ns to kill the docker process rather than mucking with its docker process.
* Don't clear the `state` info before `perform-task!` finishes up and clears the `state` info.
* Implement per-task timeout, reading a `:timeout` value from the task spec; default = 1 hour.
* Share the timer code and the kill-task code between kafka-kill-request and timeout.
* Use a future in a future so cancelling a timer can cancel the sleep but can't interrupt the delayed function.
* Append the elapsed run time to the log and an indication of whether it hit the timeout. This is useful feedback and needed to test out the timeout feature.
* Fix #13 Each task can specify a time limit in seconds, default = 1 hour.
  * TODO: Make Gaia pass through the timeout.
* Fix #15 Make it work if Sisyphus gets a task-cancellation request when there's no docker container process to kill, i.e. while pulling the docker image and the task input files.
* Closed some race condition windows.
* Move `:docker-id` from the task spec (which shouldn't change for a task) to the top level of the state-map `@state`. Make task.clj responsible for the `docker-id` et al rather than mixing it into core.clj.
* Check (and log) the container's completion code, docker `error-string`, and docker `oom-killed?`.
* A little cleanup.
* Project version "0.0.18".
* Also check the completion status (completed, terminated-by-timeout, terminated-by-request), docker's error-string, and docker's OOM Killed flag.
* Send more of that to Gaia for now, but later we'll trim the messages to Gaia.
... so Gaia can default timeout to `nil`.
Copy link
Member

@prismofeverything prismofeverything left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @1fish2, looks excellent! Thanks for detailing all of the states (termination and otherwise) we could encounter here. Let's maybe talk about the future-inside-future approach. Random comments below.

(and
id
(= id (:id message))
(:id message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is no longer checking the id matches, I assume this is happening somewhere else then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. task.clj checks the id within a transaction that also depends on the FSM.

(task/status! (:kafka state) task "step-terminated" message)
(swap! (:state state) assoc :status :waiting :task {}))
(when (terminate? message)
(task/terminate-by-request! state id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, this pulls the docker functionality entirely into the task ns, I like it!

; ; required:
; :id "id", :workflow "w", :name "n", :image "d", :command "c",
; ; optional:
; :inputs [i], :outputs [o], :timeout seconds}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good and should be declared somewhere. This is a pretty good guide to spec, which we could use for specific maps like sisyphus state and tasks: https://clojure.org/guides/spec

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Filed as #18

@@ -110,16 +115,18 @@
(let [docker (docker/connect! (:docker config))
storage (cloud/connect-storage! (:storage config))
rabbit (rabbit/connect! (:rabbit config))
; TODO: A state map containing a state map is too confusing. The key
; names are the only thing we have for navigating the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed here. One way to handle this is to give each a specific name, for instance here the outer one we could call :sisyphus and the inner one we could call :task-info, or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I took some steps in that direction but am not attached to the names. Along the way I realized there are 3 things to distinguish: the Sisyphus server's immutable state map, the inner map containing the current state, and the atom itself.

@@ -171,27 +177,136 @@
(log/notice! "STEP FAILED" (:workflow task) (:name task) log)
(status!
kafka task "step-error"
{:code code
{:success success?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm don't we want to report the exit code here in case it is an informative error to the end user?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sisyphus logs the exit code to the user and soon it'll also go into the log file that it'll write, separate from conditionally writing stdout.

What changed here is Sisyphus takes more info into account to decide if the step succeeded or failed. The exit code is just partial info, so it no longer sends that to Gaia, which didn't use it.

We talked about trimming down the Sisyphus -> Gaia messages to the ones that Gaia uses now that logging is separate.

[milliseconds f]
(future
(Thread/sleep milliseconds)
(future (f))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might need more explanation or a different approach.... putting futures inside futures makes me nervous. What problem is this solving? Also, I think canceling the outer future would cancel the inner one also, so I'm not sure it is accomplishing what you want to accomplish here (but I would need to test that).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll try to test it. Figuring out a surefire approach could take deep Java concurrency details. Maybe end up with a ThreadPoolExecutor or something?

I'll document this more clearly than the current docstring. The point is we want to be able to cancel the sleep phase but not interrupt the execution of the delayed function f.

I'm under the impression (could be wrong!) that cancelling a Java future just interrupts that thread and interrupting a thread doesn't impact any threads it forked.

(assoc state-map :status :terminate-when-ready :action termination-status)

(= (:status state-map) :running)
(assoc state-map :status termination-status :action :kill)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice distinction here.


(defn terminate-by-request!
[sisy-state task-id]
(terminate! sisy-state task-id :terminated-by-request))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is becoming a serious state machine! Something like this could be useful, but maybe overkill? https://github.com/ztellman/automat

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be simpler. It's synchronizing 3 threads that are collectively controlling docker and the command itself running in external processes. Or maybe it needs a concise FSM diagram.

@1fish2
Copy link
Collaborator Author

1fish2 commented Oct 31, 2019

The future-in-future does work! It's pretty simple, too.

@1fish2 1fish2 merged commit 0fc13e0 into master Oct 31, 2019
@1fish2 1fish2 deleted the euthanasia branch October 31, 2019 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle task cancellation when the container hasn't started yet impose a time limit on each task!
2 participants