-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Euthanasia #16
Euthanasia #16
Conversation
* In core/sisyphus-handle-kafka, the `task` ns owns the docker process so ask it to kill the process etc. rather than mucking about in there. * In core/sisyphus-handle-kafka, ask the `task` ns to kill the docker process rather than mucking with its docker process. * Don't clear the `state` info before `perform-task!` finishes up and clears the `state` info.
* Implement per-task timeout, reading a `:timeout` value from the task spec; default = 1 hour. * Share the timer code and the kill-task code between kafka-kill-request and timeout. * Use a future in a future so cancelling a timer can cancel the sleep but can't interrupt the delayed function. * Append the elapsed run time to the log and an indication of whether it hit the timeout. This is useful feedback and needed to test out the timeout feature.
* Fix #13 Each task can specify a time limit in seconds, default = 1 hour. * TODO: Make Gaia pass through the timeout. * Fix #15 Make it work if Sisyphus gets a task-cancellation request when there's no docker container process to kill, i.e. while pulling the docker image and the task input files. * Closed some race condition windows. * Move `:docker-id` from the task spec (which shouldn't change for a task) to the top level of the state-map `@state`. Make task.clj responsible for the `docker-id` et al rather than mixing it into core.clj. * Check (and log) the container's completion code, docker `error-string`, and docker `oom-killed?`. * A little cleanup. * Project version "0.0.18".
* Also check the completion status (completed, terminated-by-timeout, terminated-by-request), docker's error-string, and docker's OOM Killed flag. * Send more of that to Gaia for now, but later we'll trim the messages to Gaia.
... so Gaia can default timeout to `nil`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @1fish2, looks excellent! Thanks for detailing all of the states (termination and otherwise) we could encounter here. Let's maybe talk about the future-inside-future approach. Random comments below.
(and | ||
id | ||
(= id (:id message)) | ||
(:id message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is no longer checking the id matches, I assume this is happening somewhere else then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. task.clj checks the id within a transaction that also depends on the FSM.
(task/status! (:kafka state) task "step-terminated" message) | ||
(swap! (:state state) assoc :status :waiting :task {})) | ||
(when (terminate? message) | ||
(task/terminate-by-request! state id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, this pulls the docker functionality entirely into the task
ns, I like it!
; ; required: | ||
; :id "id", :workflow "w", :name "n", :image "d", :command "c", | ||
; ; optional: | ||
; :inputs [i], :outputs [o], :timeout seconds}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good and should be declared somewhere. This is a pretty good guide to spec, which we could use for specific maps like sisyphus state and tasks: https://clojure.org/guides/spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Filed as #18
@@ -110,16 +115,18 @@ | |||
(let [docker (docker/connect! (:docker config)) | |||
storage (cloud/connect-storage! (:storage config)) | |||
rabbit (rabbit/connect! (:rabbit config)) | |||
; TODO: A state map containing a state map is too confusing. The key | |||
; names are the only thing we have for navigating the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed here. One way to handle this is to give each a specific name, for instance here the outer one we could call :sisyphus
and the inner one we could call :task-info
, or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I took some steps in that direction but am not attached to the names. Along the way I realized there are 3 things to distinguish: the Sisyphus server's immutable state map, the inner map containing the current state, and the atom itself.
src/sisyphus/task.clj
Outdated
@@ -171,27 +177,136 @@ | |||
(log/notice! "STEP FAILED" (:workflow task) (:name task) log) | |||
(status! | |||
kafka task "step-error" | |||
{:code code | |||
{:success success? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm don't we want to report the exit code here in case it is an informative error to the end user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sisyphus logs the exit code to the user and soon it'll also go into the log file that it'll write, separate from conditionally writing stdout.
What changed here is Sisyphus takes more info into account to decide if the step succeeded or failed. The exit code is just partial info, so it no longer sends that to Gaia, which didn't use it.
We talked about trimming down the Sisyphus -> Gaia messages to the ones that Gaia uses now that logging is separate.
[milliseconds f] | ||
(future | ||
(Thread/sleep milliseconds) | ||
(future (f)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might need more explanation or a different approach.... putting futures inside futures makes me nervous. What problem is this solving? Also, I think canceling the outer future would cancel the inner one also, so I'm not sure it is accomplishing what you want to accomplish here (but I would need to test that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll try to test it. Figuring out a surefire approach could take deep Java concurrency details. Maybe end up with a ThreadPoolExecutor or something?
I'll document this more clearly than the current docstring. The point is we want to be able to cancel the sleep phase but not interrupt the execution of the delayed function f
.
I'm under the impression (could be wrong!) that cancelling a Java future just interrupts that thread and interrupting a thread doesn't impact any threads it forked.
(assoc state-map :status :terminate-when-ready :action termination-status) | ||
|
||
(= (:status state-map) :running) | ||
(assoc state-map :status termination-status :action :kill) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice distinction here.
|
||
(defn terminate-by-request! | ||
[sisy-state task-id] | ||
(terminate! sisy-state task-id :terminated-by-request)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is becoming a serious state machine! Something like this could be useful, but maybe overkill? https://github.com/ztellman/automat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be simpler. It's synchronizing 3 threads that are collectively controlling docker and the command itself running in external processes. Or maybe it needs a concise FSM diagram.
The future-in-future does work! It's pretty simple, too. |
The double future tests out fine.
Implement per-task timeout, reading a
:timeout
parameter in seconds from the task spec; default = 1 hour.Log the elapsed run time, whether it terminated by timeout, and the timeout parameter. This is useful feedback and needed to test the timeout feature.
Fix impose a time limit on each task! #13 Each task can specify a time limit in seconds, default = 1 hour.
timeout
parameter through Gaia.Fix Handle task cancellation when the container hasn't started yet #15 Make it work if Sisyphus gets a task-cancellation request when there's no docker container process to kill, i.e. while pulling the docker image and the task input files.
TODO: Unconditionally push STDOUT + the completion log message like to a log dir. Push a requested STDOUT only on success, like other outputs.
Check and log the container's completion code and docker
oom-killed?
. Log the dockererror-string
.error-string
as another failure indicator?Project version "0.0.18".
Code improvements:
state
things and disentangle task.clj vs. core.clj state.