Euthanasia #16

1fish2 · 2019-10-30T20:26:11Z

Implement per-task timeout, reading a :timeout parameter in seconds from the task spec; default = 1 hour.

Log the elapsed run time, whether it terminated by timeout, and the timeout parameter. This is useful feedback and needed to test the timeout feature.

PROCESS terminated-by-timeout, elapsed 1.562448734S, timeout parameter 1s, exit code 137 (SIGKILL), error string ""
PROCESS completed, elapsed 1.562448734S, timeout parameter 1s, exit code 0, error string ""

Fix impose a time limit on each task! #13 Each task can specify a time limit in seconds, default = 1 hour.
- TODO: Pass the timeout parameter through Gaia.
Fix Handle task cancellation when the container hasn't started yet #15 Make it work if Sisyphus gets a task-cancellation request when there's no docker container process to kill, i.e. while pulling the docker image and the task input files.
TODO: Unconditionally push STDOUT + the completion log message like to a log dir. Push a requested STDOUT only on success, like other outputs.
Check and log the container's completion code and docker oom-killed?. Log the docker error-string.
- TODO: Use error-string as another failure indicator?
Project version "0.0.18".

Code improvements:

Make task.clj entirely responsible for its docker process.
Partially disentangle three state things and disentangle task.clj vs. core.clj state.
Don't mutate the task spec.
Share the timer code and make cancelling a timer cancel the sleep without interrupting the delayed function.
Fix some race conditions.

* In core/sisyphus-handle-kafka, the `task` ns owns the docker process so ask it to kill the process etc. rather than mucking about in there. * In core/sisyphus-handle-kafka, ask the `task` ns to kill the docker process rather than mucking with its docker process. * Don't clear the `state` info before `perform-task!` finishes up and clears the `state` info.

* Implement per-task timeout, reading a `:timeout` value from the task spec; default = 1 hour. * Share the timer code and the kill-task code between kafka-kill-request and timeout. * Use a future in a future so cancelling a timer can cancel the sleep but can't interrupt the delayed function. * Append the elapsed run time to the log and an indication of whether it hit the timeout. This is useful feedback and needed to test out the timeout feature.

* Fix #13 Each task can specify a time limit in seconds, default = 1 hour. * TODO: Make Gaia pass through the timeout. * Fix #15 Make it work if Sisyphus gets a task-cancellation request when there's no docker container process to kill, i.e. while pulling the docker image and the task input files. * Closed some race condition windows. * Move `:docker-id` from the task spec (which shouldn't change for a task) to the top level of the state-map `@state`. Make task.clj responsible for the `docker-id` et al rather than mixing it into core.clj. * Check (and log) the container's completion code, docker `error-string`, and docker `oom-killed?`. * A little cleanup. * Project version "0.0.18".

* Also check the completion status (completed, terminated-by-timeout, terminated-by-request), docker's error-string, and docker's OOM Killed flag. * Send more of that to Gaia for now, but later we'll trim the messages to Gaia.

... so Gaia can default timeout to `nil`.

prismofeverything

Hey @1fish2, looks excellent! Thanks for detailing all of the states (termination and otherwise) we could encounter here. Let's maybe talk about the future-inside-future approach. Random comments below.

prismofeverything · 2019-10-30T21:41:54Z

src/sisyphus/core.clj

  (and
-   id
-   (= id (:id message))
+   (:id message)


So this is no longer checking the id matches, I assume this is happening somewhere else then?

Yes. task.clj checks the id within a transaction that also depends on the FSM.

prismofeverything · 2019-10-30T21:42:40Z

src/sisyphus/core.clj

-        (task/status! (:kafka state) task "step-terminated" message)
-        (swap! (:state state) assoc :status :waiting :task {}))
+      (when (terminate? message)
+        (task/terminate-by-request! state id))


Ah I see, this pulls the docker functionality entirely into the task ns, I like it!

prismofeverything · 2019-10-30T21:47:20Z

src/sisyphus/core.clj

+  ;    ; required:
+  ;    :id "id", :workflow "w", :name "n", :image "d", :command "c",
+  ;    ; optional:
+  ;    :inputs [i], :outputs [o], :timeout seconds}.


This is good and should be declared somewhere. This is a pretty good guide to spec, which we could use for specific maps like sisyphus state and tasks: https://clojure.org/guides/spec

Cool. Filed as #18

prismofeverything · 2019-10-30T21:49:17Z

src/sisyphus/core.clj

@@ -110,16 +115,18 @@
  (let [docker (docker/connect! (:docker config))
        storage (cloud/connect-storage! (:storage config))
        rabbit (rabbit/connect! (:rabbit config))
+        ; TODO: A state map containing a state map is too confusing. The key
+        ; names are the only thing we have for navigating the data.


Agreed here. One way to handle this is to give each a specific name, for instance here the outer one we could call :sisyphus and the inner one we could call :task-info, or something.

Yes. I took some steps in that direction but am not attached to the names. Along the way I realized there are 3 things to distinguish: the Sisyphus server's immutable state map, the inner map containing the current state, and the atom itself.

prismofeverything · 2019-10-30T21:50:45Z

src/sisyphus/task.clj

@@ -171,27 +177,136 @@
      (log/notice! "STEP FAILED" (:workflow task) (:name task) log)
      (status!
       kafka task "step-error"
-       {:code code
+       {:success success?


Hmmmm don't we want to report the exit code here in case it is an informative error to the end user?

Sisyphus logs the exit code to the user and soon it'll also go into the log file that it'll write, separate from conditionally writing stdout.

What changed here is Sisyphus takes more info into account to decide if the step succeeded or failed. The exit code is just partial info, so it no longer sends that to Gaia, which didn't use it.

We talked about trimming down the Sisyphus -> Gaia messages to the ones that Gaia uses now that logging is separate.

prismofeverything · 2019-10-30T21:53:21Z

src/sisyphus/task.clj

+  [milliseconds f]
+  (future
+    (Thread/sleep milliseconds)
+    (future (f))))


This might need more explanation or a different approach.... putting futures inside futures makes me nervous. What problem is this solving? Also, I think canceling the outer future would cancel the inner one also, so I'm not sure it is accomplishing what you want to accomplish here (but I would need to test that).

Good point. I'll try to test it. Figuring out a surefire approach could take deep Java concurrency details. Maybe end up with a ThreadPoolExecutor or something?

I'll document this more clearly than the current docstring. The point is we want to be able to cancel the sleep phase but not interrupt the execution of the delayed function f.

I'm under the impression (could be wrong!) that cancelling a Java future just interrupts that thread and interrupting a thread doesn't impact any threads it forked.

prismofeverything · 2019-10-30T22:10:56Z

src/sisyphus/task.clj

+    (assoc state-map :status :terminate-when-ready :action termination-status)
+
+    (= (:status state-map) :running)
+    (assoc state-map :status termination-status :action :kill)


Nice distinction here.

prismofeverything · 2019-10-30T22:24:38Z

src/sisyphus/task.clj

+
+(defn terminate-by-request!
+  [sisy-state task-id]
+  (terminate! sisy-state task-id :terminated-by-request))


This is becoming a serious state machine! Something like this could be useful, but maybe overkill? https://github.com/ztellman/automat

Maybe this could be simpler. It's synchronizing 3 threads that are collectively controlling docker and the command itself running in external processes. Or maybe it needs a concise FSM diagram.

1fish2 · 2019-10-31T00:17:06Z

The future-in-future does work! It's pretty simple, too.

The double future tests out fine.

1fish2 added 11 commits October 22, 2019 23:34

log the elapsed-time note

3568865

more complete failure test than the exit code

d2f4a9c

* Also check the completion status (completed, terminated-by-timeout, terminated-by-request), docker's error-string, and docker's OOM Killed flag. * Send more of that to Gaia for now, but later we'll trim the messages to Gaia.

don't count on that docker error-string yet

d90e46d

(first (swap-vals! ...))

3a46c2a

improved state transactions

d337100

tune the log messages

f1349f9

interpret exit codd 137

856bdb6

()

d3f20e2

1fish2 requested a review from prismofeverything October 30, 2019 20:26

stronger defaulting for :timeout

551cda1

... so Gaia can default timeout to `nil`.

prismofeverything approved these changes Oct 30, 2019

View reviewed changes

1fish2 added 2 commits October 30, 2019 17:28

document the FSM and the double future

f4bbf58

The double future tests out fine.

less ALL CAPS shouting

20fdd56

1fish2 merged commit 0fc13e0 into master Oct 31, 2019

1fish2 deleted the euthanasia branch October 31, 2019 00:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Euthanasia #16

Euthanasia #16

1fish2 commented Oct 30, 2019

prismofeverything left a comment

prismofeverything Oct 30, 2019

1fish2 Oct 30, 2019

prismofeverything Oct 30, 2019

prismofeverything Oct 30, 2019

1fish2 Oct 30, 2019

prismofeverything Oct 30, 2019

1fish2 Oct 30, 2019

prismofeverything Oct 30, 2019

1fish2 Oct 30, 2019

prismofeverything Oct 30, 2019

1fish2 Oct 30, 2019

prismofeverything Oct 30, 2019

prismofeverything Oct 30, 2019

1fish2 Oct 30, 2019

1fish2 commented Oct 31, 2019

Euthanasia #16

Euthanasia #16

Conversation

1fish2 commented Oct 30, 2019

prismofeverything left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1fish2 commented Oct 31, 2019