Full provenance of tasks executions #3447
pditommaso
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Problem
The execution DAG created by Nextflow tracks the dependencies between processes, operators and any combination of them.
However, Nextflow is not able to track which is the task instance(s) that triggered a specific task execution.
This means that it's possible to know, for example, that a process A is connected with a process B e.g.
A -> B
, however, each of them can run n tasks and it's not possible to establish which task i-th ofA
trigger the task j-th ofB
.This information is important to be able to precisely track the provenance of the task executions and solve other problems like the cleanup at runtime of the temporary files created by each task #452
Heuristic solution
A possible solution to this problem could consist in using the unique hash id associated with each task execution.
The task hash is used to allocate a scratch work directory where all output files are created. For example, a task having id
9a001d8539def2552dd604427ec9aee1
will create all output files into a directory having the path/some/work/dir/9a/001d8539def2552dd604427ec9aee1
.Therefore all downstream tasks receiving these outputs as input files could use the hash encoded in the file path to infer the task instance that created those file are re-create the tasks execution DAG.
However this approach does not provide a complete solution because Nextflow processes can produce arbitrary values are outputs other than file paths, that would make not possible to use this approach.
A similar problem can arise when a process execution is chained with an operator that can alter the file output path with an arbitrary path (e.g.
collectFile
) breaking the above assumption.Object identity solution
This solution to some extent is similar to the previous approach, however, the main idea is to use the Java object identity associated to each input and output to track the relationship of the tasks.
This could be done by storing in the dictionary structure the pair
< object id, task id>
, where object id is taken using the identity hash code of the n-th output object, andtask id
is the nextflow TaskRun.id attribute. Let's call this structureprov-map
This information can be captured when the output of a task
Tx
is bound to the corresponding output channel, see here.Correspondingly, when a new task execution
Ty
is triggered, for each value in the list of inputs should be looked up in theprov-map
using the input object system identity as the access key.If an entry is found a relationship between the two tasks can be established and a direct edge
Tx -> Ty
can be added to the provenance graph.The problem still remains for processes interleaved with operator execution e.g.
P1 -> map -> P2
A similar approach could be takin into consideration that all Nextflow operators are implemented as a DataflowProcessor class.
This class allows the use of a listener interface defined as show below:
The methods
beforeRun
andafterRun
can be used to propagate the task association the corresponding output value. For example let's consider the flowT1(x) -> map(x) -> T2(y)
T1
store the pair< x, T1 >
in theprov-map
.map
receives the valuex
and invokes thebeforeRun
even listenerT1
using theprov-map
afterRun
is invoked and the pair< y, T1 >
is stored intoprov-map
T2
receives the valuey
and fetch the valueT1
from theprov-map
T1 -> T2
is createdBeta Was this translation helpful? Give feedback.
All reactions