You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our current approach to distribute the workload on a cluster is based on MPI. Even #17 will use MPI for now.
The following is inspired by distributed systems. I'm not sure if this is actually necessary, but the impact on performance should be acceptable and failure tolerance will be improved. Let me know what you think!
Problem:
There is no failure tolerance
If one node crashes, the application goes down with it
no new nodes can be added, everything has to be static from the start on
Idea:
Use approaches that are common in distributed systems:
HeadNode is the server (there could be mirrored backup HeadNodes!), only Node with actual persistent state information (results for sampling points, currently working ComputeNodes)
Each ComputeNode is an independent client, has only a SoftState
When a ComputeNode starts, it searches for the HeadNode (either fixed hostname for headnode, or some service discovery broadcast)
ComputeNodes Request work from HeadNode. When work done, they send it back (like it is done now)
During computation, ComputeNodes send heartbeat-signal to the HeadNode. If heartbeat stops, the HeadNode assumes they are dead and re-schedules the sampling point
For very long computations (or when you repaired a broken ComputeNode), ComputeNodes can be started while the computation is during and will simply request work from HeadNode.
Simple communication with UDP or TCP, no more MPI dependency.
It would be also possible (and very easy), to add ComputeNodes that use a CPU code or different accelerators. All ComputeNodes would have to simply implement the same network interface
In theory, this allows also scaling among multiple clusters or to home computers 👯
Problems / Thoughts:
Communication might be a bit slower than MPI. Might not matter, since individual sample points take so much longer anyway
Could we find a way to implement this as an option to MPI (I don't think so, since structure would be quite different. Maybe after modularizing it a lot)
The text was updated successfully, but these errors were encountered:
Our current approach to distribute the workload on a cluster is based on MPI. Even #17 will use MPI for now.
The following is inspired by distributed systems. I'm not sure if this is actually necessary, but the impact on performance should be acceptable and failure tolerance will be improved. Let me know what you think!
Problem:
Idea:
Use approaches that are common in distributed systems:
Problems / Thoughts:
The text was updated successfully, but these errors were encountered: