Failure tolerance through a more distributed design #52

slizzered · 2015-02-10T14:34:48Z

Our current approach to distribute the workload on a cluster is based on MPI. Even #17 will use MPI for now.
The following is inspired by distributed systems. I'm not sure if this is actually necessary, but the impact on performance should be acceptable and failure tolerance will be improved. Let me know what you think!

Problem:

There is no failure tolerance
If one node crashes, the application goes down with it
no new nodes can be added, everything has to be static from the start on

Idea:

Use approaches that are common in distributed systems:

HeadNode is the server (there could be mirrored backup HeadNodes!), only Node with actual persistent state information (results for sampling points, currently working ComputeNodes)
Each ComputeNode is an independent client, has only a SoftState
When a ComputeNode starts, it searches for the HeadNode (either fixed hostname for headnode, or some service discovery broadcast)
ComputeNodes Request work from HeadNode. When work done, they send it back (like it is done now)
During computation, ComputeNodes send heartbeat-signal to the HeadNode. If heartbeat stops, the HeadNode assumes they are dead and re-schedules the sampling point
For very long computations (or when you repaired a broken ComputeNode), ComputeNodes can be started while the computation is during and will simply request work from HeadNode.
Simple communication with UDP or TCP, no more MPI dependency.
It would be also possible (and very easy), to add ComputeNodes that use a CPU code or different accelerators. All ComputeNodes would have to simply implement the same network interface
In theory, this allows also scaling among multiple clusters or to home computers 👯

Problems / Thoughts:

Communication might be a bit slower than MPI. Might not matter, since individual sample points take so much longer anyway
Much to restructure
Will conflict somehow with Use a better communication framework than plain MPI #17, MPI/GrayBat master compute node should participate in the computation #13, Improve MPI code quality #12
Could we find a way to implement this as an option to MPI (I don't think so, since structure would be quite different. Maybe after modularizing it a lot)

9dfdb96 fix missing OpenMP link flag b9f099c fix foldrAll ICC bug 83ddac5 disable the OpenMP 4 back-end by default 8644064 fix Vec for Intel 819e5d9 fix boost 1.56 missing const bug f9cd663 really fix Intel cpuid 330d983 remove incorrect docu 9f1b692 fix Intel compiler cpuid 1aa4c86 fix missing OMP_NUM_THREADS reset in getMaxOmpThreads 328e866 fix CUDA compilation 33c7888 remove ICC from the readme (untested / not compiling) 40a8465 always interpret all source files as .cu files for nvcc 25f4670 allow vectorize to be called without the element type 882c0a9 enhance documentation 05454a6 fix ambiguous template specialization for GetWorkDiv 5b70326 remove call to std::ref in BlockSharedAllocCudaBuiltIn e15c40a fix fix AtomicOmpCritSec afffe2f fix wrong atomic implementation for AccCpuOmp2Blocks 2a60bbb fix BufCudaRt destruction 062378d add ALPAKA_ADD_EXECUTABLE to alpakaConfig.cmake b9a4125 use DimInt more consistently 919dc26 move ElemType from mem::view to elem 2807fc8 add initial ALPAKA_ADD_EXECUTABLE f019e70 fix BufPlainPtrWrapper pitch 1ca1923 fix missing OpenMP linker flag 9ee231d fix getFreeGlobalMemSizeBytes 7e853c6 Merge pull request ComputationalRadiationPhysics#54 from psychocoderHPC/fix-cudaSet 6796eff Merge pull request ComputationalRadiationPhysics#55 from psychocoderHPC/fix-callingHostFunctionFromDevice 9f3d8e6 fix warning calling host function from device 000a250 fix wrong usage of `getPitchBytes<>()` 8be955d Merge pull request ComputationalRadiationPhysics#53 from psychocoderHPC/topic-suppressHostDeviceWarning b7c877d Merge pull request ComputationalRadiationPhysics#52 from psychocoderHPC/tpoic-updateGitIgnore 0b94251 suppress host device warning 33a59be update `.gitignore` 237898f refactoring d0ad945 implement getFreeGlobalMemSizeBytes f85e233 allow accelerators to inherit from rand implementation d96e8b5 fix CUDA set implemenentation git-subtree-dir: include/alpaka git-subtree-split: 9dfdb96b0cb2fc32a1f2e447de755905f7538bf4

slizzered added question new feature labels Feb 10, 2015

slizzered added this to the 2.0 - the next generation milestone Feb 10, 2015

slizzered mentioned this issue Feb 10, 2015

Benchmarking (and testing) framework #53

Open

slizzered added the low priority label May 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure tolerance through a more distributed design #52

Failure tolerance through a more distributed design #52

slizzered commented Feb 10, 2015

Failure tolerance through a more distributed design #52

Failure tolerance through a more distributed design #52

Comments

slizzered commented Feb 10, 2015

Problem:

Idea:

Problems / Thoughts: