You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LF should be resilient enough to take care of failures when production code runs as federates. For example, when a server is unreachable (either a network failure or a machine had a hardware failure) we should be able to spin up another server inside that POP and load-balancer should be informed not to send traffic to non-responding server. If POP goes unreachable due to network failure or some rare hardware fault, system should be able to route and distribute its traffic to other POPs. Users should have support to bring in a secondary machine to a primary role. Not to forget the RTI failures, we should plan to have a resilient mechanism that covers for RTI failures as well.
Following user stories further breakdown areas we need to concentrate to have resiliency in a deployed set of LF reactors.
Centralized or decentralized Coordination with Fast Mode Support - Brainstorming
Centralized coordination solves many of the problems in this Epic, it provides us with consistency, Fast mode support is a huge plus. However, it doesn’t have any notion of handling federate failures. For example a POP running as a federate goes down, the whole topology would come to a halt. RTI is also a single point of failure in centralized coordination.
On the other hand, if we investigate a decentralized coordinator, it provides us resiliency in terms of federate failures. In case a federate goes down, other federates can still keep on moving making assumptions on physical time. This assumption on physical time takes the Fast mode support out of this decentralized coordination. On top of this, decentralized coordination is also prone to inconsistent behavior, in contrast to centralized coordination.
We need to think through these coordination techniques and come up with a solution to handle these problems.
The text was updated successfully, but these errors were encountered:
LF should be resilient enough to take care of failures when production code runs as federates. For example, when a server is unreachable (either a network failure or a machine had a hardware failure) we should be able to spin up another server inside that POP and load-balancer should be informed not to send traffic to non-responding server. If POP goes unreachable due to network failure or some rare hardware fault, system should be able to route and distribute its traffic to other POPs. Users should have support to bring in a secondary machine to a primary role. Not to forget the RTI failures, we should plan to have a resilient mechanism that covers for RTI failures as well.
Following user stories further breakdown areas we need to concentrate to have resiliency in a deployed set of LF reactors.
The text was updated successfully, but these errors were encountered: