EPIC-3: Failure Handling #74

OmerMajNition · 2024-10-17T11:39:26Z

LF should be resilient enough to take care of failures when production code runs as federates. For example, when a server is unreachable (either a network failure or a machine had a hardware failure) we should be able to spin up another server inside that POP and load-balancer should be informed not to send traffic to non-responding server. If POP goes unreachable due to network failure or some rare hardware fault, system should be able to route and distribute its traffic to other POPs. Users should have support to bring in a secondary machine to a primary role. Not to forget the RTI failures, we should plan to have a resilient mechanism that covers for RTI failures as well.

Following user stories further breakdown areas we need to concentrate to have resiliency in a deployed set of LF reactors.

Centralized or decentralized Coordination with Fast Mode Support - Brainstorming

Centralized coordination solves many of the problems in this Epic, it provides us with consistency, Fast mode support is a huge plus. However, it doesn’t have any notion of handling federate failures. For example a POP running as a federate goes down, the whole topology would come to a halt. RTI is also a single point of failure in centralized coordination.

On the other hand, if we investigate a decentralized coordinator, it provides us resiliency in terms of federate failures. In case a federate goes down, other federates can still keep on moving making assumptions on physical time. This assumption on physical time takes the Fast mode support out of this decentralized coordination. On top of this, decentralized coordination is also prone to inconsistent behavior, in contrast to centralized coordination.

We need to think through these coordination techniques and come up with a solution to handle these problems.

OmerMajNition changed the title ~~User Epic 3: Failure Handling~~ EPIC-3: Failure Handling Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPIC-3: Failure Handling #74

EPIC-3: Failure Handling #74

OmerMajNition commented Oct 17, 2024 •

edited

Loading

Centralized or decentralized Coordination with Fast Mode Support - Brainstorming

EPIC-3: Failure Handling #74

EPIC-3: Failure Handling #74

Comments

OmerMajNition commented Oct 17, 2024 • edited Loading

Centralized or decentralized Coordination with Fast Mode Support - Brainstorming

OmerMajNition commented Oct 17, 2024 •

edited

Loading