Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC-3: Failure Handling #74

Open
3 tasks
OmerMajNition opened this issue Oct 17, 2024 · 0 comments
Open
3 tasks

EPIC-3: Failure Handling #74

OmerMajNition opened this issue Oct 17, 2024 · 0 comments

Comments

@OmerMajNition
Copy link

OmerMajNition commented Oct 17, 2024

LF should be resilient enough to take care of failures when production code runs as federates. For example, when a server is unreachable (either a network failure or a machine had a hardware failure) we should be able to spin up another server inside that POP and load-balancer should be informed not to send traffic to non-responding server. If POP goes unreachable due to network failure or some rare hardware fault, system should be able to route and distribute its traffic to other POPs. Users should have support to bring in a secondary machine to a primary role. Not to forget the RTI failures, we should plan to have a resilient mechanism that covers for RTI failures as well.

Following user stories further breakdown areas we need to concentrate to have resiliency in a deployed set of LF reactors.

Centralized or decentralized Coordination with Fast Mode Support - Brainstorming

Centralized coordination solves many of the problems in this Epic, it provides us with consistency, Fast mode support is a huge plus. However, it doesn’t have any notion of handling federate failures. For example a POP running as a federate goes down, the whole topology would come to a halt. RTI is also a single point of failure in centralized coordination.

On the other hand, if we investigate a decentralized coordinator, it provides us resiliency in terms of federate failures. In case a federate goes down, other federates can still keep on moving making assumptions on physical time. This assumption on physical time takes the Fast mode support out of this decentralized coordination. On top of this, decentralized coordination is also prone to inconsistent behavior, in contrast to centralized coordination.

We need to think through these coordination techniques and come up with a solution to handle these problems.

@OmerMajNition OmerMajNition changed the title User Epic 3: Failure Handling EPIC-3: Failure Handling Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant