Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

SEP: Master cluster #72

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions 0000-master-cluster.md

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you go into more detail on the future of current HA methods with this in place. as well as the future of syndic? also any potential pitfalls to look at with things such as network latency. what kind of throughput will this require? what about split brain handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work is not deprecating any of the current HA functionality nor is it deprecating Syndic.

The network will need to be a reliable network and this is called out in the docs. If there is a split brain problem, the network is not reliable.

Copy link

@OrangeDog OrangeDog Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By that definition, no network is reliable. That's why we need HA solutions in the first place.
We at least need to know which way it's going to fail during a network partition and not do something unsafe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as consistency and reliability. There is a huge difference between local networks and WAN networks. With this design, if a master goes offline for some reason. There is no failure. Any minion connections will be routed to a different master by the load balancer. The other masters will still try and forward events and you will see timeouts in the logs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it isn't just about consistency and reliability. if the communication between masters CAN be broken and them not show as offline, it will happen. it needs to be documentation at the very least of what it looks like when it happens, I honestly don't think it will break much, as we don't do total bidirectional control. but it needs to be documented.

I can see this happening with the kind of engineer that loves segregating network traffic to separate lans. one network for minion communication, one network for storage, one network for master communication. then all of a sudden the network admin has a spanning tree go haywire in the master communication network. both masters will appear up to the minion and storage still works.

Copy link
Contributor Author

@dwoz dwoz Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both masters would not appear up to a minion because minions connect to the load balancer. I have not been able to break anything by taking masters offline. If you'd like to take the work for a spin and try and cause breakage please feel free.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both masters would appear up to the load balancer too. The only connection that is broken in this scenario is master-master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the scenario described here you're salt cli would fail to receive events because they are not being forwarded from the disconnected master. There will be errors in the logs on the disconnected master that it's not able to forward it's events to the other master. The job would still finish correctly and the job cache would contain the correct results of the job.

Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
- Feature Name: Master Cluster
- Start Date: 2023-08-09
- SEP Status: Draft
- SEP PR: (leave this empty)
- Salt Issue: (leave this empty)

# Summary
[summary]: #summary

Add the ability to create a cluster of Masters that run behind a load balancer.

# Motivation
[motivation]: #motivation

The current [high availability features](https://docs.saltproject.io/en/latest/topics/highavailability/index.html) in the Salt ecosystem allow minions to have back up masters. There are two flavors of Multi Master which can be configured on a Minion.

Minions can connect to [multiple masters simultaneously](https://docs.saltproject.io/en/latest/topics/tutorials/multimaster.html).

<img src='/diagrams/000-multi-master.png' width='400px'>

Minions can also be configured to connect to one master at a time [using fail over](https://docs.saltproject.io/en/latest/topics/tutorials/multimaster_pki.html#multiple-masters-for-a-minion).

<img src='/diagrams/000-multi-master-failover.png' width='400px'>

This results in jobs targeting lots of minions being pinned to a single master. Another drawback to the current HA implementation is that minions need to be re-configured to add or remove masters.


<img src='/diagrams/000-mm-large-job.png' width='400px'>

It would be much more ideal if jobs could scale across multiple masters.


<img src='/diagrams/000-mc-large-job.png' width='400px'>

# Design
[design]: #detailed-design

In order to accomplish this, we will need to change the way jobs execute.
Currently new jobs get sent directly to the publish server from the request
server.

<img src='/diagrams/000-current-job-pub.png' width='400px'>

If we forward IPC Events between Masters, we can get the return flow to be shared, as shown below:


<img src='/diagrams/000-cluster-job-pub.png' width='400px'>

To get job publishes to work, we need to make sure publishes also travel over the IPC Event bus.


<img src='/diagrams/000-cluster-fwd.png' width='400px'>

Jobs can come and go through all the masters in our master pool. From a minion's perspective, all of the masters in our pool are completely the same. We can remove the need of minions to know about multiple masters by putting our pool behind a load balancer. Minions will not need to be re-configured to add master resources.


<img src='/diagrams/000-cluster-arch.png' width='400px'>


Events from master's including job returns are sent to all masters in the cluster. This requires that all masters run on stable local network.

<img src="/diagrams/000-master-event-bus.png" width="400px">

> [!IMPORTANT]
> The current work for this SEP can be found [here](https://github.com/saltstack/salt/pull/64936)

### Cluster communication

Each master in a cluster will retain it's own public/private keypair as well as it's own aes session key stored in memory on the master. In addition, a new `cluster_pki_dir` configuration options is added. The cluster will maintin a cluster wide public/private keypair and a cluster wide aes session key which will be used for minion communication. Each master in the cluster will publish a copy of it's public key in `<cluster_pki_dir>/peers`. Minion public keys will also be stored in `cluster_pki_dir` when in cluster mode. The same code use for master/minion configuration can be used to secure the master event bus.

Master event bus communications will be secured using each masters' own keypair and aes sessions. There will be an addition of a new Channel `salt.channels.server` to handle the additional communications logic while leveraging our existing generic transports. This will also allow a cluster to leverage new transport (rabbitmq) or additional functionality (client tls certificates) as those feature become available.

Communictiona with the minions connected to the cluster are secured with the cluster wide key pair and aes session key.


## Alternatives
[alternatives]: #alternatives

We currently have two alternatives to achieve "high availablity". This is a
third, more robust approach that alleviates the issues with the current
options. This is not intending to deprecate the current HA functionality.


## Unresolved questions
[unresolved]: #unresolved-questions

None as of this time.

# Drawbacks
[drawbacks]: #drawbacks

The biggest drawback is the fact that we will need to maintain three ways of
doing HA. This adds complexity however, if successfull. We can potentially
depericate some of, or all of, the exiting HA functionality.
Binary file added diagrams/000-cluster-arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/000-cluster-fwd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/000-cluster-job-pub.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/000-current-job-pub.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/000-master-event-bus.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/000-mc-large-job.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/000-mm-large-job.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/000-multi-master-failover.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/000-multi-master.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.