-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add rollout API spec #897
Conversation
d65db0e
to
195a9b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, PTAL 🙏
// StageRolloutConditionType identifies a specific condition of the StageRollout. | ||
type StageRolloutConditionType string | ||
|
||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ryan! Just for clarification, from the comments it seems that StageRolloutWaiting
is just a special case of StageRollingOut
being true?
5cc24ae
to
7fd558b
Compare
// the common post rollout tasks. This setting will override the common post rollout tasks if there are | ||
// duplicate type of tasks with different settings. | ||
// +optional | ||
PostRolloutTasks PostRolloutTask `json:"postRolloutTasks,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we will run the PostRolloutTasks first then the common PostRolloutTasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the common one for now to simplify
} | ||
|
||
// StageRolloutConfig describes a single rollout stage group configuration. | ||
type StageRolloutConfig struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we rollout the selected clusters within a stage?
sequentially or concurrently?
type PostRolloutTask struct { | ||
// The name of the RolloutApprovalRequest we need to create to get the approval for the next stage. | ||
// An RolloutApprovalRequest object will be created if this is not empty. | ||
// +optional | ||
ApprovalRequestName string `json:"approvalTask,omitempty"` | ||
|
||
// The time to wait after all the clusters in the current stage complete the update | ||
// before we move to the next stage. | ||
// +optional | ||
WaitTime *metav1.Duration `json:"waitTime,omitempty"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type PostRolloutTask struct { | |
// The name of the RolloutApprovalRequest we need to create to get the approval for the next stage. | |
// An RolloutApprovalRequest object will be created if this is not empty. | |
// +optional | |
ApprovalRequestName string `json:"approvalTask,omitempty"` | |
// The time to wait after all the clusters in the current stage complete the update | |
// before we move to the next stage. | |
// +optional | |
WaitTime *metav1.Duration `json:"waitTime,omitempty"` | |
} | |
type PostRolloutTask struct { | |
type TaskType // approval, wait | |
// The time to wait after all the clusters in the current stage complete the update | |
// before we move to the next stage. | |
// +optional | |
WaitTime *metav1.Duration `json:"waitTime,omitempty"` | |
} |
PostRolloutTasks []PostRolloutTask
Then user can define their own sequence. Otherwise, it's hard to tell which one runs first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current idea is that they run at the same time. For example, we don't wait before or after the approval, the wait time is the minimum time needed between stages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I thought about the type pattern. The problem with that is each type requires a distinct set of parameters which pretty much makes the taskType
field redundant.
// There can only be one active rollout for each ClusterResourcePlacement, | ||
// we will fail the rollout if there is already an active rollout. | ||
// +required | ||
ParentCRPName string `json:"parentCRPName"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm thinking of the other way. CRP refers to the StageRollout name.
StageRollout will be a common rollout configurations for multiple CRP.
All the rollout status will be part of the CRP status, like today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if that would work as different CRPs can pick different clusters which means the stages are different between them.
I am actually thinking the other way around, one CRP can have multiple rollouts just like our AKS rollout. We have daily releases to canary only and one or two weekly releases going on simultaneously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this field? This is auto created for a CRP. Its owner reference should contain the CRP. Just like pod. Do we need a deployment name field in a pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not auto created for a CRP. This CR is created by user to trigger an update run.
33331bd
to
21783bf
Compare
|
||
// StagedRolloutRun defines a stage by stage rollout run that is applied to | ||
// the clusters that the corresponding ClusterResourcePlacement selects. | ||
type StagedRolloutRun struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RollingUpdate -> RollingUpdateConfig in the current API.
Use StagedUpdate -> StagedUpdateRun, StagedUpdateStrategy in the new API.
) | ||
|
||
// +genclient | ||
// +genclient:namespaced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the resource namespaced if it is used by CRP which is not namespaced? How to decide which namespace a run should be in?
// +kubebuilder:resource:scope="Namespaced",categories={fleet,fleet-placement} | ||
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object | ||
|
||
// StagedRolloutRun defines a stage by stage rollout run that is applied to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who creates this resource?
I assume it is auto created whennever we need to roll out a change in CRP. If that's the case, CRP must contain StagedUpdateConfig (similar with RollingUpdateConfig), which includes at least one field, a reference to StagedUpdateStrategy (as strategy must be provide by customers, not auto generated by us).
// There can only be one active rollout for each ClusterResourcePlacement, | ||
// we will fail the rollout if there is already an active rollout. | ||
// +required | ||
ParentCRPName string `json:"parentCRPName"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this field? This is auto created for a CRP. Its owner reference should contain the CRP. Just like pod. Do we need a deployment name field in a pod?
// The name of the rollout strategy that specifies the stages and the sequence in which | ||
// the application will be updated on the member clusters. | ||
// +required | ||
StagedRolloutStrategy string `json:"stagedRolloutStrategy"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the first field?
// - `ParentStageRollout` which points to its owner stage rollout. | ||
// - `TargetStage` which is the name of the stage that this approval request is for. | ||
// - `IsLatestRolloutApproval` which indicates whether this approval request is the latest one related to this rollout | ||
type RolloutApprovalRequest struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which namespace should the rollout approve request be in? Will it be in the same one as rollout update run? How do we manage access control of approve requests vs runs vs strategies?
21783bf
to
228f754
Compare
// The clusters in each stage are ordered following its order to be rolled out. | ||
// The updateRun will stop if the cluster stages change during the update. | ||
// +required | ||
ComputedStages [][]string `json:"currentStages"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use []StageStatus which contains each cluster's status. See this one: https://learn.microsoft.com/en-us/rest/api/fleet/update-runs/get?view=rest-fleet-2023-10-15&tabs=HTTP
// Therefore, we need to keep track of the clusters that we have applied the resources to but no long is selected. | ||
// We remove the application from these clusters after all the stages are completed and remove them from this list. | ||
// +required | ||
ToBeRevertedClusters []string `json:"toBeRevertedClusters"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeletionStageStatus StageStatus
contains a list of clusters whose work will be deleted ordered by the given sort label key.
// The clusters in each stage are ordered following its order to be rolled out. | ||
// The updateRun will stop if the cluster stages change during the update. | ||
// +required | ||
ComputedStages [][]string `json:"currentStages"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add UpdateStatus at each level: cluster, stage, run.
// There can only be one active rollout for each ClusterResourcePlacement, | ||
// we will fail the rollout if there is already an active rollout. | ||
// +required | ||
ParentCRPName string `json:"parentCRPName"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not auto created for a CRP. This CR is created by user to trigger an update run.
// The clusters in each stage are ordered following its order to be rolled out. | ||
// The updateRun will stop if the cluster stages change during the update. | ||
// +required | ||
ComputedStages [][]string `json:"currentStages"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add cluster level detail
|
||
// The override snapshot index that this update run is based on. | ||
// +required | ||
OverrideSnapShotIndex string `json:"overrideSnapShotIndex"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the override, we don't support snapshot index because there will be always one snapshot per cro or ro. Here user needs to configure a list of ROs snapshot or CROs snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do users find out which ROs and CROs apply to this given placement? Listing them all seems a very cumbersome task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to go to the status field instead.
|
||
// The override snapshot index that this update run is based on. | ||
// +required | ||
OverrideSnapShotIndex string `json:"overrideSnapShotIndex"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do users find out which ROs and CROs apply to this given placement? Listing them all seems a very cumbersome task.
// Conditions is an array of current observed conditions for the specific type of post update task. | ||
// Known conditions are "ApprovalRequestCreated", "WaitTimeElapsed", and "ApprovalRequestApproved". | ||
// +kubebuilder:validation:Optional | ||
Conditions []metav1.Condition `json:"conditions"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you reuse UpdateStatus instead of Conditions? Maybe rename to StepStatus? Everything: a stage, a cluster, a after stage task is a step in the staged update run.
You also need to record start time and complete time of the step. This is especially useful for TimedWait so the user knows when the wait started and when it ended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention in k8s is to use conditions to record information for each state (running, wait, completed). I think it's better to at least keep the overall state in condition instead of a single status. We can convert all the state back into conditions too but it's little bit too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this? We use conditions at the updaterun level, but use updatestatus for stages and below.
Better to have a start time and end time for these AfterStageTasks too. So it is better to use UpdateStatus instead conditions for AfterStageTasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition states (ApprovalRequestCreated, ApprovalRequestApproved, WaitTimeElapsed) in the afterStageTasks are different from the update status so it's hard to reuse it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NotStarted -> (created approval request) -> waiting -> (approve request) -> completed
NotStarted -> (start wait time) -> waiting -> (wait time has passed) -> completed
UpdateRun is a tree of steps: a step can be a run, a stage, a cluster, a after-stage task.
It is better to have the same state model for each step so customers don't need to learn new state/condition model for each step.
That's why we should have the same state for the run and each after-stage task as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fleet RP also have the wait time step and use the same state data model.
// for this stage. There cannot be overlapping clusters between stages when the stagedUpdateRun is created. | ||
// If the label selector is absent, all the selected clusters are selected. | ||
// +kubebuilder:validation:Optional | ||
LabelSelector *metav1.LabelSelector `json:"labelSelector"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are going to do an intersection between the clusters selected by the scheduler and the clusters selected here right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
// - `ParentStagedUpdateRun` which points to its owner stage update. | ||
// - `TargetStage` which is the name of the stage that this approval request is for. | ||
// - `IsLatestUpdateRunApproval` which indicates whether this approval request is the latest one related to this update run. | ||
type StageApprovalRequest struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little bit confused: how will this request get approved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user needs to patch the "Approved" condition of the request as true
@@ -346,6 +368,12 @@ const ( | |||
// - "False" means the staged update run is not waiting between stages. | |||
// - "Unknown" means it is unknown. | |||
StagedUpdateRunConditionWaitingBetweenStage StagedUpdateRunConditionType = "WaitingBetweenStage" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the "Running" and "WaitingBetweenStage" conditions? they could be "reason" when the completed is "false".
"Running", "WaitingBetweenStage" and "Completed" all these conditions are delivering some duplicate information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With "running" and "waittingBetweenStage" conditions, we will get the information like last time the stage has completed. If we collapse them into one condition then we lost a bit of the history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need both conditions and state? State serves the same purpose as conditions. State contains starttime and endtime.
RunStatus {
runState State
stages []StageStatus
deletionStage StageStatus
}
StageStatus {
stageName string
stageState State
clusters []ClusterStatus
afterStageTasks []TaskStatus
}
ClusterStatus {
clusterName string
clusterState State
}
TastkStatus {
taskType
taskState State
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's k8s convention in the status to use condition to show the state transition which is pretty much equivalent to the runState. I think using state is more efficient and acceptable as an internal field but we better to keep condition as the first layer fields status. If we really want to unify them then they all should be conditions like in CPR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with Liqian about the detailed stage information. They are already part of the StageStatus.
For the conditions in the "StagedUpdateRunStatus", can we treat it as overall condition/status? so that user can query this condition to know about whether the updateRun has completed or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the conditions in the "StagedUpdateRunStatus" are the overall condition/status of the updateRun.
dc4e9b0
to
2563c04
Compare
@@ -346,6 +368,12 @@ const ( | |||
// - "False" means the staged update run is not waiting between stages. | |||
// - "Unknown" means it is unknown. | |||
StagedUpdateRunConditionWaitingBetweenStage StagedUpdateRunConditionType = "WaitingBetweenStage" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need both conditions and state? State serves the same purpose as conditions. State contains starttime and endtime.
RunStatus {
runState State
stages []StageStatus
deletionStage StageStatus
}
StageStatus {
stageName string
stageState State
clusters []ClusterStatus
afterStageTasks []TaskStatus
}
ClusterStatus {
clusterName string
clusterState State
}
TastkStatus {
taskType
taskState State
}
type StagedUpdateRunSpec struct { | ||
// A reference to the placement that this update run is applied to. | ||
// There can be multiple active update runs for each placement but | ||
// it's up to the devOps to make sure they don't conflict with each other. | ||
// +kubebuilder:validation:Required | ||
PlacementRef PlacementReference `json:"placementRef"` | ||
|
||
// The resource snapshot index that this update run is based on. | ||
// The snapshot index of the selected resources to be updated across clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give a concrete example on what you mean by the snaptshot index?
Something like:
e.g., for resource snapshots "xxx-123-1" and "xxx-123-2", "123" is the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that you used policysnaptshot name instead of index below. Is that because there are N resource snapshot for a single version of selected resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, for each resource snapshot index, there are N resource snapshots. The index is "123" in the example you gave. There is only 1 policy snapshot object for a snapshot of the CRP placement policy.
// Conditions is an array of current observed conditions for the specific type of post update task. | ||
// Known conditions are "ApprovalRequestCreated", "WaitTimeElapsed", and "ApprovalRequestApproved". | ||
// +kubebuilder:validation:Optional | ||
Conditions []metav1.Condition `json:"conditions"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this? We use conditions at the updaterun level, but use updatestatus for stages and below.
Better to have a start time and end time for these AfterStageTasks too. So it is better to use UpdateStatus instead conditions for AfterStageTasks.
@@ -346,6 +368,12 @@ const ( | |||
// - "False" means the staged update run is not waiting between stages. | |||
// - "Unknown" means it is unknown. | |||
StagedUpdateRunConditionWaitingBetweenStage StagedUpdateRunConditionType = "WaitingBetweenStage" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with Liqian about the detailed stage information. They are already part of the StageStatus.
For the conditions in the "StagedUpdateRunStatus", can we treat it as overall condition/status? so that user can query this condition to know about whether the updateRun has completed or not.
// Conditions is an array of current observed conditions for the specific type of post update task. | ||
// Known conditions are "ApprovalRequestCreated", "WaitTimeElapsed", and "ApprovalRequestApproved". | ||
// +kubebuilder:validation:Optional | ||
Conditions []metav1.Condition `json:"conditions"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NotStarted -> (created approval request) -> waiting -> (approve request) -> completed
NotStarted -> (start wait time) -> waiting -> (wait time has passed) -> completed
UpdateRun is a tree of steps: a step can be a run, a stage, a cluster, a after-stage task.
It is better to have the same state model for each step so customers don't need to learn new state/condition model for each step.
That's why we should have the same state for the run and each after-stage task as well.
// Conditions is an array of current observed conditions for the specific type of post update task. | ||
// Known conditions are "ApprovalRequestCreated", "WaitTimeElapsed", and "ApprovalRequestApproved". | ||
// +kubebuilder:validation:Optional | ||
Conditions []metav1.Condition `json:"conditions"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fleet RP also have the wait time step and use the same state data model.
// - `ParentStagedUpdateRun` which points to its owner stage update. | ||
// - `TargetStage` which is the name of the stage that this approval request is for. | ||
// - `IsLatestUpdateRunApproval` which indicates whether this approval request is the latest one related to this update run. | ||
type StageApprovalRequest struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend to use ApprovalRequest instead of StageApprovalRequest.
Add TargetStep in the spec. A step can be a stage, a group or a cluster.
StepReference {
RunName
StageName
(future) GroupName
(future) ClusterName
}
b4e840f
to
0c9519e
Compare
e718378
to
cec70f3
Compare
Description of your changes
An overall idea of rollout API
Fixes #
I have:
make reviewable
to ensure this PR is ready for review.How has this code been tested
Special notes for your reviewer