-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standalone medusa pod #980
Conversation
Codecov Report
@@ Coverage Diff @@
## main #980 +/- ##
==========================================
- Coverage 57.91% 57.45% -0.47%
==========================================
Files 98 99 +1
Lines 9655 10013 +358
==========================================
+ Hits 5592 5753 +161
- Misses 3594 3767 +173
- Partials 469 493 +24
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments, not yet sure how to test this or whether you're ready for me to kick off manual tests.
Suggestion: I have one more general comment too, which is non-blocking, but hrmmmm... The way you're creating the Medusa containers/deployment feels convoluted/tortured to me. Each of them seems to go through multiple phases of mutation, e.g. UpdateMedusaInitContainer
, medusaInitContainerResources
, CreateMedusaMainContainer
, setImage
, medusaVolumeMounts
, GenerateMedusaVolumes
, medusaEnvVars
, etc. This is less declarative than I sense it could be.
Issue: There are also code paths that may not be needed anymore. E.g. do we still allow local storage here?
Suggestion: Maybe Medusa just has a lot of configurables, but I wonder if better API design on the CRD (so that it uses more k8s native types) together with more extensive usage of goalesce might allow for better configurability while simplifying the reconciliation code paths?
…ing part of labels directly Fix restore hostmap test Temporarily disable flaky test Replace the prepare_restore tasks with an internal computation of the restore mapping Use the new Medusa build Fix failing unit tests Fix tests with use of the right context for Medusa Standalone depl checks Change Medusa probe settings to use an actual Probe object and using goalesce to merge with the defaults Revert adding labels in the generic reconciler
cb28725
to
d8b649b
Compare
22abdd4
to
df63afe
Compare
}, | ||
Spec: appsv1.DeploymentSpec{ | ||
Selector: &metav1.LabelSelector{ | ||
MatchLabels: map[string]string{"app": medusa.MedusaStandaloneDeploymentName(kc.SanitizedName(), dcName)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about this one, shouldn't the kc.SanitizedName just be kc.meta.name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not. We name dependent objects based on the sanitized name, not the kc name. This way the objects have the same prefix names as the Cassandra statefulset pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that the SanitizedName was for getting a DNS-compatible k8s resource name from a Cassandra cluster name (which can contain more characters). But we should be naming dependant objects based on the k8s resource names, shouldn't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, this is not the path we took with cluster/dc name overrides.
The SanitizedName will pick the override if it exists and sanitize it, or the resource name if not.
And all dependent objects will be named based on this.
// It uses the Medusa client to get the host map for the restore operation, using the Medusa standalone service to get the backup topology. | ||
func (r *MedusaRestoreJobReconciler) prepareRestore(ctx context.Context, request *medusa.RestoreRequest) (*medusav1alpha1.MedusaRestoreMapping, error) { | ||
medusaClient, err := r.ClientFactory.NewClient(medusaServiceUrl( | ||
request.Datacenter.Spec.ClusterName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: We really need to start distinguishing between the ClusterName for the k8s resource and the Cassandra cluster name. Impossible to tell what is going on throughout these codebases sometimes :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: The right label here is thought:
instead of Observation:
😉
The backup and restore operations are directly related to a CassandraDatacenter object, they have no connection to a K8ssandraCluster object. The only way to get the cluster name, which is used to build the service url, is .spec.ClusterName
here.
I understand there can be some confusion, but both options we had did come with their share of confusion, we just picked one over the other 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way to get the cluster name, which is used to build the service url, is .spec.ClusterName here.
Do you mean .spec.cassandra.clusterName? This can be set differently to the k8s resource name and doesn't need to bear any relation to it AFAIK.
I believe you could possibly use .metadata.labels to get the parent cluster name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversation resolved offline: we don't want to use the k8c resource name here, we really need the sanitized name.
pkg/k8ssandra/util.go
Outdated
@@ -29,3 +35,27 @@ func GetDatacenterForDecommission(kc *api.K8ssandraCluster) string { | |||
|
|||
return "" | |||
} | |||
|
|||
func GetK8ssandraClusterFromCassDc(ctx context.Context, cassdc *cassdcapi.CassandraDatacenter, client client.Client, logger logr.Logger) *api.K8ssandraCluster { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: this is wrong, and is causing the bug I found in testing. See my testing comments for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good catch!!
We need to use the sanitized names instead for both the cluster and the dc, otherwise we'll end up with either the wrong name or a name that's incompatible with kubernetes.
I'll fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments above, I suspect we need to discuss the use of sanitizedName and similar in this PR. I think what you need to use is actually the cluster name from the labels throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even worse, this is dead code :)
I'm removing this.
I've left a number of comments above RE the implementation details. Black box testing reveals a bug. Procedure
I got the following error back from the k8ssandra-operator pod:
Somewhere (I think perhaps here) in the codebase you're using the |
SonarCloud Quality Gate failed. 0 Bugs No Coverage information Catch issues before they fail your Quality Gate with our IDE extension SonarLint |
Oops, looks like I approved the old PR. My comments were as follows:
Approving this one now. |
What this PR does:
The creation of the main Medusa container has been separated from its update process. The medusa.CreateMedusaMainContainer method now handles creation, while medusa.UpdateMedusaMainContainer only updates the created container.
The creation and updating of Medusa volumes has been improved. Volumes are generated using the medusa.GenerateMedusaVolumes method and then updated or added to the CassandraDatacenter configuration (dcConfig).
Both these changes allow creating a standalone Medusa pod that is using the medusa.StandaloneMedusaDeployment method, by reusing the previously created Medusa main container and volumes.
14/6/23 update: The
local
storage option for Medusa has been removed. It was only available for testing purposes but since we now use a separate pod to compute restore mappings, local storage is no longer an option. As such, there are no more additionalVolumes added to the CassandraDatacenter definition as part of reconciling Medusa. This is a breaking change.A standalone Medusa service is also created and reconciled. The medusa.StandaloneMedusaService method creates the service, which is then reconciled like other resources.
The reconciliation process for the Medusa ConfigMap has been updated. It now includes a key to the K8ssandraCluster resource, which can be used for owner references or other purposes.
Support was added for the definition of readiness and liveness probe settings for Medusa containers. This allows developers to specify custom probe parameters, such as initialDelaySeconds, periodSeconds, failureThreshold, successThreshold, and timeoutSeconds.
In the cleanup.go file, the checkDeletion function now includes calls to the newly introduced deleteDeployments and deleteServices methods. These methods aim to delete all Kubernetes Services and Deployments associated with the K8ssandraCluster in question. If the deletion of any of these resources fails, an error is logged and the function will return true, indicating the presence of errors.
The deleteServices method lists all services part of the K8ssandraCluster, as determined by label selectors, and attempts to delete them. Similarly, the deleteDeployments method lists and tries to delete all deployments associated with a deleted K8ssandraCluster.
This pull request also introduces an enhancement to the generic reconciliation process. The changes aim to improve the labeling of reconciled objects, linking them back to the associated K8ssandra object:The ReconcileObject function signature has been updated to take an additional parameter, klKey. This parameter represents the key of a K8ssandra object, which includes its name and namespace.
Inside the ReconcileObject function, if a klKey is provided, the reconciled object is given a new set of labels. These labels are generated using the labels.PartOfLabels function with the provided klKey. These labels signify that the reconciled object is part of the associated K8ssandra object which then allows proper cleanup of resources upon deletion of the K8ssandraCluster object.
Which issue(s) this PR fixes:
Fixes #483
Fixes #486
Fixes #821
Fixes #990
Checklist