flux reconcile on helmrelease leaves erroneously modified object in bad state #267

scott-grimes · 2021-05-14T19:56:48Z

Describe the bug

A HelmRelease contains a cronjob as part of a release. If we edit that cronjob in k8s and then run flux reconcile -n <namespace> helmrelease <hr name> --with-source, the edited cronjob is not replaced with the correct cronjob.

To Reproduce

Steps to reproduce the behavior:

HelmRelease is READY, and helm shows STATUS as deployed

$ kubectl get hr
NAME             READY   STATUS                             AGE
blerg           True    Release reconciliation succeeded   9d
$ helm ls
NAME          	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART               	APP VERSION
blerg        	prod     	224     	2021-05-14 19:30:24.051846927 +0000 UTC	deployed	blerg-0.2.0        	1.0

I can see my cronjob spec in the cluster, it is exactly what helm says it should be

$ kubectl get cronjob myjob -o yaml
...
spec:
  containers:
  - args:
    - /bin/bash
    - /app/scripts/run
    ...
###
$ helm get all blerg > manifests.yaml
# grep for CronJob
...
spec:
  containers:
  - args:
    - /bin/bash
    - /app/scripts/run
    ...

Manually edit the cronjob, adding a "command" section to the container spec

$ kubectl edit cronjob myjob
...
spec:
  containers:
  - args:
    - /bin/bash
    - /app/scripts/run
    command:
    - /bin/bash
    - -c
    ...

Reconcile, which should replace the edited job with the correct job

$ flux reconcile -n prod helmrelease blerg --with-source
► annotating GitRepository myrepo in flux-system namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ GitRepository reconciliation completed
✔ fetched revision master/<sha>
► annotating HelmRelease blerg in prod namespace
✔ HelmRelease annotated
◎ waiting for HelmRelease reconciliation
✔ HelmRelease reconciliation completed
✔ reconciled revision 0.2.0

I can see reconciliation working, the update hooks are running, etc

$ kubectl get hr 
NAME             READY     STATUS                             AGE
blerg           Unknown   Reconciliation in progress   
...
$ helm ls -a
NAME          	NAMESPACE	REVISION	UPDATED                                	STATUS         	CHART               	APP VERSION
blerg        	prod     	225     	2021-05-14 19:43:50.894886714 +0000 UTC	pending-upgrade	blerg-0.2.0        	1.0

I wait for helm to complete. HelmRelease is READY, and helm shows STATUS as deployed

$ kubectl get hr
NAME             READY   STATUS                             AGE
blerg           True    Release reconciliation succeeded   9d
$ helm ls
NAME          	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART               	APP VERSION
blerg        	prod     	225     	2021-05-14 19:43:50.894886714 +0000 UTC	deployed	blerg-0.2.0        	1.0

I can now look at the cronjob and see that my modification has remained in place, even though it is not supposed to be there.

$ kubectl get cronjob myjob -o yaml
...
spec:
  containers:
  - args:
    - /bin/bash
    - /app/scripts/run
    command:
    - /bin/bash
    - -c
    ...
###
$ helm get all blerg > manifests.yaml
# grep for CronJob
...
spec:
  containers:
  - args:
    - /bin/bash
    - /app/scripts/run
    # no "command" section
    ...

Expected behavior

The invalid cronjob should have been replaced with helms manifest

Additional context

Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-14T05:14:17Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Git provider:
Github

Below please provide the output of the following commands:

flux --version
flux version 0.11.0
flux check
► checking prerequisites
✗ flux 0.11.0 <0.13.4 (new version is available, please upgrade)
✔ kubectl 1.20.2 >=1.18.0-0
✔ Kubernetes 1.17.12-eks-7684af >=1.16.0-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.9.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.10.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.11.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.10.0
✔ all checks passed
kubectl -n <namespace> logs deploy/source-controller
{"level":"info","ts":"2021-05-14T19:53:48.826Z","logger":"controller.gitrepository","msg":"Reconciliation finished in 350.449894ms, next run in 1m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"...","namespace":"..."}
kubectl -n <namespace> logs deploy/kustomize-controller
{"level":"info","ts":"2021-05-14T19:54:37.116Z","logger":"controller.kustomization","msg":"Kustomization applied in 456.616953ms","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"...","namespace":"...","output":{"<all unchanged>}}
{"level":"info","ts":"2021-05-14T19:54:37.141Z","logger":"controller.kustomization","msg":"Reconciliation finished in 640.896612ms, next run in 3m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"...","namespace":"...","revision":"master/<sha>"}

The text was updated successfully, but these errors were encountered:

gladiatr72 · 2021-05-17T21:16:50Z

That's a Kubernetes thing. It has to do with how patches work--very non-automated/human friendly--the idea being that a patch consisting of a partial manifest does not contain enough information to denote anything other than put this here or update the value of this key to a different value.

In lieu of requiring client processes (that category includes the non-automated/human set) to submit complete manifests to submit a much smaller set of changes, the assumption is to leave manifest keys alone unless they are explicitly addressed in the update. If you were to be one of the rare individuals that can actually generate a functioning patch to submit (or are a kube/manifest patch generator routine), I'm sure this is considered a useful and welcome feature.

If you want to do away with a key (command, in this case), you need to leave it defined but set it to whatever passes for the key/value's type empty value.

If you kubectl edit deployment thing, change the command key to look like command: [] and save, then (and only then) will the command key disappear.

note: another kubernetes pointy bit: you can't have an empty container object (list, map str) stored with a manifest. I blame git! (yeah. 4K * the-number-of-files-called-.holding probably takes up a few hundred gigs of storage in the world). This means that you couldn't even use a simple kustomization patch to delete a thing--one of the more granular patch types must be employed to... I dunno.. give substance to the nothingness you want to express. If you try with a simple SMP, an empty key will work for its initial reconciliation. Subsequent reconciliations will fail as the manifest from the last update will show a key (with an empty value) that (according to Helm) should still exist.

What's even groovier: In cases where I've done exactly what you have described (usually in the name of troubleshooting) I've found it quite important to revert any manual changes to any helm-deployed object before resuming the HelmRelease object to avoid it becoming wedged. The only semi-reliable method I have found for escape is:

suspend the HelmRelease
delete the HelmRelease and its related HelmChart object
- deleting just the HelmChart or HelmRelease does not help. I understand the purpose of the HelmChart object but not entirely certain why its presence blocks the recreation of the HelmRelease object on reconciliation.
Then flux reconcile kustomization ...

I'm still on 0.11--I think I spotted something in a changelog or PR or some such indicating that suspended objects will no longer be affected by reconciliation requests. I'm not sure what the status is for that feature--in the future, YMMV :)

AND (big breath), it is a Helm thing for a similar reason--Helm keeps its manifests' state in the kube object store.. it becomes wedged when it sees anything that it didn't put in place; similar to an OS package manager... except the package manager usually doesn't soil its trousers when it comes across a configuration file a (human)operator may have changed.

flux/HelmRelease does have the option of specifying a force[bool] key in its rollback and install sub-sections if you don't mind the potential for a short outtage if the helm controller is unable to find another path to reconciliation.

scott-grimes · 2021-05-18T15:37:16Z

hmm I'm not so sure. helm diff does pick up on the additional value and shows that it will be removed on upgrade: so the three-way merge appears to be working correctly. leads me to think it's a problem with flux itself

...
            containers:
-           - args:
+           - name: <container-name>
+             image: "<image>"
+             imagePullPolicy: Always
+             args:
+               - -c
+               - /app/scripts/run
+             command:
              - /bin/bash
-             - /app/scripts/run
              env:

mmckane · 2022-02-09T20:57:59Z

I am seeing this as well on flux 0.20 it seems like diffing with helmreleases is broken. It also seems that if I delete a resource manually flux doesn't seem to be picking up that it should put it back and won't until the chart version is modified or the git repo is updated when using reconcileStrategy: Revision.

zaggash · 2022-03-16T13:56:21Z

I am seeing this as well on flux 0.20 it seems like diffing with helmreleases is broken. It also seems that if I delete a resource manually flux doesn't seem to be picking up that it should put it back and won't until the chart version is modified or the git repo is updated when using reconcileStrategy: Revision.

I experience the same with deleted objects from a HelmRelease, lets say an ingress.
If I delete it, it is not recreated on a reconcile.

monotek · 2022-03-30T10:25:41Z

Same here.
Imho helm-controller should act as "helm upgrade" command as this changes edited resources to it's desired state again.

kingdonb · 2022-03-31T13:35:32Z

If I delete it, it is not recreated on a reconcile.

@zaggash @monotek This is expected behavior. To understand why, you must understand how Helm resolves drifts, (and what obstacles it has placed in its own way with respect to that story)

Cluster-state manifest drift correction #186

Tl;dr: helm controller doesn't reconcile drift in the cluster, except when it is performing an actual upgrade (and that drift in the cluster wouldn't currently be able to trigger any upgrade, because for Helm Controller it's perfectly indistinguishable from any side-effects that might have been the intended result of any post-render or lifecycle hook behavior in the chart.)

I am not certain if these reports are the same as the original report, but the behavior that both @zaggash and @monotek are describing are already explained by #186 as I'll elaborate further on below. (This is a complicated story, so save yourself and look the other way), unless the tl;dr above has made you hungry to understand what the problem is and why this issue can't easily be resolved:

[snip] I moved this comment part to the #186 discussion because it's only distracting from the OP topic here.

I am virtually certain this does not have anything to do with the original issue, sorry for contributing to the noise, but this is a hard issue to explain, and I'm sure it comes up all the time. Though I am afraid I have not quite understood the cronjob issue that started this thread, doing my best to steer the discussion back on topic and find out if any fix is possible, (or if we might have already fixed it somehow without noticing)

monotek · 2022-03-31T15:00:08Z

Ok, i understand that continuously upgrading the helm chart would be a mess secretswise, but it would be nice anyway if one could force such upgrade via "flux reconcile helmrelease" command somehow, so that for example deleted parts of a chart would be recreated, without the need to delete the chart completly first.

kingdonb · 2022-03-31T16:03:48Z

Any change that would adjust the inputs or spec of the chart, can trigger an upgrade without deleting anything; however the point is well taken, that it would be useful for the API client to have a direct and force-upgrade button, whatever magic happens behind the curtain.

Moving my thoughts over to #186 where they are more relevant to the discussion, this issue report is about a different topic.

stefanprodan transferred this issue from fluxcd/flux2 May 25, 2021

kingdonb mentioned this issue Mar 31, 2022

Cluster-state manifest drift correction #186

Closed

hiddeco mentioned this issue Nov 29, 2023

Introduce forceAt and resetAt annotations #823

Merged

hiddeco closed this as completed in #823 Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux reconcile on helmrelease leaves erroneously modified object in bad state #267

flux reconcile on helmrelease leaves erroneously modified object in bad state #267

scott-grimes commented May 14, 2021

gladiatr72 commented May 17, 2021 •

edited

Loading

scott-grimes commented May 18, 2021

mmckane commented Feb 9, 2022

zaggash commented Mar 16, 2022

monotek commented Mar 30, 2022 •

edited

Loading

kingdonb commented Mar 31, 2022 •

edited

Loading

monotek commented Mar 31, 2022

kingdonb commented Mar 31, 2022

flux reconcile on helmrelease leaves erroneously modified object in bad state #267

flux reconcile on helmrelease leaves erroneously modified object in bad state #267

Comments

scott-grimes commented May 14, 2021

Describe the bug

To Reproduce

Expected behavior

Additional context

gladiatr72 commented May 17, 2021 • edited Loading

scott-grimes commented May 18, 2021

mmckane commented Feb 9, 2022

zaggash commented Mar 16, 2022

monotek commented Mar 30, 2022 • edited Loading

kingdonb commented Mar 31, 2022 • edited Loading

monotek commented Mar 31, 2022

kingdonb commented Mar 31, 2022

gladiatr72 commented May 17, 2021 •

edited

Loading

monotek commented Mar 30, 2022 •

edited

Loading

kingdonb commented Mar 31, 2022 •

edited

Loading