Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux reconcile on helmrelease leaves erroneously modified object in bad state #267

Closed
scott-grimes opened this issue May 14, 2021 · 8 comments · Fixed by #823
Closed

flux reconcile on helmrelease leaves erroneously modified object in bad state #267

scott-grimes opened this issue May 14, 2021 · 8 comments · Fixed by #823

Comments

@scott-grimes
Copy link

Describe the bug

A HelmRelease contains a cronjob as part of a release. If we edit that cronjob in k8s and then run flux reconcile -n <namespace> helmrelease <hr name> --with-source, the edited cronjob is not replaced with the correct cronjob.

To Reproduce

Steps to reproduce the behavior:

  1. HelmRelease is READY, and helm shows STATUS as deployed
$ kubectl get hr
NAME             READY   STATUS                             AGE
blerg           True    Release reconciliation succeeded   9d
$ helm ls
NAME          	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART               	APP VERSION
blerg        	prod     	224     	2021-05-14 19:30:24.051846927 +0000 UTC	deployed	blerg-0.2.0        	1.0
  1. I can see my cronjob spec in the cluster, it is exactly what helm says it should be
$ kubectl get cronjob myjob -o yaml
...
spec:
  containers:
  - args:
    - /bin/bash
    - /app/scripts/run
    ...
###
$ helm get all blerg > manifests.yaml
# grep for CronJob
...
spec:
  containers:
  - args:
    - /bin/bash
    - /app/scripts/run
    ...
  1. Manually edit the cronjob, adding a "command" section to the container spec
$ kubectl edit cronjob myjob
...
spec:
  containers:
  - args:
    - /bin/bash
    - /app/scripts/run
    command:
    - /bin/bash
    - -c
    ...
  1. Reconcile, which should replace the edited job with the correct job
$ flux reconcile -n prod helmrelease blerg --with-source
► annotating GitRepository myrepo in flux-system namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ GitRepository reconciliation completed
✔ fetched revision master/<sha>
► annotating HelmRelease blerg in prod namespace
✔ HelmRelease annotated
◎ waiting for HelmRelease reconciliation
✔ HelmRelease reconciliation completed
✔ reconciled revision 0.2.0
  1. I can see reconciliation working, the update hooks are running, etc
$ kubectl get hr 
NAME             READY     STATUS                             AGE
blerg           Unknown   Reconciliation in progress   
...
$ helm ls -a
NAME          	NAMESPACE	REVISION	UPDATED                                	STATUS         	CHART               	APP VERSION
blerg        	prod     	225     	2021-05-14 19:43:50.894886714 +0000 UTC	pending-upgrade	blerg-0.2.0        	1.0 
  1. I wait for helm to complete. HelmRelease is READY, and helm shows STATUS as deployed
$ kubectl get hr
NAME             READY   STATUS                             AGE
blerg           True    Release reconciliation succeeded   9d
$ helm ls
NAME          	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART               	APP VERSION
blerg        	prod     	225     	2021-05-14 19:43:50.894886714 +0000 UTC	deployed	blerg-0.2.0        	1.0        
  1. I can now look at the cronjob and see that my modification has remained in place, even though it is not supposed to be there.
$ kubectl get cronjob myjob -o yaml
...
spec:
  containers:
  - args:
    - /bin/bash
    - /app/scripts/run
    command:
    - /bin/bash
    - -c
    ...
###
$ helm get all blerg > manifests.yaml
# grep for CronJob
...
spec:
  containers:
  - args:
    - /bin/bash
    - /app/scripts/run
    # no "command" section
    ...

Expected behavior

The invalid cronjob should have been replaced with helms manifest

Additional context

  • Kubernetes version:
    Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-14T05:14:17Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • Git provider:
    Github

Below please provide the output of the following commands:

flux --version
flux version 0.11.0
flux check
► checking prerequisites
✗ flux 0.11.0 <0.13.4 (new version is available, please upgrade)
✔ kubectl 1.20.2 >=1.18.0-0
✔ Kubernetes 1.17.12-eks-7684af >=1.16.0-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.9.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.10.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.11.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.10.0
✔ all checks passed
kubectl -n <namespace> logs deploy/source-controller
{"level":"info","ts":"2021-05-14T19:53:48.826Z","logger":"controller.gitrepository","msg":"Reconciliation finished in 350.449894ms, next run in 1m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"...","namespace":"..."}
kubectl -n <namespace> logs deploy/kustomize-controller
{"level":"info","ts":"2021-05-14T19:54:37.116Z","logger":"controller.kustomization","msg":"Kustomization applied in 456.616953ms","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"...","namespace":"...","output":{"<all unchanged>}}
{"level":"info","ts":"2021-05-14T19:54:37.141Z","logger":"controller.kustomization","msg":"Reconciliation finished in 640.896612ms, next run in 3m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"...","namespace":"...","revision":"master/<sha>"}
@gladiatr72
Copy link

gladiatr72 commented May 17, 2021

That's a Kubernetes thing. It has to do with how patches work--very non-automated/human friendly--the idea being that a patch consisting of a partial manifest does not contain enough information to denote anything other than put this here or update the value of this key to a different value.

In lieu of requiring client processes (that category includes the non-automated/human set) to submit complete manifests to submit a much smaller set of changes, the assumption is to leave manifest keys alone unless they are explicitly addressed in the update. If you were to be one of the rare individuals that can actually generate a functioning patch to submit (or are a kube/manifest patch generator routine), I'm sure this is considered a useful and welcome feature.

If you want to do away with a key (command, in this case), you need to leave it defined but set it to whatever passes for the key/value's type empty value.

If you kubectl edit deployment thing, change the command key to look like command: [] and save, then (and only then) will the command key disappear.

note: another kubernetes pointy bit: you can't have an empty container object (list, map str) stored with a manifest. I blame git! (yeah. 4K * the-number-of-files-called-.holding probably takes up a few hundred gigs of storage in the world). This means that you couldn't even use a simple kustomization patch to delete a thing--one of the more granular patch types must be employed to... I dunno.. give substance to the nothingness you want to express. If you try with a simple SMP, an empty key will work for its initial reconciliation. Subsequent reconciliations will fail as the manifest from the last update will show a key (with an empty value) that (according to Helm) should still exist.

What's even groovier: In cases where I've done exactly what you have described (usually in the name of troubleshooting) I've found it quite important to revert any manual changes to any helm-deployed object before resuming the HelmRelease object to avoid it becoming wedged. The only semi-reliable method I have found for escape is:

  • suspend the HelmRelease
  • delete the HelmRelease and its related HelmChart object
    • deleting just the HelmChart or HelmRelease does not help. I understand the purpose of the HelmChart object but not entirely certain why its presence blocks the recreation of the HelmRelease object on reconciliation.
  • Then flux reconcile kustomization ...

I'm still on 0.11--I think I spotted something in a changelog or PR or some such indicating that suspended objects will no longer be affected by reconciliation requests. I'm not sure what the status is for that feature--in the future, YMMV :)

AND (big breath), it is a Helm thing for a similar reason--Helm keeps its manifests' state in the kube object store.. it becomes wedged when it sees anything that it didn't put in place; similar to an OS package manager... except the package manager usually doesn't soil its trousers when it comes across a configuration file a (human)operator may have changed.

flux/HelmRelease does have the option of specifying a force[bool] key in its rollback and install sub-sections if you don't mind the potential for a short outtage if the helm controller is unable to find another path to reconciliation.

@scott-grimes
Copy link
Author

hmm I'm not so sure. helm diff does pick up on the additional value and shows that it will be removed on upgrade: so the three-way merge appears to be working correctly. leads me to think it's a problem with flux itself

...
            containers:
-           - args:
+           - name: <container-name>
+             image: "<image>"
+             imagePullPolicy: Always
+             args:
+               - -c
+               - /app/scripts/run
+             command:
              - /bin/bash
-             - /app/scripts/run
              env:

@stefanprodan stefanprodan transferred this issue from fluxcd/flux2 May 25, 2021
@mmckane
Copy link

mmckane commented Feb 9, 2022

I am seeing this as well on flux 0.20 it seems like diffing with helmreleases is broken. It also seems that if I delete a resource manually flux doesn't seem to be picking up that it should put it back and won't until the chart version is modified or the git repo is updated when using reconcileStrategy: Revision.

@zaggash
Copy link

zaggash commented Mar 16, 2022

I am seeing this as well on flux 0.20 it seems like diffing with helmreleases is broken. It also seems that if I delete a resource manually flux doesn't seem to be picking up that it should put it back and won't until the chart version is modified or the git repo is updated when using reconcileStrategy: Revision.

I experience the same with deleted objects from a HelmRelease, lets say an ingress.
If I delete it, it is not recreated on a reconcile.

@monotek
Copy link

monotek commented Mar 30, 2022

Same here.
Imho helm-controller should act as "helm upgrade" command as this changes edited resources to it's desired state again.

@kingdonb
Copy link
Member

kingdonb commented Mar 31, 2022

If I delete it, it is not recreated on a reconcile.

@zaggash @monotek This is expected behavior. To understand why, you must understand how Helm resolves drifts, (and what obstacles it has placed in its own way with respect to that story)

Tl;dr: helm controller doesn't reconcile drift in the cluster, except when it is performing an actual upgrade (and that drift in the cluster wouldn't currently be able to trigger any upgrade, because for Helm Controller it's perfectly indistinguishable from any side-effects that might have been the intended result of any post-render or lifecycle hook behavior in the chart.)

I am not certain if these reports are the same as the original report, but the behavior that both @zaggash and @monotek are describing are already explained by #186 as I'll elaborate further on below. (This is a complicated story, so save yourself and look the other way), unless the tl;dr above has made you hungry to understand what the problem is and why this issue can't easily be resolved:

[snip] I moved this comment part to the #186 discussion because it's only distracting from the OP topic here.

I am virtually certain this does not have anything to do with the original issue, sorry for contributing to the noise, but this is a hard issue to explain, and I'm sure it comes up all the time. Though I am afraid I have not quite understood the cronjob issue that started this thread, doing my best to steer the discussion back on topic and find out if any fix is possible, (or if we might have already fixed it somehow without noticing)

@monotek
Copy link

monotek commented Mar 31, 2022

Ok, i understand that continuously upgrading the helm chart would be a mess secretswise, but it would be nice anyway if one could force such upgrade via "flux reconcile helmrelease" command somehow, so that for example deleted parts of a chart would be recreated, without the need to delete the chart completly first.

@kingdonb
Copy link
Member

Any change that would adjust the inputs or spec of the chart, can trigger an upgrade without deleting anything; however the point is well taken, that it would be useful for the API client to have a direct and force-upgrade button, whatever magic happens behind the curtain.

Moving my thoughts over to #186 where they are more relevant to the discussion, this issue report is about a different topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants