Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New cluster using Talos is not progressing beyond Machines in Provisioning stage. #37

Open
dhaugli opened this issue Jun 6, 2024 · 5 comments

Comments

@dhaugli
Copy link

dhaugli commented Jun 6, 2024

What happened:
[A clear and concise description of what the bug is.]

The cluster is not coming up, Harvester Loadbalancer is not created, machines never leave provisioning state.
The machines is provisioned in harvester, gets IP from my network. I can attach a console to them. Though its Talos so its not much you get in return.

Screenshot of console of one of the talos cp vms:

Screenshot 2024-06-06 232557

caph-provider logs:

 ERROR   failed to patch HarvesterMachine        {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "7ec120a6-8a1e-40b1-98dd-3597ce44ca1c", "machine": "cluster-capi-mgmt-p-01/capi-mgmt-p-01-7shhp", "cluster": "cluster-capi-mgmt-p-01/capi-mgmt-p-01", "error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value", "errorCauses": [{"error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value"}]}
github.com/rancher-sandbox/cluster-api-provider-harvester/controllers.(*HarvesterMachineReconciler).Reconcile.func1
        /workspace/controllers/harvestermachine_controller.go:121
github.com/rancher-sandbox/cluster-api-provider-harvester/controllers.(*HarvesterMachineReconciler).Reconcile
        /workspace/controllers/harvestermachine_controller.go:198
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
2024-06-06T19:58:10Z    ERROR   Reconciler error        {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "7ec120a6-8a1e-40b1-98dd-3597ce44ca1c", "error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value", "errorCauses": [{"error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value"}]}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
  1. These two log entries keeps going.
 2024-06-06T19:58:10Z    INFO    Reconciling HarvesterMachine ...        {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "dc815768-5306-42cc-91c0-be802d85bc82"}
2024-06-06T19:58:10Z    INFO    Waiting for ProviderID to be set on Node resource in Workload Cluster ...       {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "dc815768-5306-42cc-91c0-be802d85bc82", "machine": "cluster-capi-mgmt-p-01/capi-mgmt-p-01-7shhp", "cluster": "cluster-capi-mgmt-p-01/capi-mgmt-p-01"}

capt-controller-manager logs:

I0606 19:58:08.737945       1 taloscontrolplane_controller.go:176] "controllers/TalosControlPlane: successfully updated control plane status" namespace="cluster-capi-mgmt-p-01" talosControlPlane="capi-mgmt-p-01" cluster="capi-mgmt-p-01"
I0606 19:58:08.739615       1 controller.go:327] "Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler" controller="taloscontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="TalosControlPlane" TalosControlPlane="cluster-capi-mgmt-p-01/capi-mgmt-p-01" namespace="cluster-capi-mgmt-p-01" name="capi-mgmt-p-01" reconcileID="b0b79408-8a41-43df-91ef-07fe7d36fa7c"
E0606 19:58:08.739746       1 controller.go:329] "Reconciler error" err="at least one machine should be provided" controller="taloscontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="TalosControlPlane" TalosControlPlane="cluster-capi-mgmt-p-01/capi-mgmt-p-01" namespace="cluster-capi-mgmt-p-01" name="capi-mgmt-p-01" reconcileID="b0b79408-8a41-43df-91ef-07fe7d36fa7c"
I0606 19:58:08.749008       1 taloscontrolplane_controller.go:189] "reconcile TalosControlPlane" controller="taloscontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="TalosControlPlane" TalosControlPlane="cluster-capi-mgmt-p-01/capi-mgmt-p-01" namespace="cluster-capi-mgmt-p-01" name="capi-mgmt-p-01" reconcileID="c37dc309-f8fb-42c7-a375-5faceb9019b9" cluster="capi-mgmt-p-01"
I0606 19:58:09.190175       1 scale.go:33] "controllers/TalosControlPlane: scaling up control plane" Desired=3 Existing=1
I0606 19:58:09.213294       1 taloscontrolplane_controller.go:152] "controllers/TalosControlPlane: attempting to set control plane status"
I0606 19:58:09.220900       1 taloscontrolplane_controller.go:564] "controllers/TalosControlPlane: failed to get kubeconfig for the cluster" error="failed to create cluster accessor: error creating client for remote cluster \"cluster-capi-mgmt-p-01/capi-mgmt-p-01\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://10.0.0.113:6443/api/v1?timeout=10s\": tls: failed to verify certificate: x509: certificate is valid for 10.0.0.3, 127.0.0.1, ::1, 10.0.0.5, 10.53.0.1, not 10.0.0.113"

cabpt-talos-bootstrap(I dont know if this is relevant):

I0606 19:58:09.206570       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.224117       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.243118       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.280372       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.341804       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.352557       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.439369       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.480714       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.539945       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.548156       1 secrets.go:174] "controllers/TalosConfig: handling bootstrap data for " owner="capi-mgmt-p-01-n48cx"
I0606 19:58:09.717884       1 secrets.go:174] "controllers/TalosConfig: handling bootstrap data for " owner="capi-mgmt-p-01-n48cx"
I0606 19:58:09.720944       1 secrets.go:174] "controllers/TalosConfig: handling bootstrap data for " owner="capi-mgmt-p-01-7shhp"
I0606 19:58:09.756344       1 talosconfig_controller.go:223] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4/owner-name=capi-mgmt-p-01-n48cx: ignoring an already ready config"
I0606 19:58:09.765995       1 secrets.go:243] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4/owner-name=capi-mgmt-p-01-n48cx: updating talosconfig" endpoints=null secret="capi-mgmt-p-01-talosconfig"

What did you expect to happen:
I expected that the caph provider created the LB and proceeded on creating the cluster.

How to reproduce it:

I added the providers for talos (boostrap and controlplane) and ofcourse the harvester provider.

Added 4 files + the harvester secret with the following configuration:

cluster.yaml:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 172.16.0.0/20
    services:
      cidrBlocks:
        - 172.16.16.0/20
    serviceDomain: cluster.local
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    kind: TalosControlPlane
    name: capi-mgmt-p-01
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: HarvesterCluster
    name: capi-mgmt-p-01

harvester-cluster.yaml:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HarvesterCluster
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  targetNamespace: cluster-capi-mgmt-p-01
  loadBalancerConfig:
    ipamType: pool
    ipPoolRef: k8s-api
  server: https://10.0.0.3
  identitySecret: 
    name: trollit-harvester-secret
    namespace: cluster-capi-mgmt-p-01

harvester-machinetemplate.yaml:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HarvesterMachineTemplate
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  template: 
    spec:
      cpu: 2
      memory: 8Gi
      sshUser: ubuntu
      sshKeyPair: default/david
      networks:
      -  cluster-capi-mgmt-p-01/capi-mgmt-network
      volumes:
      - volumeType: image 
        imageName: harvester-public/talos-1.7.4-metalqemu
        volumeSize: 50Gi
        bootOrder: 0

controlplane.yaml:

apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  version: "v1.30.0"
  replicas: 3
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: HarvesterMachineTemplate
    name: capi-mgmt-p-01
  controlPlaneConfig:
    controlplane:
      generateType: controlplane
      talosVersion: v1.7.4
      configPatches:
        - op: add
          path: /cluster/network
          value:
            cni:
              name: none

        - op: add
          path: /cluster/proxy
          value:
            disabled: true

        - op: add
          path: /cluster/network/podSubnets
          value:
            - 172.16.0.0/20

        - op: add
          path: /cluster/network/serviceSubnets
          value:
            - 172.16.16.0/20

        - op: add
          path: /machine/kubelet/extraArgs
          value:
            cloud-provider: external

        - op: add
          path: /machine/kubelet/nodeIP
          value:
            validSubnets:
              - 10.0.0.0/24

        - op: add
          path: /cluster/discovery
          value:
            enabled: false

        - op: add
          path: /machine/features/kubePrism
          value:
            enabled: true

        - op: add
          path: /cluster/apiServer/certSANs
          value:
            - 127.0.0.1

        - op: add
          path: /cluster/apiServer/extraArgs
          value:
            anonymous-auth: true

Anything else you would like to add:

I have tried to switch the Loadbalancer config from dhcp to ipPoolRef, and set a pre-configured ippool this also did not work. I think its related to that the LB is never provisioned in the first place.

[Miscellaneous information that will assist in solving the issue.]

Environment:

  • talos controlplane provider version: 0.5.5
  • talos bootstrap provider version: 0.6.4
  • harvester cluster api provider: 0.1.2
  • harvester version installed on my HP server: 1.3.0
  • OS (e.g. from /etc/os-release):
@ekarlso
Copy link

ekarlso commented Jun 7, 2024

So after looking around and thinking a bit I see that our CAPHV is waiting for providerID to be set

2024-06-07T06:13:18Z	INFO	Waiting for ProviderID to be set on Node resource in Workload Cluster ...	{"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-7d7pr","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-7d7pr", "reconcileID": "d258bfe4-ba85-4d61-92e0-d6ee8aced78d", "machine": "cluster-capi-mgmt-p-01/capi-mgmt-p-01-n48cx", "cluster": "cluster-capi-mgmt-p-01/capi-mgmt-p-01"}

I see that you are in your examples including the CPI as a DaemonSet that means that will not be setting the provideID on TalOS since it needs to be bootstraped before the DaemonSet would be started and the CPI setting providerID?
https://github.com/rancher-sandbox/cluster-api-provider-harvester/blob/main/templates/cluster-template-kubeadm.yaml#L190

IMHO the controller should be able to get the HarvesterMachine into a state so that the Machine object would phase into Provisioned so that other controllers whether it be Talos' or any else will function with it which is the normal ?

@PatrickLaabs
Copy link
Contributor

Hi @dhaugli,
which version of the rke2 controlplane and bootstrap provider are you using?

@dhaugli
Copy link
Author

dhaugli commented Jun 8, 2024

Hi @dhaugli, which version of the rke2 controlplane and bootstrap provider are you using?

We are using Talos bootstrap and Talos controlplane provider in this case.

@dhaugli
Copy link
Author

dhaugli commented Jun 9, 2024

I have followed the example now from templates, but still it dosent work, and I think I know why. Beacause the caph controller dosent propagate the IP address of the machines into the machine object like:

status:
  addresses:
  - address: <IP>
    type: ExternalIP
  - address: <IP>
    type: ExternalIP
  - address: <DNS NAME OF MACHINE>
    type: InternalDNS

For reference, Vsphere capi controller does this, without this the TalosBoot controller can't see the ip and can't continue the bootstrap process. But my machines does get IP in my network and the qemu agent does report this through harvester.

@dhaugli
Copy link
Author

dhaugli commented Jun 9, 2024

I found the issue with the CAPH controller, from the principles from Cluster API on how the bootstrap should work:

image

CAPH controller does not set the machine as ready in the infrastructure provider (even though its running just fine as a VM in Harvester), because CAPH controller is waiting for Provider Id, and the LB is never created (because of this) and with Talos this will just make the nodes end up waiting forever in the bootstrap process, and will not progress.

My friend Endre just made a fix in our own image, still dosen't work, but we are working on it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants