Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: bind() to unix:/var/lib/nginx/nginx-status.sock failed (98: Address already in use) #6752

Open
granescb opened this issue Nov 5, 2024 · 10 comments
Labels
backlog Pull requests/issues that are backlog items bug An issue reporting a potential bug
Milestone

Comments

@granescb
Copy link

granescb commented Nov 5, 2024

Version

3.6.2

What Kubernetes platforms are you running on?

EKS Amazon

Steps to reproduce

k8s EKS version: 1.31

Describe the bug:
Sometimes, the nginx-ingress-controller restarts the process without cleaning the socket files.
At first time we meet this problem during massive node restarting in the k8s cluster.
Then it happens randomly on weekends.

The problem was noticed in version 3.6.2. Before we used app version 3.0.2 and never had this problem

Manual Pod deletion solves the problem, but it can happen again.

Here is deployment yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    meta.helm.sh/release-name: nginx-inc-ingress-controller
    meta.helm.sh/release-namespace: nginx-ingress
  labels:
    app: nginx-inc-ingress-controller
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: 3.6.1
    helm.sh/chart: nginx-ingress-1.3.1
  name: nginx-inc-ingress-controller
  namespace: nginx-ingress
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nginx-inc-ingress-controller
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        logs.improvado.io/app: nginx-ingress
        logs.improvado.io/format: json
        logs.improvado.io/ingress-class: nginx-stable
        prometheus.io/port: "9113"
        prometheus.io/scheme: http
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app: nginx-inc-ingress-controller
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: nginx-inc-ingress-controller
            topologyKey: kubernetes.io/hostname
      automountServiceAccountToken: true
      containers:
      - args:
        - -nginx-plus=false
        - -nginx-reload-timeout=60000
        - -enable-app-protect=false
        - -enable-app-protect-dos=false
        - -nginx-configmaps=$(POD_NAMESPACE)/nginx-inc-ingress-controller
        - -default-server-tls-secret=$(POD_NAMESPACE)/nginx-inc-ingress-controller-default-server-tls
        - -ingress-class=nginx-stable
        - -health-status=true
        - -health-status-uri=/-/health/lb
        - -nginx-debug=false
        - -v=1
        - -nginx-status=true
        - -nginx-status-port=8080
        - -nginx-status-allow-cidrs=127.0.0.1
        - -report-ingress-status
        - -enable-leader-election=true
        - -leader-election-lock-name=nginx-inc-ingress-controller-leader
        - -enable-prometheus-metrics=true
        - -prometheus-metrics-listen-port=9113
        - -prometheus-tls-secret=
        - -enable-service-insight=false
        - -service-insight-listen-port=9114
        - -service-insight-tls-secret=
        - -enable-custom-resources=true
        - -enable-snippets=true
        - -include-year=false
        - -disable-ipv6=false
        - -enable-tls-passthrough=false
        - -enable-cert-manager=false
        - -enable-oidc=false
        - -enable-external-dns=false
        - -default-http-listener-port=80
        - -default-https-listener-port=443
        - -ready-status=true
        - -ready-status-port=8081
        - -enable-latency-metrics=false
        - -ssl-dynamic-reload=true
        - -enable-telemetry-reporting=false
        - -weight-changes-dynamic-reload=false
        env:
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: 627003544259.dkr.ecr.us-east-1.amazonaws.com/nginx-inc-ingress:master-3.6.2-2-1
        imagePullPolicy: IfNotPresent
        name: ingress-controller
        ports:
        - containerPort: 80
          name: http
          protocol: TCP
        - containerPort: 443
          name: https
          protocol: TCP
        - containerPort: 9113
          name: prometheus
          protocol: TCP
        - containerPort: 8081
          name: readiness-port
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /nginx-ready
            port: readiness-port
            scheme: HTTP
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 1500m
            memory: 1500Mi
          requests:
            cpu: 100m
            memory: 1500Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 101
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/nginx
          name: nginx-etc
        - mountPath: /var/cache/nginx
          name: nginx-cache
        - mountPath: /var/lib/nginx
          name: nginx-lib
        - mountPath: /var/log/nginx
          name: nginx-log
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - cp
        - -vdR
        - /etc/nginx/.
        - /mnt/etc
        image: 627003544259.dkr.ecr.us-east-1.amazonaws.com/nginx-inc-ingress:master-3.6.2-2-1
        imagePullPolicy: IfNotPresent
        name: init-ingress-controller
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 101
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mnt/etc
          name: nginx-etc
      nodeSelector:
        kubernetes.io/arch: amd64
      priorityClassName: cluster-application-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        seccompProfile:
          type: RuntimeDefault
      serviceAccount: nginx-inc-ingress-controller
      serviceAccountName: nginx-inc-ingress-controller
      terminationGracePeriodSeconds: 60
      tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 300
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 300
      - effect: NoSchedule
        key: node.kubernetes.io/memory-pressure
        operator: Exists
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app: nginx-inc-ingress-controller
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
      volumes:
      - emptyDir: {}
        name: nginx-etc
      - emptyDir: {}
        name: nginx-cache
      - emptyDir: {}
        name: nginx-lib
      - emptyDir: {}
        name: nginx-log

Logs with error:

2024-11-02 10:02:58.543	2024/11/02 06:02:56 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-status.sock failed (98: Address already in use)
2024-11-02 10:02:58.543	2024/11/02 06:02:56 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2024-11-02 10:02:58.543	2024/11/02 06:02:56 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2024-11-02 10:02:58.543	2024/11/02 06:02:56 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2024-11-02 10:02:58.543	2024/11/02 06:02:56 [notice] 18#18: try again to bind() after 500ms
2024-11-02 10:02:59.043	2024/11/02 06:02:56 [emerg] 18#18: still could not bind()

Here are logs, containing 1 signal reconfiguring and then a crash loop with socket busy error
Explore-logs-2024-11-05 18_40_57.txt

Expected behavior
nginx-ingress controller pod is working.

@granescb granescb added bug An issue reporting a potential bug needs triage An issue that needs to be triaged labels Nov 5, 2024
Copy link

github-actions bot commented Nov 5, 2024

Hi @granescb thanks for reporting!

Be sure to check out the docs and the Contributing Guidelines while you wait for a human to take a look at this 🙂

Cheers!

@AlexFenlon
Copy link
Contributor

Hi @granescb,

Can you give more details about the node restarts. Are the nodes on a scheduled restart?

Can you try turn readOnlyRootFilesystem to false and let us know if this changes the behaviour.

In the meantime, we will do our best to reproduce the issue and get back as soon as we can.

@AlexFenlon AlexFenlon added waiting for response Waiting for author's response and removed needs triage An issue that needs to be triaged labels Nov 5, 2024
@granescb
Copy link
Author

granescb commented Nov 6, 2024

Hello @AlexFenlon
The node restart was related to cluster component update, so k8s drained all old nodes and migrated all pods to the new ones.
We did this operation with 4 k8s clusters but got an ingress problem only for the biggest one. 2 of 3 pods were in CrashLoopBackOff status. The biggest cluster has +- 60 nodes and about 217 ingress resources - maybe it's related to the problem.

Yes, I can try readOnlyRootFilesystem, but only in the staging cluster. The main problem - I don't know how to reproduce the issue to check whether readOnlyRootFilesystem will solve the problem. I will try to reproduce the problem with the current settings and then try to do it with readOnlyRootFilesystem.
UPD: During the update from 3.0.2 to 3.6.2 we set readOnlyRootFilesystem = true

@granescb
Copy link
Author

granescb commented Nov 6, 2024

I repeated the same behavior by sending 1 signal from the k8s worker node.

  1. Run nginx-ingress-controller
  2. exec to k8s worker node
  3. find nginx master process by htop
  4. kill with 1 signal
  5. Now pod in CrashLoopBackOff
2024/11/06 09:35:56 [notice] 14#14: signal 1 (SIGHUP) received from 24, reconfiguring
2024/11/06 09:35:56 [notice] 14#14: reconfiguring
2024/11/06 09:35:56 [warn] 14#14: duplicate MIME type "text/html" in /etc/nginx/nginx.conf:28
2024/11/06 09:35:56 [notice] 14#14: using the "epoll" event method
2024/11/06 09:35:56 [notice] 14#14: start worker processes
2024/11/06 09:35:56 [notice] 14#14: start worker process 25
2024/11/06 09:35:56 [notice] 14#14: start worker process 26
2024/11/06 09:35:56 [notice] 14#14: start worker process 27
2024/11/06 09:35:56 [notice] 14#14: start worker process 28
2024/11/06 09:35:56 [notice] 16#16: gracefully shutting down
2024/11/06 09:35:56 [notice] 17#17: gracefully shutting down
2024/11/06 09:35:56 [notice] 18#18: gracefully shutting down
2024/11/06 09:35:56 [notice] 15#15: gracefully shutting down
2024/11/06 09:35:56 [notice] 17#17: exiting
2024/11/06 09:35:56 [notice] 18#18: exiting
2024/11/06 09:35:56 [notice] 15#15: exiting
2024/11/06 09:35:56 [notice] 16#16: exiting
2024/11/06 09:35:56 [notice] 16#16: exit
2024/11/06 09:35:56 [notice] 15#15: exit
2024/11/06 09:35:56 [notice] 17#17: exit
2024/11/06 09:35:56 [notice] 18#18: exit
I1106 09:35:56.455595       1 event.go:377] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"nginx-ingress", Name:"nginx-inc-ingress-controller", UID:"a61edba5-ec9b-4024-8649-d88d6d932178", APIVersion:"v1", ResourceVersion:"404400560", FieldPath:""}): type: 'Normal' reason: 'Updated' Configuration from nginx-ingress/nginx-inc-ingress-controller was updated
I1106 09:35:56.455660       1 event.go:377] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"<namespace>", Name:"<ingress_name>", UID:"2d658178-96ee-4290-a8cb-a04de49f3150", APIVersion:"networking.k8s.io/v1", ResourceVersion:"381554759", FieldPath:""}): type: 'Normal' reason: 'AddedOrUpdated' Configuration for <namespace>/<ingress_name> was added or updated
2024/11/06 09:35:56 [notice] 14#14: signal 17 (SIGCHLD) received from 17
2024/11/06 09:35:56 [notice] 14#14: worker process 17 exited with code 0
2024/11/06 09:35:56 [notice] 14#14: signal 29 (SIGIO) received
2024/11/06 09:35:56 [notice] 14#14: signal 17 (SIGCHLD) received from 16
2024/11/06 09:35:56 [notice] 14#14: worker process 16 exited with code 0
2024/11/06 09:35:56 [notice] 14#14: signal 29 (SIGIO) received
2024/11/06 09:35:56 [notice] 14#14: signal 17 (SIGCHLD) received from 18
2024/11/06 09:35:56 [notice] 14#14: worker process 18 exited with code 0
2024/11/06 09:35:56 [notice] 14#14: worker process 15 exited with code 0
2024/11/06 09:35:56 [notice] 14#14: signal 29 (SIGIO) received
2024/11/06 09:35:56 [notice] 14#14: signal 17 (SIGCHLD) received from 15
E1106 09:39:46.190574       1 processes.go:39] unable to collect process metrics : unable to read file /proc/37/cmdline: open /proc/37/cmdline: no such file or directory

Now Pod is restarting and going to CrashLoopBackOff cause of busy sockets error.
The question - who is sending 1 signal in production workload?

I also used readOnlyRootFilesystem=false and repeated the same case with 1 signal - now the pod is just restarting and working fine. So looks like this solution will work for us.

@AlexFenlon
Copy link
Contributor

If you are happy, we will close this for now.

@granescb
Copy link
Author

granescb commented Nov 7, 2024

@AlexFenlon
No, we wanna use this security feature but can't right now cause of this bug.

Also, looks like the same problem was reported about a year ago: #4604

@AlexFenlon
Copy link
Contributor

Hi @granescb,

Thanks again for bringing this to our attention, we will investigate this again and get back to you.

@AlexFenlon AlexFenlon removed the waiting for response Waiting for author's response label Nov 12, 2024
@j1m-ryan j1m-ryan added the needs triage An issue that needs to be triaged label Nov 18, 2024
@j1m-ryan
Copy link
Member

Hi @granescb, we are looking into this.
Are you using a particular type of node / machine OS?

@MarkTopping
Copy link

MarkTopping commented Nov 25, 2024

Just noting that this bug was also reported here: #4370

Furthermore, I've incurred this issue again with release 3.7.1 because this particular release has increased the memory consumption of the ingress controller Pods - which led to OOM Kills - which led to this issue resurfacing on Pod restarts.

I'm going to be raising a separate issue concerning the memory consumption as I don't spot anyone else having done so yet

@vepatel
Copy link
Contributor

vepatel commented Dec 2, 2024

ref: #4604

@vepatel vepatel added this to the v4.1.0 milestone Dec 2, 2024
@jjngx jjngx added ready for refinement An issue that was triaged and it is ready to be refined backlog Pull requests/issues that are backlog items and removed needs triage An issue that needs to be triaged ready for refinement An issue that was triaged and it is ready to be refined labels Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Pull requests/issues that are backlog items bug An issue reporting a potential bug
Projects
Status: Todo ☑
Development

No branches or pull requests

6 participants