Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Director daemon doesn't act on DB connection loss #2909

Open
log1-c opened this issue Aug 14, 2024 · 1 comment
Open

Director daemon doesn't act on DB connection loss #2909

log1-c opened this issue Aug 14, 2024 · 1 comment
Labels
dev-call Issues and Pull Requests to be discussed at the Dev Call.

Comments

@log1-c
Copy link

log1-c commented Aug 14, 2024

Not sure about the title, but here is what happened in our setup.

Setup:
Two webservers running the Director daemon as a systemd service. One server is in a public cloud, one is in a private cloud.
Connection to the database in handled via HAproxy to the three galera-cluster nodes. The webinterface is behind a loadbalancer.

Normally the primary instances for the webinterface (and thus the daemon) is the private cloud side.

Now there was a VPN connection issue leading to a connection loss for the private cloud side.
Icinga2 switched to the public cloud, icingaweb2 (the loadbalancer) switched to the public cloud, and with it the Director daemon.

But according to journalctl -u icinga-director it still lost connection to the MySQL cluster. Many MySQL server has gone away messages from our import & sync jobs.

journalctl -u icinga-director from private cloud host.txt
journalctl -u icinga-director from public cloud host.txt

But the systemctl status icinga-director output still says running, db: connected

icinga-director.service - Icinga Director - Monitoring Configuration
   Loaded: loaded (/etc/systemd/system/icinga-director.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2024-08-11 20:33:54 CEST; 2 days ago
     Docs: https://icinga.com/docs/director/latest/
 Main PID: 4020 (icingacli)
   Status: "running, db: connected"
    Tasks: 2 (limit: 24881)
   Memory: 125.2M
   CGroup: /system.slice/icinga-director.service
           ├─  4020 icinga::director: running, db: connected
           └─463212 icinga::director::job (Import all Sources)

The issue is:
Our import & sync jobs aren't running leading to a discrepancy between the monitored infrastructure and the monitoring view.

What I would have expected (one of those):

  • Director daemon retries the database connection so that the jobs can run again
  • Director daemon automatically does a restart because of an recognised error
  • Director daemon switches to the other running instance once it is reachable again
  • Director daemon stops

System:
OS: rhel8
Director Version 1.11.1

@lippserd
Copy link
Member

@log1-c thanks for the issue and the logs. This does indeed look strange. Investigating will take some time though.

@lippserd lippserd added the dev-call Issues and Pull Requests to be discussed at the Dev Call. label Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev-call Issues and Pull Requests to be discussed at the Dev Call.
Projects
None yet
Development

No branches or pull requests

2 participants