Director daemon doesn't act on DB connection loss #2909

log1-c · 2024-08-14T12:28:42Z

Not sure about the title, but here is what happened in our setup.

Setup:
Two webservers running the Director daemon as a systemd service. One server is in a public cloud, one is in a private cloud.
Connection to the database in handled via HAproxy to the three galera-cluster nodes. The webinterface is behind a loadbalancer.

Normally the primary instances for the webinterface (and thus the daemon) is the private cloud side.

Now there was a VPN connection issue leading to a connection loss for the private cloud side.
Icinga2 switched to the public cloud, icingaweb2 (the loadbalancer) switched to the public cloud, and with it the Director daemon.

But according to journalctl -u icinga-director it still lost connection to the MySQL cluster. Many MySQL server has gone away messages from our import & sync jobs.

journalctl -u icinga-director from private cloud host.txt
journalctl -u icinga-director from public cloud host.txt

But the systemctl status icinga-director output still says running, db: connected

icinga-director.service - Icinga Director - Monitoring Configuration
   Loaded: loaded (/etc/systemd/system/icinga-director.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2024-08-11 20:33:54 CEST; 2 days ago
     Docs: https://icinga.com/docs/director/latest/
 Main PID: 4020 (icingacli)
   Status: "running, db: connected"
    Tasks: 2 (limit: 24881)
   Memory: 125.2M
   CGroup: /system.slice/icinga-director.service
           ├─  4020 icinga::director: running, db: connected
           └─463212 icinga::director::job (Import all Sources)

The issue is:
Our import & sync jobs aren't running leading to a discrepancy between the monitored infrastructure and the monitoring view.

What I would have expected (one of those):

Director daemon retries the database connection so that the jobs can run again
Director daemon automatically does a restart because of an recognised error
Director daemon switches to the other running instance once it is reachable again
Director daemon stops

System:
OS: rhel8
Director Version 1.11.1

The text was updated successfully, but these errors were encountered:

lippserd · 2024-10-25T12:05:50Z

@log1-c thanks for the issue and the logs. This does indeed look strange. Investigating will take some time though.

lippserd added the dev-call Issues and Pull Requests to be discussed at the Dev Call. label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Director daemon doesn't act on DB connection loss #2909

Director daemon doesn't act on DB connection loss #2909

log1-c commented Aug 14, 2024 •

edited

Loading

lippserd commented Oct 25, 2024

Director daemon doesn't act on DB connection loss #2909

Director daemon doesn't act on DB connection loss #2909

Comments

log1-c commented Aug 14, 2024 • edited Loading

lippserd commented Oct 25, 2024

log1-c commented Aug 14, 2024 •

edited

Loading