Separate regex validation failures from `ssl_probe_success` metric #162

dragoangel · 2024-02-23T14:51:21Z

From my view up == 0 better covers failed proto regex checks then ssl_probe_success, as failed ssl_probe_success should indicates TLS instead of host unavailability or instability.
Ideally there should be dedicated metric for protol specific regex's that should indicate such failures in composition with up == 0 so existing users alerts will still cover issues as before, but for newly configured alerts would allow separate general unavailable service from service which fails protol specific checks

The text was updated successfully, but these errors were encountered:

dragoangel · 2024-02-23T15:03:15Z

Also another option is to add label with reason of failures, instead creating other metrics :)

It would be cool if all data that now written to logs about check failures will be available via metric label so user can create alert description that will indicate exact reason of failure. aka:

read tcp x:x->x:x: i/o timeout
dial tcp x:x: operation was canceled
regex: ^x didn't match: ...
dial tcp x:25: connect: connection refused
...
etc

ribbybibby · 2024-03-21T13:42:56Z

The up metric records the success/failure of requests from Prometheus -> the exporter. I don't think it would be right for us to return a non-2xx response if the exporter is fine but the issue is with the upstream.

You make a good point about non-TLS related failures though. Perhaps we should ignore errors from the upstream as long as we can successfully establish a TLS connection and extract certificates?

dragoangel · 2024-03-21T13:47:58Z

My point that I want to see that host is down or timeout separetly from ssl. This could be explicitly each separate metric but it requires a lot of alerts, or reason could be recorded as a label - then only one alert can cover all issues and throw exact reason of failure

ribbybibby · 2024-03-21T13:51:07Z

What are some examples of SSL related errors? TLS verification failing? Is there anything else that can happen?

I suppose bugs in our regex matching for starttls? Or servers that do things in a way we haven't accounted for?

ribbybibby · 2024-03-21T14:01:14Z

Putting raw error log strings into metrics strikes me as the wrong approach. Metrics are not designed for that kind of information.

We could have some coarser labels like 'starttls' I guess? What would you actually use this delineation for though? How would you treat a host that is timing out vs a host that is failing the starttls handshake differently?

dragoangel · 2024-03-21T14:04:26Z

Yes I would definetly treat host with failed ssl definitely compared of host that down, because it could be server totally off. Alerting that provide exact reason what is going on always better that alert that could be due to different reasons because you need to check all of them. And in historical view - you would know what it was, without going and reading logs of remote ssl exporter that setuped somewhere far away :)

dragoangel · 2024-03-21T15:02:53Z

What are some examples of SSL related errors? TLS verification failing? Is there anything else that can happen?

I suppose bugs in our regex matching for starttls? Or servers that do things in a way we haven't accounted for?

For example regex can not match in smtp when server is totally overloaded and do not return any data, just open connection, saw it couple of times

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate regex validation failures from `ssl_probe_success` metric #162

Separate regex validation failures from `ssl_probe_success` metric #162

dragoangel commented Feb 23, 2024

dragoangel commented Feb 23, 2024 •

edited

Loading

ribbybibby commented Mar 21, 2024

dragoangel commented Mar 21, 2024

ribbybibby commented Mar 21, 2024 •

edited

Loading

ribbybibby commented Mar 21, 2024

dragoangel commented Mar 21, 2024

dragoangel commented Mar 21, 2024

Separate regex validation failures from ssl_probe_success metric #162

Separate regex validation failures from ssl_probe_success metric #162

Comments

dragoangel commented Feb 23, 2024

dragoangel commented Feb 23, 2024 • edited Loading

ribbybibby commented Mar 21, 2024

dragoangel commented Mar 21, 2024

ribbybibby commented Mar 21, 2024 • edited Loading

ribbybibby commented Mar 21, 2024

dragoangel commented Mar 21, 2024

dragoangel commented Mar 21, 2024

Separate regex validation failures from `ssl_probe_success` metric #162

Separate regex validation failures from `ssl_probe_success` metric #162

dragoangel commented Feb 23, 2024 •

edited

Loading

ribbybibby commented Mar 21, 2024 •

edited

Loading