-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate regex validation failures from ssl_probe_success
metric
#162
Comments
Also another option is to add label with reason of failures, instead creating other metrics :) It would be cool if all data that now written to logs about check failures will be available via metric label so user can create alert description that will indicate exact reason of failure. aka:
|
The You make a good point about non-TLS related failures though. Perhaps we should ignore errors from the upstream as long as we can successfully establish a TLS connection and extract certificates? |
My point that I want to see that host is down or timeout separetly from ssl. This could be explicitly each separate metric but it requires a lot of alerts, or reason could be recorded as a label - then only one alert can cover all issues and throw exact reason of failure |
What are some examples of SSL related errors? TLS verification failing? Is there anything else that can happen? I suppose bugs in our regex matching for starttls? Or servers that do things in a way we haven't accounted for? |
Putting raw error log strings into metrics strikes me as the wrong approach. Metrics are not designed for that kind of information. We could have some coarser labels like 'starttls' I guess? What would you actually use this delineation for though? How would you treat a host that is timing out vs a host that is failing the starttls handshake differently? |
Yes I would definetly treat host with failed ssl definitely compared of host that down, because it could be server totally off. Alerting that provide exact reason what is going on always better that alert that could be due to different reasons because you need to check all of them. And in historical view - you would know what it was, without going and reading logs of remote ssl exporter that setuped somewhere far away :) |
For example regex can not match in smtp when server is totally overloaded and do not return any data, just open connection, saw it couple of times |
From my view
up == 0
better covers failed proto regex checks thenssl_probe_success
, as failedssl_probe_success
should indicates TLS instead of host unavailability or instability.Ideally there should be dedicated metric for protol specific regex's that should indicate such failures in composition with
up == 0
so existing users alerts will still cover issues as before, but for newly configured alerts would allow separate general unavailable service from service which fails protol specific checksThe text was updated successfully, but these errors were encountered: