Logging metrics is important for cluster admins to understand if their cluster and the devices attached to it are in good working order and to do capacity planning. They can also be used in other places such as defining custom auto-scaling rules.
An operator can create its own custom metrics using the Prometheus golang client library. For an example of how this can be done see the Ptemplate example operator implementation.
For an Operator to integrate seamlessly with the OpenShift admin dashboard it is highly recommended to provide a Grafana dashboard spec in the operator. This allows cluster admins to have an out-of-the-box view of the devices and have a single pane of glass to examine in the even of problems.
For details on how to create a dashboard see the Grafana Documentation
Once metrics are being passed to to Prometheus they can be used to trigger alerts that can then be caught by Alert Manager and wake up someone at 2 AM with the integration to PagerDuty or in the best case trigger automated remediation.
This requires the creation of PrometheusRule
objects with expressions (spec.groups.expr
) that reference the metric name.
For example, this rule will trigger an InstanceDown
alert when the Ptemplate_consumers
metric is zero for more then 1 minute:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: example-alert
namespace: ns1
spec:
groups:
- alert: InstanceDown
expr: Ptemplate_consumers == 0
for: 1m
Configuring The Monitoring Stack In OpenShift
OpenShift PrometheusRule reference