Kubernetes distributed alert management with Prometheus Operator and Flux Notification Controller

Distributed alert management (DAM) allows automatically identify a non-compliance of service level objectives and any risky activities inside a cluster and GitOps infrastructure. In my previous post, I presented a redundant monitoring infrastructure based on variety of tools such like Grafana, Prometheus, Loki and Thanos. This article focuses on a way to integrate continuous monitoring and alerting with modern Kubernetes cluster.

To collect real-time information from Kubernetes cluster we would use Prometheus. Prometheus is a tool to grab and stores metrics as time series data. Than, the Alert Manager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, SMS, MS Teams channel and so on. It also takes care of silencing and inhibition of alerts.

Prometheus operator

All alerts inside Prometheus Alert Manager configured using yaml format. Alert Manger it self configured for high availability. The essential part of solution is Prometheus Operator. The Prometheus Operator introduces an Alertmanager resource, which allows users to declaratively describe an Alertmanager cluster. To successfully deploy an Alertmanager cluster, it is important to understand the contract between Prometheus and Alertmanager. Alertmanager is used to:

  • Deduplicate alerts received from Prometheus.
  • Silence alerts.
  • Route and send grouped notifications to various integrations (PagerDuty, OpsGenie, mail, chat, …).

First, we need to deploy Alert Manger cluster:

apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
  name: example
  replicas: 3
kubectl get pods -l alertmanager=example -w

Next, creates an AlertmanagerConfig.yaml resource to sends notifications to a appropriate AWS Lambda webhook:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
  name: example-alertmanagerconfig
    groupBy: ['alertname']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: 'lambda-webhook'
    - name: namespace
      matchType: =
      value: "my-cluster-namespace"
    - name: severity
      matchType: =~
      value: "warning|critical|error"
  - name: 'lambda-webhook'
    - urlSecret:
        key: 'http://my-lambda-webhook/'
        name: alertmanager

The PrometheusRule CRD allows to define alerting and recording rules. The operator knows which PrometheusRule objects to select for a given Prometheus based on the spec.ruleSelector field. By default, the Prometheus resources discovers only PrometheusRule resources in the same namespace. This can be refined with the ruleNamespaceSelector field:

  • To discover rules from all namespaces, pass an empty dict (ruleNamespaceSelector: {}).
  • To discover rules from all namespaces matching a certain label, use the matchLabels field
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
  creationTimestamp: null
    prometheus: example
    role: alert-rules
  name: prometheus-alerts
  - name: ./example.rules
    - alert: ExampleAlert
      expr: vector(1)

And then display rules with:

kubectl get prometheusrule -n prometheus -o yaml

Flux notification controller

The Notification Controller is a Kubernetes operator, specialized in handling inbound and outbound events. For more information go to fluxcd.io. The controller handles events coming from external systems (GitHub, GitLab, Bitbucket, Harbor, Jenkins, etc) and notifies the GitOps toolkit controllers about source changes.

We send events to the Prometheus Alert Manager using a dedicated provider for Prometheus Alert Manager. Alert Manager distributes notifications according to the previous configuration (to an MS Teams channel). We catch errors related to all Flux objects. Below is a sample configuration for test environment:

apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Provider
 name: alertmanager
 namespace: prometheus-alerts
 type: alertmanager
 address: http://alertmanager-operated.prometheus:9093/api/v2/alerts/
apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Alert
 name: errors
 namespace: prometheus-alerts
 name: alertmanager
 eventSeverity: error

 - kind: HelmRelease
 name: '*'
 namespace: prometheus-alerts
 - kind: ImagePolicy
 name: '*'
 namespace: prometheus-alerts
 - kind: Kustomization
 name: '*'
 namespace: prometheus-alerts
 - kind: ImageRepository
 name: '*'
 - kind: GitRepository
 name: '*'
 namespace: prometheus-alerts
 - kind: HelmChart
 name: '*'
 namespace: prometheus-alerts
 - kind: HelmRepository
 name: '*'
 namespace: prometheus-alerts
 - kind: OCIRepository
 name: '*'
 namespace: prometheus-alerts

In addition to the Flux Notification Controller, we use dashboard in Grafana to visualize cluster behavior in time. As a bonus, Grafana allows exploring logs from Loki and build graphs on top of Prometheus metrics.

Hope, you like the post. Please follow me on Twitter or LinkedIn and subscribe to newsletter below.

Be an ethical, save your privacy!

subscribe to newsletter

and receive weekly update from our blog

By submitting your information, you're giving us permission to email you. You may unsubscribe at any time.

Leave a Comment