Kubernetes distributed alert management with Prometheus Operator and Flux Notification Controller

Distributed alert management (DAM) allows automatically identify a non-compliance of service level objectives and any risky activities inside a cluster and GitOps infrastructure. In my previous post, I presented a redundant monitoring infrastructure based on variety of tools such like Grafana, Prometheus, Loki and Thanos. This article focuses on a way to integrate continuous monitoring and alerting with modern Kubernetes cluster.

To collect real-time information from Kubernetes cluster we would use Prometheus. Prometheus is a tool to grab and stores metrics as time series data. Than, the Alert Manager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, SMS, MS Teams channel and so on. It also takes care of silencing and inhibition of alerts.

Prometheus operator

All alerts inside Prometheus Alert Manager configured using yaml format. Alert Manger it self configured for high availability. The essential part of solution is Prometheus Operator. The Prometheus Operator introduces an Alertmanager resource, which allows users to declaratively describe an Alertmanager cluster. To successfully deploy an Alertmanager cluster, it is important to understand the contract between Prometheus and Alertmanager. Alertmanager is used to:

  • Deduplicate alerts received from Prometheus.
  • Silence alerts.
  • Route and send grouped notifications to various integrations (PagerDuty, OpsGenie, mail, chat, …).

First, we need to deploy Alert Manger cluster:

apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: example
spec:
  replicas: 3
kubectl get pods -l alertmanager=example -w

Next, creates an AlertmanagerConfig.yaml resource to sends notifications to a appropriate AWS Lambda webhook:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: example-alertmanagerconfig
spec:
  route:
    groupBy: ['alertname']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: 'lambda-webhook'
    matchers:
    - name: namespace
      matchType: =
      value: "my-cluster-namespace"
    - name: severity
      matchType: =~
      value: "warning|critical|error"
  receivers:
  - name: 'lambda-webhook'
    webhookConfigs:
    - urlSecret:
        key: 'http://my-lambda-webhook/'
        name: alertmanager

The PrometheusRule CRD allows to define alerting and recording rules. The operator knows which PrometheusRule objects to select for a given Prometheus based on the spec.ruleSelector field. By default, the Prometheus resources discovers only PrometheusRule resources in the same namespace. This can be refined with the ruleNamespaceSelector field:

  • To discover rules from all namespaces, pass an empty dict (ruleNamespaceSelector: {}).
  • To discover rules from all namespaces matching a certain label, use the matchLabels field
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: null
  labels:
    prometheus: example
    role: alert-rules
  name: prometheus-alerts
spec:
  groups:
  - name: ./example.rules
    rules:
    - alert: ExampleAlert
      expr: vector(1)

And then display rules with:

kubectl get prometheusrule -n prometheus -o yaml

Flux notification controller

The Notification Controller is a Kubernetes operator, specialized in handling inbound and outbound events. For more information go to fluxcd.io. The controller handles events coming from external systems (GitHub, GitLab, Bitbucket, Harbor, Jenkins, etc) and notifies the GitOps toolkit controllers about source changes.

We send events to the Prometheus Alert Manager using a dedicated provider for Prometheus Alert Manager. Alert Manager distributes notifications according to the previous configuration (to an MS Teams channel). We catch errors related to all Flux objects. Below is a sample configuration for test environment:

apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Provider
metadata:
 name: alertmanager
 namespace: prometheus-alerts
spec:
 type: alertmanager
 address: http://alertmanager-operated.prometheus:9093/api/v2/alerts/
---
apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Alert
metadata:
 name: errors
 namespace: prometheus-alerts
spec:
 providerRef:
 name: alertmanager
 eventSeverity: error


 eventSources:
 - kind: HelmRelease
 name: '*'
 namespace: prometheus-alerts
 - kind: ImagePolicy
 name: '*'
 namespace: prometheus-alerts
 - kind: Kustomization
 name: '*'
 namespace: prometheus-alerts
 - kind: ImageRepository
 name: '*'
 namespace:prometheus-alerts
 - kind: GitRepository
 name: '*'
 namespace: prometheus-alerts
 - kind: HelmChart
 name: '*'
 namespace: prometheus-alerts
 - kind: HelmRepository
 name: '*'
 namespace: prometheus-alerts
 - kind: OCIRepository
 name: '*'
 namespace: prometheus-alerts

In addition to the Flux Notification Controller, we use dashboard in Grafana to visualize cluster behavior in time. As a bonus, Grafana allows exploring logs from Loki and build graphs on top of Prometheus metrics.

Hope, you like the post. Please follow me on Twitter or LinkedIn and subscribe to newsletter below.

Be an ethical, save your privacy!

subscribe to newsletter

and receive weekly update from our blog

By submitting your information, you're giving us permission to email you. You may unsubscribe at any time.

Leave a Comment