Building redundant EKS monitoring and alerting stack

As Kubernetes containers are actually Linux processes, we can use our favorite tools to monitor and log cluster performance. In Kubernetes, application monitoring does not depend on a single monitoring solution. Each organization is unique in form of requirements to monitoring sensitivity and log ingestion, analysis and persistence. This is very important building our own EKS monitoring and alerting stack which fits all your requirements.

Monitoring and logging stack
Flow
- Inside cluster
- Outside cluster

In case of shared, multi-tenant clusters, it is very important, on the one hand, to fully isolate the telemetry data of individual tenants (and making them available only to the relevant tenant) and, on the other hand, to collect data from all areas of the cluster for administrative purposes.

Monitoring and logging stack

Prometheus

Monitoring solution used for checking the health of cluster and services deployed on it. Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the time stamp at which it was recorded, alongside optional key-value pairs called labels. In the multi-tenant cluster there is single deployment of Prometheus for whole cluster which is gathering information from various of exporters. Exporter is a service that is exposing metrics via http endpoint. Prometheus uses service discovery to automatically get list of working exporters, then it scrapes metrics in given interval (1 min). Prometheus is also equipped with alert manager which is responsible for triggering alerts based on given criteria.

Loki and FluentBit

Loki and FluentBit. Loki is log aggregation system inspired by Prometheus. In the solution, there is one instance of Loki used for administrative purpose and tenant usage (multi-tenant solution). FluentBit deployed as a single instance per node is used to feed Loki instance and distribute logs into appropriate tenants based on namespace names. Isolation for tenants to realized by loki-proxy which allows tenants to access only tenant’s logs.

Thanos Metrics

Thanos is a fast, cost-effective and scalable monitoring solution and time series database. In multi-tenant cluster the Thanos is utilized for tenant isolation on storage level. Thanos has ability to store metrics separately based on tenantId allowing tenants to query metrics relevant only for their application. Additionally, there is another data store for admins purpose that aggregates metrics from all namespaces. It is used to visualise cluster health as a whole in one place. The metrics are stored in Kubernetes Persistent Volume.

Grafana

Grafana is visualization tool with convenient set of ready to use, but also custom-made, dashboards. Each tenant has its own deployment of Grafana with automatically added data sources for Prometheus via Thanos using prometheus-proxy and Loki via loki-proxy. Grafana allows exploring logs from Loki and build graphs on top of Prometheus metrics.

Kiali and Jaeger

Kiali offers correlated views of metrics, logs, and tracing, as well as strong validations to pinpoint configuration issues. Kiali includes Grafana and Jaeger integrations. In the solution, it queries Jaeger metrics or traces and visualizes them on interactive dashboards.

Jaeger is a distributed tracing system used for monitoring and troubleshooting microservices including distributed context propagation, root cause analysis, service dependency analysis, and performance optimization. Traces are visualized in Kiali, where they are being directly queried from Jaeger.

CloudTrail

Amazon EKS is integrated with AWS CloudTrail, a service that provides a record of actions that’s taken by a user, role, or an AWS service in Amazon EKS. CloudTrail captures all API calls for Amazon EKS as events. This includes calls from the Amazon EKS console and from code calls to the Amazon EKS API operations.

If you create a trail, you can enable continuous delivery of CloudTrail events to an Amazon S3 bucket. Using the information that CloudTrail collects, you can determine several details about a request. For example, you can determine when the request was made to Amazon EKS, the IP address where the request was made from, and who made the request.

CloudTrail is used by cluster admins as an additional layer of cluster monitoring to collect and proceed base telemetry outside the cluster (in contrast to Prometheus and Loki which both run inside a cluster). It collects only base cluster data with no container logs and is mainly used to monitor status/availability of the internal cluster monitoring services.

AWS Lambda, SNS topic and notification

We could leverage Amazon Simple Notification Service (Amazon SNS) topics to process all variety of event from CloudTrail, Prometheus or S3 bucket. You can use a Lambda function to process SNS notifications. Amazon SNS supports Lambda functions as a target for messages sent to a topic. You can subscribe your function to topics and notify email group or direct massaging or even calling.

Flow

Now, let’s have a look of how all of these tool are interconnected in one stack. One of the main aim is provide high redundancy of log and metrics data and be able perform in high load and no significant performance degradation.

Inside cluster

Jaeger is used to collect service mesh trace data from Istio and pass to central monitoring process Kiali. Kiali presents a high-level view of the namespaces accessible to. It combines service and application information, along with telemetry, validations, and health, to provide a summary of system behavior. From here users can perform namespace-level Actions, or quickly navigate to more detailed views.

The Kiali Graph offers a powerful visualization of mesh traffic. The topology combines real-time request traffic with Istio configuration information to present immediate insight into the behavior of service mesh, allowing to quickly pinpoint issues. The Istio configuration view provides advanced filtering and navigation for Istio configuration objects, such as Virtual Services and Gateways. Kiali provides inline config editing and powerful semantic validation for Istio resources.

Metrics scrapped via exporters to Prometheus from core K8s services and complementary components (like Jaeger, Velero, GitOps tool like AgroCD or Flux, kubecost, kured controller and etc.).

Logs ingested from Pods by Fluent Bit to multi-tenant instance of Loki. Fluent Bit is a lightweight and fast solution for sending logs to Loki. It requires fewer system resources and runs faster than other log collection agents. As a result, it can process a massive amount of log data with minimal impact on system performance. Fluent Bit acts as a bridge between your logs and Loki, which is a horizontally-scalable, highly-available, and multi-tenant log aggregation system.

Outside cluster

Telemetry storage – Thanos stores and aggregates metrics and traces in common storage. Thanos stores the data on S3 bucket. After 30 days, metrics data is sampled to 5 minute intervals. And after 60 days, metrics data goes to Glasier. Loki instance stores the cluster logs in same way. It provides cost saving and good redundancy at the same time.

Observability of the cluster components and single services provided in Grafana dashboards and, with more details for traces, in Kiali.

Alerting mechanism based on EKS CloudTrail events. The event sends to notification channel or email directly with Lambda or proceeded and filtered with SNS topic for higher redundancy and process isolation.

Custom Loki alerts. Alerting rules allow you to define alert conditions based on LogQL expressions and to send notifications about firing alerts to CloudTrail.

The technology stack above is not comprehensive. It presents a correct way to integrate all of monitoring and logging tools together to achieve a best experience and redundancy.

I hope you like the topic. If yes, please follow me on Twitter and subscribe to our newsletter to be in touch with latest architecture trends.

Save your privacy, be ethical!