Skip to content

Amazon EKS cluster monitoring

This guide demonstrates how to monitor your Amazon Elastic Kubernetes Service (Amazon EKS) cluster with the Observability Accelerator's EKS monitoring module.

Overview

The EKS monitoring module uses a profile-driven architecture with three collector profiles:

Profile Backend Collector Best for
cloudwatch-otlp Amazon CloudWatch CW Agent EKS Add-on CloudWatch-native observability with OTLP
managed-metrics Amazon Managed Prometheus AMP Managed Collector (agentless) Agentless setup, no in-cluster collector to manage
self-managed-amp Amazon Managed Prometheus OpenTelemetry Collector (Helm) Full control over collection pipeline, traces + logs support

All profiles deploy kube-state-metrics and node-exporter for infrastructure metrics, and provision Grafana dashboards for cluster visibility.

Prerequisites

Note

Make sure to complete the prerequisites section before proceeding.

Quick start — CloudWatch OTLP profile

This walkthrough uses the cloudwatch-otlp profile, which deploys the Amazon CloudWatch Observability EKS add-on for Container Insights, with an optional OTLP gateway for application metrics queryable via PromQL.

1. Clone and initialize

git clone https://github.com/aws-observability/terraform-aws-observability-accelerator.git
cd examples/eks-cloudwatch-otlp
terraform init

2. Configure variables

export TF_VAR_eks_cluster_id=my-cluster
export TF_VAR_aws_region=us-east-1

3. Amazon Managed Grafana workspace (optional)

If you want Grafana dashboards, follow our helper guide to create a workspace, then:

export TF_VAR_grafana_endpoint="https://g-xxx.grafana-workspace.us-east-1.amazonaws.com"
export TF_VAR_grafana_api_key="glsa_xxx"

4. Deploy

terraform apply

5. Verification

# Check add-on and pods
kubectl get pods -n amazon-cloudwatch

# Check OTLP gateway (if enable_otlp_gateway = true)
kubectl get amazoncloudwatchagent cwa-otlp-gateway -n amazon-cloudwatch

Container Insights metrics appear in CloudWatch within 3-5 minutes. Application metrics sent to the OTLP gateway are queryable via PromQL in Grafana using the CloudWatch PromQL datasource.

Tip

This repository includes an AI agent guide (AGENT.md) that can walk you through the entire deployment conversationally — gathering prerequisites, running Terraform, and handing you working dashboard URLs.

For more details, see the CloudWatch OTLP guide.

Managed-metrics profile (agentless AMP scraper)

The managed-metrics profile uses the AMP Managed Collector — a fully managed, agentless scraper that runs outside your cluster. No OpenTelemetry Collector pods are deployed. Metrics-only (no traces or logs).

module "eks_monitoring" {
  source = "github.com/aws-observability/terraform-aws-observability-accelerator//modules/eks-monitoring?ref=v3.0.0"

  providers = { grafana = grafana }

  collector_profile          = "managed-metrics"
  eks_cluster_id             = var.eks_cluster_id
  scraper_subnet_ids         = var.scraper_subnet_ids         # >= 2 subnets in 2 AZs
  scraper_security_group_ids = var.scraper_security_group_ids
}

Note

The managed scraper requires at least 2 subnets in 2 distinct Availability Zones. See the managed-metrics example.

Self-managed AMP profile

The self-managed-amp profile deploys an OpenTelemetry Collector via Helm to scrape Prometheus metrics and remote-write to Amazon Managed Prometheus. It supports metrics, traces (X-Ray), and logs (CloudWatch Logs).

git clone https://github.com/aws-observability/terraform-aws-observability-accelerator.git
cd examples/eks-amp-otel
terraform init
export TF_VAR_eks_cluster_id=my-cluster
export TF_VAR_aws_region=us-west-2

By default the module creates a new AMP workspace. To use an existing one:

export TF_VAR_managed_prometheus_workspace_id=ws-xxx

And set create_amp_workspace = false in your module call.

module "eks_monitoring" {
  source = "github.com/aws-observability/terraform-aws-observability-accelerator//modules/eks-monitoring?ref=v3.0.0"

  providers = { grafana = grafana }

  collector_profile = "self-managed-amp"
  eks_cluster_id    = var.eks_cluster_id
  enable_tracing    = true
  enable_logs       = true
}

See the self-managed AMP example for a complete working configuration.

Dashboards

The module provisions Grafana dashboards via the grafana_dashboard Terraform resource. For the cloudwatch-otlp profile, dashboards include Container Insights Containers, Container Insights Nodes, GPU Fleet Utilization, and Unified Service Dashboard views. For AMP-backed profiles, dashboards cover cluster, namespace workloads, node-exporter, nodes, and workloads views.

You can control dashboard delivery with dashboard_delivery_method:

  • "terraform" (default) — the module provisions dashboards directly
  • "none" — skip provisioning; use the dashboard_sources and amp_datasource_config / cloudwatch_promql_datasource_config outputs to wire up your own GitOps pipeline (FluxCD, ArgoCD, etc.)

To override the default dashboard set, pass a custom dashboard_sources map:

module "eks_monitoring" {
  # ...
  dashboard_sources = {
    my-custom = "https://example.com/my-dashboard.json"
  }
}

Custom metrics and scrape jobs

To scrape additional workload metrics (Java/JMX, NGINX, Istio, your own apps), use the additional_scrape_jobs variable:

module "eks_monitoring" {
  # ...
  additional_scrape_jobs = [
    {
      job_name        = "my-app"
      scrape_interval = "30s"
      static_configs = [
        { targets = ["my-app.default.svc.cluster.local:8080"] }
      ]
    }
  ]
}

For the self-managed-amp and cloudwatch-otlp profiles, you can also pass arbitrary OTel Collector Helm values via helm_values for full pipeline customization.

For the managed-metrics profile, you can provide a complete custom Prometheus scrape configuration via scrape_configuration to override the defaults entirely.

AMP recording and alerting rules

When using an AMP-backed profile (managed-metrics or self-managed-amp), the module creates default infrastructure recording and alerting rules. You can extend them with custom rules:

module "eks_monitoring" {
  # ...
  enable_recording_rules = true
  enable_alerting_rules  = true

  custom_recording_rules = <<-YAML
    - record: my_custom:metric
      expr: sum(rate(http_requests_total[5m]))
  YAML

  custom_alerting_rules = <<-YAML
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
      for: 5m
      labels:
        severity: critical
  YAML
}

Note

To setup your alert receiver with Amazon SNS, follow this documentation.

Tracing and logs

The self-managed-amp profile supports traces and logs pipelines via the OpenTelemetry Collector:

  • Traces — enabled by default (enable_tracing = true), exported to AWS X-Ray via OTLP
  • Logs — enabled by default (enable_logs = true), exported to CloudWatch Logs via OTLP

The cloudwatch-otlp profile includes traces and logs pipelines by default with no additional configuration needed.

The managed-metrics profile is metrics-only (no traces or logs).

For details on instrumenting your applications, see the tracing guide and logs guide.

Upgrading from v2.x

If you are migrating from v2.x, see the Upgrading to v3.0.0 guide for a complete list of removed variables, new requirements, and migration examples.