Amazon EKS cluster monitoring¶
This guide demonstrates how to monitor your Amazon Elastic Kubernetes Service (Amazon EKS) cluster with the Observability Accelerator's EKS monitoring module.
Overview¶
The EKS monitoring module uses a profile-driven architecture with three collector profiles:
| Profile | Backend | Collector | Best for |
|---|---|---|---|
cloudwatch-otlp |
Amazon CloudWatch | CW Agent EKS Add-on | CloudWatch-native observability with OTLP |
managed-metrics |
Amazon Managed Prometheus | AMP Managed Collector (agentless) | Agentless setup, no in-cluster collector to manage |
self-managed-amp |
Amazon Managed Prometheus | OpenTelemetry Collector (Helm) | Full control over collection pipeline, traces + logs support |
All profiles deploy kube-state-metrics and node-exporter for infrastructure metrics, and provision Grafana dashboards for cluster visibility.
Prerequisites¶
Note
Make sure to complete the prerequisites section before proceeding.
- An existing Amazon EKS cluster
- Terraform >= 1.5.0
- AWS CLI
- An Amazon Managed Grafana workspace (for dashboards)
Quick start — CloudWatch OTLP profile¶
This walkthrough uses the cloudwatch-otlp profile, which deploys the
Amazon CloudWatch Observability EKS add-on for Container Insights, with an
optional OTLP gateway for application metrics queryable via PromQL.
1. Clone and initialize¶
git clone https://github.com/aws-observability/terraform-aws-observability-accelerator.git
cd examples/eks-cloudwatch-otlp
terraform init
2. Configure variables¶
export TF_VAR_eks_cluster_id=my-cluster
export TF_VAR_aws_region=us-east-1
3. Amazon Managed Grafana workspace (optional)¶
If you want Grafana dashboards, follow our helper guide to create a workspace, then:
export TF_VAR_grafana_endpoint="https://g-xxx.grafana-workspace.us-east-1.amazonaws.com"
export TF_VAR_grafana_api_key="glsa_xxx"
4. Deploy¶
terraform apply
5. Verification¶
# Check add-on and pods
kubectl get pods -n amazon-cloudwatch
# Check OTLP gateway (if enable_otlp_gateway = true)
kubectl get amazoncloudwatchagent cwa-otlp-gateway -n amazon-cloudwatch
Container Insights metrics appear in CloudWatch within 3-5 minutes. Application metrics sent to the OTLP gateway are queryable via PromQL in Grafana using the CloudWatch PromQL datasource.
Tip
This repository includes an AI agent guide (AGENT.md) that can walk you
through the entire deployment conversationally — gathering prerequisites,
running Terraform, and handing you working dashboard URLs.
For more details, see the CloudWatch OTLP guide.
Managed-metrics profile (agentless AMP scraper)¶
The managed-metrics profile uses the
AMP Managed Collector
— a fully managed, agentless scraper that runs outside your cluster. No
OpenTelemetry Collector pods are deployed. Metrics-only (no traces or logs).
module "eks_monitoring" {
source = "github.com/aws-observability/terraform-aws-observability-accelerator//modules/eks-monitoring?ref=v3.0.0"
providers = { grafana = grafana }
collector_profile = "managed-metrics"
eks_cluster_id = var.eks_cluster_id
scraper_subnet_ids = var.scraper_subnet_ids # >= 2 subnets in 2 AZs
scraper_security_group_ids = var.scraper_security_group_ids
}
Note
The managed scraper requires at least 2 subnets in 2 distinct Availability Zones. See the managed-metrics example.
Self-managed AMP profile¶
The self-managed-amp profile deploys an OpenTelemetry Collector via Helm to
scrape Prometheus metrics and remote-write to Amazon Managed Prometheus. It
supports metrics, traces (X-Ray), and logs (CloudWatch Logs).
git clone https://github.com/aws-observability/terraform-aws-observability-accelerator.git
cd examples/eks-amp-otel
terraform init
export TF_VAR_eks_cluster_id=my-cluster
export TF_VAR_aws_region=us-west-2
By default the module creates a new AMP workspace. To use an existing one:
export TF_VAR_managed_prometheus_workspace_id=ws-xxx
And set create_amp_workspace = false in your module call.
module "eks_monitoring" {
source = "github.com/aws-observability/terraform-aws-observability-accelerator//modules/eks-monitoring?ref=v3.0.0"
providers = { grafana = grafana }
collector_profile = "self-managed-amp"
eks_cluster_id = var.eks_cluster_id
enable_tracing = true
enable_logs = true
}
See the self-managed AMP example for a complete working configuration.
Dashboards¶
The module provisions Grafana dashboards via the grafana_dashboard Terraform
resource. For the cloudwatch-otlp profile, dashboards include Container
Insights Containers, Container Insights Nodes, GPU Fleet Utilization, and
Unified Service Dashboard views. For AMP-backed profiles, dashboards cover
cluster, namespace workloads, node-exporter, nodes, and workloads views.
You can control dashboard delivery with dashboard_delivery_method:
"terraform"(default) — the module provisions dashboards directly"none"— skip provisioning; use thedashboard_sourcesandamp_datasource_config/cloudwatch_promql_datasource_configoutputs to wire up your own GitOps pipeline (FluxCD, ArgoCD, etc.)
To override the default dashboard set, pass a custom dashboard_sources map:
module "eks_monitoring" {
# ...
dashboard_sources = {
my-custom = "https://example.com/my-dashboard.json"
}
}
Custom metrics and scrape jobs¶
To scrape additional workload metrics (Java/JMX, NGINX, Istio, your own apps),
use the additional_scrape_jobs variable:
module "eks_monitoring" {
# ...
additional_scrape_jobs = [
{
job_name = "my-app"
scrape_interval = "30s"
static_configs = [
{ targets = ["my-app.default.svc.cluster.local:8080"] }
]
}
]
}
For the self-managed-amp and cloudwatch-otlp profiles, you can also pass
arbitrary OTel Collector Helm values via helm_values for full pipeline
customization.
For the managed-metrics profile, you can provide a complete custom Prometheus
scrape configuration via scrape_configuration to override the defaults entirely.
AMP recording and alerting rules¶
When using an AMP-backed profile (managed-metrics or self-managed-amp), the
module creates default infrastructure recording and alerting rules. You can
extend them with custom rules:
module "eks_monitoring" {
# ...
enable_recording_rules = true
enable_alerting_rules = true
custom_recording_rules = <<-YAML
- record: my_custom:metric
expr: sum(rate(http_requests_total[5m]))
YAML
custom_alerting_rules = <<-YAML
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
YAML
}
Note
To setup your alert receiver with Amazon SNS, follow this documentation.
Tracing and logs¶
The self-managed-amp profile supports traces and logs pipelines via the
OpenTelemetry Collector:
- Traces — enabled by default (
enable_tracing = true), exported to AWS X-Ray via OTLP - Logs — enabled by default (
enable_logs = true), exported to CloudWatch Logs via OTLP
The cloudwatch-otlp profile includes traces and logs pipelines by default
with no additional configuration needed.
The managed-metrics profile is metrics-only (no traces or logs).
For details on instrumenting your applications, see the tracing guide and logs guide.
Upgrading from v2.x¶
If you are migrating from v2.x, see the Upgrading to v3.0.0 guide for a complete list of removed variables, new requirements, and migration examples.