Amazon EKS cluster metrics¶
This example demonstrates how to monitor your Amazon Elastic Kubernetes Service (Amazon EKS) cluster with the Observability Accelerator's EKS monitoring module.
Monitoring Amazon Elastic Kubernetes Service (Amazon EKS) for metrics has two categories: the control plane and the Amazon EKS nodes (with Kubernetes objects). The Amazon EKS control plane consists of control plane nodes that run the Kubernetes software, such as etcd and the Kubernetes API server. To read more on the components of an Amazon EKS cluster, please read the service documentation.
The Amazon EKS infrastructure Terraform modules focuses on metrics collection to Amazon Managed Service for Prometheus using the AWS Distro for OpenTelemetry Operator for Amazon EKS. It deploys the node exporter and kube-state-metrics in your cluster.
It provides default dashboards to get a comprehensible visibility on your nodes, namespaces, pods, and Kubelet operations health. Finally, you get curated Prometheus recording rules and alerts to operate your cluster.
Additionally, you can optionally collect custom Prometheus metrics from your applications running on your EKS cluster.
Prerequisites¶
Note
Make sure to complete the prerequisites section before proceeding.
Setup¶
1. Download sources and initialize Terraform¶
git clone https://github.com/aws-observability/terraform-aws-observability-accelerator.git
cd examples/existing-cluster-with-base-and-infra
terraform init
2. AWS Region¶
Specify the AWS Region where the resources will be deployed:
export TF_VAR_aws_region=xxx
3. Amazon EKS Cluster¶
To run this example, you need to provide your EKS cluster name. If you don't have a cluster ready, visit this example first to create a new one.
Specify your cluster name:
export TF_VAR_eks_cluster_id=xxx
4. Amazon Managed Service for Prometheus workspace (optional)¶
By default, we create an Amazon Managed Service for Prometheus workspace for you. However, if you have an existing workspace you want to reuse, edit and run:
export TF_VAR_managed_prometheus_workspace_id=ws-xxx
To create a workspace outside of Terraform's state, simply run:
aws amp create-workspace --alias observability-accelerator --query '.workspaceId' --output text
5. Amazon Managed Grafana workspace¶
To visualize metrics collected, you need an Amazon Managed Grafana workspace. If you have an existing workspace, create an environment variable as described below. To create a new workspace, visit our supporting example for Grafana
Note
For the URL https://g-xyz.grafana-workspace.eu-central-1.amazonaws.com
, the workspace ID would be g-xyz
export TF_VAR_managed_grafana_workspace_id=g-xxx
6. Grafana authentication¶
Grafana Service Accounts and Service Account Tokens have been introduced in Amazon Managed Grafana v9.4, which replaces Grafana API Keys in v10.4. Amazon Managed Grafana provides new control plane APIs to automate their creation. If you are still using a workspace in Grafana v8.4, you can use a Grafana API Key.
As a security best practice, we will provide Terraform a short lived token to
run the apply
or destroy
command.
Ensure you have necessary IAM permissions
(CreateWorkspaceServiceAccount, CreateWorkspaceServiceAccountToken, DeleteWorkspaceServiceAccounts, DeleteWorkspaceServiceAccountToken
)
for Service Accounts and (CreateWorkspaceApiKey, DeleteWorkspaceApiKey
) for Grafana API key.
# skip this command if you already have a service token
GRAFANA_SA_ID=$(aws grafana create-workspace-service-account \
--workspace-id $TF_VAR_managed_grafana_workspace_id \
--grafana-role ADMIN \
--name terraform-accelerator-eks \
--query 'id' \
--output text)
# creates a new token for running Terraform
export TF_VAR_grafana_api_key=$(aws grafana create-workspace-service-account-token \
--workspace-id $TF_VAR_managed_grafana_workspace_id \
--name "observability-accelerator-$(date +%s)" \
--seconds-to-live 7200 \
--service-account-id $GRAFANA_SA_ID \
--query 'serviceAccountToken.key' \
--output text)
export TF_VAR_grafana_api_key=`aws grafana create-workspace-api-key --key-name "observability-accelerator-$(date +%s)" --key-role ADMIN --seconds-to-live 7200 --workspace-id $TF_VAR_managed_grafana_workspace_id --query key --output text`
Note
The grafana_api_key
variable accepts both Grafana API key or a service
account token
Deploy¶
Simply run this command to deploy the example
terraform apply
Visualization¶
1. Grafana dashboards¶
Login to your Grafana workspace and navigate to the Dashboards panel. You should see a list of dashboards under the Observability Accelerator Dashboards
Open a specific dashboard and you should be able to view its visualization
With v2.5 and above, the dashboards are managed with a Grafana Operator running in your cluster. From the cluster to view all dashboards as Kubernetes objects, run
kubectl get grafanadashboards -A
NAMESPACE NAME AGE
grafana-operator cluster-grafanadashboard 138m
grafana-operator java-grafanadashboard 143m
grafana-operator kubelet-grafanadashboard 13h
grafana-operator namespace-workloads-grafanadashboard 13h
grafana-operator nginx-grafanadashboard 134m
grafana-operator node-exporter-grafanadashboard 13h
grafana-operator nodes-grafanadashboard 13h
grafana-operator workloads-grafanadashboard 13h
You can inspect more details per dashboard using this command
kubectl describe grafanadashboards cluster-grafanadashboard -n grafana-operator
Grafana Operator and Flux always work together to synchronize your dashboards with Git. If you delete your dashboards by accident, they will be re-provisioned automatically.
3. Amazon Managed Service for Prometheus rules and alerts¶
Open the Amazon Managed Service for Prometheus console and view the details of your workspace. Under the Rules management
tab, you should find new rules deployed.
Note
To setup your alert receiver, with Amazon SNS, follow this documentation
Custom Prometheus metrics collection¶
In addition to the cluster metrics, if you are interested in collecting Prometheus
metrics from your pods, you can use setup custom metrics collection
.
This will instruct the ADOT collector to scrape your applications metrics based
on the configuration you provide. You can also exclude some of the metrics and save costs.
Using the example, you can edit examples/existing-cluster-with-base-and-infra/main.tf
.
In the module module "workloads_infra" {
add the following config (make sure the values matches your use case):
enable_custom_metrics = true
custom_metrics_config = {
custom_app_1 = {
enableBasicAuth = true
path = "/metrics"
basicAuthUsername = "username"
basicAuthPassword = "password"
ports = ".*:(8080)$"
droppedSeriesPrefixes = "(unspecified.*)$"
}
}
After applying Terraform, on Grafana, you can query Prometheus for your application metrics, create alerts and build on your own dashboards. On the explorer section of Grafana, the following query will give you the containers exposing metrics that matched the custom metrics collection, grouped by cluster and node.
sum(up{job="custom-metrics"}) by (container_name, cluster, nodename)