Troubleshooting guide for Amazon EKS monitoring module¶

Depending on your setup, you might face a few errors. If you encounter an error not listed here, please open an issue in the issues section

These guide applies to the eks-monitoring Terraform module

Cluster authentication issue¶

Error message¶

╷
│ Error: cluster-secretstore-sm failed to create kubernetes rest client for update of resource: Get "https://FINGERPRINT.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup F867DE6CE883F9595FC8A73D84FB9F83.gr7.us-east-1.eks.amazonaws.com on 192.168.4.1:53: no such host
│
│   with module.eks_monitoring.module.external_secrets[0].kubectl_manifest.cluster_secretstore,
│   on ../../modules/eks-monitoring/add-ons/external-secrets/main.tf line 59, in resource "kubectl_manifest" "cluster_secretstore":
│   59: resource "kubectl_manifest" "cluster_secretstore" {
│
╵
╷
│ Error: grafana-operator/external-secrets-sm failed to create kubernetes rest client for update of resource: Get "https://FINGERPRINT.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup F867DE6CE883F9595FC8A73D84FB9F83.gr7.us-east-1.eks.amazonaws.com on 192.168.4.1:53: no such host
│
│   with module.eks_monitoring.module.external_secrets[0].kubectl_manifest.secret,
│   on ../../modules/eks-monitoring/add-ons/external-secrets/main.tf line 89, in resource "kubectl_manifest" "secret":
│   89: resource "kubectl_manifest" "secret" {

Resolution¶

To provision the eks-monitoring module, the environment where you are running Terraform apply needs to be authenticated against your cluster and be your current context. To verify, you can run a single kubectl get nodes command to ensure you are using the correct Amazon EKS cluster.

To login agains the correct cluster, run:

aws eks update-kubeconfig --name <cluster name> --region <aws region>

Missing Grafana dashboards¶

Terraform apply can run without apparent errors and your Grafana workspace won't present any dashboards. Many situations could lead to this as described below. The best place to start would be checking the logs of grafana-operator, external-secrets and flux-system pods.

Wrong Grafana workspace¶

It might happen that you provide the wrong Grafana workspace. One way to verify this is to run the following command:

kubectl describe grafanas external-grafana -n grafana-operator

You should see an output similar to this (truncated for brevity). Validate that you have the correct URL. If that's the case, re-running Terraform with the correct workspace ID, API key should fix this issue.

...
Spec:
  External:
    API Key:
      Key:   GF_SECURITY_ADMIN_APIKEY
      Name:  grafana-admin-credentials
    URL:     https://g-workspaceid.grafana-workspace.eu-central-1.amazonaws.com
Status:
  Admin URL:  https://g-workspaceid.grafana-workspace.eu-central-1.amazonaws.com
  Dashboards:
    grafana-operator/apiserver-troubleshooting-grafanadashboard/V3y_Zcb7k
    grafana-operator/apiserver-basic-grafanadashboard/R6abPf9Zz
    grafana-operator/java-grafanadashboard/m9mHfAy7ks
    grafana-operator/grafana-dashboards-adothealth/reshmanat
    grafana-operator/apiserver-advanced-grafanadashboard/09ec8aa1e996d6ffcd6817bbaff4db1b
    grafana-operator/nginx-grafanadashboard/nginx
    grafana-operator/kubelet-grafanadashboard/3138fa155d5915769fbded898ac09fd9
    grafana-operator/cluster-grafanadashboard/efa86fd1d0c121a26444b636a3f509a8
    grafana-operator/workloads-grafanadashboard/a164a7f0339f99e89cea5cb47e9be617
    grafana-operator/grafana-dashboards-kubeproxy/632e265de029684c40b21cb76bca4f94
    grafana-operator/nodes-grafanadashboard/200ac8fdbfbb74b39aff88118e4d1c2c
    grafana-operator/node-exporter-grafanadashboard/v8yDYJqnz
    grafana-operator/namespace-workloads-grafanadashboard/a87fb0d919ec0ea5f6543124e16c42a5

Grafana API key expired¶

Check on the logs on your grafana operator pod using the below command :

kubectl get pods -n grafana-operator

Output:

NAME                                READY   STATUS    RESTARTS   AGE
grafana-operator-866d4446bb-nqq5c   1/1     Running   0          3h17m

kubectl logs grafana-operator-866d4446bb-nqq5c -n grafana-operator

Output:

1.6857285045556655e+09  ERROR   error reconciling datasource    {"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample-amp","namespace":"grafana-operator"}, "namespace": "grafana-operator", "name": "grafanadatasource-sample-amp", "reconcileID": "72cfd60c-a255-44a1-bfbd-88b0cbc4f90c", "datasource": "grafanadatasource-sample-amp", "grafana": "external-grafana", "error": "status: 401, body: {\"message\":\"Expired API key\"}\n"}
github.com/grafana-operator/grafana-operator/controllers.(*GrafanaDatasourceReconciler).Reconcile

If you observe, the the above grafana-api-key error in the logs, your grafana API key is expired.

Please use the operational procedure to update your grafana-api-key :

Create a new Grafana API key, you can use this step and make sure the API key duration is not too short.
Run Terraform with the new API key. Terraform will modify the AWS SSM Parameter used by externalsecret.
If the issue persists, you can force the synchronization by deleting the externalsecret Kubernetes object.

kubectl delete externalsecret/external-secrets-sm -n grafana-operator

Git repository errors¶

Flux is responsible to regularly pull and synchronize dashboards and artifacts into your EKS cluster. It might happen that its state gets corrupted.

You can verify those errors by using this command. You should see an error if Flux is not able to pull correctly:

kubectl get gitrepositories -n flux-system
NAME                            URL                                                                  AGE     READY   STATUS
aws-observability-accelerator   https://github.com/aws-observability/aws-observability-accelerator   6d12h   True    stored artifact for revision 'v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d'

Depending on the error, you can delete the repository and re-run Terraform and force the synchronization.

k delete gitrepositories aws-observability-accelerator  -n flux-system

If you believe this is a bug, please open an issue here.

Flux Kustomizations¶

After Flux pulls the repository in the cluster state, it will apply Kustomizations to create Grafana data sources, folders and dashboards.

Check the kustomization objects. Here you should see the dashboards you have enabled

k get kustomizations.kustomize.toolkit.fluxcd.io -A
NAMESPACE     NAME                                AGE   READY   STATUS
flux-system   grafana-dashboards-adothealth       18d   True    Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
flux-system   grafana-dashboards-apiserver        18d   True    Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
flux-system   grafana-dashboards-infrastructure   10d   True    Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
flux-system   grafana-dashboards-java             18d   True    Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
flux-system   grafana-dashboards-kubeproxy        10d   True    Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
flux-system   grafana-dashboards-nginx            18d   True    Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d

To have more infos on an error, you can view the Kustomization controller logs

kubectl get pods -n flux-system
NAME                                          READY   STATUS    RESTARTS      AGE
helm-controller-65cc46469f-nsqd5              1/1     Running   2 (13d ago)   27d
image-automation-controller-d8f7bfcb4-k2m9j   1/1     Running   2 (13d ago)   27d
image-reflector-controller-68979dfd49-wh25h   1/1     Running   2 (13d ago)   27d
kustomize-controller-767677f7f5-c5xsp         1/1     Running   5 (13d ago)   63d
notification-controller-55d8c759f5-7df5l      1/1     Running   5 (13d ago)   63d
source-controller-58c66d55cd-4j6bl            1/1     Running   5 (13d ago)   63d

kubectl logs -f -n flux-system kustomize-controller-767677f7f5-c5xsp

If you believe there is a bug, please open an issue here.

Depending on the error, delete the kustomization object and re-apply Terraform

kubectl delete kustomizations -n flux-system grafana-dashboards-apiserver

Grafana dashboards errors¶

If all of the above seem normal, finally inspect deployed dashboards by running this command:

kubectl get grafanadashboards -A
NAMESPACE          NAME                                         AGE
grafana-operator   apiserver-advanced-grafanadashboard          18d
grafana-operator   apiserver-basic-grafanadashboard             18d
grafana-operator   apiserver-troubleshooting-grafanadashboard   18d
grafana-operator   cluster-grafanadashboard                     10d
grafana-operator   grafana-dashboards-adothealth                18d
grafana-operator   grafana-dashboards-kubeproxy                 10d
grafana-operator   java-grafanadashboard                        18d
grafana-operator   kubelet-grafanadashboard                     10d
grafana-operator   namespace-workloads-grafanadashboard         10d
grafana-operator   nginx-grafanadashboard                       18d
grafana-operator   node-exporter-grafanadashboard               10d
grafana-operator   nodes-grafanadashboard                       10d
grafana-operator   workloads-grafanadashboard                   10d

You can dive into the details of a dashboard by running:

kubectl describe grafanadashboards grafana-dashboards-kubeproxy -n grafana-operator

Depending on the error, you can delete the dashboard object. In this case, you don't need to re-run Terraform as the Flux Kustomization will force its recreation through the Grafana operator

kubectl describe grafanadashboards grafana-dashboards-kubeproxy -n grafana-operator

If you believe there is a bug, please open an issue here.

Upgrade from to v2.5 or earlier¶

v2.5.0 removes the dependency to the Terraform Grafana provider in the EKS monitoring module. As Grafana Operator manages and syncs the Grafana contents, Terraform is not required anymore in this context.

However, if you migrate from earlier versions, you might leave some data orphan as the Grafana provider is dropped. Terraform will throw an error. We have released v2.5.0-rc.1 which removes all the Grafana resources provisioned by Terraform in the EKS context, without removing the provider configurations.

Step 1: migrate to v2.5.0-rc.1 and run apply
Step 2: migrate to v2.5.0 or above