Monitoring NVIDIA GPU Workloads¶

GPUs play an integral part in data intensive workloads. The eks-monitoring module of the Observability Accelerator provides the ability to deploy the NVIDIA DCGM Exporter Dashboard. The dashboard utilizes metrics scraped from the /metrics endpoint that are exposed when running the nvidia gpu operator with the DCGM exporter and NVSMI binary.

Note

In order to make use of this dashboard, you will need to have a GPU backed EKS cluster and deploy the GPU operator The recommended way of deploying the GPU operator is the Data on EKS Blueprint

Deployment¶

This is enabled by default in the eks-monitoring module.

Dashboards¶

In order to start producing diagnostic metrics you must first deploy the nvidia SMI binary. nvidia-smi (also NVSMI) provides monitoring and management capabilities for each of NVIDIA’s devices from Fermi and higher architecture families. We can now deploy the nvidia-smi binary, which shows diagnostic information about all GPUs visible to the container:

cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
    args:
    - "nvidia-smi"
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

After producing the metrics they should populate the DCGM exporter dashboard: