EKS cluster wide GPU Cost Attribution
This post walks through an end-to-end proof of concept (PoC) for GPU slice cost allocation on Amazon EKS.
Problem statement
When multiple tenants share GPU capacity (e.g., MIG slices), you need to answer:
- Who requested what share of GPU (by pod / namespace / BU)?
- Who actually used the GPU (and how much)?
- Given a “public” price like $12 per GPU-hour, how do we compute:
- Allocated cost (based on requested share)
- Effective cost (based on observed utilization)
- Waste (allocated minus effective)
Architecture (high level)

Prerequisites
AWS + EKS prerequisites
- AWS account with permission to create:
- EKS clusters + nodegroups
- IAM roles for service accounts (IRSA)
- AMP workspace
- Quota and AZ capacity for running GPU instances in your region
Variables used
export AWS_REGION="us-west-2"
export CLUSTER_NAME="gpu-cost-poc"
export AMP_ALIAS="gpu-cost-poc"
# Public/benchmark price you want to demonstrate (not CUR yet)
export GPU_HOURLY_RATE="12"
# MIG profile for the PoC (eg: A100 40GB commonly supports 1g.5gb with 7 slices/GPU)
export MIG_PROFILE_LABEL="all-1g.5gb"
# IMPORTANT: in this PoC, MIG slices were exposed as nvidia.com/gpu (1 “gpu” == 1 MIG slice)
export MIG_RESOURCE_KEY="nvidia.com/gpu"
# For 1g.5gb on A100: typically 7 slices per physical GPU
export SLICES_PER_GPU="7"
# kube-state-metrics may “sanitize” extended resource names
export KSM_RESOURCE_REGEX='nvidia.*(gpu|mig).*'
Step-by-step instructions
Step 1 — Create EKS cluster
List versions your eksctl supports:
eksctl utils describe cluster-versions
Create the cluster (omit --version to let eksctl pick a supported default):
eksctl create cluster \
--name "$CLUSTER_NAME" \
--region "$AWS_REGION" \
--managed
Step 2 — Add a “system” nodegroup (recommended)
This keeps CoreDNS and operators off expensive GPU nodes.
eksctl create nodegroup \
--cluster "$CLUSTER_NAME" \
--region "$AWS_REGION" \
--name "system-ng" \
--node-type "m5.large" \
--nodes 2 --nodes-min 2 --nodes-max 3