GPU Metrics and Cost Allocation

GPU Metric K8s Integration


To set up a cluster and configure Prometheus to capture GPU-related CPU and memory metrics.


To gather GPU telemetry metrics from Kubernetes pods we need to deploy a list of special services:

  • nvidia-device-plugin

  • dcgm-exporter

This document will describe how to set up services in the new K8s installation. I have chosen another guide as a source for this instruction and applied several fixes to achieve the result. To check the initial guide please refer to

You need several prerequisites for this guide:

  • Installed Helm

  • Installed and configured kubectl

Installation configuration

To test the setup, we have used EKS clusters with versions 1.18 and 1.21.

In the example, we choose a p2.xlarge worker node for costs optimization.

List of Metrics

Metric Id


GPU Metrics


GPU utilization (in %).


Ratio of time the graphics engine is active (in %).


The ratio of number of warps resident on an SM (in %). Similar to DCGM_FI_DEV_GPU_UTIL, but shows how effectively resource is utilized.

Memory Metrics


Memory utilization (in %).


Framebuffer memory used (in MiB).


Total number of bytes transmitted through PCIe TX(RX) (in KB) via NVML.

Integration guide

  • Setup nvidia-device-plugin service running the command

 helm repo add nvdp && helm repo update && 
 helm install --generate-name nvdp/nvidia-device-plugin
  • Check that installation finished correctly and you have pods with nvidia-device-plugin

% kubectl get pods -A
NAMESPACE     NAME                                    READY   STATUS    RESTARTS   AGE
kube-system   aws-node-4j682                          1/1     Running   0          75s
kube-system   coredns-f47955f89-gs6zk                 1/1     Running   0          8m5s
kube-system   coredns-f47955f89-xm6rd                 1/1     Running   0          8m5s
kube-system   kube-proxy-csdwp                        1/1     Running   0          2m19s
kube-system   nvidia-device-plugin-1633035998-2j2qp   1/1     Running   0          38s
  • Install monitoring solution consisting of kube-state-metrics and Prometheus. We use the predefined Helm chart to deploy the whole list of services. Changes in kube-prometheus-stack.values should be applied in like in the source guide

helm repo add prometheus-community && \
helm repo update && \
helm install prometheus-community/kube-prometheus-stack \
--namespace kube-system --generate-name --values ./kube-prometheus-stack.values
  • Verify the installation

% kubectl get pods -A
NAMESPACE     NAME                                                              READY   STATUS    RESTARTS   AGE
kube-system   alertmanager-kube-prometheus-stack-1633-alertmanager-0            2/2     Running   0          49s
kube-system   aws-node-4j682                                                    1/1     Running   0          6m
kube-system   coredns-f47955f89-gs6zk                                           1/1     Running   0          12m
kube-system   coredns-f47955f89-xm6rd                                           1/1     Running   0          12m
kube-system   kube-prometheus-stack-1633-operator-8576fc8f45-64vpb              1/1     Running   0          52s
kube-system   kube-prometheus-stack-1633036072-grafana-778bcb548b-256nw         2/2     Running   0          52s
kube-system   kube-prometheus-stack-1633036072-kube-state-metrics-68c6b6dxj5s   1/1     Running   0          52s
kube-system   kube-prometheus-stack-1633036072-prometheus-node-exporter-w2k67   1/1     Running   0          53s
kube-system   kube-proxy-csdwp                                                  1/1     Running   0          7m4s
kube-system   nvidia-device-plugin-1633035998-2j2qp                             1/1     Running   0          5m23s
kube-system   prometheus-kube-prometheus-stack-1633-prometheus-0                2/2     Running   0          48s
  • Install DCGM-Exporter service. Please be informed that some metrics disabled in the default installation. If you need your custom set of metrics you need to rebuild docker image of service with your configuration. (See: I have used the pre-build docker image from community, you can find reference in Appendix 1.

helm install --namespace kube-system --generate-name \
--values ./dcgm_vals.yaml gpu-helm-charts/dcgm-exporter
  • Verify the installation

% kubectl get pods -A
NAMESPACE     NAME                                                              READY   STATUS    RESTARTS   AGE
kube-system   alertmanager-kube-prometheus-stack-1633-alertmanager-0            2/2     Running   0          2m47s
kube-system   aws-node-4j682                                                    1/1     Running   0          7m58s
kube-system   coredns-f47955f89-gs6zk                                           1/1     Running   0          14m
kube-system   coredns-f47955f89-xm6rd                                           1/1     Running   0          14m
kube-system   dcgm-exporter-1633036367-nct2v                                    1/1     Running   0          67s
kube-system   kube-prometheus-stack-1633-operator-8576fc8f45-64vpb              1/1     Running   0          2m50s
kube-system   kube-prometheus-stack-1633036072-grafana-778bcb548b-256nw         2/2     Running   0          2m50s
kube-system   kube-prometheus-stack-1633036072-kube-state-metrics-68c6b6dxj5s   1/1     Running   0          2m50s
kube-system   kube-prometheus-stack-1633036072-prometheus-node-exporter-w2k67   1/1     Running   0          2m51s
kube-system   kube-proxy-csdwp                                                  1/1     Running   0          9m2s
kube-system   nvidia-device-plugin-1633035998-2j2qp                             1/1     Running   0          7m21s
kube-system   prometheus-kube-prometheus-stack-1633-prometheus-0                2/2     Running   0          2m46s
helm fetch && \
helm install video-analytics-demo-0.1.4.tgz --generate-name
  • Wait for 3-5 minutes and kill demo service

helm delete $(helm list | grep video-analytics-demo | awk '{print $1}’)
  • After that you can go to Prometheus and check DCGM_FI_DEV_GPU_UTIL metric


Appendix 1: dcgm_vals.yaml

 Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.

  repository: shan100docker/dcgm-exporter
  pullPolicy: IfNotPresent
  tag: 2.1.8-2.4.0-rc.2-ubuntu18.04-v2

# Comment the following line to stop profiling metrics from DCGM
arguments: ["-f", "/etc/dcgm-exporter/dcp-metrics-included.csv"]
# NOTE: in general, add any command line arguments to arguments above
# and they will be passed through.
# Use "-r", "<HOST>:<PORT>" to connect to an already running hostengine
# Example arguments: ["-r", "host123:5555"]
# Use "-n" to remove the hostname tag from the output.
# Example arguments: ["-n"]
# Use "-d" to specify the devices to monitor. -d must be followed by a string
# in the following format: [f] or [g[:numeric_range][+]][i[:numeric_range]]
# Where a numeric range is something like 0-4 or 0,2,4, etc.
# Example arguments: ["-d", "g+i"] to monitor all GPUs and GPU instances or
# ["-d", "g:0-3"] to monitor GPUs 0-3.
# Use "-m" to specify the namespace and name of a configmap containing
# the watched exporter fields.
# Example arguments: ["-m", "default:exporter-metrics-config-map"]

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template

podAnnotations: {}
podSecurityContext: {}
  # fsGroup: 2000

  runAsNonRoot: false
  runAsUser: 0
     add: ["SYS_ADMIN"]
  # readOnlyRootFilesystem: true

  enable: true
  type: ClusterIP
  port: 9400
  address: ":9400"
  # Annotations to add to the service
  annotations: {}

resources: {}
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi
  enabled: true
  interval: 15s
  additionalLabels: {}
    #monitoring: prometheus

mapPodsMetrics: false

nodeSelector: {}
  #node: gpu

tolerations: []
#- operator: Exists

affinity: {}
  #  requiredDuringSchedulingIgnoredDuringExecution:
  #    nodeSelectorTerms:
  #    - matchExpressions:
  #      - key: nvidia-gpu
  #        operator: Exists

extraHostVolumes: []
#- name: host-binaries
#  hostPath: /opt/bin

extraConfigMapVolumes: []
#- name: exporter-metrics-volume
#  configMap:
#    name: exporter-metrics-config-map

extraVolumeMounts: []
#- name: host-binaries
#  mountPath: /opt/bin
#  readOnly: true

extraEnv: []
#- name: EXTRA_VAR
#  value: "TheStringValue

