5.4 KiB

nvidia collector

{ "nvidia": { "exclude_devices": [ "0", "1", "00000000:FF:01.0" ], "exclude_metrics": [ "nv_fb_mem_used", "nv_fan" ], "only_metrics": [ "nv_nvlink_ecc_errors_sum", "nv_nvlink_ecc_errors_sum_diff" ], "send_diff_values": true, "process_mig_devices": false, "use_pci_info_as_type_id": true, "add_pci_info_tag": false, "add_uuid_meta": false, "add_board_number_meta": false, "add_serial_meta": false, "use_uuid_for_mig_device": false, "use_slice_for_mig_device": false, "use_memory_info_v2": false } }

The nvidia collector gathers metrics from NVIDIA GPUs using the NVIDIA Management Library (NVML). It can be configured to exclude specific devices with the exclude_devices option, which accepts device indices (e.g., "0", "1") as used by nvmlDeviceGetHandleByIndex() or PCI addresses in NVML format (e.g., "%08X:%02X:%02X.0").

Both filtering mechanisms are supported:

  • exclude_metrics: Excludes the specified metrics.
  • only_metrics: If provided, only the listed metrics are collected. This takes precedence over exclude_metrics.

Differential Metrics

  • send_diff_values: When set to true, differential metrics (e.g., nv_ecc_corrected_error_diff, nv_nvlink_ecc_errors_sum_diff) are calculated and sent alongside absolute values. These represent the change since the last measurement.

MIG Devices

By default, only physical GPUs are monitored. To include MIG (Multi-Instance GPU) devices, set process_mig_devices to true. This adds tags stype=mig and stype-id=<mig_index> to MIG-specific metrics. The stype-id can be customized:

  • use_uuid_for_mig_device: Uses the MIG UUID (e.g., MIG-6a9f7cc8-6d5b-5ce0-92de-750edc4d8849).
  • use_slice_for_mig_device: Uses the MIG slice name (e.g., 1g.5gb).

Tags and Metadata

Metrics use type=accelerator as a tag. The type-id defaults to the device handle index. Additional options include:

  • use_pci_info_as_type_id: Uses the PCI ID (e.g., 00000000:FF:01.0) as type-id instead of the index.
  • add_pci_info_tag: Adds the PCI ID as a separate pci_identifier tag, while keeping the index as type-id.
  • add_uuid_meta, add_board_number_meta, add_serial_meta: Add UUID, board part number, or serial number to the metadata (not sent to sinks unless configured otherwise).

Memory Info

  • use_memory_info_v2: When true, uses nvmlDeviceGetMemoryInfo_v2 for more detailed memory metrics (i.e. mem_reserved not part of mem_used). Defaults to false, falling back to nvmlDeviceGetMemoryInfo.

Metrics

The following metrics are available. All nv_nvlink_* metrics are always delivered as _sum (aggregated across all NVLinks). If multiple devices are present, they are also provided as per-device metrics with stype=nvlink and stype-id=<link_number>.

Absolute Metrics

  • nv_util (unit: %)
  • nv_mem_util (unit: %)
  • nv_fb_mem_total (unit: MByte)
  • nv_fb_mem_used (unit: MByte)
  • nv_fb_mem_reserved (unit: MByte)
  • nv_bar1_mem_total (unit: MByte)
  • nv_bar1_mem_used (unit: MByte)
  • nv_temp (unit: degC)
  • nv_fan (unit: %)
  • nv_ecc_mode
  • nv_perf_state
  • nv_power_usage (unit: watts)
  • nv_graphics_clock (unit: MHz)
  • nv_sm_clock (unit: MHz)
  • nv_mem_clock (unit: MHz)
  • nv_video_clock (unit: MHz)
  • nv_max_graphics_clock (unit: MHz)
  • nv_max_sm_clock (unit: MHz)
  • nv_max_mem_clock (unit: MHz)
  • nv_max_video_clock (unit: MHz)
  • nv_ecc_uncorrected_error
  • nv_ecc_corrected_error
  • nv_power_max_limit (unit: watts)
  • nv_encoder_util (unit: %)
  • nv_decoder_util (unit: %)
  • nv_remapped_rows_corrected
  • nv_remapped_rows_uncorrected
  • nv_remapped_rows_pending
  • nv_remapped_rows_failure
  • nv_compute_processes
  • nv_graphics_processes
  • nv_violation_power (unit: sec)
  • nv_violation_thermal (unit: sec)
  • nv_violation_sync_boost (unit: sec)
  • nv_violation_board_limit (unit: sec)
  • nv_violation_low_util (unit: sec)
  • nv_violation_reliability (unit: sec)
  • nv_violation_below_app_clock (unit: sec)
  • nv_violation_below_base_clock (unit: sec)
  • nv_nvlink_crc_flit_errors
  • nv_nvlink_crc_errors
  • nv_nvlink_ecc_errors
  • nv_nvlink_replay_errors
  • nv_nvlink_recovery_errors
  • nv_nvlink_crc_flit_errors_sum
  • nv_nvlink_crc_errors_sum
  • nv_nvlink_ecc_errors_sum
  • nv_nvlink_replay_errors_sum
  • nv_nvlink_recovery_errors_sum

Differential Metrics (requires send_diff_values: true)

  • nv_ecc_uncorrected_error_diff
  • nv_ecc_corrected_error_diff
  • nv_remapped_rows_corrected_diff
  • nv_remapped_rows_uncorrected_diff
  • nv_remapped_rows_pending_diff
  • nv_remapped_rows_failure_diff
  • nv_violation_power_diff (unit: sec)
  • nv_violation_thermal_diff (unit: sec)
  • nv_violation_sync_boost_diff (unit: sec)
  • nv_violation_board_limit_diff (unit: sec)
  • nv_violation_low_util_diff (unit: sec)
  • nv_violation_reliability_diff (unit: sec)
  • nv_violation_below_app_clock_diff (unit: sec)
  • nv_violation_below_base_clock_diff (unit: sec)
  • nv_nvlink_crc_flit_errors_diff
  • nv_nvlink_crc_errors_diff
  • nv_nvlink_ecc_errors_diff
  • nv_nvlink_replay_errors_diff
  • nv_nvlink_recovery_errors_diff
  • nv_nvlink_crc_flit_errors_sum_diff
  • nv_nvlink_crc_errors_sum_diff
  • nv_nvlink_ecc_errors_sum_diff
  • nv_nvlink_replay_errors_sum_diff
  • nv_nvlink_recovery_errors_sum_diff