5.4 KiB
nvidia collector
{ "nvidia": { "exclude_devices": [ "0", "1", "00000000:FF:01.0" ], "exclude_metrics": [ "nv_fb_mem_used", "nv_fan" ], "only_metrics": [ "nv_nvlink_ecc_errors_sum", "nv_nvlink_ecc_errors_sum_diff" ], "send_diff_values": true, "process_mig_devices": false, "use_pci_info_as_type_id": true, "add_pci_info_tag": false, "add_uuid_meta": false, "add_board_number_meta": false, "add_serial_meta": false, "use_uuid_for_mig_device": false, "use_slice_for_mig_device": false, "use_memory_info_v2": false } }
The nvidia collector gathers metrics from NVIDIA GPUs using the NVIDIA Management Library (NVML). It can be configured to exclude specific devices with the exclude_devices option, which accepts device indices (e.g., "0", "1") as used by nvmlDeviceGetHandleByIndex() or PCI addresses in NVML format (e.g., "%08X:%02X:%02X.0").
Both filtering mechanisms are supported:
exclude_metrics: Excludes the specified metrics.only_metrics: If provided, only the listed metrics are collected. This takes precedence overexclude_metrics.
Differential Metrics
send_diff_values: When set totrue, differential metrics (e.g.,nv_ecc_corrected_error_diff,nv_nvlink_ecc_errors_sum_diff) are calculated and sent alongside absolute values. These represent the change since the last measurement.
MIG Devices
By default, only physical GPUs are monitored. To include MIG (Multi-Instance GPU) devices, set process_mig_devices to true. This adds tags stype=mig and stype-id=<mig_index> to MIG-specific metrics. The stype-id can be customized:
use_uuid_for_mig_device: Uses the MIG UUID (e.g.,MIG-6a9f7cc8-6d5b-5ce0-92de-750edc4d8849).use_slice_for_mig_device: Uses the MIG slice name (e.g.,1g.5gb).
Tags and Metadata
Metrics use type=accelerator as a tag. The type-id defaults to the device handle index. Additional options include:
use_pci_info_as_type_id: Uses the PCI ID (e.g.,00000000:FF:01.0) astype-idinstead of the index.add_pci_info_tag: Adds the PCI ID as a separatepci_identifiertag, while keeping the index astype-id.add_uuid_meta,add_board_number_meta,add_serial_meta: Add UUID, board part number, or serial number to the metadata (not sent to sinks unless configured otherwise).
Memory Info
use_memory_info_v2: Whentrue, usesnvmlDeviceGetMemoryInfo_v2for more detailed memory metrics (i.e.mem_reservednot part ofmem_used). Defaults tofalse, falling back tonvmlDeviceGetMemoryInfo.
Metrics
The following metrics are available. All nv_nvlink_* metrics are always delivered as _sum (aggregated across all NVLinks). If multiple devices are present, they are also provided as per-device metrics with stype=nvlink and stype-id=<link_number>.
Absolute Metrics
nv_util(unit:%)nv_mem_util(unit:%)nv_fb_mem_total(unit:MByte)nv_fb_mem_used(unit:MByte)nv_fb_mem_reserved(unit:MByte)nv_bar1_mem_total(unit:MByte)nv_bar1_mem_used(unit:MByte)nv_temp(unit:degC)nv_fan(unit:%)nv_ecc_modenv_perf_statenv_power_usage(unit:watts)nv_graphics_clock(unit:MHz)nv_sm_clock(unit:MHz)nv_mem_clock(unit:MHz)nv_video_clock(unit:MHz)nv_max_graphics_clock(unit:MHz)nv_max_sm_clock(unit:MHz)nv_max_mem_clock(unit:MHz)nv_max_video_clock(unit:MHz)nv_ecc_uncorrected_errornv_ecc_corrected_errornv_power_max_limit(unit:watts)nv_encoder_util(unit:%)nv_decoder_util(unit:%)nv_remapped_rows_correctednv_remapped_rows_uncorrectednv_remapped_rows_pendingnv_remapped_rows_failurenv_compute_processesnv_graphics_processesnv_violation_power(unit:sec)nv_violation_thermal(unit:sec)nv_violation_sync_boost(unit:sec)nv_violation_board_limit(unit:sec)nv_violation_low_util(unit:sec)nv_violation_reliability(unit:sec)nv_violation_below_app_clock(unit:sec)nv_violation_below_base_clock(unit:sec)nv_nvlink_crc_flit_errorsnv_nvlink_crc_errorsnv_nvlink_ecc_errorsnv_nvlink_replay_errorsnv_nvlink_recovery_errorsnv_nvlink_crc_flit_errors_sumnv_nvlink_crc_errors_sumnv_nvlink_ecc_errors_sumnv_nvlink_replay_errors_sumnv_nvlink_recovery_errors_sum
Differential Metrics (requires send_diff_values: true)
nv_ecc_uncorrected_error_diffnv_ecc_corrected_error_diffnv_remapped_rows_corrected_diffnv_remapped_rows_uncorrected_diffnv_remapped_rows_pending_diffnv_remapped_rows_failure_diffnv_violation_power_diff(unit:sec)nv_violation_thermal_diff(unit:sec)nv_violation_sync_boost_diff(unit:sec)nv_violation_board_limit_diff(unit:sec)nv_violation_low_util_diff(unit:sec)nv_violation_reliability_diff(unit:sec)nv_violation_below_app_clock_diff(unit:sec)nv_violation_below_base_clock_diff(unit:sec)nv_nvlink_crc_flit_errors_diffnv_nvlink_crc_errors_diffnv_nvlink_ecc_errors_diffnv_nvlink_replay_errors_diffnv_nvlink_recovery_errors_diffnv_nvlink_crc_flit_errors_sum_diffnv_nvlink_crc_errors_sum_diffnv_nvlink_ecc_errors_sum_diffnv_nvlink_replay_errors_sum_diffnv_nvlink_recovery_errors_sum_diff