mirror of
				https://github.com/ClusterCockpit/cc-metric-collector.git
				synced 2025-11-04 02:35:07 +01:00 
			
		
		
		
	Update NvidiaCollector with new metrics, MIG and NvLink support (#75)
This commit is contained in:
		@@ -3,38 +3,72 @@
 | 
			
		||||
 | 
			
		||||
```json
 | 
			
		||||
  "nvidia": {
 | 
			
		||||
    "exclude_devices" : [
 | 
			
		||||
      "0","1"
 | 
			
		||||
    "exclude_devices": [
 | 
			
		||||
      "0","1", "0000000:ff:01.0"
 | 
			
		||||
    ],
 | 
			
		||||
    "exclude_metrics": [
 | 
			
		||||
      "nv_fb_memory",
 | 
			
		||||
      "nv_fb_mem_used",
 | 
			
		||||
      "nv_fan"
 | 
			
		||||
    ]
 | 
			
		||||
    ],
 | 
			
		||||
    "process_mig_devices": false,
 | 
			
		||||
    "use_pci_info_as_type_id": true,
 | 
			
		||||
    "add_pci_info_tag": false,
 | 
			
		||||
    "add_uuid_meta": false,
 | 
			
		||||
    "add_board_number_meta": false,
 | 
			
		||||
    "add_serial_meta": false
 | 
			
		||||
  }
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The `nvidia` collector can be configured to leave out specific devices with the `exclude_devices` option. It takes IDs as supplied to the NVML with `nvmlDeviceGetHandleByIndex()` or the PCI address in NVML format (`%08X:%02X:%02X.0`). Metrics (listed below) that should not be sent to the MetricRouter can be excluded with the `exclude_metrics` option. Commonly only the physical GPUs are monitored. If MIG devices should be analyzed as well, set `process_mig_devices` (adds `stype=mig,stype-id=<mig_index>`).
 | 
			
		||||
 | 
			
		||||
The metrics sent by the `nvidia` collector use `accelerator` as `type` tag. For the `type-id`, it uses the device handle index by default. With the `use_pci_info_as_type_id` option, the PCI ID is used instead. If both values should be added as tags, activate the `add_pci_info_tag` option. It uses the device handle index as `type-id` and adds the PCI ID as separate `pci_identifier` tag.
 | 
			
		||||
 | 
			
		||||
Optionally, it is possible to add the UUID, the board part number and the serial to the meta informations. They are not sent to the sinks (if not configured otherwise).
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
Metrics:
 | 
			
		||||
* `nv_util`
 | 
			
		||||
* `nv_mem_util`
 | 
			
		||||
* `nv_mem_total`
 | 
			
		||||
* `nv_fb_memory`
 | 
			
		||||
* `nv_fb_mem_total`
 | 
			
		||||
* `nv_fb_mem_used`
 | 
			
		||||
* `nv_bar1_mem_total`
 | 
			
		||||
* `nv_bar1_mem_used`
 | 
			
		||||
* `nv_temp`
 | 
			
		||||
* `nv_fan`
 | 
			
		||||
* `nv_ecc_mode`
 | 
			
		||||
* `nv_perf_state`
 | 
			
		||||
* `nv_power_usage_report`
 | 
			
		||||
* `nv_graphics_clock_report`
 | 
			
		||||
* `nv_sm_clock_report`
 | 
			
		||||
* `nv_mem_clock_report`
 | 
			
		||||
* `nv_power_usage`
 | 
			
		||||
* `nv_graphics_clock`
 | 
			
		||||
* `nv_sm_clock`
 | 
			
		||||
* `nv_mem_clock`
 | 
			
		||||
* `nv_video_clock`
 | 
			
		||||
* `nv_max_graphics_clock`
 | 
			
		||||
* `nv_max_sm_clock`
 | 
			
		||||
* `nv_max_mem_clock`
 | 
			
		||||
* `nv_ecc_db_error`
 | 
			
		||||
* `nv_ecc_sb_error`
 | 
			
		||||
* `nv_power_man_limit`
 | 
			
		||||
* `nv_max_video_clock`
 | 
			
		||||
* `nv_ecc_uncorrected_error`
 | 
			
		||||
* `nv_ecc_corrected_error`
 | 
			
		||||
* `nv_power_max_limit`
 | 
			
		||||
* `nv_encoder_util`
 | 
			
		||||
* `nv_decoder_util`
 | 
			
		||||
* `nv_remapped_rows_corrected`
 | 
			
		||||
* `nv_remapped_rows_uncorrected`
 | 
			
		||||
* `nv_remapped_rows_pending`
 | 
			
		||||
* `nv_remapped_rows_failure`
 | 
			
		||||
* `nv_compute_processes`
 | 
			
		||||
* `nv_graphics_processes`
 | 
			
		||||
* `nv_violation_power`
 | 
			
		||||
* `nv_violation_thermal`
 | 
			
		||||
* `nv_violation_sync_boost`
 | 
			
		||||
* `nv_violation_board_limit`
 | 
			
		||||
* `nv_violation_low_util`
 | 
			
		||||
* `nv_violation_reliability`
 | 
			
		||||
* `nv_violation_below_app_clock`
 | 
			
		||||
* `nv_violation_below_base_clock`
 | 
			
		||||
* `nv_nvlink_crc_flit_errors`
 | 
			
		||||
* `nv_nvlink_crc_errors`
 | 
			
		||||
* `nv_nvlink_ecc_errors`
 | 
			
		||||
* `nv_nvlink_replay_errors`
 | 
			
		||||
* `nv_nvlink_recovery_errors`
 | 
			
		||||
 | 
			
		||||
It uses a separate `type` in the metrics. The output metric looks like this:
 | 
			
		||||
`<name>,type=accelerator,type-id=<nvidia-gpu-id> value=<metric value> <timestamp>`
 | 
			
		||||
 | 
			
		||||
Some metrics add the additional sub type tag (`stype`) like the `nv_nvlink_*` metrics set `stype=nvlink,stype-id=<link_number>`. 
 | 
			
		||||
		Reference in New Issue
	
	Block a user