cc-metric-collector/collectors/rocmsmiMetric.md
Thomas Gruber e13695307f
AMD ROCm SMI collector (#77)
* Add collector for AMD ROCm SMI metrics

* Fix import path

* Fix imports

* Remove Board Number

* store GPU index explicitly

* Remove board number from description
2022-05-25 15:55:43 +02:00

1.6 KiB

rocm_smi collector

  "rocm_smi": {
    "exclude_devices": [
      "0","1", "0000000:ff:01.0"
    ],
    "exclude_metrics": [
      "rocm_mm_util",
      "rocm_temp_vrsoc"
    ],
    "use_pci_info_as_type_id": true,
    "add_pci_info_tag": false,
    "add_serial_meta": false,
  }

The rocm_smi collector can be configured to leave out specific devices with the exclude_devices option. It takes logical IDs in the list of available devices or the PCI address similar to NVML format (%08X:%02X:%02X.0). Metrics (listed below) that should not be sent to the MetricRouter can be excluded with the exclude_metrics option.

The metrics sent by the rocm_smi collector use accelerator as type tag. For the type-id, it uses the device handle index by default. With the use_pci_info_as_type_id option, the PCI ID is used instead. If both values should be added as tags, activate the add_pci_info_tag option. It uses the device handle index as type-id and adds the PCI ID as separate pci_identifier tag.

Optionally, it is possible to add the serial to the meta informations. They are not sent to the sinks (if not configured otherwise).

Metrics:

  • rocm_gfx_util
  • rocm_umc_util
  • rocm_mm_util
  • rocm_avg_power
  • rocm_temp_mem
  • rocm_temp_hotspot
  • rocm_temp_edge
  • rocm_temp_vrgfx
  • rocm_temp_vrsoc
  • rocm_temp_vrmem
  • rocm_gfx_clock
  • rocm_soc_clock
  • rocm_u_clock
  • rocm_v0_clock
  • rocm_v1_clock
  • rocm_d0_clock
  • rocm_d1_clock
  • rocm_temp_hbm

Some metrics add the additional sub type tag (stype) like the rocm_temp_hbm metrics set stype=device,stype-id=<HBM_slice_number>.