cc-metric-collector/collectors/nvidiaMetric.md
Thomas Gruber 8d85bd53f1
Merge latest development changes to main branch (#79)
* Cleanup: Remove unused code

* Use Golang duration parser for 'interval' and 'duration'
 in main config

* Update handling of LIKWID headers. Download only if not already present in the system. Fixes #73

* Units with cc-units (#64)

* Add option to normalize units with cc-unit

* Add unit conversion to router

* Add option to change unit prefix in the router

* Add to MetricRouter README

* Add order of operations in router to README

* Use second add_tags/del_tags only if metric gets renamed

* Skip disks in DiskstatCollector that have size=0

* Check readability of sensor files in TempCollector

* Fix for --once option

* Rename `cpu` type to `hwthread` (#69)

* Rename 'cpu' type to 'hwthread' to avoid naming clashes with MetricStore and CC-Webfrontend

* Collectors in parallel (#74)

* Provide info to CollectorManager whether the collector can be executed in parallel with others

* Split serial and parallel collectors. Read in parallel first

* Update NvidiaCollector with new metrics, MIG and NvLink support (#75)

* CC topology module update (#76)

* Rename CPU to hardware thread, write some comments

* Do renaming in other parts

* Remove CpuList and SocketList function from metricCollector. Available in ccTopology

* Option to use MIG UUID as subtype-id in NvidiaCollector

* Option to use MIG slice name as subtype-id in NvidiaCollector

* MetricRouter: Fix JSON in README

* Fix for Github Action to really use the selected version

* Remove Ganglia installation in runonce Action and add Go 1.18

* Fix daemon options in init script

* Add separate go.mod files to use it with deprecated 1.16

* Minor updates for Makefiles

* fix string comparison

* AMD ROCm SMI collector (#77)

* Add collector for AMD ROCm SMI metrics

* Fix import path

* Fix imports

* Remove Board Number

* store GPU index explicitly

* Remove board number from description

* Use http instead of ftp to download likwid

* Fix serial number in rocmCollector

* Improved http sink (#78)

* automatic flush in NatsSink

* tweak default options of HttpSink

* shorter cirt. section and retries for HttpSink

* fix error handling

* Remove file added by mistake.

* Use http instead of ftp to download likwid

* Fix serial number in rocmCollector

Co-authored-by: Thomas Roehl <thomas.roehl@fau.de>

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Lou <lou.knauer@gmx.de>
2022-06-08 15:25:40 +02:00

2.8 KiB

nvidia collector

  "nvidia": {
    "exclude_devices": [
      "0","1", "0000000:ff:01.0"
    ],
    "exclude_metrics": [
      "nv_fb_mem_used",
      "nv_fan"
    ],
    "process_mig_devices": false,
    "use_pci_info_as_type_id": true,
    "add_pci_info_tag": false,
    "add_uuid_meta": false,
    "add_board_number_meta": false,
    "add_serial_meta": false,
    "use_uuid_for_mig_device": false,
    "use_slice_for_mig_device": false
  }

The nvidia collector can be configured to leave out specific devices with the exclude_devices option. It takes IDs as supplied to the NVML with nvmlDeviceGetHandleByIndex() or the PCI address in NVML format (%08X:%02X:%02X.0). Metrics (listed below) that should not be sent to the MetricRouter can be excluded with the exclude_metrics option. Commonly only the physical GPUs are monitored. If MIG devices should be analyzed as well, set process_mig_devices (adds stype=mig,stype-id=<mig_index>). With the options use_uuid_for_mig_device and use_slice_for_mig_device, the <mig_index> can be replaced with the UUID (e.g. MIG-6a9f7cc8-6d5b-5ce0-92de-750edc4d8849) or the MIG slice name (e.g. 1g.5gb).

The metrics sent by the nvidia collector use accelerator as type tag. For the type-id, it uses the device handle index by default. With the use_pci_info_as_type_id option, the PCI ID is used instead. If both values should be added as tags, activate the add_pci_info_tag option. It uses the device handle index as type-id and adds the PCI ID as separate pci_identifier tag.

Optionally, it is possible to add the UUID, the board part number and the serial to the meta informations. They are not sent to the sinks (if not configured otherwise).

Metrics:

  • nv_util
  • nv_mem_util
  • nv_fb_mem_total
  • nv_fb_mem_used
  • nv_bar1_mem_total
  • nv_bar1_mem_used
  • nv_temp
  • nv_fan
  • nv_ecc_mode
  • nv_perf_state
  • nv_power_usage
  • nv_graphics_clock
  • nv_sm_clock
  • nv_mem_clock
  • nv_video_clock
  • nv_max_graphics_clock
  • nv_max_sm_clock
  • nv_max_mem_clock
  • nv_max_video_clock
  • nv_ecc_uncorrected_error
  • nv_ecc_corrected_error
  • nv_power_max_limit
  • nv_encoder_util
  • nv_decoder_util
  • nv_remapped_rows_corrected
  • nv_remapped_rows_uncorrected
  • nv_remapped_rows_pending
  • nv_remapped_rows_failure
  • nv_compute_processes
  • nv_graphics_processes
  • nv_violation_power
  • nv_violation_thermal
  • nv_violation_sync_boost
  • nv_violation_board_limit
  • nv_violation_low_util
  • nv_violation_reliability
  • nv_violation_below_app_clock
  • nv_violation_below_base_clock
  • nv_nvlink_crc_flit_errors
  • nv_nvlink_crc_errors
  • nv_nvlink_ecc_errors
  • nv_nvlink_replay_errors
  • nv_nvlink_recovery_errors

Some metrics add the additional sub type tag (stype) like the nv_nvlink_* metrics set stype=nvlink,stype-id=<link_number>.