From a14210061cfa127484e0602dfb71f774a9ddde4d Mon Sep 17 00:00:00 2001 From: Thomas Roehl Date: Mon, 27 Dec 2021 13:01:50 +0100 Subject: [PATCH] Use seperate page per collector --- collectors/README.md | 360 ++------------------------------- collectors/cpustatMetric.md | 23 +++ collectors/customCmdMetric.md | 0 collectors/diskstatMetric.md | 34 ++++ collectors/infinibandMetric.md | 19 ++ collectors/ipmiMetric.md | 0 collectors/likwidMetric.md | 119 +++++++++++ collectors/loadavgMetric.md | 19 ++ collectors/lustreMetric.md | 29 +++ collectors/memstatMetric.md | 27 +++ collectors/netstatMetric.md | 21 ++ collectors/nvidiaMetric.md | 40 ++++ collectors/tempMetric.md | 22 ++ collectors/topprocsMetric.md | 0 14 files changed, 367 insertions(+), 346 deletions(-) create mode 100644 collectors/cpustatMetric.md create mode 100644 collectors/customCmdMetric.md create mode 100644 collectors/diskstatMetric.md create mode 100644 collectors/infinibandMetric.md create mode 100644 collectors/ipmiMetric.md create mode 100644 collectors/likwidMetric.md create mode 100644 collectors/loadavgMetric.md create mode 100644 collectors/lustreMetric.md create mode 100644 collectors/memstatMetric.md create mode 100644 collectors/netstatMetric.md create mode 100644 collectors/nvidiaMetric.md create mode 100644 collectors/tempMetric.md create mode 100644 collectors/topprocsMetric.md diff --git a/collectors/README.md b/collectors/README.md index 09ffa06..3cbd40c 100644 --- a/collectors/README.md +++ b/collectors/README.md @@ -14,353 +14,21 @@ This folder contains the collectors for the cc-metric-collector. In contrast to the configuration files for sinks and receivers, the collectors configuration is not a list but a set of dicts. This is required because we didn't manage to partially read the type before loading the remaining configuration. We are eager to change this to the same format. +# Available collectors -## `memstat` collector - -```json - "memstat": { - "exclude_metrics": [ - "mem_used" - ] - } -``` - -The `memstat` collector reads data from `/proc/meminfo` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink. - - -Metrics: -* `mem_total` -* `mem_sreclaimable` -* `mem_slab` -* `mem_free` -* `mem_buffers` -* `mem_cached` -* `mem_available` -* `mem_shared` -* `swap_total` -* `swap_free` -* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached`) - -## `loadavg` collector -```json - "loadavg": { - "exclude_metrics": [ - "proc_run" - ] - } -``` - -The `loadavg` collector reads data from `/proc/loadavg` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink. - -Metrics: -* `load_one` -* `load_five` -* `load_fifteen` -* `proc_run` -* `proc_total` - -## `netstat` collector -```json - "netstat": { - "exclude_devices": [ - "lo" - ] - } -``` - -The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a device is not required, it can be excluded from forwarding it to the sink. Commonly the `lo` device should be excluded. - -Metrics: -* `bytes_in` -* `bytes_out` -* `pkts_in` -* `pkts_out` - -The device name is added as tag `device`. - - -## `diskstat` collector - -```json - "diskstat": { - "exclude_metrics": [ - "read_ms" - ], - } -``` - -The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink. - -Metrics: -* `reads` -* `reads_merged` -* `read_sectors` -* `read_ms` -* `writes` -* `writes_merged` -* `writes_sectors` -* `writes_ms` -* `ioops` -* `ioops_ms` -* `ioops_weighted_ms` -* `discards` -* `discards_merged` -* `discards_sectors` -* `discards_ms` -* `flushes` -* `flushes_ms` - - -The device name is added as tag `device`. - -## `cpustat` collector -```json - "netstat": { - "exclude_metrics": [ - "cpu_idle" - ] - } -``` - -The `cpustat` collector reads data from `/proc/stats` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink. - -Metrics: -* `cpu_user` -* `cpu_nice` -* `cpu_system` -* `cpu_idle` -* `cpu_iowait` -* `cpu_irq` -* `cpu_softirq` -* `cpu_steal` -* `cpu_guest` -* `cpu_guest_nice` - -## `ibstat` collector - -```json - "ibstat": { - "perfquery_path" : "", - "exclude_devices": [ - "mlx4" - ] - } -``` - -The `ibstat` collector reads either data through the `perfquery` command or the sysfs files below `/sys/class/infiniband/`. - -Metrics: -* `ib_recv` -* `ib_xmit` - - -## `lustrestat` collector - -```json - "lustrestat": { - "procfiles" : [ - "/proc/fs/lustre/llite/lnec-XXXXXX/stats" - ], - "exclude_metrics": [ - "setattr", - "getattr" - ] - } -``` - -The `lustrestat` collector reads from the procfs stat files for Lustre like `/proc/fs/lustre/llite/lnec-XXXXXX/stats`. - -Metrics: -* `read_bytes` -* `read_requests` -* `write_bytes` -* `write_requests` -* `open` -* `close` -* `getattr` -* `setattr` -* `statfs` -* `inode_permission` - -## `nvidia` collector - -```json - "lustrestat": { - "exclude_devices" : [ - "0","1" - ], - "exclude_metrics": [ - "fb_memory", - "fan" - ] - } -``` - -Metrics: -* `util` -* `mem_util` -* `mem_total` -* `fb_memory` -* `temp` -* `fan` -* `ecc_mode` -* `perf_state` -* `power_usage_report` -* `graphics_clock_report` -* `sm_clock_report` -* `mem_clock_report` -* `max_graphics_clock` -* `max_sm_clock` -* `max_mem_clock` -* `ecc_db_error` -* `ecc_sb_error` -* `power_man_limit` -* `encoder_util` -* `decoder_util` - -It uses a separate `type` in the metrics. The output metric looks like this: -`,type=accelerator,type-id= value= ` - -## `tempstat` collector - -```json - "lustrestat": { - "tag_override" : { - "" : { - "type" : "socket", - "type-id" : "0" - } - }, - "exclude_metrics": [ - "metric1", - "metric2" - ] - } -``` - -The `tempstat` collector reads the data from `/sys/class/hwmon//tempX_{input,label}` - -Metrics: -* `temp_*`: The metric name is taken from the `label` files. - -## `likwid` collector -```json - "likwid": { - "eventsets": [ - { - "events": { - "FIXC1": "ACTUAL_CPU_CLOCK", - "FIXC2": "MAX_CPU_CLOCK", - "PMC0": "RETIRED_INSTRUCTIONS", - "PMC1": "CPU_CLOCKS_UNHALTED", - "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL", - "PMC3": "MERGE", - "DFC0": "DRAM_CHANNEL_0", - "DFC1": "DRAM_CHANNEL_1", - "DFC2": "DRAM_CHANNEL_2", - "DFC3": "DRAM_CHANNEL_3" - }, - "metrics": [ - { - "name": "ipc", - "calc": "PMC0/PMC1", - "socket_scope": false, - "publish": true - }, - { - "name": "flops_any", - "calc": "0.000001*PMC2/time", - "socket_scope": false, - "publish": true - }, - { - "name": "clock_mhz", - "calc": "0.000001*(FIXC1/FIXC2)/inverseClock", - "socket_scope": false, - "publish": true - }, - { - "name": "mem1", - "calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time", - "socket_scope": true, - "publish": false - } - ] - }, - { - "events": { - "DFC0": "DRAM_CHANNEL_4", - "DFC1": "DRAM_CHANNEL_5", - "DFC2": "DRAM_CHANNEL_6", - "DFC3": "DRAM_CHANNEL_7", - "PWR0": "RAPL_CORE_ENERGY", - "PWR1": "RAPL_PKG_ENERGY" - }, - "metrics": [ - { - "name": "pwr_core", - "calc": "PWR0/time", - "socket_scope": false, - "publish": true - }, - { - "name": "pwr_pkg", - "calc": "PWR1/time", - "socket_scope": true, - "publish": true - }, - { - "name": "mem2", - "calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time", - "socket_scope": true, - "publish": false - } - ] - } - ], - "globalmetrics": [ - { - "name": "mem_bw", - "calc": "mem1+mem2", - "socket_scope": true, - "publish": true - } - ] - } -``` - -_Example config suitable for AMD Zen3_ - -The `likwid` collector reads hardware performance counters at a **hwthread** and **socket** level. The configuration looks quite complicated but it is basically copy&paste from [LIKWID's performance groups](https://github.com/RRZE-HPC/likwid/tree/master/groups). The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility. - -The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference: -``` -EVENTSET -> "events": { -FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK", -FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK", -PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS", -PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED", -PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL", -PMC3 MERGE -> "PMC3": "MERGE", - -> } -``` - -The metrics are following the same procedure: - -``` -METRICS -> "metrics": [ -IPC PMC0/PMC1 -> { - -> "name" : "IPC", - -> "calc" : "PMC0/PMC1", - -> "socket_scope": false, - -> "publish": true - -> } - -> ] -``` - -The `socket_scope` option tells whether it is submitted per socket or per hwthread. If a metric is only used for internal calculations, you can set `publish = false`. - -Since some metrics can only be gathered in multiple measurements (like the memory bandwidth on AMD Zen3 chips), configure multiple eventsets like in the example config and use the `globalmetrics` section to combine them. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases. +* [`cpustat`](./cpustatMetric.md) +* [`memstat`](./memstatMetric.md) +* [`diskstat`](./diskstatMetric.md) +* [`loadavg`](./loadavgMetric.md) +* [`netstat`](./netstatMetric.md) +* [`ibstat`](./infinibandMetric.md) +* [`tempstat`](./tempstatMetric.md) +* [`lustre`](./lustreMetric.md) +* [`likwid`](./likwidMetric.md) +* [`nvidia`](./nvidiaMetric.md) +* [`customcmd`](./customCmdMetric.md) +* [`ipmistat`](./ipmiMetric.md) +* [`topprocs`](./topprocsMetric.md) ## Todos diff --git a/collectors/cpustatMetric.md b/collectors/cpustatMetric.md new file mode 100644 index 0000000..604445a --- /dev/null +++ b/collectors/cpustatMetric.md @@ -0,0 +1,23 @@ + +## `cpustat` collector +```json + "netstat": { + "exclude_metrics": [ + "cpu_idle" + ] + } +``` + +The `cpustat` collector reads data from `/proc/stats` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink. + +Metrics: +* `cpu_user` +* `cpu_nice` +* `cpu_system` +* `cpu_idle` +* `cpu_iowait` +* `cpu_irq` +* `cpu_softirq` +* `cpu_steal` +* `cpu_guest` +* `cpu_guest_nice` diff --git a/collectors/customCmdMetric.md b/collectors/customCmdMetric.md new file mode 100644 index 0000000..e69de29 diff --git a/collectors/diskstatMetric.md b/collectors/diskstatMetric.md new file mode 100644 index 0000000..1ac341d --- /dev/null +++ b/collectors/diskstatMetric.md @@ -0,0 +1,34 @@ + +## `diskstat` collector + +```json + "diskstat": { + "exclude_metrics": [ + "read_ms" + ], + } +``` + +The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink. + +Metrics: +* `reads` +* `reads_merged` +* `read_sectors` +* `read_ms` +* `writes` +* `writes_merged` +* `writes_sectors` +* `writes_ms` +* `ioops` +* `ioops_ms` +* `ioops_weighted_ms` +* `discards` +* `discards_merged` +* `discards_sectors` +* `discards_ms` +* `flushes` +* `flushes_ms` + +The device name is added as tag `device`. + diff --git a/collectors/infinibandMetric.md b/collectors/infinibandMetric.md new file mode 100644 index 0000000..e9ba043 --- /dev/null +++ b/collectors/infinibandMetric.md @@ -0,0 +1,19 @@ + +## `ibstat` collector + +```json + "ibstat": { + "perfquery_path" : "", + "exclude_devices": [ + "mlx4" + ] + } +``` + +The `ibstat` collector reads either data through the `perfquery` command or the sysfs files below `/sys/class/infiniband/`. + +Metrics: +* `ib_recv` +* `ib_xmit` + +The collector adds a `device` tag to all metrics diff --git a/collectors/ipmiMetric.md b/collectors/ipmiMetric.md new file mode 100644 index 0000000..e69de29 diff --git a/collectors/likwidMetric.md b/collectors/likwidMetric.md new file mode 100644 index 0000000..08b917f --- /dev/null +++ b/collectors/likwidMetric.md @@ -0,0 +1,119 @@ + +## `likwid` collector +```json + "likwid": { + "eventsets": [ + { + "events": { + "FIXC1": "ACTUAL_CPU_CLOCK", + "FIXC2": "MAX_CPU_CLOCK", + "PMC0": "RETIRED_INSTRUCTIONS", + "PMC1": "CPU_CLOCKS_UNHALTED", + "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL", + "PMC3": "MERGE", + "DFC0": "DRAM_CHANNEL_0", + "DFC1": "DRAM_CHANNEL_1", + "DFC2": "DRAM_CHANNEL_2", + "DFC3": "DRAM_CHANNEL_3" + }, + "metrics": [ + { + "name": "ipc", + "calc": "PMC0/PMC1", + "socket_scope": false, + "publish": true + }, + { + "name": "flops_any", + "calc": "0.000001*PMC2/time", + "socket_scope": false, + "publish": true + }, + { + "name": "clock_mhz", + "calc": "0.000001*(FIXC1/FIXC2)/inverseClock", + "socket_scope": false, + "publish": true + }, + { + "name": "mem1", + "calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time", + "socket_scope": true, + "publish": false + } + ] + }, + { + "events": { + "DFC0": "DRAM_CHANNEL_4", + "DFC1": "DRAM_CHANNEL_5", + "DFC2": "DRAM_CHANNEL_6", + "DFC3": "DRAM_CHANNEL_7", + "PWR0": "RAPL_CORE_ENERGY", + "PWR1": "RAPL_PKG_ENERGY" + }, + "metrics": [ + { + "name": "pwr_core", + "calc": "PWR0/time", + "socket_scope": false, + "publish": true + }, + { + "name": "pwr_pkg", + "calc": "PWR1/time", + "socket_scope": true, + "publish": true + }, + { + "name": "mem2", + "calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time", + "socket_scope": true, + "publish": false + } + ] + } + ], + "globalmetrics": [ + { + "name": "mem_bw", + "calc": "mem1+mem2", + "socket_scope": true, + "publish": true + } + ] + } +``` + +_Example config suitable for AMD Zen3_ + +The `likwid` collector reads hardware performance counters at a **hwthread** and **socket** level. The configuration looks quite complicated but it is basically copy&paste from [LIKWID's performance groups](https://github.com/RRZE-HPC/likwid/tree/master/groups). The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility. + +The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference: +``` +EVENTSET -> "events": { +FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK", +FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK", +PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS", +PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED", +PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL", +PMC3 MERGE -> "PMC3": "MERGE", + -> } +``` + +The metrics are following the same procedure: + +``` +METRICS -> "metrics": [ +IPC PMC0/PMC1 -> { + -> "name" : "IPC", + -> "calc" : "PMC0/PMC1", + -> "socket_scope": false, + -> "publish": true + -> } + -> ] +``` + +The `socket_scope` option tells whether it is submitted per socket or per hwthread. If a metric is only used for internal calculations, you can set `publish = false`. + +Since some metrics can only be gathered in multiple measurements (like the memory bandwidth on AMD Zen3 chips), configure multiple eventsets like in the example config and use the `globalmetrics` section to combine them. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases. diff --git a/collectors/loadavgMetric.md b/collectors/loadavgMetric.md new file mode 100644 index 0000000..d2b3f50 --- /dev/null +++ b/collectors/loadavgMetric.md @@ -0,0 +1,19 @@ + +## `loadavg` collector + +```json + "loadavg": { + "exclude_metrics": [ + "proc_run" + ] + } +``` + +The `loadavg` collector reads data from `/proc/loadavg` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink. + +Metrics: +* `load_one` +* `load_five` +* `load_fifteen` +* `proc_run` +* `proc_total` diff --git a/collectors/lustreMetric.md b/collectors/lustreMetric.md new file mode 100644 index 0000000..0cb9fc8 --- /dev/null +++ b/collectors/lustreMetric.md @@ -0,0 +1,29 @@ + +## `lustrestat` collector + +```json + "lustrestat": { + "procfiles" : [ + "/proc/fs/lustre/llite/lnec-XXXXXX/stats" + ], + "exclude_metrics": [ + "setattr", + "getattr" + ] + } +``` + +The `lustrestat` collector reads from the procfs stat files for Lustre like `/proc/fs/lustre/llite/lnec-XXXXXX/stats`. + +Metrics: +* `read_bytes` +* `read_requests` +* `write_bytes` +* `write_requests` +* `open` +* `close` +* `getattr` +* `setattr` +* `statfs` +* `inode_permission` + diff --git a/collectors/memstatMetric.md b/collectors/memstatMetric.md new file mode 100644 index 0000000..4b7b8c7 --- /dev/null +++ b/collectors/memstatMetric.md @@ -0,0 +1,27 @@ + +## `memstat` collector + +```json + "memstat": { + "exclude_metrics": [ + "mem_used" + ] + } +``` + +The `memstat` collector reads data from `/proc/meminfo` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink. + + +Metrics: +* `mem_total` +* `mem_sreclaimable` +* `mem_slab` +* `mem_free` +* `mem_buffers` +* `mem_cached` +* `mem_available` +* `mem_shared` +* `swap_total` +* `swap_free` +* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached`) + diff --git a/collectors/netstatMetric.md b/collectors/netstatMetric.md new file mode 100644 index 0000000..34a48fd --- /dev/null +++ b/collectors/netstatMetric.md @@ -0,0 +1,21 @@ + +## `netstat` collector + +```json + "netstat": { + "exclude_devices": [ + "lo" + ] + } +``` + +The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a device is not required, it can be excluded from forwarding it to the sink. Commonly the `lo` device should be excluded. + +Metrics: +* `bytes_in` +* `bytes_out` +* `pkts_in` +* `pkts_out` + +The device name is added as tag `device`. + diff --git a/collectors/nvidiaMetric.md b/collectors/nvidiaMetric.md new file mode 100644 index 0000000..c774139 --- /dev/null +++ b/collectors/nvidiaMetric.md @@ -0,0 +1,40 @@ + +## `nvidia` collector + +```json + "lustrestat": { + "exclude_devices" : [ + "0","1" + ], + "exclude_metrics": [ + "fb_memory", + "fan" + ] + } +``` + +Metrics: +* `util` +* `mem_util` +* `mem_total` +* `fb_memory` +* `temp` +* `fan` +* `ecc_mode` +* `perf_state` +* `power_usage_report` +* `graphics_clock_report` +* `sm_clock_report` +* `mem_clock_report` +* `max_graphics_clock` +* `max_sm_clock` +* `max_mem_clock` +* `ecc_db_error` +* `ecc_sb_error` +* `power_man_limit` +* `encoder_util` +* `decoder_util` + +It uses a separate `type` in the metrics. The output metric looks like this: +`,type=accelerator,type-id= value= ` + diff --git a/collectors/tempMetric.md b/collectors/tempMetric.md new file mode 100644 index 0000000..0e43de0 --- /dev/null +++ b/collectors/tempMetric.md @@ -0,0 +1,22 @@ + +## `tempstat` collector + +```json + "lustrestat": { + "tag_override" : { + "" : { + "type" : "socket", + "type-id" : "0" + } + }, + "exclude_metrics": [ + "metric1", + "metric2" + ] + } +``` + +The `tempstat` collector reads the data from `/sys/class/hwmon//tempX_{input,label}` + +Metrics: +* `temp_*`: The metric name is taken from the `label` files. diff --git a/collectors/topprocsMetric.md b/collectors/topprocsMetric.md new file mode 100644 index 0000000..e69de29