cc-metric-collector/collectors/likwidMetric.md
Thomas Gruber 200af84c54
Modularize the whole thing (#16)
* Use channels, add a metric router, split up configuration and use extended version of Influx line protocol internally

* Use central timer for collectors and router. Add expressions to router

* Add expression to router config

* Update entry points

* Start with README

* Update README for CCMetric

* Formatting

* Update README.md

* Add README for MultiChanTicker

* Add README for MultiChanTicker

* Update README.md

* Add README to metric router

* Update main README

* Remove SinkEntity type

* Update README for sinks

* Update go files

* Update README for receivers

* Update collectors README

* Update collectors README

* Use seperate page per collector

* Fix for tempstat page

* Add docs for customcmd collector

* Add docs for ipmistat collector

* Add docs for topprocs collector

* Update customCmdMetric.md

* Use seconds when calculating LIKWID metrics

* Add IB metrics ib_recv_pkts and ib_xmit_pkts

* Drop domain part of host name

* Updated to latest stable version of likwid

* Define source code dependencies in Makefile

* Add GPFS / IBM Spectrum Scale collector

* Add vet and staticcheck make targets

* Add vet and staticcheck make targets

* Avoid go vet warning:
struct field tag `json:"..., omitempty"` not compatible with reflect.StructTag.Get: suspicious space in struct tag value
struct field tag `json:"...", omitempty` not compatible with reflect.StructTag.Get: key:"value" pairs not separated by spaces

* Add sample collector to README.md

* Add CPU frequency collector

* Avoid staticcheck warning: redundant return statement

* Avoid staticcheck warning: unnecessary assignment to the blank identifier

* Simplified code

* Add CPUFreqCollectorCpuinfo
a metric collector to measure the current frequency of the CPUs
as obtained from /proc/cpuinfo
Only measure on the first hyperthread

* Add collector for NFS clients

* Move publication of metrics into Flush() for NatsSink

* Update GitHub actions

* Refactoring

* Avoid vet warning: Println arg list ends with redundant newline

* Avoid vet warning struct field commands has json tag but is not exported

* Avoid vet warning: return copies lock value.

* Corrected typo

* Refactoring

* Add go sources in internal/...

* Bad separator in Makefile

* Fix Infiniband collector

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
2022-01-25 15:37:43 +01:00

120 lines
4.3 KiB
Markdown

## `likwid` collector
```json
"likwid": {
"eventsets": [
{
"events": {
"FIXC1": "ACTUAL_CPU_CLOCK",
"FIXC2": "MAX_CPU_CLOCK",
"PMC0": "RETIRED_INSTRUCTIONS",
"PMC1": "CPU_CLOCKS_UNHALTED",
"PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
"PMC3": "MERGE",
"DFC0": "DRAM_CHANNEL_0",
"DFC1": "DRAM_CHANNEL_1",
"DFC2": "DRAM_CHANNEL_2",
"DFC3": "DRAM_CHANNEL_3"
},
"metrics": [
{
"name": "ipc",
"calc": "PMC0/PMC1",
"socket_scope": false,
"publish": true
},
{
"name": "flops_any",
"calc": "0.000001*PMC2/time",
"socket_scope": false,
"publish": true
},
{
"name": "clock_mhz",
"calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
"socket_scope": false,
"publish": true
},
{
"name": "mem1",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"socket_scope": true,
"publish": false
}
]
},
{
"events": {
"DFC0": "DRAM_CHANNEL_4",
"DFC1": "DRAM_CHANNEL_5",
"DFC2": "DRAM_CHANNEL_6",
"DFC3": "DRAM_CHANNEL_7",
"PWR0": "RAPL_CORE_ENERGY",
"PWR1": "RAPL_PKG_ENERGY"
},
"metrics": [
{
"name": "pwr_core",
"calc": "PWR0/time",
"socket_scope": false,
"publish": true
},
{
"name": "pwr_pkg",
"calc": "PWR1/time",
"socket_scope": true,
"publish": true
},
{
"name": "mem2",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"socket_scope": true,
"publish": false
}
]
}
],
"globalmetrics": [
{
"name": "mem_bw",
"calc": "mem1+mem2",
"socket_scope": true,
"publish": true
}
]
}
```
_Example config suitable for AMD Zen3_
The `likwid` collector reads hardware performance counters at a **hwthread** and **socket** level. The configuration looks quite complicated but it is basically copy&paste from [LIKWID's performance groups](https://github.com/RRZE-HPC/likwid/tree/master/groups). The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
```
EVENTSET -> "events": {
FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK",
FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK",
PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS",
PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED",
PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
PMC3 MERGE -> "PMC3": "MERGE",
-> }
```
The metrics are following the same procedure:
```
METRICS -> "metrics": [
IPC PMC0/PMC1 -> {
-> "name" : "IPC",
-> "calc" : "PMC0/PMC1",
-> "socket_scope": false,
-> "publish": true
-> }
-> ]
```
The `socket_scope` option tells whether it is submitted per socket or per hwthread. If a metric is only used for internal calculations, you can set `publish = false`.
Since some metrics can only be gathered in multiple measurements (like the memory bandwidth on AMD Zen3 chips), configure multiple eventsets like in the example config and use the `globalmetrics` section to combine them. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.