CCMetric collectors
This folder contains the collectors for the cc-metric-collector.
Configuration
{
"collector_type" : {
<collector specific configuration>
}
}
In contrast to the configuration files for sinks and receivers, the collectors configuration is not a list but a set of dicts. This is required because we didn't manage to partially read the type before loading the remaining configuration. We are eager to change this to the same format.
memstat
collector
"memstat": {
"exclude_metrics": [
"mem_used"
]
}
The memstat
collector reads data from /proc/meminfo
and outputs a handful node metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
mem_total
mem_sreclaimable
mem_slab
mem_free
mem_buffers
mem_cached
mem_available
mem_shared
swap_total
swap_free
mem_used
=mem_total
- (mem_free
+mem_buffers
+mem_cached
)
loadavg
collector
"loadavg": {
"exclude_metrics": [
"proc_run"
]
}
The loadavg
collector reads data from /proc/loadavg
and outputs a handful node metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
load_one
load_five
load_fifteen
proc_run
proc_total
netstat
collector
"netstat": {
"exclude_devices": [
"lo"
]
}
The netstat
collector reads data from /proc/net/dev
and outputs a handful node metrics. If a device is not required, it can be excluded from forwarding it to the sink. Commonly the lo
device should be excluded.
Metrics:
bytes_in
bytes_out
pkts_in
pkts_out
The device name is added as tag device
.
diskstat
collector
"diskstat": {
"exclude_metrics": [
"read_ms"
],
}
The netstat
collector reads data from /proc/net/dev
and outputs a handful node metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
reads
reads_merged
read_sectors
read_ms
writes
writes_merged
writes_sectors
writes_ms
ioops
ioops_ms
ioops_weighted_ms
discards
discards_merged
discards_sectors
discards_ms
flushes
flushes_ms
The device name is added as tag device
.
cpustat
collector
"netstat": {
"exclude_metrics": [
"cpu_idle"
]
}
The cpustat
collector reads data from /proc/stats
and outputs a handful node and hwthread metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
cpu_user
cpu_nice
cpu_system
cpu_idle
cpu_iowait
cpu_irq
cpu_softirq
cpu_steal
cpu_guest
cpu_guest_nice
ibstat
collector
"ibstat": {
"perfquery_path" : "<path to perfquery command>",
"exclude_devices": [
"mlx4"
]
}
The ibstat
collector reads either data through the perfquery
command or the sysfs files below /sys/class/infiniband/<device>
.
Metrics:
ib_recv
ib_xmit
lustrestat
collector
"lustrestat": {
"procfiles" : [
"/proc/fs/lustre/llite/lnec-XXXXXX/stats"
],
"exclude_metrics": [
"setattr",
"getattr"
]
}
The lustrestat
collector reads from the procfs stat files for Lustre like /proc/fs/lustre/llite/lnec-XXXXXX/stats
.
Metrics:
read_bytes
read_requests
write_bytes
write_requests
open
close
getattr
setattr
statfs
inode_permission
nvidia
collector
"lustrestat": {
"exclude_devices" : [
"0","1"
],
"exclude_metrics": [
"fb_memory",
"fan"
]
}
Metrics:
util
mem_util
mem_total
fb_memory
temp
fan
ecc_mode
perf_state
power_usage_report
graphics_clock_report
sm_clock_report
mem_clock_report
max_graphics_clock
max_sm_clock
max_mem_clock
ecc_db_error
ecc_sb_error
power_man_limit
encoder_util
decoder_util
It uses a separate type
in the metrics. The output metric looks like this:
<name>,type=accelerator,type-id=<nvidia-gpu-id> value=<metric value> <timestamp>
tempstat
collector
"lustrestat": {
"tag_override" : {
"<device like hwmon1>" : {
"type" : "socket",
"type-id" : "0"
}
},
"exclude_metrics": [
"metric1",
"metric2"
]
}
The tempstat
collector reads the data from /sys/class/hwmon/<device>/tempX_{input,label}
Metrics:
temp_*
: The metric name is taken from thelabel
files.
likwid
collector
"likwid": {
"eventsets": [
{
"events": {
"FIXC1": "ACTUAL_CPU_CLOCK",
"FIXC2": "MAX_CPU_CLOCK",
"PMC0": "RETIRED_INSTRUCTIONS",
"PMC1": "CPU_CLOCKS_UNHALTED",
"PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
"PMC3": "MERGE",
"DFC0": "DRAM_CHANNEL_0",
"DFC1": "DRAM_CHANNEL_1",
"DFC2": "DRAM_CHANNEL_2",
"DFC3": "DRAM_CHANNEL_3"
},
"metrics": [
{
"name": "ipc",
"calc": "PMC0/PMC1",
"socket_scope": false,
"publish": true
},
{
"name": "flops_any",
"calc": "0.000001*PMC2/time",
"socket_scope": false,
"publish": true
},
{
"name": "clock_mhz",
"calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
"socket_scope": false,
"publish": true
},
{
"name": "mem1",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"socket_scope": true,
"publish": false
}
]
},
{
"events": {
"DFC0": "DRAM_CHANNEL_4",
"DFC1": "DRAM_CHANNEL_5",
"DFC2": "DRAM_CHANNEL_6",
"DFC3": "DRAM_CHANNEL_7",
"PWR0": "RAPL_CORE_ENERGY",
"PWR1": "RAPL_PKG_ENERGY"
},
"metrics": [
{
"name": "pwr_core",
"calc": "PWR0/time",
"socket_scope": false,
"publish": true
},
{
"name": "pwr_pkg",
"calc": "PWR1/time",
"socket_scope": true,
"publish": true
},
{
"name": "mem2",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"socket_scope": true,
"publish": false
}
]
}
],
"globalmetrics": [
{
"name": "mem_bw",
"calc": "mem1+mem2",
"socket_scope": true,
"publish": true
}
]
}
Example config suitable for AMD Zen3
The likwid
collector reads hardware performance counters at a hwthread and socket level. The configuration looks quite complicated but it is basically copy&paste from LIKWID's performance groups. The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
EVENTSET -> "events": {
FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK",
FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK",
PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS",
PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED",
PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
PMC3 MERGE -> "PMC3": "MERGE",
-> }
The metrics are following the same procedure:
METRICS -> "metrics": [
IPC PMC0/PMC1 -> {
-> "name" : "IPC",
-> "calc" : "PMC0/PMC1",
-> "socket_scope": false,
-> "publish": true
-> }
-> ]
The socket_scope
option tells whether it is submitted per socket or per hwthread. If a metric is only used for internal calculations, you can set publish = false
.
Since some metrics can only be gathered in multiple measurements (like the memory bandwidth on AMD Zen3 chips), configure multiple eventsets like in the example config and use the globalmetrics
section to combine them. Be aware that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.
Todos
- Exclude devices for
diskstat
collector - Aggreate metrics to higher topology entity (sum hwthread metrics to socket metric, ...). Needs to be configurable
Contributing own collectors
A collector reads data from any source, parses it to metrics and submits these metrics to the metric-collector
. A collector provides three function:
Name() string
: Return the name of the collectorInit(config json.RawMessage) error
: Initializes the collector using the given collector-specific config in JSON. Check if needed files/commands exists, ...Initialized() bool
: Check if a collector is successfully initializedRead(duration time.Duration, output chan ccMetric.CCMetric)
: Read, parse and submit data to theoutput
channel asCCMetric
. If the collector has to measure anything for some duration, use the provided function argumentduration
.Close()
: Closes down the collector.
It is recommanded to call setup()
in the Init()
function.
Finally, the collector needs to be registered in the collectorManager.go
. There is a list of collectors called AvailableCollectors
which is a map (collector_type_string
-> pointer to MetricCollector interface
). Add a new entry with a descriptive name and the new collector.