.. | ||
infinibandMetric.go | ||
likwidMetric.go | ||
loadavgMetric.go | ||
lustreMetric.go | ||
Makefile | ||
memstatMetric.go | ||
metricCollector.go | ||
netstatMetric.go | ||
README.md |
This folder contains the collectors for the cc-metric-collector.
metricCollector.go
The base class/configuration is located in metricCollector.go
.
Collectors
memstatMetric.go
: Reads/proc/meminfo
to calculate the node metricmem_used
loadavgMetric.go
: Reads/proc/loadavg
and submits the node metrics:load_one
load_five
load_fifteen
netstatMetric.go
: Reads/proc/net/dev
and submits for all network devices (except loopbacklo
) the node metrics:<dev>_bytes_in
<dev>_bytes_out
<dev>_pkts_in
<dev>_pkts_out
lustreMetric.go
: Reads Lustre's stats file/proc/fs/lustre/llite/lnec-XXXXXX/stats
and submits the node metrics:read_bytes
read_requests
write_bytes
write_bytes
open
close
getattr
setattr
statfs
inode_permission
infinibandMetric.go
: Reads InfiniBand LID from/sys/class/infiniband/mlx4_0/ports/1/lid
and uses theperfquery
command to read the node metrics:ib_recv
ib_xmit
likwidMetric.go
: Reads hardware performance events using LIKWID. It submits socket and cpu metrics:mem_bw
(socket)power
(socket, Sum of RAPL domains PKG and DRAM)flops_dp
(cpu)flops_sp
(cpu)flops_any
(cpu,2*flops_dp + flops_sp
)cpi
(cpu)clock
(cpu)
LIKWID collector
Only the likwidMetric.go
requires preparation steps. For this, the Makefile
can be used. The LIKWID build needs to be configured:
- Version of LIKWID in
LIKWID_VERSION
- Target user for LIKWID's accessdaemon in
DAEMON_USER
. The user has to have enough permissions to read themsr
andpci
device files - Target group for LIKWID's accessdaemon in
DAEMON_GROUP
It performs the following steps:
- Download LIKWID tarball
- Unpacking
- Adjusting configuration for LIKWID build
- Build it
- Copy all required files into
collectors/likwid
- The accessdaemon is installed with the suid bit set using
sudo
also intocollectors/likwid
Custom metrics for LIKWID
The likwidMetric.go
collector uses it's own performance group tree by copying it from the LIKWID sources. By adding groups to this directory tree, you can use them in the collector. Additionally, you have to tell the collector which group to measure and which event count or derived metric should be used.
The collector contains a hash map with the groups and metrics (reduced set of metrics):
var likwid_metrics = map[string][]LikwidMetric{
"MEM_DP": {LikwidMetric{name: "mem_bw", search: "Memory bandwidth [MBytes/s]", socket_scope: true}},
"FLOPS_SP": {LikwidMetric{name: "clock", search: "Clock [MHz]", socket_scope: false}},
}
The collector will measure both groups MEM_DP
and FLOPS_DP
for duration
seconds (global config.json
). It matches the LIKWID name by using the search
string and submits the value with the given name
as field name in either the socket
or the cpu
metric depending on the socket_scope
flag.
Todos
- Aggregate a per-hwthread metric to a socket metric if
socket_scope=true
- Add a JSON configuration file
likwid.json
and suitable reader for the metrics and group tree path.
Contributing own collectors
A collector reads data from any source, parses it to metrics and submits these metrics to the metric-collector
. A collector provides three function:
Init() error
: Initializes the collector and its data structures.Read(duration time.Duration) error
: Read, parse and submit data. If the collector has to measure anything for some duration, use the provided function argumentduration
Close()
: Closes down the collector.
It is recommanded to call setup()
in the Init()
function as it creates the required data structures.
Each collector contains data structures for the submission of metrics after calling setup()
in Init()
:
node
(map[string]string
): Just key-value store for all metrics concerning the whole systemsockets
(map[int]map[string]string
): One key-value store per CPU socket likesockets[1]["testmetric] = 1.0
for the second socket. You can either uselen(sockets)
to get the amount of sockets or you useSocketList()
.cpus
(map[int]map[string]string
): One key-value store per hardware thread likecpus[12]["testmetric] = 1.0
. You can either uselen(cpus)
to get the amount of hardware threads or you useCpuList()
.