Modularize the whole thing (#16)

* Use channels, add a metric router, split up configuration and use extended version of Influx line protocol internally

* Use central timer for collectors and router. Add expressions to router

* Add expression to router config

* Update entry points

* Start with README

* Update README for CCMetric

* Formatting

* Update README.md

* Add README for MultiChanTicker

* Add README for MultiChanTicker

* Update README.md

* Add README to metric router

* Update main README

* Remove SinkEntity type

* Update README for sinks

* Update go files

* Update README for receivers

* Update collectors README

* Update collectors README

* Use seperate page per collector

* Fix for tempstat page

* Add docs for customcmd collector

* Add docs for ipmistat collector

* Add docs for topprocs collector

* Update customCmdMetric.md

* Use seconds when calculating LIKWID metrics

* Add IB metrics ib_recv_pkts and ib_xmit_pkts

* Drop domain part of host name

* Updated to latest stable version of likwid

* Define source code dependencies in Makefile

* Add GPFS / IBM Spectrum Scale collector

* Add vet and staticcheck make targets

* Add vet and staticcheck make targets

* Avoid go vet warning:
struct field tag `json:"..., omitempty"` not compatible with reflect.StructTag.Get: suspicious space in struct tag value
struct field tag `json:"...", omitempty` not compatible with reflect.StructTag.Get: key:"value" pairs not separated by spaces

* Add sample collector to README.md

* Add CPU frequency collector

* Avoid staticcheck warning: redundant return statement

* Avoid staticcheck warning: unnecessary assignment to the blank identifier

* Simplified code

* Add CPUFreqCollectorCpuinfo
a metric collector to measure the current frequency of the CPUs
as obtained from /proc/cpuinfo
Only measure on the first hyperthread

* Add collector for NFS clients

* Move publication of metrics into Flush() for NatsSink

* Update GitHub actions

* Refactoring

* Avoid vet warning: Println arg list ends with redundant newline

* Avoid vet warning struct field commands has json tag but is not exported

* Avoid vet warning: return copies lock value.

* Corrected typo

* Refactoring

* Add go sources in internal/...

* Bad separator in Makefile

* Fix Infiniband collector

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
This commit is contained in:
Thomas Gruber
2022-01-25 15:37:43 +01:00
committed by GitHub
parent 222862af32
commit 200af84c54
60 changed files with 2596 additions and 1105 deletions

View File

@@ -1,288 +1,34 @@
# CCMetric collectors
This folder contains the collectors for the cc-metric-collector.
# `metricCollector.go`
The base class/configuration is located in `metricCollector.go`.
# Collectors
* `memstatMetric.go`: Reads `/proc/meminfo` to calculate **node** metrics. It also combines values to the metric `mem_used`
* `loadavgMetric.go`: Reads `/proc/loadavg` and submits **node** metrics:
* `netstatMetric.go`: Reads `/proc/net/dev` and submits for all network devices as the **node** metrics.
* `lustreMetric.go`: Reads Lustre's stats files and submits **node** metrics:
* `infinibandMetric.go`: Reads InfiniBand metrics. It uses the `perfquery` command to read the **node** metrics but can fallback to sysfs counters in case `perfquery` does not work.
* `likwidMetric.go`: Reads hardware performance events using LIKWID. It submits **socket** and **cpu** metrics
* `cpustatMetric.go`: Read CPU specific values from `/proc/stat`
* `topprocsMetric.go`: Reads the TopX processes by their CPU usage. X is configurable
* `nvidiaMetric.go`: Read data about Nvidia GPUs using the NVML library
* `tempMetric.go`: Read temperature data from `/sys/class/hwmon/hwmon*`
* `ipmiMetric.go`: Collect data from `ipmitool` or as fallback `ipmi-sensors`
* `customCmdMetric.go`: Run commands or read files and submit the output (output has to be in InfluxDB line protocol!)
If any of the collectors cannot be initialized, it is excluded from all further reads. Like if the Lustre stat file is not a valid path, no Lustre specific metrics will be recorded.
# Collector configuration
# Configuration
```json
"collectors": [
"tempstat"
],
"collect_config": {
"tempstat": {
"tag_override": {
"hwmon0" : {
"type" : "socket",
"type-id" : "0"
},
"hwmon1" : {
"type" : "socket",
"type-id" : "1"
}
}
{
"collector_type" : {
<collector specific configuration>
}
}
}
```
The configuration of the collectors in the main config files consists of two parts: active collectors (`collectors`) and collector configuration (`collect_config`). At startup, all collectors in the `collectors` list is initialized and, if successfully initialized, added to the active collectors for metric retrieval. At initialization the collector-specific configuration from the `collect_config` section is handed over. Each collector has own configuration options, check at the collector-specific section.
In contrast to the configuration files for sinks and receivers, the collectors configuration is not a list but a set of dicts. This is required because we didn't manage to partially read the type before loading the remaining configuration. We are eager to change this to the same format.
## `memstat`
# Available collectors
```json
"memstat": {
"exclude_metrics": [
"mem_used"
]
}
```
The `memstat` collector reads data from `/proc/meminfo` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `mem_total`
* `mem_sreclaimable`
* `mem_slab`
* `mem_free`
* `mem_buffers`
* `mem_cached`
* `mem_available`
* `mem_shared`
* `swap_total`
* `swap_free`
* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached`)
## `loadavg`
```json
"loadavg": {
"exclude_metrics": [
"proc_run"
]
}
```
The `loadavg` collector reads data from `/proc/loadavg` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `load_one`
* `load_five`
* `load_fifteen`
* `proc_run`
* `proc_total`
## `netstat`
```json
"netstat": {
"exclude_devices": [
"lo"
]
}
```
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a device is not required, it can be excluded from forwarding it to the sink. Commonly the `lo` device should be excluded.
Metrics:
* `bytes_in`
* `bytes_out`
* `pkts_in`
* `pkts_out`
The device name is added as tag `device`.
## `diskstat`
```json
"diskstat": {
"exclude_metrics": [
"read_ms"
],
}
```
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `reads`
* `reads_merged`
* `read_sectors`
* `read_ms`
* `writes`
* `writes_merged`
* `writes_sectors`
* `writes_ms`
* `ioops`
* `ioops_ms`
* `ioops_weighted_ms`
* `discards`
* `discards_merged`
* `discards_sectors`
* `discards_ms`
* `flushes`
* `flushes_ms`
The device name is added as tag `device`.
## `cpustat`
```json
"netstat": {
"exclude_metrics": [
"cpu_idle"
]
}
```
The `cpustat` collector reads data from `/proc/stats` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `cpu_user`
* `cpu_nice`
* `cpu_system`
* `cpu_idle`
* `cpu_iowait`
* `cpu_irq`
* `cpu_softirq`
* `cpu_steal`
* `cpu_guest`
* `cpu_guest_nice`
## `likwid`
```json
"likwid": {
"eventsets": [
{
"events": {
"FIXC1": "ACTUAL_CPU_CLOCK",
"FIXC2": "MAX_CPU_CLOCK",
"PMC0": "RETIRED_INSTRUCTIONS",
"PMC1": "CPU_CLOCKS_UNHALTED",
"PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
"PMC3": "MERGE",
"DFC0": "DRAM_CHANNEL_0",
"DFC1": "DRAM_CHANNEL_1",
"DFC2": "DRAM_CHANNEL_2",
"DFC3": "DRAM_CHANNEL_3"
},
"metrics": [
{
"name": "ipc",
"calc": "PMC0/PMC1",
"socket_scope": false,
"publish": true
},
{
"name": "flops_any",
"calc": "0.000001*PMC2/time",
"socket_scope": false,
"publish": true
},
{
"name": "clock_mhz",
"calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
"socket_scope": false,
"publish": true
},
{
"name": "mem1",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"socket_scope": true,
"publish": false
}
]
},
{
"events": {
"DFC0": "DRAM_CHANNEL_4",
"DFC1": "DRAM_CHANNEL_5",
"DFC2": "DRAM_CHANNEL_6",
"DFC3": "DRAM_CHANNEL_7",
"PWR0": "RAPL_CORE_ENERGY",
"PWR1": "RAPL_PKG_ENERGY"
},
"metrics": [
{
"name": "pwr_core",
"calc": "PWR0/time",
"socket_scope": false,
"publish": true
},
{
"name": "pwr_pkg",
"calc": "PWR1/time",
"socket_scope": true,
"publish": true
},
{
"name": "mem2",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"socket_scope": true,
"publish": false
}
]
}
],
"globalmetrics": [
{
"name": "mem_bw",
"calc": "mem1+mem2",
"socket_scope": true,
"publish": true
}
]
}
```
_Example config suitable for AMD Zen3_
The `likwid` collector reads hardware performance counters at a **hwthread** and **socket** level. The configuration looks quite complicated but it is basically copy&paste from [LIKWID's performance groups](https://github.com/RRZE-HPC/likwid/tree/master/groups). The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
```
EVENTSET -> "events": {
FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK",
FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK",
PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS",
PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED",
PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
PMC3 MERGE -> "PMC3": "MERGE",
-> }
```
The metrics are following the same procedure:
```
METRICS -> "metrics": [
IPC PMC0/PMC1 -> {
-> "name" : "IPC",
-> "calc" : "PMC0/PMC1",
-> "socket_scope": false,
-> "publish": true
-> }
-> ]
```
The `socket_scope` option tells whether it is submitted per socket or per hwthread. If a metric is only used for internal calculations, you can set `publish = false`.
Since some metrics can only be gathered in multiple measurements (like the memory bandwidth on AMD Zen3 chips), configure multiple eventsets like in the example config and use the `globalmetrics` section to combine them. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.
* [`cpustat`](./cpustatMetric.md)
* [`memstat`](./memstatMetric.md)
* [`diskstat`](./diskstatMetric.md)
* [`loadavg`](./loadavgMetric.md)
* [`netstat`](./netstatMetric.md)
* [`ibstat`](./infinibandMetric.md)
* [`tempstat`](./tempMetric.md)
* [`lustre`](./lustreMetric.md)
* [`likwid`](./likwidMetric.md)
* [`nvidia`](./nvidiaMetric.md)
* [`customcmd`](./customCmdMetric.md)
* [`ipmistat`](./ipmiMetric.md)
* [`topprocs`](./topprocsMetric.md)
## Todos
@@ -292,13 +38,15 @@ Since some metrics can only be gathered in multiple measurements (like the memor
# Contributing own collectors
A collector reads data from any source, parses it to metrics and submits these metrics to the `metric-collector`. A collector provides three function:
* `Init(config []byte) error`: Initializes the collector using the given collector-specific config in JSON.
* `Read(duration time.Duration, out *[]lp.MutableMetric) error`: Read, parse and submit data to the `out` list. If the collector has to measure anything for some duration, use the provided function argument `duration`.
* `Name() string`: Return the name of the collector
* `Init(config json.RawMessage) error`: Initializes the collector using the given collector-specific config in JSON. Check if needed files/commands exists, ...
* `Initialized() bool`: Check if a collector is successfully initialized
* `Read(duration time.Duration, output chan ccMetric.CCMetric)`: Read, parse and submit data to the `output` channel as [`CCMetric`](../internal/ccMetric/README.md). If the collector has to measure anything for some duration, use the provided function argument `duration`.
* `Close()`: Closes down the collector.
It is recommanded to call `setup()` in the `Init()` function.
Finally, the collector needs to be registered in the `metric-collector.go`. There is a list of collectors called `Collectors` which is a map (string -> pointer to collector). Add a new entry with a descriptive name and the new collector.
Finally, the collector needs to be registered in the `collectorManager.go`. There is a list of collectors called `AvailableCollectors` which is a map (`collector_type_string` -> `pointer to MetricCollector interface`). Add a new entry with a descriptive name and the new collector.
## Sample collector
@@ -307,8 +55,9 @@ package collectors
import (
"encoding/json"
lp "github.com/influxdata/line-protocol"
"time"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
// Struct for the collector-specific JSON config
@@ -317,11 +66,11 @@ type SampleCollectorConfig struct {
}
type SampleCollector struct {
MetricCollector
metricCollector
config SampleCollectorConfig
}
func (m *SampleCollector) Init(config []byte) error {
func (m *SampleCollector) Init(config json.RawMessage) error {
m.name = "SampleCollector"
m.setup()
if len(config) > 0 {
@@ -330,11 +79,13 @@ func (m *SampleCollector) Init(config []byte) error {
return err
}
}
m.meta = map[string]string{"source": m.name, "group": "Sample"}
m.init = true
return nil
}
func (m *SampleCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *SampleCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -342,9 +93,9 @@ func (m *SampleCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
tags := map[string]string{"type" : "node"}
// Each metric has exactly one field: value !
value := map[string]interface{}{"value": int(x)}
y, err := lp.New("sample_metric", tags, value, time.Now())
y, err := lp.New("sample_metric", tags, m.meta, value, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}

View File

@@ -0,0 +1,143 @@
package collectors
import (
"encoding/json"
"log"
"os"
"sync"
"time"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
mct "github.com/ClusterCockpit/cc-metric-collector/internal/multiChanTicker"
)
var AvailableCollectors = map[string]MetricCollector{
"likwid": &LikwidCollector{},
"loadavg": &LoadavgCollector{},
"memstat": &MemstatCollector{},
"netstat": &NetstatCollector{},
"ibstat": &InfinibandCollector{},
"lustrestat": &LustreCollector{},
"cpustat": &CpustatCollector{},
"topprocs": &TopProcsCollector{},
"nvidia": &NvidiaCollector{},
"customcmd": &CustomCmdCollector{},
"diskstat": &DiskstatCollector{},
"tempstat": &TempCollector{},
"ipmistat": &IpmiCollector{},
"gpfs": new(GpfsCollector),
"cpufreq": new(CPUFreqCollector),
"cpufreq_cpuinfo": new(CPUFreqCpuInfoCollector),
"nfsstat": new(NfsCollector),
}
type collectorManager struct {
collectors []MetricCollector
output chan lp.CCMetric
done chan bool
ticker mct.MultiChanTicker
duration time.Duration
wg *sync.WaitGroup
config map[string]json.RawMessage
}
type CollectorManager interface {
Init(ticker mct.MultiChanTicker, duration time.Duration, wg *sync.WaitGroup, collectConfigFile string) error
AddOutput(output chan lp.CCMetric)
Start()
Close()
}
func (cm *collectorManager) Init(ticker mct.MultiChanTicker, duration time.Duration, wg *sync.WaitGroup, collectConfigFile string) error {
cm.collectors = make([]MetricCollector, 0)
cm.output = nil
cm.done = make(chan bool)
cm.wg = wg
cm.ticker = ticker
cm.duration = duration
configFile, err := os.Open(collectConfigFile)
if err != nil {
log.Print(err.Error())
return err
}
defer configFile.Close()
jsonParser := json.NewDecoder(configFile)
err = jsonParser.Decode(&cm.config)
if err != nil {
log.Print(err.Error())
return err
}
for k, cfg := range cm.config {
log.Print(k, " ", cfg)
if _, found := AvailableCollectors[k]; !found {
log.Print("[CollectorManager] SKIP unknown collector ", k)
continue
}
c := AvailableCollectors[k]
err = c.Init(cfg)
if err != nil {
log.Print("[CollectorManager] Collector ", k, "initialization failed: ", err.Error())
continue
}
cm.collectors = append(cm.collectors, c)
}
return nil
}
func (cm *collectorManager) Start() {
cm.wg.Add(1)
tick := make(chan time.Time)
cm.ticker.AddChannel(tick)
go func() {
for {
CollectorManagerLoop:
select {
case <-cm.done:
for _, c := range cm.collectors {
c.Close()
}
cm.wg.Done()
log.Print("[CollectorManager] DONE\n")
break CollectorManagerLoop
case t := <-tick:
for _, c := range cm.collectors {
CollectorManagerInputLoop:
select {
case <-cm.done:
for _, c := range cm.collectors {
c.Close()
}
cm.wg.Done()
log.Print("[CollectorManager] DONE\n")
break CollectorManagerInputLoop
default:
log.Print("[CollectorManager] ", c.Name(), " ", t)
c.Read(cm.duration, cm.output)
}
}
}
}
log.Print("[CollectorManager] EXIT\n")
}()
log.Print("[CollectorManager] STARTED\n")
}
func (cm *collectorManager) AddOutput(output chan lp.CCMetric) {
cm.output = output
}
func (cm *collectorManager) Close() {
cm.done <- true
log.Print("[CollectorManager] CLOSE")
}
func New(ticker mct.MultiChanTicker, duration time.Duration, wg *sync.WaitGroup, collectConfigFile string) (CollectorManager, error) {
cm := &collectorManager{}
err := cm.Init(ticker, duration, wg, collectConfigFile)
if err != nil {
return nil, err
}
return cm, err
}

View File

@@ -2,14 +2,16 @@ package collectors
import (
"bufio"
"encoding/json"
"fmt"
"log"
"os"
"strconv"
"strings"
"time"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
lp "github.com/influxdata/line-protocol"
)
//
@@ -33,12 +35,16 @@ type CPUFreqCpuInfoCollectorTopology struct {
}
type CPUFreqCpuInfoCollector struct {
MetricCollector
metricCollector
topology []CPUFreqCpuInfoCollectorTopology
}
func (m *CPUFreqCpuInfoCollector) Init(config []byte) error {
func (m *CPUFreqCpuInfoCollector) Init(config json.RawMessage) error {
m.name = "CPUFreqCpuInfoCollector"
m.meta = map[string]string{
"source": m.name,
"group": "cpufreq",
}
const cpuInfoFile = "/proc/cpuinfo"
file, err := os.Open(cpuInfoFile)
@@ -145,7 +151,8 @@ func (m *CPUFreqCpuInfoCollector) Init(config []byte) error {
return nil
}
func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -174,9 +181,9 @@ func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, out *[]lp.Mutable
log.Printf("Failed to convert cpu MHz to float: %v", err)
return
}
y, err := lp.New("cpufreq", t.tagSet, map[string]interface{}{"value": value}, now)
y, err := lp.New("cpufreq", t.tagSet, m.meta, map[string]interface{}{"value": value}, now)
if err == nil {
*out = append(*out, y)
output <- y
}
}
processorCounter++

View File

@@ -10,8 +10,7 @@ import (
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
"golang.org/x/sys/unix"
)
@@ -56,14 +55,14 @@ type CPUFreqCollectorTopology struct {
// See: https://www.kernel.org/doc/html/latest/admin-guide/pm/cpufreq.html
//
type CPUFreqCollector struct {
MetricCollector
metricCollector
topology []CPUFreqCollectorTopology
config struct {
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
}
}
func (m *CPUFreqCollector) Init(config []byte) error {
func (m *CPUFreqCollector) Init(config json.RawMessage) error {
m.name = "CPUFreqCollector"
m.setup()
if len(config) > 0 {
@@ -72,6 +71,10 @@ func (m *CPUFreqCollector) Init(config []byte) error {
return err
}
}
m.meta = map[string]string{
"source": m.name,
"group": "CPU Frequency",
}
// Loop for all CPU directories
baseDir := "/sys/devices/system/cpu"
@@ -179,7 +182,7 @@ func (m *CPUFreqCollector) Init(config []byte) error {
return nil
}
func (m *CPUFreqCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *CPUFreqCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -205,9 +208,9 @@ func (m *CPUFreqCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
continue
}
y, err := lp.New("cpufreq", t.tagSet, map[string]interface{}{"value": cpuFreq}, now)
y, err := lp.New("cpufreq", t.tagSet, m.meta, map[string]interface{}{"value": cpuFreq}, now)
if err == nil {
*out = append(*out, y)
output <- y
}
}
}

View File

@@ -7,8 +7,7 @@ import (
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const CPUSTATFILE = `/proc/stat`
@@ -18,13 +17,14 @@ type CpustatCollectorConfig struct {
}
type CpustatCollector struct {
MetricCollector
metricCollector
config CpustatCollectorConfig
}
func (m *CpustatCollector) Init(config []byte) error {
func (m *CpustatCollector) Init(config json.RawMessage) error {
m.name = "CpustatCollector"
m.setup()
m.meta = map[string]string{"source": m.name, "group": "CPU"}
if len(config) > 0 {
err := json.Unmarshal(config, &m.config)
if err != nil {
@@ -35,7 +35,7 @@ func (m *CpustatCollector) Init(config []byte) error {
return nil
}
func ParseStatLine(line string, cpu int, exclude []string, out *[]lp.MutableMetric) {
func (c *CpustatCollector) parseStatLine(line string, cpu int, exclude []string, output chan lp.CCMetric) {
ls := strings.Fields(line)
matches := []string{"", "cpu_user", "cpu_nice", "cpu_system", "cpu_idle", "cpu_iowait", "cpu_irq", "cpu_softirq", "cpu_steal", "cpu_guest", "cpu_guest_nice"}
for _, ex := range exclude {
@@ -52,16 +52,16 @@ func ParseStatLine(line string, cpu int, exclude []string, out *[]lp.MutableMetr
if len(m) > 0 {
x, err := strconv.ParseInt(ls[i], 0, 64)
if err == nil {
y, err := lp.New(m, tags, map[string]interface{}{"value": int(x)}, time.Now())
y, err := lp.New(m, tags, c.meta, map[string]interface{}{"value": int(x)}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}
}
}
func (m *CpustatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *CpustatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -78,11 +78,11 @@ func (m *CpustatCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
}
ls := strings.Fields(line)
if strings.Compare(ls[0], "cpu") == 0 {
ParseStatLine(line, -1, m.config.ExcludeMetrics, out)
m.parseStatLine(line, -1, m.config.ExcludeMetrics, output)
} else if strings.HasPrefix(ls[0], "cpu") {
cpustr := strings.TrimLeft(ls[0], "cpu")
cpu, _ := strconv.Atoi(cpustr)
ParseStatLine(line, cpu, m.config.ExcludeMetrics, out)
m.parseStatLine(line, cpu, m.config.ExcludeMetrics, output)
}
}
}

View File

@@ -0,0 +1,23 @@
## `cpustat` collector
```json
"netstat": {
"exclude_metrics": [
"cpu_idle"
]
}
```
The `cpustat` collector reads data from `/proc/stats` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `cpu_user`
* `cpu_nice`
* `cpu_system`
* `cpu_idle`
* `cpu_iowait`
* `cpu_irq`
* `cpu_softirq`
* `cpu_steal`
* `cpu_guest`
* `cpu_guest_nice`

View File

@@ -9,7 +9,8 @@ import (
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
influx "github.com/influxdata/line-protocol"
)
const CUSTOMCMDPATH = `/home/unrz139/Work/cc-metric-collector/collectors/custom`
@@ -21,17 +22,18 @@ type CustomCmdCollectorConfig struct {
}
type CustomCmdCollector struct {
MetricCollector
handler *lp.MetricHandler
parser *lp.Parser
metricCollector
handler *influx.MetricHandler
parser *influx.Parser
config CustomCmdCollectorConfig
commands []string
files []string
}
func (m *CustomCmdCollector) Init(config []byte) error {
func (m *CustomCmdCollector) Init(config json.RawMessage) error {
var err error
m.name = "CustomCmdCollector"
m.meta = map[string]string{"source": m.name, "group": "Custom"}
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
@@ -61,8 +63,8 @@ func (m *CustomCmdCollector) Init(config []byte) error {
if len(m.files) == 0 && len(m.commands) == 0 {
return errors.New("No metrics to collect")
}
m.handler = lp.NewMetricHandler()
m.parser = lp.NewParser(m.handler)
m.handler = influx.NewMetricHandler()
m.parser = influx.NewParser(m.handler)
m.parser.SetTimeFunc(DefaultTime)
m.init = true
return nil
@@ -72,7 +74,7 @@ var DefaultTime = func() time.Time {
return time.Unix(42, 0)
}
func (m *CustomCmdCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *CustomCmdCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -95,9 +97,9 @@ func (m *CustomCmdCollector) Read(interval time.Duration, out *[]lp.MutableMetri
if skip {
continue
}
y, err := lp.New(c.Name(), Tags2Map(c), Fields2Map(c), c.Time())
y, err := lp.New(c.Name(), Tags2Map(c), m.meta, Fields2Map(c), c.Time())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}
@@ -117,9 +119,9 @@ func (m *CustomCmdCollector) Read(interval time.Duration, out *[]lp.MutableMetri
if skip {
continue
}
y, err := lp.New(f.Name(), Tags2Map(f), Fields2Map(f), f.Time())
y, err := lp.New(f.Name(), Tags2Map(f), m.meta, Fields2Map(f), f.Time())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}

View File

@@ -0,0 +1,20 @@
## `customcmd` collector
```json
"customcmd": {
"exclude_metrics": [
"mymetric"
],
"files" : [
"/var/run/myapp.metrics"
],
"commands" : [
"/usr/local/bin/getmetrics.pl"
]
}
```
The `customcmd` collector reads data from files and the output of executed commands. The files and commands can output multiple metrics (separated by newline) but the have to be in the [InfluxDB line protocol](https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/). If a metric is not parsable, it is skipped. If a metric is not required, it can be excluded from forwarding it to the sink.

View File

@@ -2,9 +2,7 @@ package collectors
import (
"io/ioutil"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
// "log"
"encoding/json"
"errors"
@@ -21,14 +19,15 @@ type DiskstatCollectorConfig struct {
}
type DiskstatCollector struct {
MetricCollector
metricCollector
matches map[int]string
config DiskstatCollectorConfig
}
func (m *DiskstatCollector) Init(config []byte) error {
func (m *DiskstatCollector) Init(config json.RawMessage) error {
var err error
m.name = "DiskstatCollector"
m.meta = map[string]string{"source": m.name, "group": "Disk"}
m.setup()
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
@@ -73,7 +72,7 @@ func (m *DiskstatCollector) Init(config []byte) error {
return err
}
func (m *DiskstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *DiskstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
var lines []string
if !m.init {
return
@@ -101,9 +100,9 @@ func (m *DiskstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric
if idx < len(f) {
x, err := strconv.ParseInt(f[idx], 0, 64)
if err == nil {
y, err := lp.New(name, tags, map[string]interface{}{"value": int(x)}, time.Now())
y, err := lp.New(name, tags, m.meta, map[string]interface{}{"value": int(x)}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}

View File

@@ -0,0 +1,34 @@
## `diskstat` collector
```json
"diskstat": {
"exclude_metrics": [
"read_ms"
],
}
```
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `reads`
* `reads_merged`
* `read_sectors`
* `read_ms`
* `writes`
* `writes_merged`
* `writes_sectors`
* `writes_ms`
* `ioops`
* `ioops_ms`
* `ioops_weighted_ms`
* `discards`
* `discards_merged`
* `discards_sectors`
* `discards_ms`
* `flushes`
* `flushes_ms`
The device name is added as tag `device`.

View File

@@ -13,18 +13,20 @@ import (
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
type GpfsCollector struct {
MetricCollector
metricCollector
tags map[string]string
config struct {
Mmpmon string `json:"mmpmon"`
}
}
func (m *GpfsCollector) Init(config []byte) error {
func (m *GpfsCollector) Init(config json.RawMessage) error {
var err error
m.name = "GpfsCollector"
m.setup()
@@ -40,6 +42,14 @@ func (m *GpfsCollector) Init(config []byte) error {
return err
}
}
m.meta = map[string]string{
"source": m.name,
"group": "GPFS",
}
m.tags = map[string]string{
"type": "node",
"filesystem": "",
}
// GPFS / IBM Spectrum Scale file system statistics can only be queried by user root
user, err := user.Current()
@@ -60,7 +70,7 @@ func (m *GpfsCollector) Init(config []byte) error {
return nil
}
func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *GpfsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -108,6 +118,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
continue
}
m.tags["filesystem"] = filesystem
// return code
rc, err := strconv.Atoi(key_value["_rc_"])
if err != nil {
@@ -140,17 +153,10 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
key_value["_br_"], err.Error())
continue
}
y, err := lp.New(
"gpfs_bytes_read",
map[string]string{
"filesystem": filesystem,
},
map[string]interface{}{
"value": bytesRead,
},
timestamp)
y, err := lp.New("gpfs_bytes_read", m.tags, m.meta, map[string]interface{}{"value": bytesRead}, timestamp)
if err == nil {
*out = append(*out, y)
output <- y
}
// bytes written
@@ -161,17 +167,10 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
key_value["_bw_"], err.Error())
continue
}
y, err = lp.New(
"gpfs_bytes_written",
map[string]string{
"filesystem": filesystem,
},
map[string]interface{}{
"value": bytesWritten,
},
timestamp)
y, err = lp.New("gpfs_bytes_written", m.tags, m.meta, map[string]interface{}{"value": bytesWritten}, timestamp)
if err == nil {
*out = append(*out, y)
output <- y
}
// number of opens
@@ -182,17 +181,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
key_value["_oc_"], err.Error())
continue
}
y, err = lp.New(
"gpfs_num_opens",
map[string]string{
"filesystem": filesystem,
},
map[string]interface{}{
"value": numOpens,
},
timestamp)
y, err = lp.New("gpfs_num_opens", m.tags, m.meta, map[string]interface{}{"value": numOpens}, timestamp)
if err == nil {
*out = append(*out, y)
output <- y
}
// number of closes
@@ -201,17 +192,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of closes: %s\n", err.Error())
continue
}
y, err = lp.New(
"gpfs_num_closes",
map[string]string{
"filesystem": filesystem,
},
map[string]interface{}{
"value": numCloses,
},
timestamp)
y, err = lp.New("gpfs_num_closes", m.tags, m.meta, map[string]interface{}{"value": numCloses}, timestamp)
if err == nil {
*out = append(*out, y)
output <- y
}
// number of reads
@@ -220,17 +203,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of reads: %s\n", err.Error())
continue
}
y, err = lp.New(
"gpfs_num_reads",
map[string]string{
"filesystem": filesystem,
},
map[string]interface{}{
"value": numReads,
},
timestamp)
y, err = lp.New("gpfs_num_reads", m.tags, m.meta, map[string]interface{}{"value": numReads}, timestamp)
if err == nil {
*out = append(*out, y)
output <- y
}
// number of writes
@@ -239,17 +214,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of writes: %s\n", err.Error())
continue
}
y, err = lp.New(
"gpfs_num_writes",
map[string]string{
"filesystem": filesystem,
},
map[string]interface{}{
"value": numWrites,
},
timestamp)
y, err = lp.New("gpfs_num_writes", m.tags, m.meta, map[string]interface{}{"value": numWrites}, timestamp)
if err == nil {
*out = append(*out, y)
output <- y
}
// number of read directories
@@ -258,17 +225,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of read directories: %s\n", err.Error())
continue
}
y, err = lp.New(
"gpfs_num_readdirs",
map[string]string{
"filesystem": filesystem,
},
map[string]interface{}{
"value": numReaddirs,
},
timestamp)
y, err = lp.New("gpfs_num_readdirs", m.tags, m.meta, map[string]interface{}{"value": numReaddirs}, timestamp)
if err == nil {
*out = append(*out, y)
output <- y
}
// Number of inode updates
@@ -277,17 +236,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert Number of inode updates: %s\n", err.Error())
continue
}
y, err = lp.New(
"gpfs_num_inode_updates",
map[string]string{
"filesystem": filesystem,
},
map[string]interface{}{
"value": numInodeUpdates,
},
timestamp)
y, err = lp.New("gpfs_num_inode_updates", m.tags, m.meta, map[string]interface{}{"value": numInodeUpdates}, timestamp)
if err == nil {
*out = append(*out, y)
output <- y
}
}
}

View File

@@ -5,9 +5,7 @@ import (
"io/ioutil"
"log"
"os/exec"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
// "os"
"encoding/json"
"errors"
@@ -28,7 +26,7 @@ type InfinibandCollectorConfig struct {
}
type InfinibandCollector struct {
MetricCollector
metricCollector
tags map[string]string
lids map[string]map[string]string
config InfinibandCollectorConfig
@@ -56,11 +54,12 @@ func (m *InfinibandCollector) Help() {
fmt.Println("- ib_xmit_pkts")
}
func (m *InfinibandCollector) Init(config []byte) error {
func (m *InfinibandCollector) Init(config json.RawMessage) error {
var err error
m.name = "InfinibandCollector"
m.use_perfquery = false
m.setup()
m.meta = map[string]string{"source": m.name, "group": "Network"}
m.tags = map[string]string{"type": "node"}
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
@@ -117,7 +116,7 @@ func (m *InfinibandCollector) Init(config []byte) error {
return err
}
func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[string]string, out *[]lp.MutableMetric) error {
func (m *InfinibandCollector) doPerfQuery(cmd string, dev string, lid string, port string, tags map[string]string, output chan lp.CCMetric) error {
args := fmt.Sprintf("-r %s %s 0xf000", lid, port)
command := exec.Command(cmd, args)
@@ -134,9 +133,9 @@ func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[strin
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_recv", tags, map[string]interface{}{"value": float64(v)}, time.Now())
y, err := lp.New("ib_recv", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}
@@ -144,9 +143,9 @@ func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[strin
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_xmit", tags, map[string]interface{}{"value": float64(v)}, time.Now())
y, err := lp.New("ib_xmit", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}
@@ -154,9 +153,9 @@ func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[strin
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_recv_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
y, err := lp.New("ib_recv_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}
@@ -164,9 +163,29 @@ func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[strin
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_xmit_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
y, err := lp.New("ib_xmit_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}
if strings.HasPrefix(line, "PortRcvPkts") || strings.HasPrefix(line, "RcvPkts") {
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_recv_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
output <- y
}
}
}
if strings.HasPrefix(line, "PortXmitPkts") || strings.HasPrefix(line, "XmtPkts") {
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_xmit_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
output <- y
}
}
}
@@ -174,16 +193,16 @@ func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[strin
return nil
}
func DoSysfsRead(dev string, lid string, port string, tags map[string]string, out *[]lp.MutableMetric) error {
func (m *InfinibandCollector) doSysfsRead(dev string, lid string, port string, tags map[string]string, output chan lp.CCMetric) error {
path := fmt.Sprintf("%s/%s/ports/%s/counters/", string(IBBASEPATH), dev, port)
buffer, err := ioutil.ReadFile(fmt.Sprintf("%s/port_rcv_data", path))
if err == nil {
data := strings.Replace(string(buffer), "\n", "", -1)
v, err := strconv.ParseFloat(data, 64)
if err == nil {
y, err := lp.New("ib_recv", tags, map[string]interface{}{"value": float64(v)}, time.Now())
y, err := lp.New("ib_recv", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}
@@ -192,9 +211,9 @@ func DoSysfsRead(dev string, lid string, port string, tags map[string]string, ou
data := strings.Replace(string(buffer), "\n", "", -1)
v, err := strconv.ParseFloat(data, 64)
if err == nil {
y, err := lp.New("ib_xmit", tags, map[string]interface{}{"value": float64(v)}, time.Now())
y, err := lp.New("ib_xmit", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}
@@ -203,9 +222,9 @@ func DoSysfsRead(dev string, lid string, port string, tags map[string]string, ou
data := strings.Replace(string(buffer), "\n", "", -1)
v, err := strconv.ParseFloat(data, 64)
if err == nil {
y, err := lp.New("ib_recv_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
y, err := lp.New("ib_recv_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}
@@ -214,71 +233,29 @@ func DoSysfsRead(dev string, lid string, port string, tags map[string]string, ou
data := strings.Replace(string(buffer), "\n", "", -1)
v, err := strconv.ParseFloat(data, 64)
if err == nil {
y, err := lp.New("ib_xmit_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
y, err := lp.New("ib_xmit_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}
return nil
}
func (m *InfinibandCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *InfinibandCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if m.init {
for dev, ports := range m.lids {
for port, lid := range ports {
tags := map[string]string{"type": "node", "device": dev, "port": port}
if m.use_perfquery {
DoPerfQuery(m.config.PerfQueryPath, dev, lid, port, tags, out)
m.doPerfQuery(m.config.PerfQueryPath, dev, lid, port, tags, output)
} else {
DoSysfsRead(dev, lid, port, tags, out)
m.doSysfsRead(dev, lid, port, tags, output)
}
}
}
}
// buffer, err := ioutil.ReadFile(string(LIDFILE))
// if err != nil {
// log.Print(err)
// return
// }
// args := fmt.Sprintf("-r %s 1 0xf000", string(buffer))
// command := exec.Command(PERFQUERY, args)
// command.Wait()
// stdout, err := command.Output()
// if err != nil {
// log.Print(err)
// return
// }
// ll := strings.Split(string(stdout), "\n")
// for _, line := range ll {
// if strings.HasPrefix(line, "PortRcvData") || strings.HasPrefix(line, "RcvData") {
// lv := strings.Fields(line)
// v, err := strconv.ParseFloat(lv[1], 64)
// if err == nil {
// y, err := lp.New("ib_recv", m.tags, map[string]interface{}{"value": float64(v)}, time.Now())
// if err == nil {
// *out = append(*out, y)
// }
// }
// }
// if strings.HasPrefix(line, "PortXmitData") || strings.HasPrefix(line, "XmtData") {
// lv := strings.Fields(line)
// v, err := strconv.ParseFloat(lv[1], 64)
// if err == nil {
// y, err := lp.New("ib_xmit", m.tags, map[string]interface{}{"value": float64(v)}, time.Now())
// if err == nil {
// *out = append(*out, y)
// }
// }
// }
// }
}
func (m *InfinibandCollector) Close() {

View File

@@ -0,0 +1,19 @@
## `ibstat` collector
```json
"ibstat": {
"perfquery_path" : "<path to perfquery command>",
"exclude_devices": [
"mlx4"
]
}
```
The `ibstat` collector reads either data through the `perfquery` command or the sysfs files below `/sys/class/infiniband/<device>`.
Metrics:
* `ib_recv`
* `ib_xmit`
The collector adds a `device` tag to all metrics

View File

@@ -9,8 +9,7 @@ import (
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const IPMITOOL_PATH = `/usr/bin/ipmitool`
@@ -23,15 +22,16 @@ type IpmiCollectorConfig struct {
}
type IpmiCollector struct {
MetricCollector
metricCollector
tags map[string]string
matches map[string]string
config IpmiCollectorConfig
}
func (m *IpmiCollector) Init(config []byte) error {
func (m *IpmiCollector) Init(config json.RawMessage) error {
m.name = "IpmiCollector"
m.setup()
m.meta = map[string]string{"source": m.name, "group": "IPMI"}
if len(config) > 0 {
err := json.Unmarshal(config, &m.config)
if err != nil {
@@ -53,7 +53,7 @@ func (m *IpmiCollector) Init(config []byte) error {
return nil
}
func ReadIpmiTool(cmd string, out *[]lp.MutableMetric) {
func (m *IpmiCollector) readIpmiTool(cmd string, output chan lp.CCMetric) {
command := exec.Command(cmd, "sensor")
command.Wait()
stdout, err := command.Output()
@@ -74,24 +74,25 @@ func ReadIpmiTool(cmd string, out *[]lp.MutableMetric) {
name := strings.ToLower(strings.Replace(strings.Trim(lv[0], " "), " ", "_", -1))
unit := strings.Trim(lv[2], " ")
if unit == "Volts" {
unit = "V"
unit = "Volts"
} else if unit == "degrees C" {
unit = "C"
unit = "degC"
} else if unit == "degrees F" {
unit = "F"
unit = "degF"
} else if unit == "Watts" {
unit = "W"
unit = "Watts"
}
y, err := lp.New(name, map[string]string{"unit": unit, "type": "node"}, map[string]interface{}{"value": v}, time.Now())
y, err := lp.New(name, map[string]string{"type": "node"}, m.meta, map[string]interface{}{"value": v}, time.Now())
if err == nil {
*out = append(*out, y)
y.AddMeta("unit", unit)
output <- y
}
}
}
}
func ReadIpmiSensors(cmd string, out *[]lp.MutableMetric) {
func (m *IpmiCollector) readIpmiSensors(cmd string, output chan lp.CCMetric) {
command := exec.Command(cmd, "--comma-separated-output", "--sdr-cache-recreate")
command.Wait()
@@ -109,25 +110,28 @@ func ReadIpmiSensors(cmd string, out *[]lp.MutableMetric) {
v, err := strconv.ParseFloat(lv[3], 64)
if err == nil {
name := strings.ToLower(strings.Replace(lv[1], " ", "_", -1))
y, err := lp.New(name, map[string]string{"unit": lv[4], "type": "node"}, map[string]interface{}{"value": v}, time.Now())
y, err := lp.New(name, map[string]string{"type": "node"}, m.meta, map[string]interface{}{"value": v}, time.Now())
if err == nil {
*out = append(*out, y)
if len(lv) > 4 {
y.AddMeta("unit", lv[4])
}
output <- y
}
}
}
}
}
func (m *IpmiCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *IpmiCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if len(m.config.IpmitoolPath) > 0 {
_, err := os.Stat(m.config.IpmitoolPath)
if err == nil {
ReadIpmiTool(m.config.IpmitoolPath, out)
m.readIpmiTool(m.config.IpmitoolPath, output)
}
} else if len(m.config.IpmisensorsPath) > 0 {
_, err := os.Stat(m.config.IpmisensorsPath)
if err == nil {
ReadIpmiSensors(m.config.IpmisensorsPath, out)
m.readIpmiSensors(m.config.IpmisensorsPath, output)
}
}
}

16
collectors/ipmiMetric.md Normal file
View File

@@ -0,0 +1,16 @@
## `ipmistat` collector
```json
"ipmistat": {
"ipmitool_path": "/path/to/ipmitool",
"ipmisensors_path": "/path/to/ipmi-sensors",
}
```
The `ipmistat` collector reads data from `ipmitool` (`ipmitool sensor`) or `ipmi-sensors` (`ipmi-sensors --sdr-cache-recreate --comma-separated-output`).
The metrics depend on the output of the underlying tools but contain temperature, power and energy metrics.

View File

@@ -20,16 +20,28 @@ import (
"strings"
"time"
"unsafe"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
"gopkg.in/Knetic/govaluate.v2"
)
type MetricScope int
const (
METRIC_SCOPE_HWTHREAD = iota
METRIC_SCOPE_SOCKET
METRIC_SCOPE_NUMA
METRIC_SCOPE_NODE
)
func (ms MetricScope) String() string {
return []string{"Head", "Shoulder", "Knee", "Toe"}[ms]
}
type LikwidCollectorMetricConfig struct {
Name string `json:"name"`
Calc string `json:"calc"`
Socket_scope bool `json:"socket_scope"`
Publish bool `json:"publish"`
Name string `json:"name"`
Calc string `json:"calc"`
Scope MetricScope `json:"socket_scope"`
Publish bool `json:"publish"`
}
type LikwidCollectorEventsetConfig struct {
@@ -45,7 +57,7 @@ type LikwidCollectorConfig struct {
}
type LikwidCollector struct {
MetricCollector
metricCollector
cpulist []C.int
sock2tid map[int]int
metrics map[C.int]map[string]int
@@ -105,7 +117,7 @@ func getSocketCpus() map[C.int]int {
return outmap
}
func (m *LikwidCollector) Init(config []byte) error {
func (m *LikwidCollector) Init(config json.RawMessage) error {
var ret C.int
m.name = "LikwidCollector"
if len(config) > 0 {
@@ -115,11 +127,13 @@ func (m *LikwidCollector) Init(config []byte) error {
}
}
m.setup()
m.meta = map[string]string{"source": m.name, "group": "PerfCounter"}
cpulist := CpuList()
m.cpulist = make([]C.int, len(cpulist))
slist := getSocketCpus()
m.sock2tid = make(map[int]int)
// m.numa2tid = make(map[int]int)
for i, c := range cpulist {
m.cpulist[i] = C.int(c)
if sid, found := slist[m.cpulist[i]]; found {
@@ -169,7 +183,7 @@ func (m *LikwidCollector) Init(config []byte) error {
return nil
}
func (m *LikwidCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *LikwidCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -246,24 +260,28 @@ func (m *LikwidCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
for _, metric := range evset.Metrics {
_, skip := stringArrayContains(m.config.ExcludeMetrics, metric.Name)
if metric.Publish && !skip {
if metric.Socket_scope {
if metric.Scope.String() == "socket" {
for sid, tid := range m.sock2tid {
y, err := lp.New(metric.Name,
map[string]string{"type": "socket", "type-id": fmt.Sprintf("%d", int(sid))},
map[string]string{"type": "socket",
"type-id": fmt.Sprintf("%d", int(sid))},
m.meta,
map[string]interface{}{"value": m.mresults[i][tid][metric.Name]},
time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
} else {
} else if metric.Scope.String() == "hwthread" {
for tid, cpu := range m.cpulist {
y, err := lp.New(metric.Name,
map[string]string{"type": "cpu", "type-id": fmt.Sprintf("%d", int(cpu))},
map[string]string{"type": "cpu",
"type-id": fmt.Sprintf("%d", int(cpu))},
m.meta,
map[string]interface{}{"value": m.mresults[i][tid][metric.Name]},
time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}
@@ -273,24 +291,28 @@ func (m *LikwidCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
for _, metric := range m.config.Metrics {
_, skip := stringArrayContains(m.config.ExcludeMetrics, metric.Name)
if metric.Publish && !skip {
if metric.Socket_scope {
if metric.Scope.String() == "socket" {
for sid, tid := range m.sock2tid {
y, err := lp.New(metric.Name,
map[string]string{"type": "socket", "type-id": fmt.Sprintf("%d", int(sid))},
map[string]string{"type": "socket",
"type-id": fmt.Sprintf("%d", int(sid))},
m.meta,
map[string]interface{}{"value": m.gmresults[tid][metric.Name]},
time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
} else {
for tid, cpu := range m.cpulist {
y, err := lp.New(metric.Name,
map[string]string{"type": "cpu", "type-id": fmt.Sprintf("%d", int(cpu))},
map[string]string{"type": "cpu",
"type-id": fmt.Sprintf("%d", int(cpu))},
m.meta,
map[string]interface{}{"value": m.gmresults[tid][metric.Name]},
time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}

119
collectors/likwidMetric.md Normal file
View File

@@ -0,0 +1,119 @@
## `likwid` collector
```json
"likwid": {
"eventsets": [
{
"events": {
"FIXC1": "ACTUAL_CPU_CLOCK",
"FIXC2": "MAX_CPU_CLOCK",
"PMC0": "RETIRED_INSTRUCTIONS",
"PMC1": "CPU_CLOCKS_UNHALTED",
"PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
"PMC3": "MERGE",
"DFC0": "DRAM_CHANNEL_0",
"DFC1": "DRAM_CHANNEL_1",
"DFC2": "DRAM_CHANNEL_2",
"DFC3": "DRAM_CHANNEL_3"
},
"metrics": [
{
"name": "ipc",
"calc": "PMC0/PMC1",
"socket_scope": false,
"publish": true
},
{
"name": "flops_any",
"calc": "0.000001*PMC2/time",
"socket_scope": false,
"publish": true
},
{
"name": "clock_mhz",
"calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
"socket_scope": false,
"publish": true
},
{
"name": "mem1",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"socket_scope": true,
"publish": false
}
]
},
{
"events": {
"DFC0": "DRAM_CHANNEL_4",
"DFC1": "DRAM_CHANNEL_5",
"DFC2": "DRAM_CHANNEL_6",
"DFC3": "DRAM_CHANNEL_7",
"PWR0": "RAPL_CORE_ENERGY",
"PWR1": "RAPL_PKG_ENERGY"
},
"metrics": [
{
"name": "pwr_core",
"calc": "PWR0/time",
"socket_scope": false,
"publish": true
},
{
"name": "pwr_pkg",
"calc": "PWR1/time",
"socket_scope": true,
"publish": true
},
{
"name": "mem2",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"socket_scope": true,
"publish": false
}
]
}
],
"globalmetrics": [
{
"name": "mem_bw",
"calc": "mem1+mem2",
"socket_scope": true,
"publish": true
}
]
}
```
_Example config suitable for AMD Zen3_
The `likwid` collector reads hardware performance counters at a **hwthread** and **socket** level. The configuration looks quite complicated but it is basically copy&paste from [LIKWID's performance groups](https://github.com/RRZE-HPC/likwid/tree/master/groups). The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
```
EVENTSET -> "events": {
FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK",
FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK",
PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS",
PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED",
PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
PMC3 MERGE -> "PMC3": "MERGE",
-> }
```
The metrics are following the same procedure:
```
METRICS -> "metrics": [
IPC PMC0/PMC1 -> {
-> "name" : "IPC",
-> "calc" : "PMC0/PMC1",
-> "socket_scope": false,
-> "publish": true
-> }
-> ]
```
The `socket_scope` option tells whether it is submitted per socket or per hwthread. If a metric is only used for internal calculations, you can set `publish = false`.
Since some metrics can only be gathered in multiple measurements (like the memory bandwidth on AMD Zen3 chips), configure multiple eventsets like in the example config and use the `globalmetrics` section to combine them. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.

View File

@@ -6,8 +6,7 @@ import (
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const LOADAVGFILE = `/proc/loadavg`
@@ -17,14 +16,14 @@ type LoadavgCollectorConfig struct {
}
type LoadavgCollector struct {
MetricCollector
metricCollector
tags map[string]string
load_matches []string
proc_matches []string
config LoadavgCollectorConfig
}
func (m *LoadavgCollector) Init(config []byte) error {
func (m *LoadavgCollector) Init(config json.RawMessage) error {
m.name = "LoadavgCollector"
m.setup()
if len(config) > 0 {
@@ -33,6 +32,7 @@ func (m *LoadavgCollector) Init(config []byte) error {
return err
}
}
m.meta = map[string]string{"source": m.name, "group": "LOAD"}
m.tags = map[string]string{"type": "node"}
m.load_matches = []string{"load_one", "load_five", "load_fifteen"}
m.proc_matches = []string{"proc_run", "proc_total"}
@@ -40,7 +40,7 @@ func (m *LoadavgCollector) Init(config []byte) error {
return nil
}
func (m *LoadavgCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *LoadavgCollector) Read(interval time.Duration, output chan lp.CCMetric) {
var skip bool
if !m.init {
return
@@ -56,9 +56,9 @@ func (m *LoadavgCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
x, err := strconv.ParseFloat(ls[i], 64)
if err == nil {
_, skip = stringArrayContains(m.config.ExcludeMetrics, name)
y, err := lp.New(name, m.tags, map[string]interface{}{"value": float64(x)}, time.Now())
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": float64(x)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
}
@@ -67,9 +67,9 @@ func (m *LoadavgCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
x, err := strconv.ParseFloat(lv[i], 64)
if err == nil {
_, skip = stringArrayContains(m.config.ExcludeMetrics, name)
y, err := lp.New(name, m.tags, map[string]interface{}{"value": float64(x)}, time.Now())
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": float64(x)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
}

View File

@@ -0,0 +1,19 @@
## `loadavg` collector
```json
"loadavg": {
"exclude_metrics": [
"proc_run"
]
}
```
The `loadavg` collector reads data from `/proc/loadavg` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `load_one`
* `load_five`
* `load_fifteen`
* `proc_run`
* `proc_total`

View File

@@ -8,8 +8,7 @@ import (
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const LUSTREFILE = `/proc/fs/lustre/llite/lnec-XXXXXX/stats`
@@ -20,14 +19,14 @@ type LustreCollectorConfig struct {
}
type LustreCollector struct {
MetricCollector
metricCollector
tags map[string]string
matches map[string]map[string]int
devices []string
config LustreCollectorConfig
}
func (m *LustreCollector) Init(config []byte) error {
func (m *LustreCollector) Init(config json.RawMessage) error {
var err error
m.name = "LustreCollector"
if len(config) > 0 {
@@ -38,6 +37,7 @@ func (m *LustreCollector) Init(config []byte) error {
}
m.setup()
m.tags = map[string]string{"type": "node"}
m.meta = map[string]string{"source": m.name, "group": "Lustre"}
m.matches = map[string]map[string]int{"read_bytes": {"read_bytes": 6, "read_requests": 1},
"write_bytes": {"write_bytes": 6, "write_requests": 1},
"open": {"open": 1},
@@ -64,7 +64,7 @@ func (m *LustreCollector) Init(config []byte) error {
return nil
}
func (m *LustreCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *LustreCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -88,9 +88,12 @@ func (m *LustreCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
}
x, err := strconv.ParseInt(lf[idx], 0, 64)
if err == nil {
y, err := lp.New(name, m.tags, map[string]interface{}{"value": x}, time.Now())
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": x}, time.Now())
if err == nil {
*out = append(*out, y)
if strings.Contains(name, "byte") {
y.AddMeta("unit", "Byte")
}
output <- y
}
}
}

View File

@@ -0,0 +1,29 @@
## `lustrestat` collector
```json
"lustrestat": {
"procfiles" : [
"/proc/fs/lustre/llite/lnec-XXXXXX/stats"
],
"exclude_metrics": [
"setattr",
"getattr"
]
}
```
The `lustrestat` collector reads from the procfs stat files for Lustre like `/proc/fs/lustre/llite/lnec-XXXXXX/stats`.
Metrics:
* `read_bytes`
* `read_requests`
* `write_bytes`
* `write_requests`
* `open`
* `close`
* `getattr`
* `setattr`
* `statfs`
* `inode_permission`

View File

@@ -9,8 +9,7 @@ import (
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const MEMSTATFILE = `/proc/meminfo`
@@ -20,14 +19,14 @@ type MemstatCollectorConfig struct {
}
type MemstatCollector struct {
MetricCollector
metricCollector
stats map[string]int64
tags map[string]string
matches map[string]string
config MemstatCollectorConfig
}
func (m *MemstatCollector) Init(config []byte) error {
func (m *MemstatCollector) Init(config json.RawMessage) error {
var err error
m.name = "MemstatCollector"
if len(config) > 0 {
@@ -36,6 +35,7 @@ func (m *MemstatCollector) Init(config []byte) error {
return err
}
}
m.meta = map[string]string{"source": m.name, "group": "Memory", "unit": "kByte"}
m.stats = make(map[string]int64)
m.matches = make(map[string]string)
m.tags = map[string]string{"type": "node"}
@@ -65,7 +65,7 @@ func (m *MemstatCollector) Init(config []byte) error {
return err
}
func (m *MemstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *MemstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -97,9 +97,9 @@ func (m *MemstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
log.Print(err)
continue
}
y, err := lp.New(name, m.tags, map[string]interface{}{"value": int(float64(m.stats[match]) * 1.0e-3)}, time.Now())
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": int(float64(m.stats[match]) * 1.0e-3)}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
@@ -108,18 +108,18 @@ func (m *MemstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
if _, cached := m.stats[`Cached`]; cached {
memUsed := m.stats[`MemTotal`] - (m.stats[`MemFree`] + m.stats[`Buffers`] + m.stats[`Cached`])
_, skip := stringArrayContains(m.config.ExcludeMetrics, "mem_used")
y, err := lp.New("mem_used", m.tags, map[string]interface{}{"value": int(float64(memUsed) * 1.0e-3)}, time.Now())
y, err := lp.New("mem_used", m.tags, m.meta, map[string]interface{}{"value": int(float64(memUsed) * 1.0e-3)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
}
}
if _, found := m.stats[`MemShared`]; found {
_, skip := stringArrayContains(m.config.ExcludeMetrics, "mem_shared")
y, err := lp.New("mem_shared", m.tags, map[string]interface{}{"value": int(float64(m.stats[`MemShared`]) * 1.0e-3)}, time.Now())
y, err := lp.New("mem_shared", m.tags, m.meta, map[string]interface{}{"value": int(float64(m.stats[`MemShared`]) * 1.0e-3)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
}

View File

@@ -0,0 +1,27 @@
## `memstat` collector
```json
"memstat": {
"exclude_metrics": [
"mem_used"
]
}
```
The `memstat` collector reads data from `/proc/meminfo` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `mem_total`
* `mem_sreclaimable`
* `mem_slab`
* `mem_free`
* `mem_buffers`
* `mem_cached`
* `mem_available`
* `mem_shared`
* `swap_total`
* `swap_free`
* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached`)

View File

@@ -1,8 +1,10 @@
package collectors
import (
"encoding/json"
"errors"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
influx "github.com/influxdata/line-protocol"
"io/ioutil"
"log"
"strconv"
@@ -10,28 +12,30 @@ import (
"time"
)
type MetricGetter interface {
type MetricCollector interface {
Name() string
Init(config []byte) error
Init(config json.RawMessage) error
Initialized() bool
Read(time.Duration, *[]lp.MutableMetric)
Read(duration time.Duration, output chan lp.CCMetric)
Close()
}
type MetricCollector struct {
name string
init bool
type metricCollector struct {
output chan lp.CCMetric
name string
init bool
meta map[string]string
}
func (c *MetricCollector) Name() string {
func (c *metricCollector) Name() string {
return c.name
}
func (c *MetricCollector) setup() error {
func (c *metricCollector) setup() error {
return nil
}
func (c *MetricCollector) Initialized() bool {
func (c *metricCollector) Initialized() bool {
return c.init == true
}
@@ -103,7 +107,7 @@ func CpuList() []int {
return cpulist
}
func Tags2Map(metric lp.Metric) map[string]string {
func Tags2Map(metric influx.Metric) map[string]string {
tags := make(map[string]string)
for _, t := range metric.TagList() {
tags[t.Key] = t.Value
@@ -111,7 +115,7 @@ func Tags2Map(metric lp.Metric) map[string]string {
return tags
}
func Fields2Map(metric lp.Metric) map[string]interface{} {
func Fields2Map(metric influx.Metric) map[string]interface{} {
fields := make(map[string]interface{})
for _, f := range metric.FieldList() {
fields[f.Key] = f.Value

View File

@@ -7,8 +7,7 @@ import (
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const NETSTATFILE = `/proc/net/dev`
@@ -18,14 +17,15 @@ type NetstatCollectorConfig struct {
}
type NetstatCollector struct {
MetricCollector
metricCollector
config NetstatCollectorConfig
matches map[int]string
}
func (m *NetstatCollector) Init(config []byte) error {
func (m *NetstatCollector) Init(config json.RawMessage) error {
m.name = "NetstatCollector"
m.setup()
m.meta = map[string]string{"source": m.name, "group": "Memory"}
m.matches = map[int]string{
1: "bytes_in",
9: "bytes_out",
@@ -46,7 +46,7 @@ func (m *NetstatCollector) Init(config []byte) error {
return nil
}
func (m *NetstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *NetstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
data, err := ioutil.ReadFile(string(NETSTATFILE))
if err != nil {
log.Print(err.Error())
@@ -73,9 +73,15 @@ func (m *NetstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
for i, name := range m.matches {
v, err := strconv.ParseInt(f[i], 10, 0)
if err == nil {
y, err := lp.New(name, tags, map[string]interface{}{"value": int(float64(v) * 1.0e-3)}, time.Now())
y, err := lp.New(name, tags, m.meta, map[string]interface{}{"value": int(float64(v) * 1.0e-3)}, time.Now())
if err == nil {
*out = append(*out, y)
switch {
case strings.Contains(name, "byte"):
y.AddMeta("unit", "Byte")
case strings.Contains(name, "pkt"):
y.AddMeta("unit", "Packets")
}
output <- y
}
}
}

View File

@@ -0,0 +1,21 @@
## `netstat` collector
```json
"netstat": {
"exclude_devices": [
"lo"
]
}
```
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a device is not required, it can be excluded from forwarding it to the sink. Commonly the `lo` device should be excluded.
Metrics:
* `bytes_in`
* `bytes_out`
* `pkts_in`
* `pkts_out`
The device name is added as tag `device`.

147
collectors/nfsMetric.go Normal file
View File

@@ -0,0 +1,147 @@
package collectors
import (
"encoding/json"
"fmt"
"log"
// "os"
"os/exec"
"strconv"
"strings"
"time"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
type NfsCollectorData struct {
current int64
last int64
}
type NfsCollector struct {
metricCollector
tags map[string]string
config struct {
Nfsutils string `json:"nfsutils"`
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
}
data map[string]map[string]NfsCollectorData
}
func (m *NfsCollector) initStats() error {
cmd := exec.Command(m.config.Nfsutils, "-l")
cmd.Wait()
buffer, err := cmd.Output()
if err == nil {
for _, line := range strings.Split(string(buffer), "\n") {
lf := strings.Fields(line)
if len(lf) != 5 {
continue
}
if _, exist := m.data[lf[1]]; !exist {
m.data[lf[1]] = make(map[string]NfsCollectorData)
}
name := strings.Trim(lf[3], ":")
if _, exist := m.data[lf[1]][name]; !exist {
value, err := strconv.ParseInt(lf[4], 0, 64)
if err == nil {
x := m.data[lf[1]][name]
x.current = value
x.last = 0
m.data[lf[1]][name] = x
}
}
}
}
return err
}
func (m *NfsCollector) updateStats() error {
cmd := exec.Command(m.config.Nfsutils, "-l")
cmd.Wait()
buffer, err := cmd.Output()
if err == nil {
for _, line := range strings.Split(string(buffer), "\n") {
lf := strings.Fields(line)
if len(lf) != 5 {
continue
}
if _, exist := m.data[lf[1]]; !exist {
m.data[lf[1]] = make(map[string]NfsCollectorData)
}
name := strings.Trim(lf[3], ":")
if _, exist := m.data[lf[1]][name]; exist {
value, err := strconv.ParseInt(lf[4], 0, 64)
if err == nil {
x := m.data[lf[1]][name]
x.last = x.current
x.current = value
m.data[lf[1]][name] = x
}
}
}
}
return err
}
func (m *NfsCollector) Init(config json.RawMessage) error {
var err error
m.name = "NfsCollector"
m.setup()
// Set default mmpmon binary
m.config.Nfsutils = "/usr/sbin/nfsstat"
// Read JSON configuration
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
log.Print(err.Error())
return err
}
}
m.meta = map[string]string{
"source": m.name,
"group": "NFS",
}
m.tags = map[string]string{
"type": "node",
}
// Check if mmpmon is in executable search path
_, err = exec.LookPath(m.config.Nfsutils)
if err != nil {
return fmt.Errorf("NfsCollector.Init(): Failed to find nfsstat binary '%s': %v", m.config.Nfsutils, err)
}
m.data = make(map[string]map[string]NfsCollectorData)
m.initStats()
m.init = true
return nil
}
func (m *NfsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
timestamp := time.Now()
m.updateStats()
for version, metrics := range m.data {
for name, data := range metrics {
if _, skip := stringArrayContains(m.config.ExcludeMetrics, name); skip {
continue
}
value := data.current - data.last
y, err := lp.New(fmt.Sprintf("nfs_%s", name), m.tags, m.meta, map[string]interface{}{"value": value}, timestamp)
if err == nil {
y.AddMeta("version", version)
output <- y
}
}
}
}
func (m *NfsCollector) Close() {
m.init = false
}

View File

@@ -6,9 +6,8 @@ import (
"fmt"
"log"
"time"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
"github.com/NVIDIA/go-nvml/pkg/nvml"
lp "github.com/influxdata/line-protocol"
)
type NvidiaCollectorConfig struct {
@@ -17,7 +16,7 @@ type NvidiaCollectorConfig struct {
}
type NvidiaCollector struct {
MetricCollector
metricCollector
num_gpus int
config NvidiaCollectorConfig
}
@@ -29,10 +28,11 @@ func (m *NvidiaCollector) CatchPanic() {
}
}
func (m *NvidiaCollector) Init(config []byte) error {
func (m *NvidiaCollector) Init(config json.RawMessage) error {
var err error
m.name = "NvidiaCollector"
m.setup()
m.meta = map[string]string{"source": m.name, "group": "Nvidia"}
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
@@ -55,7 +55,7 @@ func (m *NvidiaCollector) Init(config []byte) error {
return nil
}
func (m *NvidiaCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -74,14 +74,14 @@ func (m *NvidiaCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
util, ret := nvml.DeviceGetUtilizationRates(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "util")
y, err := lp.New("util", tags, map[string]interface{}{"value": float64(util.Gpu)}, time.Now())
y, err := lp.New("util", tags, m.meta, map[string]interface{}{"value": float64(util.Gpu)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
_, skip = stringArrayContains(m.config.ExcludeMetrics, "mem_util")
y, err = lp.New("mem_util", tags, map[string]interface{}{"value": float64(util.Memory)}, time.Now())
y, err = lp.New("mem_util", tags, m.meta, map[string]interface{}{"value": float64(util.Memory)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
@@ -89,174 +89,177 @@ func (m *NvidiaCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
if ret == nvml.SUCCESS {
t := float64(meminfo.Total) / (1024 * 1024)
_, skip = stringArrayContains(m.config.ExcludeMetrics, "mem_total")
y, err := lp.New("mem_total", tags, map[string]interface{}{"value": t}, time.Now())
y, err := lp.New("mem_total", tags, m.meta, map[string]interface{}{"value": t}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
y.AddMeta("unit", "MByte")
output <- y
}
f := float64(meminfo.Used) / (1024 * 1024)
_, skip = stringArrayContains(m.config.ExcludeMetrics, "fb_memory")
y, err = lp.New("fb_memory", tags, map[string]interface{}{"value": f}, time.Now())
y, err = lp.New("fb_memory", tags, m.meta, map[string]interface{}{"value": f}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
y.AddMeta("unit", "MByte")
output <- y
}
}
temp, ret := nvml.DeviceGetTemperature(device, nvml.TEMPERATURE_GPU)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "temp")
y, err := lp.New("temp", tags, map[string]interface{}{"value": float64(temp)}, time.Now())
y, err := lp.New("temp", tags, m.meta, map[string]interface{}{"value": float64(temp)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
y.AddMeta("unit", "degC")
output <- y
}
}
fan, ret := nvml.DeviceGetFanSpeed(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "fan")
y, err := lp.New("fan", tags, map[string]interface{}{"value": float64(fan)}, time.Now())
y, err := lp.New("fan", tags, m.meta, map[string]interface{}{"value": float64(fan)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
_, ecc_pend, ret := nvml.DeviceGetEccMode(device)
if ret == nvml.SUCCESS {
var y lp.MutableMetric
var y lp.CCMetric
var err error
switch ecc_pend {
case nvml.FEATURE_DISABLED:
y, err = lp.New("ecc_mode", tags, map[string]interface{}{"value": string("OFF")}, time.Now())
y, err = lp.New("ecc_mode", tags, m.meta, map[string]interface{}{"value": string("OFF")}, time.Now())
case nvml.FEATURE_ENABLED:
y, err = lp.New("ecc_mode", tags, map[string]interface{}{"value": string("ON")}, time.Now())
y, err = lp.New("ecc_mode", tags, m.meta, map[string]interface{}{"value": string("ON")}, time.Now())
default:
y, err = lp.New("ecc_mode", tags, map[string]interface{}{"value": string("UNKNOWN")}, time.Now())
y, err = lp.New("ecc_mode", tags, m.meta, map[string]interface{}{"value": string("UNKNOWN")}, time.Now())
}
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_mode")
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
} else if ret == nvml.ERROR_NOT_SUPPORTED {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_mode")
y, err := lp.New("ecc_mode", tags, map[string]interface{}{"value": string("N/A")}, time.Now())
y, err := lp.New("ecc_mode", tags, m.meta, map[string]interface{}{"value": string("N/A")}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
pstate, ret := nvml.DeviceGetPerformanceState(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "perf_state")
y, err := lp.New("perf_state", tags, map[string]interface{}{"value": fmt.Sprintf("P%d", int(pstate))}, time.Now())
y, err := lp.New("perf_state", tags, m.meta, map[string]interface{}{"value": fmt.Sprintf("P%d", int(pstate))}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
power, ret := nvml.DeviceGetPowerUsage(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "power_usage_report")
y, err := lp.New("power_usage_report", tags, map[string]interface{}{"value": float64(power) / 1000}, time.Now())
y, err := lp.New("power_usage_report", tags, m.meta, map[string]interface{}{"value": float64(power) / 1000}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
gclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_GRAPHICS)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "graphics_clock_report")
y, err := lp.New("graphics_clock_report", tags, map[string]interface{}{"value": float64(gclk)}, time.Now())
y, err := lp.New("graphics_clock_report", tags, m.meta, map[string]interface{}{"value": float64(gclk)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
smclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_SM)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "sm_clock_report")
y, err := lp.New("sm_clock_report", tags, map[string]interface{}{"value": float64(smclk)}, time.Now())
y, err := lp.New("sm_clock_report", tags, m.meta, map[string]interface{}{"value": float64(smclk)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
memclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_MEM)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "mem_clock_report")
y, err := lp.New("mem_clock_report", tags, map[string]interface{}{"value": float64(memclk)}, time.Now())
y, err := lp.New("mem_clock_report", tags, m.meta, map[string]interface{}{"value": float64(memclk)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
max_gclk, ret := nvml.DeviceGetMaxClockInfo(device, nvml.CLOCK_GRAPHICS)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "max_graphics_clock")
y, err := lp.New("max_graphics_clock", tags, map[string]interface{}{"value": float64(max_gclk)}, time.Now())
y, err := lp.New("max_graphics_clock", tags, m.meta, map[string]interface{}{"value": float64(max_gclk)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
max_smclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_SM)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "max_sm_clock")
y, err := lp.New("max_sm_clock", tags, map[string]interface{}{"value": float64(max_smclk)}, time.Now())
y, err := lp.New("max_sm_clock", tags, m.meta, map[string]interface{}{"value": float64(max_smclk)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
max_memclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_MEM)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "max_mem_clock")
y, err := lp.New("max_mem_clock", tags, map[string]interface{}{"value": float64(max_memclk)}, time.Now())
y, err := lp.New("max_mem_clock", tags, m.meta, map[string]interface{}{"value": float64(max_memclk)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
ecc_db, ret := nvml.DeviceGetTotalEccErrors(device, 1, 1)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_db_error")
y, err := lp.New("ecc_db_error", tags, map[string]interface{}{"value": float64(ecc_db)}, time.Now())
y, err := lp.New("ecc_db_error", tags, m.meta, map[string]interface{}{"value": float64(ecc_db)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
ecc_sb, ret := nvml.DeviceGetTotalEccErrors(device, 0, 1)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_sb_error")
y, err := lp.New("ecc_sb_error", tags, map[string]interface{}{"value": float64(ecc_sb)}, time.Now())
y, err := lp.New("ecc_sb_error", tags, m.meta, map[string]interface{}{"value": float64(ecc_sb)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
pwr_limit, ret := nvml.DeviceGetPowerManagementLimit(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "power_man_limit")
y, err := lp.New("power_man_limit", tags, map[string]interface{}{"value": float64(pwr_limit)}, time.Now())
y, err := lp.New("power_man_limit", tags, m.meta, map[string]interface{}{"value": float64(pwr_limit)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
enc_util, _, ret := nvml.DeviceGetEncoderUtilization(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "encoder_util")
y, err := lp.New("encoder_util", tags, map[string]interface{}{"value": float64(enc_util)}, time.Now())
y, err := lp.New("encoder_util", tags, m.meta, map[string]interface{}{"value": float64(enc_util)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
dec_util, _, ret := nvml.DeviceGetDecoderUtilization(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "decoder_util")
y, err := lp.New("decoder_util", tags, map[string]interface{}{"value": float64(dec_util)}, time.Now())
y, err := lp.New("decoder_util", tags, m.meta, map[string]interface{}{"value": float64(dec_util)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
}

View File

@@ -0,0 +1,40 @@
## `nvidia` collector
```json
"lustrestat": {
"exclude_devices" : [
"0","1"
],
"exclude_metrics": [
"fb_memory",
"fan"
]
}
```
Metrics:
* `util`
* `mem_util`
* `mem_total`
* `fb_memory`
* `temp`
* `fan`
* `ecc_mode`
* `perf_state`
* `power_usage_report`
* `graphics_clock_report`
* `sm_clock_report`
* `mem_clock_report`
* `max_graphics_clock`
* `max_sm_clock`
* `max_mem_clock`
* `ecc_db_error`
* `ecc_sb_error`
* `power_man_limit`
* `encoder_util`
* `decoder_util`
It uses a separate `type` in the metrics. The output metric looks like this:
`<name>,type=accelerator,type-id=<nvidia-gpu-id> value=<metric value> <timestamp>`

View File

@@ -4,13 +4,13 @@ import (
"encoding/json"
"fmt"
"io/ioutil"
"log"
"os"
"path/filepath"
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const HWMON_PATH = `/sys/class/hwmon`
@@ -21,20 +21,21 @@ type TempCollectorConfig struct {
}
type TempCollector struct {
MetricCollector
metricCollector
config TempCollectorConfig
}
func (m *TempCollector) Init(config []byte) error {
func (m *TempCollector) Init(config json.RawMessage) error {
m.name = "TempCollector"
m.setup()
m.init = true
m.meta = map[string]string{"source": m.name, "group": "IPMI", "unit": "degC"}
if len(config) > 0 {
err := json.Unmarshal(config, &m.config)
if err != nil {
return err
}
}
m.init = true
return nil
}
@@ -74,7 +75,7 @@ func get_hwmon_sensors() (map[string]map[string]string, error) {
return sensors, nil
}
func (m *TempCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *TempCollector) Read(interval time.Duration, output chan lp.CCMetric) {
sensors, err := get_hwmon_sensors()
if err != nil {
@@ -89,15 +90,20 @@ func (m *TempCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
break
}
}
mname := strings.Replace(name, " ", "_", -1)
if !strings.Contains(mname, "temp") {
mname = fmt.Sprintf("temp_%s", mname)
}
buffer, err := ioutil.ReadFile(string(file))
if err != nil {
continue
}
x, err := strconv.ParseInt(strings.Replace(string(buffer), "\n", "", -1), 0, 64)
if err == nil {
y, err := lp.New(strings.ToLower(name), tags, map[string]interface{}{"value": float64(x) / 1000}, time.Now())
y, err := lp.New(strings.ToLower(mname), tags, m.meta, map[string]interface{}{"value": int(float64(x) / 1000)}, time.Now())
if err == nil {
*out = append(*out, y)
log.Print("[", m.name, "] ", y)
output <- y
}
}
}

22
collectors/tempMetric.md Normal file
View File

@@ -0,0 +1,22 @@
## `tempstat` collector
```json
"tempstat": {
"tag_override" : {
"<device like hwmon1>" : {
"type" : "socket",
"type-id" : "0"
}
},
"exclude_metrics": [
"metric1",
"metric2"
]
}
```
The `tempstat` collector reads the data from `/sys/class/hwmon/<device>/tempX_{input,label}`
Metrics:
* `temp_*`: The metric name is taken from the `label` files.

View File

@@ -8,8 +8,7 @@ import (
"os/exec"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const MAX_NUM_PROCS = 10
@@ -20,15 +19,16 @@ type TopProcsCollectorConfig struct {
}
type TopProcsCollector struct {
MetricCollector
metricCollector
tags map[string]string
config TopProcsCollectorConfig
}
func (m *TopProcsCollector) Init(config []byte) error {
func (m *TopProcsCollector) Init(config json.RawMessage) error {
var err error
m.name = "TopProcsCollector"
m.tags = map[string]string{"type": "node"}
m.meta = map[string]string{"source": m.name, "group": "TopProcs"}
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
@@ -51,7 +51,7 @@ func (m *TopProcsCollector) Init(config []byte) error {
return nil
}
func (m *TopProcsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *TopProcsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -66,9 +66,9 @@ func (m *TopProcsCollector) Read(interval time.Duration, out *[]lp.MutableMetric
lines := strings.Split(string(stdout), "\n")
for i := 1; i < m.config.Num_procs+1; i++ {
name := fmt.Sprintf("topproc%d", i)
y, err := lp.New(name, m.tags, map[string]interface{}{"value": string(lines[i])}, time.Now())
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": string(lines[i])}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}

View File

@@ -0,0 +1,15 @@
## `topprocs` collector
```json
"topprocs": {
"num_procs": 5
}
```
The `topprocs` collector reads the TopX processes (sorted by CPU utilization, `ps -Ao comm --sort=-pcpu`).
In contrast to most other collectors, the metric value is a `string`.