mirror of
https://github.com/ClusterCockpit/cc-metric-collector.git
synced 2025-07-19 11:21:41 +02:00
Merge branch 'develop' into main
This commit is contained in:
@@ -1,304 +1,59 @@
|
||||
# CCMetric collectors
|
||||
|
||||
This folder contains the collectors for the cc-metric-collector.
|
||||
|
||||
# `metricCollector.go`
|
||||
The base class/configuration is located in `metricCollector.go`.
|
||||
|
||||
# Collectors
|
||||
|
||||
* `memstatMetric.go`: Reads `/proc/meminfo` to calculate **node** metrics. It also combines values to the metric `mem_used`
|
||||
* `loadavgMetric.go`: Reads `/proc/loadavg` and submits **node** metrics:
|
||||
* `netstatMetric.go`: Reads `/proc/net/dev` and submits for all network devices as the **node** metrics.
|
||||
* `lustreMetric.go`: Reads Lustre's stats files and submits **node** metrics:
|
||||
* `infinibandMetric.go`: Reads InfiniBand metrics. It uses the `perfquery` command to read the **node** metrics but can fallback to sysfs counters in case `perfquery` does not work.
|
||||
* `likwidMetric.go`: Reads hardware performance events using LIKWID. It submits **socket** and **cpu** metrics
|
||||
* `cpustatMetric.go`: Read CPU specific values from `/proc/stat`
|
||||
* `topprocsMetric.go`: Reads the TopX processes by their CPU usage. X is configurable
|
||||
* `nvidiaMetric.go`: Read data about Nvidia GPUs using the NVML library
|
||||
* `tempMetric.go`: Read temperature data from `/sys/class/hwmon/hwmon*`
|
||||
* `ipmiMetric.go`: Collect data from `ipmitool` or as fallback `ipmi-sensors`
|
||||
* `customCmdMetric.go`: Run commands or read files and submit the output (output has to be in InfluxDB line protocol!)
|
||||
|
||||
If any of the collectors cannot be initialized, it is excluded from all further reads. Like if the Lustre stat file is not a valid path, no Lustre specific metrics will be recorded.
|
||||
|
||||
# Collector configuration
|
||||
# Configuration
|
||||
|
||||
```json
|
||||
"collectors": [
|
||||
"tempstat"
|
||||
],
|
||||
"collect_config": {
|
||||
"tempstat": {
|
||||
"tag_override": {
|
||||
"hwmon0" : {
|
||||
"type" : "socket",
|
||||
"type-id" : "0"
|
||||
},
|
||||
"hwmon1" : {
|
||||
"type" : "socket",
|
||||
"type-id" : "1"
|
||||
}
|
||||
}
|
||||
{
|
||||
"collector_type" : {
|
||||
<collector specific configuration>
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The configuration of the collectors in the main config files consists of two parts: active collectors (`collectors`) and collector configuration (`collect_config`). At startup, all collectors in the `collectors` list is initialized and, if successfully initialized, added to the active collectors for metric retrieval. At initialization the collector-specific configuration from the `collect_config` section is handed over. Each collector has own configuration options, check at the collector-specific section.
|
||||
In contrast to the configuration files for sinks and receivers, the collectors configuration is not a list but a set of dicts. This is required because we didn't manage to partially read the type before loading the remaining configuration. We are eager to change this to the same format.
|
||||
|
||||
## `memstat`
|
||||
# Available collectors
|
||||
|
||||
```json
|
||||
"memstat": {
|
||||
"exclude_metrics": [
|
||||
"mem_used"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `memstat` collector reads data from `/proc/meminfo` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
|
||||
Metrics:
|
||||
* `mem_total`
|
||||
* `mem_sreclaimable`
|
||||
* `mem_slab`
|
||||
* `mem_free`
|
||||
* `mem_buffers`
|
||||
* `mem_cached`
|
||||
* `mem_available`
|
||||
* `mem_shared`
|
||||
* `swap_total`
|
||||
* `swap_free`
|
||||
* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached`)
|
||||
|
||||
## `loadavg`
|
||||
```json
|
||||
"loadavg": {
|
||||
"exclude_metrics": [
|
||||
"proc_run"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `loadavg` collector reads data from `/proc/loadavg` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics:
|
||||
* `load_one`
|
||||
* `load_five`
|
||||
* `load_fifteen`
|
||||
* `proc_run`
|
||||
* `proc_total`
|
||||
|
||||
## `netstat`
|
||||
```json
|
||||
"netstat": {
|
||||
"exclude_devices": [
|
||||
"lo"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a device is not required, it can be excluded from forwarding it to the sink. Commonly the `lo` device should be excluded.
|
||||
|
||||
Metrics:
|
||||
* `bytes_in`
|
||||
* `bytes_out`
|
||||
* `pkts_in`
|
||||
* `pkts_out`
|
||||
|
||||
The device name is added as tag `device`.
|
||||
|
||||
|
||||
## `diskstat`
|
||||
|
||||
```json
|
||||
"diskstat": {
|
||||
"exclude_metrics": [
|
||||
"read_ms"
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics:
|
||||
* `reads`
|
||||
* `reads_merged`
|
||||
* `read_sectors`
|
||||
* `read_ms`
|
||||
* `writes`
|
||||
* `writes_merged`
|
||||
* `writes_sectors`
|
||||
* `writes_ms`
|
||||
* `ioops`
|
||||
* `ioops_ms`
|
||||
* `ioops_weighted_ms`
|
||||
* `discards`
|
||||
* `discards_merged`
|
||||
* `discards_sectors`
|
||||
* `discards_ms`
|
||||
* `flushes`
|
||||
* `flushes_ms`
|
||||
|
||||
|
||||
The device name is added as tag `device`.
|
||||
|
||||
## `cpustat`
|
||||
```json
|
||||
"netstat": {
|
||||
"exclude_metrics": [
|
||||
"cpu_idle"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `cpustat` collector reads data from `/proc/stats` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics:
|
||||
* `cpu_user`
|
||||
* `cpu_nice`
|
||||
* `cpu_system`
|
||||
* `cpu_idle`
|
||||
* `cpu_iowait`
|
||||
* `cpu_irq`
|
||||
* `cpu_softirq`
|
||||
* `cpu_steal`
|
||||
* `cpu_guest`
|
||||
* `cpu_guest_nice`
|
||||
|
||||
## `likwid`
|
||||
```json
|
||||
"likwid": {
|
||||
"eventsets": [
|
||||
{
|
||||
"events": {
|
||||
"FIXC1": "ACTUAL_CPU_CLOCK",
|
||||
"FIXC2": "MAX_CPU_CLOCK",
|
||||
"PMC0": "RETIRED_INSTRUCTIONS",
|
||||
"PMC1": "CPU_CLOCKS_UNHALTED",
|
||||
"PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
|
||||
"PMC3": "MERGE",
|
||||
"DFC0": "DRAM_CHANNEL_0",
|
||||
"DFC1": "DRAM_CHANNEL_1",
|
||||
"DFC2": "DRAM_CHANNEL_2",
|
||||
"DFC3": "DRAM_CHANNEL_3"
|
||||
},
|
||||
"metrics": [
|
||||
{
|
||||
"name": "ipc",
|
||||
"calc": "PMC0/PMC1",
|
||||
"socket_scope": false,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "flops_any",
|
||||
"calc": "0.000001*PMC2/time",
|
||||
"socket_scope": false,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "clock_mhz",
|
||||
"calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
|
||||
"socket_scope": false,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "mem1",
|
||||
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
|
||||
"socket_scope": true,
|
||||
"publish": false
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"events": {
|
||||
"DFC0": "DRAM_CHANNEL_4",
|
||||
"DFC1": "DRAM_CHANNEL_5",
|
||||
"DFC2": "DRAM_CHANNEL_6",
|
||||
"DFC3": "DRAM_CHANNEL_7",
|
||||
"PWR0": "RAPL_CORE_ENERGY",
|
||||
"PWR1": "RAPL_PKG_ENERGY"
|
||||
},
|
||||
"metrics": [
|
||||
{
|
||||
"name": "pwr_core",
|
||||
"calc": "PWR0/time",
|
||||
"socket_scope": false,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "pwr_pkg",
|
||||
"calc": "PWR1/time",
|
||||
"socket_scope": true,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "mem2",
|
||||
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
|
||||
"socket_scope": true,
|
||||
"publish": false
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"globalmetrics": [
|
||||
{
|
||||
"name": "mem_bw",
|
||||
"calc": "mem1+mem2",
|
||||
"socket_scope": true,
|
||||
"publish": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
_Example config suitable for AMD Zen3_
|
||||
|
||||
The `likwid` collector reads hardware performance counters at a **hwthread** and **socket** level. The configuration looks quite complicated but it is basically copy&paste from [LIKWID's performance groups](https://github.com/RRZE-HPC/likwid/tree/master/groups). The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
|
||||
|
||||
The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
|
||||
```
|
||||
EVENTSET -> "events": {
|
||||
FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK",
|
||||
FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK",
|
||||
PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS",
|
||||
PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED",
|
||||
PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
|
||||
PMC3 MERGE -> "PMC3": "MERGE",
|
||||
-> }
|
||||
```
|
||||
|
||||
The metrics are following the same procedure:
|
||||
|
||||
```
|
||||
METRICS -> "metrics": [
|
||||
IPC PMC0/PMC1 -> {
|
||||
-> "name" : "IPC",
|
||||
-> "calc" : "PMC0/PMC1",
|
||||
-> "socket_scope": false,
|
||||
-> "publish": true
|
||||
-> }
|
||||
-> ]
|
||||
```
|
||||
|
||||
The `socket_scope` option tells whether it is submitted per socket or per hwthread. If a metric is only used for internal calculations, you can set `publish = false`.
|
||||
|
||||
Since some metrics can only be gathered in multiple measurements (like the memory bandwidth on AMD Zen3 chips), configure multiple eventsets like in the example config and use the `globalmetrics` section to combine them. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.
|
||||
* [`cpustat`](./cpustatMetric.md)
|
||||
* [`memstat`](./memstatMetric.md)
|
||||
* [`iostat`](./iostatMetric.md)
|
||||
* [`diskstat`](./diskstatMetric.md)
|
||||
* [`loadavg`](./loadavgMetric.md)
|
||||
* [`netstat`](./netstatMetric.md)
|
||||
* [`ibstat`](./infinibandMetric.md)
|
||||
* [`ibstat_perfquery`](./infinibandPerfQueryMetric.md)
|
||||
* [`tempstat`](./tempMetric.md)
|
||||
* [`lustrestat`](./lustreMetric.md)
|
||||
* [`likwid`](./likwidMetric.md)
|
||||
* [`nvidia`](./nvidiaMetric.md)
|
||||
* [`customcmd`](./customCmdMetric.md)
|
||||
* [`ipmistat`](./ipmiMetric.md)
|
||||
* [`topprocs`](./topprocsMetric.md)
|
||||
* [`nfs3stat`](./nfs3Metric.md)
|
||||
* [`nfs4stat`](./nfs4Metric.md)
|
||||
* [`cpufreq`](./cpufreqMetric.md)
|
||||
* [`cpufreq_cpuinfo`](./cpufreqCpuinfoMetric.md)
|
||||
* [`numastat`](./numastatMetric.md)
|
||||
* [`gpfs`](./gpfsMetric.md)
|
||||
|
||||
## Todos
|
||||
|
||||
* [ ] Exclude devices for `diskstat` collector
|
||||
* [ ] Aggreate metrics to higher topology entity (sum hwthread metrics to socket metric, ...). Needs to be configurable
|
||||
|
||||
# Contributing own collectors
|
||||
A collector reads data from any source, parses it to metrics and submits these metrics to the `metric-collector`. A collector provides three function:
|
||||
|
||||
* `Init(config []byte) error`: Initializes the collector using the given collector-specific config in JSON.
|
||||
* `Read(duration time.Duration, out *[]lp.MutableMetric) error`: Read, parse and submit data to the `out` list. If the collector has to measure anything for some duration, use the provided function argument `duration`.
|
||||
* `Name() string`: Return the name of the collector
|
||||
* `Init(config json.RawMessage) error`: Initializes the collector using the given collector-specific config in JSON. Check if needed files/commands exists, ...
|
||||
* `Initialized() bool`: Check if a collector is successfully initialized
|
||||
* `Read(duration time.Duration, output chan ccMetric.CCMetric)`: Read, parse and submit data to the `output` channel as [`CCMetric`](../internal/ccMetric/README.md). If the collector has to measure anything for some duration, use the provided function argument `duration`.
|
||||
* `Close()`: Closes down the collector.
|
||||
|
||||
It is recommanded to call `setup()` in the `Init()` function.
|
||||
|
||||
Finally, the collector needs to be registered in the `metric-collector.go`. There is a list of collectors called `Collectors` which is a map (string -> pointer to collector). Add a new entry with a descriptive name and the new collector.
|
||||
Finally, the collector needs to be registered in the `collectorManager.go`. There is a list of collectors called `AvailableCollectors` which is a map (`collector_type_string` -> `pointer to MetricCollector interface`). Add a new entry with a descriptive name and the new collector.
|
||||
|
||||
## Sample collector
|
||||
|
||||
@@ -307,8 +62,9 @@ package collectors
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
"time"
|
||||
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
// Struct for the collector-specific JSON config
|
||||
@@ -317,11 +73,11 @@ type SampleCollectorConfig struct {
|
||||
}
|
||||
|
||||
type SampleCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
config SampleCollectorConfig
|
||||
}
|
||||
|
||||
func (m *SampleCollector) Init(config []byte) error {
|
||||
func (m *SampleCollector) Init(config json.RawMessage) error {
|
||||
// Check if already initialized
|
||||
if m.init {
|
||||
return nil
|
||||
@@ -335,21 +91,28 @@ func (m *SampleCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{"source": m.name, "group": "Sample"}
|
||||
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *SampleCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *SampleCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
// tags for the metric, if type != node use proper type and type-id
|
||||
tags := map[string]string{"type" : "node"}
|
||||
|
||||
x, err := GetMetric()
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, fmt.Sprintf("Read(): %v", err))
|
||||
}
|
||||
|
||||
// Each metric has exactly one field: value !
|
||||
value := map[string]interface{}{"value": int(x)}
|
||||
y, err := lp.New("sample_metric", tags, value, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
value := map[string]interface{}{"value": int64(x)}
|
||||
if y, err := lp.New("sample_metric", tags, m.meta, value, time.Now()); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
|
173
collectors/collectorManager.go
Normal file
173
collectors/collectorManager.go
Normal file
@@ -0,0 +1,173 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"os"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
mct "github.com/ClusterCockpit/cc-metric-collector/internal/multiChanTicker"
|
||||
)
|
||||
|
||||
// Map of all available metric collectors
|
||||
var AvailableCollectors = map[string]MetricCollector{
|
||||
|
||||
"likwid": new(LikwidCollector),
|
||||
"loadavg": new(LoadavgCollector),
|
||||
"memstat": new(MemstatCollector),
|
||||
"netstat": new(NetstatCollector),
|
||||
"ibstat": new(InfinibandCollector),
|
||||
"ibstat_perfquery": new(InfinibandPerfQueryCollector),
|
||||
"lustrestat": new(LustreCollector),
|
||||
"cpustat": new(CpustatCollector),
|
||||
"topprocs": new(TopProcsCollector),
|
||||
"nvidia": new(NvidiaCollector),
|
||||
"customcmd": new(CustomCmdCollector),
|
||||
"iostat": new(IOstatCollector),
|
||||
"diskstat": new(DiskstatCollector),
|
||||
"tempstat": new(TempCollector),
|
||||
"ipmistat": new(IpmiCollector),
|
||||
"gpfs": new(GpfsCollector),
|
||||
"cpufreq": new(CPUFreqCollector),
|
||||
"cpufreq_cpuinfo": new(CPUFreqCpuInfoCollector),
|
||||
"nfs3stat": new(Nfs3Collector),
|
||||
"nfs4stat": new(Nfs4Collector),
|
||||
"numastats": new(NUMAStatsCollector),
|
||||
}
|
||||
|
||||
// Metric collector manager data structure
|
||||
type collectorManager struct {
|
||||
collectors []MetricCollector // List of metric collectors to use
|
||||
output chan lp.CCMetric // Output channels
|
||||
done chan bool // channel to finish / stop metric collector manager
|
||||
ticker mct.MultiChanTicker // periodically ticking once each interval
|
||||
duration time.Duration // duration (for metrics that measure over a given duration)
|
||||
wg *sync.WaitGroup // wait group for all goroutines in cc-metric-collector
|
||||
config map[string]json.RawMessage // json encoded config for collector manager
|
||||
}
|
||||
|
||||
// Metric collector manager access functions
|
||||
type CollectorManager interface {
|
||||
Init(ticker mct.MultiChanTicker, duration time.Duration, wg *sync.WaitGroup, collectConfigFile string) error
|
||||
AddOutput(output chan lp.CCMetric)
|
||||
Start()
|
||||
Close()
|
||||
}
|
||||
|
||||
// Init initializes a new metric collector manager by setting up:
|
||||
// * output channel
|
||||
// * done channel
|
||||
// * wait group synchronization for goroutines (from variable wg)
|
||||
// * ticker (from variable ticker)
|
||||
// * configuration (read from config file in variable collectConfigFile)
|
||||
// Initialization is done for all configured collectors
|
||||
func (cm *collectorManager) Init(ticker mct.MultiChanTicker, duration time.Duration, wg *sync.WaitGroup, collectConfigFile string) error {
|
||||
cm.collectors = make([]MetricCollector, 0)
|
||||
cm.output = nil
|
||||
cm.done = make(chan bool)
|
||||
cm.wg = wg
|
||||
cm.ticker = ticker
|
||||
cm.duration = duration
|
||||
|
||||
// Read collector config file
|
||||
configFile, err := os.Open(collectConfigFile)
|
||||
if err != nil {
|
||||
cclog.Error(err.Error())
|
||||
return err
|
||||
}
|
||||
defer configFile.Close()
|
||||
jsonParser := json.NewDecoder(configFile)
|
||||
err = jsonParser.Decode(&cm.config)
|
||||
if err != nil {
|
||||
cclog.Error(err.Error())
|
||||
return err
|
||||
}
|
||||
|
||||
// Initialize configured collectors
|
||||
for collectorName, collectorCfg := range cm.config {
|
||||
if _, found := AvailableCollectors[collectorName]; !found {
|
||||
cclog.ComponentError("CollectorManager", "SKIP unknown collector", collectorName)
|
||||
continue
|
||||
}
|
||||
collector := AvailableCollectors[collectorName]
|
||||
|
||||
err = collector.Init(collectorCfg)
|
||||
if err != nil {
|
||||
cclog.ComponentError("CollectorManager", "Collector", collectorName, "initialization failed:", err.Error())
|
||||
continue
|
||||
}
|
||||
cclog.ComponentDebug("CollectorManager", "ADD COLLECTOR", collector.Name())
|
||||
cm.collectors = append(cm.collectors, collector)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Start starts the metric collector manager
|
||||
func (cm *collectorManager) Start() {
|
||||
tick := make(chan time.Time)
|
||||
cm.ticker.AddChannel(tick)
|
||||
|
||||
cm.wg.Add(1)
|
||||
go func() {
|
||||
defer cm.wg.Done()
|
||||
// Collector manager is done
|
||||
done := func() {
|
||||
// close all metric collectors
|
||||
for _, c := range cm.collectors {
|
||||
c.Close()
|
||||
}
|
||||
close(cm.done)
|
||||
cclog.ComponentDebug("CollectorManager", "DONE")
|
||||
}
|
||||
|
||||
// Wait for done signal or timer event
|
||||
for {
|
||||
select {
|
||||
case <-cm.done:
|
||||
done()
|
||||
return
|
||||
case t := <-tick:
|
||||
for _, c := range cm.collectors {
|
||||
// Wait for done signal or execute the collector
|
||||
select {
|
||||
case <-cm.done:
|
||||
done()
|
||||
return
|
||||
default:
|
||||
// Read metrics from collector c
|
||||
cclog.ComponentDebug("CollectorManager", c.Name(), t)
|
||||
c.Read(cm.duration, cm.output)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}()
|
||||
|
||||
// Collector manager is started
|
||||
cclog.ComponentDebug("CollectorManager", "STARTED")
|
||||
}
|
||||
|
||||
// AddOutput adds the output channel to the metric collector manager
|
||||
func (cm *collectorManager) AddOutput(output chan lp.CCMetric) {
|
||||
cm.output = output
|
||||
}
|
||||
|
||||
// Close finishes / stops the metric collector manager
|
||||
func (cm *collectorManager) Close() {
|
||||
cclog.ComponentDebug("CollectorManager", "CLOSE")
|
||||
cm.done <- true
|
||||
// wait for close of channel cm.done
|
||||
<-cm.done
|
||||
}
|
||||
|
||||
// New creates a new initialized metric collector manager
|
||||
func New(ticker mct.MultiChanTicker, duration time.Duration, wg *sync.WaitGroup, collectConfigFile string) (CollectorManager, error) {
|
||||
cm := new(collectorManager)
|
||||
err := cm.Init(ticker, duration, wg, collectConfigFile)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return cm, err
|
||||
}
|
@@ -2,14 +2,16 @@ package collectors
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"encoding/json"
|
||||
|
||||
"fmt"
|
||||
"log"
|
||||
"os"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
//
|
||||
@@ -21,41 +23,55 @@ import (
|
||||
type CPUFreqCpuInfoCollectorTopology struct {
|
||||
processor string // logical processor number (continuous, starting at 0)
|
||||
coreID string // socket local core ID
|
||||
coreID_int int
|
||||
coreID_int int64
|
||||
physicalPackageID string // socket / package ID
|
||||
physicalPackageID_int int
|
||||
physicalPackageID_int int64
|
||||
numPhysicalPackages string // number of sockets / packages
|
||||
numPhysicalPackages_int int
|
||||
numPhysicalPackages_int int64
|
||||
isHT bool
|
||||
numNonHT string // number of non hyperthreading processors
|
||||
numNonHT_int int
|
||||
numNonHT_int int64
|
||||
tagSet map[string]string
|
||||
}
|
||||
|
||||
type CPUFreqCpuInfoCollector struct {
|
||||
MetricCollector
|
||||
topology []CPUFreqCpuInfoCollectorTopology
|
||||
metricCollector
|
||||
topology []*CPUFreqCpuInfoCollectorTopology
|
||||
}
|
||||
|
||||
func (m *CPUFreqCpuInfoCollector) Init(config []byte) error {
|
||||
func (m *CPUFreqCpuInfoCollector) Init(config json.RawMessage) error {
|
||||
// Check if already initialized
|
||||
if m.init {
|
||||
return nil
|
||||
}
|
||||
|
||||
m.setup()
|
||||
|
||||
m.name = "CPUFreqCpuInfoCollector"
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "CPU",
|
||||
"unit": "MHz",
|
||||
}
|
||||
|
||||
const cpuInfoFile = "/proc/cpuinfo"
|
||||
file, err := os.Open(cpuInfoFile)
|
||||
if err != nil {
|
||||
return fmt.Errorf("Failed to open '%s': %v", cpuInfoFile, err)
|
||||
return fmt.Errorf("Failed to open file '%s': %v", cpuInfoFile, err)
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
// Collect topology information from file cpuinfo
|
||||
foundFreq := false
|
||||
processor := ""
|
||||
numNonHT_int := 0
|
||||
var numNonHT_int int64 = 0
|
||||
coreID := ""
|
||||
physicalPackageID := ""
|
||||
maxPhysicalPackageID := 0
|
||||
m.topology = make([]CPUFreqCpuInfoCollectorTopology, 0)
|
||||
var maxPhysicalPackageID int64 = 0
|
||||
m.topology = make([]*CPUFreqCpuInfoCollectorTopology, 0)
|
||||
coreSeenBefore := make(map[string]bool)
|
||||
|
||||
// Read cpuinfo file, line by line
|
||||
scanner := bufio.NewScanner(file)
|
||||
for scanner.Scan() {
|
||||
lineSplit := strings.Split(scanner.Text(), ":")
|
||||
@@ -81,39 +97,41 @@ func (m *CPUFreqCpuInfoCollector) Init(config []byte) error {
|
||||
len(coreID) > 0 &&
|
||||
len(physicalPackageID) > 0 {
|
||||
|
||||
coreID_int, err := strconv.Atoi(coreID)
|
||||
topology := new(CPUFreqCpuInfoCollectorTopology)
|
||||
|
||||
// Processor
|
||||
topology.processor = processor
|
||||
|
||||
// Core ID
|
||||
topology.coreID = coreID
|
||||
topology.coreID_int, err = strconv.ParseInt(coreID, 10, 64)
|
||||
if err != nil {
|
||||
return fmt.Errorf("Unable to convert coreID to int: %v", err)
|
||||
return fmt.Errorf("Unable to convert coreID '%s' to int64: %v", coreID, err)
|
||||
}
|
||||
physicalPackageID_int, err := strconv.Atoi(physicalPackageID)
|
||||
|
||||
// Physical package ID
|
||||
topology.physicalPackageID = physicalPackageID
|
||||
topology.physicalPackageID_int, err = strconv.ParseInt(physicalPackageID, 10, 64)
|
||||
if err != nil {
|
||||
return fmt.Errorf("Unable to convert physicalPackageID to int: %v", err)
|
||||
return fmt.Errorf("Unable to convert physicalPackageID '%s' to int64: %v", physicalPackageID, err)
|
||||
}
|
||||
|
||||
// increase maximun socket / package ID, when required
|
||||
if physicalPackageID_int > maxPhysicalPackageID {
|
||||
maxPhysicalPackageID = physicalPackageID_int
|
||||
if topology.physicalPackageID_int > maxPhysicalPackageID {
|
||||
maxPhysicalPackageID = topology.physicalPackageID_int
|
||||
}
|
||||
|
||||
// is hyperthread?
|
||||
globalID := physicalPackageID + ":" + coreID
|
||||
isHT := coreSeenBefore[globalID]
|
||||
topology.isHT = coreSeenBefore[globalID]
|
||||
coreSeenBefore[globalID] = true
|
||||
if !isHT {
|
||||
if !topology.isHT {
|
||||
// increase number on non hyper thread cores
|
||||
numNonHT_int++
|
||||
}
|
||||
|
||||
// store collected topology information
|
||||
m.topology = append(
|
||||
m.topology,
|
||||
CPUFreqCpuInfoCollectorTopology{
|
||||
processor: processor,
|
||||
coreID: coreID,
|
||||
coreID_int: coreID_int,
|
||||
physicalPackageID: physicalPackageID,
|
||||
physicalPackageID_int: physicalPackageID_int,
|
||||
isHT: isHT,
|
||||
})
|
||||
m.topology = append(m.topology, topology)
|
||||
|
||||
// reset topology information
|
||||
foundFreq = false
|
||||
@@ -126,18 +144,15 @@ func (m *CPUFreqCpuInfoCollector) Init(config []byte) error {
|
||||
numPhysicalPackageID_int := maxPhysicalPackageID + 1
|
||||
numPhysicalPackageID := fmt.Sprint(numPhysicalPackageID_int)
|
||||
numNonHT := fmt.Sprint(numNonHT_int)
|
||||
for i := range m.topology {
|
||||
t := &m.topology[i]
|
||||
for _, t := range m.topology {
|
||||
t.numPhysicalPackages = numPhysicalPackageID
|
||||
t.numPhysicalPackages_int = numPhysicalPackageID_int
|
||||
t.numNonHT = numNonHT
|
||||
t.numNonHT_int = numNonHT_int
|
||||
t.tagSet = map[string]string{
|
||||
"type": "cpu",
|
||||
"type-id": t.processor,
|
||||
"num_core": t.numNonHT,
|
||||
"package_id": t.physicalPackageID,
|
||||
"num_package": t.numPhysicalPackages,
|
||||
"type": "cpu",
|
||||
"type-id": t.processor,
|
||||
"package_id": t.physicalPackageID,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -145,14 +160,18 @@ func (m *CPUFreqCpuInfoCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
// Check if already initialized
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
|
||||
const cpuInfoFile = "/proc/cpuinfo"
|
||||
file, err := os.Open(cpuInfoFile)
|
||||
if err != nil {
|
||||
log.Printf("Failed to open '%s': %v", cpuInfoFile, err)
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to open file '%s': %v", cpuInfoFile, err))
|
||||
return
|
||||
}
|
||||
defer file.Close()
|
||||
@@ -167,16 +186,17 @@ func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, out *[]lp.Mutable
|
||||
|
||||
// frequency
|
||||
if key == "cpu MHz" {
|
||||
t := &m.topology[processorCounter]
|
||||
t := m.topology[processorCounter]
|
||||
if !t.isHT {
|
||||
value, err := strconv.ParseFloat(strings.TrimSpace(lineSplit[1]), 64)
|
||||
if err != nil {
|
||||
log.Printf("Failed to convert cpu MHz to float: %v", err)
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert cpu MHz '%s' to float64: %v", lineSplit[1], err))
|
||||
return
|
||||
}
|
||||
y, err := lp.New("cpufreq", t.tagSet, map[string]interface{}{"value": value}, now)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
if y, err := lp.New("cpufreq", t.tagSet, m.meta, map[string]interface{}{"value": value}, now); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
processorCounter++
|
||||
|
10
collectors/cpufreqCpuinfoMetric.md
Normal file
10
collectors/cpufreqCpuinfoMetric.md
Normal file
@@ -0,0 +1,10 @@
|
||||
|
||||
## `cpufreq_cpuinfo` collector
|
||||
```json
|
||||
"cpufreq_cpuinfo": {}
|
||||
```
|
||||
|
||||
The `cpufreq_cpuinfo` collector reads the clock frequency from `/proc/cpuinfo` and outputs a handful **cpu** metrics.
|
||||
|
||||
Metrics:
|
||||
* `cpufreq`
|
@@ -1,48 +1,30 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"log"
|
||||
"os"
|
||||
"io/ioutil"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
"golang.org/x/sys/unix"
|
||||
)
|
||||
|
||||
//
|
||||
// readOneLine reads one line from a file.
|
||||
// It returns ok when file was successfully read.
|
||||
// In this case text contains the first line of the files contents.
|
||||
//
|
||||
func readOneLine(filename string) (text string, ok bool) {
|
||||
file, err := os.Open(filename)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
defer file.Close()
|
||||
scanner := bufio.NewScanner(file)
|
||||
ok = scanner.Scan()
|
||||
text = scanner.Text()
|
||||
return
|
||||
}
|
||||
|
||||
type CPUFreqCollectorTopology struct {
|
||||
processor string // logical processor number (continuous, starting at 0)
|
||||
coreID string // socket local core ID
|
||||
coreID_int int
|
||||
coreID_int int64
|
||||
physicalPackageID string // socket / package ID
|
||||
physicalPackageID_int int
|
||||
physicalPackageID_int int64
|
||||
numPhysicalPackages string // number of sockets / packages
|
||||
numPhysicalPackages_int int
|
||||
numPhysicalPackages_int int64
|
||||
isHT bool
|
||||
numNonHT string // number of non hyperthreading processors
|
||||
numNonHT_int int
|
||||
numNonHT_int int64
|
||||
scalingCurFreqFile string
|
||||
tagSet map[string]string
|
||||
}
|
||||
@@ -56,14 +38,19 @@ type CPUFreqCollectorTopology struct {
|
||||
// See: https://www.kernel.org/doc/html/latest/admin-guide/pm/cpufreq.html
|
||||
//
|
||||
type CPUFreqCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
topology []CPUFreqCollectorTopology
|
||||
config struct {
|
||||
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
|
||||
}
|
||||
}
|
||||
|
||||
func (m *CPUFreqCollector) Init(config []byte) error {
|
||||
func (m *CPUFreqCollector) Init(config json.RawMessage) error {
|
||||
// Check if already initialized
|
||||
if m.init {
|
||||
return nil
|
||||
}
|
||||
|
||||
m.name = "CPUFreqCollector"
|
||||
m.setup()
|
||||
if len(config) > 0 {
|
||||
@@ -72,54 +59,61 @@ func (m *CPUFreqCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "CPU",
|
||||
"unit": "MHz",
|
||||
}
|
||||
|
||||
// Loop for all CPU directories
|
||||
baseDir := "/sys/devices/system/cpu"
|
||||
globPattern := filepath.Join(baseDir, "cpu[0-9]*")
|
||||
cpuDirs, err := filepath.Glob(globPattern)
|
||||
if err != nil {
|
||||
return fmt.Errorf("CPUFreqCollector.Init() unable to glob files with pattern %s: %v", globPattern, err)
|
||||
return fmt.Errorf("Unable to glob files with pattern '%s': %v", globPattern, err)
|
||||
}
|
||||
if cpuDirs == nil {
|
||||
return fmt.Errorf("CPUFreqCollector.Init() unable to find any files with pattern %s", globPattern)
|
||||
return fmt.Errorf("Unable to find any files with pattern '%s'", globPattern)
|
||||
}
|
||||
|
||||
// Initialize CPU topology
|
||||
m.topology = make([]CPUFreqCollectorTopology, len(cpuDirs))
|
||||
for _, cpuDir := range cpuDirs {
|
||||
processor := strings.TrimPrefix(cpuDir, "/sys/devices/system/cpu/cpu")
|
||||
processor_int, err := strconv.Atoi(processor)
|
||||
processor_int, err := strconv.ParseInt(processor, 10, 64)
|
||||
if err != nil {
|
||||
return fmt.Errorf("CPUFreqCollector.Init() unable to convert cpuID to int: %v", err)
|
||||
return fmt.Errorf("Unable to convert cpuID '%s' to int64: %v", processor, err)
|
||||
}
|
||||
|
||||
// Read package ID
|
||||
physicalPackageIDFile := filepath.Join(cpuDir, "topology", "physical_package_id")
|
||||
physicalPackageID, ok := readOneLine(physicalPackageIDFile)
|
||||
if !ok {
|
||||
return fmt.Errorf("CPUFreqCollector.Init() unable to read physical package ID from %s", physicalPackageIDFile)
|
||||
}
|
||||
physicalPackageID_int, err := strconv.Atoi(physicalPackageID)
|
||||
line, err := ioutil.ReadFile(physicalPackageIDFile)
|
||||
if err != nil {
|
||||
return fmt.Errorf("CPUFreqCollector.Init() unable to convert packageID to int: %v", err)
|
||||
return fmt.Errorf("Unable to read physical package ID from file '%s': %v", physicalPackageIDFile, err)
|
||||
}
|
||||
physicalPackageID := strings.TrimSpace(string(line))
|
||||
physicalPackageID_int, err := strconv.ParseInt(physicalPackageID, 10, 64)
|
||||
if err != nil {
|
||||
return fmt.Errorf("Unable to convert packageID '%s' to int64: %v", physicalPackageID, err)
|
||||
}
|
||||
|
||||
// Read core ID
|
||||
coreIDFile := filepath.Join(cpuDir, "topology", "core_id")
|
||||
coreID, ok := readOneLine(coreIDFile)
|
||||
if !ok {
|
||||
return fmt.Errorf("CPUFreqCollector.Init() unable to read core ID from %s", coreIDFile)
|
||||
}
|
||||
coreID_int, err := strconv.Atoi(coreID)
|
||||
line, err = ioutil.ReadFile(coreIDFile)
|
||||
if err != nil {
|
||||
return fmt.Errorf("CPUFreqCollector.Init() unable to convert coreID to int: %v", err)
|
||||
return fmt.Errorf("Unable to read core ID from file '%s': %v", coreIDFile, err)
|
||||
}
|
||||
coreID := strings.TrimSpace(string(line))
|
||||
coreID_int, err := strconv.ParseInt(coreID, 10, 64)
|
||||
if err != nil {
|
||||
return fmt.Errorf("Unable to convert coreID '%s' to int64: %v", coreID, err)
|
||||
}
|
||||
|
||||
// Check access to current frequency file
|
||||
scalingCurFreqFile := filepath.Join(cpuDir, "cpufreq", "scaling_cur_freq")
|
||||
err = unix.Access(scalingCurFreqFile, unix.R_OK)
|
||||
if err != nil {
|
||||
return fmt.Errorf("CPUFreqCollector.Init() unable to access %s: %v", scalingCurFreqFile, err)
|
||||
return fmt.Errorf("Unable to access file '%s': %v", scalingCurFreqFile, err)
|
||||
}
|
||||
|
||||
t := &m.topology[processor_int]
|
||||
@@ -142,8 +136,8 @@ func (m *CPUFreqCollector) Init(config []byte) error {
|
||||
}
|
||||
|
||||
// number of non hyper thread cores and packages / sockets
|
||||
numNonHT_int := 0
|
||||
maxPhysicalPackageID := 0
|
||||
var numNonHT_int int64 = 0
|
||||
var maxPhysicalPackageID int64 = 0
|
||||
for i := range m.topology {
|
||||
t := &m.topology[i]
|
||||
|
||||
@@ -167,11 +161,9 @@ func (m *CPUFreqCollector) Init(config []byte) error {
|
||||
t.numNonHT = numNonHT
|
||||
t.numNonHT_int = numNonHT_int
|
||||
t.tagSet = map[string]string{
|
||||
"type": "cpu",
|
||||
"type-id": t.processor,
|
||||
"num_core": t.numNonHT,
|
||||
"package_id": t.physicalPackageID,
|
||||
"num_package": t.numPhysicalPackages,
|
||||
"type": "cpu",
|
||||
"type-id": t.processor,
|
||||
"package_id": t.physicalPackageID,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -179,7 +171,8 @@ func (m *CPUFreqCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *CPUFreqCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *CPUFreqCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
// Check if already initialized
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -194,20 +187,23 @@ func (m *CPUFreqCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
}
|
||||
|
||||
// Read current frequency
|
||||
line, ok := readOneLine(t.scalingCurFreqFile)
|
||||
if !ok {
|
||||
log.Printf("CPUFreqCollector.Read(): Failed to read one line from file '%s'", t.scalingCurFreqFile)
|
||||
line, err := ioutil.ReadFile(t.scalingCurFreqFile)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to read file '%s': %v", t.scalingCurFreqFile, err))
|
||||
continue
|
||||
}
|
||||
cpuFreq, err := strconv.Atoi(line)
|
||||
cpuFreq, err := strconv.ParseInt(strings.TrimSpace(string(line)), 10, 64)
|
||||
if err != nil {
|
||||
log.Printf("CPUFreqCollector.Read(): Failed to convert CPU frequency '%s': %v", line, err)
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert CPU frequency '%s' to int64: %v", line, err))
|
||||
continue
|
||||
}
|
||||
|
||||
y, err := lp.New("cpufreq", t.tagSet, map[string]interface{}{"value": cpuFreq}, now)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
if y, err := lp.New("cpufreq", t.tagSet, m.meta, map[string]interface{}{"value": cpuFreq}, now); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
11
collectors/cpufreqMetric.md
Normal file
11
collectors/cpufreqMetric.md
Normal file
@@ -0,0 +1,11 @@
|
||||
## `cpufreq_cpuinfo` collector
|
||||
```json
|
||||
"cpufreq": {
|
||||
"exclude_metrics": []
|
||||
}
|
||||
```
|
||||
|
||||
The `cpufreq` collector reads the clock frequency from `/sys/devices/system/cpu/cpu*/cpufreq` and outputs a handful **cpu** metrics.
|
||||
|
||||
Metrics:
|
||||
* `cpufreq`
|
@@ -1,14 +1,16 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io/ioutil"
|
||||
"os"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const CPUSTATFILE = `/proc/stat`
|
||||
@@ -18,72 +20,129 @@ type CpustatCollectorConfig struct {
|
||||
}
|
||||
|
||||
type CpustatCollector struct {
|
||||
MetricCollector
|
||||
config CpustatCollectorConfig
|
||||
metricCollector
|
||||
config CpustatCollectorConfig
|
||||
matches map[string]int
|
||||
cputags map[string]map[string]string
|
||||
nodetags map[string]string
|
||||
num_cpus_metric lp.CCMetric
|
||||
}
|
||||
|
||||
func (m *CpustatCollector) Init(config []byte) error {
|
||||
func (m *CpustatCollector) Init(config json.RawMessage) error {
|
||||
m.name = "CpustatCollector"
|
||||
m.setup()
|
||||
m.meta = map[string]string{"source": m.name, "group": "CPU", "unit": "Percent"}
|
||||
m.nodetags = map[string]string{"type": "node"}
|
||||
if len(config) > 0 {
|
||||
err := json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
matches := map[string]int{
|
||||
"cpu_user": 1,
|
||||
"cpu_nice": 2,
|
||||
"cpu_system": 3,
|
||||
"cpu_idle": 4,
|
||||
"cpu_iowait": 5,
|
||||
"cpu_irq": 6,
|
||||
"cpu_softirq": 7,
|
||||
"cpu_steal": 8,
|
||||
"cpu_guest": 9,
|
||||
"cpu_guest_nice": 10,
|
||||
}
|
||||
|
||||
m.matches = make(map[string]int)
|
||||
for match, index := range matches {
|
||||
doExclude := false
|
||||
for _, exclude := range m.config.ExcludeMetrics {
|
||||
if match == exclude {
|
||||
doExclude = true
|
||||
break
|
||||
}
|
||||
}
|
||||
if !doExclude {
|
||||
m.matches[match] = index
|
||||
}
|
||||
}
|
||||
|
||||
// Check input file
|
||||
file, err := os.Open(string(CPUSTATFILE))
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
// Pre-generate tags for all CPUs
|
||||
num_cpus := 0
|
||||
m.cputags = make(map[string]map[string]string)
|
||||
scanner := bufio.NewScanner(file)
|
||||
for scanner.Scan() {
|
||||
line := scanner.Text()
|
||||
linefields := strings.Fields(line)
|
||||
if strings.HasPrefix(linefields[0], "cpu") && strings.Compare(linefields[0], "cpu") != 0 {
|
||||
cpustr := strings.TrimLeft(linefields[0], "cpu")
|
||||
cpu, _ := strconv.Atoi(cpustr)
|
||||
m.cputags[linefields[0]] = map[string]string{"type": "cpu", "type-id": fmt.Sprintf("%d", cpu)}
|
||||
num_cpus++
|
||||
}
|
||||
}
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func ParseStatLine(line string, cpu int, exclude []string, out *[]lp.MutableMetric) {
|
||||
ls := strings.Fields(line)
|
||||
matches := []string{"", "cpu_user", "cpu_nice", "cpu_system", "cpu_idle", "cpu_iowait", "cpu_irq", "cpu_softirq", "cpu_steal", "cpu_guest", "cpu_guest_nice"}
|
||||
for _, ex := range exclude {
|
||||
matches, _ = RemoveFromStringList(matches, ex)
|
||||
}
|
||||
|
||||
var tags map[string]string
|
||||
if cpu < 0 {
|
||||
tags = map[string]string{"type": "node"}
|
||||
} else {
|
||||
tags = map[string]string{"type": "cpu", "type-id": fmt.Sprintf("%d", cpu)}
|
||||
}
|
||||
for i, m := range matches {
|
||||
if len(m) > 0 {
|
||||
x, err := strconv.ParseInt(ls[i], 0, 64)
|
||||
func (m *CpustatCollector) parseStatLine(linefields []string, tags map[string]string, output chan lp.CCMetric) {
|
||||
values := make(map[string]float64)
|
||||
total := 0.0
|
||||
for match, index := range m.matches {
|
||||
if len(match) > 0 {
|
||||
x, err := strconv.ParseInt(linefields[index], 0, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New(m, tags, map[string]interface{}{"value": int(x)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
values[match] = float64(x)
|
||||
total += values[match]
|
||||
}
|
||||
}
|
||||
}
|
||||
t := time.Now()
|
||||
for name, value := range values {
|
||||
y, err := lp.New(name, tags, m.meta, map[string]interface{}{"value": (value * 100.0) / total}, t)
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (m *CpustatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *CpustatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
buffer, err := ioutil.ReadFile(string(CPUSTATFILE))
|
||||
|
||||
num_cpus := 0
|
||||
file, err := os.Open(string(CPUSTATFILE))
|
||||
if err != nil {
|
||||
return
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
scanner := bufio.NewScanner(file)
|
||||
for scanner.Scan() {
|
||||
line := scanner.Text()
|
||||
linefields := strings.Fields(line)
|
||||
if strings.Compare(linefields[0], "cpu") == 0 {
|
||||
m.parseStatLine(linefields, m.nodetags, output)
|
||||
} else if strings.HasPrefix(linefields[0], "cpu") {
|
||||
m.parseStatLine(linefields, m.cputags[linefields[0]], output)
|
||||
num_cpus++
|
||||
}
|
||||
}
|
||||
|
||||
ll := strings.Split(string(buffer), "\n")
|
||||
for _, line := range ll {
|
||||
if len(line) == 0 {
|
||||
continue
|
||||
}
|
||||
ls := strings.Fields(line)
|
||||
if strings.Compare(ls[0], "cpu") == 0 {
|
||||
ParseStatLine(line, -1, m.config.ExcludeMetrics, out)
|
||||
} else if strings.HasPrefix(ls[0], "cpu") {
|
||||
cpustr := strings.TrimLeft(ls[0], "cpu")
|
||||
cpu, _ := strconv.Atoi(cpustr)
|
||||
ParseStatLine(line, cpu, m.config.ExcludeMetrics, out)
|
||||
}
|
||||
num_cpus_metric, err := lp.New("num_cpus",
|
||||
m.nodetags,
|
||||
m.meta,
|
||||
map[string]interface{}{"value": int(num_cpus)},
|
||||
time.Now(),
|
||||
)
|
||||
if err == nil {
|
||||
output <- num_cpus_metric
|
||||
}
|
||||
}
|
||||
|
||||
|
23
collectors/cpustatMetric.md
Normal file
23
collectors/cpustatMetric.md
Normal file
@@ -0,0 +1,23 @@
|
||||
|
||||
## `cpustat` collector
|
||||
```json
|
||||
"cpustat": {
|
||||
"exclude_metrics": [
|
||||
"cpu_idle"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `cpustat` collector reads data from `/proc/stats` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics:
|
||||
* `cpu_user`
|
||||
* `cpu_nice`
|
||||
* `cpu_system`
|
||||
* `cpu_idle`
|
||||
* `cpu_iowait`
|
||||
* `cpu_irq`
|
||||
* `cpu_softirq`
|
||||
* `cpu_steal`
|
||||
* `cpu_guest`
|
||||
* `cpu_guest_nice`
|
@@ -9,7 +9,13 @@ import (
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
<<<<<<< HEAD
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
=======
|
||||
ccmetric "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
influx "github.com/influxdata/line-protocol"
|
||||
>>>>>>> develop
|
||||
)
|
||||
|
||||
const CUSTOMCMDPATH = `/home/unrz139/Work/cc-metric-collector/collectors/custom`
|
||||
@@ -21,17 +27,18 @@ type CustomCmdCollectorConfig struct {
|
||||
}
|
||||
|
||||
type CustomCmdCollector struct {
|
||||
MetricCollector
|
||||
handler *lp.MetricHandler
|
||||
parser *lp.Parser
|
||||
metricCollector
|
||||
handler *influx.MetricHandler
|
||||
parser *influx.Parser
|
||||
config CustomCmdCollectorConfig
|
||||
commands []string
|
||||
files []string
|
||||
}
|
||||
|
||||
func (m *CustomCmdCollector) Init(config []byte) error {
|
||||
func (m *CustomCmdCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "CustomCmdCollector"
|
||||
m.meta = map[string]string{"source": m.name, "group": "Custom"}
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
@@ -61,8 +68,8 @@ func (m *CustomCmdCollector) Init(config []byte) error {
|
||||
if len(m.files) == 0 && len(m.commands) == 0 {
|
||||
return errors.New("No metrics to collect")
|
||||
}
|
||||
m.handler = lp.NewMetricHandler()
|
||||
m.parser = lp.NewParser(m.handler)
|
||||
m.handler = influx.NewMetricHandler()
|
||||
m.parser = influx.NewParser(m.handler)
|
||||
m.parser.SetTimeFunc(DefaultTime)
|
||||
m.init = true
|
||||
return nil
|
||||
@@ -72,7 +79,7 @@ var DefaultTime = func() time.Time {
|
||||
return time.Unix(42, 0)
|
||||
}
|
||||
|
||||
func (m *CustomCmdCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *CustomCmdCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -95,9 +102,10 @@ func (m *CustomCmdCollector) Read(interval time.Duration, out *[]lp.MutableMetri
|
||||
if skip {
|
||||
continue
|
||||
}
|
||||
y, err := lp.New(c.Name(), Tags2Map(c), Fields2Map(c), c.Time())
|
||||
|
||||
y := ccmetric.FromInfluxMetric(c)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -117,9 +125,9 @@ func (m *CustomCmdCollector) Read(interval time.Duration, out *[]lp.MutableMetri
|
||||
if skip {
|
||||
continue
|
||||
}
|
||||
y, err := lp.New(f.Name(), Tags2Map(f), Fields2Map(f), f.Time())
|
||||
y := ccmetric.FromInfluxMetric(f)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
20
collectors/customCmdMetric.md
Normal file
20
collectors/customCmdMetric.md
Normal file
@@ -0,0 +1,20 @@
|
||||
|
||||
## `customcmd` collector
|
||||
|
||||
```json
|
||||
"customcmd": {
|
||||
"exclude_metrics": [
|
||||
"mymetric"
|
||||
],
|
||||
"files" : [
|
||||
"/var/run/myapp.metrics"
|
||||
],
|
||||
"commands" : [
|
||||
"/usr/local/bin/getmetrics.pl"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `customcmd` collector reads data from files and the output of executed commands. The files and commands can output multiple metrics (separated by newline) but the have to be in the [InfluxDB line protocol](https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/). If a metric is not parsable, it is skipped. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
|
@@ -1,113 +1,111 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"io/ioutil"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
|
||||
// "log"
|
||||
"bufio"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"strconv"
|
||||
"fmt"
|
||||
"os"
|
||||
"strings"
|
||||
"syscall"
|
||||
"time"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const DISKSTATFILE = `/proc/diskstats`
|
||||
const DISKSTAT_SYSFSPATH = `/sys/block`
|
||||
// "log"
|
||||
|
||||
const MOUNTFILE = `/proc/self/mounts`
|
||||
|
||||
type DiskstatCollectorConfig struct {
|
||||
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
|
||||
}
|
||||
|
||||
type DiskstatCollector struct {
|
||||
MetricCollector
|
||||
matches map[int]string
|
||||
config DiskstatCollectorConfig
|
||||
metricCollector
|
||||
//matches map[string]int
|
||||
config IOstatCollectorConfig
|
||||
//devices map[string]IOstatCollectorEntry
|
||||
}
|
||||
|
||||
func (m *DiskstatCollector) Init(config []byte) error {
|
||||
var err error
|
||||
func (m *DiskstatCollector) Init(config json.RawMessage) error {
|
||||
m.name = "DiskstatCollector"
|
||||
m.meta = map[string]string{"source": m.name, "group": "Disk"}
|
||||
m.setup()
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
err := json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
// https://www.kernel.org/doc/html/latest/admin-guide/iostats.html
|
||||
matches := map[int]string{
|
||||
3: "reads",
|
||||
4: "reads_merged",
|
||||
5: "read_sectors",
|
||||
6: "read_ms",
|
||||
7: "writes",
|
||||
8: "writes_merged",
|
||||
9: "writes_sectors",
|
||||
10: "writes_ms",
|
||||
11: "ioops",
|
||||
12: "ioops_ms",
|
||||
13: "ioops_weighted_ms",
|
||||
14: "discards",
|
||||
15: "discards_merged",
|
||||
16: "discards_sectors",
|
||||
17: "discards_ms",
|
||||
18: "flushes",
|
||||
19: "flushes_ms",
|
||||
file, err := os.Open(string(MOUNTFILE))
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
return err
|
||||
}
|
||||
m.matches = make(map[int]string)
|
||||
for k, v := range matches {
|
||||
_, skip := stringArrayContains(m.config.ExcludeMetrics, v)
|
||||
if !skip {
|
||||
m.matches[k] = v
|
||||
}
|
||||
}
|
||||
if len(m.matches) == 0 {
|
||||
return errors.New("No metrics to collect")
|
||||
}
|
||||
_, err = ioutil.ReadFile(string(DISKSTATFILE))
|
||||
if err == nil {
|
||||
m.init = true
|
||||
}
|
||||
return err
|
||||
defer file.Close()
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *DiskstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
var lines []string
|
||||
func (m *DiskstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
|
||||
buffer, err := ioutil.ReadFile(string(DISKSTATFILE))
|
||||
file, err := os.Open(string(MOUNTFILE))
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
return
|
||||
}
|
||||
lines = strings.Split(string(buffer), "\n")
|
||||
defer file.Close()
|
||||
|
||||
for _, line := range lines {
|
||||
part_max_used := uint64(0)
|
||||
scanner := bufio.NewScanner(file)
|
||||
for scanner.Scan() {
|
||||
line := scanner.Text()
|
||||
if len(line) == 0 {
|
||||
continue
|
||||
}
|
||||
f := strings.Fields(line)
|
||||
if strings.Contains(f[2], "loop") {
|
||||
if !strings.HasPrefix(line, "/dev") {
|
||||
continue
|
||||
}
|
||||
tags := map[string]string{
|
||||
"device": f[2],
|
||||
"type": "node",
|
||||
linefields := strings.Fields(line)
|
||||
if strings.Contains(linefields[0], "loop") {
|
||||
continue
|
||||
}
|
||||
for idx, name := range m.matches {
|
||||
if idx < len(f) {
|
||||
x, err := strconv.ParseInt(f[idx], 0, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New(name, tags, map[string]interface{}{"value": int(x)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
}
|
||||
}
|
||||
if strings.Contains(linefields[1], "boot") {
|
||||
continue
|
||||
}
|
||||
path := strings.Replace(linefields[1], `\040`, " ", -1)
|
||||
stat := syscall.Statfs_t{}
|
||||
err := syscall.Statfs(path, &stat)
|
||||
if err != nil {
|
||||
fmt.Println(err.Error())
|
||||
return
|
||||
}
|
||||
tags := map[string]string{"type": "node", "device": linefields[0]}
|
||||
total := (stat.Blocks * uint64(stat.Bsize)) / uint64(1000000000)
|
||||
y, err := lp.New("disk_total", tags, m.meta, map[string]interface{}{"value": total}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "GBytes")
|
||||
output <- y
|
||||
}
|
||||
free := (stat.Bfree * uint64(stat.Bsize)) / uint64(1000000000)
|
||||
y, err = lp.New("disk_free", tags, m.meta, map[string]interface{}{"value": free}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "GBytes")
|
||||
output <- y
|
||||
}
|
||||
perc := (100 * (total - free)) / total
|
||||
if perc > part_max_used {
|
||||
part_max_used = perc
|
||||
}
|
||||
}
|
||||
y, err := lp.New("part_max_used", map[string]string{"type": "node"}, m.meta, map[string]interface{}{"value": part_max_used}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "percent")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
|
21
collectors/diskstatMetric.md
Normal file
21
collectors/diskstatMetric.md
Normal file
@@ -0,0 +1,21 @@
|
||||
|
||||
## `diskstat` collector
|
||||
|
||||
```json
|
||||
"diskstat": {
|
||||
"exclude_metrics": [
|
||||
"disk_total"
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
The `diskstat` collector reads data from `/proc/self/mounts` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics per device (with `device` tag):
|
||||
* `disk_total` (unit `GBytes`)
|
||||
* `disk_free` (unit `GBytes`)
|
||||
|
||||
Global metrics:
|
||||
* `part_max_used` (unit `percent`)
|
||||
|
||||
|
@@ -7,24 +7,32 @@ import (
|
||||
"fmt"
|
||||
"io/ioutil"
|
||||
"log"
|
||||
"os"
|
||||
"os/exec"
|
||||
"os/user"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
type GpfsCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
config struct {
|
||||
Mmpmon string `json:"mmpmon"`
|
||||
Mmpmon string `json:"mmpmon_path,omitempty"`
|
||||
ExcludeFilesystem []string `json:"exclude_filesystem,omitempty"`
|
||||
}
|
||||
skipFS map[string]struct{}
|
||||
}
|
||||
|
||||
func (m *GpfsCollector) Init(config []byte) error {
|
||||
func (m *GpfsCollector) Init(config json.RawMessage) error {
|
||||
// Check if already initialized
|
||||
if m.init {
|
||||
return nil
|
||||
}
|
||||
|
||||
var err error
|
||||
m.name = "GpfsCollector"
|
||||
m.setup()
|
||||
@@ -40,27 +48,40 @@ func (m *GpfsCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "GPFS",
|
||||
}
|
||||
m.tags = map[string]string{
|
||||
"type": "node",
|
||||
"filesystem": "",
|
||||
}
|
||||
m.skipFS = make(map[string]struct{})
|
||||
for _, fs := range m.config.ExcludeFilesystem {
|
||||
m.skipFS[fs] = struct{}{}
|
||||
}
|
||||
|
||||
// GPFS / IBM Spectrum Scale file system statistics can only be queried by user root
|
||||
user, err := user.Current()
|
||||
if err != nil {
|
||||
return fmt.Errorf("GpfsCollector.Init(): Failed to get current user: %v", err)
|
||||
return fmt.Errorf("Failed to get current user: %v", err)
|
||||
}
|
||||
if user.Uid != "0" {
|
||||
return fmt.Errorf("GpfsCollector.Init(): GPFS file system statistics can only be queried by user root")
|
||||
return fmt.Errorf("GPFS file system statistics can only be queried by user root")
|
||||
}
|
||||
|
||||
// Check if mmpmon is in executable search path
|
||||
_, err = exec.LookPath(m.config.Mmpmon)
|
||||
if err != nil {
|
||||
return fmt.Errorf("GpfsCollector.Init(): Failed to find mmpmon binary '%s': %v", m.config.Mmpmon, err)
|
||||
return fmt.Errorf("Failed to find mmpmon binary '%s': %v", m.config.Mmpmon, err)
|
||||
}
|
||||
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *GpfsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
// Check if already initialized
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -77,12 +98,15 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
cmd.Stderr = cmdStderr
|
||||
err := cmd.Run()
|
||||
if err != nil {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to execute command \"%s\": %s\n", cmd.String(), err.Error())
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): command exit code: \"%d\"\n", cmd.ProcessState.ExitCode())
|
||||
data, _ := ioutil.ReadAll(cmdStderr)
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): command stderr: \"%s\"\n", string(data))
|
||||
data, _ = ioutil.ReadAll(cmdStdout)
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): command stdout: \"%s\"\n", string(data))
|
||||
dataStdErr, _ := ioutil.ReadAll(cmdStderr)
|
||||
dataStdOut, _ := ioutil.ReadAll(cmdStdout)
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to execute command \"%s\": %v\n", cmd.String(), err),
|
||||
fmt.Sprintf("Read(): command exit code: \"%d\"\n", cmd.ProcessState.ExitCode()),
|
||||
fmt.Sprintf("Read(): command stderr: \"%s\"\n", string(dataStdErr)),
|
||||
fmt.Sprintf("Read(): command stdout: \"%s\"\n", string(dataStdOut)),
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
@@ -90,194 +114,163 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
scanner := bufio.NewScanner(cmdStdout)
|
||||
for scanner.Scan() {
|
||||
lineSplit := strings.Fields(scanner.Text())
|
||||
if lineSplit[0] == "_fs_io_s_" {
|
||||
key_value := make(map[string]string)
|
||||
for i := 1; i < len(lineSplit); i += 2 {
|
||||
key_value[lineSplit[i]] = lineSplit[i+1]
|
||||
}
|
||||
|
||||
// Ignore keys:
|
||||
// _n_: node IP address,
|
||||
// _nn_: node name,
|
||||
// _cl_: cluster name,
|
||||
// _d_: number of disks
|
||||
// Only process lines starting with _fs_io_s_
|
||||
if lineSplit[0] != "_fs_io_s_" {
|
||||
continue
|
||||
}
|
||||
|
||||
filesystem, ok := key_value["_fs_"]
|
||||
if !ok {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to get filesystem name.\n")
|
||||
continue
|
||||
}
|
||||
key_value := make(map[string]string)
|
||||
for i := 1; i < len(lineSplit); i += 2 {
|
||||
key_value[lineSplit[i]] = lineSplit[i+1]
|
||||
}
|
||||
|
||||
tagList := map[string]string{
|
||||
"type": "node",
|
||||
"filesystem": filesystem,
|
||||
}
|
||||
// Ignore keys:
|
||||
// _n_: node IP address,
|
||||
// _nn_: node name,
|
||||
// _cl_: cluster name,
|
||||
// _d_: number of disks
|
||||
|
||||
// return code
|
||||
rc, err := strconv.Atoi(key_value["_rc_"])
|
||||
if err != nil {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert return code: %s\n", err.Error())
|
||||
continue
|
||||
}
|
||||
if rc != 0 {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Filesystem %s not ok.", filesystem)
|
||||
continue
|
||||
}
|
||||
filesystem, ok := key_value["_fs_"]
|
||||
if !ok {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
"Read(): Failed to get filesystem name.")
|
||||
continue
|
||||
}
|
||||
|
||||
/* requires go 1.17
|
||||
// unix epoch in microseconds
|
||||
timestampInt, err := strconv.ParseInt(key_value["_t_"]+key_value["_tu_"], 10, 64)
|
||||
timestamp := time.UnixMicro(timestampInt)
|
||||
if err != nil {
|
||||
fmt.Fprintf(os.Stderr,
|
||||
"GpfsCollector.Read(): Failed to convert time stamp '%s': %s\n",
|
||||
key_value["_t_"]+key_value["_tu_"], err.Error())
|
||||
continue
|
||||
}
|
||||
*/
|
||||
timestamp := time.Now()
|
||||
// Skip excluded filesystems
|
||||
if _, skip := m.skipFS[filesystem]; skip {
|
||||
continue
|
||||
}
|
||||
|
||||
// bytes read
|
||||
bytesRead, err := strconv.ParseInt(key_value["_br_"], 10, 64)
|
||||
if err != nil {
|
||||
fmt.Fprintf(os.Stderr,
|
||||
"GpfsCollector.Read(): Failed to convert bytes read '%s': %s\n",
|
||||
key_value["_br_"], err.Error())
|
||||
continue
|
||||
}
|
||||
y, err := lp.New(
|
||||
"gpfs_bytes_read",
|
||||
tagList,
|
||||
map[string]interface{}{
|
||||
"value": bytesRead,
|
||||
},
|
||||
timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
m.tags["filesystem"] = filesystem
|
||||
|
||||
// bytes written
|
||||
bytesWritten, err := strconv.ParseInt(key_value["_bw_"], 10, 64)
|
||||
if err != nil {
|
||||
fmt.Fprintf(os.Stderr,
|
||||
"GpfsCollector.Read(): Failed to convert bytes written '%s': %s\n",
|
||||
key_value["_bw_"], err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_bytes_written",
|
||||
tagList,
|
||||
map[string]interface{}{
|
||||
"value": bytesWritten,
|
||||
},
|
||||
timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
// return code
|
||||
rc, err := strconv.Atoi(key_value["_rc_"])
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert return code '%s' to int: %v", key_value["_rc_"], err))
|
||||
continue
|
||||
}
|
||||
if rc != 0 {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Filesystem '%s' is not ok.", filesystem))
|
||||
continue
|
||||
}
|
||||
|
||||
// number of opens
|
||||
numOpens, err := strconv.ParseInt(key_value["_oc_"], 10, 64)
|
||||
if err != nil {
|
||||
fmt.Fprintf(os.Stderr,
|
||||
"GpfsCollector.Read(): Failed to convert number of opens '%s': %s\n",
|
||||
key_value["_oc_"], err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_num_opens",
|
||||
tagList,
|
||||
map[string]interface{}{
|
||||
"value": numOpens,
|
||||
},
|
||||
timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
sec, err := strconv.ParseInt(key_value["_t_"], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert seconds '%s' to int64: %v", key_value["_t_"], err))
|
||||
continue
|
||||
}
|
||||
msec, err := strconv.ParseInt(key_value["_tu_"], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert micro seconds '%s' to int64: %v", key_value["_tu_"], err))
|
||||
continue
|
||||
}
|
||||
timestamp := time.Unix(sec, msec*1000)
|
||||
|
||||
// number of closes
|
||||
numCloses, err := strconv.ParseInt(key_value["_cc_"], 10, 64)
|
||||
if err != nil {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of closes: %s\n", err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_num_closes",
|
||||
tagList,
|
||||
map[string]interface{}{
|
||||
"value": numCloses,
|
||||
},
|
||||
timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
// bytes read
|
||||
bytesRead, err := strconv.ParseInt(key_value["_br_"], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert bytes read '%s' to int64: %v", key_value["_br_"], err))
|
||||
continue
|
||||
}
|
||||
if y, err := lp.New("gpfs_bytes_read", m.tags, m.meta, map[string]interface{}{"value": bytesRead}, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
// number of reads
|
||||
numReads, err := strconv.ParseInt(key_value["_rdc_"], 10, 64)
|
||||
if err != nil {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of reads: %s\n", err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_num_reads",
|
||||
tagList,
|
||||
map[string]interface{}{
|
||||
"value": numReads,
|
||||
},
|
||||
timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
// bytes written
|
||||
bytesWritten, err := strconv.ParseInt(key_value["_bw_"], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert bytes written '%s' to int64: %v", key_value["_bw_"], err))
|
||||
continue
|
||||
}
|
||||
if y, err := lp.New("gpfs_bytes_written", m.tags, m.meta, map[string]interface{}{"value": bytesWritten}, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
// number of writes
|
||||
numWrites, err := strconv.ParseInt(key_value["_wc_"], 10, 64)
|
||||
if err != nil {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of writes: %s\n", err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_num_writes",
|
||||
tagList,
|
||||
map[string]interface{}{
|
||||
"value": numWrites,
|
||||
},
|
||||
timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
// number of opens
|
||||
numOpens, err := strconv.ParseInt(key_value["_oc_"], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert number of opens '%s' to int64: %v", key_value["_oc_"], err))
|
||||
continue
|
||||
}
|
||||
if y, err := lp.New("gpfs_num_opens", m.tags, m.meta, map[string]interface{}{"value": numOpens}, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
// number of read directories
|
||||
numReaddirs, err := strconv.ParseInt(key_value["_dir_"], 10, 64)
|
||||
if err != nil {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of read directories: %s\n", err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_num_readdirs",
|
||||
tagList,
|
||||
map[string]interface{}{
|
||||
"value": numReaddirs,
|
||||
},
|
||||
timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
// number of closes
|
||||
numCloses, err := strconv.ParseInt(key_value["_cc_"], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert number of closes: '%s' to int64: %v", key_value["_cc_"], err))
|
||||
continue
|
||||
}
|
||||
if y, err := lp.New("gpfs_num_closes", m.tags, m.meta, map[string]interface{}{"value": numCloses}, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
// Number of inode updates
|
||||
numInodeUpdates, err := strconv.ParseInt(key_value["_iu_"], 10, 64)
|
||||
if err != nil {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert Number of inode updates: %s\n", err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_num_inode_updates",
|
||||
tagList,
|
||||
map[string]interface{}{
|
||||
"value": numInodeUpdates,
|
||||
},
|
||||
timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
// number of reads
|
||||
numReads, err := strconv.ParseInt(key_value["_rdc_"], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert number of reads: '%s' to int64: %v", key_value["_rdc_"], err))
|
||||
continue
|
||||
}
|
||||
if y, err := lp.New("gpfs_num_reads", m.tags, m.meta, map[string]interface{}{"value": numReads}, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
// number of writes
|
||||
numWrites, err := strconv.ParseInt(key_value["_wc_"], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert number of writes: '%s' to int64: %v", key_value["_wc_"], err))
|
||||
continue
|
||||
}
|
||||
if y, err := lp.New("gpfs_num_writes", m.tags, m.meta, map[string]interface{}{"value": numWrites}, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
// number of read directories
|
||||
numReaddirs, err := strconv.ParseInt(key_value["_dir_"], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert number of read directories: '%s' to int64: %v", key_value["_dir_"], err))
|
||||
continue
|
||||
}
|
||||
if y, err := lp.New("gpfs_num_readdirs", m.tags, m.meta, map[string]interface{}{"value": numReaddirs}, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
// Number of inode updates
|
||||
numInodeUpdates, err := strconv.ParseInt(key_value["_iu_"], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert number of inode updates: '%s' to int: %v", key_value["_iu_"], err))
|
||||
continue
|
||||
}
|
||||
if y, err := lp.New("gpfs_num_inode_updates", m.tags, m.meta, map[string]interface{}{"value": numInodeUpdates}, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
30
collectors/gpfsMetric.md
Normal file
30
collectors/gpfsMetric.md
Normal file
@@ -0,0 +1,30 @@
|
||||
## `gpfs` collector
|
||||
|
||||
```json
|
||||
"ibstat": {
|
||||
"mmpmon_path": "/path/to/mmpmon",
|
||||
"exclude_filesystem": [
|
||||
"fs1"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `gpfs` collector uses the `mmpmon` command to read performance metrics for
|
||||
GPFS / IBM Spectrum Scale filesystems.
|
||||
|
||||
The reported filesystems can be filtered with the `exclude_filesystem` option
|
||||
in the configuration.
|
||||
|
||||
The path to the `mmpmon` command can be configured with the `mmpmon_path` option
|
||||
in the configuration.
|
||||
|
||||
Metrics:
|
||||
* `bytes_read`
|
||||
* `gpfs_bytes_written`
|
||||
* `gpfs_num_opens`
|
||||
* `gpfs_num_closes`
|
||||
* `gpfs_num_reads`
|
||||
* `gpfs_num_readdirs`
|
||||
* `gpfs_num_inode_updates`
|
||||
|
||||
The collector adds a `filesystem` tag to all metrics
|
@@ -3,282 +3,168 @@ package collectors
|
||||
import (
|
||||
"fmt"
|
||||
"io/ioutil"
|
||||
"log"
|
||||
"os/exec"
|
||||
"os"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
"golang.org/x/sys/unix"
|
||||
|
||||
// "os"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
const (
|
||||
IBBASEPATH = `/sys/class/infiniband/`
|
||||
PERFQUERY = `/usr/sbin/perfquery`
|
||||
)
|
||||
const IB_BASEPATH = `/sys/class/infiniband/`
|
||||
|
||||
type InfinibandCollectorConfig struct {
|
||||
ExcludeDevices []string `json:"exclude_devices,omitempty"`
|
||||
PerfQueryPath string `json:"perfquery_path"`
|
||||
type InfinibandCollectorInfo struct {
|
||||
LID string // IB local Identifier (LID)
|
||||
device string // IB device
|
||||
port string // IB device port
|
||||
portCounterFiles map[string]string // mapping counter name -> sysfs file
|
||||
tagSet map[string]string // corresponding tag list
|
||||
}
|
||||
|
||||
type InfinibandCollector struct {
|
||||
MetricCollector
|
||||
tags map[string]string
|
||||
lids map[string]map[string]string
|
||||
config InfinibandCollectorConfig
|
||||
use_perfquery bool
|
||||
metricCollector
|
||||
config struct {
|
||||
ExcludeDevices []string `json:"exclude_devices,omitempty"` // IB device to exclude e.g. mlx5_0
|
||||
}
|
||||
info []*InfinibandCollectorInfo
|
||||
}
|
||||
|
||||
func (m *InfinibandCollector) Help() {
|
||||
fmt.Println("This collector includes all devices that can be found below ", IBBASEPATH)
|
||||
fmt.Println("and where any of the ports provides a 'lid' file (glob ", IBBASEPATH, "/<dev>/ports/<port>/lid).")
|
||||
fmt.Println("The devices can be filtered with the 'exclude_devices' option in the configuration.")
|
||||
fmt.Println("For each found LIDs the collector calls the 'perfquery' command")
|
||||
fmt.Println("The path to the 'perfquery' command can be configured with the 'perfquery_path' option")
|
||||
fmt.Println("in the configuration")
|
||||
fmt.Println("")
|
||||
fmt.Println("Full configuration object:")
|
||||
fmt.Println("\"ibstat\" : {")
|
||||
fmt.Println(" \"perfquery_path\" : \"path/to/perfquery\" # if omitted, it searches in $PATH")
|
||||
fmt.Println(" \"exclude_devices\" : [\"dev1\"]")
|
||||
fmt.Println("}")
|
||||
fmt.Println("")
|
||||
fmt.Println("Metrics:")
|
||||
fmt.Println("- ib_recv")
|
||||
fmt.Println("- ib_xmit")
|
||||
fmt.Println("- ib_recv_pkts")
|
||||
fmt.Println("- ib_xmit_pkts")
|
||||
}
|
||||
// Init initializes the Infiniband collector by walking through files below IB_BASEPATH
|
||||
func (m *InfinibandCollector) Init(config json.RawMessage) error {
|
||||
|
||||
// Check if already initialized
|
||||
if m.init {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *InfinibandCollector) Init(config []byte) error {
|
||||
var err error
|
||||
m.name = "InfinibandCollector"
|
||||
m.use_perfquery = false
|
||||
m.setup()
|
||||
m.tags = map[string]string{"type": "node"}
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "Network",
|
||||
}
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
if len(m.config.PerfQueryPath) == 0 {
|
||||
path, err := exec.LookPath("perfquery")
|
||||
if err == nil {
|
||||
m.config.PerfQueryPath = path
|
||||
}
|
||||
}
|
||||
m.lids = make(map[string]map[string]string)
|
||||
p := fmt.Sprintf("%s/*/ports/*/lid", string(IBBASEPATH))
|
||||
files, err := filepath.Glob(p)
|
||||
for _, f := range files {
|
||||
lid, err := ioutil.ReadFile(f)
|
||||
if err == nil {
|
||||
plist := strings.Split(strings.Replace(f, string(IBBASEPATH), "", -1), "/")
|
||||
skip := false
|
||||
for _, d := range m.config.ExcludeDevices {
|
||||
if d == plist[0] {
|
||||
skip = true
|
||||
}
|
||||
}
|
||||
if !skip {
|
||||
m.lids[plist[0]] = make(map[string]string)
|
||||
m.lids[plist[0]][plist[2]] = string(lid)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
for _, ports := range m.lids {
|
||||
for port, lid := range ports {
|
||||
args := fmt.Sprintf("-r %s %s 0xf000", lid, port)
|
||||
command := exec.Command(m.config.PerfQueryPath, args)
|
||||
command.Wait()
|
||||
_, err := command.Output()
|
||||
if err == nil {
|
||||
m.use_perfquery = true
|
||||
}
|
||||
break
|
||||
}
|
||||
break
|
||||
}
|
||||
|
||||
if len(m.lids) > 0 {
|
||||
m.init = true
|
||||
} else {
|
||||
err = errors.New("No usable devices")
|
||||
}
|
||||
|
||||
return err
|
||||
}
|
||||
|
||||
func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[string]string, out *[]lp.MutableMetric) error {
|
||||
|
||||
args := fmt.Sprintf("-r %s %s 0xf000", lid, port)
|
||||
command := exec.Command(cmd, args)
|
||||
command.Wait()
|
||||
stdout, err := command.Output()
|
||||
// Loop for all InfiniBand directories
|
||||
globPattern := filepath.Join(IB_BASEPATH, "*", "ports", "*")
|
||||
ibDirs, err := filepath.Glob(globPattern)
|
||||
if err != nil {
|
||||
log.Print(err)
|
||||
return err
|
||||
return fmt.Errorf("Unable to glob files with pattern %s: %v", globPattern, err)
|
||||
}
|
||||
if ibDirs == nil {
|
||||
return fmt.Errorf("Unable to find any directories with pattern %s", globPattern)
|
||||
}
|
||||
ll := strings.Split(string(stdout), "\n")
|
||||
|
||||
for _, line := range ll {
|
||||
if strings.HasPrefix(line, "PortRcvData") || strings.HasPrefix(line, "RcvData") {
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
for _, path := range ibDirs {
|
||||
|
||||
// Skip, when no LID is assigned
|
||||
line, err := ioutil.ReadFile(filepath.Join(path, "lid"))
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
LID := strings.TrimSpace(string(line))
|
||||
if LID == "0x0" {
|
||||
continue
|
||||
}
|
||||
|
||||
// Get device and port component
|
||||
pathSplit := strings.Split(path, string(os.PathSeparator))
|
||||
device := pathSplit[4]
|
||||
port := pathSplit[6]
|
||||
|
||||
// Skip excluded devices
|
||||
skip := false
|
||||
for _, excludedDevice := range m.config.ExcludeDevices {
|
||||
if excludedDevice == device {
|
||||
skip = true
|
||||
break
|
||||
}
|
||||
}
|
||||
if strings.HasPrefix(line, "PortXmitData") || strings.HasPrefix(line, "XmtData") {
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
}
|
||||
}
|
||||
if strings.HasPrefix(line, "PortRcvPkts") || strings.HasPrefix(line, "RcvPkts") {
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
}
|
||||
}
|
||||
if strings.HasPrefix(line, "PortXmitPkts") || strings.HasPrefix(line, "XmtPkts") {
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
if skip {
|
||||
continue
|
||||
}
|
||||
|
||||
// Check access to counter files
|
||||
countersDir := filepath.Join(path, "counters")
|
||||
portCounterFiles := map[string]string{
|
||||
"ib_recv": filepath.Join(countersDir, "port_rcv_data"),
|
||||
"ib_xmit": filepath.Join(countersDir, "port_xmit_data"),
|
||||
"ib_recv_pkts": filepath.Join(countersDir, "port_rcv_packets"),
|
||||
"ib_xmit_pkts": filepath.Join(countersDir, "port_xmit_packets"),
|
||||
}
|
||||
for _, counterFile := range portCounterFiles {
|
||||
err := unix.Access(counterFile, unix.R_OK)
|
||||
if err != nil {
|
||||
return fmt.Errorf("Unable to access %s: %v", counterFile, err)
|
||||
}
|
||||
}
|
||||
|
||||
m.info = append(m.info,
|
||||
&InfinibandCollectorInfo{
|
||||
LID: LID,
|
||||
device: device,
|
||||
port: port,
|
||||
portCounterFiles: portCounterFiles,
|
||||
tagSet: map[string]string{
|
||||
"type": "node",
|
||||
"device": device,
|
||||
"port": port,
|
||||
"lid": LID,
|
||||
},
|
||||
})
|
||||
}
|
||||
|
||||
if len(m.info) == 0 {
|
||||
return fmt.Errorf("Found no IB devices")
|
||||
}
|
||||
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func DoSysfsRead(dev string, lid string, port string, tags map[string]string, out *[]lp.MutableMetric) error {
|
||||
path := fmt.Sprintf("%s/%s/ports/%s/counters/", string(IBBASEPATH), dev, port)
|
||||
buffer, err := ioutil.ReadFile(fmt.Sprintf("%s/port_rcv_data", path))
|
||||
if err == nil {
|
||||
data := strings.Replace(string(buffer), "\n", "", -1)
|
||||
v, err := strconv.ParseFloat(data, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
}
|
||||
}
|
||||
buffer, err = ioutil.ReadFile(fmt.Sprintf("%s/port_xmit_data", path))
|
||||
if err == nil {
|
||||
data := strings.Replace(string(buffer), "\n", "", -1)
|
||||
v, err := strconv.ParseFloat(data, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
}
|
||||
}
|
||||
buffer, err = ioutil.ReadFile(fmt.Sprintf("%s/port_rcv_packets", path))
|
||||
if err == nil {
|
||||
data := strings.Replace(string(buffer), "\n", "", -1)
|
||||
v, err := strconv.ParseFloat(data, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
}
|
||||
}
|
||||
buffer, err = ioutil.ReadFile(fmt.Sprintf("%s/port_xmit_packets", path))
|
||||
if err == nil {
|
||||
data := strings.Replace(string(buffer), "\n", "", -1)
|
||||
v, err := strconv.ParseFloat(data, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
// Read reads Infiniband counter files below IB_BASEPATH
|
||||
func (m *InfinibandCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
|
||||
func (m *InfinibandCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
|
||||
if m.init {
|
||||
for dev, ports := range m.lids {
|
||||
for port, lid := range ports {
|
||||
tags := map[string]string{"type": "node", "device": dev, "port": port}
|
||||
if m.use_perfquery {
|
||||
DoPerfQuery(m.config.PerfQueryPath, dev, lid, port, tags, out)
|
||||
} else {
|
||||
DoSysfsRead(dev, lid, port, tags, out)
|
||||
}
|
||||
}
|
||||
}
|
||||
// Check if already initialized
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
|
||||
// buffer, err := ioutil.ReadFile(string(LIDFILE))
|
||||
now := time.Now()
|
||||
for _, info := range m.info {
|
||||
for counterName, counterFile := range info.portCounterFiles {
|
||||
line, err := ioutil.ReadFile(counterFile)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to read from file '%s': %v", counterFile, err))
|
||||
continue
|
||||
}
|
||||
data := strings.TrimSpace(string(line))
|
||||
v, err := strconv.ParseInt(data, 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert Infininiband metrice %s='%s' to int64: %v", counterName, data, err))
|
||||
continue
|
||||
}
|
||||
if y, err := lp.New(counterName, info.tagSet, m.meta, map[string]interface{}{"value": v}, now); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
// if err != nil {
|
||||
// log.Print(err)
|
||||
// return
|
||||
// }
|
||||
|
||||
// args := fmt.Sprintf("-r %s 1 0xf000", string(buffer))
|
||||
|
||||
// command := exec.Command(PERFQUERY, args)
|
||||
// command.Wait()
|
||||
// stdout, err := command.Output()
|
||||
// if err != nil {
|
||||
// log.Print(err)
|
||||
// return
|
||||
// }
|
||||
|
||||
// ll := strings.Split(string(stdout), "\n")
|
||||
|
||||
// for _, line := range ll {
|
||||
// if strings.HasPrefix(line, "PortRcvData") || strings.HasPrefix(line, "RcvData") {
|
||||
// lv := strings.Fields(line)
|
||||
// v, err := strconv.ParseFloat(lv[1], 64)
|
||||
// if err == nil {
|
||||
// y, err := lp.New("ib_recv", m.tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
// if err == nil {
|
||||
// *out = append(*out, y)
|
||||
// }
|
||||
// }
|
||||
// }
|
||||
// if strings.HasPrefix(line, "PortXmitData") || strings.HasPrefix(line, "XmtData") {
|
||||
// lv := strings.Fields(line)
|
||||
// v, err := strconv.ParseFloat(lv[1], 64)
|
||||
// if err == nil {
|
||||
// y, err := lp.New("ib_xmit", m.tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
// if err == nil {
|
||||
// *out = append(*out, y)
|
||||
// }
|
||||
// }
|
||||
// }
|
||||
// }
|
||||
}
|
||||
}
|
||||
|
||||
func (m *InfinibandCollector) Close() {
|
||||
|
26
collectors/infinibandMetric.md
Normal file
26
collectors/infinibandMetric.md
Normal file
@@ -0,0 +1,26 @@
|
||||
|
||||
## `ibstat` collector
|
||||
|
||||
```json
|
||||
"ibstat": {
|
||||
"exclude_devices": [
|
||||
"mlx4"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `ibstat` collector includes all Infiniband devices that can be
|
||||
found below `/sys/class/infiniband/` and where any of the ports provides a
|
||||
LID file (`/sys/class/infiniband/<dev>/ports/<port>/lid`)
|
||||
|
||||
The devices can be filtered with the `exclude_devices` option in the configuration.
|
||||
|
||||
For each found LID the collector reads data through the sysfs files below `/sys/class/infiniband/<device>`.
|
||||
|
||||
Metrics:
|
||||
* `ib_recv`
|
||||
* `ib_xmit`
|
||||
* `ib_recv_pkts`
|
||||
* `ib_xmit_pkts`
|
||||
|
||||
The collector adds a `device` tag to all metrics
|
232
collectors/infinibandPerfQueryMetric.go
Normal file
232
collectors/infinibandPerfQueryMetric.go
Normal file
@@ -0,0 +1,232 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"io/ioutil"
|
||||
"log"
|
||||
"os/exec"
|
||||
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
|
||||
// "os"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
const PERFQUERY = `/usr/sbin/perfquery`
|
||||
|
||||
type InfinibandPerfQueryCollector struct {
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
lids map[string]map[string]string
|
||||
config struct {
|
||||
ExcludeDevices []string `json:"exclude_devices,omitempty"`
|
||||
PerfQueryPath string `json:"perfquery_path"`
|
||||
}
|
||||
}
|
||||
|
||||
func (m *InfinibandPerfQueryCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "InfinibandCollectorPerfQuery"
|
||||
m.setup()
|
||||
m.meta = map[string]string{"source": m.name, "group": "Network"}
|
||||
m.tags = map[string]string{"type": "node"}
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
if len(m.config.PerfQueryPath) == 0 {
|
||||
path, err := exec.LookPath("perfquery")
|
||||
if err == nil {
|
||||
m.config.PerfQueryPath = path
|
||||
}
|
||||
}
|
||||
m.lids = make(map[string]map[string]string)
|
||||
p := fmt.Sprintf("%s/*/ports/*/lid", string(IB_BASEPATH))
|
||||
files, err := filepath.Glob(p)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
for _, f := range files {
|
||||
lid, err := ioutil.ReadFile(f)
|
||||
if err == nil {
|
||||
plist := strings.Split(strings.Replace(f, string(IB_BASEPATH), "", -1), "/")
|
||||
skip := false
|
||||
for _, d := range m.config.ExcludeDevices {
|
||||
if d == plist[0] {
|
||||
skip = true
|
||||
}
|
||||
}
|
||||
if !skip {
|
||||
m.lids[plist[0]] = make(map[string]string)
|
||||
m.lids[plist[0]][plist[2]] = string(lid)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
for _, ports := range m.lids {
|
||||
for port, lid := range ports {
|
||||
args := fmt.Sprintf("-r %s %s 0xf000", lid, port)
|
||||
command := exec.Command(m.config.PerfQueryPath, args)
|
||||
command.Wait()
|
||||
_, err := command.Output()
|
||||
if err != nil {
|
||||
return fmt.Errorf("Failed to execute %s: %v", m.config.PerfQueryPath, err)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if len(m.lids) == 0 {
|
||||
return errors.New("No usable IB devices")
|
||||
}
|
||||
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *InfinibandPerfQueryCollector) doPerfQuery(cmd string, dev string, lid string, port string, tags map[string]string, output chan lp.CCMetric) error {
|
||||
|
||||
args := fmt.Sprintf("-r %s %s 0xf000", lid, port)
|
||||
command := exec.Command(cmd, args)
|
||||
command.Wait()
|
||||
stdout, err := command.Output()
|
||||
if err != nil {
|
||||
log.Print(err)
|
||||
return err
|
||||
}
|
||||
ll := strings.Split(string(stdout), "\n")
|
||||
|
||||
for _, line := range ll {
|
||||
if strings.HasPrefix(line, "PortRcvData") || strings.HasPrefix(line, "RcvData") {
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
if strings.HasPrefix(line, "PortXmitData") || strings.HasPrefix(line, "XmtData") {
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
if strings.HasPrefix(line, "PortRcvPkts") || strings.HasPrefix(line, "RcvPkts") {
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
if strings.HasPrefix(line, "PortXmitPkts") || strings.HasPrefix(line, "XmtPkts") {
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
if strings.HasPrefix(line, "PortRcvPkts") || strings.HasPrefix(line, "RcvPkts") {
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
if strings.HasPrefix(line, "PortXmitPkts") || strings.HasPrefix(line, "XmtPkts") {
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *InfinibandPerfQueryCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
|
||||
if m.init {
|
||||
for dev, ports := range m.lids {
|
||||
for port, lid := range ports {
|
||||
tags := map[string]string{
|
||||
"type": "node",
|
||||
"device": dev,
|
||||
"port": port,
|
||||
"lid": lid}
|
||||
path := fmt.Sprintf("%s/%s/ports/%s/counters/", string(IB_BASEPATH), dev, port)
|
||||
buffer, err := ioutil.ReadFile(fmt.Sprintf("%s/port_rcv_data", path))
|
||||
if err == nil {
|
||||
data := strings.Replace(string(buffer), "\n", "", -1)
|
||||
v, err := strconv.ParseFloat(data, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
buffer, err = ioutil.ReadFile(fmt.Sprintf("%s/port_xmit_data", path))
|
||||
if err == nil {
|
||||
data := strings.Replace(string(buffer), "\n", "", -1)
|
||||
v, err := strconv.ParseFloat(data, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
buffer, err = ioutil.ReadFile(fmt.Sprintf("%s/port_rcv_packets", path))
|
||||
if err == nil {
|
||||
data := strings.Replace(string(buffer), "\n", "", -1)
|
||||
v, err := strconv.ParseFloat(data, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
buffer, err = ioutil.ReadFile(fmt.Sprintf("%s/port_xmit_packets", path))
|
||||
if err == nil {
|
||||
data := strings.Replace(string(buffer), "\n", "", -1)
|
||||
v, err := strconv.ParseFloat(data, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (m *InfinibandPerfQueryCollector) Close() {
|
||||
m.init = false
|
||||
}
|
28
collectors/infinibandPerfQueryMetric.md
Normal file
28
collectors/infinibandPerfQueryMetric.md
Normal file
@@ -0,0 +1,28 @@
|
||||
|
||||
## `ibstat_perfquery` collector
|
||||
|
||||
```json
|
||||
"ibstat_perfquery": {
|
||||
"perfquery_path": "/path/to/perfquery",
|
||||
"exclude_devices": [
|
||||
"mlx4"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `ibstat_perfquery` collector includes all Infiniband devices that can be
|
||||
found below `/sys/class/infiniband/` and where any of the ports provides a
|
||||
LID file (`/sys/class/infiniband/<dev>/ports/<port>/lid`)
|
||||
|
||||
The devices can be filtered with the `exclude_devices` option in the configuration.
|
||||
|
||||
For each found LID the collector calls the `perfquery` command. The path to the
|
||||
`perfquery` command can be configured with the `perfquery_path` option in the configuration
|
||||
|
||||
Metrics:
|
||||
* `ib_recv`
|
||||
* `ib_xmit`
|
||||
* `ib_recv_pkts`
|
||||
* `ib_xmit_pkts`
|
||||
|
||||
The collector adds a `device` tag to all metrics
|
155
collectors/iostatMetric.go
Normal file
155
collectors/iostatMetric.go
Normal file
@@ -0,0 +1,155 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"os"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
|
||||
// "log"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
const IOSTATFILE = `/proc/diskstats`
|
||||
const IOSTAT_SYSFSPATH = `/sys/block`
|
||||
|
||||
type IOstatCollectorConfig struct {
|
||||
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
|
||||
}
|
||||
|
||||
type IOstatCollectorEntry struct {
|
||||
lastValues map[string]int64
|
||||
tags map[string]string
|
||||
}
|
||||
|
||||
type IOstatCollector struct {
|
||||
metricCollector
|
||||
matches map[string]int
|
||||
config IOstatCollectorConfig
|
||||
devices map[string]IOstatCollectorEntry
|
||||
}
|
||||
|
||||
func (m *IOstatCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "IOstatCollector"
|
||||
m.meta = map[string]string{"source": m.name, "group": "Disk"}
|
||||
m.setup()
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
// https://www.kernel.org/doc/html/latest/admin-guide/iostats.html
|
||||
matches := map[string]int{
|
||||
"io_reads": 3,
|
||||
"io_reads_merged": 4,
|
||||
"io_read_sectors": 5,
|
||||
"io_read_ms": 6,
|
||||
"io_writes": 7,
|
||||
"io_writes_merged": 8,
|
||||
"io_writes_sectors": 9,
|
||||
"io_writes_ms": 10,
|
||||
"io_ioops": 11,
|
||||
"io_ioops_ms": 12,
|
||||
"io_ioops_weighted_ms": 13,
|
||||
"io_discards": 14,
|
||||
"io_discards_merged": 15,
|
||||
"io_discards_sectors": 16,
|
||||
"io_discards_ms": 17,
|
||||
"io_flushes": 18,
|
||||
"io_flushes_ms": 19,
|
||||
}
|
||||
m.devices = make(map[string]IOstatCollectorEntry)
|
||||
m.matches = make(map[string]int)
|
||||
for k, v := range matches {
|
||||
if _, skip := stringArrayContains(m.config.ExcludeMetrics, k); !skip {
|
||||
m.matches[k] = v
|
||||
}
|
||||
}
|
||||
if len(m.matches) == 0 {
|
||||
return errors.New("no metrics to collect")
|
||||
}
|
||||
file, err := os.Open(string(IOSTATFILE))
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
return err
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
scanner := bufio.NewScanner(file)
|
||||
for scanner.Scan() {
|
||||
line := scanner.Text()
|
||||
linefields := strings.Fields(line)
|
||||
device := linefields[2]
|
||||
if strings.Contains(device, "loop") {
|
||||
continue
|
||||
}
|
||||
values := make(map[string]int64)
|
||||
for m := range m.matches {
|
||||
values[m] = 0
|
||||
}
|
||||
m.devices[device] = IOstatCollectorEntry{
|
||||
tags: map[string]string{
|
||||
"device": linefields[2],
|
||||
"type": "node",
|
||||
},
|
||||
lastValues: values,
|
||||
}
|
||||
}
|
||||
m.init = true
|
||||
return err
|
||||
}
|
||||
|
||||
func (m *IOstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
|
||||
file, err := os.Open(string(IOSTATFILE))
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
return
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
scanner := bufio.NewScanner(file)
|
||||
for scanner.Scan() {
|
||||
line := scanner.Text()
|
||||
if len(line) == 0 {
|
||||
continue
|
||||
}
|
||||
linefields := strings.Fields(line)
|
||||
device := linefields[2]
|
||||
if strings.Contains(device, "loop") {
|
||||
continue
|
||||
}
|
||||
if _, ok := m.devices[device]; !ok {
|
||||
continue
|
||||
}
|
||||
entry := m.devices[device]
|
||||
for name, idx := range m.matches {
|
||||
if idx < len(linefields) {
|
||||
x, err := strconv.ParseInt(linefields[idx], 0, 64)
|
||||
if err == nil {
|
||||
diff := x - entry.lastValues[name]
|
||||
y, err := lp.New(name, entry.tags, m.meta, map[string]interface{}{"value": int(diff)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
entry.lastValues[name] = x
|
||||
}
|
||||
}
|
||||
m.devices[device] = entry
|
||||
}
|
||||
}
|
||||
|
||||
func (m *IOstatCollector) Close() {
|
||||
m.init = false
|
||||
}
|
34
collectors/iostatMetric.md
Normal file
34
collectors/iostatMetric.md
Normal file
@@ -0,0 +1,34 @@
|
||||
|
||||
## `iostat` collector
|
||||
|
||||
```json
|
||||
"iostat": {
|
||||
"exclude_metrics": [
|
||||
"read_ms"
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
The `iostat` collector reads data from `/proc/diskstats` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics:
|
||||
* `io_reads`
|
||||
* `io_reads_merged`
|
||||
* `io_read_sectors`
|
||||
* `io_read_ms`
|
||||
* `io_writes`
|
||||
* `io_writes_merged`
|
||||
* `io_writes_sectors`
|
||||
* `io_writes_ms`
|
||||
* `io_ioops`
|
||||
* `io_ioops_ms`
|
||||
* `io_ioops_weighted_ms`
|
||||
* `io_discards`
|
||||
* `io_discards_merged`
|
||||
* `io_discards_sectors`
|
||||
* `io_discards_ms`
|
||||
* `io_flushes`
|
||||
* `io_flushes_ms`
|
||||
|
||||
The device name is added as tag `device`. For more details, see https://www.kernel.org/doc/html/latest/admin-guide/iostats.html
|
||||
|
@@ -10,11 +10,11 @@ import (
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const IPMITOOL_PATH = `/usr/bin/ipmitool`
|
||||
const IPMISENSORS_PATH = `/usr/sbin/ipmi-sensors`
|
||||
const IPMITOOL_PATH = `ipmitool`
|
||||
const IPMISENSORS_PATH = `ipmi-sensors`
|
||||
|
||||
type IpmiCollectorConfig struct {
|
||||
ExcludeDevices []string `json:"exclude_devices"`
|
||||
@@ -23,37 +23,44 @@ type IpmiCollectorConfig struct {
|
||||
}
|
||||
|
||||
type IpmiCollector struct {
|
||||
MetricCollector
|
||||
tags map[string]string
|
||||
matches map[string]string
|
||||
config IpmiCollectorConfig
|
||||
metricCollector
|
||||
//tags map[string]string
|
||||
//matches map[string]string
|
||||
config IpmiCollectorConfig
|
||||
ipmitool string
|
||||
ipmisensors string
|
||||
}
|
||||
|
||||
func (m *IpmiCollector) Init(config []byte) error {
|
||||
func (m *IpmiCollector) Init(config json.RawMessage) error {
|
||||
m.name = "IpmiCollector"
|
||||
m.setup()
|
||||
m.meta = map[string]string{"source": m.name, "group": "IPMI"}
|
||||
m.config.IpmitoolPath = string(IPMITOOL_PATH)
|
||||
m.config.IpmisensorsPath = string(IPMISENSORS_PATH)
|
||||
m.ipmitool = ""
|
||||
m.ipmisensors = ""
|
||||
if len(config) > 0 {
|
||||
err := json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
_, err1 := os.Stat(m.config.IpmitoolPath)
|
||||
_, err2 := os.Stat(m.config.IpmisensorsPath)
|
||||
if err1 != nil {
|
||||
m.config.IpmitoolPath = ""
|
||||
p, err := exec.LookPath(m.config.IpmitoolPath)
|
||||
if err == nil {
|
||||
m.ipmitool = p
|
||||
}
|
||||
if err2 != nil {
|
||||
m.config.IpmisensorsPath = ""
|
||||
p, err = exec.LookPath(m.config.IpmisensorsPath)
|
||||
if err == nil {
|
||||
m.ipmisensors = p
|
||||
}
|
||||
if err1 != nil && err2 != nil {
|
||||
if len(m.ipmitool) == 0 && len(m.ipmisensors) == 0 {
|
||||
return errors.New("No IPMI reader found")
|
||||
}
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func ReadIpmiTool(cmd string, out *[]lp.MutableMetric) {
|
||||
func (m *IpmiCollector) readIpmiTool(cmd string, output chan lp.CCMetric) {
|
||||
command := exec.Command(cmd, "sensor")
|
||||
command.Wait()
|
||||
stdout, err := command.Output()
|
||||
@@ -74,24 +81,25 @@ func ReadIpmiTool(cmd string, out *[]lp.MutableMetric) {
|
||||
name := strings.ToLower(strings.Replace(strings.Trim(lv[0], " "), " ", "_", -1))
|
||||
unit := strings.Trim(lv[2], " ")
|
||||
if unit == "Volts" {
|
||||
unit = "V"
|
||||
unit = "Volts"
|
||||
} else if unit == "degrees C" {
|
||||
unit = "C"
|
||||
unit = "degC"
|
||||
} else if unit == "degrees F" {
|
||||
unit = "F"
|
||||
unit = "degF"
|
||||
} else if unit == "Watts" {
|
||||
unit = "W"
|
||||
unit = "Watts"
|
||||
}
|
||||
|
||||
y, err := lp.New(name, map[string]string{"unit": unit, "type": "node"}, map[string]interface{}{"value": v}, time.Now())
|
||||
y, err := lp.New(name, map[string]string{"type": "node"}, m.meta, map[string]interface{}{"value": v}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
y.AddMeta("unit", unit)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func ReadIpmiSensors(cmd string, out *[]lp.MutableMetric) {
|
||||
func (m *IpmiCollector) readIpmiSensors(cmd string, output chan lp.CCMetric) {
|
||||
|
||||
command := exec.Command(cmd, "--comma-separated-output", "--sdr-cache-recreate")
|
||||
command.Wait()
|
||||
@@ -109,25 +117,28 @@ func ReadIpmiSensors(cmd string, out *[]lp.MutableMetric) {
|
||||
v, err := strconv.ParseFloat(lv[3], 64)
|
||||
if err == nil {
|
||||
name := strings.ToLower(strings.Replace(lv[1], " ", "_", -1))
|
||||
y, err := lp.New(name, map[string]string{"unit": lv[4], "type": "node"}, map[string]interface{}{"value": v}, time.Now())
|
||||
y, err := lp.New(name, map[string]string{"type": "node"}, m.meta, map[string]interface{}{"value": v}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
if len(lv) > 4 {
|
||||
y.AddMeta("unit", lv[4])
|
||||
}
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (m *IpmiCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *IpmiCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if len(m.config.IpmitoolPath) > 0 {
|
||||
_, err := os.Stat(m.config.IpmitoolPath)
|
||||
if err == nil {
|
||||
ReadIpmiTool(m.config.IpmitoolPath, out)
|
||||
m.readIpmiTool(m.config.IpmitoolPath, output)
|
||||
}
|
||||
} else if len(m.config.IpmisensorsPath) > 0 {
|
||||
_, err := os.Stat(m.config.IpmisensorsPath)
|
||||
if err == nil {
|
||||
ReadIpmiSensors(m.config.IpmisensorsPath, out)
|
||||
m.readIpmiSensors(m.config.IpmisensorsPath, output)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
16
collectors/ipmiMetric.md
Normal file
16
collectors/ipmiMetric.md
Normal file
@@ -0,0 +1,16 @@
|
||||
|
||||
## `ipmistat` collector
|
||||
|
||||
```json
|
||||
"ipmistat": {
|
||||
"ipmitool_path": "/path/to/ipmitool",
|
||||
"ipmisensors_path": "/path/to/ipmi-sensors",
|
||||
}
|
||||
```
|
||||
|
||||
The `ipmistat` collector reads data from `ipmitool` (`ipmitool sensor`) or `ipmi-sensors` (`ipmi-sensors --sdr-cache-recreate --comma-separated-output`).
|
||||
|
||||
The metrics depend on the output of the underlying tools but contain temperature, power and energy metrics.
|
||||
|
||||
|
||||
|
@@ -2,7 +2,7 @@ package collectors
|
||||
|
||||
/*
|
||||
#cgo CFLAGS: -I./likwid
|
||||
#cgo LDFLAGS: -L./likwid -llikwid -llikwid-hwloc -lm
|
||||
#cgo LDFLAGS: -L./likwid -llikwid -llikwid-hwloc -lm -Wl,--unresolved-symbols=ignore-in-object-files
|
||||
#include <stdlib.h>
|
||||
#include <likwid.h>
|
||||
*/
|
||||
@@ -13,55 +13,111 @@ import (
|
||||
"errors"
|
||||
"fmt"
|
||||
"io/ioutil"
|
||||
"log"
|
||||
"math"
|
||||
"os"
|
||||
"regexp"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
"unsafe"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
"gopkg.in/Knetic/govaluate.v2"
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
topo "github.com/ClusterCockpit/cc-metric-collector/internal/ccTopology"
|
||||
agg "github.com/ClusterCockpit/cc-metric-collector/internal/metricAggregator"
|
||||
"github.com/NVIDIA/go-nvml/pkg/dl"
|
||||
)
|
||||
|
||||
type MetricScope string
|
||||
|
||||
const (
|
||||
METRIC_SCOPE_HWTHREAD = iota
|
||||
METRIC_SCOPE_CORE
|
||||
METRIC_SCOPE_LLC
|
||||
METRIC_SCOPE_NUMA
|
||||
METRIC_SCOPE_DIE
|
||||
METRIC_SCOPE_SOCKET
|
||||
METRIC_SCOPE_NODE
|
||||
)
|
||||
|
||||
func (ms MetricScope) String() string {
|
||||
return string(ms)
|
||||
}
|
||||
|
||||
func (ms MetricScope) Likwid() string {
|
||||
LikwidDomains := map[string]string{
|
||||
"cpu": "",
|
||||
"core": "",
|
||||
"llc": "C",
|
||||
"numadomain": "M",
|
||||
"die": "D",
|
||||
"socket": "S",
|
||||
"node": "N",
|
||||
}
|
||||
return LikwidDomains[string(ms)]
|
||||
}
|
||||
|
||||
func (ms MetricScope) Granularity() int {
|
||||
for i, g := range GetAllMetricScopes() {
|
||||
if ms == g {
|
||||
return i
|
||||
}
|
||||
}
|
||||
return -1
|
||||
}
|
||||
|
||||
func GetAllMetricScopes() []MetricScope {
|
||||
return []MetricScope{"cpu" /*, "core", "llc", "numadomain", "die",*/, "socket", "node"}
|
||||
}
|
||||
|
||||
const (
|
||||
LIKWID_LIB_NAME = "liblikwid.so"
|
||||
LIKWID_LIB_DL_FLAGS = dl.RTLD_LAZY | dl.RTLD_GLOBAL
|
||||
)
|
||||
|
||||
type LikwidCollectorMetricConfig struct {
|
||||
Name string `json:"name"`
|
||||
Calc string `json:"calc"`
|
||||
Socket_scope bool `json:"socket_scope"`
|
||||
Publish bool `json:"publish"`
|
||||
Name string `json:"name"` // Name of the metric
|
||||
Calc string `json:"calc"` // Calculation for the metric using
|
||||
//Aggr string `json:"aggregation"` // if scope unequal to LIKWID metric scope, the values are combined (sum, min, max, mean or avg, median)
|
||||
Scope MetricScope `json:"scope"` // scope for calculation. subscopes are aggregated using the 'aggregation' function
|
||||
Publish bool `json:"publish"`
|
||||
granulatity MetricScope
|
||||
}
|
||||
|
||||
type LikwidCollectorEventsetConfig struct {
|
||||
Events map[string]string `json:"events"`
|
||||
Metrics []LikwidCollectorMetricConfig `json:"metrics"`
|
||||
Events map[string]string `json:"events"`
|
||||
granulatity map[string]MetricScope
|
||||
Metrics []LikwidCollectorMetricConfig `json:"metrics"`
|
||||
}
|
||||
|
||||
type LikwidCollectorConfig struct {
|
||||
Eventsets []LikwidCollectorEventsetConfig `json:"eventsets"`
|
||||
Metrics []LikwidCollectorMetricConfig `json:"globalmetrics"`
|
||||
ExcludeMetrics []string `json:"exclude_metrics"`
|
||||
ForceOverwrite bool `json:"force_overwrite"`
|
||||
Metrics []LikwidCollectorMetricConfig `json:"globalmetrics,omitempty"`
|
||||
ForceOverwrite bool `json:"force_overwrite,omitempty"`
|
||||
InvalidToZero bool `json:"invalid_to_zero,omitempty"`
|
||||
}
|
||||
|
||||
type LikwidCollector struct {
|
||||
MetricCollector
|
||||
cpulist []C.int
|
||||
sock2tid map[int]int
|
||||
metrics map[C.int]map[string]int
|
||||
groups []C.int
|
||||
config LikwidCollectorConfig
|
||||
results map[int]map[int]map[string]interface{}
|
||||
mresults map[int]map[int]map[string]float64
|
||||
gmresults map[int]map[string]float64
|
||||
basefreq float64
|
||||
metricCollector
|
||||
cpulist []C.int
|
||||
cpu2tid map[int]int
|
||||
sock2tid map[int]int
|
||||
scopeRespTids map[MetricScope]map[int]int
|
||||
metrics map[C.int]map[string]int
|
||||
groups []C.int
|
||||
config LikwidCollectorConfig
|
||||
results map[int]map[int]map[string]interface{}
|
||||
mresults map[int]map[int]map[string]float64
|
||||
gmresults map[int]map[string]float64
|
||||
basefreq float64
|
||||
running bool
|
||||
}
|
||||
|
||||
type LikwidMetric struct {
|
||||
name string
|
||||
search string
|
||||
socket_scope bool
|
||||
group_idx int
|
||||
name string
|
||||
search string
|
||||
scope MetricScope
|
||||
group_idx int
|
||||
}
|
||||
|
||||
func eventsToEventStr(events map[string]string) string {
|
||||
@@ -72,12 +128,27 @@ func eventsToEventStr(events map[string]string) string {
|
||||
return strings.Join(elist, ",")
|
||||
}
|
||||
|
||||
func getGranularity(counter, event string) MetricScope {
|
||||
if strings.HasPrefix(counter, "PMC") || strings.HasPrefix(counter, "FIXC") {
|
||||
return "cpu"
|
||||
} else if strings.Contains(counter, "BOX") || strings.Contains(counter, "DEV") {
|
||||
return "socket"
|
||||
} else if strings.HasPrefix(counter, "PWR") {
|
||||
if event == "RAPL_CORE_ENERGY" {
|
||||
return "cpu"
|
||||
} else {
|
||||
return "socket"
|
||||
}
|
||||
}
|
||||
return "unknown"
|
||||
}
|
||||
|
||||
func getBaseFreq() float64 {
|
||||
var freq float64 = math.NaN()
|
||||
C.power_init(0)
|
||||
info := C.get_powerInfo()
|
||||
if float64(info.baseFrequency) != 0 {
|
||||
freq = float64(info.baseFrequency)
|
||||
freq = float64(info.baseFrequency) * 1e3
|
||||
} else {
|
||||
buffer, err := ioutil.ReadFile("/sys/devices/system/cpu/cpu0/cpufreq/bios_limit")
|
||||
if err == nil {
|
||||
@@ -91,21 +162,102 @@ func getBaseFreq() float64 {
|
||||
return freq
|
||||
}
|
||||
|
||||
func getSocketCpus() map[C.int]int {
|
||||
slist := SocketList()
|
||||
var cpu C.int
|
||||
outmap := make(map[C.int]int)
|
||||
for _, s := range slist {
|
||||
t := C.CString(fmt.Sprintf("S%d", s))
|
||||
clen := C.cpustr_to_cpulist(t, &cpu, 1)
|
||||
if int(clen) == 1 {
|
||||
outmap[cpu] = s
|
||||
func (m *LikwidCollector) initGranularity() {
|
||||
splitRegex := regexp.MustCompile("[+-/*()]")
|
||||
for _, evset := range m.config.Eventsets {
|
||||
evset.granulatity = make(map[string]MetricScope)
|
||||
for counter, event := range evset.Events {
|
||||
gran := getGranularity(counter, event)
|
||||
if gran.Granularity() >= 0 {
|
||||
evset.granulatity[counter] = gran
|
||||
}
|
||||
}
|
||||
for i, metric := range evset.Metrics {
|
||||
s := splitRegex.Split(metric.Calc, -1)
|
||||
gran := MetricScope("cpu")
|
||||
evset.Metrics[i].granulatity = gran
|
||||
for _, x := range s {
|
||||
if _, ok := evset.Events[x]; ok {
|
||||
if evset.granulatity[x].Granularity() > gran.Granularity() {
|
||||
gran = evset.granulatity[x]
|
||||
}
|
||||
}
|
||||
}
|
||||
evset.Metrics[i].granulatity = gran
|
||||
}
|
||||
}
|
||||
return outmap
|
||||
for i, metric := range m.config.Metrics {
|
||||
s := splitRegex.Split(metric.Calc, -1)
|
||||
gran := MetricScope("cpu")
|
||||
m.config.Metrics[i].granulatity = gran
|
||||
for _, x := range s {
|
||||
for _, evset := range m.config.Eventsets {
|
||||
for _, m := range evset.Metrics {
|
||||
if m.Name == x && m.granulatity.Granularity() > gran.Granularity() {
|
||||
gran = m.granulatity
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
m.config.Metrics[i].granulatity = gran
|
||||
}
|
||||
}
|
||||
|
||||
func (m *LikwidCollector) Init(config []byte) error {
|
||||
type TopoResolveFunc func(cpuid int) int
|
||||
|
||||
func (m *LikwidCollector) getResponsiblities() map[MetricScope]map[int]int {
|
||||
get_cpus := func(scope MetricScope) map[int]int {
|
||||
var slist []int
|
||||
var cpu C.int
|
||||
var input func(index int) string
|
||||
switch scope {
|
||||
case "node":
|
||||
slist = []int{0}
|
||||
input = func(index int) string { return "N:0" }
|
||||
case "socket":
|
||||
input = func(index int) string { return fmt.Sprintf("%s%d:0", scope.Likwid(), index) }
|
||||
slist = topo.SocketList()
|
||||
// case "numadomain":
|
||||
// input = func(index int) string { return fmt.Sprintf("%s%d:0", scope.Likwid(), index) }
|
||||
// slist = topo.NumaNodeList()
|
||||
// cclog.Debug(scope, " ", input(0), " ", slist)
|
||||
// case "die":
|
||||
// input = func(index int) string { return fmt.Sprintf("%s%d:0", scope.Likwid(), index) }
|
||||
// slist = topo.DieList()
|
||||
// case "llc":
|
||||
// input = fmt.Sprintf("%s%d:0", scope.Likwid(), s)
|
||||
// slist = topo.LLCacheList()
|
||||
case "cpu":
|
||||
input = func(index int) string { return fmt.Sprintf("%d", index) }
|
||||
slist = topo.CpuList()
|
||||
case "hwthread":
|
||||
input = func(index int) string { return fmt.Sprintf("%d", index) }
|
||||
slist = topo.CpuList()
|
||||
}
|
||||
outmap := make(map[int]int)
|
||||
for _, s := range slist {
|
||||
t := C.CString(input(s))
|
||||
clen := C.cpustr_to_cpulist(t, &cpu, 1)
|
||||
if int(clen) == 1 {
|
||||
outmap[s] = m.cpu2tid[int(cpu)]
|
||||
} else {
|
||||
cclog.Error(fmt.Sprintf("Cannot determine responsible CPU for %s", input(s)))
|
||||
outmap[s] = -1
|
||||
}
|
||||
C.free(unsafe.Pointer(t))
|
||||
}
|
||||
return outmap
|
||||
}
|
||||
|
||||
scopes := GetAllMetricScopes()
|
||||
complete := make(map[MetricScope]map[int]int)
|
||||
for _, s := range scopes {
|
||||
complete[s] = get_cpus(s)
|
||||
}
|
||||
return complete
|
||||
}
|
||||
|
||||
func (m *LikwidCollector) Init(config json.RawMessage) error {
|
||||
var ret C.int
|
||||
m.name = "LikwidCollector"
|
||||
if len(config) > 0 {
|
||||
@@ -114,36 +266,78 @@ func (m *LikwidCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
}
|
||||
lib := dl.New(LIKWID_LIB_NAME, LIKWID_LIB_DL_FLAGS)
|
||||
if lib == nil {
|
||||
return fmt.Errorf("error instantiating DynamicLibrary for %s", LIKWID_LIB_NAME)
|
||||
}
|
||||
if m.config.ForceOverwrite {
|
||||
cclog.ComponentDebug(m.name, "Set LIKWID_FORCE=1")
|
||||
os.Setenv("LIKWID_FORCE", "1")
|
||||
}
|
||||
m.setup()
|
||||
cpulist := CpuList()
|
||||
m.cpulist = make([]C.int, len(cpulist))
|
||||
slist := getSocketCpus()
|
||||
|
||||
m.sock2tid = make(map[int]int)
|
||||
m.meta = map[string]string{"source": m.name, "group": "PerfCounter"}
|
||||
cclog.ComponentDebug(m.name, "Get cpulist and init maps and lists")
|
||||
cpulist := topo.CpuList()
|
||||
m.cpulist = make([]C.int, len(cpulist))
|
||||
m.cpu2tid = make(map[int]int)
|
||||
for i, c := range cpulist {
|
||||
m.cpulist[i] = C.int(c)
|
||||
if sid, found := slist[m.cpulist[i]]; found {
|
||||
m.sock2tid[sid] = i
|
||||
}
|
||||
m.cpu2tid[c] = i
|
||||
|
||||
}
|
||||
m.results = make(map[int]map[int]map[string]interface{})
|
||||
m.mresults = make(map[int]map[int]map[string]float64)
|
||||
m.gmresults = make(map[int]map[string]float64)
|
||||
cclog.ComponentDebug(m.name, "initialize LIKWID topology")
|
||||
ret = C.topology_init()
|
||||
if ret != 0 {
|
||||
return errors.New("Failed to initialize LIKWID topology")
|
||||
}
|
||||
if m.config.ForceOverwrite {
|
||||
os.Setenv("LIKWID_FORCE", "1")
|
||||
err := errors.New("failed to initialize LIKWID topology")
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
return err
|
||||
}
|
||||
|
||||
// Determine which counter works at which level. PMC*: cpu, *BOX*: socket, ...
|
||||
m.initGranularity()
|
||||
// Generate map for MetricScope -> scope_id (like socket id) -> responsible id (offset in cpulist)
|
||||
m.scopeRespTids = m.getResponsiblities()
|
||||
|
||||
cclog.ComponentDebug(m.name, "initialize LIKWID perfmon module")
|
||||
ret = C.perfmon_init(C.int(len(m.cpulist)), &m.cpulist[0])
|
||||
if ret != 0 {
|
||||
C.topology_finalize()
|
||||
return errors.New("Failed to initialize LIKWID topology")
|
||||
err := errors.New("failed to initialize LIKWID topology")
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
return err
|
||||
}
|
||||
|
||||
// This is for the global metrics computation test
|
||||
globalParams := make(map[string]interface{})
|
||||
globalParams["time"] = float64(1.0)
|
||||
globalParams["inverseClock"] = float64(1.0)
|
||||
// While adding the events, we test the metrics whether they can be computed at all
|
||||
for i, evset := range m.config.Eventsets {
|
||||
estr := eventsToEventStr(evset.Events)
|
||||
// Generate parameter list for the metric computing test
|
||||
params := make(map[string]interface{})
|
||||
params["time"] = float64(1.0)
|
||||
params["inverseClock"] = float64(1.0)
|
||||
for counter := range evset.Events {
|
||||
params[counter] = float64(1.0)
|
||||
}
|
||||
for _, metric := range evset.Metrics {
|
||||
// Try to evaluate the metric
|
||||
_, err := agg.EvalFloat64Condition(metric.Calc, params)
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, "Calculation for metric", metric.Name, "failed:", err.Error())
|
||||
continue
|
||||
}
|
||||
// If the metric is not in the parameter list for the global metrics, add it
|
||||
if _, ok := globalParams[metric.Name]; !ok {
|
||||
globalParams[metric.Name] = float64(1.0)
|
||||
}
|
||||
}
|
||||
// Now we add the list of events to likwid
|
||||
cstr := C.CString(estr)
|
||||
gid := C.perfmon_addEventSet(cstr)
|
||||
if gid >= 0 {
|
||||
@@ -155,153 +349,208 @@ func (m *LikwidCollector) Init(config []byte) error {
|
||||
for tid := range m.cpulist {
|
||||
m.results[i][tid] = make(map[string]interface{})
|
||||
m.mresults[i][tid] = make(map[string]float64)
|
||||
m.gmresults[tid] = make(map[string]float64)
|
||||
if i == 0 {
|
||||
m.gmresults[tid] = make(map[string]float64)
|
||||
}
|
||||
}
|
||||
}
|
||||
for _, metric := range m.config.Metrics {
|
||||
// Try to evaluate the global metric
|
||||
_, err := agg.EvalFloat64Condition(metric.Calc, globalParams)
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, "Calculation for metric", metric.Name, "failed:", err.Error())
|
||||
continue
|
||||
}
|
||||
}
|
||||
|
||||
// If no event set could be added, shut down LikwidCollector
|
||||
if len(m.groups) == 0 {
|
||||
C.perfmon_finalize()
|
||||
C.topology_finalize()
|
||||
return errors.New("No LIKWID performance group initialized")
|
||||
err := errors.New("no LIKWID performance group initialized")
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
return err
|
||||
}
|
||||
m.basefreq = getBaseFreq()
|
||||
cclog.ComponentDebug(m.name, "BaseFreq", m.basefreq)
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *LikwidCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
// take a measurement for 'interval' seconds of event set index 'group'
|
||||
func (m *LikwidCollector) takeMeasurement(group int, interval time.Duration) error {
|
||||
var ret C.int
|
||||
gid := m.groups[group]
|
||||
ret = C.perfmon_setupCounters(gid)
|
||||
if ret != 0 {
|
||||
gctr := C.GoString(C.perfmon_getGroupName(gid))
|
||||
err := fmt.Errorf("failed to setup performance group %d (%s)", gid, gctr)
|
||||
return err
|
||||
}
|
||||
ret = C.perfmon_startCounters()
|
||||
if ret != 0 {
|
||||
gctr := C.GoString(C.perfmon_getGroupName(gid))
|
||||
err := fmt.Errorf("failed to start performance group %d (%s)", gid, gctr)
|
||||
return err
|
||||
}
|
||||
m.running = true
|
||||
time.Sleep(interval)
|
||||
m.running = false
|
||||
ret = C.perfmon_stopCounters()
|
||||
if ret != 0 {
|
||||
gctr := C.GoString(C.perfmon_getGroupName(gid))
|
||||
err := fmt.Errorf("failed to stop performance group %d (%s)", gid, gctr)
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Get all measurement results for an event set, derive the metric values out of the measurement results and send it
|
||||
func (m *LikwidCollector) calcEventsetMetrics(group int, interval time.Duration, output chan lp.CCMetric) error {
|
||||
var eidx C.int
|
||||
evset := m.config.Eventsets[group]
|
||||
gid := m.groups[group]
|
||||
invClock := float64(1.0 / m.basefreq)
|
||||
|
||||
// Go over events and get the results
|
||||
for eidx = 0; int(eidx) < len(evset.Events); eidx++ {
|
||||
ctr := C.perfmon_getCounterName(gid, eidx)
|
||||
ev := C.perfmon_getEventName(gid, eidx)
|
||||
gctr := C.GoString(ctr)
|
||||
gev := C.GoString(ev)
|
||||
// MetricScope for the counter (and if needed the event)
|
||||
scope := getGranularity(gctr, gev)
|
||||
// Get the map scope-id -> tids
|
||||
// This way we read less counters like only the responsible hardware thread for a socket
|
||||
scopemap := m.scopeRespTids[scope]
|
||||
for _, tid := range scopemap {
|
||||
if tid >= 0 {
|
||||
m.results[group][tid]["time"] = interval.Seconds()
|
||||
m.results[group][tid]["inverseClock"] = invClock
|
||||
res := C.perfmon_getLastResult(gid, eidx, C.int(tid))
|
||||
m.results[group][tid][gctr] = float64(res)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Go over the event set metrics, derive the value out of the event:counter values and send it
|
||||
for _, metric := range evset.Metrics {
|
||||
// The metric scope is determined in the Init() function
|
||||
// Get the map scope-id -> tids
|
||||
scopemap := m.scopeRespTids[metric.Scope]
|
||||
for domain, tid := range scopemap {
|
||||
if tid >= 0 {
|
||||
value, err := agg.EvalFloat64Condition(metric.Calc, m.results[group][tid])
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, "Calculation for metric", metric.Name, "failed:", err.Error())
|
||||
continue
|
||||
}
|
||||
m.mresults[group][tid][metric.Name] = value
|
||||
if m.config.InvalidToZero && math.IsNaN(value) {
|
||||
value = 0.0
|
||||
}
|
||||
if m.config.InvalidToZero && math.IsInf(value, 0) {
|
||||
value = 0.0
|
||||
}
|
||||
// Now we have the result, send it with the proper tags
|
||||
if !math.IsNaN(value) {
|
||||
if metric.Publish {
|
||||
tags := map[string]string{"type": metric.Scope.String()}
|
||||
if metric.Scope != "node" {
|
||||
tags["type-id"] = fmt.Sprintf("%d", domain)
|
||||
}
|
||||
fields := map[string]interface{}{"value": value}
|
||||
y, err := lp.New(metric.Name, tags, m.meta, fields, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// Go over the global metrics, derive the value out of the event sets' metric values and send it
|
||||
func (m *LikwidCollector) calcGlobalMetrics(interval time.Duration, output chan lp.CCMetric) error {
|
||||
for _, metric := range m.config.Metrics {
|
||||
scopemap := m.scopeRespTids[metric.Scope]
|
||||
for domain, tid := range scopemap {
|
||||
if tid >= 0 {
|
||||
// Here we generate parameter list
|
||||
params := make(map[string]interface{})
|
||||
for j := range m.groups {
|
||||
for mname, mres := range m.mresults[j][tid] {
|
||||
params[mname] = mres
|
||||
}
|
||||
}
|
||||
// Evaluate the metric
|
||||
value, err := agg.EvalFloat64Condition(metric.Calc, params)
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, "Calculation for metric", metric.Name, "failed:", err.Error())
|
||||
continue
|
||||
}
|
||||
m.gmresults[tid][metric.Name] = value
|
||||
if m.config.InvalidToZero && math.IsNaN(value) {
|
||||
value = 0.0
|
||||
}
|
||||
if m.config.InvalidToZero && math.IsInf(value, 0) {
|
||||
value = 0.0
|
||||
}
|
||||
// Now we have the result, send it with the proper tags
|
||||
if !math.IsNaN(value) {
|
||||
if metric.Publish {
|
||||
tags := map[string]string{"type": metric.Scope.String()}
|
||||
if metric.Scope != "node" {
|
||||
tags["type-id"] = fmt.Sprintf("%d", domain)
|
||||
}
|
||||
fields := map[string]interface{}{"value": value}
|
||||
y, err := lp.New(metric.Name, tags, m.meta, fields, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// main read function taking multiple measurement rounds, each 'interval' seconds long
|
||||
func (m *LikwidCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
var ret C.int
|
||||
|
||||
for i, gid := range m.groups {
|
||||
evset := m.config.Eventsets[i]
|
||||
ret = C.perfmon_setupCounters(gid)
|
||||
if ret != 0 {
|
||||
log.Print("Failed to setup performance group ", C.perfmon_getGroupName(gid))
|
||||
continue
|
||||
}
|
||||
ret = C.perfmon_startCounters()
|
||||
if ret != 0 {
|
||||
log.Print("Failed to start performance group ", C.perfmon_getGroupName(gid))
|
||||
continue
|
||||
}
|
||||
time.Sleep(interval)
|
||||
ret = C.perfmon_stopCounters()
|
||||
if ret != 0 {
|
||||
log.Print("Failed to stop performance group ", C.perfmon_getGroupName(gid))
|
||||
continue
|
||||
}
|
||||
var eidx C.int
|
||||
for tid := range m.cpulist {
|
||||
for eidx = 0; int(eidx) < len(evset.Events); eidx++ {
|
||||
ctr := C.perfmon_getCounterName(gid, eidx)
|
||||
gctr := C.GoString(ctr)
|
||||
res := C.perfmon_getLastResult(gid, eidx, C.int(tid))
|
||||
m.results[i][tid][gctr] = float64(res)
|
||||
}
|
||||
m.results[i][tid]["time"] = interval.Seconds()
|
||||
m.results[i][tid]["inverseClock"] = float64(1.0 / m.basefreq)
|
||||
for _, metric := range evset.Metrics {
|
||||
expression, err := govaluate.NewEvaluableExpression(metric.Calc)
|
||||
if err != nil {
|
||||
log.Print(err.Error())
|
||||
continue
|
||||
}
|
||||
result, err := expression.Evaluate(m.results[i][tid])
|
||||
if err != nil {
|
||||
log.Print(err.Error())
|
||||
continue
|
||||
}
|
||||
m.mresults[i][tid][metric.Name] = float64(result.(float64))
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
for _, metric := range m.config.Metrics {
|
||||
for tid := range m.cpulist {
|
||||
var params map[string]interface{}
|
||||
expression, err := govaluate.NewEvaluableExpression(metric.Calc)
|
||||
if err != nil {
|
||||
log.Print(err.Error())
|
||||
continue
|
||||
}
|
||||
params = make(map[string]interface{})
|
||||
for j := range m.groups {
|
||||
for mname, mres := range m.mresults[j][tid] {
|
||||
params[mname] = mres
|
||||
}
|
||||
}
|
||||
result, err := expression.Evaluate(params)
|
||||
if err != nil {
|
||||
log.Print(err.Error())
|
||||
continue
|
||||
}
|
||||
m.gmresults[tid][metric.Name] = float64(result.(float64))
|
||||
}
|
||||
}
|
||||
for i := range m.groups {
|
||||
evset := m.config.Eventsets[i]
|
||||
for _, metric := range evset.Metrics {
|
||||
_, skip := stringArrayContains(m.config.ExcludeMetrics, metric.Name)
|
||||
if metric.Publish && !skip {
|
||||
if metric.Socket_scope {
|
||||
for sid, tid := range m.sock2tid {
|
||||
y, err := lp.New(metric.Name,
|
||||
map[string]string{"type": "socket", "type-id": fmt.Sprintf("%d", int(sid))},
|
||||
map[string]interface{}{"value": m.mresults[i][tid][metric.Name]},
|
||||
time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
}
|
||||
} else {
|
||||
for tid, cpu := range m.cpulist {
|
||||
y, err := lp.New(metric.Name,
|
||||
map[string]string{"type": "cpu", "type-id": fmt.Sprintf("%d", int(cpu))},
|
||||
map[string]interface{}{"value": m.mresults[i][tid][metric.Name]},
|
||||
time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
for _, metric := range m.config.Metrics {
|
||||
_, skip := stringArrayContains(m.config.ExcludeMetrics, metric.Name)
|
||||
if metric.Publish && !skip {
|
||||
if metric.Socket_scope {
|
||||
for sid, tid := range m.sock2tid {
|
||||
y, err := lp.New(metric.Name,
|
||||
map[string]string{"type": "socket", "type-id": fmt.Sprintf("%d", int(sid))},
|
||||
map[string]interface{}{"value": m.gmresults[tid][metric.Name]},
|
||||
time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
}
|
||||
} else {
|
||||
for tid, cpu := range m.cpulist {
|
||||
y, err := lp.New(metric.Name,
|
||||
map[string]string{"type": "cpu", "type-id": fmt.Sprintf("%d", int(cpu))},
|
||||
map[string]interface{}{"value": m.gmresults[tid][metric.Name]},
|
||||
time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
}
|
||||
}
|
||||
// measure event set 'i' for 'interval' seconds
|
||||
err := m.takeMeasurement(i, interval)
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
return
|
||||
}
|
||||
// read measurements and derive event set metrics
|
||||
m.calcEventsetMetrics(i, interval, output)
|
||||
}
|
||||
// use the event set metrics to derive the global metrics
|
||||
m.calcGlobalMetrics(interval, output)
|
||||
}
|
||||
|
||||
func (m *LikwidCollector) Close() {
|
||||
if m.init {
|
||||
cclog.ComponentDebug(m.name, "Closing ...")
|
||||
m.init = false
|
||||
if m.running {
|
||||
cclog.ComponentDebug(m.name, "Stopping counters")
|
||||
C.perfmon_stopCounters()
|
||||
}
|
||||
cclog.ComponentDebug(m.name, "Finalize LIKWID perfmon module")
|
||||
C.perfmon_finalize()
|
||||
cclog.ComponentDebug(m.name, "Finalize LIKWID topology module")
|
||||
C.topology_finalize()
|
||||
cclog.ComponentDebug(m.name, "Closing done")
|
||||
}
|
||||
}
|
||||
|
148
collectors/likwidMetric.md
Normal file
148
collectors/likwidMetric.md
Normal file
@@ -0,0 +1,148 @@
|
||||
|
||||
## `likwid` collector
|
||||
|
||||
The `likwid` collector is probably the most complicated collector. The LIKWID library is included as static library with *direct* access mode. The *direct* access mode is suitable if the daemon is executed by a root user. The static library does not contain the performance groups, so all information needs to be provided in the configuration.
|
||||
|
||||
The `likwid` configuration consists of two parts, the "eventsets" and "globalmetrics":
|
||||
- An event set list itself has two parts, the "events" and a set of derivable "metrics". Each of the "events" is a counter:event pair in LIKWID's syntax. The "metrics" are a list of formulas to derive the metric value from the measurements of the "events". Each metric has a name, the formula, a scope and a publish flag. A counter names can be used like variables in the formulas, so `PMC0+PMC1` sums the measurements for the both events configured in the counters `PMC0` and `PMC1`. The scope tells the Collector whether it is a metric for each hardware thread (`cpu`) or each CPU socket (`socket`). The last one is the publishing flag. It tells the collector whether a metric should be sent to the router.
|
||||
- The global metrics are metrics which require data from all event set measurements to be derived. The inputs are the metrics in the event sets. Similar to the metrics in the event sets, the global metrics are defined by a name, a formula, a scope and a publish flag. See event set metrics for details. The only difference is that there is no access to the raw event measurements anymore but only to the metrics. So, the idea is to derive a metric in the "eventsets" section and reuse it in the "globalmetrics" part. If you need a metric only for deriving the global metrics, disable forwarding of the event set metrics. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.
|
||||
|
||||
Additional options:
|
||||
- `force_overwrite`: Same as setting `LIKWID_FORCE=1`. In case counters are already in-use, LIKWID overwrites their configuration to do its measurements
|
||||
- `invalid_to_zero`: In some cases, the calculations result in `NaN` or `Inf`. With this option, all `NaN` and `Inf` values are replaces with `0.0`.
|
||||
|
||||
### Available metric scopes
|
||||
|
||||
Hardware performance counters are scattered all over the system nowadays. A counter coveres a specific part of the system. While there are hardware thread specific counter for CPU cycles, instructions and so on, some others are specific for a whole CPU socket/package. To address that, the collector provides the specification of a 'scope' for each metric.
|
||||
|
||||
- `cpu` : One metric per CPU hardware thread with the tags `"type" : "cpu"` and `"type-id" : "$cpu_id"`
|
||||
- `socket` : One metric per CPU socket/package with the tags `"type" : "socket"` and `"type-id" : "$socket_id"`
|
||||
|
||||
**Note:** You cannot specify `socket` scope for a metric that is measured at `cpu` scope, so some kind of expert knowledge or lookup work in the [Likwid Wiki](https://github.com/RRZE-HPC/likwid/wiki) is required. Get the scope of each counter from the *Architecture* pages and as soon as one counter in a metric is socket-specific, the whole metric is socket-specific.
|
||||
|
||||
As a guideline:
|
||||
- All counters `FIXCx`, `PMCy` and `TMAz` have the scope `cpu`
|
||||
- All counters names containing `BOX` have the scope `socket`
|
||||
- All `PWRx` counters have scope `socket`, except `"PWR1" : "RAPL_CORE_ENERGY"` has `cpu` scope
|
||||
- All `DFCx` counters have scope `socket`
|
||||
|
||||
|
||||
### Example configuration
|
||||
|
||||
|
||||
```json
|
||||
"likwid": {
|
||||
"force_overwrite" : false,
|
||||
"nan_to_zero" : false,
|
||||
"eventsets": [
|
||||
{
|
||||
"events": {
|
||||
"FIXC1": "ACTUAL_CPU_CLOCK",
|
||||
"FIXC2": "MAX_CPU_CLOCK",
|
||||
"PMC0": "RETIRED_INSTRUCTIONS",
|
||||
"PMC1": "CPU_CLOCKS_UNHALTED",
|
||||
"PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
|
||||
"PMC3": "MERGE",
|
||||
"DFC0": "DRAM_CHANNEL_0",
|
||||
"DFC1": "DRAM_CHANNEL_1",
|
||||
"DFC2": "DRAM_CHANNEL_2",
|
||||
"DFC3": "DRAM_CHANNEL_3"
|
||||
},
|
||||
"metrics": [
|
||||
{
|
||||
"name": "ipc",
|
||||
"calc": "PMC0/PMC1",
|
||||
"scope": "cpu",
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "flops_any",
|
||||
"calc": "0.000001*PMC2/time",
|
||||
"scope": "cpu",
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "clock_mhz",
|
||||
"calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
|
||||
"scope": "cpu",
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "mem1",
|
||||
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
|
||||
"scope": "socket",
|
||||
"publish": false
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"events": {
|
||||
"DFC0": "DRAM_CHANNEL_4",
|
||||
"DFC1": "DRAM_CHANNEL_5",
|
||||
"DFC2": "DRAM_CHANNEL_6",
|
||||
"DFC3": "DRAM_CHANNEL_7",
|
||||
"PWR0": "RAPL_CORE_ENERGY",
|
||||
"PWR1": "RAPL_PKG_ENERGY"
|
||||
},
|
||||
"metrics": [
|
||||
{
|
||||
"name": "pwr_core",
|
||||
"calc": "PWR0/time",
|
||||
"scope": "socket",
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "pwr_pkg",
|
||||
"calc": "PWR1/time",
|
||||
"scope": "socket",
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "mem2",
|
||||
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
|
||||
"scope": "socket",
|
||||
"publish": false
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"globalmetrics": [
|
||||
{
|
||||
"name": "mem_bw",
|
||||
"calc": "mem1+mem2",
|
||||
"scope": "socket",
|
||||
"publish": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### How to get the eventsets and metrics from LIKWID
|
||||
|
||||
The `likwid` collector reads hardware performance counters at a **cpu** and **socket** level. The configuration looks quite complicated but it is basically copy&paste from [LIKWID's performance groups](https://github.com/RRZE-HPC/likwid/tree/master/groups). The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
|
||||
|
||||
The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
|
||||
```
|
||||
EVENTSET -> "events": {
|
||||
FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK",
|
||||
FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK",
|
||||
PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS",
|
||||
PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED",
|
||||
PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
|
||||
PMC3 MERGE -> "PMC3": "MERGE",
|
||||
-> }
|
||||
```
|
||||
|
||||
The metrics are following the same procedure:
|
||||
|
||||
```
|
||||
METRICS -> "metrics": [
|
||||
IPC PMC0/PMC1 -> {
|
||||
-> "name" : "IPC",
|
||||
-> "calc" : "PMC0/PMC1",
|
||||
-> "scope": "cpu",
|
||||
-> "publish": true
|
||||
-> }
|
||||
-> ]
|
||||
```
|
||||
|
@@ -2,29 +2,39 @@ package collectors
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io/ioutil"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const LOADAVGFILE = `/proc/loadavg`
|
||||
|
||||
type LoadavgCollectorConfig struct {
|
||||
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
|
||||
}
|
||||
//
|
||||
// LoadavgCollector collects:
|
||||
// * load average of last 1, 5 & 15 minutes
|
||||
// * number of processes currently runnable
|
||||
// * total number of processes in system
|
||||
//
|
||||
// See: https://www.kernel.org/doc/html/latest/filesystems/proc.html
|
||||
//
|
||||
const LOADAVGFILE = "/proc/loadavg"
|
||||
|
||||
type LoadavgCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
load_matches []string
|
||||
load_skips []bool
|
||||
proc_matches []string
|
||||
config LoadavgCollectorConfig
|
||||
proc_skips []bool
|
||||
config struct {
|
||||
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
|
||||
}
|
||||
}
|
||||
|
||||
func (m *LoadavgCollector) Init(config []byte) error {
|
||||
func (m *LoadavgCollector) Init(config json.RawMessage) error {
|
||||
m.name = "LoadavgCollector"
|
||||
m.setup()
|
||||
if len(config) > 0 {
|
||||
@@ -33,45 +43,82 @@ func (m *LoadavgCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "LOAD"}
|
||||
m.tags = map[string]string{"type": "node"}
|
||||
m.load_matches = []string{"load_one", "load_five", "load_fifteen"}
|
||||
m.proc_matches = []string{"proc_run", "proc_total"}
|
||||
m.load_matches = []string{
|
||||
"load_one",
|
||||
"load_five",
|
||||
"load_fifteen"}
|
||||
m.load_skips = make([]bool, len(m.load_matches))
|
||||
m.proc_matches = []string{
|
||||
"proc_run",
|
||||
"proc_total"}
|
||||
m.proc_skips = make([]bool, len(m.proc_matches))
|
||||
|
||||
for i, name := range m.load_matches {
|
||||
_, m.load_skips[i] = stringArrayContains(m.config.ExcludeMetrics, name)
|
||||
}
|
||||
for i, name := range m.proc_matches {
|
||||
_, m.proc_skips[i] = stringArrayContains(m.config.ExcludeMetrics, name)
|
||||
}
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *LoadavgCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
var skip bool
|
||||
func (m *LoadavgCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
buffer, err := ioutil.ReadFile(string(LOADAVGFILE))
|
||||
|
||||
buffer, err := ioutil.ReadFile(LOADAVGFILE)
|
||||
if err != nil {
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to read file '%s': %v", LOADAVGFILE, err))
|
||||
}
|
||||
return
|
||||
}
|
||||
now := time.Now()
|
||||
|
||||
// Load metrics
|
||||
ls := strings.Split(string(buffer), ` `)
|
||||
for i, name := range m.load_matches {
|
||||
x, err := strconv.ParseFloat(ls[i], 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert '%s' to float64: %v", ls[i], err))
|
||||
continue
|
||||
}
|
||||
if m.load_skips[i] {
|
||||
continue
|
||||
}
|
||||
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": x}, now)
|
||||
if err == nil {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, name)
|
||||
y, err := lp.New(name, m.tags, map[string]interface{}{"value": float64(x)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
// Process metrics
|
||||
lv := strings.Split(ls[3], `/`)
|
||||
for i, name := range m.proc_matches {
|
||||
x, err := strconv.ParseFloat(lv[i], 64)
|
||||
if err == nil {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, name)
|
||||
y, err := lp.New(name, m.tags, map[string]interface{}{"value": float64(x)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
x, err := strconv.ParseInt(lv[i], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert '%s' to float64: %v", lv[i], err))
|
||||
continue
|
||||
}
|
||||
if m.proc_skips[i] {
|
||||
continue
|
||||
}
|
||||
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": x}, now)
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
|
19
collectors/loadavgMetric.md
Normal file
19
collectors/loadavgMetric.md
Normal file
@@ -0,0 +1,19 @@
|
||||
|
||||
## `loadavg` collector
|
||||
|
||||
```json
|
||||
"loadavg": {
|
||||
"exclude_metrics": [
|
||||
"proc_run"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `loadavg` collector reads data from `/proc/loadavg` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics:
|
||||
* `load_one`
|
||||
* `load_five`
|
||||
* `load_fifteen`
|
||||
* `proc_run`
|
||||
* `proc_total`
|
@@ -3,31 +3,84 @@ package collectors
|
||||
import (
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"io/ioutil"
|
||||
"log"
|
||||
"fmt"
|
||||
"os/exec"
|
||||
"os/user"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const LUSTREFILE = `/proc/fs/lustre/llite/lnec-XXXXXX/stats`
|
||||
const LUSTRE_SYSFS = `/sys/fs/lustre`
|
||||
const LCTL_CMD = `lctl`
|
||||
const LCTL_OPTION = `get_param`
|
||||
|
||||
type LustreCollectorConfig struct {
|
||||
Procfiles []string `json:"procfiles"`
|
||||
LCtlCommand string `json:"lctl_command"`
|
||||
ExcludeMetrics []string `json:"exclude_metrics"`
|
||||
SendAllMetrics bool `json:"send_all_metrics"`
|
||||
}
|
||||
|
||||
type LustreCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
matches map[string]map[string]int
|
||||
devices []string
|
||||
stats map[string]map[string]int64
|
||||
config LustreCollectorConfig
|
||||
lctl string
|
||||
}
|
||||
|
||||
func (m *LustreCollector) Init(config []byte) error {
|
||||
func (m *LustreCollector) getDeviceDataCommand(device string) []string {
|
||||
statsfile := fmt.Sprintf("llite.%s.stats", device)
|
||||
command := exec.Command(m.lctl, LCTL_OPTION, statsfile)
|
||||
command.Wait()
|
||||
stdout, _ := command.Output()
|
||||
return strings.Split(string(stdout), "\n")
|
||||
}
|
||||
|
||||
func (m *LustreCollector) getDevices() []string {
|
||||
devices := make([]string, 0)
|
||||
|
||||
// //Version reading devices from sysfs
|
||||
// globPattern := filepath.Join(LUSTRE_SYSFS, "llite/*/stats")
|
||||
// files, err := filepath.Glob(globPattern)
|
||||
// if err != nil {
|
||||
// return devices
|
||||
// }
|
||||
// for _, f := range files {
|
||||
// pathlist := strings.Split(f, "/")
|
||||
// devices = append(devices, pathlist[4])
|
||||
// }
|
||||
|
||||
data := m.getDeviceDataCommand("*")
|
||||
|
||||
for _, line := range data {
|
||||
if strings.HasPrefix(line, "llite") {
|
||||
linefields := strings.Split(line, ".")
|
||||
if len(linefields) > 2 {
|
||||
devices = append(devices, linefields[1])
|
||||
}
|
||||
}
|
||||
}
|
||||
return devices
|
||||
}
|
||||
|
||||
// //Version reading the stats data of a device from sysfs
|
||||
// func (m *LustreCollector) getDeviceDataSysfs(device string) []string {
|
||||
// llitedir := filepath.Join(LUSTRE_SYSFS, "llite")
|
||||
// devdir := filepath.Join(llitedir, device)
|
||||
// statsfile := filepath.Join(devdir, "stats")
|
||||
// buffer, err := ioutil.ReadFile(statsfile)
|
||||
// if err != nil {
|
||||
// return make([]string, 0)
|
||||
// }
|
||||
// return strings.Split(string(buffer), "\n")
|
||||
// }
|
||||
|
||||
func (m *LustreCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "LustreCollector"
|
||||
if len(config) > 0 {
|
||||
@@ -38,66 +91,120 @@ func (m *LustreCollector) Init(config []byte) error {
|
||||
}
|
||||
m.setup()
|
||||
m.tags = map[string]string{"type": "node"}
|
||||
m.matches = map[string]map[string]int{"read_bytes": {"read_bytes": 6, "read_requests": 1},
|
||||
"write_bytes": {"write_bytes": 6, "write_requests": 1},
|
||||
"open": {"open": 1},
|
||||
"close": {"close": 1},
|
||||
"setattr": {"setattr": 1},
|
||||
"getattr": {"getattr": 1},
|
||||
"statfs": {"statfs": 1},
|
||||
"inode_permission": {"inode_permission": 1}}
|
||||
m.devices = make([]string, 0)
|
||||
for _, p := range m.config.Procfiles {
|
||||
_, err := ioutil.ReadFile(p)
|
||||
if err == nil {
|
||||
m.devices = append(m.devices, p)
|
||||
} else {
|
||||
log.Print(err.Error())
|
||||
continue
|
||||
}
|
||||
m.meta = map[string]string{"source": m.name, "group": "Lustre"}
|
||||
defmatches := map[string]map[string]int{
|
||||
"read_bytes": {"lustre_read_bytes": 6, "lustre_read_requests": 1},
|
||||
"write_bytes": {"lustre_write_bytes": 6, "lustre_write_requests": 1},
|
||||
"open": {"lustre_open": 1},
|
||||
"close": {"lustre_close": 1},
|
||||
"setattr": {"lustre_setattr": 1},
|
||||
"getattr": {"lustre_getattr": 1},
|
||||
"statfs": {"lustre_statfs": 1},
|
||||
"inode_permission": {"lustre_inode_permission": 1}}
|
||||
|
||||
// Lustre file system statistics can only be queried by user root
|
||||
user, err := user.Current()
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, "Failed to get current user:", err.Error())
|
||||
return err
|
||||
}
|
||||
if user.Uid != "0" {
|
||||
cclog.ComponentError(m.name, "Lustre file system statistics can only be queried by user root:", err.Error())
|
||||
return err
|
||||
}
|
||||
|
||||
if len(m.devices) == 0 {
|
||||
return errors.New("No metrics to collect")
|
||||
m.matches = make(map[string]map[string]int)
|
||||
for lineprefix, names := range defmatches {
|
||||
for metricname, offset := range names {
|
||||
_, skip := stringArrayContains(m.config.ExcludeMetrics, metricname)
|
||||
if skip {
|
||||
continue
|
||||
}
|
||||
if _, prefixExist := m.matches[lineprefix]; !prefixExist {
|
||||
m.matches[lineprefix] = make(map[string]int)
|
||||
}
|
||||
if _, metricExist := m.matches[lineprefix][metricname]; !metricExist {
|
||||
m.matches[lineprefix][metricname] = offset
|
||||
}
|
||||
}
|
||||
}
|
||||
p, err := exec.LookPath(m.config.LCtlCommand)
|
||||
if err != nil {
|
||||
p, err = exec.LookPath(LCTL_CMD)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.lctl = p
|
||||
|
||||
devices := m.getDevices()
|
||||
if len(devices) == 0 {
|
||||
return errors.New("no metrics to collect")
|
||||
}
|
||||
m.stats = make(map[string]map[string]int64)
|
||||
for _, d := range devices {
|
||||
m.stats[d] = make(map[string]int64)
|
||||
for _, names := range m.matches {
|
||||
for metricname := range names {
|
||||
m.stats[d][metricname] = 0
|
||||
}
|
||||
}
|
||||
}
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *LustreCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *LustreCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
for _, p := range m.devices {
|
||||
buffer, err := ioutil.ReadFile(p)
|
||||
for device, devData := range m.stats {
|
||||
stats := m.getDeviceDataCommand(device)
|
||||
processed := []string{}
|
||||
|
||||
if err != nil {
|
||||
log.Print(err)
|
||||
return
|
||||
}
|
||||
|
||||
for _, line := range strings.Split(string(buffer), "\n") {
|
||||
for _, line := range stats {
|
||||
lf := strings.Fields(line)
|
||||
if len(lf) > 1 {
|
||||
for match, fields := range m.matches {
|
||||
if lf[0] == match {
|
||||
for name, idx := range fields {
|
||||
_, skip := stringArrayContains(m.config.ExcludeMetrics, name)
|
||||
if skip {
|
||||
continue
|
||||
if fields, ok := m.matches[lf[0]]; ok {
|
||||
for name, idx := range fields {
|
||||
x, err := strconv.ParseInt(lf[idx], 0, 64)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
value := x - devData[name]
|
||||
devData[name] = x
|
||||
if value < 0 {
|
||||
value = 0
|
||||
}
|
||||
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": value}, time.Now())
|
||||
if err == nil {
|
||||
y.AddTag("device", device)
|
||||
if strings.Contains(name, "byte") {
|
||||
y.AddMeta("unit", "Byte")
|
||||
}
|
||||
x, err := strconv.ParseInt(lf[idx], 0, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New(name, m.tags, map[string]interface{}{"value": x}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
output <- y
|
||||
if m.config.SendAllMetrics {
|
||||
processed = append(processed, name)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
if m.config.SendAllMetrics {
|
||||
for name := range devData {
|
||||
if _, done := stringArrayContains(processed, name); !done {
|
||||
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": 0}, time.Now())
|
||||
if err == nil {
|
||||
y.AddTag("device", device)
|
||||
if strings.Contains(name, "byte") {
|
||||
y.AddMeta("unit", "Byte")
|
||||
}
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
29
collectors/lustreMetric.md
Normal file
29
collectors/lustreMetric.md
Normal file
@@ -0,0 +1,29 @@
|
||||
|
||||
## `lustrestat` collector
|
||||
|
||||
```json
|
||||
"lustrestat": {
|
||||
"procfiles" : [
|
||||
"/proc/fs/lustre/llite/lnec-XXXXXX/stats"
|
||||
],
|
||||
"exclude_metrics": [
|
||||
"setattr",
|
||||
"getattr"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `lustrestat` collector reads from the procfs stat files for Lustre like `/proc/fs/lustre/llite/lnec-XXXXXX/stats`.
|
||||
|
||||
Metrics:
|
||||
* `read_bytes`
|
||||
* `read_requests`
|
||||
* `write_bytes`
|
||||
* `write_requests`
|
||||
* `open`
|
||||
* `close`
|
||||
* `getattr`
|
||||
* `setattr`
|
||||
* `statfs`
|
||||
* `inode_permission`
|
||||
|
@@ -10,7 +10,7 @@ import (
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const MEMSTATFILE = `/proc/meminfo`
|
||||
@@ -20,14 +20,14 @@ type MemstatCollectorConfig struct {
|
||||
}
|
||||
|
||||
type MemstatCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
stats map[string]int64
|
||||
tags map[string]string
|
||||
matches map[string]string
|
||||
config MemstatCollectorConfig
|
||||
}
|
||||
|
||||
func (m *MemstatCollector) Init(config []byte) error {
|
||||
func (m *MemstatCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "MemstatCollector"
|
||||
if len(config) > 0 {
|
||||
@@ -36,6 +36,7 @@ func (m *MemstatCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{"source": m.name, "group": "Memory", "unit": "kByte"}
|
||||
m.stats = make(map[string]int64)
|
||||
m.matches = make(map[string]string)
|
||||
m.tags = map[string]string{"type": "node"}
|
||||
@@ -65,7 +66,7 @@ func (m *MemstatCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
|
||||
func (m *MemstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *MemstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -93,13 +94,13 @@ func (m *MemstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
|
||||
for match, name := range m.matches {
|
||||
if _, exists := m.stats[match]; !exists {
|
||||
err = errors.New(fmt.Sprintf("Parse error for %s : %s", match, name))
|
||||
err = fmt.Errorf("Parse error for %s : %s", match, name)
|
||||
log.Print(err)
|
||||
continue
|
||||
}
|
||||
y, err := lp.New(name, m.tags, map[string]interface{}{"value": int(float64(m.stats[match]) * 1.0e-3)}, time.Now())
|
||||
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": int(float64(m.stats[match]) * 1.0e-3)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
@@ -108,18 +109,18 @@ func (m *MemstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
if _, cached := m.stats[`Cached`]; cached {
|
||||
memUsed := m.stats[`MemTotal`] - (m.stats[`MemFree`] + m.stats[`Buffers`] + m.stats[`Cached`])
|
||||
_, skip := stringArrayContains(m.config.ExcludeMetrics, "mem_used")
|
||||
y, err := lp.New("mem_used", m.tags, map[string]interface{}{"value": int(float64(memUsed) * 1.0e-3)}, time.Now())
|
||||
y, err := lp.New("mem_used", m.tags, m.meta, map[string]interface{}{"value": int(float64(memUsed) * 1.0e-3)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
if _, found := m.stats[`MemShared`]; found {
|
||||
_, skip := stringArrayContains(m.config.ExcludeMetrics, "mem_shared")
|
||||
y, err := lp.New("mem_shared", m.tags, map[string]interface{}{"value": int(float64(m.stats[`MemShared`]) * 1.0e-3)}, time.Now())
|
||||
y, err := lp.New("mem_shared", m.tags, m.meta, map[string]interface{}{"value": int(float64(m.stats[`MemShared`]) * 1.0e-3)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
27
collectors/memstatMetric.md
Normal file
27
collectors/memstatMetric.md
Normal file
@@ -0,0 +1,27 @@
|
||||
|
||||
## `memstat` collector
|
||||
|
||||
```json
|
||||
"memstat": {
|
||||
"exclude_metrics": [
|
||||
"mem_used"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `memstat` collector reads data from `/proc/meminfo` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
|
||||
Metrics:
|
||||
* `mem_total`
|
||||
* `mem_sreclaimable`
|
||||
* `mem_slab`
|
||||
* `mem_free`
|
||||
* `mem_buffers`
|
||||
* `mem_cached`
|
||||
* `mem_available`
|
||||
* `mem_shared`
|
||||
* `swap_total`
|
||||
* `swap_free`
|
||||
* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached`)
|
||||
|
@@ -1,40 +1,48 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"errors"
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io/ioutil"
|
||||
"log"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
type MetricGetter interface {
|
||||
type MetricCollector interface {
|
||||
Name() string
|
||||
Init(config []byte) error
|
||||
Init(config json.RawMessage) error
|
||||
Initialized() bool
|
||||
Read(time.Duration, *[]lp.MutableMetric)
|
||||
Read(duration time.Duration, output chan lp.CCMetric)
|
||||
Close()
|
||||
}
|
||||
|
||||
type MetricCollector struct {
|
||||
type metricCollector struct {
|
||||
name string
|
||||
init bool
|
||||
meta map[string]string
|
||||
}
|
||||
|
||||
func (c *MetricCollector) Name() string {
|
||||
// Name() returns the name of the metric collector
|
||||
func (c *metricCollector) Name() string {
|
||||
return c.name
|
||||
}
|
||||
|
||||
func (c *MetricCollector) setup() error {
|
||||
func (c *metricCollector) setup() error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (c *MetricCollector) Initialized() bool {
|
||||
return c.init == true
|
||||
// Initialized() indicates whether the metric collector has been initialized.
|
||||
func (c *metricCollector) Initialized() bool {
|
||||
return c.init
|
||||
}
|
||||
|
||||
// intArrayContains scans an array of ints if the value str is present in the array
|
||||
// If the specified value is found, the corresponding array index is returned.
|
||||
// The bool value is used to signal success or failure
|
||||
func intArrayContains(array []int, str int) (int, bool) {
|
||||
for i, a := range array {
|
||||
if a == str {
|
||||
@@ -44,6 +52,9 @@ func intArrayContains(array []int, str int) (int, bool) {
|
||||
return -1, false
|
||||
}
|
||||
|
||||
// stringArrayContains scans an array of strings if the value str is present in the array
|
||||
// If the specified value is found, the corresponding array index is returned.
|
||||
// The bool value is used to signal success or failure
|
||||
func stringArrayContains(array []string, str string) (int, bool) {
|
||||
for i, a := range array {
|
||||
if a == str {
|
||||
@@ -103,27 +114,13 @@ func CpuList() []int {
|
||||
return cpulist
|
||||
}
|
||||
|
||||
func Tags2Map(metric lp.Metric) map[string]string {
|
||||
tags := make(map[string]string)
|
||||
for _, t := range metric.TagList() {
|
||||
tags[t.Key] = t.Value
|
||||
}
|
||||
return tags
|
||||
}
|
||||
|
||||
func Fields2Map(metric lp.Metric) map[string]interface{} {
|
||||
fields := make(map[string]interface{})
|
||||
for _, f := range metric.FieldList() {
|
||||
fields[f.Key] = f.Value
|
||||
}
|
||||
return fields
|
||||
}
|
||||
|
||||
// RemoveFromStringList removes the string r from the array of strings s
|
||||
// If r is not contained in the array an error is returned
|
||||
func RemoveFromStringList(s []string, r string) ([]string, error) {
|
||||
for i, item := range s {
|
||||
if r == item {
|
||||
return append(s[:i], s[i+1:]...), nil
|
||||
}
|
||||
}
|
||||
return s, errors.New("No such string in list")
|
||||
return s, fmt.Errorf("No such string in list")
|
||||
}
|
||||
|
@@ -1,86 +1,138 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"encoding/json"
|
||||
"io/ioutil"
|
||||
"log"
|
||||
"errors"
|
||||
"os"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const NETSTATFILE = `/proc/net/dev`
|
||||
|
||||
type NetstatCollectorConfig struct {
|
||||
ExcludeDevices []string `json:"exclude_devices"`
|
||||
IncludeDevices []string `json:"include_devices"`
|
||||
}
|
||||
|
||||
type NetstatCollectorMetric struct {
|
||||
index int
|
||||
lastValue float64
|
||||
}
|
||||
|
||||
type NetstatCollector struct {
|
||||
MetricCollector
|
||||
config NetstatCollectorConfig
|
||||
matches map[int]string
|
||||
metricCollector
|
||||
config NetstatCollectorConfig
|
||||
matches map[string]map[string]NetstatCollectorMetric
|
||||
devtags map[string]map[string]string
|
||||
lastTimestamp time.Time
|
||||
}
|
||||
|
||||
func (m *NetstatCollector) Init(config []byte) error {
|
||||
func (m *NetstatCollector) Init(config json.RawMessage) error {
|
||||
m.name = "NetstatCollector"
|
||||
m.setup()
|
||||
m.matches = map[int]string{
|
||||
1: "bytes_in",
|
||||
9: "bytes_out",
|
||||
2: "pkts_in",
|
||||
10: "pkts_out",
|
||||
m.lastTimestamp = time.Now()
|
||||
m.meta = map[string]string{"source": m.name, "group": "Network"}
|
||||
m.devtags = make(map[string]map[string]string)
|
||||
nameIndexMap := map[string]int{
|
||||
"net_bytes_in": 1,
|
||||
"net_pkts_in": 2,
|
||||
"net_bytes_out": 9,
|
||||
"net_pkts_out": 10,
|
||||
}
|
||||
m.matches = make(map[string]map[string]NetstatCollectorMetric)
|
||||
if len(config) > 0 {
|
||||
err := json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
log.Print(err.Error())
|
||||
cclog.ComponentError(m.name, "Error reading config:", err.Error())
|
||||
return err
|
||||
}
|
||||
}
|
||||
_, err := ioutil.ReadFile(string(NETSTATFILE))
|
||||
if err == nil {
|
||||
m.init = true
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *NetstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
data, err := ioutil.ReadFile(string(NETSTATFILE))
|
||||
file, err := os.Open(string(NETSTATFILE))
|
||||
if err != nil {
|
||||
log.Print(err.Error())
|
||||
return
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
return err
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
lines := strings.Split(string(data), "\n")
|
||||
for _, l := range lines {
|
||||
scanner := bufio.NewScanner(file)
|
||||
for scanner.Scan() {
|
||||
l := scanner.Text()
|
||||
if !strings.Contains(l, ":") {
|
||||
continue
|
||||
}
|
||||
f := strings.Fields(l)
|
||||
dev := f[0][0 : len(f[0])-1]
|
||||
cont := false
|
||||
for _, d := range m.config.ExcludeDevices {
|
||||
if d == dev {
|
||||
cont = true
|
||||
dev := strings.Trim(f[0], ": ")
|
||||
if _, ok := stringArrayContains(m.config.IncludeDevices, dev); ok {
|
||||
m.matches[dev] = make(map[string]NetstatCollectorMetric)
|
||||
for name, idx := range nameIndexMap {
|
||||
m.matches[dev][name] = NetstatCollectorMetric{
|
||||
index: idx,
|
||||
lastValue: 0,
|
||||
}
|
||||
}
|
||||
m.devtags[dev] = map[string]string{"device": dev, "type": "node"}
|
||||
}
|
||||
if cont {
|
||||
}
|
||||
if len(m.devtags) == 0 {
|
||||
return errors.New("no devices to collector metrics found")
|
||||
}
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *NetstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
now := time.Now()
|
||||
file, err := os.Open(string(NETSTATFILE))
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
return
|
||||
}
|
||||
defer file.Close()
|
||||
tdiff := now.Sub(m.lastTimestamp)
|
||||
|
||||
scanner := bufio.NewScanner(file)
|
||||
for scanner.Scan() {
|
||||
l := scanner.Text()
|
||||
if !strings.Contains(l, ":") {
|
||||
continue
|
||||
}
|
||||
tags := map[string]string{"device": dev, "type": "node"}
|
||||
for i, name := range m.matches {
|
||||
v, err := strconv.ParseInt(f[i], 10, 0)
|
||||
if err == nil {
|
||||
y, err := lp.New(name, tags, map[string]interface{}{"value": int(float64(v) * 1.0e-3)}, time.Now())
|
||||
f := strings.Fields(l)
|
||||
dev := strings.Trim(f[0], ":")
|
||||
|
||||
if devmetrics, ok := m.matches[dev]; ok {
|
||||
for name, data := range devmetrics {
|
||||
v, err := strconv.ParseFloat(f[data.index], 64)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
vdiff := v - data.lastValue
|
||||
value := vdiff / tdiff.Seconds()
|
||||
if data.lastValue == 0 {
|
||||
value = 0
|
||||
}
|
||||
data.lastValue = v
|
||||
y, err := lp.New(name, m.devtags[dev], m.meta, map[string]interface{}{"value": value}, now)
|
||||
if err == nil {
|
||||
switch {
|
||||
case strings.Contains(name, "byte"):
|
||||
y.AddMeta("unit", "bytes/sec")
|
||||
case strings.Contains(name, "pkt"):
|
||||
y.AddMeta("unit", "packets/sec")
|
||||
}
|
||||
output <- y
|
||||
}
|
||||
devmetrics[name] = data
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
m.lastTimestamp = time.Now()
|
||||
}
|
||||
|
||||
func (m *NetstatCollector) Close() {
|
||||
|
21
collectors/netstatMetric.md
Normal file
21
collectors/netstatMetric.md
Normal file
@@ -0,0 +1,21 @@
|
||||
|
||||
## `netstat` collector
|
||||
|
||||
```json
|
||||
"netstat": {
|
||||
"include_devices": [
|
||||
"eth0"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. With the `include_devices` list you can specify which network devices should be measured. **Note**: Most other collectors use an _exclude_ list instead of an include list.
|
||||
|
||||
Metrics:
|
||||
* `net_bytes_in` (`unit=bytes/sec`)
|
||||
* `net_bytes_out` (`unit=bytes/sec`)
|
||||
* `net_pkts_in` (`unit=packets/sec`)
|
||||
* `net_pkts_out` (`unit=packets/sec`)
|
||||
|
||||
The device name is added as tag `device`.
|
||||
|
39
collectors/nfs3Metric.md
Normal file
39
collectors/nfs3Metric.md
Normal file
@@ -0,0 +1,39 @@
|
||||
|
||||
## `nfs3stat` collector
|
||||
|
||||
```json
|
||||
"nfs3stat": {
|
||||
"nfsstat" : "/path/to/nfsstat",
|
||||
"exclude_metrics": [
|
||||
"nfs3_total"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `nfs3stat` collector reads data from `nfsstat` command and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink. There is currently no possibility to get the metrics per mount point.
|
||||
|
||||
|
||||
Metrics:
|
||||
* `nfs3_total`
|
||||
* `nfs3_null`
|
||||
* `nfs3_getattr`
|
||||
* `nfs3_setattr`
|
||||
* `nfs3_lookup`
|
||||
* `nfs3_access`
|
||||
* `nfs3_readlink`
|
||||
* `nfs3_read`
|
||||
* `nfs3_write`
|
||||
* `nfs3_create`
|
||||
* `nfs3_mkdir`
|
||||
* `nfs3_symlink`
|
||||
* `nfs3_remove`
|
||||
* `nfs3_rmdir`
|
||||
* `nfs3_rename`
|
||||
* `nfs3_link`
|
||||
* `nfs3_readdir`
|
||||
* `nfs3_readdirplus`
|
||||
* `nfs3_fsstat`
|
||||
* `nfs3_fsinfo`
|
||||
* `nfs3_pathconf`
|
||||
* `nfs3_commit`
|
||||
|
62
collectors/nfs4Metric.md
Normal file
62
collectors/nfs4Metric.md
Normal file
@@ -0,0 +1,62 @@
|
||||
|
||||
## `nfs4stat` collector
|
||||
|
||||
```json
|
||||
"nfs4stat": {
|
||||
"nfsstat" : "/path/to/nfsstat",
|
||||
"exclude_metrics": [
|
||||
"nfs4_total"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `nfs4stat` collector reads data from `nfsstat` command and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink. There is currently no possibility to get the metrics per mount point.
|
||||
|
||||
|
||||
Metrics:
|
||||
* `nfs4_total`
|
||||
* `nfs4_null`
|
||||
* `nfs4_read`
|
||||
* `nfs4_write`
|
||||
* `nfs4_commit`
|
||||
* `nfs4_open`
|
||||
* `nfs4_open_conf`
|
||||
* `nfs4_open_noat`
|
||||
* `nfs4_open_dgrd`
|
||||
* `nfs4_close`
|
||||
* `nfs4_setattr`
|
||||
* `nfs4_fsinfo`
|
||||
* `nfs4_renew`
|
||||
* `nfs4_setclntid`
|
||||
* `nfs4_confirm`
|
||||
* `nfs4_lock`
|
||||
* `nfs4_lockt`
|
||||
* `nfs4_locku`
|
||||
* `nfs4_access`
|
||||
* `nfs4_getattr`
|
||||
* `nfs4_lookup`
|
||||
* `nfs4_lookup_root`
|
||||
* `nfs4_remove`
|
||||
* `nfs4_rename`
|
||||
* `nfs4_link`
|
||||
* `nfs4_symlink`
|
||||
* `nfs4_create`
|
||||
* `nfs4_pathconf`
|
||||
* `nfs4_statfs`
|
||||
* `nfs4_readlink`
|
||||
* `nfs4_readdir`
|
||||
* `nfs4_server_caps`
|
||||
* `nfs4_delegreturn`
|
||||
* `nfs4_getacl`
|
||||
* `nfs4_setacl`
|
||||
* `nfs4_rel_lkowner`
|
||||
* `nfs4_exchange_id`
|
||||
* `nfs4_create_session`
|
||||
* `nfs4_destroy_session`
|
||||
* `nfs4_sequence`
|
||||
* `nfs4_get_lease_time`
|
||||
* `nfs4_reclaim_comp`
|
||||
* `nfs4_secinfo_no`
|
||||
* `nfs4_bind_conn_to_ses`
|
||||
|
||||
|
174
collectors/nfsMetric.go
Normal file
174
collectors/nfsMetric.go
Normal file
@@ -0,0 +1,174 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"log"
|
||||
|
||||
// "os"
|
||||
"os/exec"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
// First part contains the code for the general NfsCollector.
|
||||
// Later, the general NfsCollector is more limited to Nfs3- and Nfs4Collector.
|
||||
|
||||
const NFSSTAT_EXEC = `nfsstat`
|
||||
|
||||
type NfsCollectorData struct {
|
||||
current int64
|
||||
last int64
|
||||
}
|
||||
|
||||
type nfsCollector struct {
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
version string
|
||||
config struct {
|
||||
Nfsstats string `json:"nfsstat"`
|
||||
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
|
||||
}
|
||||
data map[string]NfsCollectorData
|
||||
}
|
||||
|
||||
func (m *nfsCollector) initStats() error {
|
||||
cmd := exec.Command(m.config.Nfsstats, `-l`)
|
||||
cmd.Wait()
|
||||
buffer, err := cmd.Output()
|
||||
if err == nil {
|
||||
for _, line := range strings.Split(string(buffer), "\n") {
|
||||
lf := strings.Fields(line)
|
||||
if len(lf) != 5 {
|
||||
continue
|
||||
}
|
||||
if lf[1] == m.version {
|
||||
name := strings.Trim(lf[3], ":")
|
||||
if _, exist := m.data[name]; !exist {
|
||||
value, err := strconv.ParseInt(lf[4], 0, 64)
|
||||
if err == nil {
|
||||
x := m.data[name]
|
||||
x.current = value
|
||||
x.last = 0
|
||||
m.data[name] = x
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return err
|
||||
}
|
||||
|
||||
func (m *nfsCollector) updateStats() error {
|
||||
cmd := exec.Command(m.config.Nfsstats, `-l`)
|
||||
cmd.Wait()
|
||||
buffer, err := cmd.Output()
|
||||
if err == nil {
|
||||
for _, line := range strings.Split(string(buffer), "\n") {
|
||||
lf := strings.Fields(line)
|
||||
if len(lf) != 5 {
|
||||
continue
|
||||
}
|
||||
if lf[1] == m.version {
|
||||
name := strings.Trim(lf[3], ":")
|
||||
if _, exist := m.data[name]; exist {
|
||||
value, err := strconv.ParseInt(lf[4], 0, 64)
|
||||
if err == nil {
|
||||
x := m.data[name]
|
||||
x.last = x.current
|
||||
x.current = value
|
||||
m.data[name] = x
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return err
|
||||
}
|
||||
|
||||
func (m *nfsCollector) MainInit(config json.RawMessage) error {
|
||||
m.config.Nfsstats = string(NFSSTAT_EXEC)
|
||||
// Read JSON configuration
|
||||
if len(config) > 0 {
|
||||
err := json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
log.Print(err.Error())
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "NFS",
|
||||
}
|
||||
m.tags = map[string]string{
|
||||
"type": "node",
|
||||
}
|
||||
// Check if nfsstat is in executable search path
|
||||
_, err := exec.LookPath(m.config.Nfsstats)
|
||||
if err != nil {
|
||||
return fmt.Errorf("NfsCollector.Init(): Failed to find nfsstat binary '%s': %v", m.config.Nfsstats, err)
|
||||
}
|
||||
m.data = make(map[string]NfsCollectorData)
|
||||
m.initStats()
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *nfsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
timestamp := time.Now()
|
||||
|
||||
m.updateStats()
|
||||
prefix := ""
|
||||
switch m.version {
|
||||
case "v3":
|
||||
prefix = "nfs3"
|
||||
case "v4":
|
||||
prefix = "nfs4"
|
||||
default:
|
||||
prefix = "nfs"
|
||||
}
|
||||
|
||||
for name, data := range m.data {
|
||||
if _, skip := stringArrayContains(m.config.ExcludeMetrics, name); skip {
|
||||
continue
|
||||
}
|
||||
value := data.current - data.last
|
||||
y, err := lp.New(fmt.Sprintf("%s_%s", prefix, name), m.tags, m.meta, map[string]interface{}{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
y.AddMeta("version", m.version)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (m *nfsCollector) Close() {
|
||||
m.init = false
|
||||
}
|
||||
|
||||
type Nfs3Collector struct {
|
||||
nfsCollector
|
||||
}
|
||||
|
||||
type Nfs4Collector struct {
|
||||
nfsCollector
|
||||
}
|
||||
|
||||
func (m *Nfs3Collector) Init(config json.RawMessage) error {
|
||||
m.name = "Nfs3Collector"
|
||||
m.version = `v3`
|
||||
m.setup()
|
||||
return m.MainInit(config)
|
||||
}
|
||||
|
||||
func (m *Nfs4Collector) Init(config json.RawMessage) error {
|
||||
m.name = "Nfs4Collector"
|
||||
m.version = `v4`
|
||||
m.setup()
|
||||
return m.MainInit(config)
|
||||
}
|
@@ -2,15 +2,16 @@ package collectors
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"log"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
//
|
||||
@@ -42,11 +43,11 @@ type NUMAStatsCollectorTopolgy struct {
|
||||
}
|
||||
|
||||
type NUMAStatsCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
topology []NUMAStatsCollectorTopolgy
|
||||
}
|
||||
|
||||
func (m *NUMAStatsCollector) Init(config []byte) error {
|
||||
func (m *NUMAStatsCollector) Init(config json.RawMessage) error {
|
||||
// Check if already initialized
|
||||
if m.init {
|
||||
return nil
|
||||
@@ -54,25 +55,29 @@ func (m *NUMAStatsCollector) Init(config []byte) error {
|
||||
|
||||
m.name = "NUMAStatsCollector"
|
||||
m.setup()
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "NUMA",
|
||||
}
|
||||
|
||||
// Loop for all NUMA node directories
|
||||
baseDir := "/sys/devices/system/node"
|
||||
globPattern := filepath.Join(baseDir, "node[0-9]*")
|
||||
base := "/sys/devices/system/node/node"
|
||||
globPattern := base + "[0-9]*"
|
||||
dirs, err := filepath.Glob(globPattern)
|
||||
if err != nil {
|
||||
return fmt.Errorf("unable to glob files with pattern %s", globPattern)
|
||||
return fmt.Errorf("unable to glob files with pattern '%s'", globPattern)
|
||||
}
|
||||
if dirs == nil {
|
||||
return fmt.Errorf("unable to find any files with pattern %s", globPattern)
|
||||
return fmt.Errorf("unable to find any files with pattern '%s'", globPattern)
|
||||
}
|
||||
m.topology = make([]NUMAStatsCollectorTopolgy, 0, len(dirs))
|
||||
for _, dir := range dirs {
|
||||
node := strings.TrimPrefix(dir, "/sys/devices/system/node/node")
|
||||
node := strings.TrimPrefix(dir, base)
|
||||
file := filepath.Join(dir, "numastat")
|
||||
m.topology = append(m.topology,
|
||||
NUMAStatsCollectorTopolgy{
|
||||
file: file,
|
||||
tagSet: map[string]string{"domain": node},
|
||||
tagSet: map[string]string{"memoryDomain": node},
|
||||
})
|
||||
}
|
||||
|
||||
@@ -80,7 +85,7 @@ func (m *NUMAStatsCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *NUMAStatsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *NUMAStatsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -92,9 +97,14 @@ func (m *NUMAStatsCollector) Read(interval time.Duration, out *[]lp.MutableMetri
|
||||
now := time.Now()
|
||||
file, err := os.Open(t.file)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to open file '%s': %v", t.file, err))
|
||||
return
|
||||
}
|
||||
scanner := bufio.NewScanner(file)
|
||||
|
||||
// Read line by line
|
||||
for scanner.Scan() {
|
||||
split := strings.Fields(scanner.Text())
|
||||
if len(split) != 2 {
|
||||
@@ -103,12 +113,20 @@ func (m *NUMAStatsCollector) Read(interval time.Duration, out *[]lp.MutableMetri
|
||||
key := split[0]
|
||||
value, err := strconv.ParseInt(split[1], 10, 64)
|
||||
if err != nil {
|
||||
log.Printf("failed to convert %s='%s' to int64: %v", key, split[1], err)
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert %s='%s' to int64: %v", key, split[1], err))
|
||||
continue
|
||||
}
|
||||
y, err := lp.New("numastats_"+key, t.tagSet, map[string]interface{}{"value": value}, now)
|
||||
y, err := lp.New(
|
||||
"numastats_"+key,
|
||||
t.tagSet,
|
||||
m.meta,
|
||||
map[string]interface{}{"value": value},
|
||||
now,
|
||||
)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
|
15
collectors/numastatsMetric.md
Normal file
15
collectors/numastatsMetric.md
Normal file
@@ -0,0 +1,15 @@
|
||||
|
||||
## `numastat` collector
|
||||
```json
|
||||
"numastat": {}
|
||||
```
|
||||
|
||||
The `numastat` collector reads data from `/sys/devices/system/node/node*/numastat` and outputs a handful **memoryDomain** metrics. See: https://www.kernel.org/doc/html/latest/admin-guide/numastat.html
|
||||
|
||||
Metrics:
|
||||
* `numastats_numa_hit`: A process wanted to allocate memory from this node, and succeeded.
|
||||
* `numastats_numa_miss`: A process wanted to allocate memory from another node, but ended up with memory from this node.
|
||||
* `numastats_numa_foreign`: A process wanted to allocate on this node, but ended up with memory from another node.
|
||||
* `numastats_local_node`: A process ran on this node's CPU, and got memory from this node.
|
||||
* `numastats_other_node`: A process ran on a different node's CPU, and got memory from this node.
|
||||
* `numastats_interleave_hit`: Interleaving wanted to allocate from this node and succeeded.
|
@@ -7,19 +7,28 @@ import (
|
||||
"log"
|
||||
"time"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
"github.com/NVIDIA/go-nvml/pkg/nvml"
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
)
|
||||
|
||||
type NvidiaCollectorConfig struct {
|
||||
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
|
||||
ExcludeDevices []string `json:"exclude_devices,omitempty"`
|
||||
AddPciInfoTag bool `json:"add_pci_info_tag,omitempty"`
|
||||
}
|
||||
|
||||
type NvidiaCollectorDevice struct {
|
||||
device nvml.Device
|
||||
excludeMetrics map[string]bool
|
||||
tags map[string]string
|
||||
}
|
||||
|
||||
type NvidiaCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
num_gpus int
|
||||
config NvidiaCollectorConfig
|
||||
gpus []NvidiaCollectorDevice
|
||||
}
|
||||
|
||||
func (m *NvidiaCollector) CatchPanic() {
|
||||
@@ -29,9 +38,10 @@ func (m *NvidiaCollector) CatchPanic() {
|
||||
}
|
||||
}
|
||||
|
||||
func (m *NvidiaCollector) Init(config []byte) error {
|
||||
func (m *NvidiaCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "NvidiaCollector"
|
||||
m.config.AddPciInfoTag = false
|
||||
m.setup()
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
@@ -39,224 +49,415 @@ func (m *NvidiaCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "Nvidia",
|
||||
}
|
||||
|
||||
m.num_gpus = 0
|
||||
defer m.CatchPanic()
|
||||
|
||||
// Initialize NVIDIA Management Library (NVML)
|
||||
ret := nvml.Init()
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentError(m.name, "Unable to initialize NVML", err.Error())
|
||||
return err
|
||||
}
|
||||
m.num_gpus, ret = nvml.DeviceGetCount()
|
||||
|
||||
// Number of NVIDIA GPUs
|
||||
num_gpus, ret := nvml.DeviceGetCount()
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentError(m.name, "Unable to get device count", err.Error())
|
||||
return err
|
||||
}
|
||||
|
||||
// For all GPUs
|
||||
m.gpus = make([]NvidiaCollectorDevice, num_gpus)
|
||||
for i := 0; i < num_gpus; i++ {
|
||||
g := &m.gpus[i]
|
||||
|
||||
// Skip excluded devices
|
||||
str_i := fmt.Sprintf("%d", i)
|
||||
if _, skip := stringArrayContains(m.config.ExcludeDevices, str_i); skip {
|
||||
continue
|
||||
}
|
||||
|
||||
// Get device handle
|
||||
device, ret := nvml.DeviceGetHandleByIndex(i)
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentError(m.name, "Unable to get device at index", i, ":", err.Error())
|
||||
return err
|
||||
}
|
||||
g.device = device
|
||||
|
||||
// Add tags
|
||||
g.tags = map[string]string{
|
||||
"type": "accelerator",
|
||||
"type-id": str_i,
|
||||
}
|
||||
|
||||
// Add excluded metrics
|
||||
g.excludeMetrics = map[string]bool{}
|
||||
for _, e := range m.config.ExcludeMetrics {
|
||||
g.excludeMetrics[e] = true
|
||||
}
|
||||
|
||||
// Add PCI info as tag
|
||||
if m.config.AddPciInfoTag {
|
||||
pciInfo, ret := nvml.DeviceGetPciInfo(g.device)
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentError(m.name, "Unable to get PCI info for device at index", i, ":", err.Error())
|
||||
return err
|
||||
}
|
||||
g.tags["pci_identifier"] = fmt.Sprintf(
|
||||
"%08X:%02X:%02X.0",
|
||||
pciInfo.Domain,
|
||||
pciInfo.Bus,
|
||||
pciInfo.Device)
|
||||
}
|
||||
}
|
||||
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *NvidiaCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
for i := 0; i < m.num_gpus; i++ {
|
||||
device, ret := nvml.DeviceGetHandleByIndex(i)
|
||||
if ret != nvml.SUCCESS {
|
||||
log.Fatalf("Unable to get device at index %d: %v", i, nvml.ErrorString(ret))
|
||||
return
|
||||
}
|
||||
_, skip := stringArrayContains(m.config.ExcludeDevices, fmt.Sprintf("%d", i))
|
||||
if skip {
|
||||
continue
|
||||
}
|
||||
tags := map[string]string{"type": "accelerator", "type-id": fmt.Sprintf("%d", i)}
|
||||
|
||||
util, ret := nvml.DeviceGetUtilizationRates(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "util")
|
||||
y, err := lp.New("util", tags, map[string]interface{}{"value": float64(util.Gpu)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "mem_util")
|
||||
y, err = lp.New("mem_util", tags, map[string]interface{}{"value": float64(util.Memory)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
for i := range m.gpus {
|
||||
device := &m.gpus[i]
|
||||
|
||||
if !device.excludeMetrics["nv_util"] || !device.excludeMetrics["nv_mem_util"] {
|
||||
// Retrieves the current utilization rates for the device's major subsystems.
|
||||
//
|
||||
// Available utilization rates
|
||||
// * Gpu: Percent of time over the past sample period during which one or more kernels was executing on the GPU.
|
||||
// * Memory: Percent of time over the past sample period during which global (device) memory was being read or written
|
||||
//
|
||||
// Note:
|
||||
// * During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings.
|
||||
// This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.
|
||||
// * On MIG-enabled GPUs, querying device utilization rates is not currently supported.
|
||||
util, ret := nvml.DeviceGetUtilizationRates(device.device)
|
||||
if ret == nvml.SUCCESS {
|
||||
if !device.excludeMetrics["nv_util"] {
|
||||
y, err := lp.New("nv_util", device.tags, m.meta, map[string]interface{}{"value": float64(util.Gpu)}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "%")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !device.excludeMetrics["nv_mem_util"] {
|
||||
y, err := lp.New("nv_mem_util", device.tags, m.meta, map[string]interface{}{"value": float64(util.Memory)}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "%")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
meminfo, ret := nvml.DeviceGetMemoryInfo(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
t := float64(meminfo.Total) / (1024 * 1024)
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "mem_total")
|
||||
y, err := lp.New("mem_total", tags, map[string]interface{}{"value": t}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
f := float64(meminfo.Used) / (1024 * 1024)
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "fb_memory")
|
||||
y, err = lp.New("fb_memory", tags, map[string]interface{}{"value": f}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_mem_total"] || !device.excludeMetrics["nv_fb_memory"] {
|
||||
// Retrieves the amount of used, free and total memory available on the device, in bytes.
|
||||
//
|
||||
// Enabling ECC reduces the amount of total available memory, due to the extra required parity bits.
|
||||
//
|
||||
// The reported amount of used memory is equal to the sum of memory allocated by all active channels on the device.
|
||||
//
|
||||
// Available memory info:
|
||||
// * Free: Unallocated FB memory (in bytes).
|
||||
// * Total: Total installed FB memory (in bytes).
|
||||
// * Used: Allocated FB memory (in bytes). Note that the driver/GPU always sets aside a small amount of memory for bookkeeping.
|
||||
//
|
||||
// Note:
|
||||
// In MIG mode, if device handle is provided, the API returns aggregate information, only if the caller has appropriate privileges.
|
||||
// Per-instance information can be queried by using specific MIG device handles.
|
||||
meminfo, ret := nvml.DeviceGetMemoryInfo(device.device)
|
||||
if ret == nvml.SUCCESS {
|
||||
if !device.excludeMetrics["nv_mem_total"] {
|
||||
t := float64(meminfo.Total) / (1024 * 1024)
|
||||
y, err := lp.New("nv_mem_total", device.tags, m.meta, map[string]interface{}{"value": t}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "MByte")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
if !device.excludeMetrics["nv_fb_memory"] {
|
||||
f := float64(meminfo.Used) / (1024 * 1024)
|
||||
y, err := lp.New("nv_fb_memory", device.tags, m.meta, map[string]interface{}{"value": f}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "MByte")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
temp, ret := nvml.DeviceGetTemperature(device, nvml.TEMPERATURE_GPU)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "temp")
|
||||
y, err := lp.New("temp", tags, map[string]interface{}{"value": float64(temp)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_temp"] {
|
||||
// Retrieves the current temperature readings for the device, in degrees C.
|
||||
//
|
||||
// Available temperature sensors:
|
||||
// * TEMPERATURE_GPU: Temperature sensor for the GPU die.
|
||||
// * NVML_TEMPERATURE_COUNT
|
||||
temp, ret := nvml.DeviceGetTemperature(device.device, nvml.TEMPERATURE_GPU)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_temp", device.tags, m.meta, map[string]interface{}{"value": float64(temp)}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "degC")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fan, ret := nvml.DeviceGetFanSpeed(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "fan")
|
||||
y, err := lp.New("fan", tags, map[string]interface{}{"value": float64(fan)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_fan"] {
|
||||
// Retrieves the intended operating speed of the device's fan.
|
||||
//
|
||||
// Note: The reported speed is the intended fan speed.
|
||||
// If the fan is physically blocked and unable to spin, the output will not match the actual fan speed.
|
||||
//
|
||||
// For all discrete products with dedicated fans.
|
||||
//
|
||||
// The fan speed is expressed as a percentage of the product's maximum noise tolerance fan speed.
|
||||
// This value may exceed 100% in certain cases.
|
||||
fan, ret := nvml.DeviceGetFanSpeed(device.device)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_fan", device.tags, m.meta, map[string]interface{}{"value": float64(fan)}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "%")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
_, ecc_pend, ret := nvml.DeviceGetEccMode(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
var y lp.MutableMetric
|
||||
var err error
|
||||
switch ecc_pend {
|
||||
case nvml.FEATURE_DISABLED:
|
||||
y, err = lp.New("ecc_mode", tags, map[string]interface{}{"value": string("OFF")}, time.Now())
|
||||
case nvml.FEATURE_ENABLED:
|
||||
y, err = lp.New("ecc_mode", tags, map[string]interface{}{"value": string("ON")}, time.Now())
|
||||
default:
|
||||
y, err = lp.New("ecc_mode", tags, map[string]interface{}{"value": string("UNKNOWN")}, time.Now())
|
||||
}
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_mode")
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
} else if ret == nvml.ERROR_NOT_SUPPORTED {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_mode")
|
||||
y, err := lp.New("ecc_mode", tags, map[string]interface{}{"value": string("N/A")}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_ecc_mode"] {
|
||||
// Retrieves the current and pending ECC modes for the device.
|
||||
//
|
||||
// For Fermi or newer fully supported devices. Only applicable to devices with ECC.
|
||||
// Requires NVML_INFOROM_ECC version 1.0 or higher.
|
||||
//
|
||||
// Changing ECC modes requires a reboot.
|
||||
// The "pending" ECC mode refers to the target mode following the next reboot.
|
||||
_, ecc_pend, ret := nvml.DeviceGetEccMode(device.device)
|
||||
if ret == nvml.SUCCESS {
|
||||
var y lp.CCMetric
|
||||
var err error
|
||||
switch ecc_pend {
|
||||
case nvml.FEATURE_DISABLED:
|
||||
y, err = lp.New("nv_ecc_mode", device.tags, m.meta, map[string]interface{}{"value": "OFF"}, time.Now())
|
||||
case nvml.FEATURE_ENABLED:
|
||||
y, err = lp.New("nv_ecc_mode", device.tags, m.meta, map[string]interface{}{"value": "ON"}, time.Now())
|
||||
default:
|
||||
y, err = lp.New("nv_ecc_mode", device.tags, m.meta, map[string]interface{}{"value": "UNKNOWN"}, time.Now())
|
||||
}
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
} else if ret == nvml.ERROR_NOT_SUPPORTED {
|
||||
y, err := lp.New("nv_ecc_mode", device.tags, m.meta, map[string]interface{}{"value": "N/A"}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pstate, ret := nvml.DeviceGetPerformanceState(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "perf_state")
|
||||
y, err := lp.New("perf_state", tags, map[string]interface{}{"value": fmt.Sprintf("P%d", int(pstate))}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_perf_state"] {
|
||||
// Retrieves the current performance state for the device.
|
||||
//
|
||||
// Allowed PStates:
|
||||
// 0: Maximum Performance.
|
||||
// ..
|
||||
// 15: Minimum Performance.
|
||||
// 32: Unknown performance state.
|
||||
pState, ret := nvml.DeviceGetPerformanceState(device.device)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_perf_state", device.tags, m.meta, map[string]interface{}{"value": fmt.Sprintf("P%d", int(pState))}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
power, ret := nvml.DeviceGetPowerUsage(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "power_usage_report")
|
||||
y, err := lp.New("power_usage_report", tags, map[string]interface{}{"value": float64(power) / 1000}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_power_usage_report"] {
|
||||
// Retrieves power usage for this GPU in milliwatts and its associated circuitry (e.g. memory)
|
||||
//
|
||||
// On Fermi and Kepler GPUs the reading is accurate to within +/- 5% of current power draw.
|
||||
//
|
||||
// It is only available if power management mode is supported
|
||||
power, ret := nvml.DeviceGetPowerUsage(device.device)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_power_usage_report", device.tags, m.meta, map[string]interface{}{"value": float64(power) / 1000}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "watts")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
gclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_GRAPHICS)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "graphics_clock_report")
|
||||
y, err := lp.New("graphics_clock_report", tags, map[string]interface{}{"value": float64(gclk)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
// Retrieves the current clock speeds for the device.
|
||||
//
|
||||
// Available clock information:
|
||||
// * CLOCK_GRAPHICS: Graphics clock domain.
|
||||
// * CLOCK_SM: Streaming Multiprocessor clock domain.
|
||||
// * CLOCK_MEM: Memory clock domain.
|
||||
if !device.excludeMetrics["nv_graphics_clock_report"] {
|
||||
graphicsClock, ret := nvml.DeviceGetClockInfo(device.device, nvml.CLOCK_GRAPHICS)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_graphics_clock_report", device.tags, m.meta, map[string]interface{}{"value": float64(graphicsClock)}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "MHz")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
smclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_SM)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "sm_clock_report")
|
||||
y, err := lp.New("sm_clock_report", tags, map[string]interface{}{"value": float64(smclk)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_sm_clock_report"] {
|
||||
smCock, ret := nvml.DeviceGetClockInfo(device.device, nvml.CLOCK_SM)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_sm_clock_report", device.tags, m.meta, map[string]interface{}{"value": float64(smCock)}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "MHz")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
memclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_MEM)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "mem_clock_report")
|
||||
y, err := lp.New("mem_clock_report", tags, map[string]interface{}{"value": float64(memclk)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_mem_clock_report"] {
|
||||
memClock, ret := nvml.DeviceGetClockInfo(device.device, nvml.CLOCK_MEM)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_mem_clock_report", device.tags, m.meta, map[string]interface{}{"value": float64(memClock)}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "MHz")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
max_gclk, ret := nvml.DeviceGetMaxClockInfo(device, nvml.CLOCK_GRAPHICS)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "max_graphics_clock")
|
||||
y, err := lp.New("max_graphics_clock", tags, map[string]interface{}{"value": float64(max_gclk)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
// Retrieves the maximum clock speeds for the device.
|
||||
//
|
||||
// Available clock information:
|
||||
// * CLOCK_GRAPHICS: Graphics clock domain.
|
||||
// * CLOCK_SM: Streaming multiprocessor clock domain.
|
||||
// * CLOCK_MEM: Memory clock domain.
|
||||
// * CLOCK_VIDEO: Video encoder/decoder clock domain.
|
||||
// * CLOCK_COUNT: Count of clock types.
|
||||
//
|
||||
// Note:
|
||||
/// On GPUs from Fermi family current P0 clocks (reported by nvmlDeviceGetClockInfo) can differ from max clocks by few MHz.
|
||||
if !device.excludeMetrics["nv_max_graphics_clock"] {
|
||||
max_gclk, ret := nvml.DeviceGetMaxClockInfo(device.device, nvml.CLOCK_GRAPHICS)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_max_graphics_clock", device.tags, m.meta, map[string]interface{}{"value": float64(max_gclk)}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "MHz")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
max_smclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_SM)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "max_sm_clock")
|
||||
y, err := lp.New("max_sm_clock", tags, map[string]interface{}{"value": float64(max_smclk)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_max_sm_clock"] {
|
||||
maxSmClock, ret := nvml.DeviceGetClockInfo(device.device, nvml.CLOCK_SM)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_max_sm_clock", device.tags, m.meta, map[string]interface{}{"value": float64(maxSmClock)}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "MHz")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
max_memclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_MEM)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "max_mem_clock")
|
||||
y, err := lp.New("max_mem_clock", tags, map[string]interface{}{"value": float64(max_memclk)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_max_mem_clock"] {
|
||||
maxMemClock, ret := nvml.DeviceGetClockInfo(device.device, nvml.CLOCK_MEM)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_max_mem_clock", device.tags, m.meta, map[string]interface{}{"value": float64(maxMemClock)}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "MHz")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
ecc_db, ret := nvml.DeviceGetTotalEccErrors(device, 1, 1)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_db_error")
|
||||
y, err := lp.New("ecc_db_error", tags, map[string]interface{}{"value": float64(ecc_db)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_ecc_db_error"] {
|
||||
// Retrieves the total ECC error counts for the device.
|
||||
//
|
||||
// For Fermi or newer fully supported devices.
|
||||
// Only applicable to devices with ECC.
|
||||
// Requires NVML_INFOROM_ECC version 1.0 or higher.
|
||||
// Requires ECC Mode to be enabled.
|
||||
//
|
||||
// The total error count is the sum of errors across each of the separate memory systems,
|
||||
// i.e. the total set of errors across the entire device.
|
||||
ecc_db, ret := nvml.DeviceGetTotalEccErrors(device.device, nvml.MEMORY_ERROR_TYPE_UNCORRECTED, nvml.AGGREGATE_ECC)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_ecc_db_error", device.tags, m.meta, map[string]interface{}{"value": float64(ecc_db)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
ecc_sb, ret := nvml.DeviceGetTotalEccErrors(device, 0, 1)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_sb_error")
|
||||
y, err := lp.New("ecc_sb_error", tags, map[string]interface{}{"value": float64(ecc_sb)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_ecc_sb_error"] {
|
||||
ecc_sb, ret := nvml.DeviceGetTotalEccErrors(device.device, nvml.MEMORY_ERROR_TYPE_CORRECTED, nvml.AGGREGATE_ECC)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_ecc_sb_error", device.tags, m.meta, map[string]interface{}{"value": float64(ecc_sb)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pwr_limit, ret := nvml.DeviceGetPowerManagementLimit(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "power_man_limit")
|
||||
y, err := lp.New("power_man_limit", tags, map[string]interface{}{"value": float64(pwr_limit)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_power_man_limit"] {
|
||||
// Retrieves the power management limit associated with this device.
|
||||
//
|
||||
// For Fermi or newer fully supported devices.
|
||||
//
|
||||
// The power limit defines the upper boundary for the card's power draw.
|
||||
// If the card's total power draw reaches this limit the power management algorithm kicks in.
|
||||
pwr_limit, ret := nvml.DeviceGetPowerManagementLimit(device.device)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_power_man_limit", device.tags, m.meta, map[string]interface{}{"value": float64(pwr_limit) / 1000}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "watts")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
enc_util, _, ret := nvml.DeviceGetEncoderUtilization(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "encoder_util")
|
||||
y, err := lp.New("encoder_util", tags, map[string]interface{}{"value": float64(enc_util)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_encoder_util"] {
|
||||
// Retrieves the current utilization and sampling size in microseconds for the Encoder
|
||||
//
|
||||
// For Kepler or newer fully supported devices.
|
||||
//
|
||||
// Note: On MIG-enabled GPUs, querying encoder utilization is not currently supported.
|
||||
enc_util, _, ret := nvml.DeviceGetEncoderUtilization(device.device)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_encoder_util", device.tags, m.meta, map[string]interface{}{"value": float64(enc_util)}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "%")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
dec_util, _, ret := nvml.DeviceGetDecoderUtilization(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "decoder_util")
|
||||
y, err := lp.New("decoder_util", tags, map[string]interface{}{"value": float64(dec_util)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
if !device.excludeMetrics["nv_decoder_util"] {
|
||||
// Retrieves the current utilization and sampling size in microseconds for the Decoder
|
||||
//
|
||||
// For Kepler or newer fully supported devices.
|
||||
//
|
||||
// Note: On MIG-enabled GPUs, querying decoder utilization is not currently supported.
|
||||
dec_util, _, ret := nvml.DeviceGetDecoderUtilization(device.device)
|
||||
if ret == nvml.SUCCESS {
|
||||
y, err := lp.New("nv_decoder_util", device.tags, m.meta, map[string]interface{}{"value": float64(dec_util)}, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "%")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
40
collectors/nvidiaMetric.md
Normal file
40
collectors/nvidiaMetric.md
Normal file
@@ -0,0 +1,40 @@
|
||||
|
||||
## `nvidia` collector
|
||||
|
||||
```json
|
||||
"nvidia": {
|
||||
"exclude_devices" : [
|
||||
"0","1"
|
||||
],
|
||||
"exclude_metrics": [
|
||||
"nv_fb_memory",
|
||||
"nv_fan"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Metrics:
|
||||
* `nv_util`
|
||||
* `nv_mem_util`
|
||||
* `nv_mem_total`
|
||||
* `nv_fb_memory`
|
||||
* `nv_temp`
|
||||
* `nv_fan`
|
||||
* `nv_ecc_mode`
|
||||
* `nv_perf_state`
|
||||
* `nv_power_usage_report`
|
||||
* `nv_graphics_clock_report`
|
||||
* `nv_sm_clock_report`
|
||||
* `nv_mem_clock_report`
|
||||
* `nv_max_graphics_clock`
|
||||
* `nv_max_sm_clock`
|
||||
* `nv_max_mem_clock`
|
||||
* `nv_ecc_db_error`
|
||||
* `nv_ecc_sb_error`
|
||||
* `nv_power_man_limit`
|
||||
* `nv_encoder_util`
|
||||
* `nv_decoder_util`
|
||||
|
||||
It uses a separate `type` in the metrics. The output metric looks like this:
|
||||
`<name>,type=accelerator,type-id=<nvidia-gpu-id> value=<metric value> <timestamp>`
|
||||
|
@@ -4,104 +4,227 @@ import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io/ioutil"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const HWMON_PATH = `/sys/class/hwmon`
|
||||
// See: https://www.kernel.org/doc/html/latest/hwmon/sysfs-interface.html
|
||||
// /sys/class/hwmon/hwmon*/name -> coretemp
|
||||
// /sys/class/hwmon/hwmon*/temp*_label -> Core 0
|
||||
// /sys/class/hwmon/hwmon*/temp*_input -> 27800 = 27.8°C
|
||||
// /sys/class/hwmon/hwmon*/temp*_max -> 86000 = 86.0°C
|
||||
// /sys/class/hwmon/hwmon*/temp*_crit -> 100000 = 100.0°C
|
||||
|
||||
type TempCollectorConfig struct {
|
||||
ExcludeMetrics []string `json:"exclude_metrics"`
|
||||
TagOverride map[string]map[string]string `json:"tag_override"`
|
||||
type TempCollectorSensor struct {
|
||||
name string
|
||||
label string
|
||||
metricName string // Default: name_label
|
||||
file string
|
||||
maxTempName string
|
||||
maxTemp int64
|
||||
critTempName string
|
||||
critTemp int64
|
||||
tags map[string]string
|
||||
}
|
||||
|
||||
type TempCollector struct {
|
||||
MetricCollector
|
||||
config TempCollectorConfig
|
||||
metricCollector
|
||||
config struct {
|
||||
ExcludeMetrics []string `json:"exclude_metrics"`
|
||||
TagOverride map[string]map[string]string `json:"tag_override"`
|
||||
ReportMaxTemp bool `json:"report_max_temperature"`
|
||||
ReportCriticalTemp bool `json:"report_critical_temperature"`
|
||||
}
|
||||
sensors []*TempCollectorSensor
|
||||
}
|
||||
|
||||
func (m *TempCollector) Init(config []byte) error {
|
||||
func (m *TempCollector) Init(config json.RawMessage) error {
|
||||
// Check if already initialized
|
||||
if m.init {
|
||||
return nil
|
||||
}
|
||||
|
||||
m.name = "TempCollector"
|
||||
m.setup()
|
||||
m.init = true
|
||||
if len(config) > 0 {
|
||||
err := json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "IPMI",
|
||||
"unit": "degC",
|
||||
}
|
||||
|
||||
m.sensors = make([]*TempCollectorSensor, 0)
|
||||
|
||||
// Find all temperature sensor files
|
||||
globPattern := filepath.Join("/sys/class/hwmon", "*", "temp*_input")
|
||||
inputFiles, err := filepath.Glob(globPattern)
|
||||
if err != nil {
|
||||
return fmt.Errorf("Unable to glob files with pattern '%s': %v", globPattern, err)
|
||||
}
|
||||
if inputFiles == nil {
|
||||
return fmt.Errorf("Unable to find any files with pattern '%s'", globPattern)
|
||||
}
|
||||
|
||||
// Get sensor name for each temperature sensor file
|
||||
for _, file := range inputFiles {
|
||||
sensor := new(TempCollectorSensor)
|
||||
|
||||
// sensor name
|
||||
nameFile := filepath.Join(filepath.Dir(file), "name")
|
||||
name, err := ioutil.ReadFile(nameFile)
|
||||
if err == nil {
|
||||
sensor.name = strings.TrimSpace(string(name))
|
||||
}
|
||||
|
||||
// sensor label
|
||||
labelFile := strings.TrimSuffix(file, "_input") + "_label"
|
||||
label, err := ioutil.ReadFile(labelFile)
|
||||
if err == nil {
|
||||
sensor.label = strings.TrimSpace(string(label))
|
||||
}
|
||||
|
||||
// sensor metric name
|
||||
switch {
|
||||
case len(sensor.name) == 0 && len(sensor.label) == 0:
|
||||
continue
|
||||
case sensor.name == "coretemp" && strings.HasPrefix(sensor.label, "Core ") ||
|
||||
sensor.name == "coretemp" && strings.HasPrefix(sensor.label, "Package id "):
|
||||
sensor.metricName = "temp_" + sensor.label
|
||||
case len(sensor.name) != 0 && len(sensor.label) != 0:
|
||||
sensor.metricName = sensor.name + "_" + sensor.label
|
||||
case len(sensor.name) != 0:
|
||||
sensor.metricName = sensor.name
|
||||
case len(sensor.label) != 0:
|
||||
sensor.metricName = sensor.label
|
||||
}
|
||||
sensor.metricName = strings.ToLower(sensor.metricName)
|
||||
sensor.metricName = strings.Replace(sensor.metricName, " ", "_", -1)
|
||||
// Add temperature prefix, if required
|
||||
if !strings.Contains(sensor.metricName, "temp") {
|
||||
sensor.metricName = "temp_" + sensor.metricName
|
||||
}
|
||||
|
||||
// Sensor file
|
||||
sensor.file = file
|
||||
|
||||
// Sensor tags
|
||||
sensor.tags = map[string]string{
|
||||
"type": "node",
|
||||
}
|
||||
|
||||
// Apply tag override configuration
|
||||
for key, newtags := range m.config.TagOverride {
|
||||
if strings.Contains(sensor.file, key) {
|
||||
sensor.tags = newtags
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
// max temperature
|
||||
if m.config.ReportMaxTemp {
|
||||
maxTempFile := strings.TrimSuffix(file, "_input") + "_max"
|
||||
if buffer, err := ioutil.ReadFile(maxTempFile); err == nil {
|
||||
if x, err := strconv.ParseInt(strings.TrimSpace(string(buffer)), 10, 64); err == nil {
|
||||
sensor.maxTempName = strings.Replace(sensor.metricName, "temp", "max_temp", 1)
|
||||
sensor.maxTemp = x / 1000
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// critical temperature
|
||||
if m.config.ReportCriticalTemp {
|
||||
criticalTempFile := strings.TrimSuffix(file, "_input") + "_crit"
|
||||
if buffer, err := ioutil.ReadFile(criticalTempFile); err == nil {
|
||||
if x, err := strconv.ParseInt(strings.TrimSpace(string(buffer)), 10, 64); err == nil {
|
||||
sensor.critTempName = strings.Replace(sensor.metricName, "temp", "crit_temp", 1)
|
||||
sensor.critTemp = x / 1000
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
m.sensors = append(m.sensors, sensor)
|
||||
}
|
||||
|
||||
// Empty sensors map
|
||||
if len(m.sensors) == 0 {
|
||||
return fmt.Errorf("No temperature sensors found")
|
||||
}
|
||||
|
||||
// Finished initialization
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func get_hwmon_sensors() (map[string]map[string]string, error) {
|
||||
var folders []string
|
||||
var sensors map[string]map[string]string
|
||||
sensors = make(map[string]map[string]string)
|
||||
err := filepath.Walk(HWMON_PATH, func(p string, info os.FileInfo, err error) error {
|
||||
if info.IsDir() {
|
||||
return nil
|
||||
}
|
||||
folders = append(folders, p)
|
||||
return nil
|
||||
})
|
||||
if err != nil {
|
||||
return sensors, err
|
||||
}
|
||||
func (m *TempCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
|
||||
for _, f := range folders {
|
||||
sensors[f] = make(map[string]string)
|
||||
myp := fmt.Sprintf("%s/", f)
|
||||
err := filepath.Walk(myp, func(path string, info os.FileInfo, err error) error {
|
||||
dir, fname := filepath.Split(path)
|
||||
if strings.Contains(fname, "temp") && strings.Contains(fname, "_input") {
|
||||
namefile := fmt.Sprintf("%s/%s", dir, strings.Replace(fname, "_input", "_label", -1))
|
||||
name, ierr := ioutil.ReadFile(namefile)
|
||||
if ierr == nil {
|
||||
sensors[f][strings.Replace(string(name), "\n", "", -1)] = path
|
||||
}
|
||||
}
|
||||
return nil
|
||||
})
|
||||
for _, sensor := range m.sensors {
|
||||
// Read sensor file
|
||||
buffer, err := ioutil.ReadFile(sensor.file)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to read file '%s': %v", sensor.file, err))
|
||||
continue
|
||||
}
|
||||
}
|
||||
return sensors, nil
|
||||
}
|
||||
x, err := strconv.ParseInt(strings.TrimSpace(string(buffer)), 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert temperature '%s' to int64: %v", buffer, err))
|
||||
continue
|
||||
}
|
||||
x /= 1000
|
||||
y, err := lp.New(
|
||||
sensor.metricName,
|
||||
sensor.tags,
|
||||
m.meta,
|
||||
map[string]interface{}{"value": x},
|
||||
time.Now(),
|
||||
)
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
func (m *TempCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
|
||||
sensors, err := get_hwmon_sensors()
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
for _, files := range sensors {
|
||||
for name, file := range files {
|
||||
tags := map[string]string{"type": "node"}
|
||||
for key, newtags := range m.config.TagOverride {
|
||||
if strings.Contains(file, key) {
|
||||
tags = newtags
|
||||
break
|
||||
}
|
||||
}
|
||||
buffer, err := ioutil.ReadFile(string(file))
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
x, err := strconv.ParseInt(strings.Replace(string(buffer), "\n", "", -1), 0, 64)
|
||||
// max temperature
|
||||
if m.config.ReportMaxTemp && sensor.maxTemp != 0 {
|
||||
y, err := lp.New(
|
||||
sensor.maxTempName,
|
||||
sensor.tags,
|
||||
m.meta,
|
||||
map[string]interface{}{"value": sensor.maxTemp},
|
||||
time.Now(),
|
||||
)
|
||||
if err == nil {
|
||||
y, err := lp.New(strings.ToLower(name), tags, map[string]interface{}{"value": float64(x) / 1000}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
}
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
// critical temperature
|
||||
if m.config.ReportCriticalTemp && sensor.critTemp != 0 {
|
||||
y, err := lp.New(
|
||||
sensor.critTempName,
|
||||
sensor.tags,
|
||||
m.meta,
|
||||
map[string]interface{}{"value": sensor.critTemp},
|
||||
time.Now(),
|
||||
)
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func (m *TempCollector) Close() {
|
||||
|
22
collectors/tempMetric.md
Normal file
22
collectors/tempMetric.md
Normal file
@@ -0,0 +1,22 @@
|
||||
|
||||
## `tempstat` collector
|
||||
|
||||
```json
|
||||
"tempstat": {
|
||||
"tag_override" : {
|
||||
"<device like hwmon1>" : {
|
||||
"type" : "socket",
|
||||
"type-id" : "0"
|
||||
}
|
||||
},
|
||||
"exclude_metrics": [
|
||||
"metric1",
|
||||
"metric2"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `tempstat` collector reads the data from `/sys/class/hwmon/<device>/tempX_{input,label}`
|
||||
|
||||
Metrics:
|
||||
* `temp_*`: The metric name is taken from the `label` files.
|
@@ -9,7 +9,7 @@ import (
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const MAX_NUM_PROCS = 10
|
||||
@@ -20,15 +20,16 @@ type TopProcsCollectorConfig struct {
|
||||
}
|
||||
|
||||
type TopProcsCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
config TopProcsCollectorConfig
|
||||
}
|
||||
|
||||
func (m *TopProcsCollector) Init(config []byte) error {
|
||||
func (m *TopProcsCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "TopProcsCollector"
|
||||
m.tags = map[string]string{"type": "node"}
|
||||
m.meta = map[string]string{"source": m.name, "group": "TopProcs"}
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
@@ -51,7 +52,7 @@ func (m *TopProcsCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *TopProcsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *TopProcsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -66,9 +67,9 @@ func (m *TopProcsCollector) Read(interval time.Duration, out *[]lp.MutableMetric
|
||||
lines := strings.Split(string(stdout), "\n")
|
||||
for i := 1; i < m.config.Num_procs+1; i++ {
|
||||
name := fmt.Sprintf("topproc%d", i)
|
||||
y, err := lp.New(name, m.tags, map[string]interface{}{"value": string(lines[i])}, time.Now())
|
||||
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": string(lines[i])}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
15
collectors/topprocsMetric.md
Normal file
15
collectors/topprocsMetric.md
Normal file
@@ -0,0 +1,15 @@
|
||||
|
||||
## `topprocs` collector
|
||||
|
||||
```json
|
||||
"topprocs": {
|
||||
"num_procs": 5
|
||||
}
|
||||
```
|
||||
|
||||
The `topprocs` collector reads the TopX processes (sorted by CPU utilization, `ps -Ao comm --sort=-pcpu`).
|
||||
|
||||
In contrast to most other collectors, the metric value is a `string`.
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user