Merge branch 'develop' into main

This commit is contained in:
Thomas Roehl
2022-02-21 14:32:24 +01:00
101 changed files with 8324 additions and 2641 deletions

View File

@@ -1,304 +1,59 @@
# CCMetric collectors
This folder contains the collectors for the cc-metric-collector.
# `metricCollector.go`
The base class/configuration is located in `metricCollector.go`.
# Collectors
* `memstatMetric.go`: Reads `/proc/meminfo` to calculate **node** metrics. It also combines values to the metric `mem_used`
* `loadavgMetric.go`: Reads `/proc/loadavg` and submits **node** metrics:
* `netstatMetric.go`: Reads `/proc/net/dev` and submits for all network devices as the **node** metrics.
* `lustreMetric.go`: Reads Lustre's stats files and submits **node** metrics:
* `infinibandMetric.go`: Reads InfiniBand metrics. It uses the `perfquery` command to read the **node** metrics but can fallback to sysfs counters in case `perfquery` does not work.
* `likwidMetric.go`: Reads hardware performance events using LIKWID. It submits **socket** and **cpu** metrics
* `cpustatMetric.go`: Read CPU specific values from `/proc/stat`
* `topprocsMetric.go`: Reads the TopX processes by their CPU usage. X is configurable
* `nvidiaMetric.go`: Read data about Nvidia GPUs using the NVML library
* `tempMetric.go`: Read temperature data from `/sys/class/hwmon/hwmon*`
* `ipmiMetric.go`: Collect data from `ipmitool` or as fallback `ipmi-sensors`
* `customCmdMetric.go`: Run commands or read files and submit the output (output has to be in InfluxDB line protocol!)
If any of the collectors cannot be initialized, it is excluded from all further reads. Like if the Lustre stat file is not a valid path, no Lustre specific metrics will be recorded.
# Collector configuration
# Configuration
```json
"collectors": [
"tempstat"
],
"collect_config": {
"tempstat": {
"tag_override": {
"hwmon0" : {
"type" : "socket",
"type-id" : "0"
},
"hwmon1" : {
"type" : "socket",
"type-id" : "1"
}
}
{
"collector_type" : {
<collector specific configuration>
}
}
}
```
The configuration of the collectors in the main config files consists of two parts: active collectors (`collectors`) and collector configuration (`collect_config`). At startup, all collectors in the `collectors` list is initialized and, if successfully initialized, added to the active collectors for metric retrieval. At initialization the collector-specific configuration from the `collect_config` section is handed over. Each collector has own configuration options, check at the collector-specific section.
In contrast to the configuration files for sinks and receivers, the collectors configuration is not a list but a set of dicts. This is required because we didn't manage to partially read the type before loading the remaining configuration. We are eager to change this to the same format.
## `memstat`
# Available collectors
```json
"memstat": {
"exclude_metrics": [
"mem_used"
]
}
```
The `memstat` collector reads data from `/proc/meminfo` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `mem_total`
* `mem_sreclaimable`
* `mem_slab`
* `mem_free`
* `mem_buffers`
* `mem_cached`
* `mem_available`
* `mem_shared`
* `swap_total`
* `swap_free`
* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached`)
## `loadavg`
```json
"loadavg": {
"exclude_metrics": [
"proc_run"
]
}
```
The `loadavg` collector reads data from `/proc/loadavg` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `load_one`
* `load_five`
* `load_fifteen`
* `proc_run`
* `proc_total`
## `netstat`
```json
"netstat": {
"exclude_devices": [
"lo"
]
}
```
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a device is not required, it can be excluded from forwarding it to the sink. Commonly the `lo` device should be excluded.
Metrics:
* `bytes_in`
* `bytes_out`
* `pkts_in`
* `pkts_out`
The device name is added as tag `device`.
## `diskstat`
```json
"diskstat": {
"exclude_metrics": [
"read_ms"
],
}
```
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `reads`
* `reads_merged`
* `read_sectors`
* `read_ms`
* `writes`
* `writes_merged`
* `writes_sectors`
* `writes_ms`
* `ioops`
* `ioops_ms`
* `ioops_weighted_ms`
* `discards`
* `discards_merged`
* `discards_sectors`
* `discards_ms`
* `flushes`
* `flushes_ms`
The device name is added as tag `device`.
## `cpustat`
```json
"netstat": {
"exclude_metrics": [
"cpu_idle"
]
}
```
The `cpustat` collector reads data from `/proc/stats` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `cpu_user`
* `cpu_nice`
* `cpu_system`
* `cpu_idle`
* `cpu_iowait`
* `cpu_irq`
* `cpu_softirq`
* `cpu_steal`
* `cpu_guest`
* `cpu_guest_nice`
## `likwid`
```json
"likwid": {
"eventsets": [
{
"events": {
"FIXC1": "ACTUAL_CPU_CLOCK",
"FIXC2": "MAX_CPU_CLOCK",
"PMC0": "RETIRED_INSTRUCTIONS",
"PMC1": "CPU_CLOCKS_UNHALTED",
"PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
"PMC3": "MERGE",
"DFC0": "DRAM_CHANNEL_0",
"DFC1": "DRAM_CHANNEL_1",
"DFC2": "DRAM_CHANNEL_2",
"DFC3": "DRAM_CHANNEL_3"
},
"metrics": [
{
"name": "ipc",
"calc": "PMC0/PMC1",
"socket_scope": false,
"publish": true
},
{
"name": "flops_any",
"calc": "0.000001*PMC2/time",
"socket_scope": false,
"publish": true
},
{
"name": "clock_mhz",
"calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
"socket_scope": false,
"publish": true
},
{
"name": "mem1",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"socket_scope": true,
"publish": false
}
]
},
{
"events": {
"DFC0": "DRAM_CHANNEL_4",
"DFC1": "DRAM_CHANNEL_5",
"DFC2": "DRAM_CHANNEL_6",
"DFC3": "DRAM_CHANNEL_7",
"PWR0": "RAPL_CORE_ENERGY",
"PWR1": "RAPL_PKG_ENERGY"
},
"metrics": [
{
"name": "pwr_core",
"calc": "PWR0/time",
"socket_scope": false,
"publish": true
},
{
"name": "pwr_pkg",
"calc": "PWR1/time",
"socket_scope": true,
"publish": true
},
{
"name": "mem2",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"socket_scope": true,
"publish": false
}
]
}
],
"globalmetrics": [
{
"name": "mem_bw",
"calc": "mem1+mem2",
"socket_scope": true,
"publish": true
}
]
}
```
_Example config suitable for AMD Zen3_
The `likwid` collector reads hardware performance counters at a **hwthread** and **socket** level. The configuration looks quite complicated but it is basically copy&paste from [LIKWID's performance groups](https://github.com/RRZE-HPC/likwid/tree/master/groups). The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
```
EVENTSET -> "events": {
FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK",
FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK",
PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS",
PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED",
PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
PMC3 MERGE -> "PMC3": "MERGE",
-> }
```
The metrics are following the same procedure:
```
METRICS -> "metrics": [
IPC PMC0/PMC1 -> {
-> "name" : "IPC",
-> "calc" : "PMC0/PMC1",
-> "socket_scope": false,
-> "publish": true
-> }
-> ]
```
The `socket_scope` option tells whether it is submitted per socket or per hwthread. If a metric is only used for internal calculations, you can set `publish = false`.
Since some metrics can only be gathered in multiple measurements (like the memory bandwidth on AMD Zen3 chips), configure multiple eventsets like in the example config and use the `globalmetrics` section to combine them. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.
* [`cpustat`](./cpustatMetric.md)
* [`memstat`](./memstatMetric.md)
* [`iostat`](./iostatMetric.md)
* [`diskstat`](./diskstatMetric.md)
* [`loadavg`](./loadavgMetric.md)
* [`netstat`](./netstatMetric.md)
* [`ibstat`](./infinibandMetric.md)
* [`ibstat_perfquery`](./infinibandPerfQueryMetric.md)
* [`tempstat`](./tempMetric.md)
* [`lustrestat`](./lustreMetric.md)
* [`likwid`](./likwidMetric.md)
* [`nvidia`](./nvidiaMetric.md)
* [`customcmd`](./customCmdMetric.md)
* [`ipmistat`](./ipmiMetric.md)
* [`topprocs`](./topprocsMetric.md)
* [`nfs3stat`](./nfs3Metric.md)
* [`nfs4stat`](./nfs4Metric.md)
* [`cpufreq`](./cpufreqMetric.md)
* [`cpufreq_cpuinfo`](./cpufreqCpuinfoMetric.md)
* [`numastat`](./numastatMetric.md)
* [`gpfs`](./gpfsMetric.md)
## Todos
* [ ] Exclude devices for `diskstat` collector
* [ ] Aggreate metrics to higher topology entity (sum hwthread metrics to socket metric, ...). Needs to be configurable
# Contributing own collectors
A collector reads data from any source, parses it to metrics and submits these metrics to the `metric-collector`. A collector provides three function:
* `Init(config []byte) error`: Initializes the collector using the given collector-specific config in JSON.
* `Read(duration time.Duration, out *[]lp.MutableMetric) error`: Read, parse and submit data to the `out` list. If the collector has to measure anything for some duration, use the provided function argument `duration`.
* `Name() string`: Return the name of the collector
* `Init(config json.RawMessage) error`: Initializes the collector using the given collector-specific config in JSON. Check if needed files/commands exists, ...
* `Initialized() bool`: Check if a collector is successfully initialized
* `Read(duration time.Duration, output chan ccMetric.CCMetric)`: Read, parse and submit data to the `output` channel as [`CCMetric`](../internal/ccMetric/README.md). If the collector has to measure anything for some duration, use the provided function argument `duration`.
* `Close()`: Closes down the collector.
It is recommanded to call `setup()` in the `Init()` function.
Finally, the collector needs to be registered in the `metric-collector.go`. There is a list of collectors called `Collectors` which is a map (string -> pointer to collector). Add a new entry with a descriptive name and the new collector.
Finally, the collector needs to be registered in the `collectorManager.go`. There is a list of collectors called `AvailableCollectors` which is a map (`collector_type_string` -> `pointer to MetricCollector interface`). Add a new entry with a descriptive name and the new collector.
## Sample collector
@@ -307,8 +62,9 @@ package collectors
import (
"encoding/json"
lp "github.com/influxdata/line-protocol"
"time"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
// Struct for the collector-specific JSON config
@@ -317,11 +73,11 @@ type SampleCollectorConfig struct {
}
type SampleCollector struct {
MetricCollector
metricCollector
config SampleCollectorConfig
}
func (m *SampleCollector) Init(config []byte) error {
func (m *SampleCollector) Init(config json.RawMessage) error {
// Check if already initialized
if m.init {
return nil
@@ -335,21 +91,28 @@ func (m *SampleCollector) Init(config []byte) error {
return err
}
}
m.meta = map[string]string{"source": m.name, "group": "Sample"}
m.init = true
return nil
}
func (m *SampleCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *SampleCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
// tags for the metric, if type != node use proper type and type-id
tags := map[string]string{"type" : "node"}
x, err := GetMetric()
if err != nil {
cclog.ComponentError(m.name, fmt.Sprintf("Read(): %v", err))
}
// Each metric has exactly one field: value !
value := map[string]interface{}{"value": int(x)}
y, err := lp.New("sample_metric", tags, value, time.Now())
if err == nil {
*out = append(*out, y)
value := map[string]interface{}{"value": int64(x)}
if y, err := lp.New("sample_metric", tags, m.meta, value, time.Now()); err == nil {
output <- y
}
}

View File

@@ -0,0 +1,173 @@
package collectors
import (
"encoding/json"
"os"
"sync"
"time"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
mct "github.com/ClusterCockpit/cc-metric-collector/internal/multiChanTicker"
)
// Map of all available metric collectors
var AvailableCollectors = map[string]MetricCollector{
"likwid": new(LikwidCollector),
"loadavg": new(LoadavgCollector),
"memstat": new(MemstatCollector),
"netstat": new(NetstatCollector),
"ibstat": new(InfinibandCollector),
"ibstat_perfquery": new(InfinibandPerfQueryCollector),
"lustrestat": new(LustreCollector),
"cpustat": new(CpustatCollector),
"topprocs": new(TopProcsCollector),
"nvidia": new(NvidiaCollector),
"customcmd": new(CustomCmdCollector),
"iostat": new(IOstatCollector),
"diskstat": new(DiskstatCollector),
"tempstat": new(TempCollector),
"ipmistat": new(IpmiCollector),
"gpfs": new(GpfsCollector),
"cpufreq": new(CPUFreqCollector),
"cpufreq_cpuinfo": new(CPUFreqCpuInfoCollector),
"nfs3stat": new(Nfs3Collector),
"nfs4stat": new(Nfs4Collector),
"numastats": new(NUMAStatsCollector),
}
// Metric collector manager data structure
type collectorManager struct {
collectors []MetricCollector // List of metric collectors to use
output chan lp.CCMetric // Output channels
done chan bool // channel to finish / stop metric collector manager
ticker mct.MultiChanTicker // periodically ticking once each interval
duration time.Duration // duration (for metrics that measure over a given duration)
wg *sync.WaitGroup // wait group for all goroutines in cc-metric-collector
config map[string]json.RawMessage // json encoded config for collector manager
}
// Metric collector manager access functions
type CollectorManager interface {
Init(ticker mct.MultiChanTicker, duration time.Duration, wg *sync.WaitGroup, collectConfigFile string) error
AddOutput(output chan lp.CCMetric)
Start()
Close()
}
// Init initializes a new metric collector manager by setting up:
// * output channel
// * done channel
// * wait group synchronization for goroutines (from variable wg)
// * ticker (from variable ticker)
// * configuration (read from config file in variable collectConfigFile)
// Initialization is done for all configured collectors
func (cm *collectorManager) Init(ticker mct.MultiChanTicker, duration time.Duration, wg *sync.WaitGroup, collectConfigFile string) error {
cm.collectors = make([]MetricCollector, 0)
cm.output = nil
cm.done = make(chan bool)
cm.wg = wg
cm.ticker = ticker
cm.duration = duration
// Read collector config file
configFile, err := os.Open(collectConfigFile)
if err != nil {
cclog.Error(err.Error())
return err
}
defer configFile.Close()
jsonParser := json.NewDecoder(configFile)
err = jsonParser.Decode(&cm.config)
if err != nil {
cclog.Error(err.Error())
return err
}
// Initialize configured collectors
for collectorName, collectorCfg := range cm.config {
if _, found := AvailableCollectors[collectorName]; !found {
cclog.ComponentError("CollectorManager", "SKIP unknown collector", collectorName)
continue
}
collector := AvailableCollectors[collectorName]
err = collector.Init(collectorCfg)
if err != nil {
cclog.ComponentError("CollectorManager", "Collector", collectorName, "initialization failed:", err.Error())
continue
}
cclog.ComponentDebug("CollectorManager", "ADD COLLECTOR", collector.Name())
cm.collectors = append(cm.collectors, collector)
}
return nil
}
// Start starts the metric collector manager
func (cm *collectorManager) Start() {
tick := make(chan time.Time)
cm.ticker.AddChannel(tick)
cm.wg.Add(1)
go func() {
defer cm.wg.Done()
// Collector manager is done
done := func() {
// close all metric collectors
for _, c := range cm.collectors {
c.Close()
}
close(cm.done)
cclog.ComponentDebug("CollectorManager", "DONE")
}
// Wait for done signal or timer event
for {
select {
case <-cm.done:
done()
return
case t := <-tick:
for _, c := range cm.collectors {
// Wait for done signal or execute the collector
select {
case <-cm.done:
done()
return
default:
// Read metrics from collector c
cclog.ComponentDebug("CollectorManager", c.Name(), t)
c.Read(cm.duration, cm.output)
}
}
}
}
}()
// Collector manager is started
cclog.ComponentDebug("CollectorManager", "STARTED")
}
// AddOutput adds the output channel to the metric collector manager
func (cm *collectorManager) AddOutput(output chan lp.CCMetric) {
cm.output = output
}
// Close finishes / stops the metric collector manager
func (cm *collectorManager) Close() {
cclog.ComponentDebug("CollectorManager", "CLOSE")
cm.done <- true
// wait for close of channel cm.done
<-cm.done
}
// New creates a new initialized metric collector manager
func New(ticker mct.MultiChanTicker, duration time.Duration, wg *sync.WaitGroup, collectConfigFile string) (CollectorManager, error) {
cm := new(collectorManager)
err := cm.Init(ticker, duration, wg, collectConfigFile)
if err != nil {
return nil, err
}
return cm, err
}

View File

@@ -2,14 +2,16 @@ package collectors
import (
"bufio"
"encoding/json"
"fmt"
"log"
"os"
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
//
@@ -21,41 +23,55 @@ import (
type CPUFreqCpuInfoCollectorTopology struct {
processor string // logical processor number (continuous, starting at 0)
coreID string // socket local core ID
coreID_int int
coreID_int int64
physicalPackageID string // socket / package ID
physicalPackageID_int int
physicalPackageID_int int64
numPhysicalPackages string // number of sockets / packages
numPhysicalPackages_int int
numPhysicalPackages_int int64
isHT bool
numNonHT string // number of non hyperthreading processors
numNonHT_int int
numNonHT_int int64
tagSet map[string]string
}
type CPUFreqCpuInfoCollector struct {
MetricCollector
topology []CPUFreqCpuInfoCollectorTopology
metricCollector
topology []*CPUFreqCpuInfoCollectorTopology
}
func (m *CPUFreqCpuInfoCollector) Init(config []byte) error {
func (m *CPUFreqCpuInfoCollector) Init(config json.RawMessage) error {
// Check if already initialized
if m.init {
return nil
}
m.setup()
m.name = "CPUFreqCpuInfoCollector"
m.meta = map[string]string{
"source": m.name,
"group": "CPU",
"unit": "MHz",
}
const cpuInfoFile = "/proc/cpuinfo"
file, err := os.Open(cpuInfoFile)
if err != nil {
return fmt.Errorf("Failed to open '%s': %v", cpuInfoFile, err)
return fmt.Errorf("Failed to open file '%s': %v", cpuInfoFile, err)
}
defer file.Close()
// Collect topology information from file cpuinfo
foundFreq := false
processor := ""
numNonHT_int := 0
var numNonHT_int int64 = 0
coreID := ""
physicalPackageID := ""
maxPhysicalPackageID := 0
m.topology = make([]CPUFreqCpuInfoCollectorTopology, 0)
var maxPhysicalPackageID int64 = 0
m.topology = make([]*CPUFreqCpuInfoCollectorTopology, 0)
coreSeenBefore := make(map[string]bool)
// Read cpuinfo file, line by line
scanner := bufio.NewScanner(file)
for scanner.Scan() {
lineSplit := strings.Split(scanner.Text(), ":")
@@ -81,39 +97,41 @@ func (m *CPUFreqCpuInfoCollector) Init(config []byte) error {
len(coreID) > 0 &&
len(physicalPackageID) > 0 {
coreID_int, err := strconv.Atoi(coreID)
topology := new(CPUFreqCpuInfoCollectorTopology)
// Processor
topology.processor = processor
// Core ID
topology.coreID = coreID
topology.coreID_int, err = strconv.ParseInt(coreID, 10, 64)
if err != nil {
return fmt.Errorf("Unable to convert coreID to int: %v", err)
return fmt.Errorf("Unable to convert coreID '%s' to int64: %v", coreID, err)
}
physicalPackageID_int, err := strconv.Atoi(physicalPackageID)
// Physical package ID
topology.physicalPackageID = physicalPackageID
topology.physicalPackageID_int, err = strconv.ParseInt(physicalPackageID, 10, 64)
if err != nil {
return fmt.Errorf("Unable to convert physicalPackageID to int: %v", err)
return fmt.Errorf("Unable to convert physicalPackageID '%s' to int64: %v", physicalPackageID, err)
}
// increase maximun socket / package ID, when required
if physicalPackageID_int > maxPhysicalPackageID {
maxPhysicalPackageID = physicalPackageID_int
if topology.physicalPackageID_int > maxPhysicalPackageID {
maxPhysicalPackageID = topology.physicalPackageID_int
}
// is hyperthread?
globalID := physicalPackageID + ":" + coreID
isHT := coreSeenBefore[globalID]
topology.isHT = coreSeenBefore[globalID]
coreSeenBefore[globalID] = true
if !isHT {
if !topology.isHT {
// increase number on non hyper thread cores
numNonHT_int++
}
// store collected topology information
m.topology = append(
m.topology,
CPUFreqCpuInfoCollectorTopology{
processor: processor,
coreID: coreID,
coreID_int: coreID_int,
physicalPackageID: physicalPackageID,
physicalPackageID_int: physicalPackageID_int,
isHT: isHT,
})
m.topology = append(m.topology, topology)
// reset topology information
foundFreq = false
@@ -126,18 +144,15 @@ func (m *CPUFreqCpuInfoCollector) Init(config []byte) error {
numPhysicalPackageID_int := maxPhysicalPackageID + 1
numPhysicalPackageID := fmt.Sprint(numPhysicalPackageID_int)
numNonHT := fmt.Sprint(numNonHT_int)
for i := range m.topology {
t := &m.topology[i]
for _, t := range m.topology {
t.numPhysicalPackages = numPhysicalPackageID
t.numPhysicalPackages_int = numPhysicalPackageID_int
t.numNonHT = numNonHT
t.numNonHT_int = numNonHT_int
t.tagSet = map[string]string{
"type": "cpu",
"type-id": t.processor,
"num_core": t.numNonHT,
"package_id": t.physicalPackageID,
"num_package": t.numPhysicalPackages,
"type": "cpu",
"type-id": t.processor,
"package_id": t.physicalPackageID,
}
}
@@ -145,14 +160,18 @@ func (m *CPUFreqCpuInfoCollector) Init(config []byte) error {
return nil
}
func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, output chan lp.CCMetric) {
// Check if already initialized
if !m.init {
return
}
const cpuInfoFile = "/proc/cpuinfo"
file, err := os.Open(cpuInfoFile)
if err != nil {
log.Printf("Failed to open '%s': %v", cpuInfoFile, err)
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to open file '%s': %v", cpuInfoFile, err))
return
}
defer file.Close()
@@ -167,16 +186,17 @@ func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, out *[]lp.Mutable
// frequency
if key == "cpu MHz" {
t := &m.topology[processorCounter]
t := m.topology[processorCounter]
if !t.isHT {
value, err := strconv.ParseFloat(strings.TrimSpace(lineSplit[1]), 64)
if err != nil {
log.Printf("Failed to convert cpu MHz to float: %v", err)
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert cpu MHz '%s' to float64: %v", lineSplit[1], err))
return
}
y, err := lp.New("cpufreq", t.tagSet, map[string]interface{}{"value": value}, now)
if err == nil {
*out = append(*out, y)
if y, err := lp.New("cpufreq", t.tagSet, m.meta, map[string]interface{}{"value": value}, now); err == nil {
output <- y
}
}
processorCounter++

View File

@@ -0,0 +1,10 @@
## `cpufreq_cpuinfo` collector
```json
"cpufreq_cpuinfo": {}
```
The `cpufreq_cpuinfo` collector reads the clock frequency from `/proc/cpuinfo` and outputs a handful **cpu** metrics.
Metrics:
* `cpufreq`

View File

@@ -1,48 +1,30 @@
package collectors
import (
"bufio"
"encoding/json"
"fmt"
"log"
"os"
"io/ioutil"
"path/filepath"
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
"golang.org/x/sys/unix"
)
//
// readOneLine reads one line from a file.
// It returns ok when file was successfully read.
// In this case text contains the first line of the files contents.
//
func readOneLine(filename string) (text string, ok bool) {
file, err := os.Open(filename)
if err != nil {
return
}
defer file.Close()
scanner := bufio.NewScanner(file)
ok = scanner.Scan()
text = scanner.Text()
return
}
type CPUFreqCollectorTopology struct {
processor string // logical processor number (continuous, starting at 0)
coreID string // socket local core ID
coreID_int int
coreID_int int64
physicalPackageID string // socket / package ID
physicalPackageID_int int
physicalPackageID_int int64
numPhysicalPackages string // number of sockets / packages
numPhysicalPackages_int int
numPhysicalPackages_int int64
isHT bool
numNonHT string // number of non hyperthreading processors
numNonHT_int int
numNonHT_int int64
scalingCurFreqFile string
tagSet map[string]string
}
@@ -56,14 +38,19 @@ type CPUFreqCollectorTopology struct {
// See: https://www.kernel.org/doc/html/latest/admin-guide/pm/cpufreq.html
//
type CPUFreqCollector struct {
MetricCollector
metricCollector
topology []CPUFreqCollectorTopology
config struct {
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
}
}
func (m *CPUFreqCollector) Init(config []byte) error {
func (m *CPUFreqCollector) Init(config json.RawMessage) error {
// Check if already initialized
if m.init {
return nil
}
m.name = "CPUFreqCollector"
m.setup()
if len(config) > 0 {
@@ -72,54 +59,61 @@ func (m *CPUFreqCollector) Init(config []byte) error {
return err
}
}
m.meta = map[string]string{
"source": m.name,
"group": "CPU",
"unit": "MHz",
}
// Loop for all CPU directories
baseDir := "/sys/devices/system/cpu"
globPattern := filepath.Join(baseDir, "cpu[0-9]*")
cpuDirs, err := filepath.Glob(globPattern)
if err != nil {
return fmt.Errorf("CPUFreqCollector.Init() unable to glob files with pattern %s: %v", globPattern, err)
return fmt.Errorf("Unable to glob files with pattern '%s': %v", globPattern, err)
}
if cpuDirs == nil {
return fmt.Errorf("CPUFreqCollector.Init() unable to find any files with pattern %s", globPattern)
return fmt.Errorf("Unable to find any files with pattern '%s'", globPattern)
}
// Initialize CPU topology
m.topology = make([]CPUFreqCollectorTopology, len(cpuDirs))
for _, cpuDir := range cpuDirs {
processor := strings.TrimPrefix(cpuDir, "/sys/devices/system/cpu/cpu")
processor_int, err := strconv.Atoi(processor)
processor_int, err := strconv.ParseInt(processor, 10, 64)
if err != nil {
return fmt.Errorf("CPUFreqCollector.Init() unable to convert cpuID to int: %v", err)
return fmt.Errorf("Unable to convert cpuID '%s' to int64: %v", processor, err)
}
// Read package ID
physicalPackageIDFile := filepath.Join(cpuDir, "topology", "physical_package_id")
physicalPackageID, ok := readOneLine(physicalPackageIDFile)
if !ok {
return fmt.Errorf("CPUFreqCollector.Init() unable to read physical package ID from %s", physicalPackageIDFile)
}
physicalPackageID_int, err := strconv.Atoi(physicalPackageID)
line, err := ioutil.ReadFile(physicalPackageIDFile)
if err != nil {
return fmt.Errorf("CPUFreqCollector.Init() unable to convert packageID to int: %v", err)
return fmt.Errorf("Unable to read physical package ID from file '%s': %v", physicalPackageIDFile, err)
}
physicalPackageID := strings.TrimSpace(string(line))
physicalPackageID_int, err := strconv.ParseInt(physicalPackageID, 10, 64)
if err != nil {
return fmt.Errorf("Unable to convert packageID '%s' to int64: %v", physicalPackageID, err)
}
// Read core ID
coreIDFile := filepath.Join(cpuDir, "topology", "core_id")
coreID, ok := readOneLine(coreIDFile)
if !ok {
return fmt.Errorf("CPUFreqCollector.Init() unable to read core ID from %s", coreIDFile)
}
coreID_int, err := strconv.Atoi(coreID)
line, err = ioutil.ReadFile(coreIDFile)
if err != nil {
return fmt.Errorf("CPUFreqCollector.Init() unable to convert coreID to int: %v", err)
return fmt.Errorf("Unable to read core ID from file '%s': %v", coreIDFile, err)
}
coreID := strings.TrimSpace(string(line))
coreID_int, err := strconv.ParseInt(coreID, 10, 64)
if err != nil {
return fmt.Errorf("Unable to convert coreID '%s' to int64: %v", coreID, err)
}
// Check access to current frequency file
scalingCurFreqFile := filepath.Join(cpuDir, "cpufreq", "scaling_cur_freq")
err = unix.Access(scalingCurFreqFile, unix.R_OK)
if err != nil {
return fmt.Errorf("CPUFreqCollector.Init() unable to access %s: %v", scalingCurFreqFile, err)
return fmt.Errorf("Unable to access file '%s': %v", scalingCurFreqFile, err)
}
t := &m.topology[processor_int]
@@ -142,8 +136,8 @@ func (m *CPUFreqCollector) Init(config []byte) error {
}
// number of non hyper thread cores and packages / sockets
numNonHT_int := 0
maxPhysicalPackageID := 0
var numNonHT_int int64 = 0
var maxPhysicalPackageID int64 = 0
for i := range m.topology {
t := &m.topology[i]
@@ -167,11 +161,9 @@ func (m *CPUFreqCollector) Init(config []byte) error {
t.numNonHT = numNonHT
t.numNonHT_int = numNonHT_int
t.tagSet = map[string]string{
"type": "cpu",
"type-id": t.processor,
"num_core": t.numNonHT,
"package_id": t.physicalPackageID,
"num_package": t.numPhysicalPackages,
"type": "cpu",
"type-id": t.processor,
"package_id": t.physicalPackageID,
}
}
@@ -179,7 +171,8 @@ func (m *CPUFreqCollector) Init(config []byte) error {
return nil
}
func (m *CPUFreqCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *CPUFreqCollector) Read(interval time.Duration, output chan lp.CCMetric) {
// Check if already initialized
if !m.init {
return
}
@@ -194,20 +187,23 @@ func (m *CPUFreqCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
}
// Read current frequency
line, ok := readOneLine(t.scalingCurFreqFile)
if !ok {
log.Printf("CPUFreqCollector.Read(): Failed to read one line from file '%s'", t.scalingCurFreqFile)
line, err := ioutil.ReadFile(t.scalingCurFreqFile)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to read file '%s': %v", t.scalingCurFreqFile, err))
continue
}
cpuFreq, err := strconv.Atoi(line)
cpuFreq, err := strconv.ParseInt(strings.TrimSpace(string(line)), 10, 64)
if err != nil {
log.Printf("CPUFreqCollector.Read(): Failed to convert CPU frequency '%s': %v", line, err)
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert CPU frequency '%s' to int64: %v", line, err))
continue
}
y, err := lp.New("cpufreq", t.tagSet, map[string]interface{}{"value": cpuFreq}, now)
if err == nil {
*out = append(*out, y)
if y, err := lp.New("cpufreq", t.tagSet, m.meta, map[string]interface{}{"value": cpuFreq}, now); err == nil {
output <- y
}
}
}

View File

@@ -0,0 +1,11 @@
## `cpufreq_cpuinfo` collector
```json
"cpufreq": {
"exclude_metrics": []
}
```
The `cpufreq` collector reads the clock frequency from `/sys/devices/system/cpu/cpu*/cpufreq` and outputs a handful **cpu** metrics.
Metrics:
* `cpufreq`

View File

@@ -1,14 +1,16 @@
package collectors
import (
"bufio"
"encoding/json"
"fmt"
"io/ioutil"
"os"
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const CPUSTATFILE = `/proc/stat`
@@ -18,72 +20,129 @@ type CpustatCollectorConfig struct {
}
type CpustatCollector struct {
MetricCollector
config CpustatCollectorConfig
metricCollector
config CpustatCollectorConfig
matches map[string]int
cputags map[string]map[string]string
nodetags map[string]string
num_cpus_metric lp.CCMetric
}
func (m *CpustatCollector) Init(config []byte) error {
func (m *CpustatCollector) Init(config json.RawMessage) error {
m.name = "CpustatCollector"
m.setup()
m.meta = map[string]string{"source": m.name, "group": "CPU", "unit": "Percent"}
m.nodetags = map[string]string{"type": "node"}
if len(config) > 0 {
err := json.Unmarshal(config, &m.config)
if err != nil {
return err
}
}
matches := map[string]int{
"cpu_user": 1,
"cpu_nice": 2,
"cpu_system": 3,
"cpu_idle": 4,
"cpu_iowait": 5,
"cpu_irq": 6,
"cpu_softirq": 7,
"cpu_steal": 8,
"cpu_guest": 9,
"cpu_guest_nice": 10,
}
m.matches = make(map[string]int)
for match, index := range matches {
doExclude := false
for _, exclude := range m.config.ExcludeMetrics {
if match == exclude {
doExclude = true
break
}
}
if !doExclude {
m.matches[match] = index
}
}
// Check input file
file, err := os.Open(string(CPUSTATFILE))
if err != nil {
cclog.ComponentError(m.name, err.Error())
}
defer file.Close()
// Pre-generate tags for all CPUs
num_cpus := 0
m.cputags = make(map[string]map[string]string)
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
linefields := strings.Fields(line)
if strings.HasPrefix(linefields[0], "cpu") && strings.Compare(linefields[0], "cpu") != 0 {
cpustr := strings.TrimLeft(linefields[0], "cpu")
cpu, _ := strconv.Atoi(cpustr)
m.cputags[linefields[0]] = map[string]string{"type": "cpu", "type-id": fmt.Sprintf("%d", cpu)}
num_cpus++
}
}
m.init = true
return nil
}
func ParseStatLine(line string, cpu int, exclude []string, out *[]lp.MutableMetric) {
ls := strings.Fields(line)
matches := []string{"", "cpu_user", "cpu_nice", "cpu_system", "cpu_idle", "cpu_iowait", "cpu_irq", "cpu_softirq", "cpu_steal", "cpu_guest", "cpu_guest_nice"}
for _, ex := range exclude {
matches, _ = RemoveFromStringList(matches, ex)
}
var tags map[string]string
if cpu < 0 {
tags = map[string]string{"type": "node"}
} else {
tags = map[string]string{"type": "cpu", "type-id": fmt.Sprintf("%d", cpu)}
}
for i, m := range matches {
if len(m) > 0 {
x, err := strconv.ParseInt(ls[i], 0, 64)
func (m *CpustatCollector) parseStatLine(linefields []string, tags map[string]string, output chan lp.CCMetric) {
values := make(map[string]float64)
total := 0.0
for match, index := range m.matches {
if len(match) > 0 {
x, err := strconv.ParseInt(linefields[index], 0, 64)
if err == nil {
y, err := lp.New(m, tags, map[string]interface{}{"value": int(x)}, time.Now())
if err == nil {
*out = append(*out, y)
}
values[match] = float64(x)
total += values[match]
}
}
}
t := time.Now()
for name, value := range values {
y, err := lp.New(name, tags, m.meta, map[string]interface{}{"value": (value * 100.0) / total}, t)
if err == nil {
output <- y
}
}
}
func (m *CpustatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *CpustatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
buffer, err := ioutil.ReadFile(string(CPUSTATFILE))
num_cpus := 0
file, err := os.Open(string(CPUSTATFILE))
if err != nil {
return
cclog.ComponentError(m.name, err.Error())
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
linefields := strings.Fields(line)
if strings.Compare(linefields[0], "cpu") == 0 {
m.parseStatLine(linefields, m.nodetags, output)
} else if strings.HasPrefix(linefields[0], "cpu") {
m.parseStatLine(linefields, m.cputags[linefields[0]], output)
num_cpus++
}
}
ll := strings.Split(string(buffer), "\n")
for _, line := range ll {
if len(line) == 0 {
continue
}
ls := strings.Fields(line)
if strings.Compare(ls[0], "cpu") == 0 {
ParseStatLine(line, -1, m.config.ExcludeMetrics, out)
} else if strings.HasPrefix(ls[0], "cpu") {
cpustr := strings.TrimLeft(ls[0], "cpu")
cpu, _ := strconv.Atoi(cpustr)
ParseStatLine(line, cpu, m.config.ExcludeMetrics, out)
}
num_cpus_metric, err := lp.New("num_cpus",
m.nodetags,
m.meta,
map[string]interface{}{"value": int(num_cpus)},
time.Now(),
)
if err == nil {
output <- num_cpus_metric
}
}

View File

@@ -0,0 +1,23 @@
## `cpustat` collector
```json
"cpustat": {
"exclude_metrics": [
"cpu_idle"
]
}
```
The `cpustat` collector reads data from `/proc/stats` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `cpu_user`
* `cpu_nice`
* `cpu_system`
* `cpu_idle`
* `cpu_iowait`
* `cpu_irq`
* `cpu_softirq`
* `cpu_steal`
* `cpu_guest`
* `cpu_guest_nice`

View File

@@ -9,7 +9,13 @@ import (
"strings"
"time"
<<<<<<< HEAD
lp "github.com/influxdata/line-protocol"
=======
ccmetric "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
influx "github.com/influxdata/line-protocol"
>>>>>>> develop
)
const CUSTOMCMDPATH = `/home/unrz139/Work/cc-metric-collector/collectors/custom`
@@ -21,17 +27,18 @@ type CustomCmdCollectorConfig struct {
}
type CustomCmdCollector struct {
MetricCollector
handler *lp.MetricHandler
parser *lp.Parser
metricCollector
handler *influx.MetricHandler
parser *influx.Parser
config CustomCmdCollectorConfig
commands []string
files []string
}
func (m *CustomCmdCollector) Init(config []byte) error {
func (m *CustomCmdCollector) Init(config json.RawMessage) error {
var err error
m.name = "CustomCmdCollector"
m.meta = map[string]string{"source": m.name, "group": "Custom"}
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
@@ -61,8 +68,8 @@ func (m *CustomCmdCollector) Init(config []byte) error {
if len(m.files) == 0 && len(m.commands) == 0 {
return errors.New("No metrics to collect")
}
m.handler = lp.NewMetricHandler()
m.parser = lp.NewParser(m.handler)
m.handler = influx.NewMetricHandler()
m.parser = influx.NewParser(m.handler)
m.parser.SetTimeFunc(DefaultTime)
m.init = true
return nil
@@ -72,7 +79,7 @@ var DefaultTime = func() time.Time {
return time.Unix(42, 0)
}
func (m *CustomCmdCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *CustomCmdCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -95,9 +102,10 @@ func (m *CustomCmdCollector) Read(interval time.Duration, out *[]lp.MutableMetri
if skip {
continue
}
y, err := lp.New(c.Name(), Tags2Map(c), Fields2Map(c), c.Time())
y := ccmetric.FromInfluxMetric(c)
if err == nil {
*out = append(*out, y)
output <- y
}
}
}
@@ -117,9 +125,9 @@ func (m *CustomCmdCollector) Read(interval time.Duration, out *[]lp.MutableMetri
if skip {
continue
}
y, err := lp.New(f.Name(), Tags2Map(f), Fields2Map(f), f.Time())
y := ccmetric.FromInfluxMetric(f)
if err == nil {
*out = append(*out, y)
output <- y
}
}
}

View File

@@ -0,0 +1,20 @@
## `customcmd` collector
```json
"customcmd": {
"exclude_metrics": [
"mymetric"
],
"files" : [
"/var/run/myapp.metrics"
],
"commands" : [
"/usr/local/bin/getmetrics.pl"
]
}
```
The `customcmd` collector reads data from files and the output of executed commands. The files and commands can output multiple metrics (separated by newline) but the have to be in the [InfluxDB line protocol](https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/). If a metric is not parsable, it is skipped. If a metric is not required, it can be excluded from forwarding it to the sink.

View File

@@ -1,113 +1,111 @@
package collectors
import (
"io/ioutil"
lp "github.com/influxdata/line-protocol"
// "log"
"bufio"
"encoding/json"
"errors"
"strconv"
"fmt"
"os"
"strings"
"syscall"
"time"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const DISKSTATFILE = `/proc/diskstats`
const DISKSTAT_SYSFSPATH = `/sys/block`
// "log"
const MOUNTFILE = `/proc/self/mounts`
type DiskstatCollectorConfig struct {
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
}
type DiskstatCollector struct {
MetricCollector
matches map[int]string
config DiskstatCollectorConfig
metricCollector
//matches map[string]int
config IOstatCollectorConfig
//devices map[string]IOstatCollectorEntry
}
func (m *DiskstatCollector) Init(config []byte) error {
var err error
func (m *DiskstatCollector) Init(config json.RawMessage) error {
m.name = "DiskstatCollector"
m.meta = map[string]string{"source": m.name, "group": "Disk"}
m.setup()
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
err := json.Unmarshal(config, &m.config)
if err != nil {
return err
}
}
// https://www.kernel.org/doc/html/latest/admin-guide/iostats.html
matches := map[int]string{
3: "reads",
4: "reads_merged",
5: "read_sectors",
6: "read_ms",
7: "writes",
8: "writes_merged",
9: "writes_sectors",
10: "writes_ms",
11: "ioops",
12: "ioops_ms",
13: "ioops_weighted_ms",
14: "discards",
15: "discards_merged",
16: "discards_sectors",
17: "discards_ms",
18: "flushes",
19: "flushes_ms",
file, err := os.Open(string(MOUNTFILE))
if err != nil {
cclog.ComponentError(m.name, err.Error())
return err
}
m.matches = make(map[int]string)
for k, v := range matches {
_, skip := stringArrayContains(m.config.ExcludeMetrics, v)
if !skip {
m.matches[k] = v
}
}
if len(m.matches) == 0 {
return errors.New("No metrics to collect")
}
_, err = ioutil.ReadFile(string(DISKSTATFILE))
if err == nil {
m.init = true
}
return err
defer file.Close()
m.init = true
return nil
}
func (m *DiskstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
var lines []string
func (m *DiskstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
buffer, err := ioutil.ReadFile(string(DISKSTATFILE))
file, err := os.Open(string(MOUNTFILE))
if err != nil {
cclog.ComponentError(m.name, err.Error())
return
}
lines = strings.Split(string(buffer), "\n")
defer file.Close()
for _, line := range lines {
part_max_used := uint64(0)
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
if len(line) == 0 {
continue
}
f := strings.Fields(line)
if strings.Contains(f[2], "loop") {
if !strings.HasPrefix(line, "/dev") {
continue
}
tags := map[string]string{
"device": f[2],
"type": "node",
linefields := strings.Fields(line)
if strings.Contains(linefields[0], "loop") {
continue
}
for idx, name := range m.matches {
if idx < len(f) {
x, err := strconv.ParseInt(f[idx], 0, 64)
if err == nil {
y, err := lp.New(name, tags, map[string]interface{}{"value": int(x)}, time.Now())
if err == nil {
*out = append(*out, y)
}
}
}
if strings.Contains(linefields[1], "boot") {
continue
}
path := strings.Replace(linefields[1], `\040`, " ", -1)
stat := syscall.Statfs_t{}
err := syscall.Statfs(path, &stat)
if err != nil {
fmt.Println(err.Error())
return
}
tags := map[string]string{"type": "node", "device": linefields[0]}
total := (stat.Blocks * uint64(stat.Bsize)) / uint64(1000000000)
y, err := lp.New("disk_total", tags, m.meta, map[string]interface{}{"value": total}, time.Now())
if err == nil {
y.AddMeta("unit", "GBytes")
output <- y
}
free := (stat.Bfree * uint64(stat.Bsize)) / uint64(1000000000)
y, err = lp.New("disk_free", tags, m.meta, map[string]interface{}{"value": free}, time.Now())
if err == nil {
y.AddMeta("unit", "GBytes")
output <- y
}
perc := (100 * (total - free)) / total
if perc > part_max_used {
part_max_used = perc
}
}
y, err := lp.New("part_max_used", map[string]string{"type": "node"}, m.meta, map[string]interface{}{"value": part_max_used}, time.Now())
if err == nil {
y.AddMeta("unit", "percent")
output <- y
}
}

View File

@@ -0,0 +1,21 @@
## `diskstat` collector
```json
"diskstat": {
"exclude_metrics": [
"disk_total"
],
}
```
The `diskstat` collector reads data from `/proc/self/mounts` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics per device (with `device` tag):
* `disk_total` (unit `GBytes`)
* `disk_free` (unit `GBytes`)
Global metrics:
* `part_max_used` (unit `percent`)

View File

@@ -7,24 +7,32 @@ import (
"fmt"
"io/ioutil"
"log"
"os"
"os/exec"
"os/user"
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
type GpfsCollector struct {
MetricCollector
metricCollector
tags map[string]string
config struct {
Mmpmon string `json:"mmpmon"`
Mmpmon string `json:"mmpmon_path,omitempty"`
ExcludeFilesystem []string `json:"exclude_filesystem,omitempty"`
}
skipFS map[string]struct{}
}
func (m *GpfsCollector) Init(config []byte) error {
func (m *GpfsCollector) Init(config json.RawMessage) error {
// Check if already initialized
if m.init {
return nil
}
var err error
m.name = "GpfsCollector"
m.setup()
@@ -40,27 +48,40 @@ func (m *GpfsCollector) Init(config []byte) error {
return err
}
}
m.meta = map[string]string{
"source": m.name,
"group": "GPFS",
}
m.tags = map[string]string{
"type": "node",
"filesystem": "",
}
m.skipFS = make(map[string]struct{})
for _, fs := range m.config.ExcludeFilesystem {
m.skipFS[fs] = struct{}{}
}
// GPFS / IBM Spectrum Scale file system statistics can only be queried by user root
user, err := user.Current()
if err != nil {
return fmt.Errorf("GpfsCollector.Init(): Failed to get current user: %v", err)
return fmt.Errorf("Failed to get current user: %v", err)
}
if user.Uid != "0" {
return fmt.Errorf("GpfsCollector.Init(): GPFS file system statistics can only be queried by user root")
return fmt.Errorf("GPFS file system statistics can only be queried by user root")
}
// Check if mmpmon is in executable search path
_, err = exec.LookPath(m.config.Mmpmon)
if err != nil {
return fmt.Errorf("GpfsCollector.Init(): Failed to find mmpmon binary '%s': %v", m.config.Mmpmon, err)
return fmt.Errorf("Failed to find mmpmon binary '%s': %v", m.config.Mmpmon, err)
}
m.init = true
return nil
}
func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *GpfsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
// Check if already initialized
if !m.init {
return
}
@@ -77,12 +98,15 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
cmd.Stderr = cmdStderr
err := cmd.Run()
if err != nil {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to execute command \"%s\": %s\n", cmd.String(), err.Error())
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): command exit code: \"%d\"\n", cmd.ProcessState.ExitCode())
data, _ := ioutil.ReadAll(cmdStderr)
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): command stderr: \"%s\"\n", string(data))
data, _ = ioutil.ReadAll(cmdStdout)
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): command stdout: \"%s\"\n", string(data))
dataStdErr, _ := ioutil.ReadAll(cmdStderr)
dataStdOut, _ := ioutil.ReadAll(cmdStdout)
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to execute command \"%s\": %v\n", cmd.String(), err),
fmt.Sprintf("Read(): command exit code: \"%d\"\n", cmd.ProcessState.ExitCode()),
fmt.Sprintf("Read(): command stderr: \"%s\"\n", string(dataStdErr)),
fmt.Sprintf("Read(): command stdout: \"%s\"\n", string(dataStdOut)),
)
return
}
@@ -90,194 +114,163 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
scanner := bufio.NewScanner(cmdStdout)
for scanner.Scan() {
lineSplit := strings.Fields(scanner.Text())
if lineSplit[0] == "_fs_io_s_" {
key_value := make(map[string]string)
for i := 1; i < len(lineSplit); i += 2 {
key_value[lineSplit[i]] = lineSplit[i+1]
}
// Ignore keys:
// _n_: node IP address,
// _nn_: node name,
// _cl_: cluster name,
// _d_: number of disks
// Only process lines starting with _fs_io_s_
if lineSplit[0] != "_fs_io_s_" {
continue
}
filesystem, ok := key_value["_fs_"]
if !ok {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to get filesystem name.\n")
continue
}
key_value := make(map[string]string)
for i := 1; i < len(lineSplit); i += 2 {
key_value[lineSplit[i]] = lineSplit[i+1]
}
tagList := map[string]string{
"type": "node",
"filesystem": filesystem,
}
// Ignore keys:
// _n_: node IP address,
// _nn_: node name,
// _cl_: cluster name,
// _d_: number of disks
// return code
rc, err := strconv.Atoi(key_value["_rc_"])
if err != nil {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert return code: %s\n", err.Error())
continue
}
if rc != 0 {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Filesystem %s not ok.", filesystem)
continue
}
filesystem, ok := key_value["_fs_"]
if !ok {
cclog.ComponentError(
m.name,
"Read(): Failed to get filesystem name.")
continue
}
/* requires go 1.17
// unix epoch in microseconds
timestampInt, err := strconv.ParseInt(key_value["_t_"]+key_value["_tu_"], 10, 64)
timestamp := time.UnixMicro(timestampInt)
if err != nil {
fmt.Fprintf(os.Stderr,
"GpfsCollector.Read(): Failed to convert time stamp '%s': %s\n",
key_value["_t_"]+key_value["_tu_"], err.Error())
continue
}
*/
timestamp := time.Now()
// Skip excluded filesystems
if _, skip := m.skipFS[filesystem]; skip {
continue
}
// bytes read
bytesRead, err := strconv.ParseInt(key_value["_br_"], 10, 64)
if err != nil {
fmt.Fprintf(os.Stderr,
"GpfsCollector.Read(): Failed to convert bytes read '%s': %s\n",
key_value["_br_"], err.Error())
continue
}
y, err := lp.New(
"gpfs_bytes_read",
tagList,
map[string]interface{}{
"value": bytesRead,
},
timestamp)
if err == nil {
*out = append(*out, y)
}
m.tags["filesystem"] = filesystem
// bytes written
bytesWritten, err := strconv.ParseInt(key_value["_bw_"], 10, 64)
if err != nil {
fmt.Fprintf(os.Stderr,
"GpfsCollector.Read(): Failed to convert bytes written '%s': %s\n",
key_value["_bw_"], err.Error())
continue
}
y, err = lp.New(
"gpfs_bytes_written",
tagList,
map[string]interface{}{
"value": bytesWritten,
},
timestamp)
if err == nil {
*out = append(*out, y)
}
// return code
rc, err := strconv.Atoi(key_value["_rc_"])
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert return code '%s' to int: %v", key_value["_rc_"], err))
continue
}
if rc != 0 {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Filesystem '%s' is not ok.", filesystem))
continue
}
// number of opens
numOpens, err := strconv.ParseInt(key_value["_oc_"], 10, 64)
if err != nil {
fmt.Fprintf(os.Stderr,
"GpfsCollector.Read(): Failed to convert number of opens '%s': %s\n",
key_value["_oc_"], err.Error())
continue
}
y, err = lp.New(
"gpfs_num_opens",
tagList,
map[string]interface{}{
"value": numOpens,
},
timestamp)
if err == nil {
*out = append(*out, y)
}
sec, err := strconv.ParseInt(key_value["_t_"], 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert seconds '%s' to int64: %v", key_value["_t_"], err))
continue
}
msec, err := strconv.ParseInt(key_value["_tu_"], 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert micro seconds '%s' to int64: %v", key_value["_tu_"], err))
continue
}
timestamp := time.Unix(sec, msec*1000)
// number of closes
numCloses, err := strconv.ParseInt(key_value["_cc_"], 10, 64)
if err != nil {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of closes: %s\n", err.Error())
continue
}
y, err = lp.New(
"gpfs_num_closes",
tagList,
map[string]interface{}{
"value": numCloses,
},
timestamp)
if err == nil {
*out = append(*out, y)
}
// bytes read
bytesRead, err := strconv.ParseInt(key_value["_br_"], 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert bytes read '%s' to int64: %v", key_value["_br_"], err))
continue
}
if y, err := lp.New("gpfs_bytes_read", m.tags, m.meta, map[string]interface{}{"value": bytesRead}, timestamp); err == nil {
output <- y
}
// number of reads
numReads, err := strconv.ParseInt(key_value["_rdc_"], 10, 64)
if err != nil {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of reads: %s\n", err.Error())
continue
}
y, err = lp.New(
"gpfs_num_reads",
tagList,
map[string]interface{}{
"value": numReads,
},
timestamp)
if err == nil {
*out = append(*out, y)
}
// bytes written
bytesWritten, err := strconv.ParseInt(key_value["_bw_"], 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert bytes written '%s' to int64: %v", key_value["_bw_"], err))
continue
}
if y, err := lp.New("gpfs_bytes_written", m.tags, m.meta, map[string]interface{}{"value": bytesWritten}, timestamp); err == nil {
output <- y
}
// number of writes
numWrites, err := strconv.ParseInt(key_value["_wc_"], 10, 64)
if err != nil {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of writes: %s\n", err.Error())
continue
}
y, err = lp.New(
"gpfs_num_writes",
tagList,
map[string]interface{}{
"value": numWrites,
},
timestamp)
if err == nil {
*out = append(*out, y)
}
// number of opens
numOpens, err := strconv.ParseInt(key_value["_oc_"], 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert number of opens '%s' to int64: %v", key_value["_oc_"], err))
continue
}
if y, err := lp.New("gpfs_num_opens", m.tags, m.meta, map[string]interface{}{"value": numOpens}, timestamp); err == nil {
output <- y
}
// number of read directories
numReaddirs, err := strconv.ParseInt(key_value["_dir_"], 10, 64)
if err != nil {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of read directories: %s\n", err.Error())
continue
}
y, err = lp.New(
"gpfs_num_readdirs",
tagList,
map[string]interface{}{
"value": numReaddirs,
},
timestamp)
if err == nil {
*out = append(*out, y)
}
// number of closes
numCloses, err := strconv.ParseInt(key_value["_cc_"], 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert number of closes: '%s' to int64: %v", key_value["_cc_"], err))
continue
}
if y, err := lp.New("gpfs_num_closes", m.tags, m.meta, map[string]interface{}{"value": numCloses}, timestamp); err == nil {
output <- y
}
// Number of inode updates
numInodeUpdates, err := strconv.ParseInt(key_value["_iu_"], 10, 64)
if err != nil {
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert Number of inode updates: %s\n", err.Error())
continue
}
y, err = lp.New(
"gpfs_num_inode_updates",
tagList,
map[string]interface{}{
"value": numInodeUpdates,
},
timestamp)
if err == nil {
*out = append(*out, y)
}
// number of reads
numReads, err := strconv.ParseInt(key_value["_rdc_"], 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert number of reads: '%s' to int64: %v", key_value["_rdc_"], err))
continue
}
if y, err := lp.New("gpfs_num_reads", m.tags, m.meta, map[string]interface{}{"value": numReads}, timestamp); err == nil {
output <- y
}
// number of writes
numWrites, err := strconv.ParseInt(key_value["_wc_"], 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert number of writes: '%s' to int64: %v", key_value["_wc_"], err))
continue
}
if y, err := lp.New("gpfs_num_writes", m.tags, m.meta, map[string]interface{}{"value": numWrites}, timestamp); err == nil {
output <- y
}
// number of read directories
numReaddirs, err := strconv.ParseInt(key_value["_dir_"], 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert number of read directories: '%s' to int64: %v", key_value["_dir_"], err))
continue
}
if y, err := lp.New("gpfs_num_readdirs", m.tags, m.meta, map[string]interface{}{"value": numReaddirs}, timestamp); err == nil {
output <- y
}
// Number of inode updates
numInodeUpdates, err := strconv.ParseInt(key_value["_iu_"], 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert number of inode updates: '%s' to int: %v", key_value["_iu_"], err))
continue
}
if y, err := lp.New("gpfs_num_inode_updates", m.tags, m.meta, map[string]interface{}{"value": numInodeUpdates}, timestamp); err == nil {
output <- y
}
}
}

30
collectors/gpfsMetric.md Normal file
View File

@@ -0,0 +1,30 @@
## `gpfs` collector
```json
"ibstat": {
"mmpmon_path": "/path/to/mmpmon",
"exclude_filesystem": [
"fs1"
]
}
```
The `gpfs` collector uses the `mmpmon` command to read performance metrics for
GPFS / IBM Spectrum Scale filesystems.
The reported filesystems can be filtered with the `exclude_filesystem` option
in the configuration.
The path to the `mmpmon` command can be configured with the `mmpmon_path` option
in the configuration.
Metrics:
* `bytes_read`
* `gpfs_bytes_written`
* `gpfs_num_opens`
* `gpfs_num_closes`
* `gpfs_num_reads`
* `gpfs_num_readdirs`
* `gpfs_num_inode_updates`
The collector adds a `filesystem` tag to all metrics

View File

@@ -3,282 +3,168 @@ package collectors
import (
"fmt"
"io/ioutil"
"log"
"os/exec"
"os"
lp "github.com/influxdata/line-protocol"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
"golang.org/x/sys/unix"
// "os"
"encoding/json"
"errors"
"path/filepath"
"strconv"
"strings"
"time"
)
const (
IBBASEPATH = `/sys/class/infiniband/`
PERFQUERY = `/usr/sbin/perfquery`
)
const IB_BASEPATH = `/sys/class/infiniband/`
type InfinibandCollectorConfig struct {
ExcludeDevices []string `json:"exclude_devices,omitempty"`
PerfQueryPath string `json:"perfquery_path"`
type InfinibandCollectorInfo struct {
LID string // IB local Identifier (LID)
device string // IB device
port string // IB device port
portCounterFiles map[string]string // mapping counter name -> sysfs file
tagSet map[string]string // corresponding tag list
}
type InfinibandCollector struct {
MetricCollector
tags map[string]string
lids map[string]map[string]string
config InfinibandCollectorConfig
use_perfquery bool
metricCollector
config struct {
ExcludeDevices []string `json:"exclude_devices,omitempty"` // IB device to exclude e.g. mlx5_0
}
info []*InfinibandCollectorInfo
}
func (m *InfinibandCollector) Help() {
fmt.Println("This collector includes all devices that can be found below ", IBBASEPATH)
fmt.Println("and where any of the ports provides a 'lid' file (glob ", IBBASEPATH, "/<dev>/ports/<port>/lid).")
fmt.Println("The devices can be filtered with the 'exclude_devices' option in the configuration.")
fmt.Println("For each found LIDs the collector calls the 'perfquery' command")
fmt.Println("The path to the 'perfquery' command can be configured with the 'perfquery_path' option")
fmt.Println("in the configuration")
fmt.Println("")
fmt.Println("Full configuration object:")
fmt.Println("\"ibstat\" : {")
fmt.Println(" \"perfquery_path\" : \"path/to/perfquery\" # if omitted, it searches in $PATH")
fmt.Println(" \"exclude_devices\" : [\"dev1\"]")
fmt.Println("}")
fmt.Println("")
fmt.Println("Metrics:")
fmt.Println("- ib_recv")
fmt.Println("- ib_xmit")
fmt.Println("- ib_recv_pkts")
fmt.Println("- ib_xmit_pkts")
}
// Init initializes the Infiniband collector by walking through files below IB_BASEPATH
func (m *InfinibandCollector) Init(config json.RawMessage) error {
// Check if already initialized
if m.init {
return nil
}
func (m *InfinibandCollector) Init(config []byte) error {
var err error
m.name = "InfinibandCollector"
m.use_perfquery = false
m.setup()
m.tags = map[string]string{"type": "node"}
m.meta = map[string]string{
"source": m.name,
"group": "Network",
}
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
return err
}
}
if len(m.config.PerfQueryPath) == 0 {
path, err := exec.LookPath("perfquery")
if err == nil {
m.config.PerfQueryPath = path
}
}
m.lids = make(map[string]map[string]string)
p := fmt.Sprintf("%s/*/ports/*/lid", string(IBBASEPATH))
files, err := filepath.Glob(p)
for _, f := range files {
lid, err := ioutil.ReadFile(f)
if err == nil {
plist := strings.Split(strings.Replace(f, string(IBBASEPATH), "", -1), "/")
skip := false
for _, d := range m.config.ExcludeDevices {
if d == plist[0] {
skip = true
}
}
if !skip {
m.lids[plist[0]] = make(map[string]string)
m.lids[plist[0]][plist[2]] = string(lid)
}
}
}
for _, ports := range m.lids {
for port, lid := range ports {
args := fmt.Sprintf("-r %s %s 0xf000", lid, port)
command := exec.Command(m.config.PerfQueryPath, args)
command.Wait()
_, err := command.Output()
if err == nil {
m.use_perfquery = true
}
break
}
break
}
if len(m.lids) > 0 {
m.init = true
} else {
err = errors.New("No usable devices")
}
return err
}
func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[string]string, out *[]lp.MutableMetric) error {
args := fmt.Sprintf("-r %s %s 0xf000", lid, port)
command := exec.Command(cmd, args)
command.Wait()
stdout, err := command.Output()
// Loop for all InfiniBand directories
globPattern := filepath.Join(IB_BASEPATH, "*", "ports", "*")
ibDirs, err := filepath.Glob(globPattern)
if err != nil {
log.Print(err)
return err
return fmt.Errorf("Unable to glob files with pattern %s: %v", globPattern, err)
}
if ibDirs == nil {
return fmt.Errorf("Unable to find any directories with pattern %s", globPattern)
}
ll := strings.Split(string(stdout), "\n")
for _, line := range ll {
if strings.HasPrefix(line, "PortRcvData") || strings.HasPrefix(line, "RcvData") {
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_recv", tags, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
}
for _, path := range ibDirs {
// Skip, when no LID is assigned
line, err := ioutil.ReadFile(filepath.Join(path, "lid"))
if err != nil {
continue
}
LID := strings.TrimSpace(string(line))
if LID == "0x0" {
continue
}
// Get device and port component
pathSplit := strings.Split(path, string(os.PathSeparator))
device := pathSplit[4]
port := pathSplit[6]
// Skip excluded devices
skip := false
for _, excludedDevice := range m.config.ExcludeDevices {
if excludedDevice == device {
skip = true
break
}
}
if strings.HasPrefix(line, "PortXmitData") || strings.HasPrefix(line, "XmtData") {
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_xmit", tags, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
}
}
}
if strings.HasPrefix(line, "PortRcvPkts") || strings.HasPrefix(line, "RcvPkts") {
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_recv_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
}
}
}
if strings.HasPrefix(line, "PortXmitPkts") || strings.HasPrefix(line, "XmtPkts") {
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_xmit_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
}
if skip {
continue
}
// Check access to counter files
countersDir := filepath.Join(path, "counters")
portCounterFiles := map[string]string{
"ib_recv": filepath.Join(countersDir, "port_rcv_data"),
"ib_xmit": filepath.Join(countersDir, "port_xmit_data"),
"ib_recv_pkts": filepath.Join(countersDir, "port_rcv_packets"),
"ib_xmit_pkts": filepath.Join(countersDir, "port_xmit_packets"),
}
for _, counterFile := range portCounterFiles {
err := unix.Access(counterFile, unix.R_OK)
if err != nil {
return fmt.Errorf("Unable to access %s: %v", counterFile, err)
}
}
m.info = append(m.info,
&InfinibandCollectorInfo{
LID: LID,
device: device,
port: port,
portCounterFiles: portCounterFiles,
tagSet: map[string]string{
"type": "node",
"device": device,
"port": port,
"lid": LID,
},
})
}
if len(m.info) == 0 {
return fmt.Errorf("Found no IB devices")
}
m.init = true
return nil
}
func DoSysfsRead(dev string, lid string, port string, tags map[string]string, out *[]lp.MutableMetric) error {
path := fmt.Sprintf("%s/%s/ports/%s/counters/", string(IBBASEPATH), dev, port)
buffer, err := ioutil.ReadFile(fmt.Sprintf("%s/port_rcv_data", path))
if err == nil {
data := strings.Replace(string(buffer), "\n", "", -1)
v, err := strconv.ParseFloat(data, 64)
if err == nil {
y, err := lp.New("ib_recv", tags, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
}
}
}
buffer, err = ioutil.ReadFile(fmt.Sprintf("%s/port_xmit_data", path))
if err == nil {
data := strings.Replace(string(buffer), "\n", "", -1)
v, err := strconv.ParseFloat(data, 64)
if err == nil {
y, err := lp.New("ib_xmit", tags, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
}
}
}
buffer, err = ioutil.ReadFile(fmt.Sprintf("%s/port_rcv_packets", path))
if err == nil {
data := strings.Replace(string(buffer), "\n", "", -1)
v, err := strconv.ParseFloat(data, 64)
if err == nil {
y, err := lp.New("ib_recv_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
}
}
}
buffer, err = ioutil.ReadFile(fmt.Sprintf("%s/port_xmit_packets", path))
if err == nil {
data := strings.Replace(string(buffer), "\n", "", -1)
v, err := strconv.ParseFloat(data, 64)
if err == nil {
y, err := lp.New("ib_xmit_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
*out = append(*out, y)
}
}
}
return nil
}
// Read reads Infiniband counter files below IB_BASEPATH
func (m *InfinibandCollector) Read(interval time.Duration, output chan lp.CCMetric) {
func (m *InfinibandCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
if m.init {
for dev, ports := range m.lids {
for port, lid := range ports {
tags := map[string]string{"type": "node", "device": dev, "port": port}
if m.use_perfquery {
DoPerfQuery(m.config.PerfQueryPath, dev, lid, port, tags, out)
} else {
DoSysfsRead(dev, lid, port, tags, out)
}
}
}
// Check if already initialized
if !m.init {
return
}
// buffer, err := ioutil.ReadFile(string(LIDFILE))
now := time.Now()
for _, info := range m.info {
for counterName, counterFile := range info.portCounterFiles {
line, err := ioutil.ReadFile(counterFile)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to read from file '%s': %v", counterFile, err))
continue
}
data := strings.TrimSpace(string(line))
v, err := strconv.ParseInt(data, 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert Infininiband metrice %s='%s' to int64: %v", counterName, data, err))
continue
}
if y, err := lp.New(counterName, info.tagSet, m.meta, map[string]interface{}{"value": v}, now); err == nil {
output <- y
}
}
// if err != nil {
// log.Print(err)
// return
// }
// args := fmt.Sprintf("-r %s 1 0xf000", string(buffer))
// command := exec.Command(PERFQUERY, args)
// command.Wait()
// stdout, err := command.Output()
// if err != nil {
// log.Print(err)
// return
// }
// ll := strings.Split(string(stdout), "\n")
// for _, line := range ll {
// if strings.HasPrefix(line, "PortRcvData") || strings.HasPrefix(line, "RcvData") {
// lv := strings.Fields(line)
// v, err := strconv.ParseFloat(lv[1], 64)
// if err == nil {
// y, err := lp.New("ib_recv", m.tags, map[string]interface{}{"value": float64(v)}, time.Now())
// if err == nil {
// *out = append(*out, y)
// }
// }
// }
// if strings.HasPrefix(line, "PortXmitData") || strings.HasPrefix(line, "XmtData") {
// lv := strings.Fields(line)
// v, err := strconv.ParseFloat(lv[1], 64)
// if err == nil {
// y, err := lp.New("ib_xmit", m.tags, map[string]interface{}{"value": float64(v)}, time.Now())
// if err == nil {
// *out = append(*out, y)
// }
// }
// }
// }
}
}
func (m *InfinibandCollector) Close() {

View File

@@ -0,0 +1,26 @@
## `ibstat` collector
```json
"ibstat": {
"exclude_devices": [
"mlx4"
]
}
```
The `ibstat` collector includes all Infiniband devices that can be
found below `/sys/class/infiniband/` and where any of the ports provides a
LID file (`/sys/class/infiniband/<dev>/ports/<port>/lid`)
The devices can be filtered with the `exclude_devices` option in the configuration.
For each found LID the collector reads data through the sysfs files below `/sys/class/infiniband/<device>`.
Metrics:
* `ib_recv`
* `ib_xmit`
* `ib_recv_pkts`
* `ib_xmit_pkts`
The collector adds a `device` tag to all metrics

View File

@@ -0,0 +1,232 @@
package collectors
import (
"fmt"
"io/ioutil"
"log"
"os/exec"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
// "os"
"encoding/json"
"errors"
"path/filepath"
"strconv"
"strings"
"time"
)
const PERFQUERY = `/usr/sbin/perfquery`
type InfinibandPerfQueryCollector struct {
metricCollector
tags map[string]string
lids map[string]map[string]string
config struct {
ExcludeDevices []string `json:"exclude_devices,omitempty"`
PerfQueryPath string `json:"perfquery_path"`
}
}
func (m *InfinibandPerfQueryCollector) Init(config json.RawMessage) error {
var err error
m.name = "InfinibandCollectorPerfQuery"
m.setup()
m.meta = map[string]string{"source": m.name, "group": "Network"}
m.tags = map[string]string{"type": "node"}
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
return err
}
}
if len(m.config.PerfQueryPath) == 0 {
path, err := exec.LookPath("perfquery")
if err == nil {
m.config.PerfQueryPath = path
}
}
m.lids = make(map[string]map[string]string)
p := fmt.Sprintf("%s/*/ports/*/lid", string(IB_BASEPATH))
files, err := filepath.Glob(p)
if err != nil {
return err
}
for _, f := range files {
lid, err := ioutil.ReadFile(f)
if err == nil {
plist := strings.Split(strings.Replace(f, string(IB_BASEPATH), "", -1), "/")
skip := false
for _, d := range m.config.ExcludeDevices {
if d == plist[0] {
skip = true
}
}
if !skip {
m.lids[plist[0]] = make(map[string]string)
m.lids[plist[0]][plist[2]] = string(lid)
}
}
}
for _, ports := range m.lids {
for port, lid := range ports {
args := fmt.Sprintf("-r %s %s 0xf000", lid, port)
command := exec.Command(m.config.PerfQueryPath, args)
command.Wait()
_, err := command.Output()
if err != nil {
return fmt.Errorf("Failed to execute %s: %v", m.config.PerfQueryPath, err)
}
}
}
if len(m.lids) == 0 {
return errors.New("No usable IB devices")
}
m.init = true
return nil
}
func (m *InfinibandPerfQueryCollector) doPerfQuery(cmd string, dev string, lid string, port string, tags map[string]string, output chan lp.CCMetric) error {
args := fmt.Sprintf("-r %s %s 0xf000", lid, port)
command := exec.Command(cmd, args)
command.Wait()
stdout, err := command.Output()
if err != nil {
log.Print(err)
return err
}
ll := strings.Split(string(stdout), "\n")
for _, line := range ll {
if strings.HasPrefix(line, "PortRcvData") || strings.HasPrefix(line, "RcvData") {
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_recv", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
output <- y
}
}
}
if strings.HasPrefix(line, "PortXmitData") || strings.HasPrefix(line, "XmtData") {
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_xmit", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
output <- y
}
}
}
if strings.HasPrefix(line, "PortRcvPkts") || strings.HasPrefix(line, "RcvPkts") {
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_recv_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
output <- y
}
}
}
if strings.HasPrefix(line, "PortXmitPkts") || strings.HasPrefix(line, "XmtPkts") {
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_xmit_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
output <- y
}
}
}
if strings.HasPrefix(line, "PortRcvPkts") || strings.HasPrefix(line, "RcvPkts") {
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_recv_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
output <- y
}
}
}
if strings.HasPrefix(line, "PortXmitPkts") || strings.HasPrefix(line, "XmtPkts") {
lv := strings.Fields(line)
v, err := strconv.ParseFloat(lv[1], 64)
if err == nil {
y, err := lp.New("ib_xmit_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
output <- y
}
}
}
}
return nil
}
func (m *InfinibandPerfQueryCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if m.init {
for dev, ports := range m.lids {
for port, lid := range ports {
tags := map[string]string{
"type": "node",
"device": dev,
"port": port,
"lid": lid}
path := fmt.Sprintf("%s/%s/ports/%s/counters/", string(IB_BASEPATH), dev, port)
buffer, err := ioutil.ReadFile(fmt.Sprintf("%s/port_rcv_data", path))
if err == nil {
data := strings.Replace(string(buffer), "\n", "", -1)
v, err := strconv.ParseFloat(data, 64)
if err == nil {
y, err := lp.New("ib_recv", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
output <- y
}
}
}
buffer, err = ioutil.ReadFile(fmt.Sprintf("%s/port_xmit_data", path))
if err == nil {
data := strings.Replace(string(buffer), "\n", "", -1)
v, err := strconv.ParseFloat(data, 64)
if err == nil {
y, err := lp.New("ib_xmit", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
output <- y
}
}
}
buffer, err = ioutil.ReadFile(fmt.Sprintf("%s/port_rcv_packets", path))
if err == nil {
data := strings.Replace(string(buffer), "\n", "", -1)
v, err := strconv.ParseFloat(data, 64)
if err == nil {
y, err := lp.New("ib_recv_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
output <- y
}
}
}
buffer, err = ioutil.ReadFile(fmt.Sprintf("%s/port_xmit_packets", path))
if err == nil {
data := strings.Replace(string(buffer), "\n", "", -1)
v, err := strconv.ParseFloat(data, 64)
if err == nil {
y, err := lp.New("ib_xmit_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
if err == nil {
output <- y
}
}
}
}
}
}
}
func (m *InfinibandPerfQueryCollector) Close() {
m.init = false
}

View File

@@ -0,0 +1,28 @@
## `ibstat_perfquery` collector
```json
"ibstat_perfquery": {
"perfquery_path": "/path/to/perfquery",
"exclude_devices": [
"mlx4"
]
}
```
The `ibstat_perfquery` collector includes all Infiniband devices that can be
found below `/sys/class/infiniband/` and where any of the ports provides a
LID file (`/sys/class/infiniband/<dev>/ports/<port>/lid`)
The devices can be filtered with the `exclude_devices` option in the configuration.
For each found LID the collector calls the `perfquery` command. The path to the
`perfquery` command can be configured with the `perfquery_path` option in the configuration
Metrics:
* `ib_recv`
* `ib_xmit`
* `ib_recv_pkts`
* `ib_xmit_pkts`
The collector adds a `device` tag to all metrics

155
collectors/iostatMetric.go Normal file
View File

@@ -0,0 +1,155 @@
package collectors
import (
"bufio"
"os"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
// "log"
"encoding/json"
"errors"
"strconv"
"strings"
"time"
)
const IOSTATFILE = `/proc/diskstats`
const IOSTAT_SYSFSPATH = `/sys/block`
type IOstatCollectorConfig struct {
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
}
type IOstatCollectorEntry struct {
lastValues map[string]int64
tags map[string]string
}
type IOstatCollector struct {
metricCollector
matches map[string]int
config IOstatCollectorConfig
devices map[string]IOstatCollectorEntry
}
func (m *IOstatCollector) Init(config json.RawMessage) error {
var err error
m.name = "IOstatCollector"
m.meta = map[string]string{"source": m.name, "group": "Disk"}
m.setup()
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
return err
}
}
// https://www.kernel.org/doc/html/latest/admin-guide/iostats.html
matches := map[string]int{
"io_reads": 3,
"io_reads_merged": 4,
"io_read_sectors": 5,
"io_read_ms": 6,
"io_writes": 7,
"io_writes_merged": 8,
"io_writes_sectors": 9,
"io_writes_ms": 10,
"io_ioops": 11,
"io_ioops_ms": 12,
"io_ioops_weighted_ms": 13,
"io_discards": 14,
"io_discards_merged": 15,
"io_discards_sectors": 16,
"io_discards_ms": 17,
"io_flushes": 18,
"io_flushes_ms": 19,
}
m.devices = make(map[string]IOstatCollectorEntry)
m.matches = make(map[string]int)
for k, v := range matches {
if _, skip := stringArrayContains(m.config.ExcludeMetrics, k); !skip {
m.matches[k] = v
}
}
if len(m.matches) == 0 {
return errors.New("no metrics to collect")
}
file, err := os.Open(string(IOSTATFILE))
if err != nil {
cclog.ComponentError(m.name, err.Error())
return err
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
linefields := strings.Fields(line)
device := linefields[2]
if strings.Contains(device, "loop") {
continue
}
values := make(map[string]int64)
for m := range m.matches {
values[m] = 0
}
m.devices[device] = IOstatCollectorEntry{
tags: map[string]string{
"device": linefields[2],
"type": "node",
},
lastValues: values,
}
}
m.init = true
return err
}
func (m *IOstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
file, err := os.Open(string(IOSTATFILE))
if err != nil {
cclog.ComponentError(m.name, err.Error())
return
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
if len(line) == 0 {
continue
}
linefields := strings.Fields(line)
device := linefields[2]
if strings.Contains(device, "loop") {
continue
}
if _, ok := m.devices[device]; !ok {
continue
}
entry := m.devices[device]
for name, idx := range m.matches {
if idx < len(linefields) {
x, err := strconv.ParseInt(linefields[idx], 0, 64)
if err == nil {
diff := x - entry.lastValues[name]
y, err := lp.New(name, entry.tags, m.meta, map[string]interface{}{"value": int(diff)}, time.Now())
if err == nil {
output <- y
}
}
entry.lastValues[name] = x
}
}
m.devices[device] = entry
}
}
func (m *IOstatCollector) Close() {
m.init = false
}

View File

@@ -0,0 +1,34 @@
## `iostat` collector
```json
"iostat": {
"exclude_metrics": [
"read_ms"
],
}
```
The `iostat` collector reads data from `/proc/diskstats` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `io_reads`
* `io_reads_merged`
* `io_read_sectors`
* `io_read_ms`
* `io_writes`
* `io_writes_merged`
* `io_writes_sectors`
* `io_writes_ms`
* `io_ioops`
* `io_ioops_ms`
* `io_ioops_weighted_ms`
* `io_discards`
* `io_discards_merged`
* `io_discards_sectors`
* `io_discards_ms`
* `io_flushes`
* `io_flushes_ms`
The device name is added as tag `device`. For more details, see https://www.kernel.org/doc/html/latest/admin-guide/iostats.html

View File

@@ -10,11 +10,11 @@ import (
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const IPMITOOL_PATH = `/usr/bin/ipmitool`
const IPMISENSORS_PATH = `/usr/sbin/ipmi-sensors`
const IPMITOOL_PATH = `ipmitool`
const IPMISENSORS_PATH = `ipmi-sensors`
type IpmiCollectorConfig struct {
ExcludeDevices []string `json:"exclude_devices"`
@@ -23,37 +23,44 @@ type IpmiCollectorConfig struct {
}
type IpmiCollector struct {
MetricCollector
tags map[string]string
matches map[string]string
config IpmiCollectorConfig
metricCollector
//tags map[string]string
//matches map[string]string
config IpmiCollectorConfig
ipmitool string
ipmisensors string
}
func (m *IpmiCollector) Init(config []byte) error {
func (m *IpmiCollector) Init(config json.RawMessage) error {
m.name = "IpmiCollector"
m.setup()
m.meta = map[string]string{"source": m.name, "group": "IPMI"}
m.config.IpmitoolPath = string(IPMITOOL_PATH)
m.config.IpmisensorsPath = string(IPMISENSORS_PATH)
m.ipmitool = ""
m.ipmisensors = ""
if len(config) > 0 {
err := json.Unmarshal(config, &m.config)
if err != nil {
return err
}
}
_, err1 := os.Stat(m.config.IpmitoolPath)
_, err2 := os.Stat(m.config.IpmisensorsPath)
if err1 != nil {
m.config.IpmitoolPath = ""
p, err := exec.LookPath(m.config.IpmitoolPath)
if err == nil {
m.ipmitool = p
}
if err2 != nil {
m.config.IpmisensorsPath = ""
p, err = exec.LookPath(m.config.IpmisensorsPath)
if err == nil {
m.ipmisensors = p
}
if err1 != nil && err2 != nil {
if len(m.ipmitool) == 0 && len(m.ipmisensors) == 0 {
return errors.New("No IPMI reader found")
}
m.init = true
return nil
}
func ReadIpmiTool(cmd string, out *[]lp.MutableMetric) {
func (m *IpmiCollector) readIpmiTool(cmd string, output chan lp.CCMetric) {
command := exec.Command(cmd, "sensor")
command.Wait()
stdout, err := command.Output()
@@ -74,24 +81,25 @@ func ReadIpmiTool(cmd string, out *[]lp.MutableMetric) {
name := strings.ToLower(strings.Replace(strings.Trim(lv[0], " "), " ", "_", -1))
unit := strings.Trim(lv[2], " ")
if unit == "Volts" {
unit = "V"
unit = "Volts"
} else if unit == "degrees C" {
unit = "C"
unit = "degC"
} else if unit == "degrees F" {
unit = "F"
unit = "degF"
} else if unit == "Watts" {
unit = "W"
unit = "Watts"
}
y, err := lp.New(name, map[string]string{"unit": unit, "type": "node"}, map[string]interface{}{"value": v}, time.Now())
y, err := lp.New(name, map[string]string{"type": "node"}, m.meta, map[string]interface{}{"value": v}, time.Now())
if err == nil {
*out = append(*out, y)
y.AddMeta("unit", unit)
output <- y
}
}
}
}
func ReadIpmiSensors(cmd string, out *[]lp.MutableMetric) {
func (m *IpmiCollector) readIpmiSensors(cmd string, output chan lp.CCMetric) {
command := exec.Command(cmd, "--comma-separated-output", "--sdr-cache-recreate")
command.Wait()
@@ -109,25 +117,28 @@ func ReadIpmiSensors(cmd string, out *[]lp.MutableMetric) {
v, err := strconv.ParseFloat(lv[3], 64)
if err == nil {
name := strings.ToLower(strings.Replace(lv[1], " ", "_", -1))
y, err := lp.New(name, map[string]string{"unit": lv[4], "type": "node"}, map[string]interface{}{"value": v}, time.Now())
y, err := lp.New(name, map[string]string{"type": "node"}, m.meta, map[string]interface{}{"value": v}, time.Now())
if err == nil {
*out = append(*out, y)
if len(lv) > 4 {
y.AddMeta("unit", lv[4])
}
output <- y
}
}
}
}
}
func (m *IpmiCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *IpmiCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if len(m.config.IpmitoolPath) > 0 {
_, err := os.Stat(m.config.IpmitoolPath)
if err == nil {
ReadIpmiTool(m.config.IpmitoolPath, out)
m.readIpmiTool(m.config.IpmitoolPath, output)
}
} else if len(m.config.IpmisensorsPath) > 0 {
_, err := os.Stat(m.config.IpmisensorsPath)
if err == nil {
ReadIpmiSensors(m.config.IpmisensorsPath, out)
m.readIpmiSensors(m.config.IpmisensorsPath, output)
}
}
}

16
collectors/ipmiMetric.md Normal file
View File

@@ -0,0 +1,16 @@
## `ipmistat` collector
```json
"ipmistat": {
"ipmitool_path": "/path/to/ipmitool",
"ipmisensors_path": "/path/to/ipmi-sensors",
}
```
The `ipmistat` collector reads data from `ipmitool` (`ipmitool sensor`) or `ipmi-sensors` (`ipmi-sensors --sdr-cache-recreate --comma-separated-output`).
The metrics depend on the output of the underlying tools but contain temperature, power and energy metrics.

View File

@@ -2,7 +2,7 @@ package collectors
/*
#cgo CFLAGS: -I./likwid
#cgo LDFLAGS: -L./likwid -llikwid -llikwid-hwloc -lm
#cgo LDFLAGS: -L./likwid -llikwid -llikwid-hwloc -lm -Wl,--unresolved-symbols=ignore-in-object-files
#include <stdlib.h>
#include <likwid.h>
*/
@@ -13,55 +13,111 @@ import (
"errors"
"fmt"
"io/ioutil"
"log"
"math"
"os"
"regexp"
"strconv"
"strings"
"time"
"unsafe"
lp "github.com/influxdata/line-protocol"
"gopkg.in/Knetic/govaluate.v2"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
topo "github.com/ClusterCockpit/cc-metric-collector/internal/ccTopology"
agg "github.com/ClusterCockpit/cc-metric-collector/internal/metricAggregator"
"github.com/NVIDIA/go-nvml/pkg/dl"
)
type MetricScope string
const (
METRIC_SCOPE_HWTHREAD = iota
METRIC_SCOPE_CORE
METRIC_SCOPE_LLC
METRIC_SCOPE_NUMA
METRIC_SCOPE_DIE
METRIC_SCOPE_SOCKET
METRIC_SCOPE_NODE
)
func (ms MetricScope) String() string {
return string(ms)
}
func (ms MetricScope) Likwid() string {
LikwidDomains := map[string]string{
"cpu": "",
"core": "",
"llc": "C",
"numadomain": "M",
"die": "D",
"socket": "S",
"node": "N",
}
return LikwidDomains[string(ms)]
}
func (ms MetricScope) Granularity() int {
for i, g := range GetAllMetricScopes() {
if ms == g {
return i
}
}
return -1
}
func GetAllMetricScopes() []MetricScope {
return []MetricScope{"cpu" /*, "core", "llc", "numadomain", "die",*/, "socket", "node"}
}
const (
LIKWID_LIB_NAME = "liblikwid.so"
LIKWID_LIB_DL_FLAGS = dl.RTLD_LAZY | dl.RTLD_GLOBAL
)
type LikwidCollectorMetricConfig struct {
Name string `json:"name"`
Calc string `json:"calc"`
Socket_scope bool `json:"socket_scope"`
Publish bool `json:"publish"`
Name string `json:"name"` // Name of the metric
Calc string `json:"calc"` // Calculation for the metric using
//Aggr string `json:"aggregation"` // if scope unequal to LIKWID metric scope, the values are combined (sum, min, max, mean or avg, median)
Scope MetricScope `json:"scope"` // scope for calculation. subscopes are aggregated using the 'aggregation' function
Publish bool `json:"publish"`
granulatity MetricScope
}
type LikwidCollectorEventsetConfig struct {
Events map[string]string `json:"events"`
Metrics []LikwidCollectorMetricConfig `json:"metrics"`
Events map[string]string `json:"events"`
granulatity map[string]MetricScope
Metrics []LikwidCollectorMetricConfig `json:"metrics"`
}
type LikwidCollectorConfig struct {
Eventsets []LikwidCollectorEventsetConfig `json:"eventsets"`
Metrics []LikwidCollectorMetricConfig `json:"globalmetrics"`
ExcludeMetrics []string `json:"exclude_metrics"`
ForceOverwrite bool `json:"force_overwrite"`
Metrics []LikwidCollectorMetricConfig `json:"globalmetrics,omitempty"`
ForceOverwrite bool `json:"force_overwrite,omitempty"`
InvalidToZero bool `json:"invalid_to_zero,omitempty"`
}
type LikwidCollector struct {
MetricCollector
cpulist []C.int
sock2tid map[int]int
metrics map[C.int]map[string]int
groups []C.int
config LikwidCollectorConfig
results map[int]map[int]map[string]interface{}
mresults map[int]map[int]map[string]float64
gmresults map[int]map[string]float64
basefreq float64
metricCollector
cpulist []C.int
cpu2tid map[int]int
sock2tid map[int]int
scopeRespTids map[MetricScope]map[int]int
metrics map[C.int]map[string]int
groups []C.int
config LikwidCollectorConfig
results map[int]map[int]map[string]interface{}
mresults map[int]map[int]map[string]float64
gmresults map[int]map[string]float64
basefreq float64
running bool
}
type LikwidMetric struct {
name string
search string
socket_scope bool
group_idx int
name string
search string
scope MetricScope
group_idx int
}
func eventsToEventStr(events map[string]string) string {
@@ -72,12 +128,27 @@ func eventsToEventStr(events map[string]string) string {
return strings.Join(elist, ",")
}
func getGranularity(counter, event string) MetricScope {
if strings.HasPrefix(counter, "PMC") || strings.HasPrefix(counter, "FIXC") {
return "cpu"
} else if strings.Contains(counter, "BOX") || strings.Contains(counter, "DEV") {
return "socket"
} else if strings.HasPrefix(counter, "PWR") {
if event == "RAPL_CORE_ENERGY" {
return "cpu"
} else {
return "socket"
}
}
return "unknown"
}
func getBaseFreq() float64 {
var freq float64 = math.NaN()
C.power_init(0)
info := C.get_powerInfo()
if float64(info.baseFrequency) != 0 {
freq = float64(info.baseFrequency)
freq = float64(info.baseFrequency) * 1e3
} else {
buffer, err := ioutil.ReadFile("/sys/devices/system/cpu/cpu0/cpufreq/bios_limit")
if err == nil {
@@ -91,21 +162,102 @@ func getBaseFreq() float64 {
return freq
}
func getSocketCpus() map[C.int]int {
slist := SocketList()
var cpu C.int
outmap := make(map[C.int]int)
for _, s := range slist {
t := C.CString(fmt.Sprintf("S%d", s))
clen := C.cpustr_to_cpulist(t, &cpu, 1)
if int(clen) == 1 {
outmap[cpu] = s
func (m *LikwidCollector) initGranularity() {
splitRegex := regexp.MustCompile("[+-/*()]")
for _, evset := range m.config.Eventsets {
evset.granulatity = make(map[string]MetricScope)
for counter, event := range evset.Events {
gran := getGranularity(counter, event)
if gran.Granularity() >= 0 {
evset.granulatity[counter] = gran
}
}
for i, metric := range evset.Metrics {
s := splitRegex.Split(metric.Calc, -1)
gran := MetricScope("cpu")
evset.Metrics[i].granulatity = gran
for _, x := range s {
if _, ok := evset.Events[x]; ok {
if evset.granulatity[x].Granularity() > gran.Granularity() {
gran = evset.granulatity[x]
}
}
}
evset.Metrics[i].granulatity = gran
}
}
return outmap
for i, metric := range m.config.Metrics {
s := splitRegex.Split(metric.Calc, -1)
gran := MetricScope("cpu")
m.config.Metrics[i].granulatity = gran
for _, x := range s {
for _, evset := range m.config.Eventsets {
for _, m := range evset.Metrics {
if m.Name == x && m.granulatity.Granularity() > gran.Granularity() {
gran = m.granulatity
}
}
}
}
m.config.Metrics[i].granulatity = gran
}
}
func (m *LikwidCollector) Init(config []byte) error {
type TopoResolveFunc func(cpuid int) int
func (m *LikwidCollector) getResponsiblities() map[MetricScope]map[int]int {
get_cpus := func(scope MetricScope) map[int]int {
var slist []int
var cpu C.int
var input func(index int) string
switch scope {
case "node":
slist = []int{0}
input = func(index int) string { return "N:0" }
case "socket":
input = func(index int) string { return fmt.Sprintf("%s%d:0", scope.Likwid(), index) }
slist = topo.SocketList()
// case "numadomain":
// input = func(index int) string { return fmt.Sprintf("%s%d:0", scope.Likwid(), index) }
// slist = topo.NumaNodeList()
// cclog.Debug(scope, " ", input(0), " ", slist)
// case "die":
// input = func(index int) string { return fmt.Sprintf("%s%d:0", scope.Likwid(), index) }
// slist = topo.DieList()
// case "llc":
// input = fmt.Sprintf("%s%d:0", scope.Likwid(), s)
// slist = topo.LLCacheList()
case "cpu":
input = func(index int) string { return fmt.Sprintf("%d", index) }
slist = topo.CpuList()
case "hwthread":
input = func(index int) string { return fmt.Sprintf("%d", index) }
slist = topo.CpuList()
}
outmap := make(map[int]int)
for _, s := range slist {
t := C.CString(input(s))
clen := C.cpustr_to_cpulist(t, &cpu, 1)
if int(clen) == 1 {
outmap[s] = m.cpu2tid[int(cpu)]
} else {
cclog.Error(fmt.Sprintf("Cannot determine responsible CPU for %s", input(s)))
outmap[s] = -1
}
C.free(unsafe.Pointer(t))
}
return outmap
}
scopes := GetAllMetricScopes()
complete := make(map[MetricScope]map[int]int)
for _, s := range scopes {
complete[s] = get_cpus(s)
}
return complete
}
func (m *LikwidCollector) Init(config json.RawMessage) error {
var ret C.int
m.name = "LikwidCollector"
if len(config) > 0 {
@@ -114,36 +266,78 @@ func (m *LikwidCollector) Init(config []byte) error {
return err
}
}
lib := dl.New(LIKWID_LIB_NAME, LIKWID_LIB_DL_FLAGS)
if lib == nil {
return fmt.Errorf("error instantiating DynamicLibrary for %s", LIKWID_LIB_NAME)
}
if m.config.ForceOverwrite {
cclog.ComponentDebug(m.name, "Set LIKWID_FORCE=1")
os.Setenv("LIKWID_FORCE", "1")
}
m.setup()
cpulist := CpuList()
m.cpulist = make([]C.int, len(cpulist))
slist := getSocketCpus()
m.sock2tid = make(map[int]int)
m.meta = map[string]string{"source": m.name, "group": "PerfCounter"}
cclog.ComponentDebug(m.name, "Get cpulist and init maps and lists")
cpulist := topo.CpuList()
m.cpulist = make([]C.int, len(cpulist))
m.cpu2tid = make(map[int]int)
for i, c := range cpulist {
m.cpulist[i] = C.int(c)
if sid, found := slist[m.cpulist[i]]; found {
m.sock2tid[sid] = i
}
m.cpu2tid[c] = i
}
m.results = make(map[int]map[int]map[string]interface{})
m.mresults = make(map[int]map[int]map[string]float64)
m.gmresults = make(map[int]map[string]float64)
cclog.ComponentDebug(m.name, "initialize LIKWID topology")
ret = C.topology_init()
if ret != 0 {
return errors.New("Failed to initialize LIKWID topology")
}
if m.config.ForceOverwrite {
os.Setenv("LIKWID_FORCE", "1")
err := errors.New("failed to initialize LIKWID topology")
cclog.ComponentError(m.name, err.Error())
return err
}
// Determine which counter works at which level. PMC*: cpu, *BOX*: socket, ...
m.initGranularity()
// Generate map for MetricScope -> scope_id (like socket id) -> responsible id (offset in cpulist)
m.scopeRespTids = m.getResponsiblities()
cclog.ComponentDebug(m.name, "initialize LIKWID perfmon module")
ret = C.perfmon_init(C.int(len(m.cpulist)), &m.cpulist[0])
if ret != 0 {
C.topology_finalize()
return errors.New("Failed to initialize LIKWID topology")
err := errors.New("failed to initialize LIKWID topology")
cclog.ComponentError(m.name, err.Error())
return err
}
// This is for the global metrics computation test
globalParams := make(map[string]interface{})
globalParams["time"] = float64(1.0)
globalParams["inverseClock"] = float64(1.0)
// While adding the events, we test the metrics whether they can be computed at all
for i, evset := range m.config.Eventsets {
estr := eventsToEventStr(evset.Events)
// Generate parameter list for the metric computing test
params := make(map[string]interface{})
params["time"] = float64(1.0)
params["inverseClock"] = float64(1.0)
for counter := range evset.Events {
params[counter] = float64(1.0)
}
for _, metric := range evset.Metrics {
// Try to evaluate the metric
_, err := agg.EvalFloat64Condition(metric.Calc, params)
if err != nil {
cclog.ComponentError(m.name, "Calculation for metric", metric.Name, "failed:", err.Error())
continue
}
// If the metric is not in the parameter list for the global metrics, add it
if _, ok := globalParams[metric.Name]; !ok {
globalParams[metric.Name] = float64(1.0)
}
}
// Now we add the list of events to likwid
cstr := C.CString(estr)
gid := C.perfmon_addEventSet(cstr)
if gid >= 0 {
@@ -155,153 +349,208 @@ func (m *LikwidCollector) Init(config []byte) error {
for tid := range m.cpulist {
m.results[i][tid] = make(map[string]interface{})
m.mresults[i][tid] = make(map[string]float64)
m.gmresults[tid] = make(map[string]float64)
if i == 0 {
m.gmresults[tid] = make(map[string]float64)
}
}
}
for _, metric := range m.config.Metrics {
// Try to evaluate the global metric
_, err := agg.EvalFloat64Condition(metric.Calc, globalParams)
if err != nil {
cclog.ComponentError(m.name, "Calculation for metric", metric.Name, "failed:", err.Error())
continue
}
}
// If no event set could be added, shut down LikwidCollector
if len(m.groups) == 0 {
C.perfmon_finalize()
C.topology_finalize()
return errors.New("No LIKWID performance group initialized")
err := errors.New("no LIKWID performance group initialized")
cclog.ComponentError(m.name, err.Error())
return err
}
m.basefreq = getBaseFreq()
cclog.ComponentDebug(m.name, "BaseFreq", m.basefreq)
m.init = true
return nil
}
func (m *LikwidCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
// take a measurement for 'interval' seconds of event set index 'group'
func (m *LikwidCollector) takeMeasurement(group int, interval time.Duration) error {
var ret C.int
gid := m.groups[group]
ret = C.perfmon_setupCounters(gid)
if ret != 0 {
gctr := C.GoString(C.perfmon_getGroupName(gid))
err := fmt.Errorf("failed to setup performance group %d (%s)", gid, gctr)
return err
}
ret = C.perfmon_startCounters()
if ret != 0 {
gctr := C.GoString(C.perfmon_getGroupName(gid))
err := fmt.Errorf("failed to start performance group %d (%s)", gid, gctr)
return err
}
m.running = true
time.Sleep(interval)
m.running = false
ret = C.perfmon_stopCounters()
if ret != 0 {
gctr := C.GoString(C.perfmon_getGroupName(gid))
err := fmt.Errorf("failed to stop performance group %d (%s)", gid, gctr)
return err
}
return nil
}
// Get all measurement results for an event set, derive the metric values out of the measurement results and send it
func (m *LikwidCollector) calcEventsetMetrics(group int, interval time.Duration, output chan lp.CCMetric) error {
var eidx C.int
evset := m.config.Eventsets[group]
gid := m.groups[group]
invClock := float64(1.0 / m.basefreq)
// Go over events and get the results
for eidx = 0; int(eidx) < len(evset.Events); eidx++ {
ctr := C.perfmon_getCounterName(gid, eidx)
ev := C.perfmon_getEventName(gid, eidx)
gctr := C.GoString(ctr)
gev := C.GoString(ev)
// MetricScope for the counter (and if needed the event)
scope := getGranularity(gctr, gev)
// Get the map scope-id -> tids
// This way we read less counters like only the responsible hardware thread for a socket
scopemap := m.scopeRespTids[scope]
for _, tid := range scopemap {
if tid >= 0 {
m.results[group][tid]["time"] = interval.Seconds()
m.results[group][tid]["inverseClock"] = invClock
res := C.perfmon_getLastResult(gid, eidx, C.int(tid))
m.results[group][tid][gctr] = float64(res)
}
}
}
// Go over the event set metrics, derive the value out of the event:counter values and send it
for _, metric := range evset.Metrics {
// The metric scope is determined in the Init() function
// Get the map scope-id -> tids
scopemap := m.scopeRespTids[metric.Scope]
for domain, tid := range scopemap {
if tid >= 0 {
value, err := agg.EvalFloat64Condition(metric.Calc, m.results[group][tid])
if err != nil {
cclog.ComponentError(m.name, "Calculation for metric", metric.Name, "failed:", err.Error())
continue
}
m.mresults[group][tid][metric.Name] = value
if m.config.InvalidToZero && math.IsNaN(value) {
value = 0.0
}
if m.config.InvalidToZero && math.IsInf(value, 0) {
value = 0.0
}
// Now we have the result, send it with the proper tags
if !math.IsNaN(value) {
if metric.Publish {
tags := map[string]string{"type": metric.Scope.String()}
if metric.Scope != "node" {
tags["type-id"] = fmt.Sprintf("%d", domain)
}
fields := map[string]interface{}{"value": value}
y, err := lp.New(metric.Name, tags, m.meta, fields, time.Now())
if err == nil {
output <- y
}
}
}
}
}
}
return nil
}
// Go over the global metrics, derive the value out of the event sets' metric values and send it
func (m *LikwidCollector) calcGlobalMetrics(interval time.Duration, output chan lp.CCMetric) error {
for _, metric := range m.config.Metrics {
scopemap := m.scopeRespTids[metric.Scope]
for domain, tid := range scopemap {
if tid >= 0 {
// Here we generate parameter list
params := make(map[string]interface{})
for j := range m.groups {
for mname, mres := range m.mresults[j][tid] {
params[mname] = mres
}
}
// Evaluate the metric
value, err := agg.EvalFloat64Condition(metric.Calc, params)
if err != nil {
cclog.ComponentError(m.name, "Calculation for metric", metric.Name, "failed:", err.Error())
continue
}
m.gmresults[tid][metric.Name] = value
if m.config.InvalidToZero && math.IsNaN(value) {
value = 0.0
}
if m.config.InvalidToZero && math.IsInf(value, 0) {
value = 0.0
}
// Now we have the result, send it with the proper tags
if !math.IsNaN(value) {
if metric.Publish {
tags := map[string]string{"type": metric.Scope.String()}
if metric.Scope != "node" {
tags["type-id"] = fmt.Sprintf("%d", domain)
}
fields := map[string]interface{}{"value": value}
y, err := lp.New(metric.Name, tags, m.meta, fields, time.Now())
if err == nil {
output <- y
}
}
}
}
}
}
return nil
}
// main read function taking multiple measurement rounds, each 'interval' seconds long
func (m *LikwidCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
var ret C.int
for i, gid := range m.groups {
evset := m.config.Eventsets[i]
ret = C.perfmon_setupCounters(gid)
if ret != 0 {
log.Print("Failed to setup performance group ", C.perfmon_getGroupName(gid))
continue
}
ret = C.perfmon_startCounters()
if ret != 0 {
log.Print("Failed to start performance group ", C.perfmon_getGroupName(gid))
continue
}
time.Sleep(interval)
ret = C.perfmon_stopCounters()
if ret != 0 {
log.Print("Failed to stop performance group ", C.perfmon_getGroupName(gid))
continue
}
var eidx C.int
for tid := range m.cpulist {
for eidx = 0; int(eidx) < len(evset.Events); eidx++ {
ctr := C.perfmon_getCounterName(gid, eidx)
gctr := C.GoString(ctr)
res := C.perfmon_getLastResult(gid, eidx, C.int(tid))
m.results[i][tid][gctr] = float64(res)
}
m.results[i][tid]["time"] = interval.Seconds()
m.results[i][tid]["inverseClock"] = float64(1.0 / m.basefreq)
for _, metric := range evset.Metrics {
expression, err := govaluate.NewEvaluableExpression(metric.Calc)
if err != nil {
log.Print(err.Error())
continue
}
result, err := expression.Evaluate(m.results[i][tid])
if err != nil {
log.Print(err.Error())
continue
}
m.mresults[i][tid][metric.Name] = float64(result.(float64))
}
}
}
for _, metric := range m.config.Metrics {
for tid := range m.cpulist {
var params map[string]interface{}
expression, err := govaluate.NewEvaluableExpression(metric.Calc)
if err != nil {
log.Print(err.Error())
continue
}
params = make(map[string]interface{})
for j := range m.groups {
for mname, mres := range m.mresults[j][tid] {
params[mname] = mres
}
}
result, err := expression.Evaluate(params)
if err != nil {
log.Print(err.Error())
continue
}
m.gmresults[tid][metric.Name] = float64(result.(float64))
}
}
for i := range m.groups {
evset := m.config.Eventsets[i]
for _, metric := range evset.Metrics {
_, skip := stringArrayContains(m.config.ExcludeMetrics, metric.Name)
if metric.Publish && !skip {
if metric.Socket_scope {
for sid, tid := range m.sock2tid {
y, err := lp.New(metric.Name,
map[string]string{"type": "socket", "type-id": fmt.Sprintf("%d", int(sid))},
map[string]interface{}{"value": m.mresults[i][tid][metric.Name]},
time.Now())
if err == nil {
*out = append(*out, y)
}
}
} else {
for tid, cpu := range m.cpulist {
y, err := lp.New(metric.Name,
map[string]string{"type": "cpu", "type-id": fmt.Sprintf("%d", int(cpu))},
map[string]interface{}{"value": m.mresults[i][tid][metric.Name]},
time.Now())
if err == nil {
*out = append(*out, y)
}
}
}
}
}
}
for _, metric := range m.config.Metrics {
_, skip := stringArrayContains(m.config.ExcludeMetrics, metric.Name)
if metric.Publish && !skip {
if metric.Socket_scope {
for sid, tid := range m.sock2tid {
y, err := lp.New(metric.Name,
map[string]string{"type": "socket", "type-id": fmt.Sprintf("%d", int(sid))},
map[string]interface{}{"value": m.gmresults[tid][metric.Name]},
time.Now())
if err == nil {
*out = append(*out, y)
}
}
} else {
for tid, cpu := range m.cpulist {
y, err := lp.New(metric.Name,
map[string]string{"type": "cpu", "type-id": fmt.Sprintf("%d", int(cpu))},
map[string]interface{}{"value": m.gmresults[tid][metric.Name]},
time.Now())
if err == nil {
*out = append(*out, y)
}
}
}
// measure event set 'i' for 'interval' seconds
err := m.takeMeasurement(i, interval)
if err != nil {
cclog.ComponentError(m.name, err.Error())
return
}
// read measurements and derive event set metrics
m.calcEventsetMetrics(i, interval, output)
}
// use the event set metrics to derive the global metrics
m.calcGlobalMetrics(interval, output)
}
func (m *LikwidCollector) Close() {
if m.init {
cclog.ComponentDebug(m.name, "Closing ...")
m.init = false
if m.running {
cclog.ComponentDebug(m.name, "Stopping counters")
C.perfmon_stopCounters()
}
cclog.ComponentDebug(m.name, "Finalize LIKWID perfmon module")
C.perfmon_finalize()
cclog.ComponentDebug(m.name, "Finalize LIKWID topology module")
C.topology_finalize()
cclog.ComponentDebug(m.name, "Closing done")
}
}

148
collectors/likwidMetric.md Normal file
View File

@@ -0,0 +1,148 @@
## `likwid` collector
The `likwid` collector is probably the most complicated collector. The LIKWID library is included as static library with *direct* access mode. The *direct* access mode is suitable if the daemon is executed by a root user. The static library does not contain the performance groups, so all information needs to be provided in the configuration.
The `likwid` configuration consists of two parts, the "eventsets" and "globalmetrics":
- An event set list itself has two parts, the "events" and a set of derivable "metrics". Each of the "events" is a counter:event pair in LIKWID's syntax. The "metrics" are a list of formulas to derive the metric value from the measurements of the "events". Each metric has a name, the formula, a scope and a publish flag. A counter names can be used like variables in the formulas, so `PMC0+PMC1` sums the measurements for the both events configured in the counters `PMC0` and `PMC1`. The scope tells the Collector whether it is a metric for each hardware thread (`cpu`) or each CPU socket (`socket`). The last one is the publishing flag. It tells the collector whether a metric should be sent to the router.
- The global metrics are metrics which require data from all event set measurements to be derived. The inputs are the metrics in the event sets. Similar to the metrics in the event sets, the global metrics are defined by a name, a formula, a scope and a publish flag. See event set metrics for details. The only difference is that there is no access to the raw event measurements anymore but only to the metrics. So, the idea is to derive a metric in the "eventsets" section and reuse it in the "globalmetrics" part. If you need a metric only for deriving the global metrics, disable forwarding of the event set metrics. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.
Additional options:
- `force_overwrite`: Same as setting `LIKWID_FORCE=1`. In case counters are already in-use, LIKWID overwrites their configuration to do its measurements
- `invalid_to_zero`: In some cases, the calculations result in `NaN` or `Inf`. With this option, all `NaN` and `Inf` values are replaces with `0.0`.
### Available metric scopes
Hardware performance counters are scattered all over the system nowadays. A counter coveres a specific part of the system. While there are hardware thread specific counter for CPU cycles, instructions and so on, some others are specific for a whole CPU socket/package. To address that, the collector provides the specification of a 'scope' for each metric.
- `cpu` : One metric per CPU hardware thread with the tags `"type" : "cpu"` and `"type-id" : "$cpu_id"`
- `socket` : One metric per CPU socket/package with the tags `"type" : "socket"` and `"type-id" : "$socket_id"`
**Note:** You cannot specify `socket` scope for a metric that is measured at `cpu` scope, so some kind of expert knowledge or lookup work in the [Likwid Wiki](https://github.com/RRZE-HPC/likwid/wiki) is required. Get the scope of each counter from the *Architecture* pages and as soon as one counter in a metric is socket-specific, the whole metric is socket-specific.
As a guideline:
- All counters `FIXCx`, `PMCy` and `TMAz` have the scope `cpu`
- All counters names containing `BOX` have the scope `socket`
- All `PWRx` counters have scope `socket`, except `"PWR1" : "RAPL_CORE_ENERGY"` has `cpu` scope
- All `DFCx` counters have scope `socket`
### Example configuration
```json
"likwid": {
"force_overwrite" : false,
"nan_to_zero" : false,
"eventsets": [
{
"events": {
"FIXC1": "ACTUAL_CPU_CLOCK",
"FIXC2": "MAX_CPU_CLOCK",
"PMC0": "RETIRED_INSTRUCTIONS",
"PMC1": "CPU_CLOCKS_UNHALTED",
"PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
"PMC3": "MERGE",
"DFC0": "DRAM_CHANNEL_0",
"DFC1": "DRAM_CHANNEL_1",
"DFC2": "DRAM_CHANNEL_2",
"DFC3": "DRAM_CHANNEL_3"
},
"metrics": [
{
"name": "ipc",
"calc": "PMC0/PMC1",
"scope": "cpu",
"publish": true
},
{
"name": "flops_any",
"calc": "0.000001*PMC2/time",
"scope": "cpu",
"publish": true
},
{
"name": "clock_mhz",
"calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
"scope": "cpu",
"publish": true
},
{
"name": "mem1",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"scope": "socket",
"publish": false
}
]
},
{
"events": {
"DFC0": "DRAM_CHANNEL_4",
"DFC1": "DRAM_CHANNEL_5",
"DFC2": "DRAM_CHANNEL_6",
"DFC3": "DRAM_CHANNEL_7",
"PWR0": "RAPL_CORE_ENERGY",
"PWR1": "RAPL_PKG_ENERGY"
},
"metrics": [
{
"name": "pwr_core",
"calc": "PWR0/time",
"scope": "socket",
"publish": true
},
{
"name": "pwr_pkg",
"calc": "PWR1/time",
"scope": "socket",
"publish": true
},
{
"name": "mem2",
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
"scope": "socket",
"publish": false
}
]
}
],
"globalmetrics": [
{
"name": "mem_bw",
"calc": "mem1+mem2",
"scope": "socket",
"publish": true
}
]
}
```
### How to get the eventsets and metrics from LIKWID
The `likwid` collector reads hardware performance counters at a **cpu** and **socket** level. The configuration looks quite complicated but it is basically copy&paste from [LIKWID's performance groups](https://github.com/RRZE-HPC/likwid/tree/master/groups). The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
```
EVENTSET -> "events": {
FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK",
FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK",
PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS",
PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED",
PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
PMC3 MERGE -> "PMC3": "MERGE",
-> }
```
The metrics are following the same procedure:
```
METRICS -> "metrics": [
IPC PMC0/PMC1 -> {
-> "name" : "IPC",
-> "calc" : "PMC0/PMC1",
-> "scope": "cpu",
-> "publish": true
-> }
-> ]
```

View File

@@ -2,29 +2,39 @@ package collectors
import (
"encoding/json"
"fmt"
"io/ioutil"
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const LOADAVGFILE = `/proc/loadavg`
type LoadavgCollectorConfig struct {
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
}
//
// LoadavgCollector collects:
// * load average of last 1, 5 & 15 minutes
// * number of processes currently runnable
// * total number of processes in system
//
// See: https://www.kernel.org/doc/html/latest/filesystems/proc.html
//
const LOADAVGFILE = "/proc/loadavg"
type LoadavgCollector struct {
MetricCollector
metricCollector
tags map[string]string
load_matches []string
load_skips []bool
proc_matches []string
config LoadavgCollectorConfig
proc_skips []bool
config struct {
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
}
}
func (m *LoadavgCollector) Init(config []byte) error {
func (m *LoadavgCollector) Init(config json.RawMessage) error {
m.name = "LoadavgCollector"
m.setup()
if len(config) > 0 {
@@ -33,45 +43,82 @@ func (m *LoadavgCollector) Init(config []byte) error {
return err
}
}
m.meta = map[string]string{
"source": m.name,
"group": "LOAD"}
m.tags = map[string]string{"type": "node"}
m.load_matches = []string{"load_one", "load_five", "load_fifteen"}
m.proc_matches = []string{"proc_run", "proc_total"}
m.load_matches = []string{
"load_one",
"load_five",
"load_fifteen"}
m.load_skips = make([]bool, len(m.load_matches))
m.proc_matches = []string{
"proc_run",
"proc_total"}
m.proc_skips = make([]bool, len(m.proc_matches))
for i, name := range m.load_matches {
_, m.load_skips[i] = stringArrayContains(m.config.ExcludeMetrics, name)
}
for i, name := range m.proc_matches {
_, m.proc_skips[i] = stringArrayContains(m.config.ExcludeMetrics, name)
}
m.init = true
return nil
}
func (m *LoadavgCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
var skip bool
func (m *LoadavgCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
buffer, err := ioutil.ReadFile(string(LOADAVGFILE))
buffer, err := ioutil.ReadFile(LOADAVGFILE)
if err != nil {
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to read file '%s': %v", LOADAVGFILE, err))
}
return
}
now := time.Now()
// Load metrics
ls := strings.Split(string(buffer), ` `)
for i, name := range m.load_matches {
x, err := strconv.ParseFloat(ls[i], 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert '%s' to float64: %v", ls[i], err))
continue
}
if m.load_skips[i] {
continue
}
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": x}, now)
if err == nil {
_, skip = stringArrayContains(m.config.ExcludeMetrics, name)
y, err := lp.New(name, m.tags, map[string]interface{}{"value": float64(x)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
}
output <- y
}
}
// Process metrics
lv := strings.Split(ls[3], `/`)
for i, name := range m.proc_matches {
x, err := strconv.ParseFloat(lv[i], 64)
if err == nil {
_, skip = stringArrayContains(m.config.ExcludeMetrics, name)
y, err := lp.New(name, m.tags, map[string]interface{}{"value": float64(x)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
}
x, err := strconv.ParseInt(lv[i], 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert '%s' to float64: %v", lv[i], err))
continue
}
if m.proc_skips[i] {
continue
}
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": x}, now)
if err == nil {
output <- y
}
}
}

View File

@@ -0,0 +1,19 @@
## `loadavg` collector
```json
"loadavg": {
"exclude_metrics": [
"proc_run"
]
}
```
The `loadavg` collector reads data from `/proc/loadavg` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `load_one`
* `load_five`
* `load_fifteen`
* `proc_run`
* `proc_total`

View File

@@ -3,31 +3,84 @@ package collectors
import (
"encoding/json"
"errors"
"io/ioutil"
"log"
"fmt"
"os/exec"
"os/user"
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const LUSTREFILE = `/proc/fs/lustre/llite/lnec-XXXXXX/stats`
const LUSTRE_SYSFS = `/sys/fs/lustre`
const LCTL_CMD = `lctl`
const LCTL_OPTION = `get_param`
type LustreCollectorConfig struct {
Procfiles []string `json:"procfiles"`
LCtlCommand string `json:"lctl_command"`
ExcludeMetrics []string `json:"exclude_metrics"`
SendAllMetrics bool `json:"send_all_metrics"`
}
type LustreCollector struct {
MetricCollector
metricCollector
tags map[string]string
matches map[string]map[string]int
devices []string
stats map[string]map[string]int64
config LustreCollectorConfig
lctl string
}
func (m *LustreCollector) Init(config []byte) error {
func (m *LustreCollector) getDeviceDataCommand(device string) []string {
statsfile := fmt.Sprintf("llite.%s.stats", device)
command := exec.Command(m.lctl, LCTL_OPTION, statsfile)
command.Wait()
stdout, _ := command.Output()
return strings.Split(string(stdout), "\n")
}
func (m *LustreCollector) getDevices() []string {
devices := make([]string, 0)
// //Version reading devices from sysfs
// globPattern := filepath.Join(LUSTRE_SYSFS, "llite/*/stats")
// files, err := filepath.Glob(globPattern)
// if err != nil {
// return devices
// }
// for _, f := range files {
// pathlist := strings.Split(f, "/")
// devices = append(devices, pathlist[4])
// }
data := m.getDeviceDataCommand("*")
for _, line := range data {
if strings.HasPrefix(line, "llite") {
linefields := strings.Split(line, ".")
if len(linefields) > 2 {
devices = append(devices, linefields[1])
}
}
}
return devices
}
// //Version reading the stats data of a device from sysfs
// func (m *LustreCollector) getDeviceDataSysfs(device string) []string {
// llitedir := filepath.Join(LUSTRE_SYSFS, "llite")
// devdir := filepath.Join(llitedir, device)
// statsfile := filepath.Join(devdir, "stats")
// buffer, err := ioutil.ReadFile(statsfile)
// if err != nil {
// return make([]string, 0)
// }
// return strings.Split(string(buffer), "\n")
// }
func (m *LustreCollector) Init(config json.RawMessage) error {
var err error
m.name = "LustreCollector"
if len(config) > 0 {
@@ -38,66 +91,120 @@ func (m *LustreCollector) Init(config []byte) error {
}
m.setup()
m.tags = map[string]string{"type": "node"}
m.matches = map[string]map[string]int{"read_bytes": {"read_bytes": 6, "read_requests": 1},
"write_bytes": {"write_bytes": 6, "write_requests": 1},
"open": {"open": 1},
"close": {"close": 1},
"setattr": {"setattr": 1},
"getattr": {"getattr": 1},
"statfs": {"statfs": 1},
"inode_permission": {"inode_permission": 1}}
m.devices = make([]string, 0)
for _, p := range m.config.Procfiles {
_, err := ioutil.ReadFile(p)
if err == nil {
m.devices = append(m.devices, p)
} else {
log.Print(err.Error())
continue
}
m.meta = map[string]string{"source": m.name, "group": "Lustre"}
defmatches := map[string]map[string]int{
"read_bytes": {"lustre_read_bytes": 6, "lustre_read_requests": 1},
"write_bytes": {"lustre_write_bytes": 6, "lustre_write_requests": 1},
"open": {"lustre_open": 1},
"close": {"lustre_close": 1},
"setattr": {"lustre_setattr": 1},
"getattr": {"lustre_getattr": 1},
"statfs": {"lustre_statfs": 1},
"inode_permission": {"lustre_inode_permission": 1}}
// Lustre file system statistics can only be queried by user root
user, err := user.Current()
if err != nil {
cclog.ComponentError(m.name, "Failed to get current user:", err.Error())
return err
}
if user.Uid != "0" {
cclog.ComponentError(m.name, "Lustre file system statistics can only be queried by user root:", err.Error())
return err
}
if len(m.devices) == 0 {
return errors.New("No metrics to collect")
m.matches = make(map[string]map[string]int)
for lineprefix, names := range defmatches {
for metricname, offset := range names {
_, skip := stringArrayContains(m.config.ExcludeMetrics, metricname)
if skip {
continue
}
if _, prefixExist := m.matches[lineprefix]; !prefixExist {
m.matches[lineprefix] = make(map[string]int)
}
if _, metricExist := m.matches[lineprefix][metricname]; !metricExist {
m.matches[lineprefix][metricname] = offset
}
}
}
p, err := exec.LookPath(m.config.LCtlCommand)
if err != nil {
p, err = exec.LookPath(LCTL_CMD)
if err != nil {
return err
}
}
m.lctl = p
devices := m.getDevices()
if len(devices) == 0 {
return errors.New("no metrics to collect")
}
m.stats = make(map[string]map[string]int64)
for _, d := range devices {
m.stats[d] = make(map[string]int64)
for _, names := range m.matches {
for metricname := range names {
m.stats[d][metricname] = 0
}
}
}
m.init = true
return nil
}
func (m *LustreCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *LustreCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
for _, p := range m.devices {
buffer, err := ioutil.ReadFile(p)
for device, devData := range m.stats {
stats := m.getDeviceDataCommand(device)
processed := []string{}
if err != nil {
log.Print(err)
return
}
for _, line := range strings.Split(string(buffer), "\n") {
for _, line := range stats {
lf := strings.Fields(line)
if len(lf) > 1 {
for match, fields := range m.matches {
if lf[0] == match {
for name, idx := range fields {
_, skip := stringArrayContains(m.config.ExcludeMetrics, name)
if skip {
continue
if fields, ok := m.matches[lf[0]]; ok {
for name, idx := range fields {
x, err := strconv.ParseInt(lf[idx], 0, 64)
if err != nil {
continue
}
value := x - devData[name]
devData[name] = x
if value < 0 {
value = 0
}
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": value}, time.Now())
if err == nil {
y.AddTag("device", device)
if strings.Contains(name, "byte") {
y.AddMeta("unit", "Byte")
}
x, err := strconv.ParseInt(lf[idx], 0, 64)
if err == nil {
y, err := lp.New(name, m.tags, map[string]interface{}{"value": x}, time.Now())
if err == nil {
*out = append(*out, y)
}
output <- y
if m.config.SendAllMetrics {
processed = append(processed, name)
}
}
}
}
}
}
if m.config.SendAllMetrics {
for name := range devData {
if _, done := stringArrayContains(processed, name); !done {
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": 0}, time.Now())
if err == nil {
y.AddTag("device", device)
if strings.Contains(name, "byte") {
y.AddMeta("unit", "Byte")
}
output <- y
}
}
}
}
}
}

View File

@@ -0,0 +1,29 @@
## `lustrestat` collector
```json
"lustrestat": {
"procfiles" : [
"/proc/fs/lustre/llite/lnec-XXXXXX/stats"
],
"exclude_metrics": [
"setattr",
"getattr"
]
}
```
The `lustrestat` collector reads from the procfs stat files for Lustre like `/proc/fs/lustre/llite/lnec-XXXXXX/stats`.
Metrics:
* `read_bytes`
* `read_requests`
* `write_bytes`
* `write_requests`
* `open`
* `close`
* `getattr`
* `setattr`
* `statfs`
* `inode_permission`

View File

@@ -10,7 +10,7 @@ import (
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const MEMSTATFILE = `/proc/meminfo`
@@ -20,14 +20,14 @@ type MemstatCollectorConfig struct {
}
type MemstatCollector struct {
MetricCollector
metricCollector
stats map[string]int64
tags map[string]string
matches map[string]string
config MemstatCollectorConfig
}
func (m *MemstatCollector) Init(config []byte) error {
func (m *MemstatCollector) Init(config json.RawMessage) error {
var err error
m.name = "MemstatCollector"
if len(config) > 0 {
@@ -36,6 +36,7 @@ func (m *MemstatCollector) Init(config []byte) error {
return err
}
}
m.meta = map[string]string{"source": m.name, "group": "Memory", "unit": "kByte"}
m.stats = make(map[string]int64)
m.matches = make(map[string]string)
m.tags = map[string]string{"type": "node"}
@@ -65,7 +66,7 @@ func (m *MemstatCollector) Init(config []byte) error {
return err
}
func (m *MemstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *MemstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -93,13 +94,13 @@ func (m *MemstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
for match, name := range m.matches {
if _, exists := m.stats[match]; !exists {
err = errors.New(fmt.Sprintf("Parse error for %s : %s", match, name))
err = fmt.Errorf("Parse error for %s : %s", match, name)
log.Print(err)
continue
}
y, err := lp.New(name, m.tags, map[string]interface{}{"value": int(float64(m.stats[match]) * 1.0e-3)}, time.Now())
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": int(float64(m.stats[match]) * 1.0e-3)}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
@@ -108,18 +109,18 @@ func (m *MemstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
if _, cached := m.stats[`Cached`]; cached {
memUsed := m.stats[`MemTotal`] - (m.stats[`MemFree`] + m.stats[`Buffers`] + m.stats[`Cached`])
_, skip := stringArrayContains(m.config.ExcludeMetrics, "mem_used")
y, err := lp.New("mem_used", m.tags, map[string]interface{}{"value": int(float64(memUsed) * 1.0e-3)}, time.Now())
y, err := lp.New("mem_used", m.tags, m.meta, map[string]interface{}{"value": int(float64(memUsed) * 1.0e-3)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
}
}
if _, found := m.stats[`MemShared`]; found {
_, skip := stringArrayContains(m.config.ExcludeMetrics, "mem_shared")
y, err := lp.New("mem_shared", m.tags, map[string]interface{}{"value": int(float64(m.stats[`MemShared`]) * 1.0e-3)}, time.Now())
y, err := lp.New("mem_shared", m.tags, m.meta, map[string]interface{}{"value": int(float64(m.stats[`MemShared`]) * 1.0e-3)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
output <- y
}
}
}

View File

@@ -0,0 +1,27 @@
## `memstat` collector
```json
"memstat": {
"exclude_metrics": [
"mem_used"
]
}
```
The `memstat` collector reads data from `/proc/meminfo` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `mem_total`
* `mem_sreclaimable`
* `mem_slab`
* `mem_free`
* `mem_buffers`
* `mem_cached`
* `mem_available`
* `mem_shared`
* `swap_total`
* `swap_free`
* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached`)

View File

@@ -1,40 +1,48 @@
package collectors
import (
"errors"
lp "github.com/influxdata/line-protocol"
"encoding/json"
"fmt"
"io/ioutil"
"log"
"strconv"
"strings"
"time"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
type MetricGetter interface {
type MetricCollector interface {
Name() string
Init(config []byte) error
Init(config json.RawMessage) error
Initialized() bool
Read(time.Duration, *[]lp.MutableMetric)
Read(duration time.Duration, output chan lp.CCMetric)
Close()
}
type MetricCollector struct {
type metricCollector struct {
name string
init bool
meta map[string]string
}
func (c *MetricCollector) Name() string {
// Name() returns the name of the metric collector
func (c *metricCollector) Name() string {
return c.name
}
func (c *MetricCollector) setup() error {
func (c *metricCollector) setup() error {
return nil
}
func (c *MetricCollector) Initialized() bool {
return c.init == true
// Initialized() indicates whether the metric collector has been initialized.
func (c *metricCollector) Initialized() bool {
return c.init
}
// intArrayContains scans an array of ints if the value str is present in the array
// If the specified value is found, the corresponding array index is returned.
// The bool value is used to signal success or failure
func intArrayContains(array []int, str int) (int, bool) {
for i, a := range array {
if a == str {
@@ -44,6 +52,9 @@ func intArrayContains(array []int, str int) (int, bool) {
return -1, false
}
// stringArrayContains scans an array of strings if the value str is present in the array
// If the specified value is found, the corresponding array index is returned.
// The bool value is used to signal success or failure
func stringArrayContains(array []string, str string) (int, bool) {
for i, a := range array {
if a == str {
@@ -103,27 +114,13 @@ func CpuList() []int {
return cpulist
}
func Tags2Map(metric lp.Metric) map[string]string {
tags := make(map[string]string)
for _, t := range metric.TagList() {
tags[t.Key] = t.Value
}
return tags
}
func Fields2Map(metric lp.Metric) map[string]interface{} {
fields := make(map[string]interface{})
for _, f := range metric.FieldList() {
fields[f.Key] = f.Value
}
return fields
}
// RemoveFromStringList removes the string r from the array of strings s
// If r is not contained in the array an error is returned
func RemoveFromStringList(s []string, r string) ([]string, error) {
for i, item := range s {
if r == item {
return append(s[:i], s[i+1:]...), nil
}
}
return s, errors.New("No such string in list")
return s, fmt.Errorf("No such string in list")
}

View File

@@ -1,86 +1,138 @@
package collectors
import (
"bufio"
"encoding/json"
"io/ioutil"
"log"
"errors"
"os"
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const NETSTATFILE = `/proc/net/dev`
type NetstatCollectorConfig struct {
ExcludeDevices []string `json:"exclude_devices"`
IncludeDevices []string `json:"include_devices"`
}
type NetstatCollectorMetric struct {
index int
lastValue float64
}
type NetstatCollector struct {
MetricCollector
config NetstatCollectorConfig
matches map[int]string
metricCollector
config NetstatCollectorConfig
matches map[string]map[string]NetstatCollectorMetric
devtags map[string]map[string]string
lastTimestamp time.Time
}
func (m *NetstatCollector) Init(config []byte) error {
func (m *NetstatCollector) Init(config json.RawMessage) error {
m.name = "NetstatCollector"
m.setup()
m.matches = map[int]string{
1: "bytes_in",
9: "bytes_out",
2: "pkts_in",
10: "pkts_out",
m.lastTimestamp = time.Now()
m.meta = map[string]string{"source": m.name, "group": "Network"}
m.devtags = make(map[string]map[string]string)
nameIndexMap := map[string]int{
"net_bytes_in": 1,
"net_pkts_in": 2,
"net_bytes_out": 9,
"net_pkts_out": 10,
}
m.matches = make(map[string]map[string]NetstatCollectorMetric)
if len(config) > 0 {
err := json.Unmarshal(config, &m.config)
if err != nil {
log.Print(err.Error())
cclog.ComponentError(m.name, "Error reading config:", err.Error())
return err
}
}
_, err := ioutil.ReadFile(string(NETSTATFILE))
if err == nil {
m.init = true
}
return nil
}
func (m *NetstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
data, err := ioutil.ReadFile(string(NETSTATFILE))
file, err := os.Open(string(NETSTATFILE))
if err != nil {
log.Print(err.Error())
return
cclog.ComponentError(m.name, err.Error())
return err
}
defer file.Close()
lines := strings.Split(string(data), "\n")
for _, l := range lines {
scanner := bufio.NewScanner(file)
for scanner.Scan() {
l := scanner.Text()
if !strings.Contains(l, ":") {
continue
}
f := strings.Fields(l)
dev := f[0][0 : len(f[0])-1]
cont := false
for _, d := range m.config.ExcludeDevices {
if d == dev {
cont = true
dev := strings.Trim(f[0], ": ")
if _, ok := stringArrayContains(m.config.IncludeDevices, dev); ok {
m.matches[dev] = make(map[string]NetstatCollectorMetric)
for name, idx := range nameIndexMap {
m.matches[dev][name] = NetstatCollectorMetric{
index: idx,
lastValue: 0,
}
}
m.devtags[dev] = map[string]string{"device": dev, "type": "node"}
}
if cont {
}
if len(m.devtags) == 0 {
return errors.New("no devices to collector metrics found")
}
m.init = true
return nil
}
func (m *NetstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
now := time.Now()
file, err := os.Open(string(NETSTATFILE))
if err != nil {
cclog.ComponentError(m.name, err.Error())
return
}
defer file.Close()
tdiff := now.Sub(m.lastTimestamp)
scanner := bufio.NewScanner(file)
for scanner.Scan() {
l := scanner.Text()
if !strings.Contains(l, ":") {
continue
}
tags := map[string]string{"device": dev, "type": "node"}
for i, name := range m.matches {
v, err := strconv.ParseInt(f[i], 10, 0)
if err == nil {
y, err := lp.New(name, tags, map[string]interface{}{"value": int(float64(v) * 1.0e-3)}, time.Now())
f := strings.Fields(l)
dev := strings.Trim(f[0], ":")
if devmetrics, ok := m.matches[dev]; ok {
for name, data := range devmetrics {
v, err := strconv.ParseFloat(f[data.index], 64)
if err == nil {
*out = append(*out, y)
vdiff := v - data.lastValue
value := vdiff / tdiff.Seconds()
if data.lastValue == 0 {
value = 0
}
data.lastValue = v
y, err := lp.New(name, m.devtags[dev], m.meta, map[string]interface{}{"value": value}, now)
if err == nil {
switch {
case strings.Contains(name, "byte"):
y.AddMeta("unit", "bytes/sec")
case strings.Contains(name, "pkt"):
y.AddMeta("unit", "packets/sec")
}
output <- y
}
devmetrics[name] = data
}
}
}
}
m.lastTimestamp = time.Now()
}
func (m *NetstatCollector) Close() {

View File

@@ -0,0 +1,21 @@
## `netstat` collector
```json
"netstat": {
"include_devices": [
"eth0"
]
}
```
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. With the `include_devices` list you can specify which network devices should be measured. **Note**: Most other collectors use an _exclude_ list instead of an include list.
Metrics:
* `net_bytes_in` (`unit=bytes/sec`)
* `net_bytes_out` (`unit=bytes/sec`)
* `net_pkts_in` (`unit=packets/sec`)
* `net_pkts_out` (`unit=packets/sec`)
The device name is added as tag `device`.

39
collectors/nfs3Metric.md Normal file
View File

@@ -0,0 +1,39 @@
## `nfs3stat` collector
```json
"nfs3stat": {
"nfsstat" : "/path/to/nfsstat",
"exclude_metrics": [
"nfs3_total"
]
}
```
The `nfs3stat` collector reads data from `nfsstat` command and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink. There is currently no possibility to get the metrics per mount point.
Metrics:
* `nfs3_total`
* `nfs3_null`
* `nfs3_getattr`
* `nfs3_setattr`
* `nfs3_lookup`
* `nfs3_access`
* `nfs3_readlink`
* `nfs3_read`
* `nfs3_write`
* `nfs3_create`
* `nfs3_mkdir`
* `nfs3_symlink`
* `nfs3_remove`
* `nfs3_rmdir`
* `nfs3_rename`
* `nfs3_link`
* `nfs3_readdir`
* `nfs3_readdirplus`
* `nfs3_fsstat`
* `nfs3_fsinfo`
* `nfs3_pathconf`
* `nfs3_commit`

62
collectors/nfs4Metric.md Normal file
View File

@@ -0,0 +1,62 @@
## `nfs4stat` collector
```json
"nfs4stat": {
"nfsstat" : "/path/to/nfsstat",
"exclude_metrics": [
"nfs4_total"
]
}
```
The `nfs4stat` collector reads data from `nfsstat` command and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink. There is currently no possibility to get the metrics per mount point.
Metrics:
* `nfs4_total`
* `nfs4_null`
* `nfs4_read`
* `nfs4_write`
* `nfs4_commit`
* `nfs4_open`
* `nfs4_open_conf`
* `nfs4_open_noat`
* `nfs4_open_dgrd`
* `nfs4_close`
* `nfs4_setattr`
* `nfs4_fsinfo`
* `nfs4_renew`
* `nfs4_setclntid`
* `nfs4_confirm`
* `nfs4_lock`
* `nfs4_lockt`
* `nfs4_locku`
* `nfs4_access`
* `nfs4_getattr`
* `nfs4_lookup`
* `nfs4_lookup_root`
* `nfs4_remove`
* `nfs4_rename`
* `nfs4_link`
* `nfs4_symlink`
* `nfs4_create`
* `nfs4_pathconf`
* `nfs4_statfs`
* `nfs4_readlink`
* `nfs4_readdir`
* `nfs4_server_caps`
* `nfs4_delegreturn`
* `nfs4_getacl`
* `nfs4_setacl`
* `nfs4_rel_lkowner`
* `nfs4_exchange_id`
* `nfs4_create_session`
* `nfs4_destroy_session`
* `nfs4_sequence`
* `nfs4_get_lease_time`
* `nfs4_reclaim_comp`
* `nfs4_secinfo_no`
* `nfs4_bind_conn_to_ses`

174
collectors/nfsMetric.go Normal file
View File

@@ -0,0 +1,174 @@
package collectors
import (
"encoding/json"
"fmt"
"log"
// "os"
"os/exec"
"strconv"
"strings"
"time"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
// First part contains the code for the general NfsCollector.
// Later, the general NfsCollector is more limited to Nfs3- and Nfs4Collector.
const NFSSTAT_EXEC = `nfsstat`
type NfsCollectorData struct {
current int64
last int64
}
type nfsCollector struct {
metricCollector
tags map[string]string
version string
config struct {
Nfsstats string `json:"nfsstat"`
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
}
data map[string]NfsCollectorData
}
func (m *nfsCollector) initStats() error {
cmd := exec.Command(m.config.Nfsstats, `-l`)
cmd.Wait()
buffer, err := cmd.Output()
if err == nil {
for _, line := range strings.Split(string(buffer), "\n") {
lf := strings.Fields(line)
if len(lf) != 5 {
continue
}
if lf[1] == m.version {
name := strings.Trim(lf[3], ":")
if _, exist := m.data[name]; !exist {
value, err := strconv.ParseInt(lf[4], 0, 64)
if err == nil {
x := m.data[name]
x.current = value
x.last = 0
m.data[name] = x
}
}
}
}
}
return err
}
func (m *nfsCollector) updateStats() error {
cmd := exec.Command(m.config.Nfsstats, `-l`)
cmd.Wait()
buffer, err := cmd.Output()
if err == nil {
for _, line := range strings.Split(string(buffer), "\n") {
lf := strings.Fields(line)
if len(lf) != 5 {
continue
}
if lf[1] == m.version {
name := strings.Trim(lf[3], ":")
if _, exist := m.data[name]; exist {
value, err := strconv.ParseInt(lf[4], 0, 64)
if err == nil {
x := m.data[name]
x.last = x.current
x.current = value
m.data[name] = x
}
}
}
}
}
return err
}
func (m *nfsCollector) MainInit(config json.RawMessage) error {
m.config.Nfsstats = string(NFSSTAT_EXEC)
// Read JSON configuration
if len(config) > 0 {
err := json.Unmarshal(config, &m.config)
if err != nil {
log.Print(err.Error())
return err
}
}
m.meta = map[string]string{
"source": m.name,
"group": "NFS",
}
m.tags = map[string]string{
"type": "node",
}
// Check if nfsstat is in executable search path
_, err := exec.LookPath(m.config.Nfsstats)
if err != nil {
return fmt.Errorf("NfsCollector.Init(): Failed to find nfsstat binary '%s': %v", m.config.Nfsstats, err)
}
m.data = make(map[string]NfsCollectorData)
m.initStats()
m.init = true
return nil
}
func (m *nfsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
timestamp := time.Now()
m.updateStats()
prefix := ""
switch m.version {
case "v3":
prefix = "nfs3"
case "v4":
prefix = "nfs4"
default:
prefix = "nfs"
}
for name, data := range m.data {
if _, skip := stringArrayContains(m.config.ExcludeMetrics, name); skip {
continue
}
value := data.current - data.last
y, err := lp.New(fmt.Sprintf("%s_%s", prefix, name), m.tags, m.meta, map[string]interface{}{"value": value}, timestamp)
if err == nil {
y.AddMeta("version", m.version)
output <- y
}
}
}
func (m *nfsCollector) Close() {
m.init = false
}
type Nfs3Collector struct {
nfsCollector
}
type Nfs4Collector struct {
nfsCollector
}
func (m *Nfs3Collector) Init(config json.RawMessage) error {
m.name = "Nfs3Collector"
m.version = `v3`
m.setup()
return m.MainInit(config)
}
func (m *Nfs4Collector) Init(config json.RawMessage) error {
m.name = "Nfs4Collector"
m.version = `v4`
m.setup()
return m.MainInit(config)
}

View File

@@ -2,15 +2,16 @@ package collectors
import (
"bufio"
"encoding/json"
"fmt"
"log"
"os"
"path/filepath"
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
//
@@ -42,11 +43,11 @@ type NUMAStatsCollectorTopolgy struct {
}
type NUMAStatsCollector struct {
MetricCollector
metricCollector
topology []NUMAStatsCollectorTopolgy
}
func (m *NUMAStatsCollector) Init(config []byte) error {
func (m *NUMAStatsCollector) Init(config json.RawMessage) error {
// Check if already initialized
if m.init {
return nil
@@ -54,25 +55,29 @@ func (m *NUMAStatsCollector) Init(config []byte) error {
m.name = "NUMAStatsCollector"
m.setup()
m.meta = map[string]string{
"source": m.name,
"group": "NUMA",
}
// Loop for all NUMA node directories
baseDir := "/sys/devices/system/node"
globPattern := filepath.Join(baseDir, "node[0-9]*")
base := "/sys/devices/system/node/node"
globPattern := base + "[0-9]*"
dirs, err := filepath.Glob(globPattern)
if err != nil {
return fmt.Errorf("unable to glob files with pattern %s", globPattern)
return fmt.Errorf("unable to glob files with pattern '%s'", globPattern)
}
if dirs == nil {
return fmt.Errorf("unable to find any files with pattern %s", globPattern)
return fmt.Errorf("unable to find any files with pattern '%s'", globPattern)
}
m.topology = make([]NUMAStatsCollectorTopolgy, 0, len(dirs))
for _, dir := range dirs {
node := strings.TrimPrefix(dir, "/sys/devices/system/node/node")
node := strings.TrimPrefix(dir, base)
file := filepath.Join(dir, "numastat")
m.topology = append(m.topology,
NUMAStatsCollectorTopolgy{
file: file,
tagSet: map[string]string{"domain": node},
tagSet: map[string]string{"memoryDomain": node},
})
}
@@ -80,7 +85,7 @@ func (m *NUMAStatsCollector) Init(config []byte) error {
return nil
}
func (m *NUMAStatsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *NUMAStatsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -92,9 +97,14 @@ func (m *NUMAStatsCollector) Read(interval time.Duration, out *[]lp.MutableMetri
now := time.Now()
file, err := os.Open(t.file)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to open file '%s': %v", t.file, err))
return
}
scanner := bufio.NewScanner(file)
// Read line by line
for scanner.Scan() {
split := strings.Fields(scanner.Text())
if len(split) != 2 {
@@ -103,12 +113,20 @@ func (m *NUMAStatsCollector) Read(interval time.Duration, out *[]lp.MutableMetri
key := split[0]
value, err := strconv.ParseInt(split[1], 10, 64)
if err != nil {
log.Printf("failed to convert %s='%s' to int64: %v", key, split[1], err)
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert %s='%s' to int64: %v", key, split[1], err))
continue
}
y, err := lp.New("numastats_"+key, t.tagSet, map[string]interface{}{"value": value}, now)
y, err := lp.New(
"numastats_"+key,
t.tagSet,
m.meta,
map[string]interface{}{"value": value},
now,
)
if err == nil {
*out = append(*out, y)
output <- y
}
}

View File

@@ -0,0 +1,15 @@
## `numastat` collector
```json
"numastat": {}
```
The `numastat` collector reads data from `/sys/devices/system/node/node*/numastat` and outputs a handful **memoryDomain** metrics. See: https://www.kernel.org/doc/html/latest/admin-guide/numastat.html
Metrics:
* `numastats_numa_hit`: A process wanted to allocate memory from this node, and succeeded.
* `numastats_numa_miss`: A process wanted to allocate memory from another node, but ended up with memory from this node.
* `numastats_numa_foreign`: A process wanted to allocate on this node, but ended up with memory from another node.
* `numastats_local_node`: A process ran on this node's CPU, and got memory from this node.
* `numastats_other_node`: A process ran on a different node's CPU, and got memory from this node.
* `numastats_interleave_hit`: Interleaving wanted to allocate from this node and succeeded.

View File

@@ -7,19 +7,28 @@ import (
"log"
"time"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
"github.com/NVIDIA/go-nvml/pkg/nvml"
lp "github.com/influxdata/line-protocol"
)
type NvidiaCollectorConfig struct {
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
ExcludeDevices []string `json:"exclude_devices,omitempty"`
AddPciInfoTag bool `json:"add_pci_info_tag,omitempty"`
}
type NvidiaCollectorDevice struct {
device nvml.Device
excludeMetrics map[string]bool
tags map[string]string
}
type NvidiaCollector struct {
MetricCollector
metricCollector
num_gpus int
config NvidiaCollectorConfig
gpus []NvidiaCollectorDevice
}
func (m *NvidiaCollector) CatchPanic() {
@@ -29,9 +38,10 @@ func (m *NvidiaCollector) CatchPanic() {
}
}
func (m *NvidiaCollector) Init(config []byte) error {
func (m *NvidiaCollector) Init(config json.RawMessage) error {
var err error
m.name = "NvidiaCollector"
m.config.AddPciInfoTag = false
m.setup()
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
@@ -39,224 +49,415 @@ func (m *NvidiaCollector) Init(config []byte) error {
return err
}
}
m.meta = map[string]string{
"source": m.name,
"group": "Nvidia",
}
m.num_gpus = 0
defer m.CatchPanic()
// Initialize NVIDIA Management Library (NVML)
ret := nvml.Init()
if ret != nvml.SUCCESS {
err = errors.New(nvml.ErrorString(ret))
cclog.ComponentError(m.name, "Unable to initialize NVML", err.Error())
return err
}
m.num_gpus, ret = nvml.DeviceGetCount()
// Number of NVIDIA GPUs
num_gpus, ret := nvml.DeviceGetCount()
if ret != nvml.SUCCESS {
err = errors.New(nvml.ErrorString(ret))
cclog.ComponentError(m.name, "Unable to get device count", err.Error())
return err
}
// For all GPUs
m.gpus = make([]NvidiaCollectorDevice, num_gpus)
for i := 0; i < num_gpus; i++ {
g := &m.gpus[i]
// Skip excluded devices
str_i := fmt.Sprintf("%d", i)
if _, skip := stringArrayContains(m.config.ExcludeDevices, str_i); skip {
continue
}
// Get device handle
device, ret := nvml.DeviceGetHandleByIndex(i)
if ret != nvml.SUCCESS {
err = errors.New(nvml.ErrorString(ret))
cclog.ComponentError(m.name, "Unable to get device at index", i, ":", err.Error())
return err
}
g.device = device
// Add tags
g.tags = map[string]string{
"type": "accelerator",
"type-id": str_i,
}
// Add excluded metrics
g.excludeMetrics = map[string]bool{}
for _, e := range m.config.ExcludeMetrics {
g.excludeMetrics[e] = true
}
// Add PCI info as tag
if m.config.AddPciInfoTag {
pciInfo, ret := nvml.DeviceGetPciInfo(g.device)
if ret != nvml.SUCCESS {
err = errors.New(nvml.ErrorString(ret))
cclog.ComponentError(m.name, "Unable to get PCI info for device at index", i, ":", err.Error())
return err
}
g.tags["pci_identifier"] = fmt.Sprintf(
"%08X:%02X:%02X.0",
pciInfo.Domain,
pciInfo.Bus,
pciInfo.Device)
}
}
m.init = true
return nil
}
func (m *NvidiaCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
for i := 0; i < m.num_gpus; i++ {
device, ret := nvml.DeviceGetHandleByIndex(i)
if ret != nvml.SUCCESS {
log.Fatalf("Unable to get device at index %d: %v", i, nvml.ErrorString(ret))
return
}
_, skip := stringArrayContains(m.config.ExcludeDevices, fmt.Sprintf("%d", i))
if skip {
continue
}
tags := map[string]string{"type": "accelerator", "type-id": fmt.Sprintf("%d", i)}
util, ret := nvml.DeviceGetUtilizationRates(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "util")
y, err := lp.New("util", tags, map[string]interface{}{"value": float64(util.Gpu)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
}
_, skip = stringArrayContains(m.config.ExcludeMetrics, "mem_util")
y, err = lp.New("mem_util", tags, map[string]interface{}{"value": float64(util.Memory)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
for i := range m.gpus {
device := &m.gpus[i]
if !device.excludeMetrics["nv_util"] || !device.excludeMetrics["nv_mem_util"] {
// Retrieves the current utilization rates for the device's major subsystems.
//
// Available utilization rates
// * Gpu: Percent of time over the past sample period during which one or more kernels was executing on the GPU.
// * Memory: Percent of time over the past sample period during which global (device) memory was being read or written
//
// Note:
// * During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings.
// This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.
// * On MIG-enabled GPUs, querying device utilization rates is not currently supported.
util, ret := nvml.DeviceGetUtilizationRates(device.device)
if ret == nvml.SUCCESS {
if !device.excludeMetrics["nv_util"] {
y, err := lp.New("nv_util", device.tags, m.meta, map[string]interface{}{"value": float64(util.Gpu)}, time.Now())
if err == nil {
y.AddMeta("unit", "%")
output <- y
}
}
if !device.excludeMetrics["nv_mem_util"] {
y, err := lp.New("nv_mem_util", device.tags, m.meta, map[string]interface{}{"value": float64(util.Memory)}, time.Now())
if err == nil {
y.AddMeta("unit", "%")
output <- y
}
}
}
}
meminfo, ret := nvml.DeviceGetMemoryInfo(device)
if ret == nvml.SUCCESS {
t := float64(meminfo.Total) / (1024 * 1024)
_, skip = stringArrayContains(m.config.ExcludeMetrics, "mem_total")
y, err := lp.New("mem_total", tags, map[string]interface{}{"value": t}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
}
f := float64(meminfo.Used) / (1024 * 1024)
_, skip = stringArrayContains(m.config.ExcludeMetrics, "fb_memory")
y, err = lp.New("fb_memory", tags, map[string]interface{}{"value": f}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_mem_total"] || !device.excludeMetrics["nv_fb_memory"] {
// Retrieves the amount of used, free and total memory available on the device, in bytes.
//
// Enabling ECC reduces the amount of total available memory, due to the extra required parity bits.
//
// The reported amount of used memory is equal to the sum of memory allocated by all active channels on the device.
//
// Available memory info:
// * Free: Unallocated FB memory (in bytes).
// * Total: Total installed FB memory (in bytes).
// * Used: Allocated FB memory (in bytes). Note that the driver/GPU always sets aside a small amount of memory for bookkeeping.
//
// Note:
// In MIG mode, if device handle is provided, the API returns aggregate information, only if the caller has appropriate privileges.
// Per-instance information can be queried by using specific MIG device handles.
meminfo, ret := nvml.DeviceGetMemoryInfo(device.device)
if ret == nvml.SUCCESS {
if !device.excludeMetrics["nv_mem_total"] {
t := float64(meminfo.Total) / (1024 * 1024)
y, err := lp.New("nv_mem_total", device.tags, m.meta, map[string]interface{}{"value": t}, time.Now())
if err == nil {
y.AddMeta("unit", "MByte")
output <- y
}
}
if !device.excludeMetrics["nv_fb_memory"] {
f := float64(meminfo.Used) / (1024 * 1024)
y, err := lp.New("nv_fb_memory", device.tags, m.meta, map[string]interface{}{"value": f}, time.Now())
if err == nil {
y.AddMeta("unit", "MByte")
output <- y
}
}
}
}
temp, ret := nvml.DeviceGetTemperature(device, nvml.TEMPERATURE_GPU)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "temp")
y, err := lp.New("temp", tags, map[string]interface{}{"value": float64(temp)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_temp"] {
// Retrieves the current temperature readings for the device, in degrees C.
//
// Available temperature sensors:
// * TEMPERATURE_GPU: Temperature sensor for the GPU die.
// * NVML_TEMPERATURE_COUNT
temp, ret := nvml.DeviceGetTemperature(device.device, nvml.TEMPERATURE_GPU)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_temp", device.tags, m.meta, map[string]interface{}{"value": float64(temp)}, time.Now())
if err == nil {
y.AddMeta("unit", "degC")
output <- y
}
}
}
fan, ret := nvml.DeviceGetFanSpeed(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "fan")
y, err := lp.New("fan", tags, map[string]interface{}{"value": float64(fan)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_fan"] {
// Retrieves the intended operating speed of the device's fan.
//
// Note: The reported speed is the intended fan speed.
// If the fan is physically blocked and unable to spin, the output will not match the actual fan speed.
//
// For all discrete products with dedicated fans.
//
// The fan speed is expressed as a percentage of the product's maximum noise tolerance fan speed.
// This value may exceed 100% in certain cases.
fan, ret := nvml.DeviceGetFanSpeed(device.device)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_fan", device.tags, m.meta, map[string]interface{}{"value": float64(fan)}, time.Now())
if err == nil {
y.AddMeta("unit", "%")
output <- y
}
}
}
_, ecc_pend, ret := nvml.DeviceGetEccMode(device)
if ret == nvml.SUCCESS {
var y lp.MutableMetric
var err error
switch ecc_pend {
case nvml.FEATURE_DISABLED:
y, err = lp.New("ecc_mode", tags, map[string]interface{}{"value": string("OFF")}, time.Now())
case nvml.FEATURE_ENABLED:
y, err = lp.New("ecc_mode", tags, map[string]interface{}{"value": string("ON")}, time.Now())
default:
y, err = lp.New("ecc_mode", tags, map[string]interface{}{"value": string("UNKNOWN")}, time.Now())
}
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_mode")
if err == nil && !skip {
*out = append(*out, y)
}
} else if ret == nvml.ERROR_NOT_SUPPORTED {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_mode")
y, err := lp.New("ecc_mode", tags, map[string]interface{}{"value": string("N/A")}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_ecc_mode"] {
// Retrieves the current and pending ECC modes for the device.
//
// For Fermi or newer fully supported devices. Only applicable to devices with ECC.
// Requires NVML_INFOROM_ECC version 1.0 or higher.
//
// Changing ECC modes requires a reboot.
// The "pending" ECC mode refers to the target mode following the next reboot.
_, ecc_pend, ret := nvml.DeviceGetEccMode(device.device)
if ret == nvml.SUCCESS {
var y lp.CCMetric
var err error
switch ecc_pend {
case nvml.FEATURE_DISABLED:
y, err = lp.New("nv_ecc_mode", device.tags, m.meta, map[string]interface{}{"value": "OFF"}, time.Now())
case nvml.FEATURE_ENABLED:
y, err = lp.New("nv_ecc_mode", device.tags, m.meta, map[string]interface{}{"value": "ON"}, time.Now())
default:
y, err = lp.New("nv_ecc_mode", device.tags, m.meta, map[string]interface{}{"value": "UNKNOWN"}, time.Now())
}
if err == nil {
output <- y
}
} else if ret == nvml.ERROR_NOT_SUPPORTED {
y, err := lp.New("nv_ecc_mode", device.tags, m.meta, map[string]interface{}{"value": "N/A"}, time.Now())
if err == nil {
output <- y
}
}
}
pstate, ret := nvml.DeviceGetPerformanceState(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "perf_state")
y, err := lp.New("perf_state", tags, map[string]interface{}{"value": fmt.Sprintf("P%d", int(pstate))}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_perf_state"] {
// Retrieves the current performance state for the device.
//
// Allowed PStates:
// 0: Maximum Performance.
// ..
// 15: Minimum Performance.
// 32: Unknown performance state.
pState, ret := nvml.DeviceGetPerformanceState(device.device)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_perf_state", device.tags, m.meta, map[string]interface{}{"value": fmt.Sprintf("P%d", int(pState))}, time.Now())
if err == nil {
output <- y
}
}
}
power, ret := nvml.DeviceGetPowerUsage(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "power_usage_report")
y, err := lp.New("power_usage_report", tags, map[string]interface{}{"value": float64(power) / 1000}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_power_usage_report"] {
// Retrieves power usage for this GPU in milliwatts and its associated circuitry (e.g. memory)
//
// On Fermi and Kepler GPUs the reading is accurate to within +/- 5% of current power draw.
//
// It is only available if power management mode is supported
power, ret := nvml.DeviceGetPowerUsage(device.device)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_power_usage_report", device.tags, m.meta, map[string]interface{}{"value": float64(power) / 1000}, time.Now())
if err == nil {
y.AddMeta("unit", "watts")
output <- y
}
}
}
gclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_GRAPHICS)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "graphics_clock_report")
y, err := lp.New("graphics_clock_report", tags, map[string]interface{}{"value": float64(gclk)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
// Retrieves the current clock speeds for the device.
//
// Available clock information:
// * CLOCK_GRAPHICS: Graphics clock domain.
// * CLOCK_SM: Streaming Multiprocessor clock domain.
// * CLOCK_MEM: Memory clock domain.
if !device.excludeMetrics["nv_graphics_clock_report"] {
graphicsClock, ret := nvml.DeviceGetClockInfo(device.device, nvml.CLOCK_GRAPHICS)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_graphics_clock_report", device.tags, m.meta, map[string]interface{}{"value": float64(graphicsClock)}, time.Now())
if err == nil {
y.AddMeta("unit", "MHz")
output <- y
}
}
}
smclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_SM)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "sm_clock_report")
y, err := lp.New("sm_clock_report", tags, map[string]interface{}{"value": float64(smclk)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_sm_clock_report"] {
smCock, ret := nvml.DeviceGetClockInfo(device.device, nvml.CLOCK_SM)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_sm_clock_report", device.tags, m.meta, map[string]interface{}{"value": float64(smCock)}, time.Now())
if err == nil {
y.AddMeta("unit", "MHz")
output <- y
}
}
}
memclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_MEM)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "mem_clock_report")
y, err := lp.New("mem_clock_report", tags, map[string]interface{}{"value": float64(memclk)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_mem_clock_report"] {
memClock, ret := nvml.DeviceGetClockInfo(device.device, nvml.CLOCK_MEM)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_mem_clock_report", device.tags, m.meta, map[string]interface{}{"value": float64(memClock)}, time.Now())
if err == nil {
y.AddMeta("unit", "MHz")
output <- y
}
}
}
max_gclk, ret := nvml.DeviceGetMaxClockInfo(device, nvml.CLOCK_GRAPHICS)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "max_graphics_clock")
y, err := lp.New("max_graphics_clock", tags, map[string]interface{}{"value": float64(max_gclk)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
// Retrieves the maximum clock speeds for the device.
//
// Available clock information:
// * CLOCK_GRAPHICS: Graphics clock domain.
// * CLOCK_SM: Streaming multiprocessor clock domain.
// * CLOCK_MEM: Memory clock domain.
// * CLOCK_VIDEO: Video encoder/decoder clock domain.
// * CLOCK_COUNT: Count of clock types.
//
// Note:
/// On GPUs from Fermi family current P0 clocks (reported by nvmlDeviceGetClockInfo) can differ from max clocks by few MHz.
if !device.excludeMetrics["nv_max_graphics_clock"] {
max_gclk, ret := nvml.DeviceGetMaxClockInfo(device.device, nvml.CLOCK_GRAPHICS)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_max_graphics_clock", device.tags, m.meta, map[string]interface{}{"value": float64(max_gclk)}, time.Now())
if err == nil {
y.AddMeta("unit", "MHz")
output <- y
}
}
}
max_smclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_SM)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "max_sm_clock")
y, err := lp.New("max_sm_clock", tags, map[string]interface{}{"value": float64(max_smclk)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_max_sm_clock"] {
maxSmClock, ret := nvml.DeviceGetClockInfo(device.device, nvml.CLOCK_SM)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_max_sm_clock", device.tags, m.meta, map[string]interface{}{"value": float64(maxSmClock)}, time.Now())
if err == nil {
y.AddMeta("unit", "MHz")
output <- y
}
}
}
max_memclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_MEM)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "max_mem_clock")
y, err := lp.New("max_mem_clock", tags, map[string]interface{}{"value": float64(max_memclk)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_max_mem_clock"] {
maxMemClock, ret := nvml.DeviceGetClockInfo(device.device, nvml.CLOCK_MEM)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_max_mem_clock", device.tags, m.meta, map[string]interface{}{"value": float64(maxMemClock)}, time.Now())
if err == nil {
y.AddMeta("unit", "MHz")
output <- y
}
}
}
ecc_db, ret := nvml.DeviceGetTotalEccErrors(device, 1, 1)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_db_error")
y, err := lp.New("ecc_db_error", tags, map[string]interface{}{"value": float64(ecc_db)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_ecc_db_error"] {
// Retrieves the total ECC error counts for the device.
//
// For Fermi or newer fully supported devices.
// Only applicable to devices with ECC.
// Requires NVML_INFOROM_ECC version 1.0 or higher.
// Requires ECC Mode to be enabled.
//
// The total error count is the sum of errors across each of the separate memory systems,
// i.e. the total set of errors across the entire device.
ecc_db, ret := nvml.DeviceGetTotalEccErrors(device.device, nvml.MEMORY_ERROR_TYPE_UNCORRECTED, nvml.AGGREGATE_ECC)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_ecc_db_error", device.tags, m.meta, map[string]interface{}{"value": float64(ecc_db)}, time.Now())
if err == nil {
output <- y
}
}
}
ecc_sb, ret := nvml.DeviceGetTotalEccErrors(device, 0, 1)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_sb_error")
y, err := lp.New("ecc_sb_error", tags, map[string]interface{}{"value": float64(ecc_sb)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_ecc_sb_error"] {
ecc_sb, ret := nvml.DeviceGetTotalEccErrors(device.device, nvml.MEMORY_ERROR_TYPE_CORRECTED, nvml.AGGREGATE_ECC)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_ecc_sb_error", device.tags, m.meta, map[string]interface{}{"value": float64(ecc_sb)}, time.Now())
if err == nil {
output <- y
}
}
}
pwr_limit, ret := nvml.DeviceGetPowerManagementLimit(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "power_man_limit")
y, err := lp.New("power_man_limit", tags, map[string]interface{}{"value": float64(pwr_limit)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_power_man_limit"] {
// Retrieves the power management limit associated with this device.
//
// For Fermi or newer fully supported devices.
//
// The power limit defines the upper boundary for the card's power draw.
// If the card's total power draw reaches this limit the power management algorithm kicks in.
pwr_limit, ret := nvml.DeviceGetPowerManagementLimit(device.device)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_power_man_limit", device.tags, m.meta, map[string]interface{}{"value": float64(pwr_limit) / 1000}, time.Now())
if err == nil {
y.AddMeta("unit", "watts")
output <- y
}
}
}
enc_util, _, ret := nvml.DeviceGetEncoderUtilization(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "encoder_util")
y, err := lp.New("encoder_util", tags, map[string]interface{}{"value": float64(enc_util)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_encoder_util"] {
// Retrieves the current utilization and sampling size in microseconds for the Encoder
//
// For Kepler or newer fully supported devices.
//
// Note: On MIG-enabled GPUs, querying encoder utilization is not currently supported.
enc_util, _, ret := nvml.DeviceGetEncoderUtilization(device.device)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_encoder_util", device.tags, m.meta, map[string]interface{}{"value": float64(enc_util)}, time.Now())
if err == nil {
y.AddMeta("unit", "%")
output <- y
}
}
}
dec_util, _, ret := nvml.DeviceGetDecoderUtilization(device)
if ret == nvml.SUCCESS {
_, skip = stringArrayContains(m.config.ExcludeMetrics, "decoder_util")
y, err := lp.New("decoder_util", tags, map[string]interface{}{"value": float64(dec_util)}, time.Now())
if err == nil && !skip {
*out = append(*out, y)
if !device.excludeMetrics["nv_decoder_util"] {
// Retrieves the current utilization and sampling size in microseconds for the Decoder
//
// For Kepler or newer fully supported devices.
//
// Note: On MIG-enabled GPUs, querying decoder utilization is not currently supported.
dec_util, _, ret := nvml.DeviceGetDecoderUtilization(device.device)
if ret == nvml.SUCCESS {
y, err := lp.New("nv_decoder_util", device.tags, m.meta, map[string]interface{}{"value": float64(dec_util)}, time.Now())
if err == nil {
y.AddMeta("unit", "%")
output <- y
}
}
}
}

View File

@@ -0,0 +1,40 @@
## `nvidia` collector
```json
"nvidia": {
"exclude_devices" : [
"0","1"
],
"exclude_metrics": [
"nv_fb_memory",
"nv_fan"
]
}
```
Metrics:
* `nv_util`
* `nv_mem_util`
* `nv_mem_total`
* `nv_fb_memory`
* `nv_temp`
* `nv_fan`
* `nv_ecc_mode`
* `nv_perf_state`
* `nv_power_usage_report`
* `nv_graphics_clock_report`
* `nv_sm_clock_report`
* `nv_mem_clock_report`
* `nv_max_graphics_clock`
* `nv_max_sm_clock`
* `nv_max_mem_clock`
* `nv_ecc_db_error`
* `nv_ecc_sb_error`
* `nv_power_man_limit`
* `nv_encoder_util`
* `nv_decoder_util`
It uses a separate `type` in the metrics. The output metric looks like this:
`<name>,type=accelerator,type-id=<nvidia-gpu-id> value=<metric value> <timestamp>`

View File

@@ -4,104 +4,227 @@ import (
"encoding/json"
"fmt"
"io/ioutil"
"os"
"path/filepath"
"strconv"
"strings"
"time"
lp "github.com/influxdata/line-protocol"
cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const HWMON_PATH = `/sys/class/hwmon`
// See: https://www.kernel.org/doc/html/latest/hwmon/sysfs-interface.html
// /sys/class/hwmon/hwmon*/name -> coretemp
// /sys/class/hwmon/hwmon*/temp*_label -> Core 0
// /sys/class/hwmon/hwmon*/temp*_input -> 27800 = 27.8°C
// /sys/class/hwmon/hwmon*/temp*_max -> 86000 = 86.0°C
// /sys/class/hwmon/hwmon*/temp*_crit -> 100000 = 100.0°C
type TempCollectorConfig struct {
ExcludeMetrics []string `json:"exclude_metrics"`
TagOverride map[string]map[string]string `json:"tag_override"`
type TempCollectorSensor struct {
name string
label string
metricName string // Default: name_label
file string
maxTempName string
maxTemp int64
critTempName string
critTemp int64
tags map[string]string
}
type TempCollector struct {
MetricCollector
config TempCollectorConfig
metricCollector
config struct {
ExcludeMetrics []string `json:"exclude_metrics"`
TagOverride map[string]map[string]string `json:"tag_override"`
ReportMaxTemp bool `json:"report_max_temperature"`
ReportCriticalTemp bool `json:"report_critical_temperature"`
}
sensors []*TempCollectorSensor
}
func (m *TempCollector) Init(config []byte) error {
func (m *TempCollector) Init(config json.RawMessage) error {
// Check if already initialized
if m.init {
return nil
}
m.name = "TempCollector"
m.setup()
m.init = true
if len(config) > 0 {
err := json.Unmarshal(config, &m.config)
if err != nil {
return err
}
}
m.meta = map[string]string{
"source": m.name,
"group": "IPMI",
"unit": "degC",
}
m.sensors = make([]*TempCollectorSensor, 0)
// Find all temperature sensor files
globPattern := filepath.Join("/sys/class/hwmon", "*", "temp*_input")
inputFiles, err := filepath.Glob(globPattern)
if err != nil {
return fmt.Errorf("Unable to glob files with pattern '%s': %v", globPattern, err)
}
if inputFiles == nil {
return fmt.Errorf("Unable to find any files with pattern '%s'", globPattern)
}
// Get sensor name for each temperature sensor file
for _, file := range inputFiles {
sensor := new(TempCollectorSensor)
// sensor name
nameFile := filepath.Join(filepath.Dir(file), "name")
name, err := ioutil.ReadFile(nameFile)
if err == nil {
sensor.name = strings.TrimSpace(string(name))
}
// sensor label
labelFile := strings.TrimSuffix(file, "_input") + "_label"
label, err := ioutil.ReadFile(labelFile)
if err == nil {
sensor.label = strings.TrimSpace(string(label))
}
// sensor metric name
switch {
case len(sensor.name) == 0 && len(sensor.label) == 0:
continue
case sensor.name == "coretemp" && strings.HasPrefix(sensor.label, "Core ") ||
sensor.name == "coretemp" && strings.HasPrefix(sensor.label, "Package id "):
sensor.metricName = "temp_" + sensor.label
case len(sensor.name) != 0 && len(sensor.label) != 0:
sensor.metricName = sensor.name + "_" + sensor.label
case len(sensor.name) != 0:
sensor.metricName = sensor.name
case len(sensor.label) != 0:
sensor.metricName = sensor.label
}
sensor.metricName = strings.ToLower(sensor.metricName)
sensor.metricName = strings.Replace(sensor.metricName, " ", "_", -1)
// Add temperature prefix, if required
if !strings.Contains(sensor.metricName, "temp") {
sensor.metricName = "temp_" + sensor.metricName
}
// Sensor file
sensor.file = file
// Sensor tags
sensor.tags = map[string]string{
"type": "node",
}
// Apply tag override configuration
for key, newtags := range m.config.TagOverride {
if strings.Contains(sensor.file, key) {
sensor.tags = newtags
break
}
}
// max temperature
if m.config.ReportMaxTemp {
maxTempFile := strings.TrimSuffix(file, "_input") + "_max"
if buffer, err := ioutil.ReadFile(maxTempFile); err == nil {
if x, err := strconv.ParseInt(strings.TrimSpace(string(buffer)), 10, 64); err == nil {
sensor.maxTempName = strings.Replace(sensor.metricName, "temp", "max_temp", 1)
sensor.maxTemp = x / 1000
}
}
}
// critical temperature
if m.config.ReportCriticalTemp {
criticalTempFile := strings.TrimSuffix(file, "_input") + "_crit"
if buffer, err := ioutil.ReadFile(criticalTempFile); err == nil {
if x, err := strconv.ParseInt(strings.TrimSpace(string(buffer)), 10, 64); err == nil {
sensor.critTempName = strings.Replace(sensor.metricName, "temp", "crit_temp", 1)
sensor.critTemp = x / 1000
}
}
}
m.sensors = append(m.sensors, sensor)
}
// Empty sensors map
if len(m.sensors) == 0 {
return fmt.Errorf("No temperature sensors found")
}
// Finished initialization
m.init = true
return nil
}
func get_hwmon_sensors() (map[string]map[string]string, error) {
var folders []string
var sensors map[string]map[string]string
sensors = make(map[string]map[string]string)
err := filepath.Walk(HWMON_PATH, func(p string, info os.FileInfo, err error) error {
if info.IsDir() {
return nil
}
folders = append(folders, p)
return nil
})
if err != nil {
return sensors, err
}
func (m *TempCollector) Read(interval time.Duration, output chan lp.CCMetric) {
for _, f := range folders {
sensors[f] = make(map[string]string)
myp := fmt.Sprintf("%s/", f)
err := filepath.Walk(myp, func(path string, info os.FileInfo, err error) error {
dir, fname := filepath.Split(path)
if strings.Contains(fname, "temp") && strings.Contains(fname, "_input") {
namefile := fmt.Sprintf("%s/%s", dir, strings.Replace(fname, "_input", "_label", -1))
name, ierr := ioutil.ReadFile(namefile)
if ierr == nil {
sensors[f][strings.Replace(string(name), "\n", "", -1)] = path
}
}
return nil
})
for _, sensor := range m.sensors {
// Read sensor file
buffer, err := ioutil.ReadFile(sensor.file)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to read file '%s': %v", sensor.file, err))
continue
}
}
return sensors, nil
}
x, err := strconv.ParseInt(strings.TrimSpace(string(buffer)), 10, 64)
if err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("Read(): Failed to convert temperature '%s' to int64: %v", buffer, err))
continue
}
x /= 1000
y, err := lp.New(
sensor.metricName,
sensor.tags,
m.meta,
map[string]interface{}{"value": x},
time.Now(),
)
if err == nil {
output <- y
}
func (m *TempCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
sensors, err := get_hwmon_sensors()
if err != nil {
return
}
for _, files := range sensors {
for name, file := range files {
tags := map[string]string{"type": "node"}
for key, newtags := range m.config.TagOverride {
if strings.Contains(file, key) {
tags = newtags
break
}
}
buffer, err := ioutil.ReadFile(string(file))
if err != nil {
continue
}
x, err := strconv.ParseInt(strings.Replace(string(buffer), "\n", "", -1), 0, 64)
// max temperature
if m.config.ReportMaxTemp && sensor.maxTemp != 0 {
y, err := lp.New(
sensor.maxTempName,
sensor.tags,
m.meta,
map[string]interface{}{"value": sensor.maxTemp},
time.Now(),
)
if err == nil {
y, err := lp.New(strings.ToLower(name), tags, map[string]interface{}{"value": float64(x) / 1000}, time.Now())
if err == nil {
*out = append(*out, y)
}
output <- y
}
}
// critical temperature
if m.config.ReportCriticalTemp && sensor.critTemp != 0 {
y, err := lp.New(
sensor.critTempName,
sensor.tags,
m.meta,
map[string]interface{}{"value": sensor.critTemp},
time.Now(),
)
if err == nil {
output <- y
}
}
}
}
func (m *TempCollector) Close() {

22
collectors/tempMetric.md Normal file
View File

@@ -0,0 +1,22 @@
## `tempstat` collector
```json
"tempstat": {
"tag_override" : {
"<device like hwmon1>" : {
"type" : "socket",
"type-id" : "0"
}
},
"exclude_metrics": [
"metric1",
"metric2"
]
}
```
The `tempstat` collector reads the data from `/sys/class/hwmon/<device>/tempX_{input,label}`
Metrics:
* `temp_*`: The metric name is taken from the `label` files.

View File

@@ -9,7 +9,7 @@ import (
"strings"
"time"
lp "github.com/influxdata/line-protocol"
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
)
const MAX_NUM_PROCS = 10
@@ -20,15 +20,16 @@ type TopProcsCollectorConfig struct {
}
type TopProcsCollector struct {
MetricCollector
metricCollector
tags map[string]string
config TopProcsCollectorConfig
}
func (m *TopProcsCollector) Init(config []byte) error {
func (m *TopProcsCollector) Init(config json.RawMessage) error {
var err error
m.name = "TopProcsCollector"
m.tags = map[string]string{"type": "node"}
m.meta = map[string]string{"source": m.name, "group": "TopProcs"}
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
@@ -51,7 +52,7 @@ func (m *TopProcsCollector) Init(config []byte) error {
return nil
}
func (m *TopProcsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
func (m *TopProcsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
@@ -66,9 +67,9 @@ func (m *TopProcsCollector) Read(interval time.Duration, out *[]lp.MutableMetric
lines := strings.Split(string(stdout), "\n")
for i := 1; i < m.config.Num_procs+1; i++ {
name := fmt.Sprintf("topproc%d", i)
y, err := lp.New(name, m.tags, map[string]interface{}{"value": string(lines[i])}, time.Now())
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": string(lines[i])}, time.Now())
if err == nil {
*out = append(*out, y)
output <- y
}
}
}

View File

@@ -0,0 +1,15 @@
## `topprocs` collector
```json
"topprocs": {
"num_procs": 5
}
```
The `topprocs` collector reads the TopX processes (sorted by CPU utilization, `ps -Ao comm --sort=-pcpu`).
In contrast to most other collectors, the metric value is a `string`.