mirror of
https://github.com/ClusterCockpit/cc-metric-collector.git
synced 2025-07-19 03:11:41 +02:00
Modularize the whole thing (#16)
* Use channels, add a metric router, split up configuration and use extended version of Influx line protocol internally * Use central timer for collectors and router. Add expressions to router * Add expression to router config * Update entry points * Start with README * Update README for CCMetric * Formatting * Update README.md * Add README for MultiChanTicker * Add README for MultiChanTicker * Update README.md * Add README to metric router * Update main README * Remove SinkEntity type * Update README for sinks * Update go files * Update README for receivers * Update collectors README * Update collectors README * Use seperate page per collector * Fix for tempstat page * Add docs for customcmd collector * Add docs for ipmistat collector * Add docs for topprocs collector * Update customCmdMetric.md * Use seconds when calculating LIKWID metrics * Add IB metrics ib_recv_pkts and ib_xmit_pkts * Drop domain part of host name * Updated to latest stable version of likwid * Define source code dependencies in Makefile * Add GPFS / IBM Spectrum Scale collector * Add vet and staticcheck make targets * Add vet and staticcheck make targets * Avoid go vet warning: struct field tag `json:"..., omitempty"` not compatible with reflect.StructTag.Get: suspicious space in struct tag value struct field tag `json:"...", omitempty` not compatible with reflect.StructTag.Get: key:"value" pairs not separated by spaces * Add sample collector to README.md * Add CPU frequency collector * Avoid staticcheck warning: redundant return statement * Avoid staticcheck warning: unnecessary assignment to the blank identifier * Simplified code * Add CPUFreqCollectorCpuinfo a metric collector to measure the current frequency of the CPUs as obtained from /proc/cpuinfo Only measure on the first hyperthread * Add collector for NFS clients * Move publication of metrics into Flush() for NatsSink * Update GitHub actions * Refactoring * Avoid vet warning: Println arg list ends with redundant newline * Avoid vet warning struct field commands has json tag but is not exported * Avoid vet warning: return copies lock value. * Corrected typo * Refactoring * Add go sources in internal/... * Bad separator in Makefile * Fix Infiniband collector Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
This commit is contained in:
@@ -1,288 +1,34 @@
|
||||
# CCMetric collectors
|
||||
|
||||
This folder contains the collectors for the cc-metric-collector.
|
||||
|
||||
# `metricCollector.go`
|
||||
The base class/configuration is located in `metricCollector.go`.
|
||||
|
||||
# Collectors
|
||||
|
||||
* `memstatMetric.go`: Reads `/proc/meminfo` to calculate **node** metrics. It also combines values to the metric `mem_used`
|
||||
* `loadavgMetric.go`: Reads `/proc/loadavg` and submits **node** metrics:
|
||||
* `netstatMetric.go`: Reads `/proc/net/dev` and submits for all network devices as the **node** metrics.
|
||||
* `lustreMetric.go`: Reads Lustre's stats files and submits **node** metrics:
|
||||
* `infinibandMetric.go`: Reads InfiniBand metrics. It uses the `perfquery` command to read the **node** metrics but can fallback to sysfs counters in case `perfquery` does not work.
|
||||
* `likwidMetric.go`: Reads hardware performance events using LIKWID. It submits **socket** and **cpu** metrics
|
||||
* `cpustatMetric.go`: Read CPU specific values from `/proc/stat`
|
||||
* `topprocsMetric.go`: Reads the TopX processes by their CPU usage. X is configurable
|
||||
* `nvidiaMetric.go`: Read data about Nvidia GPUs using the NVML library
|
||||
* `tempMetric.go`: Read temperature data from `/sys/class/hwmon/hwmon*`
|
||||
* `ipmiMetric.go`: Collect data from `ipmitool` or as fallback `ipmi-sensors`
|
||||
* `customCmdMetric.go`: Run commands or read files and submit the output (output has to be in InfluxDB line protocol!)
|
||||
|
||||
If any of the collectors cannot be initialized, it is excluded from all further reads. Like if the Lustre stat file is not a valid path, no Lustre specific metrics will be recorded.
|
||||
|
||||
# Collector configuration
|
||||
# Configuration
|
||||
|
||||
```json
|
||||
"collectors": [
|
||||
"tempstat"
|
||||
],
|
||||
"collect_config": {
|
||||
"tempstat": {
|
||||
"tag_override": {
|
||||
"hwmon0" : {
|
||||
"type" : "socket",
|
||||
"type-id" : "0"
|
||||
},
|
||||
"hwmon1" : {
|
||||
"type" : "socket",
|
||||
"type-id" : "1"
|
||||
}
|
||||
}
|
||||
{
|
||||
"collector_type" : {
|
||||
<collector specific configuration>
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The configuration of the collectors in the main config files consists of two parts: active collectors (`collectors`) and collector configuration (`collect_config`). At startup, all collectors in the `collectors` list is initialized and, if successfully initialized, added to the active collectors for metric retrieval. At initialization the collector-specific configuration from the `collect_config` section is handed over. Each collector has own configuration options, check at the collector-specific section.
|
||||
In contrast to the configuration files for sinks and receivers, the collectors configuration is not a list but a set of dicts. This is required because we didn't manage to partially read the type before loading the remaining configuration. We are eager to change this to the same format.
|
||||
|
||||
## `memstat`
|
||||
# Available collectors
|
||||
|
||||
```json
|
||||
"memstat": {
|
||||
"exclude_metrics": [
|
||||
"mem_used"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `memstat` collector reads data from `/proc/meminfo` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
|
||||
Metrics:
|
||||
* `mem_total`
|
||||
* `mem_sreclaimable`
|
||||
* `mem_slab`
|
||||
* `mem_free`
|
||||
* `mem_buffers`
|
||||
* `mem_cached`
|
||||
* `mem_available`
|
||||
* `mem_shared`
|
||||
* `swap_total`
|
||||
* `swap_free`
|
||||
* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached`)
|
||||
|
||||
## `loadavg`
|
||||
```json
|
||||
"loadavg": {
|
||||
"exclude_metrics": [
|
||||
"proc_run"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `loadavg` collector reads data from `/proc/loadavg` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics:
|
||||
* `load_one`
|
||||
* `load_five`
|
||||
* `load_fifteen`
|
||||
* `proc_run`
|
||||
* `proc_total`
|
||||
|
||||
## `netstat`
|
||||
```json
|
||||
"netstat": {
|
||||
"exclude_devices": [
|
||||
"lo"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a device is not required, it can be excluded from forwarding it to the sink. Commonly the `lo` device should be excluded.
|
||||
|
||||
Metrics:
|
||||
* `bytes_in`
|
||||
* `bytes_out`
|
||||
* `pkts_in`
|
||||
* `pkts_out`
|
||||
|
||||
The device name is added as tag `device`.
|
||||
|
||||
|
||||
## `diskstat`
|
||||
|
||||
```json
|
||||
"diskstat": {
|
||||
"exclude_metrics": [
|
||||
"read_ms"
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics:
|
||||
* `reads`
|
||||
* `reads_merged`
|
||||
* `read_sectors`
|
||||
* `read_ms`
|
||||
* `writes`
|
||||
* `writes_merged`
|
||||
* `writes_sectors`
|
||||
* `writes_ms`
|
||||
* `ioops`
|
||||
* `ioops_ms`
|
||||
* `ioops_weighted_ms`
|
||||
* `discards`
|
||||
* `discards_merged`
|
||||
* `discards_sectors`
|
||||
* `discards_ms`
|
||||
* `flushes`
|
||||
* `flushes_ms`
|
||||
|
||||
|
||||
The device name is added as tag `device`.
|
||||
|
||||
## `cpustat`
|
||||
```json
|
||||
"netstat": {
|
||||
"exclude_metrics": [
|
||||
"cpu_idle"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `cpustat` collector reads data from `/proc/stats` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics:
|
||||
* `cpu_user`
|
||||
* `cpu_nice`
|
||||
* `cpu_system`
|
||||
* `cpu_idle`
|
||||
* `cpu_iowait`
|
||||
* `cpu_irq`
|
||||
* `cpu_softirq`
|
||||
* `cpu_steal`
|
||||
* `cpu_guest`
|
||||
* `cpu_guest_nice`
|
||||
|
||||
## `likwid`
|
||||
```json
|
||||
"likwid": {
|
||||
"eventsets": [
|
||||
{
|
||||
"events": {
|
||||
"FIXC1": "ACTUAL_CPU_CLOCK",
|
||||
"FIXC2": "MAX_CPU_CLOCK",
|
||||
"PMC0": "RETIRED_INSTRUCTIONS",
|
||||
"PMC1": "CPU_CLOCKS_UNHALTED",
|
||||
"PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
|
||||
"PMC3": "MERGE",
|
||||
"DFC0": "DRAM_CHANNEL_0",
|
||||
"DFC1": "DRAM_CHANNEL_1",
|
||||
"DFC2": "DRAM_CHANNEL_2",
|
||||
"DFC3": "DRAM_CHANNEL_3"
|
||||
},
|
||||
"metrics": [
|
||||
{
|
||||
"name": "ipc",
|
||||
"calc": "PMC0/PMC1",
|
||||
"socket_scope": false,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "flops_any",
|
||||
"calc": "0.000001*PMC2/time",
|
||||
"socket_scope": false,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "clock_mhz",
|
||||
"calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
|
||||
"socket_scope": false,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "mem1",
|
||||
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
|
||||
"socket_scope": true,
|
||||
"publish": false
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"events": {
|
||||
"DFC0": "DRAM_CHANNEL_4",
|
||||
"DFC1": "DRAM_CHANNEL_5",
|
||||
"DFC2": "DRAM_CHANNEL_6",
|
||||
"DFC3": "DRAM_CHANNEL_7",
|
||||
"PWR0": "RAPL_CORE_ENERGY",
|
||||
"PWR1": "RAPL_PKG_ENERGY"
|
||||
},
|
||||
"metrics": [
|
||||
{
|
||||
"name": "pwr_core",
|
||||
"calc": "PWR0/time",
|
||||
"socket_scope": false,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "pwr_pkg",
|
||||
"calc": "PWR1/time",
|
||||
"socket_scope": true,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "mem2",
|
||||
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
|
||||
"socket_scope": true,
|
||||
"publish": false
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"globalmetrics": [
|
||||
{
|
||||
"name": "mem_bw",
|
||||
"calc": "mem1+mem2",
|
||||
"socket_scope": true,
|
||||
"publish": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
_Example config suitable for AMD Zen3_
|
||||
|
||||
The `likwid` collector reads hardware performance counters at a **hwthread** and **socket** level. The configuration looks quite complicated but it is basically copy&paste from [LIKWID's performance groups](https://github.com/RRZE-HPC/likwid/tree/master/groups). The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
|
||||
|
||||
The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
|
||||
```
|
||||
EVENTSET -> "events": {
|
||||
FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK",
|
||||
FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK",
|
||||
PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS",
|
||||
PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED",
|
||||
PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
|
||||
PMC3 MERGE -> "PMC3": "MERGE",
|
||||
-> }
|
||||
```
|
||||
|
||||
The metrics are following the same procedure:
|
||||
|
||||
```
|
||||
METRICS -> "metrics": [
|
||||
IPC PMC0/PMC1 -> {
|
||||
-> "name" : "IPC",
|
||||
-> "calc" : "PMC0/PMC1",
|
||||
-> "socket_scope": false,
|
||||
-> "publish": true
|
||||
-> }
|
||||
-> ]
|
||||
```
|
||||
|
||||
The `socket_scope` option tells whether it is submitted per socket or per hwthread. If a metric is only used for internal calculations, you can set `publish = false`.
|
||||
|
||||
Since some metrics can only be gathered in multiple measurements (like the memory bandwidth on AMD Zen3 chips), configure multiple eventsets like in the example config and use the `globalmetrics` section to combine them. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.
|
||||
* [`cpustat`](./cpustatMetric.md)
|
||||
* [`memstat`](./memstatMetric.md)
|
||||
* [`diskstat`](./diskstatMetric.md)
|
||||
* [`loadavg`](./loadavgMetric.md)
|
||||
* [`netstat`](./netstatMetric.md)
|
||||
* [`ibstat`](./infinibandMetric.md)
|
||||
* [`tempstat`](./tempMetric.md)
|
||||
* [`lustre`](./lustreMetric.md)
|
||||
* [`likwid`](./likwidMetric.md)
|
||||
* [`nvidia`](./nvidiaMetric.md)
|
||||
* [`customcmd`](./customCmdMetric.md)
|
||||
* [`ipmistat`](./ipmiMetric.md)
|
||||
* [`topprocs`](./topprocsMetric.md)
|
||||
|
||||
## Todos
|
||||
|
||||
@@ -292,13 +38,15 @@ Since some metrics can only be gathered in multiple measurements (like the memor
|
||||
# Contributing own collectors
|
||||
A collector reads data from any source, parses it to metrics and submits these metrics to the `metric-collector`. A collector provides three function:
|
||||
|
||||
* `Init(config []byte) error`: Initializes the collector using the given collector-specific config in JSON.
|
||||
* `Read(duration time.Duration, out *[]lp.MutableMetric) error`: Read, parse and submit data to the `out` list. If the collector has to measure anything for some duration, use the provided function argument `duration`.
|
||||
* `Name() string`: Return the name of the collector
|
||||
* `Init(config json.RawMessage) error`: Initializes the collector using the given collector-specific config in JSON. Check if needed files/commands exists, ...
|
||||
* `Initialized() bool`: Check if a collector is successfully initialized
|
||||
* `Read(duration time.Duration, output chan ccMetric.CCMetric)`: Read, parse and submit data to the `output` channel as [`CCMetric`](../internal/ccMetric/README.md). If the collector has to measure anything for some duration, use the provided function argument `duration`.
|
||||
* `Close()`: Closes down the collector.
|
||||
|
||||
It is recommanded to call `setup()` in the `Init()` function.
|
||||
|
||||
Finally, the collector needs to be registered in the `metric-collector.go`. There is a list of collectors called `Collectors` which is a map (string -> pointer to collector). Add a new entry with a descriptive name and the new collector.
|
||||
Finally, the collector needs to be registered in the `collectorManager.go`. There is a list of collectors called `AvailableCollectors` which is a map (`collector_type_string` -> `pointer to MetricCollector interface`). Add a new entry with a descriptive name and the new collector.
|
||||
|
||||
## Sample collector
|
||||
|
||||
@@ -307,8 +55,9 @@ package collectors
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
"time"
|
||||
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
// Struct for the collector-specific JSON config
|
||||
@@ -317,11 +66,11 @@ type SampleCollectorConfig struct {
|
||||
}
|
||||
|
||||
type SampleCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
config SampleCollectorConfig
|
||||
}
|
||||
|
||||
func (m *SampleCollector) Init(config []byte) error {
|
||||
func (m *SampleCollector) Init(config json.RawMessage) error {
|
||||
m.name = "SampleCollector"
|
||||
m.setup()
|
||||
if len(config) > 0 {
|
||||
@@ -330,11 +79,13 @@ func (m *SampleCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{"source": m.name, "group": "Sample"}
|
||||
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *SampleCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *SampleCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -342,9 +93,9 @@ func (m *SampleCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
tags := map[string]string{"type" : "node"}
|
||||
// Each metric has exactly one field: value !
|
||||
value := map[string]interface{}{"value": int(x)}
|
||||
y, err := lp.New("sample_metric", tags, value, time.Now())
|
||||
y, err := lp.New("sample_metric", tags, m.meta, value, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
|
143
collectors/collectorManager.go
Normal file
143
collectors/collectorManager.go
Normal file
@@ -0,0 +1,143 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"log"
|
||||
"os"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
mct "github.com/ClusterCockpit/cc-metric-collector/internal/multiChanTicker"
|
||||
)
|
||||
|
||||
var AvailableCollectors = map[string]MetricCollector{
|
||||
|
||||
"likwid": &LikwidCollector{},
|
||||
"loadavg": &LoadavgCollector{},
|
||||
"memstat": &MemstatCollector{},
|
||||
"netstat": &NetstatCollector{},
|
||||
"ibstat": &InfinibandCollector{},
|
||||
"lustrestat": &LustreCollector{},
|
||||
"cpustat": &CpustatCollector{},
|
||||
"topprocs": &TopProcsCollector{},
|
||||
"nvidia": &NvidiaCollector{},
|
||||
"customcmd": &CustomCmdCollector{},
|
||||
"diskstat": &DiskstatCollector{},
|
||||
"tempstat": &TempCollector{},
|
||||
"ipmistat": &IpmiCollector{},
|
||||
"gpfs": new(GpfsCollector),
|
||||
"cpufreq": new(CPUFreqCollector),
|
||||
"cpufreq_cpuinfo": new(CPUFreqCpuInfoCollector),
|
||||
"nfsstat": new(NfsCollector),
|
||||
}
|
||||
|
||||
type collectorManager struct {
|
||||
collectors []MetricCollector
|
||||
output chan lp.CCMetric
|
||||
done chan bool
|
||||
ticker mct.MultiChanTicker
|
||||
duration time.Duration
|
||||
wg *sync.WaitGroup
|
||||
config map[string]json.RawMessage
|
||||
}
|
||||
|
||||
type CollectorManager interface {
|
||||
Init(ticker mct.MultiChanTicker, duration time.Duration, wg *sync.WaitGroup, collectConfigFile string) error
|
||||
AddOutput(output chan lp.CCMetric)
|
||||
Start()
|
||||
Close()
|
||||
}
|
||||
|
||||
func (cm *collectorManager) Init(ticker mct.MultiChanTicker, duration time.Duration, wg *sync.WaitGroup, collectConfigFile string) error {
|
||||
cm.collectors = make([]MetricCollector, 0)
|
||||
cm.output = nil
|
||||
cm.done = make(chan bool)
|
||||
cm.wg = wg
|
||||
cm.ticker = ticker
|
||||
cm.duration = duration
|
||||
configFile, err := os.Open(collectConfigFile)
|
||||
if err != nil {
|
||||
log.Print(err.Error())
|
||||
return err
|
||||
}
|
||||
defer configFile.Close()
|
||||
jsonParser := json.NewDecoder(configFile)
|
||||
err = jsonParser.Decode(&cm.config)
|
||||
if err != nil {
|
||||
log.Print(err.Error())
|
||||
return err
|
||||
}
|
||||
for k, cfg := range cm.config {
|
||||
log.Print(k, " ", cfg)
|
||||
if _, found := AvailableCollectors[k]; !found {
|
||||
log.Print("[CollectorManager] SKIP unknown collector ", k)
|
||||
continue
|
||||
}
|
||||
c := AvailableCollectors[k]
|
||||
|
||||
err = c.Init(cfg)
|
||||
if err != nil {
|
||||
log.Print("[CollectorManager] Collector ", k, "initialization failed: ", err.Error())
|
||||
continue
|
||||
}
|
||||
cm.collectors = append(cm.collectors, c)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (cm *collectorManager) Start() {
|
||||
cm.wg.Add(1)
|
||||
tick := make(chan time.Time)
|
||||
cm.ticker.AddChannel(tick)
|
||||
go func() {
|
||||
for {
|
||||
CollectorManagerLoop:
|
||||
select {
|
||||
case <-cm.done:
|
||||
for _, c := range cm.collectors {
|
||||
c.Close()
|
||||
}
|
||||
cm.wg.Done()
|
||||
log.Print("[CollectorManager] DONE\n")
|
||||
break CollectorManagerLoop
|
||||
case t := <-tick:
|
||||
for _, c := range cm.collectors {
|
||||
CollectorManagerInputLoop:
|
||||
select {
|
||||
case <-cm.done:
|
||||
for _, c := range cm.collectors {
|
||||
c.Close()
|
||||
}
|
||||
cm.wg.Done()
|
||||
log.Print("[CollectorManager] DONE\n")
|
||||
break CollectorManagerInputLoop
|
||||
default:
|
||||
log.Print("[CollectorManager] ", c.Name(), " ", t)
|
||||
c.Read(cm.duration, cm.output)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
log.Print("[CollectorManager] EXIT\n")
|
||||
}()
|
||||
log.Print("[CollectorManager] STARTED\n")
|
||||
}
|
||||
|
||||
func (cm *collectorManager) AddOutput(output chan lp.CCMetric) {
|
||||
cm.output = output
|
||||
}
|
||||
|
||||
func (cm *collectorManager) Close() {
|
||||
cm.done <- true
|
||||
log.Print("[CollectorManager] CLOSE")
|
||||
}
|
||||
|
||||
func New(ticker mct.MultiChanTicker, duration time.Duration, wg *sync.WaitGroup, collectConfigFile string) (CollectorManager, error) {
|
||||
cm := &collectorManager{}
|
||||
err := cm.Init(ticker, duration, wg, collectConfigFile)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return cm, err
|
||||
}
|
@@ -2,14 +2,16 @@ package collectors
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"encoding/json"
|
||||
|
||||
"fmt"
|
||||
"log"
|
||||
"os"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
)
|
||||
|
||||
//
|
||||
@@ -33,12 +35,16 @@ type CPUFreqCpuInfoCollectorTopology struct {
|
||||
}
|
||||
|
||||
type CPUFreqCpuInfoCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
topology []CPUFreqCpuInfoCollectorTopology
|
||||
}
|
||||
|
||||
func (m *CPUFreqCpuInfoCollector) Init(config []byte) error {
|
||||
func (m *CPUFreqCpuInfoCollector) Init(config json.RawMessage) error {
|
||||
m.name = "CPUFreqCpuInfoCollector"
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "cpufreq",
|
||||
}
|
||||
|
||||
const cpuInfoFile = "/proc/cpuinfo"
|
||||
file, err := os.Open(cpuInfoFile)
|
||||
@@ -145,7 +151,8 @@ func (m *CPUFreqCpuInfoCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
|
||||
func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -174,9 +181,9 @@ func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, out *[]lp.Mutable
|
||||
log.Printf("Failed to convert cpu MHz to float: %v", err)
|
||||
return
|
||||
}
|
||||
y, err := lp.New("cpufreq", t.tagSet, map[string]interface{}{"value": value}, now)
|
||||
y, err := lp.New("cpufreq", t.tagSet, m.meta, map[string]interface{}{"value": value}, now)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
processorCounter++
|
||||
|
@@ -10,8 +10,7 @@ import (
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
"golang.org/x/sys/unix"
|
||||
)
|
||||
|
||||
@@ -56,14 +55,14 @@ type CPUFreqCollectorTopology struct {
|
||||
// See: https://www.kernel.org/doc/html/latest/admin-guide/pm/cpufreq.html
|
||||
//
|
||||
type CPUFreqCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
topology []CPUFreqCollectorTopology
|
||||
config struct {
|
||||
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
|
||||
}
|
||||
}
|
||||
|
||||
func (m *CPUFreqCollector) Init(config []byte) error {
|
||||
func (m *CPUFreqCollector) Init(config json.RawMessage) error {
|
||||
m.name = "CPUFreqCollector"
|
||||
m.setup()
|
||||
if len(config) > 0 {
|
||||
@@ -72,6 +71,10 @@ func (m *CPUFreqCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "CPU Frequency",
|
||||
}
|
||||
|
||||
// Loop for all CPU directories
|
||||
baseDir := "/sys/devices/system/cpu"
|
||||
@@ -179,7 +182,7 @@ func (m *CPUFreqCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *CPUFreqCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *CPUFreqCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -205,9 +208,9 @@ func (m *CPUFreqCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
continue
|
||||
}
|
||||
|
||||
y, err := lp.New("cpufreq", t.tagSet, map[string]interface{}{"value": cpuFreq}, now)
|
||||
y, err := lp.New("cpufreq", t.tagSet, m.meta, map[string]interface{}{"value": cpuFreq}, now)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -7,8 +7,7 @@ import (
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const CPUSTATFILE = `/proc/stat`
|
||||
@@ -18,13 +17,14 @@ type CpustatCollectorConfig struct {
|
||||
}
|
||||
|
||||
type CpustatCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
config CpustatCollectorConfig
|
||||
}
|
||||
|
||||
func (m *CpustatCollector) Init(config []byte) error {
|
||||
func (m *CpustatCollector) Init(config json.RawMessage) error {
|
||||
m.name = "CpustatCollector"
|
||||
m.setup()
|
||||
m.meta = map[string]string{"source": m.name, "group": "CPU"}
|
||||
if len(config) > 0 {
|
||||
err := json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
@@ -35,7 +35,7 @@ func (m *CpustatCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func ParseStatLine(line string, cpu int, exclude []string, out *[]lp.MutableMetric) {
|
||||
func (c *CpustatCollector) parseStatLine(line string, cpu int, exclude []string, output chan lp.CCMetric) {
|
||||
ls := strings.Fields(line)
|
||||
matches := []string{"", "cpu_user", "cpu_nice", "cpu_system", "cpu_idle", "cpu_iowait", "cpu_irq", "cpu_softirq", "cpu_steal", "cpu_guest", "cpu_guest_nice"}
|
||||
for _, ex := range exclude {
|
||||
@@ -52,16 +52,16 @@ func ParseStatLine(line string, cpu int, exclude []string, out *[]lp.MutableMetr
|
||||
if len(m) > 0 {
|
||||
x, err := strconv.ParseInt(ls[i], 0, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New(m, tags, map[string]interface{}{"value": int(x)}, time.Now())
|
||||
y, err := lp.New(m, tags, c.meta, map[string]interface{}{"value": int(x)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (m *CpustatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *CpustatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -78,11 +78,11 @@ func (m *CpustatCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
}
|
||||
ls := strings.Fields(line)
|
||||
if strings.Compare(ls[0], "cpu") == 0 {
|
||||
ParseStatLine(line, -1, m.config.ExcludeMetrics, out)
|
||||
m.parseStatLine(line, -1, m.config.ExcludeMetrics, output)
|
||||
} else if strings.HasPrefix(ls[0], "cpu") {
|
||||
cpustr := strings.TrimLeft(ls[0], "cpu")
|
||||
cpu, _ := strconv.Atoi(cpustr)
|
||||
ParseStatLine(line, cpu, m.config.ExcludeMetrics, out)
|
||||
m.parseStatLine(line, cpu, m.config.ExcludeMetrics, output)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
23
collectors/cpustatMetric.md
Normal file
23
collectors/cpustatMetric.md
Normal file
@@ -0,0 +1,23 @@
|
||||
|
||||
## `cpustat` collector
|
||||
```json
|
||||
"netstat": {
|
||||
"exclude_metrics": [
|
||||
"cpu_idle"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `cpustat` collector reads data from `/proc/stats` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics:
|
||||
* `cpu_user`
|
||||
* `cpu_nice`
|
||||
* `cpu_system`
|
||||
* `cpu_idle`
|
||||
* `cpu_iowait`
|
||||
* `cpu_irq`
|
||||
* `cpu_softirq`
|
||||
* `cpu_steal`
|
||||
* `cpu_guest`
|
||||
* `cpu_guest_nice`
|
@@ -9,7 +9,8 @@ import (
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
influx "github.com/influxdata/line-protocol"
|
||||
)
|
||||
|
||||
const CUSTOMCMDPATH = `/home/unrz139/Work/cc-metric-collector/collectors/custom`
|
||||
@@ -21,17 +22,18 @@ type CustomCmdCollectorConfig struct {
|
||||
}
|
||||
|
||||
type CustomCmdCollector struct {
|
||||
MetricCollector
|
||||
handler *lp.MetricHandler
|
||||
parser *lp.Parser
|
||||
metricCollector
|
||||
handler *influx.MetricHandler
|
||||
parser *influx.Parser
|
||||
config CustomCmdCollectorConfig
|
||||
commands []string
|
||||
files []string
|
||||
}
|
||||
|
||||
func (m *CustomCmdCollector) Init(config []byte) error {
|
||||
func (m *CustomCmdCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "CustomCmdCollector"
|
||||
m.meta = map[string]string{"source": m.name, "group": "Custom"}
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
@@ -61,8 +63,8 @@ func (m *CustomCmdCollector) Init(config []byte) error {
|
||||
if len(m.files) == 0 && len(m.commands) == 0 {
|
||||
return errors.New("No metrics to collect")
|
||||
}
|
||||
m.handler = lp.NewMetricHandler()
|
||||
m.parser = lp.NewParser(m.handler)
|
||||
m.handler = influx.NewMetricHandler()
|
||||
m.parser = influx.NewParser(m.handler)
|
||||
m.parser.SetTimeFunc(DefaultTime)
|
||||
m.init = true
|
||||
return nil
|
||||
@@ -72,7 +74,7 @@ var DefaultTime = func() time.Time {
|
||||
return time.Unix(42, 0)
|
||||
}
|
||||
|
||||
func (m *CustomCmdCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *CustomCmdCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -95,9 +97,9 @@ func (m *CustomCmdCollector) Read(interval time.Duration, out *[]lp.MutableMetri
|
||||
if skip {
|
||||
continue
|
||||
}
|
||||
y, err := lp.New(c.Name(), Tags2Map(c), Fields2Map(c), c.Time())
|
||||
y, err := lp.New(c.Name(), Tags2Map(c), m.meta, Fields2Map(c), c.Time())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -117,9 +119,9 @@ func (m *CustomCmdCollector) Read(interval time.Duration, out *[]lp.MutableMetri
|
||||
if skip {
|
||||
continue
|
||||
}
|
||||
y, err := lp.New(f.Name(), Tags2Map(f), Fields2Map(f), f.Time())
|
||||
y, err := lp.New(f.Name(), Tags2Map(f), m.meta, Fields2Map(f), f.Time())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
20
collectors/customCmdMetric.md
Normal file
20
collectors/customCmdMetric.md
Normal file
@@ -0,0 +1,20 @@
|
||||
|
||||
## `customcmd` collector
|
||||
|
||||
```json
|
||||
"customcmd": {
|
||||
"exclude_metrics": [
|
||||
"mymetric"
|
||||
],
|
||||
"files" : [
|
||||
"/var/run/myapp.metrics"
|
||||
],
|
||||
"commands" : [
|
||||
"/usr/local/bin/getmetrics.pl"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `customcmd` collector reads data from files and the output of executed commands. The files and commands can output multiple metrics (separated by newline) but the have to be in the [InfluxDB line protocol](https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/). If a metric is not parsable, it is skipped. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
|
@@ -2,9 +2,7 @@ package collectors
|
||||
|
||||
import (
|
||||
"io/ioutil"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
// "log"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
@@ -21,14 +19,15 @@ type DiskstatCollectorConfig struct {
|
||||
}
|
||||
|
||||
type DiskstatCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
matches map[int]string
|
||||
config DiskstatCollectorConfig
|
||||
}
|
||||
|
||||
func (m *DiskstatCollector) Init(config []byte) error {
|
||||
func (m *DiskstatCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "DiskstatCollector"
|
||||
m.meta = map[string]string{"source": m.name, "group": "Disk"}
|
||||
m.setup()
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
@@ -73,7 +72,7 @@ func (m *DiskstatCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
|
||||
func (m *DiskstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *DiskstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
var lines []string
|
||||
if !m.init {
|
||||
return
|
||||
@@ -101,9 +100,9 @@ func (m *DiskstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric
|
||||
if idx < len(f) {
|
||||
x, err := strconv.ParseInt(f[idx], 0, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New(name, tags, map[string]interface{}{"value": int(x)}, time.Now())
|
||||
y, err := lp.New(name, tags, m.meta, map[string]interface{}{"value": int(x)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
34
collectors/diskstatMetric.md
Normal file
34
collectors/diskstatMetric.md
Normal file
@@ -0,0 +1,34 @@
|
||||
|
||||
## `diskstat` collector
|
||||
|
||||
```json
|
||||
"diskstat": {
|
||||
"exclude_metrics": [
|
||||
"read_ms"
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics:
|
||||
* `reads`
|
||||
* `reads_merged`
|
||||
* `read_sectors`
|
||||
* `read_ms`
|
||||
* `writes`
|
||||
* `writes_merged`
|
||||
* `writes_sectors`
|
||||
* `writes_ms`
|
||||
* `ioops`
|
||||
* `ioops_ms`
|
||||
* `ioops_weighted_ms`
|
||||
* `discards`
|
||||
* `discards_merged`
|
||||
* `discards_sectors`
|
||||
* `discards_ms`
|
||||
* `flushes`
|
||||
* `flushes_ms`
|
||||
|
||||
The device name is added as tag `device`.
|
||||
|
@@ -13,18 +13,20 @@ import (
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
type GpfsCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
|
||||
config struct {
|
||||
Mmpmon string `json:"mmpmon"`
|
||||
}
|
||||
}
|
||||
|
||||
func (m *GpfsCollector) Init(config []byte) error {
|
||||
|
||||
func (m *GpfsCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "GpfsCollector"
|
||||
m.setup()
|
||||
@@ -40,6 +42,14 @@ func (m *GpfsCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "GPFS",
|
||||
}
|
||||
m.tags = map[string]string{
|
||||
"type": "node",
|
||||
"filesystem": "",
|
||||
}
|
||||
|
||||
// GPFS / IBM Spectrum Scale file system statistics can only be queried by user root
|
||||
user, err := user.Current()
|
||||
@@ -60,7 +70,7 @@ func (m *GpfsCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *GpfsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -108,6 +118,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
continue
|
||||
}
|
||||
|
||||
m.tags["filesystem"] = filesystem
|
||||
|
||||
|
||||
// return code
|
||||
rc, err := strconv.Atoi(key_value["_rc_"])
|
||||
if err != nil {
|
||||
@@ -140,17 +153,10 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
key_value["_br_"], err.Error())
|
||||
continue
|
||||
}
|
||||
y, err := lp.New(
|
||||
"gpfs_bytes_read",
|
||||
map[string]string{
|
||||
"filesystem": filesystem,
|
||||
},
|
||||
map[string]interface{}{
|
||||
"value": bytesRead,
|
||||
},
|
||||
timestamp)
|
||||
|
||||
y, err := lp.New("gpfs_bytes_read", m.tags, m.meta, map[string]interface{}{"value": bytesRead}, timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
|
||||
// bytes written
|
||||
@@ -161,17 +167,10 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
key_value["_bw_"], err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_bytes_written",
|
||||
map[string]string{
|
||||
"filesystem": filesystem,
|
||||
},
|
||||
map[string]interface{}{
|
||||
"value": bytesWritten,
|
||||
},
|
||||
timestamp)
|
||||
|
||||
y, err = lp.New("gpfs_bytes_written", m.tags, m.meta, map[string]interface{}{"value": bytesWritten}, timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
|
||||
// number of opens
|
||||
@@ -182,17 +181,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
key_value["_oc_"], err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_num_opens",
|
||||
map[string]string{
|
||||
"filesystem": filesystem,
|
||||
},
|
||||
map[string]interface{}{
|
||||
"value": numOpens,
|
||||
},
|
||||
timestamp)
|
||||
y, err = lp.New("gpfs_num_opens", m.tags, m.meta, map[string]interface{}{"value": numOpens}, timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
|
||||
// number of closes
|
||||
@@ -201,17 +192,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of closes: %s\n", err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_num_closes",
|
||||
map[string]string{
|
||||
"filesystem": filesystem,
|
||||
},
|
||||
map[string]interface{}{
|
||||
"value": numCloses,
|
||||
},
|
||||
timestamp)
|
||||
y, err = lp.New("gpfs_num_closes", m.tags, m.meta, map[string]interface{}{"value": numCloses}, timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
|
||||
// number of reads
|
||||
@@ -220,17 +203,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of reads: %s\n", err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_num_reads",
|
||||
map[string]string{
|
||||
"filesystem": filesystem,
|
||||
},
|
||||
map[string]interface{}{
|
||||
"value": numReads,
|
||||
},
|
||||
timestamp)
|
||||
y, err = lp.New("gpfs_num_reads", m.tags, m.meta, map[string]interface{}{"value": numReads}, timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
|
||||
// number of writes
|
||||
@@ -239,17 +214,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of writes: %s\n", err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_num_writes",
|
||||
map[string]string{
|
||||
"filesystem": filesystem,
|
||||
},
|
||||
map[string]interface{}{
|
||||
"value": numWrites,
|
||||
},
|
||||
timestamp)
|
||||
y, err = lp.New("gpfs_num_writes", m.tags, m.meta, map[string]interface{}{"value": numWrites}, timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
|
||||
// number of read directories
|
||||
@@ -258,17 +225,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert number of read directories: %s\n", err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_num_readdirs",
|
||||
map[string]string{
|
||||
"filesystem": filesystem,
|
||||
},
|
||||
map[string]interface{}{
|
||||
"value": numReaddirs,
|
||||
},
|
||||
timestamp)
|
||||
y, err = lp.New("gpfs_num_readdirs", m.tags, m.meta, map[string]interface{}{"value": numReaddirs}, timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
|
||||
// Number of inode updates
|
||||
@@ -277,17 +236,9 @@ func (m *GpfsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
fmt.Fprintf(os.Stderr, "GpfsCollector.Read(): Failed to convert Number of inode updates: %s\n", err.Error())
|
||||
continue
|
||||
}
|
||||
y, err = lp.New(
|
||||
"gpfs_num_inode_updates",
|
||||
map[string]string{
|
||||
"filesystem": filesystem,
|
||||
},
|
||||
map[string]interface{}{
|
||||
"value": numInodeUpdates,
|
||||
},
|
||||
timestamp)
|
||||
y, err = lp.New("gpfs_num_inode_updates", m.tags, m.meta, map[string]interface{}{"value": numInodeUpdates}, timestamp)
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -5,9 +5,7 @@ import (
|
||||
"io/ioutil"
|
||||
"log"
|
||||
"os/exec"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
// "os"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
@@ -28,7 +26,7 @@ type InfinibandCollectorConfig struct {
|
||||
}
|
||||
|
||||
type InfinibandCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
lids map[string]map[string]string
|
||||
config InfinibandCollectorConfig
|
||||
@@ -56,11 +54,12 @@ func (m *InfinibandCollector) Help() {
|
||||
fmt.Println("- ib_xmit_pkts")
|
||||
}
|
||||
|
||||
func (m *InfinibandCollector) Init(config []byte) error {
|
||||
func (m *InfinibandCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "InfinibandCollector"
|
||||
m.use_perfquery = false
|
||||
m.setup()
|
||||
m.meta = map[string]string{"source": m.name, "group": "Network"}
|
||||
m.tags = map[string]string{"type": "node"}
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
@@ -117,7 +116,7 @@ func (m *InfinibandCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
|
||||
func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[string]string, out *[]lp.MutableMetric) error {
|
||||
func (m *InfinibandCollector) doPerfQuery(cmd string, dev string, lid string, port string, tags map[string]string, output chan lp.CCMetric) error {
|
||||
|
||||
args := fmt.Sprintf("-r %s %s 0xf000", lid, port)
|
||||
command := exec.Command(cmd, args)
|
||||
@@ -134,9 +133,9 @@ func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[strin
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
y, err := lp.New("ib_recv", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -144,9 +143,9 @@ func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[strin
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
y, err := lp.New("ib_xmit", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -154,9 +153,9 @@ func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[strin
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
y, err := lp.New("ib_recv_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -164,9 +163,29 @@ func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[strin
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
y, err := lp.New("ib_xmit_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
if strings.HasPrefix(line, "PortRcvPkts") || strings.HasPrefix(line, "RcvPkts") {
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
if strings.HasPrefix(line, "PortXmitPkts") || strings.HasPrefix(line, "XmtPkts") {
|
||||
lv := strings.Fields(line)
|
||||
v, err := strconv.ParseFloat(lv[1], 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -174,16 +193,16 @@ func DoPerfQuery(cmd string, dev string, lid string, port string, tags map[strin
|
||||
return nil
|
||||
}
|
||||
|
||||
func DoSysfsRead(dev string, lid string, port string, tags map[string]string, out *[]lp.MutableMetric) error {
|
||||
func (m *InfinibandCollector) doSysfsRead(dev string, lid string, port string, tags map[string]string, output chan lp.CCMetric) error {
|
||||
path := fmt.Sprintf("%s/%s/ports/%s/counters/", string(IBBASEPATH), dev, port)
|
||||
buffer, err := ioutil.ReadFile(fmt.Sprintf("%s/port_rcv_data", path))
|
||||
if err == nil {
|
||||
data := strings.Replace(string(buffer), "\n", "", -1)
|
||||
v, err := strconv.ParseFloat(data, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
y, err := lp.New("ib_recv", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -192,9 +211,9 @@ func DoSysfsRead(dev string, lid string, port string, tags map[string]string, ou
|
||||
data := strings.Replace(string(buffer), "\n", "", -1)
|
||||
v, err := strconv.ParseFloat(data, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
y, err := lp.New("ib_xmit", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -203,9 +222,9 @@ func DoSysfsRead(dev string, lid string, port string, tags map[string]string, ou
|
||||
data := strings.Replace(string(buffer), "\n", "", -1)
|
||||
v, err := strconv.ParseFloat(data, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_recv_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
y, err := lp.New("ib_recv_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -214,71 +233,29 @@ func DoSysfsRead(dev string, lid string, port string, tags map[string]string, ou
|
||||
data := strings.Replace(string(buffer), "\n", "", -1)
|
||||
v, err := strconv.ParseFloat(data, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New("ib_xmit_pkts", tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
y, err := lp.New("ib_xmit_pkts", tags, m.meta, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *InfinibandCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *InfinibandCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
|
||||
if m.init {
|
||||
for dev, ports := range m.lids {
|
||||
for port, lid := range ports {
|
||||
tags := map[string]string{"type": "node", "device": dev, "port": port}
|
||||
if m.use_perfquery {
|
||||
DoPerfQuery(m.config.PerfQueryPath, dev, lid, port, tags, out)
|
||||
m.doPerfQuery(m.config.PerfQueryPath, dev, lid, port, tags, output)
|
||||
} else {
|
||||
DoSysfsRead(dev, lid, port, tags, out)
|
||||
m.doSysfsRead(dev, lid, port, tags, output)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// buffer, err := ioutil.ReadFile(string(LIDFILE))
|
||||
|
||||
// if err != nil {
|
||||
// log.Print(err)
|
||||
// return
|
||||
// }
|
||||
|
||||
// args := fmt.Sprintf("-r %s 1 0xf000", string(buffer))
|
||||
|
||||
// command := exec.Command(PERFQUERY, args)
|
||||
// command.Wait()
|
||||
// stdout, err := command.Output()
|
||||
// if err != nil {
|
||||
// log.Print(err)
|
||||
// return
|
||||
// }
|
||||
|
||||
// ll := strings.Split(string(stdout), "\n")
|
||||
|
||||
// for _, line := range ll {
|
||||
// if strings.HasPrefix(line, "PortRcvData") || strings.HasPrefix(line, "RcvData") {
|
||||
// lv := strings.Fields(line)
|
||||
// v, err := strconv.ParseFloat(lv[1], 64)
|
||||
// if err == nil {
|
||||
// y, err := lp.New("ib_recv", m.tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
// if err == nil {
|
||||
// *out = append(*out, y)
|
||||
// }
|
||||
// }
|
||||
// }
|
||||
// if strings.HasPrefix(line, "PortXmitData") || strings.HasPrefix(line, "XmtData") {
|
||||
// lv := strings.Fields(line)
|
||||
// v, err := strconv.ParseFloat(lv[1], 64)
|
||||
// if err == nil {
|
||||
// y, err := lp.New("ib_xmit", m.tags, map[string]interface{}{"value": float64(v)}, time.Now())
|
||||
// if err == nil {
|
||||
// *out = append(*out, y)
|
||||
// }
|
||||
// }
|
||||
// }
|
||||
// }
|
||||
}
|
||||
|
||||
func (m *InfinibandCollector) Close() {
|
||||
|
19
collectors/infinibandMetric.md
Normal file
19
collectors/infinibandMetric.md
Normal file
@@ -0,0 +1,19 @@
|
||||
|
||||
## `ibstat` collector
|
||||
|
||||
```json
|
||||
"ibstat": {
|
||||
"perfquery_path" : "<path to perfquery command>",
|
||||
"exclude_devices": [
|
||||
"mlx4"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `ibstat` collector reads either data through the `perfquery` command or the sysfs files below `/sys/class/infiniband/<device>`.
|
||||
|
||||
Metrics:
|
||||
* `ib_recv`
|
||||
* `ib_xmit`
|
||||
|
||||
The collector adds a `device` tag to all metrics
|
@@ -9,8 +9,7 @@ import (
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const IPMITOOL_PATH = `/usr/bin/ipmitool`
|
||||
@@ -23,15 +22,16 @@ type IpmiCollectorConfig struct {
|
||||
}
|
||||
|
||||
type IpmiCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
matches map[string]string
|
||||
config IpmiCollectorConfig
|
||||
}
|
||||
|
||||
func (m *IpmiCollector) Init(config []byte) error {
|
||||
func (m *IpmiCollector) Init(config json.RawMessage) error {
|
||||
m.name = "IpmiCollector"
|
||||
m.setup()
|
||||
m.meta = map[string]string{"source": m.name, "group": "IPMI"}
|
||||
if len(config) > 0 {
|
||||
err := json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
@@ -53,7 +53,7 @@ func (m *IpmiCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func ReadIpmiTool(cmd string, out *[]lp.MutableMetric) {
|
||||
func (m *IpmiCollector) readIpmiTool(cmd string, output chan lp.CCMetric) {
|
||||
command := exec.Command(cmd, "sensor")
|
||||
command.Wait()
|
||||
stdout, err := command.Output()
|
||||
@@ -74,24 +74,25 @@ func ReadIpmiTool(cmd string, out *[]lp.MutableMetric) {
|
||||
name := strings.ToLower(strings.Replace(strings.Trim(lv[0], " "), " ", "_", -1))
|
||||
unit := strings.Trim(lv[2], " ")
|
||||
if unit == "Volts" {
|
||||
unit = "V"
|
||||
unit = "Volts"
|
||||
} else if unit == "degrees C" {
|
||||
unit = "C"
|
||||
unit = "degC"
|
||||
} else if unit == "degrees F" {
|
||||
unit = "F"
|
||||
unit = "degF"
|
||||
} else if unit == "Watts" {
|
||||
unit = "W"
|
||||
unit = "Watts"
|
||||
}
|
||||
|
||||
y, err := lp.New(name, map[string]string{"unit": unit, "type": "node"}, map[string]interface{}{"value": v}, time.Now())
|
||||
y, err := lp.New(name, map[string]string{"type": "node"}, m.meta, map[string]interface{}{"value": v}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
y.AddMeta("unit", unit)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func ReadIpmiSensors(cmd string, out *[]lp.MutableMetric) {
|
||||
func (m *IpmiCollector) readIpmiSensors(cmd string, output chan lp.CCMetric) {
|
||||
|
||||
command := exec.Command(cmd, "--comma-separated-output", "--sdr-cache-recreate")
|
||||
command.Wait()
|
||||
@@ -109,25 +110,28 @@ func ReadIpmiSensors(cmd string, out *[]lp.MutableMetric) {
|
||||
v, err := strconv.ParseFloat(lv[3], 64)
|
||||
if err == nil {
|
||||
name := strings.ToLower(strings.Replace(lv[1], " ", "_", -1))
|
||||
y, err := lp.New(name, map[string]string{"unit": lv[4], "type": "node"}, map[string]interface{}{"value": v}, time.Now())
|
||||
y, err := lp.New(name, map[string]string{"type": "node"}, m.meta, map[string]interface{}{"value": v}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
if len(lv) > 4 {
|
||||
y.AddMeta("unit", lv[4])
|
||||
}
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (m *IpmiCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *IpmiCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if len(m.config.IpmitoolPath) > 0 {
|
||||
_, err := os.Stat(m.config.IpmitoolPath)
|
||||
if err == nil {
|
||||
ReadIpmiTool(m.config.IpmitoolPath, out)
|
||||
m.readIpmiTool(m.config.IpmitoolPath, output)
|
||||
}
|
||||
} else if len(m.config.IpmisensorsPath) > 0 {
|
||||
_, err := os.Stat(m.config.IpmisensorsPath)
|
||||
if err == nil {
|
||||
ReadIpmiSensors(m.config.IpmisensorsPath, out)
|
||||
m.readIpmiSensors(m.config.IpmisensorsPath, output)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
16
collectors/ipmiMetric.md
Normal file
16
collectors/ipmiMetric.md
Normal file
@@ -0,0 +1,16 @@
|
||||
|
||||
## `ipmistat` collector
|
||||
|
||||
```json
|
||||
"ipmistat": {
|
||||
"ipmitool_path": "/path/to/ipmitool",
|
||||
"ipmisensors_path": "/path/to/ipmi-sensors",
|
||||
}
|
||||
```
|
||||
|
||||
The `ipmistat` collector reads data from `ipmitool` (`ipmitool sensor`) or `ipmi-sensors` (`ipmi-sensors --sdr-cache-recreate --comma-separated-output`).
|
||||
|
||||
The metrics depend on the output of the underlying tools but contain temperature, power and energy metrics.
|
||||
|
||||
|
||||
|
@@ -20,16 +20,28 @@ import (
|
||||
"strings"
|
||||
"time"
|
||||
"unsafe"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
"gopkg.in/Knetic/govaluate.v2"
|
||||
)
|
||||
|
||||
type MetricScope int
|
||||
|
||||
const (
|
||||
METRIC_SCOPE_HWTHREAD = iota
|
||||
METRIC_SCOPE_SOCKET
|
||||
METRIC_SCOPE_NUMA
|
||||
METRIC_SCOPE_NODE
|
||||
)
|
||||
|
||||
func (ms MetricScope) String() string {
|
||||
return []string{"Head", "Shoulder", "Knee", "Toe"}[ms]
|
||||
}
|
||||
|
||||
type LikwidCollectorMetricConfig struct {
|
||||
Name string `json:"name"`
|
||||
Calc string `json:"calc"`
|
||||
Socket_scope bool `json:"socket_scope"`
|
||||
Publish bool `json:"publish"`
|
||||
Name string `json:"name"`
|
||||
Calc string `json:"calc"`
|
||||
Scope MetricScope `json:"socket_scope"`
|
||||
Publish bool `json:"publish"`
|
||||
}
|
||||
|
||||
type LikwidCollectorEventsetConfig struct {
|
||||
@@ -45,7 +57,7 @@ type LikwidCollectorConfig struct {
|
||||
}
|
||||
|
||||
type LikwidCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
cpulist []C.int
|
||||
sock2tid map[int]int
|
||||
metrics map[C.int]map[string]int
|
||||
@@ -105,7 +117,7 @@ func getSocketCpus() map[C.int]int {
|
||||
return outmap
|
||||
}
|
||||
|
||||
func (m *LikwidCollector) Init(config []byte) error {
|
||||
func (m *LikwidCollector) Init(config json.RawMessage) error {
|
||||
var ret C.int
|
||||
m.name = "LikwidCollector"
|
||||
if len(config) > 0 {
|
||||
@@ -115,11 +127,13 @@ func (m *LikwidCollector) Init(config []byte) error {
|
||||
}
|
||||
}
|
||||
m.setup()
|
||||
m.meta = map[string]string{"source": m.name, "group": "PerfCounter"}
|
||||
cpulist := CpuList()
|
||||
m.cpulist = make([]C.int, len(cpulist))
|
||||
slist := getSocketCpus()
|
||||
|
||||
m.sock2tid = make(map[int]int)
|
||||
// m.numa2tid = make(map[int]int)
|
||||
for i, c := range cpulist {
|
||||
m.cpulist[i] = C.int(c)
|
||||
if sid, found := slist[m.cpulist[i]]; found {
|
||||
@@ -169,7 +183,7 @@ func (m *LikwidCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *LikwidCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *LikwidCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -246,24 +260,28 @@ func (m *LikwidCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
for _, metric := range evset.Metrics {
|
||||
_, skip := stringArrayContains(m.config.ExcludeMetrics, metric.Name)
|
||||
if metric.Publish && !skip {
|
||||
if metric.Socket_scope {
|
||||
if metric.Scope.String() == "socket" {
|
||||
for sid, tid := range m.sock2tid {
|
||||
y, err := lp.New(metric.Name,
|
||||
map[string]string{"type": "socket", "type-id": fmt.Sprintf("%d", int(sid))},
|
||||
map[string]string{"type": "socket",
|
||||
"type-id": fmt.Sprintf("%d", int(sid))},
|
||||
m.meta,
|
||||
map[string]interface{}{"value": m.mresults[i][tid][metric.Name]},
|
||||
time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
} else {
|
||||
} else if metric.Scope.String() == "hwthread" {
|
||||
for tid, cpu := range m.cpulist {
|
||||
y, err := lp.New(metric.Name,
|
||||
map[string]string{"type": "cpu", "type-id": fmt.Sprintf("%d", int(cpu))},
|
||||
map[string]string{"type": "cpu",
|
||||
"type-id": fmt.Sprintf("%d", int(cpu))},
|
||||
m.meta,
|
||||
map[string]interface{}{"value": m.mresults[i][tid][metric.Name]},
|
||||
time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -273,24 +291,28 @@ func (m *LikwidCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
for _, metric := range m.config.Metrics {
|
||||
_, skip := stringArrayContains(m.config.ExcludeMetrics, metric.Name)
|
||||
if metric.Publish && !skip {
|
||||
if metric.Socket_scope {
|
||||
if metric.Scope.String() == "socket" {
|
||||
for sid, tid := range m.sock2tid {
|
||||
y, err := lp.New(metric.Name,
|
||||
map[string]string{"type": "socket", "type-id": fmt.Sprintf("%d", int(sid))},
|
||||
map[string]string{"type": "socket",
|
||||
"type-id": fmt.Sprintf("%d", int(sid))},
|
||||
m.meta,
|
||||
map[string]interface{}{"value": m.gmresults[tid][metric.Name]},
|
||||
time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
} else {
|
||||
for tid, cpu := range m.cpulist {
|
||||
y, err := lp.New(metric.Name,
|
||||
map[string]string{"type": "cpu", "type-id": fmt.Sprintf("%d", int(cpu))},
|
||||
map[string]string{"type": "cpu",
|
||||
"type-id": fmt.Sprintf("%d", int(cpu))},
|
||||
m.meta,
|
||||
map[string]interface{}{"value": m.gmresults[tid][metric.Name]},
|
||||
time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
119
collectors/likwidMetric.md
Normal file
119
collectors/likwidMetric.md
Normal file
@@ -0,0 +1,119 @@
|
||||
|
||||
## `likwid` collector
|
||||
```json
|
||||
"likwid": {
|
||||
"eventsets": [
|
||||
{
|
||||
"events": {
|
||||
"FIXC1": "ACTUAL_CPU_CLOCK",
|
||||
"FIXC2": "MAX_CPU_CLOCK",
|
||||
"PMC0": "RETIRED_INSTRUCTIONS",
|
||||
"PMC1": "CPU_CLOCKS_UNHALTED",
|
||||
"PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
|
||||
"PMC3": "MERGE",
|
||||
"DFC0": "DRAM_CHANNEL_0",
|
||||
"DFC1": "DRAM_CHANNEL_1",
|
||||
"DFC2": "DRAM_CHANNEL_2",
|
||||
"DFC3": "DRAM_CHANNEL_3"
|
||||
},
|
||||
"metrics": [
|
||||
{
|
||||
"name": "ipc",
|
||||
"calc": "PMC0/PMC1",
|
||||
"socket_scope": false,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "flops_any",
|
||||
"calc": "0.000001*PMC2/time",
|
||||
"socket_scope": false,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "clock_mhz",
|
||||
"calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
|
||||
"socket_scope": false,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "mem1",
|
||||
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
|
||||
"socket_scope": true,
|
||||
"publish": false
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"events": {
|
||||
"DFC0": "DRAM_CHANNEL_4",
|
||||
"DFC1": "DRAM_CHANNEL_5",
|
||||
"DFC2": "DRAM_CHANNEL_6",
|
||||
"DFC3": "DRAM_CHANNEL_7",
|
||||
"PWR0": "RAPL_CORE_ENERGY",
|
||||
"PWR1": "RAPL_PKG_ENERGY"
|
||||
},
|
||||
"metrics": [
|
||||
{
|
||||
"name": "pwr_core",
|
||||
"calc": "PWR0/time",
|
||||
"socket_scope": false,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "pwr_pkg",
|
||||
"calc": "PWR1/time",
|
||||
"socket_scope": true,
|
||||
"publish": true
|
||||
},
|
||||
{
|
||||
"name": "mem2",
|
||||
"calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
|
||||
"socket_scope": true,
|
||||
"publish": false
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"globalmetrics": [
|
||||
{
|
||||
"name": "mem_bw",
|
||||
"calc": "mem1+mem2",
|
||||
"socket_scope": true,
|
||||
"publish": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
_Example config suitable for AMD Zen3_
|
||||
|
||||
The `likwid` collector reads hardware performance counters at a **hwthread** and **socket** level. The configuration looks quite complicated but it is basically copy&paste from [LIKWID's performance groups](https://github.com/RRZE-HPC/likwid/tree/master/groups). The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
|
||||
|
||||
The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
|
||||
```
|
||||
EVENTSET -> "events": {
|
||||
FIXC1 ACTUAL_CPU_CLOCK -> "FIXC1": "ACTUAL_CPU_CLOCK",
|
||||
FIXC2 MAX_CPU_CLOCK -> "FIXC2": "MAX_CPU_CLOCK",
|
||||
PMC0 RETIRED_INSTRUCTIONS -> "PMC0" : "RETIRED_INSTRUCTIONS",
|
||||
PMC1 CPU_CLOCKS_UNHALTED -> "PMC1" : "CPU_CLOCKS_UNHALTED",
|
||||
PMC2 RETIRED_SSE_AVX_FLOPS_ALL -> "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
|
||||
PMC3 MERGE -> "PMC3": "MERGE",
|
||||
-> }
|
||||
```
|
||||
|
||||
The metrics are following the same procedure:
|
||||
|
||||
```
|
||||
METRICS -> "metrics": [
|
||||
IPC PMC0/PMC1 -> {
|
||||
-> "name" : "IPC",
|
||||
-> "calc" : "PMC0/PMC1",
|
||||
-> "socket_scope": false,
|
||||
-> "publish": true
|
||||
-> }
|
||||
-> ]
|
||||
```
|
||||
|
||||
The `socket_scope` option tells whether it is submitted per socket or per hwthread. If a metric is only used for internal calculations, you can set `publish = false`.
|
||||
|
||||
Since some metrics can only be gathered in multiple measurements (like the memory bandwidth on AMD Zen3 chips), configure multiple eventsets like in the example config and use the `globalmetrics` section to combine them. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.
|
@@ -6,8 +6,7 @@ import (
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const LOADAVGFILE = `/proc/loadavg`
|
||||
@@ -17,14 +16,14 @@ type LoadavgCollectorConfig struct {
|
||||
}
|
||||
|
||||
type LoadavgCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
load_matches []string
|
||||
proc_matches []string
|
||||
config LoadavgCollectorConfig
|
||||
}
|
||||
|
||||
func (m *LoadavgCollector) Init(config []byte) error {
|
||||
func (m *LoadavgCollector) Init(config json.RawMessage) error {
|
||||
m.name = "LoadavgCollector"
|
||||
m.setup()
|
||||
if len(config) > 0 {
|
||||
@@ -33,6 +32,7 @@ func (m *LoadavgCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{"source": m.name, "group": "LOAD"}
|
||||
m.tags = map[string]string{"type": "node"}
|
||||
m.load_matches = []string{"load_one", "load_five", "load_fifteen"}
|
||||
m.proc_matches = []string{"proc_run", "proc_total"}
|
||||
@@ -40,7 +40,7 @@ func (m *LoadavgCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *LoadavgCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *LoadavgCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
var skip bool
|
||||
if !m.init {
|
||||
return
|
||||
@@ -56,9 +56,9 @@ func (m *LoadavgCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
x, err := strconv.ParseFloat(ls[i], 64)
|
||||
if err == nil {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, name)
|
||||
y, err := lp.New(name, m.tags, map[string]interface{}{"value": float64(x)}, time.Now())
|
||||
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": float64(x)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -67,9 +67,9 @@ func (m *LoadavgCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
x, err := strconv.ParseFloat(lv[i], 64)
|
||||
if err == nil {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, name)
|
||||
y, err := lp.New(name, m.tags, map[string]interface{}{"value": float64(x)}, time.Now())
|
||||
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": float64(x)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
19
collectors/loadavgMetric.md
Normal file
19
collectors/loadavgMetric.md
Normal file
@@ -0,0 +1,19 @@
|
||||
|
||||
## `loadavg` collector
|
||||
|
||||
```json
|
||||
"loadavg": {
|
||||
"exclude_metrics": [
|
||||
"proc_run"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `loadavg` collector reads data from `/proc/loadavg` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
Metrics:
|
||||
* `load_one`
|
||||
* `load_five`
|
||||
* `load_fifteen`
|
||||
* `proc_run`
|
||||
* `proc_total`
|
@@ -8,8 +8,7 @@ import (
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const LUSTREFILE = `/proc/fs/lustre/llite/lnec-XXXXXX/stats`
|
||||
@@ -20,14 +19,14 @@ type LustreCollectorConfig struct {
|
||||
}
|
||||
|
||||
type LustreCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
matches map[string]map[string]int
|
||||
devices []string
|
||||
config LustreCollectorConfig
|
||||
}
|
||||
|
||||
func (m *LustreCollector) Init(config []byte) error {
|
||||
func (m *LustreCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "LustreCollector"
|
||||
if len(config) > 0 {
|
||||
@@ -38,6 +37,7 @@ func (m *LustreCollector) Init(config []byte) error {
|
||||
}
|
||||
m.setup()
|
||||
m.tags = map[string]string{"type": "node"}
|
||||
m.meta = map[string]string{"source": m.name, "group": "Lustre"}
|
||||
m.matches = map[string]map[string]int{"read_bytes": {"read_bytes": 6, "read_requests": 1},
|
||||
"write_bytes": {"write_bytes": 6, "write_requests": 1},
|
||||
"open": {"open": 1},
|
||||
@@ -64,7 +64,7 @@ func (m *LustreCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *LustreCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *LustreCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -88,9 +88,12 @@ func (m *LustreCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
}
|
||||
x, err := strconv.ParseInt(lf[idx], 0, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New(name, m.tags, map[string]interface{}{"value": x}, time.Now())
|
||||
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": x}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
if strings.Contains(name, "byte") {
|
||||
y.AddMeta("unit", "Byte")
|
||||
}
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
29
collectors/lustreMetric.md
Normal file
29
collectors/lustreMetric.md
Normal file
@@ -0,0 +1,29 @@
|
||||
|
||||
## `lustrestat` collector
|
||||
|
||||
```json
|
||||
"lustrestat": {
|
||||
"procfiles" : [
|
||||
"/proc/fs/lustre/llite/lnec-XXXXXX/stats"
|
||||
],
|
||||
"exclude_metrics": [
|
||||
"setattr",
|
||||
"getattr"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `lustrestat` collector reads from the procfs stat files for Lustre like `/proc/fs/lustre/llite/lnec-XXXXXX/stats`.
|
||||
|
||||
Metrics:
|
||||
* `read_bytes`
|
||||
* `read_requests`
|
||||
* `write_bytes`
|
||||
* `write_requests`
|
||||
* `open`
|
||||
* `close`
|
||||
* `getattr`
|
||||
* `setattr`
|
||||
* `statfs`
|
||||
* `inode_permission`
|
||||
|
@@ -9,8 +9,7 @@ import (
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const MEMSTATFILE = `/proc/meminfo`
|
||||
@@ -20,14 +19,14 @@ type MemstatCollectorConfig struct {
|
||||
}
|
||||
|
||||
type MemstatCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
stats map[string]int64
|
||||
tags map[string]string
|
||||
matches map[string]string
|
||||
config MemstatCollectorConfig
|
||||
}
|
||||
|
||||
func (m *MemstatCollector) Init(config []byte) error {
|
||||
func (m *MemstatCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "MemstatCollector"
|
||||
if len(config) > 0 {
|
||||
@@ -36,6 +35,7 @@ func (m *MemstatCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{"source": m.name, "group": "Memory", "unit": "kByte"}
|
||||
m.stats = make(map[string]int64)
|
||||
m.matches = make(map[string]string)
|
||||
m.tags = map[string]string{"type": "node"}
|
||||
@@ -65,7 +65,7 @@ func (m *MemstatCollector) Init(config []byte) error {
|
||||
return err
|
||||
}
|
||||
|
||||
func (m *MemstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *MemstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -97,9 +97,9 @@ func (m *MemstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
log.Print(err)
|
||||
continue
|
||||
}
|
||||
y, err := lp.New(name, m.tags, map[string]interface{}{"value": int(float64(m.stats[match]) * 1.0e-3)}, time.Now())
|
||||
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": int(float64(m.stats[match]) * 1.0e-3)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
@@ -108,18 +108,18 @@ func (m *MemstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
if _, cached := m.stats[`Cached`]; cached {
|
||||
memUsed := m.stats[`MemTotal`] - (m.stats[`MemFree`] + m.stats[`Buffers`] + m.stats[`Cached`])
|
||||
_, skip := stringArrayContains(m.config.ExcludeMetrics, "mem_used")
|
||||
y, err := lp.New("mem_used", m.tags, map[string]interface{}{"value": int(float64(memUsed) * 1.0e-3)}, time.Now())
|
||||
y, err := lp.New("mem_used", m.tags, m.meta, map[string]interface{}{"value": int(float64(memUsed) * 1.0e-3)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
if _, found := m.stats[`MemShared`]; found {
|
||||
_, skip := stringArrayContains(m.config.ExcludeMetrics, "mem_shared")
|
||||
y, err := lp.New("mem_shared", m.tags, map[string]interface{}{"value": int(float64(m.stats[`MemShared`]) * 1.0e-3)}, time.Now())
|
||||
y, err := lp.New("mem_shared", m.tags, m.meta, map[string]interface{}{"value": int(float64(m.stats[`MemShared`]) * 1.0e-3)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
27
collectors/memstatMetric.md
Normal file
27
collectors/memstatMetric.md
Normal file
@@ -0,0 +1,27 @@
|
||||
|
||||
## `memstat` collector
|
||||
|
||||
```json
|
||||
"memstat": {
|
||||
"exclude_metrics": [
|
||||
"mem_used"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `memstat` collector reads data from `/proc/meminfo` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
|
||||
|
||||
|
||||
Metrics:
|
||||
* `mem_total`
|
||||
* `mem_sreclaimable`
|
||||
* `mem_slab`
|
||||
* `mem_free`
|
||||
* `mem_buffers`
|
||||
* `mem_cached`
|
||||
* `mem_available`
|
||||
* `mem_shared`
|
||||
* `swap_total`
|
||||
* `swap_free`
|
||||
* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached`)
|
||||
|
@@ -1,8 +1,10 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"errors"
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
influx "github.com/influxdata/line-protocol"
|
||||
"io/ioutil"
|
||||
"log"
|
||||
"strconv"
|
||||
@@ -10,28 +12,30 @@ import (
|
||||
"time"
|
||||
)
|
||||
|
||||
type MetricGetter interface {
|
||||
type MetricCollector interface {
|
||||
Name() string
|
||||
Init(config []byte) error
|
||||
Init(config json.RawMessage) error
|
||||
Initialized() bool
|
||||
Read(time.Duration, *[]lp.MutableMetric)
|
||||
Read(duration time.Duration, output chan lp.CCMetric)
|
||||
Close()
|
||||
}
|
||||
|
||||
type MetricCollector struct {
|
||||
name string
|
||||
init bool
|
||||
type metricCollector struct {
|
||||
output chan lp.CCMetric
|
||||
name string
|
||||
init bool
|
||||
meta map[string]string
|
||||
}
|
||||
|
||||
func (c *MetricCollector) Name() string {
|
||||
func (c *metricCollector) Name() string {
|
||||
return c.name
|
||||
}
|
||||
|
||||
func (c *MetricCollector) setup() error {
|
||||
func (c *metricCollector) setup() error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (c *MetricCollector) Initialized() bool {
|
||||
func (c *metricCollector) Initialized() bool {
|
||||
return c.init == true
|
||||
}
|
||||
|
||||
@@ -103,7 +107,7 @@ func CpuList() []int {
|
||||
return cpulist
|
||||
}
|
||||
|
||||
func Tags2Map(metric lp.Metric) map[string]string {
|
||||
func Tags2Map(metric influx.Metric) map[string]string {
|
||||
tags := make(map[string]string)
|
||||
for _, t := range metric.TagList() {
|
||||
tags[t.Key] = t.Value
|
||||
@@ -111,7 +115,7 @@ func Tags2Map(metric lp.Metric) map[string]string {
|
||||
return tags
|
||||
}
|
||||
|
||||
func Fields2Map(metric lp.Metric) map[string]interface{} {
|
||||
func Fields2Map(metric influx.Metric) map[string]interface{} {
|
||||
fields := make(map[string]interface{})
|
||||
for _, f := range metric.FieldList() {
|
||||
fields[f.Key] = f.Value
|
||||
|
@@ -7,8 +7,7 @@ import (
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const NETSTATFILE = `/proc/net/dev`
|
||||
@@ -18,14 +17,15 @@ type NetstatCollectorConfig struct {
|
||||
}
|
||||
|
||||
type NetstatCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
config NetstatCollectorConfig
|
||||
matches map[int]string
|
||||
}
|
||||
|
||||
func (m *NetstatCollector) Init(config []byte) error {
|
||||
func (m *NetstatCollector) Init(config json.RawMessage) error {
|
||||
m.name = "NetstatCollector"
|
||||
m.setup()
|
||||
m.meta = map[string]string{"source": m.name, "group": "Memory"}
|
||||
m.matches = map[int]string{
|
||||
1: "bytes_in",
|
||||
9: "bytes_out",
|
||||
@@ -46,7 +46,7 @@ func (m *NetstatCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *NetstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *NetstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
data, err := ioutil.ReadFile(string(NETSTATFILE))
|
||||
if err != nil {
|
||||
log.Print(err.Error())
|
||||
@@ -73,9 +73,15 @@ func (m *NetstatCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
for i, name := range m.matches {
|
||||
v, err := strconv.ParseInt(f[i], 10, 0)
|
||||
if err == nil {
|
||||
y, err := lp.New(name, tags, map[string]interface{}{"value": int(float64(v) * 1.0e-3)}, time.Now())
|
||||
y, err := lp.New(name, tags, m.meta, map[string]interface{}{"value": int(float64(v) * 1.0e-3)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
switch {
|
||||
case strings.Contains(name, "byte"):
|
||||
y.AddMeta("unit", "Byte")
|
||||
case strings.Contains(name, "pkt"):
|
||||
y.AddMeta("unit", "Packets")
|
||||
}
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
21
collectors/netstatMetric.md
Normal file
21
collectors/netstatMetric.md
Normal file
@@ -0,0 +1,21 @@
|
||||
|
||||
## `netstat` collector
|
||||
|
||||
```json
|
||||
"netstat": {
|
||||
"exclude_devices": [
|
||||
"lo"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a device is not required, it can be excluded from forwarding it to the sink. Commonly the `lo` device should be excluded.
|
||||
|
||||
Metrics:
|
||||
* `bytes_in`
|
||||
* `bytes_out`
|
||||
* `pkts_in`
|
||||
* `pkts_out`
|
||||
|
||||
The device name is added as tag `device`.
|
||||
|
147
collectors/nfsMetric.go
Normal file
147
collectors/nfsMetric.go
Normal file
@@ -0,0 +1,147 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"log"
|
||||
|
||||
// "os"
|
||||
"os/exec"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
type NfsCollectorData struct {
|
||||
current int64
|
||||
last int64
|
||||
}
|
||||
|
||||
type NfsCollector struct {
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
config struct {
|
||||
Nfsutils string `json:"nfsutils"`
|
||||
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
|
||||
}
|
||||
data map[string]map[string]NfsCollectorData
|
||||
}
|
||||
|
||||
func (m *NfsCollector) initStats() error {
|
||||
cmd := exec.Command(m.config.Nfsutils, "-l")
|
||||
cmd.Wait()
|
||||
buffer, err := cmd.Output()
|
||||
if err == nil {
|
||||
for _, line := range strings.Split(string(buffer), "\n") {
|
||||
lf := strings.Fields(line)
|
||||
if len(lf) != 5 {
|
||||
continue
|
||||
}
|
||||
if _, exist := m.data[lf[1]]; !exist {
|
||||
m.data[lf[1]] = make(map[string]NfsCollectorData)
|
||||
}
|
||||
name := strings.Trim(lf[3], ":")
|
||||
if _, exist := m.data[lf[1]][name]; !exist {
|
||||
value, err := strconv.ParseInt(lf[4], 0, 64)
|
||||
if err == nil {
|
||||
x := m.data[lf[1]][name]
|
||||
x.current = value
|
||||
x.last = 0
|
||||
m.data[lf[1]][name] = x
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return err
|
||||
}
|
||||
|
||||
func (m *NfsCollector) updateStats() error {
|
||||
cmd := exec.Command(m.config.Nfsutils, "-l")
|
||||
cmd.Wait()
|
||||
buffer, err := cmd.Output()
|
||||
if err == nil {
|
||||
for _, line := range strings.Split(string(buffer), "\n") {
|
||||
lf := strings.Fields(line)
|
||||
if len(lf) != 5 {
|
||||
continue
|
||||
}
|
||||
if _, exist := m.data[lf[1]]; !exist {
|
||||
m.data[lf[1]] = make(map[string]NfsCollectorData)
|
||||
}
|
||||
name := strings.Trim(lf[3], ":")
|
||||
if _, exist := m.data[lf[1]][name]; exist {
|
||||
value, err := strconv.ParseInt(lf[4], 0, 64)
|
||||
if err == nil {
|
||||
x := m.data[lf[1]][name]
|
||||
x.last = x.current
|
||||
x.current = value
|
||||
m.data[lf[1]][name] = x
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return err
|
||||
}
|
||||
|
||||
func (m *NfsCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "NfsCollector"
|
||||
m.setup()
|
||||
|
||||
// Set default mmpmon binary
|
||||
m.config.Nfsutils = "/usr/sbin/nfsstat"
|
||||
|
||||
// Read JSON configuration
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
log.Print(err.Error())
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "NFS",
|
||||
}
|
||||
m.tags = map[string]string{
|
||||
"type": "node",
|
||||
}
|
||||
// Check if mmpmon is in executable search path
|
||||
_, err = exec.LookPath(m.config.Nfsutils)
|
||||
if err != nil {
|
||||
return fmt.Errorf("NfsCollector.Init(): Failed to find nfsstat binary '%s': %v", m.config.Nfsutils, err)
|
||||
}
|
||||
m.data = make(map[string]map[string]NfsCollectorData)
|
||||
m.initStats()
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *NfsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
timestamp := time.Now()
|
||||
|
||||
m.updateStats()
|
||||
|
||||
for version, metrics := range m.data {
|
||||
for name, data := range metrics {
|
||||
if _, skip := stringArrayContains(m.config.ExcludeMetrics, name); skip {
|
||||
continue
|
||||
}
|
||||
value := data.current - data.last
|
||||
y, err := lp.New(fmt.Sprintf("nfs_%s", name), m.tags, m.meta, map[string]interface{}{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
y.AddMeta("version", version)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (m *NfsCollector) Close() {
|
||||
m.init = false
|
||||
}
|
@@ -6,9 +6,8 @@ import (
|
||||
"fmt"
|
||||
"log"
|
||||
"time"
|
||||
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
"github.com/NVIDIA/go-nvml/pkg/nvml"
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
)
|
||||
|
||||
type NvidiaCollectorConfig struct {
|
||||
@@ -17,7 +16,7 @@ type NvidiaCollectorConfig struct {
|
||||
}
|
||||
|
||||
type NvidiaCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
num_gpus int
|
||||
config NvidiaCollectorConfig
|
||||
}
|
||||
@@ -29,10 +28,11 @@ func (m *NvidiaCollector) CatchPanic() {
|
||||
}
|
||||
}
|
||||
|
||||
func (m *NvidiaCollector) Init(config []byte) error {
|
||||
func (m *NvidiaCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "NvidiaCollector"
|
||||
m.setup()
|
||||
m.meta = map[string]string{"source": m.name, "group": "Nvidia"}
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
@@ -55,7 +55,7 @@ func (m *NvidiaCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *NvidiaCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -74,14 +74,14 @@ func (m *NvidiaCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
util, ret := nvml.DeviceGetUtilizationRates(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "util")
|
||||
y, err := lp.New("util", tags, map[string]interface{}{"value": float64(util.Gpu)}, time.Now())
|
||||
y, err := lp.New("util", tags, m.meta, map[string]interface{}{"value": float64(util.Gpu)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "mem_util")
|
||||
y, err = lp.New("mem_util", tags, map[string]interface{}{"value": float64(util.Memory)}, time.Now())
|
||||
y, err = lp.New("mem_util", tags, m.meta, map[string]interface{}{"value": float64(util.Memory)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
@@ -89,174 +89,177 @@ func (m *NvidiaCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
|
||||
if ret == nvml.SUCCESS {
|
||||
t := float64(meminfo.Total) / (1024 * 1024)
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "mem_total")
|
||||
y, err := lp.New("mem_total", tags, map[string]interface{}{"value": t}, time.Now())
|
||||
y, err := lp.New("mem_total", tags, m.meta, map[string]interface{}{"value": t}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
y.AddMeta("unit", "MByte")
|
||||
output <- y
|
||||
}
|
||||
f := float64(meminfo.Used) / (1024 * 1024)
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "fb_memory")
|
||||
y, err = lp.New("fb_memory", tags, map[string]interface{}{"value": f}, time.Now())
|
||||
y, err = lp.New("fb_memory", tags, m.meta, map[string]interface{}{"value": f}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
y.AddMeta("unit", "MByte")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
temp, ret := nvml.DeviceGetTemperature(device, nvml.TEMPERATURE_GPU)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "temp")
|
||||
y, err := lp.New("temp", tags, map[string]interface{}{"value": float64(temp)}, time.Now())
|
||||
y, err := lp.New("temp", tags, m.meta, map[string]interface{}{"value": float64(temp)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
y.AddMeta("unit", "degC")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
fan, ret := nvml.DeviceGetFanSpeed(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "fan")
|
||||
y, err := lp.New("fan", tags, map[string]interface{}{"value": float64(fan)}, time.Now())
|
||||
y, err := lp.New("fan", tags, m.meta, map[string]interface{}{"value": float64(fan)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
_, ecc_pend, ret := nvml.DeviceGetEccMode(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
var y lp.MutableMetric
|
||||
var y lp.CCMetric
|
||||
var err error
|
||||
switch ecc_pend {
|
||||
case nvml.FEATURE_DISABLED:
|
||||
y, err = lp.New("ecc_mode", tags, map[string]interface{}{"value": string("OFF")}, time.Now())
|
||||
y, err = lp.New("ecc_mode", tags, m.meta, map[string]interface{}{"value": string("OFF")}, time.Now())
|
||||
case nvml.FEATURE_ENABLED:
|
||||
y, err = lp.New("ecc_mode", tags, map[string]interface{}{"value": string("ON")}, time.Now())
|
||||
y, err = lp.New("ecc_mode", tags, m.meta, map[string]interface{}{"value": string("ON")}, time.Now())
|
||||
default:
|
||||
y, err = lp.New("ecc_mode", tags, map[string]interface{}{"value": string("UNKNOWN")}, time.Now())
|
||||
y, err = lp.New("ecc_mode", tags, m.meta, map[string]interface{}{"value": string("UNKNOWN")}, time.Now())
|
||||
}
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_mode")
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
} else if ret == nvml.ERROR_NOT_SUPPORTED {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_mode")
|
||||
y, err := lp.New("ecc_mode", tags, map[string]interface{}{"value": string("N/A")}, time.Now())
|
||||
y, err := lp.New("ecc_mode", tags, m.meta, map[string]interface{}{"value": string("N/A")}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
pstate, ret := nvml.DeviceGetPerformanceState(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "perf_state")
|
||||
y, err := lp.New("perf_state", tags, map[string]interface{}{"value": fmt.Sprintf("P%d", int(pstate))}, time.Now())
|
||||
y, err := lp.New("perf_state", tags, m.meta, map[string]interface{}{"value": fmt.Sprintf("P%d", int(pstate))}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
power, ret := nvml.DeviceGetPowerUsage(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "power_usage_report")
|
||||
y, err := lp.New("power_usage_report", tags, map[string]interface{}{"value": float64(power) / 1000}, time.Now())
|
||||
y, err := lp.New("power_usage_report", tags, m.meta, map[string]interface{}{"value": float64(power) / 1000}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
gclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_GRAPHICS)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "graphics_clock_report")
|
||||
y, err := lp.New("graphics_clock_report", tags, map[string]interface{}{"value": float64(gclk)}, time.Now())
|
||||
y, err := lp.New("graphics_clock_report", tags, m.meta, map[string]interface{}{"value": float64(gclk)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
smclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_SM)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "sm_clock_report")
|
||||
y, err := lp.New("sm_clock_report", tags, map[string]interface{}{"value": float64(smclk)}, time.Now())
|
||||
y, err := lp.New("sm_clock_report", tags, m.meta, map[string]interface{}{"value": float64(smclk)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
memclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_MEM)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "mem_clock_report")
|
||||
y, err := lp.New("mem_clock_report", tags, map[string]interface{}{"value": float64(memclk)}, time.Now())
|
||||
y, err := lp.New("mem_clock_report", tags, m.meta, map[string]interface{}{"value": float64(memclk)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
max_gclk, ret := nvml.DeviceGetMaxClockInfo(device, nvml.CLOCK_GRAPHICS)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "max_graphics_clock")
|
||||
y, err := lp.New("max_graphics_clock", tags, map[string]interface{}{"value": float64(max_gclk)}, time.Now())
|
||||
y, err := lp.New("max_graphics_clock", tags, m.meta, map[string]interface{}{"value": float64(max_gclk)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
max_smclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_SM)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "max_sm_clock")
|
||||
y, err := lp.New("max_sm_clock", tags, map[string]interface{}{"value": float64(max_smclk)}, time.Now())
|
||||
y, err := lp.New("max_sm_clock", tags, m.meta, map[string]interface{}{"value": float64(max_smclk)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
max_memclk, ret := nvml.DeviceGetClockInfo(device, nvml.CLOCK_MEM)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "max_mem_clock")
|
||||
y, err := lp.New("max_mem_clock", tags, map[string]interface{}{"value": float64(max_memclk)}, time.Now())
|
||||
y, err := lp.New("max_mem_clock", tags, m.meta, map[string]interface{}{"value": float64(max_memclk)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
ecc_db, ret := nvml.DeviceGetTotalEccErrors(device, 1, 1)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_db_error")
|
||||
y, err := lp.New("ecc_db_error", tags, map[string]interface{}{"value": float64(ecc_db)}, time.Now())
|
||||
y, err := lp.New("ecc_db_error", tags, m.meta, map[string]interface{}{"value": float64(ecc_db)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
ecc_sb, ret := nvml.DeviceGetTotalEccErrors(device, 0, 1)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "ecc_sb_error")
|
||||
y, err := lp.New("ecc_sb_error", tags, map[string]interface{}{"value": float64(ecc_sb)}, time.Now())
|
||||
y, err := lp.New("ecc_sb_error", tags, m.meta, map[string]interface{}{"value": float64(ecc_sb)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
pwr_limit, ret := nvml.DeviceGetPowerManagementLimit(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "power_man_limit")
|
||||
y, err := lp.New("power_man_limit", tags, map[string]interface{}{"value": float64(pwr_limit)}, time.Now())
|
||||
y, err := lp.New("power_man_limit", tags, m.meta, map[string]interface{}{"value": float64(pwr_limit)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
enc_util, _, ret := nvml.DeviceGetEncoderUtilization(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "encoder_util")
|
||||
y, err := lp.New("encoder_util", tags, map[string]interface{}{"value": float64(enc_util)}, time.Now())
|
||||
y, err := lp.New("encoder_util", tags, m.meta, map[string]interface{}{"value": float64(enc_util)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
dec_util, _, ret := nvml.DeviceGetDecoderUtilization(device)
|
||||
if ret == nvml.SUCCESS {
|
||||
_, skip = stringArrayContains(m.config.ExcludeMetrics, "decoder_util")
|
||||
y, err := lp.New("decoder_util", tags, map[string]interface{}{"value": float64(dec_util)}, time.Now())
|
||||
y, err := lp.New("decoder_util", tags, m.meta, map[string]interface{}{"value": float64(dec_util)}, time.Now())
|
||||
if err == nil && !skip {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
40
collectors/nvidiaMetric.md
Normal file
40
collectors/nvidiaMetric.md
Normal file
@@ -0,0 +1,40 @@
|
||||
|
||||
## `nvidia` collector
|
||||
|
||||
```json
|
||||
"lustrestat": {
|
||||
"exclude_devices" : [
|
||||
"0","1"
|
||||
],
|
||||
"exclude_metrics": [
|
||||
"fb_memory",
|
||||
"fan"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Metrics:
|
||||
* `util`
|
||||
* `mem_util`
|
||||
* `mem_total`
|
||||
* `fb_memory`
|
||||
* `temp`
|
||||
* `fan`
|
||||
* `ecc_mode`
|
||||
* `perf_state`
|
||||
* `power_usage_report`
|
||||
* `graphics_clock_report`
|
||||
* `sm_clock_report`
|
||||
* `mem_clock_report`
|
||||
* `max_graphics_clock`
|
||||
* `max_sm_clock`
|
||||
* `max_mem_clock`
|
||||
* `ecc_db_error`
|
||||
* `ecc_sb_error`
|
||||
* `power_man_limit`
|
||||
* `encoder_util`
|
||||
* `decoder_util`
|
||||
|
||||
It uses a separate `type` in the metrics. The output metric looks like this:
|
||||
`<name>,type=accelerator,type-id=<nvidia-gpu-id> value=<metric value> <timestamp>`
|
||||
|
@@ -4,13 +4,13 @@ import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io/ioutil"
|
||||
"log"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const HWMON_PATH = `/sys/class/hwmon`
|
||||
@@ -21,20 +21,21 @@ type TempCollectorConfig struct {
|
||||
}
|
||||
|
||||
type TempCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
config TempCollectorConfig
|
||||
}
|
||||
|
||||
func (m *TempCollector) Init(config []byte) error {
|
||||
func (m *TempCollector) Init(config json.RawMessage) error {
|
||||
m.name = "TempCollector"
|
||||
m.setup()
|
||||
m.init = true
|
||||
m.meta = map[string]string{"source": m.name, "group": "IPMI", "unit": "degC"}
|
||||
if len(config) > 0 {
|
||||
err := json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
@@ -74,7 +75,7 @@ func get_hwmon_sensors() (map[string]map[string]string, error) {
|
||||
return sensors, nil
|
||||
}
|
||||
|
||||
func (m *TempCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *TempCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
|
||||
sensors, err := get_hwmon_sensors()
|
||||
if err != nil {
|
||||
@@ -89,15 +90,20 @@ func (m *TempCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
break
|
||||
}
|
||||
}
|
||||
mname := strings.Replace(name, " ", "_", -1)
|
||||
if !strings.Contains(mname, "temp") {
|
||||
mname = fmt.Sprintf("temp_%s", mname)
|
||||
}
|
||||
buffer, err := ioutil.ReadFile(string(file))
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
x, err := strconv.ParseInt(strings.Replace(string(buffer), "\n", "", -1), 0, 64)
|
||||
if err == nil {
|
||||
y, err := lp.New(strings.ToLower(name), tags, map[string]interface{}{"value": float64(x) / 1000}, time.Now())
|
||||
y, err := lp.New(strings.ToLower(mname), tags, m.meta, map[string]interface{}{"value": int(float64(x) / 1000)}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
log.Print("[", m.name, "] ", y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
22
collectors/tempMetric.md
Normal file
22
collectors/tempMetric.md
Normal file
@@ -0,0 +1,22 @@
|
||||
|
||||
## `tempstat` collector
|
||||
|
||||
```json
|
||||
"tempstat": {
|
||||
"tag_override" : {
|
||||
"<device like hwmon1>" : {
|
||||
"type" : "socket",
|
||||
"type-id" : "0"
|
||||
}
|
||||
},
|
||||
"exclude_metrics": [
|
||||
"metric1",
|
||||
"metric2"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `tempstat` collector reads the data from `/sys/class/hwmon/<device>/tempX_{input,label}`
|
||||
|
||||
Metrics:
|
||||
* `temp_*`: The metric name is taken from the `label` files.
|
@@ -8,8 +8,7 @@ import (
|
||||
"os/exec"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
lp "github.com/influxdata/line-protocol"
|
||||
lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
|
||||
)
|
||||
|
||||
const MAX_NUM_PROCS = 10
|
||||
@@ -20,15 +19,16 @@ type TopProcsCollectorConfig struct {
|
||||
}
|
||||
|
||||
type TopProcsCollector struct {
|
||||
MetricCollector
|
||||
metricCollector
|
||||
tags map[string]string
|
||||
config TopProcsCollectorConfig
|
||||
}
|
||||
|
||||
func (m *TopProcsCollector) Init(config []byte) error {
|
||||
func (m *TopProcsCollector) Init(config json.RawMessage) error {
|
||||
var err error
|
||||
m.name = "TopProcsCollector"
|
||||
m.tags = map[string]string{"type": "node"}
|
||||
m.meta = map[string]string{"source": m.name, "group": "TopProcs"}
|
||||
if len(config) > 0 {
|
||||
err = json.Unmarshal(config, &m.config)
|
||||
if err != nil {
|
||||
@@ -51,7 +51,7 @@ func (m *TopProcsCollector) Init(config []byte) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (m *TopProcsCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
|
||||
func (m *TopProcsCollector) Read(interval time.Duration, output chan lp.CCMetric) {
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
@@ -66,9 +66,9 @@ func (m *TopProcsCollector) Read(interval time.Duration, out *[]lp.MutableMetric
|
||||
lines := strings.Split(string(stdout), "\n")
|
||||
for i := 1; i < m.config.Num_procs+1; i++ {
|
||||
name := fmt.Sprintf("topproc%d", i)
|
||||
y, err := lp.New(name, m.tags, map[string]interface{}{"value": string(lines[i])}, time.Now())
|
||||
y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": string(lines[i])}, time.Now())
|
||||
if err == nil {
|
||||
*out = append(*out, y)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
15
collectors/topprocsMetric.md
Normal file
15
collectors/topprocsMetric.md
Normal file
@@ -0,0 +1,15 @@
|
||||
|
||||
## `topprocs` collector
|
||||
|
||||
```json
|
||||
"topprocs": {
|
||||
"num_procs": 5
|
||||
}
|
||||
```
|
||||
|
||||
The `topprocs` collector reads the TopX processes (sorted by CPU utilization, `ps -Ao comm --sort=-pcpu`).
|
||||
|
||||
In contrast to most other collectors, the metric value is a `string`.
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user