Modularize the whole thing (#16)

* Use channels, add a metric router, split up configuration and use extended version of Influx line protocol internally * Use central timer for collectors and router. Add expressions to router * Add expression to router config * Update entry points * Start with README * Update README for CCMetric * Formatting * Update README.md * Add README for MultiChanTicker * Add README for MultiChanTicker * Update README.md * Add README to metric router * Update main README * Remove SinkEntity type * Update README for sinks * Update go files * Update README for receivers * Update collectors README * Update collectors README * Use seperate page per collector * Fix for tempstat page * Add docs for customcmd collector * Add docs for ipmistat collector * Add docs for topprocs collector * Update customCmdMetric.md * Use seconds when calculating LIKWID metrics * Add IB metrics ib_recv_pkts and ib_xmit_pkts * Drop domain part of host name * Updated to latest stable version of likwid * Define source code dependencies in Makefile * Add GPFS / IBM Spectrum Scale collector * Add vet and staticcheck make targets * Add vet and staticcheck make targets * Avoid go vet warning: struct field tag `json:"..., omitempty"` not compatible with reflect.StructTag.Get: suspicious space in struct tag value struct field tag `json:"...", omitempty` not compatible with reflect.StructTag.Get: key:"value" pairs not separated by spaces * Add sample collector to README.md * Add CPU frequency collector * Avoid staticcheck warning: redundant return statement * Avoid staticcheck warning: unnecessary assignment to the blank identifier * Simplified code * Add CPUFreqCollectorCpuinfo a metric collector to measure the current frequency of the CPUs as obtained from /proc/cpuinfo Only measure on the first hyperthread * Add collector for NFS clients * Move publication of metrics into Flush() for NatsSink * Update GitHub actions * Refactoring * Avoid vet warning: Println arg list ends with redundant newline * Avoid vet warning struct field commands has json tag but is not exported * Avoid vet warning: return copies lock value. * Corrected typo * Refactoring * Add go sources in internal/... * Bad separator in Makefile * Fix Infiniband collector Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
2025-12-19 05:36:17 +01:00 · 2022-01-25 15:37:43 +01:00
parent 222862af32
commit 200af84c54
60 changed files with 2596 additions and 1105 deletions
--- a/collectors/README.md
+++ b/collectors/README.md
@@ -1,288 +1,34 @@
+# CCMetric collectors
+
 This folder contains the collectors for the cc-metric-collector.

-# `metricCollector.go`
-The base class/configuration is located in `metricCollector.go`.
-
-# Collectors
-
-* `memstatMetric.go`: Reads `/proc/meminfo` to calculate **node** metrics. It also combines values to the metric `mem_used`
-* `loadavgMetric.go`: Reads `/proc/loadavg` and submits **node** metrics:
-* `netstatMetric.go`: Reads `/proc/net/dev` and submits for all network devices as the **node** metrics.
-* `lustreMetric.go`: Reads Lustre's stats files and submits **node** metrics:
-* `infinibandMetric.go`: Reads InfiniBand metrics. It uses the `perfquery` command to read the **node** metrics but can fallback to sysfs counters in case `perfquery` does not work.
-* `likwidMetric.go`: Reads hardware performance events using LIKWID. It submits **socket** and **cpu** metrics
-* `cpustatMetric.go`: Read CPU specific values from `/proc/stat`
-* `topprocsMetric.go`: Reads the TopX processes by their CPU usage. X is configurable
-* `nvidiaMetric.go`: Read data about Nvidia GPUs using the NVML library
-* `tempMetric.go`: Read temperature data from `/sys/class/hwmon/hwmon*`
-* `ipmiMetric.go`: Collect data from `ipmitool` or as fallback `ipmi-sensors`
-* `customCmdMetric.go`: Run commands or read files and submit the output (output has to be in InfluxDB line protocol!)
-
-If any of the collectors cannot be initialized, it is excluded from all further reads. Like if the Lustre stat file is not a valid path, no Lustre specific metrics will be recorded.
-
-# Collector configuration
+# Configuration

 ```json
-  "collectors": [
-    "tempstat"
-  ],
-  "collect_config": {
-    "tempstat": {
-      "tag_override": {
-        "hwmon0" : {
-            "type" : "socket",
-            "type-id" : "0"
-        },
-        "hwmon1" : {
-            "type" : "socket",
-            "type-id" : "1"
-        }
-      }
+{
+    "collector_type" : {
+        <collector specific configuration>
    }
-  }
+}
 ```

-The configuration of the collectors in the main config files consists of two parts: active collectors (`collectors`) and collector configuration (`collect_config`). At startup, all collectors in the `collectors` list is initialized and, if successfully initialized, added to the active collectors for metric retrieval. At initialization the collector-specific configuration from the `collect_config` section is handed over. Each collector has own configuration options, check at the collector-specific section.
+In contrast to the configuration files for sinks and receivers, the collectors configuration is not a list but a set of dicts. This is required because we didn't manage to partially read the type before loading the remaining configuration. We are eager to change this to the same format.

-## `memstat`
+# Available collectors

-```json
-  "memstat": {
-    "exclude_metrics": [
-      "mem_used"
-    ]
-  }
-```
-
-The `memstat` collector reads data from `/proc/meminfo` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
-
-
-Metrics:
-* `mem_total`
-* `mem_sreclaimable`
-* `mem_slab`
-* `mem_free`
-* `mem_buffers`
-* `mem_cached`
-* `mem_available`
-* `mem_shared`
-* `swap_total`
-* `swap_free`
-* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached`)
-
-## `loadavg`
-```json
-  "loadavg": {
-    "exclude_metrics": [
-      "proc_run"
-    ]
-  }
-```
-
-The `loadavg` collector reads data from `/proc/loadavg` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
-
-Metrics:
-* `load_one`
-* `load_five`
-* `load_fifteen`
-* `proc_run`
-* `proc_total`
-
-## `netstat`
-```json
-  "netstat": {
-    "exclude_devices": [
-      "lo"
-    ]
-  }
-```
-
-The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a device is not required, it can be excluded from forwarding it to the sink. Commonly the `lo` device should be excluded.
-
-Metrics:
-* `bytes_in`
-* `bytes_out`
-* `pkts_in`
-* `pkts_out`
-
-The device name is added as tag `device`.
-
-
-## `diskstat`
-
-```json
-  "diskstat": {
-    "exclude_metrics": [
-      "read_ms"
-    ],
-  }
-```
-
-The `netstat` collector reads data from `/proc/net/dev` and outputs a handful **node** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
-
-Metrics:
-* `reads`
-* `reads_merged`
-* `read_sectors`
-* `read_ms`
-* `writes`
-* `writes_merged`
-* `writes_sectors`
-* `writes_ms`
-* `ioops`
-* `ioops_ms`
-* `ioops_weighted_ms`
-* `discards`
-* `discards_merged`
-* `discards_sectors`
-* `discards_ms`
-* `flushes`
-* `flushes_ms`
-
-
-The device name is added as tag `device`.
-
-## `cpustat`
-```json
-  "netstat": {
-    "exclude_metrics": [
-      "cpu_idle"
-    ]
-  }
-```
-
-The `cpustat` collector reads data from `/proc/stats` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
-
-Metrics:
-* `cpu_user`
-* `cpu_nice`
-* `cpu_system`
-* `cpu_idle`
-* `cpu_iowait`
-* `cpu_irq`
-* `cpu_softirq`
-* `cpu_steal`
-* `cpu_guest`
-* `cpu_guest_nice`
-
-## `likwid`
-```json
-  "likwid": {
-    "eventsets": [
-      {
-        "events": {
-          "FIXC1": "ACTUAL_CPU_CLOCK",
-          "FIXC2": "MAX_CPU_CLOCK",
-          "PMC0": "RETIRED_INSTRUCTIONS",
-          "PMC1": "CPU_CLOCKS_UNHALTED",
-          "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
-          "PMC3": "MERGE",
-          "DFC0": "DRAM_CHANNEL_0",
-          "DFC1": "DRAM_CHANNEL_1",
-          "DFC2": "DRAM_CHANNEL_2",
-          "DFC3": "DRAM_CHANNEL_3"
-        },
-        "metrics": [
-          {
-            "name": "ipc",
-            "calc": "PMC0/PMC1",
-            "socket_scope": false,
-            "publish": true
-          },
-          {
-            "name": "flops_any",
-            "calc": "0.000001*PMC2/time",
-            "socket_scope": false,
-            "publish": true
-          },
-          {
-            "name": "clock_mhz",
-            "calc": "0.000001*(FIXC1/FIXC2)/inverseClock",
-            "socket_scope": false,
-            "publish": true
-          },
-          {
-            "name": "mem1",
-            "calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
-            "socket_scope": true,
-            "publish": false
-          }
-        ]
-      },
-      {
-        "events": {
-          "DFC0": "DRAM_CHANNEL_4",
-          "DFC1": "DRAM_CHANNEL_5",
-          "DFC2": "DRAM_CHANNEL_6",
-          "DFC3": "DRAM_CHANNEL_7",
-          "PWR0": "RAPL_CORE_ENERGY",
-          "PWR1": "RAPL_PKG_ENERGY"
-        },
-        "metrics": [
-          {
-            "name": "pwr_core",
-            "calc": "PWR0/time",
-            "socket_scope": false,
-            "publish": true
-          },
-          {
-            "name": "pwr_pkg",
-            "calc": "PWR1/time",
-            "socket_scope": true,
-            "publish": true
-          },
-          {
-            "name": "mem2",
-            "calc": "0.000001*(DFC0+DFC1+DFC2+DFC3)*64.0/time",
-            "socket_scope": true,
-            "publish": false
-          }
-        ]
-      }
-    ],
-    "globalmetrics": [
-      {
-        "name": "mem_bw",
-        "calc": "mem1+mem2",
-        "socket_scope": true,
-        "publish": true
-      }
-    ]
-  }
-```
-
-_Example config suitable for AMD Zen3_
-
-The `likwid` collector reads hardware performance counters at a **hwthread** and **socket** level. The configuration looks quite complicated but it is basically copy&paste from [LIKWID's performance groups](https://github.com/RRZE-HPC/likwid/tree/master/groups). The collector made multiple iterations and tried to use the performance groups but it lacked flexibility. The current way of configuration provides most flexibility.
-
-The logic is as following: There are multiple eventsets, each consisting of a list of counters+events and a list of metrics. If you compare a common performance group with the example setting above, there is not much difference:
-```
-EVENTSET                         ->   "events": {
-FIXC1 ACTUAL_CPU_CLOCK           ->     "FIXC1": "ACTUAL_CPU_CLOCK",
-FIXC2 MAX_CPU_CLOCK              ->     "FIXC2": "MAX_CPU_CLOCK",
-PMC0  RETIRED_INSTRUCTIONS       ->     "PMC0" : "RETIRED_INSTRUCTIONS",
-PMC1  CPU_CLOCKS_UNHALTED        ->     "PMC1" : "CPU_CLOCKS_UNHALTED",
-PMC2  RETIRED_SSE_AVX_FLOPS_ALL  ->     "PMC2": "RETIRED_SSE_AVX_FLOPS_ALL",
-PMC3  MERGE                      ->     "PMC3": "MERGE",
-                                 ->   }
-```
-
-The metrics are following the same procedure:
-
-```
-METRICS                          ->   "metrics": [
-IPC   PMC0/PMC1                  ->     {
-                                 ->       "name" : "IPC",
-                                 ->       "calc" : "PMC0/PMC1",
-                                 ->       "socket_scope": false,
-                                 ->       "publish": true
-                                 ->     }
-                                 ->   ]
-```
-
-The `socket_scope` option tells whether it is submitted per socket or per hwthread. If a metric is only used for internal calculations, you can set `publish = false`.
-
-Since some metrics can only be gathered in multiple measurements (like the memory bandwidth on AMD Zen3 chips), configure multiple eventsets like in the example config and use the `globalmetrics` section to combine them. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.
+* [`cpustat`](./cpustatMetric.md)
+* [`memstat`](./memstatMetric.md)
+* [`diskstat`](./diskstatMetric.md)
+* [`loadavg`](./loadavgMetric.md)
+* [`netstat`](./netstatMetric.md)
+* [`ibstat`](./infinibandMetric.md)
+* [`tempstat`](./tempMetric.md)
+* [`lustre`](./lustreMetric.md)
+* [`likwid`](./likwidMetric.md)
+* [`nvidia`](./nvidiaMetric.md)
+* [`customcmd`](./customCmdMetric.md)
+* [`ipmistat`](./ipmiMetric.md)
+* [`topprocs`](./topprocsMetric.md)

 ## Todos

@@ -292,13 +38,15 @@ Since some metrics can only be gathered in multiple measurements (like the memor
 # Contributing own collectors
 A collector reads data from any source, parses it to metrics and submits these metrics to the `metric-collector`. A collector provides three function:

-* `Init(config []byte) error`: Initializes the collector using the given collector-specific config in JSON.
-* `Read(duration time.Duration, out *[]lp.MutableMetric) error`: Read, parse and submit data to the `out` list. If the collector has to measure anything for some duration, use the provided function argument `duration`. 
+* `Name() string`: Return the name of the collector
+* `Init(config json.RawMessage) error`: Initializes the collector using the given collector-specific config in JSON. Check if needed files/commands exists, ...
+* `Initialized() bool`: Check if a collector is successfully initialized
+* `Read(duration time.Duration, output chan ccMetric.CCMetric)`: Read, parse and submit data to the `output` channel as [`CCMetric`](../internal/ccMetric/README.md). If the collector has to measure anything for some duration, use the provided function argument `duration`. 
 * `Close()`: Closes down the collector.

 It is recommanded to call `setup()` in the `Init()` function.

-Finally, the collector needs to be registered in the `metric-collector.go`. There is a list of collectors called `Collectors` which is a map (string -> pointer to collector). Add a new entry with a descriptive name and the new collector.
+Finally, the collector needs to be registered in the `collectorManager.go`. There is a list of collectors called `AvailableCollectors` which is a map (`collector_type_string` -> `pointer to MetricCollector interface`). Add a new entry with a descriptive name and the new collector.

 ## Sample collector

@@ -307,8 +55,9 @@ package collectors

 import (
    "encoding/json"
-    lp "github.com/influxdata/line-protocol"
    "time"
+
+    lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
 )

 // Struct for the collector-specific JSON config
@@ -317,11 +66,11 @@ type SampleCollectorConfig struct {
 }

 type SampleCollector struct {
-    MetricCollector
+    metricCollector
    config SampleCollectorConfig
 }

-func (m *SampleCollector) Init(config []byte) error {
+func (m *SampleCollector) Init(config json.RawMessage) error {
    m.name = "SampleCollector"
    m.setup()
    if len(config) > 0 {
@@ -330,11 +79,13 @@ func (m *SampleCollector) Init(config []byte) error {
            return err
        }
    }
+    m.meta = map[string]string{"source": m.name, "group": "Sample"}
+
    m.init = true
    return nil
 }

-func (m *SampleCollector) Read(interval time.Duration, out *[]lp.MutableMetric) {
+func (m *SampleCollector) Read(interval time.Duration, output chan lp.CCMetric) {
    if !m.init {
        return
    }
@@ -342,9 +93,9 @@ func (m *SampleCollector) Read(interval time.Duration, out *[]lp.MutableMetric)
    tags := map[string]string{"type" : "node"}
    // Each metric has exactly one field: value !
    value := map[string]interface{}{"value": int(x)}
-    y, err := lp.New("sample_metric", tags, value, time.Now())
+    y, err := lp.New("sample_metric", tags, m.meta, value, time.Now())
    if err == nil {
-        *out = append(*out, y)
+        output <- y
    }
 }