Compare commits

..

28 Commits

Author SHA1 Message Date
Thomas Gruber
9e321e0766 Merge branch 'develop' into cc_lib_switch 2025-04-16 12:58:29 +02:00
Thomas Roehl
813804ae2d Update CI 2025-03-18 14:32:29 +01:00
Thomas Roehl
da91813a81 Fix ccLogger import path 2025-03-18 14:28:46 +01:00
Thomas Roehl
6ea79b0099 Use receiver, sinks, ccLogger and ccConfig from cc-lib 2025-03-18 14:21:58 +01:00
Thomas Roehl
b5520efc25 Fix artifacts in netstat collector of not done cc-lib switch 2025-03-15 04:02:26 +01:00
Thomas Roehl
d2b1bad1b8 Fix artifacts of not done cc-lib switch 2025-03-15 04:01:01 +01:00
Thomas Roehl
01ff8b2e9b Remove local development path 2025-02-24 18:35:16 +01:00
Thomas Roehl
a476f1753e Change to ccMessage from cc-lib 2025-02-24 18:29:27 +01:00
brinkcoder
0e57c8db1c Add derived_values for numastats (#134)
* Check creation of CCMessage in NATS receiver

* add derived_values for numastats

* change to ccMessage

* remove vim command artefact

---------

Co-authored-by: Thomas Roehl <thomas.roehl@fau.de>
Co-authored-by: exterr2f <Robert.Externbrink@rub.de>
Co-authored-by: Thomas Gruber <Thomas.Roehl@googlemail.com>
2025-02-19 11:35:32 +01:00
brinkcoder
f2f38c81af Add exclude_devices to iostat (#133)
* Check creation of CCMessage in NATS receiver

* add exclude_device for iostatMetric

* add md file

---------

Co-authored-by: Thomas Roehl <thomas.roehl@fau.de>
Co-authored-by: exterr2f <Robert.Externbrink@rub.de>
Co-authored-by: Thomas Gruber <Thomas.Roehl@googlemail.com>
2025-02-19 11:34:56 +01:00
brinkcoder
f9acc51a50 Add derived values for nfsiostat (#132)
* Check creation of CCMessage in NATS receiver

* add derived_values for nfsiostatMetric

---------

Co-authored-by: Thomas Roehl <thomas.roehl@fau.de>
Co-authored-by: exterr2f <Robert.Externbrink@rub.de>
Co-authored-by: Thomas Gruber <Thomas.Roehl@googlemail.com>
2025-02-19 11:34:06 +01:00
brinkcoder
87346e2eae Fix excluded metrics for diskstat and add exclude_mounts (#131)
* Check creation of CCMessage in NATS receiver

* fix excluded metrics and add optional mountpoint exclude

---------

Co-authored-by: Thomas Roehl <thomas.roehl@fau.de>
Co-authored-by: exterr2f <Robert.Externbrink@rub.de>
Co-authored-by: Thomas Gruber <Thomas.Roehl@googlemail.com>
2025-02-19 11:33:13 +01:00
brinkcoder
0f92f10b66 Add optional interface alias in netstat (#130)
* Check creation of CCMessage in NATS receiver

* add optional interface aliases for netstatMetric

* small fix

---------

Co-authored-by: Thomas Roehl <thomas.roehl@fau.de>
Co-authored-by: exterr2f <Robert.Externbrink@rub.de>
Co-authored-by: Thomas Gruber <Thomas.Roehl@googlemail.com>
2025-02-19 11:32:15 +01:00
Michael Panzlaff
6901b06e44 Rename 'process_message' to 'process_messages' in metricRouter config
This makes the behavior more consistent with the other modules, which
have their MessageProcessor named 'process_messages'. This most likely
was just a typo.
2025-02-03 15:23:51 +01:00
Thomas Roehl
7b343d0bab Use CCMessage FromBytes instead of Influx's decoder 2024-12-27 15:22:59 +00:00
Thomas Roehl
7d3180b526 Check creation of CCMessage in NATS receiver 2024-12-27 15:00:48 +00:00
Thomas Roehl
70a6afc549 Generate HUGO inputs out of Markdown files 2024-12-23 17:55:48 +01:00
Thomas Roehl
e02a018327 Mark all JSON config fields of message processor as omitempty 2024-12-23 17:52:34 +01:00
Thomas Roehl
bcecdd033b Fix documentation of RAPL collector 2024-12-23 17:51:43 +01:00
Thomas Roehl
2645ffeff3 Merge branch 'main' into develop 2024-12-21 02:39:08 +01:00
Thomas Roehl
e968aa1991 Fix wrongly named packages 2024-12-20 20:33:10 +01:00
Thomas Gruber
d2a38e3844 Merge branch 'main' into develop 2024-12-20 20:27:48 +01:00
Thomas Roehl
1f35f6d3ca Fix wrongly named packages 2024-12-20 20:26:38 +01:00
Thomas Roehl
7e6870c7b3 Add golang-race for UBI9 and Alma9 2024-12-20 20:15:59 +01:00
Thomas Roehl
d881093524 Install go-toolkit to fulfill build requirements for RPM 2024-12-20 20:12:03 +01:00
Thomas Roehl
c01096c157 use go-toolkit for RPM builds 2024-12-20 18:49:28 +01:00
Thomas Roehl
3d70c8afc9 Remove condition around BuildRequires and use go-toolkit for RPM builds 2024-12-20 18:43:21 +01:00
Thomas Roehl
7ee85a07dc Remove go-toolkit as build requirement for RPM builds if run in CI 2024-12-20 18:28:32 +01:00
35 changed files with 54 additions and 457 deletions

View File

@@ -1,17 +1,6 @@
<!--
---
title: cc-metric-collector
description: Metric collecting node agent
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/_index.md
---
-->
# cc-metric-collector # cc-metric-collector
A node agent for measuring, processing and forwarding node level metrics. It is part of the [ClusterCockpit ecosystem](https://clustercockpit.org/docs/overview/). A node agent for measuring, processing and forwarding node level metrics. It is part of the [ClusterCockpit ecosystem](./docs/introduction.md).
The metric collector sends (and receives) metric in the [InfluxDB line protocol](https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/) as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns). The metric collector sends (and receives) metric in the [InfluxDB line protocol](https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/) as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns).
@@ -46,8 +35,8 @@ The `interval` defines how often the metrics should be read and send to the sink
See the component READMEs for their configuration: See the component READMEs for their configuration:
* [`collectors`](./collectors/README.md) * [`collectors`](./collectors/README.md)
* [`sinks`](https://github.com/ClusterCockpit/cc-lib/blob/main/sinks/README.md) * [`sinks`](./sinks/README.md)
* [`receivers`](https://github.com/ClusterCockpit/cc-lib/blob/main/receivers/README.md) * [`receivers`](./receivers/README.md)
* [`router`](./internal/metricRouter/README.md) * [`router`](./internal/metricRouter/README.md)
# Installation # Installation

View File

@@ -1,14 +1,3 @@
<!--
---
title: Metric Collectors
description: Metric collectors for cc-metric-collector
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/_index.md
---
-->
# CCMetric collectors # CCMetric collectors
This folder contains the collectors for the cc-metric-collector. This folder contains the collectors for the cc-metric-collector.
@@ -34,6 +23,7 @@ In contrast to the configuration files for sinks and receivers, the collectors c
* [`loadavg`](./loadavgMetric.md) * [`loadavg`](./loadavgMetric.md)
* [`netstat`](./netstatMetric.md) * [`netstat`](./netstatMetric.md)
* [`ibstat`](./infinibandMetric.md) * [`ibstat`](./infinibandMetric.md)
* [`ibstat_perfquery`](./infinibandPerfQueryMetric.md)
* [`tempstat`](./tempMetric.md) * [`tempstat`](./tempMetric.md)
* [`lustrestat`](./lustreMetric.md) * [`lustrestat`](./lustreMetric.md)
* [`likwid`](./likwidMetric.md) * [`likwid`](./likwidMetric.md)
@@ -43,10 +33,8 @@ In contrast to the configuration files for sinks and receivers, the collectors c
* [`topprocs`](./topprocsMetric.md) * [`topprocs`](./topprocsMetric.md)
* [`nfs3stat`](./nfs3Metric.md) * [`nfs3stat`](./nfs3Metric.md)
* [`nfs4stat`](./nfs4Metric.md) * [`nfs4stat`](./nfs4Metric.md)
* [`nfsiostat`](./nfsiostatMetric.md)
* [`cpufreq`](./cpufreqMetric.md) * [`cpufreq`](./cpufreqMetric.md)
* [`cpufreq_cpuinfo`](./cpufreqCpuinfoMetric.md) * [`cpufreq_cpuinfo`](./cpufreqCpuinfoMetric.md)
* [`schedstat`](./schedstatMetric.md)
* [`numastats`](./numastatsMetric.md) * [`numastats`](./numastatsMetric.md)
* [`gpfs`](./gpfsMetric.md) * [`gpfs`](./gpfsMetric.md)
* [`beegfs_meta`](./beegfsmetaMetric.md) * [`beegfs_meta`](./beegfsmetaMetric.md)
@@ -63,7 +51,7 @@ A collector reads data from any source, parses it to metrics and submits these m
* `Name() string`: Return the name of the collector * `Name() string`: Return the name of the collector
* `Init(config json.RawMessage) error`: Initializes the collector using the given collector-specific config in JSON. Check if needed files/commands exists, ... * `Init(config json.RawMessage) error`: Initializes the collector using the given collector-specific config in JSON. Check if needed files/commands exists, ...
* `Initialized() bool`: Check if a collector is successfully initialized * `Initialized() bool`: Check if a collector is successfully initialized
* `Read(duration time.Duration, output chan ccMessage.CCMessage)`: Read, parse and submit data to the `output` channel as [`CCMessage`](https://github.com/ClusterCockpit/cc-lib/blob/main/ccMessage/README.md). If the collector has to measure anything for some duration, use the provided function argument `duration`. * `Read(duration time.Duration, output chan ccMetric.CCMetric)`: Read, parse and submit data to the `output` channel as [`CCMetric`](../internal/ccMetric/README.md). If the collector has to measure anything for some duration, use the provided function argument `duration`.
* `Close()`: Closes down the collector. * `Close()`: Closes down the collector.
It is recommanded to call `setup()` in the `Init()` function. It is recommanded to call `setup()` in the `Init()` function.

View File

@@ -1,17 +1,5 @@
<!--
---
title: BeeGFS metadata metric collector
description: Collect metadata clientstats for `BeeGFS on Demand`
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/beegfsmeta.md
---
-->
## `BeeGFS on Demand` collector ## `BeeGFS on Demand` collector
This Collector is to collect `BeeGFS on Demand` (BeeOND) metadata clientstats. This Collector is to collect BeeGFS on Demand (BeeOND) metadata clientstats.
```json ```json
"beegfs_meta": { "beegfs_meta": {
@@ -84,4 +72,4 @@ Available Metrics:
* setXA * setXA
* mirror * mirror
The collector adds a `filesystem` tag to all metrics The collector adds a `filesystem` tag to all metrics

View File

@@ -1,14 +1,3 @@
<!--
---
title: "BeeGFS on Demand metric collector"
description: Collect performance metrics for BeeGFS filesystems
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/beegfsstorage.md
---
-->
## `BeeGFS on Demand` collector ## `BeeGFS on Demand` collector
This Collector is to collect BeeGFS on Demand (BeeOND) storage stats. This Collector is to collect BeeGFS on Demand (BeeOND) storage stats.
@@ -63,4 +52,4 @@ Available Metrics:
* "unlnk" * "unlnk"
The collector adds a `filesystem` tag to all metrics The collector adds a `filesystem` tag to all metrics

View File

@@ -1,14 +1,3 @@
<!--
---
title: CPU frequency metric collector through cpuinfo
description: Collect the CPU frequency from `/proc/cpuinfo`
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/cpufreq_cpuinfo.md
---
-->
## `cpufreq_cpuinfo` collector ## `cpufreq_cpuinfo` collector
```json ```json

View File

@@ -1,14 +1,3 @@
<!--
---
title: CPU frequency metric collector through sysfs
description: Collect the CPU frequency metrics from `/sys/.../cpu/.../cpufreq`
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/cpufreq.md
---
-->
## `cpufreq_cpuinfo` collector ## `cpufreq_cpuinfo` collector
```json ```json

View File

@@ -1,14 +1,3 @@
<!--
---
title: CPU usage metric collector
description: Collect CPU metrics from `/proc/stat`
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/cpustat.md
---
-->
## `cpustat` collector ## `cpustat` collector
@@ -35,4 +24,4 @@ Metrics:
* `cpu_guest` with `unit=Percent` * `cpu_guest` with `unit=Percent`
* `cpu_guest_nice` with `unit=Percent` * `cpu_guest_nice` with `unit=Percent`
* `cpu_used` = `cpu_* - cpu_idle` with `unit=Percent` * `cpu_used` = `cpu_* - cpu_idle` with `unit=Percent`
* `num_cpus` * `num_cpus`

View File

@@ -1,13 +1,3 @@
<!--
---
title: CustomCommand metric collector
description: Collect messages from custom command or files
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/customcmd.md
---
-->
## `customcmd` collector ## `customcmd` collector

View File

@@ -1,13 +1,3 @@
<!--
---
title: Disk usage statistics metric collector
description: Collect metrics for various filesystems from `/proc/self/mounts`
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/diskstat.md
---
-->
## `diskstat` collector ## `diskstat` collector

View File

@@ -1,14 +1,3 @@
<!--
---
title: GPFS collector
description: Collect infos about GPFS filesystems
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/gpfs.md
---
-->
## `gpfs` collector ## `gpfs` collector
```json ```json

View File

@@ -1,13 +1,3 @@
<!--
---
title: InfiniBand Metric collector
description: Collect metrics for InfiniBand devices
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/infiniband.md
---
-->
## `ibstat` collector ## `ibstat` collector

View File

@@ -1,13 +1,3 @@
<!--
---
title: IOStat Metric collector
description: Collect metrics from `/proc/diskstats`
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/iostat.md
---
-->
## `iostat` collector ## `iostat` collector

View File

@@ -1,13 +1,3 @@
<!--
---
title: IPMI Metric collector
description: Collect metrics using ipmitool or ipmi-sensors
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/ipmi.md
---
-->
## `ipmistat` collector ## `ipmistat` collector

View File

@@ -190,8 +190,12 @@ func getBaseFreq() float64 {
} }
if math.IsNaN(freq) { if math.IsNaN(freq) {
C.timer_init() C.power_init(0)
freq = float64(C.timer_getCycleClock()) / 1e3 info := C.get_powerInfo()
if float64(info.baseFrequency) != 0 {
freq = float64(info.baseFrequency)
}
C.power_finalize()
} }
return freq * 1e3 return freq * 1e3
} }

View File

@@ -1,13 +1,3 @@
<!--
---
title: LIKWID collector
description: Collect hardware performance events and metrics using LIKWID
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/likwid.md
---
-->
## `likwid` collector ## `likwid` collector

View File

@@ -1,14 +1,3 @@
<!--
---
title: Load average metric collector
description: Collect metrics from `/proc/loadavg`
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/loadavg.md
---
-->
## `loadavg` collector ## `loadavg` collector

View File

@@ -1,14 +1,3 @@
<!--
---
title: Lustre filesystem metric collector
description: Collect metrics for Lustre filesystems
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/lustre.md
---
-->
## `lustrestat` collector ## `lustrestat` collector
@@ -54,4 +43,4 @@ Metrics:
* `lustre_statfs_diff` (if `send_diff_values == true`) * `lustre_statfs_diff` (if `send_diff_values == true`)
* `lustre_inode_permission_diff` (if `send_diff_values == true`) * `lustre_inode_permission_diff` (if `send_diff_values == true`)
This collector adds an `device` tag. This collector adds an `device` tag.

View File

@@ -1,14 +1,3 @@
<!--
---
title: Memory statistics metric collector
description: Collect metrics from `/proc/meminfo`
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/memstat.md
---
-->
## `memstat` collector ## `memstat` collector

View File

@@ -1,13 +1,3 @@
<!--
---
title: Network device metric collector
description: Collect metrics for network devices through procfs
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/netstat.md
---
-->
## `netstat` collector ## `netstat` collector
@@ -38,4 +28,4 @@ Metrics:
* `net_pkts_in_bw` (`unit=packets/sec` if `send_derived_values == true`) * `net_pkts_in_bw` (`unit=packets/sec` if `send_derived_values == true`)
* `net_pkts_out_bw` (`unit=packets/sec` if `send_derived_values == true`) * `net_pkts_out_bw` (`unit=packets/sec` if `send_derived_values == true`)
The device name is added as tag `stype=network,stype-id=<device>`. The device name is added as tag `stype=network,stype-id=<device>`.

View File

@@ -1,14 +1,3 @@
<!--
---
title: NFS network filesystem (v3) metric collector
description: Collect metrics for NFS network filesystems in version 3
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/nfs3.md
---
-->
## `nfs3stat` collector ## `nfs3stat` collector

View File

@@ -1,14 +1,3 @@
<!--
---
title: NFS network filesystem (v4) metric collector
description: Collect metrics for NFS network filesystems in version 4
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/nfs4.md
---
-->
## `nfs4stat` collector ## `nfs4stat` collector

View File

@@ -171,7 +171,7 @@ func (m *NfsIOStatCollector) Read(interval time.Duration, output chan lp.CCMessa
} }
} }
if !found { if !found {
delete(m.data, mntpoint) m.data[mntpoint] = nil
} }
} }
} }

View File

@@ -1,14 +1,3 @@
<!--
---
title: NFS network filesystem metrics from procfs
description: Collect NFS network filesystem metrics for mounts from `/proc/self/mountstats`
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/nfsio.md
---
-->
## `nfsiostat` collector ## `nfsiostat` collector
```json ```json

View File

@@ -1,13 +1,3 @@
<!--
---
title: NUMAStat collector
description: Collect infos about NUMA domains
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/numastat.md
---
-->
## `numastat` collector ## `numastat` collector

View File

@@ -27,12 +27,10 @@ type NvidiaCollectorConfig struct {
} }
type NvidiaCollectorDevice struct { type NvidiaCollectorDevice struct {
device nvml.Device device nvml.Device
excludeMetrics map[string]bool excludeMetrics map[string]bool
tags map[string]string tags map[string]string
meta map[string]string meta map[string]string
lastEnergyReading uint64
lastEnergyTimestamp time.Time
} }
type NvidiaCollector struct { type NvidiaCollector struct {
@@ -151,8 +149,6 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
// Add device handle // Add device handle
g.device = device g.device = device
g.lastEnergyReading = 0
g.lastEnergyTimestamp = time.Now()
// Add tags // Add tags
g.tags = map[string]string{ g.tags = map[string]string{
@@ -210,7 +206,7 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
return nil return nil
} }
func readMemoryInfo(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readMemoryInfo(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
if !device.excludeMetrics["nv_fb_mem_total"] || !device.excludeMetrics["nv_fb_mem_used"] || !device.excludeMetrics["nv_fb_mem_reserved"] { if !device.excludeMetrics["nv_fb_mem_total"] || !device.excludeMetrics["nv_fb_mem_used"] || !device.excludeMetrics["nv_fb_mem_reserved"] {
var total uint64 var total uint64
var used uint64 var used uint64
@@ -254,7 +250,7 @@ func readMemoryInfo(device *NvidiaCollectorDevice, output chan lp.CCMessage) err
return nil return nil
} }
func readBarMemoryInfo(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readBarMemoryInfo(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
if !device.excludeMetrics["nv_bar1_mem_total"] || !device.excludeMetrics["nv_bar1_mem_used"] { if !device.excludeMetrics["nv_bar1_mem_total"] || !device.excludeMetrics["nv_bar1_mem_used"] {
meminfo, ret := nvml.DeviceGetBAR1MemoryInfo(device.device) meminfo, ret := nvml.DeviceGetBAR1MemoryInfo(device.device)
if ret != nvml.SUCCESS { if ret != nvml.SUCCESS {
@@ -281,7 +277,7 @@ func readBarMemoryInfo(device *NvidiaCollectorDevice, output chan lp.CCMessage)
return nil return nil
} }
func readUtilization(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readUtilization(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
isMig, ret := nvml.DeviceIsMigDeviceHandle(device.device) isMig, ret := nvml.DeviceIsMigDeviceHandle(device.device)
if ret != nvml.SUCCESS { if ret != nvml.SUCCESS {
err := errors.New(nvml.ErrorString(ret)) err := errors.New(nvml.ErrorString(ret))
@@ -323,7 +319,7 @@ func readUtilization(device *NvidiaCollectorDevice, output chan lp.CCMessage) er
return nil return nil
} }
func readTemp(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readTemp(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
if !device.excludeMetrics["nv_temp"] { if !device.excludeMetrics["nv_temp"] {
// Retrieves the current temperature readings for the device, in degrees C. // Retrieves the current temperature readings for the device, in degrees C.
// //
@@ -342,7 +338,7 @@ func readTemp(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
return nil return nil
} }
func readFan(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readFan(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
if !device.excludeMetrics["nv_fan"] { if !device.excludeMetrics["nv_fan"] {
// Retrieves the intended operating speed of the device's fan. // Retrieves the intended operating speed of the device's fan.
// //
@@ -365,7 +361,7 @@ func readFan(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
return nil return nil
} }
// func readFans(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { // func readFans(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
// if !device.excludeMetrics["nv_fan"] { // if !device.excludeMetrics["nv_fan"] {
// numFans, ret := nvml.DeviceGetNumFans(device.device) // numFans, ret := nvml.DeviceGetNumFans(device.device)
// if ret == nvml.SUCCESS { // if ret == nvml.SUCCESS {
@@ -386,7 +382,7 @@ func readFan(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
// return nil // return nil
// } // }
func readEccMode(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readEccMode(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
if !device.excludeMetrics["nv_ecc_mode"] { if !device.excludeMetrics["nv_ecc_mode"] {
// Retrieves the current and pending ECC modes for the device. // Retrieves the current and pending ECC modes for the device.
// //
@@ -420,7 +416,7 @@ func readEccMode(device *NvidiaCollectorDevice, output chan lp.CCMessage) error
return nil return nil
} }
func readPerfState(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readPerfState(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
if !device.excludeMetrics["nv_perf_state"] { if !device.excludeMetrics["nv_perf_state"] {
// Retrieves the current performance state for the device. // Retrieves the current performance state for the device.
// //
@@ -440,16 +436,13 @@ func readPerfState(device *NvidiaCollectorDevice, output chan lp.CCMessage) erro
return nil return nil
} }
func readPowerUsage(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readPowerUsage(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
if !device.excludeMetrics["nv_power_usage"] { if !device.excludeMetrics["nv_power_usage"] {
// Retrieves power usage for this GPU in milliwatts and its associated circuitry (e.g. memory) // Retrieves power usage for this GPU in milliwatts and its associated circuitry (e.g. memory)
// //
// On Fermi and Kepler GPUs the reading is accurate to within +/- 5% of current power draw. // On Fermi and Kepler GPUs the reading is accurate to within +/- 5% of current power draw.
// On Ampere (except GA100) or newer GPUs, the API returns power averaged over 1 sec interval.
// On GA100 and older architectures, instantaneous power is returned.
// //
// It is only available if power management mode is supported. // It is only available if power management mode is supported
mode, ret := nvml.DeviceGetPowerManagementMode(device.device) mode, ret := nvml.DeviceGetPowerManagementMode(device.device)
if ret != nvml.SUCCESS { if ret != nvml.SUCCESS {
return nil return nil
@@ -468,54 +461,7 @@ func readPowerUsage(device *NvidiaCollectorDevice, output chan lp.CCMessage) err
return nil return nil
} }
func readEnergyConsumption(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readClocks(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
// Retrieves total energy consumption for this GPU in millijoules (mJ) since the driver was last reloaded
// For Volta or newer fully supported devices.
if (!device.excludeMetrics["nv_energy"]) && (!device.excludeMetrics["nv_energy_abs"]) && (!device.excludeMetrics["nv_average_power"]) {
now := time.Now()
mode, ret := nvml.DeviceGetPowerManagementMode(device.device)
if ret != nvml.SUCCESS {
return nil
}
if mode == nvml.FEATURE_ENABLED {
energy, ret := nvml.DeviceGetTotalEnergyConsumption(device.device)
if ret == nvml.SUCCESS {
if device.lastEnergyReading != 0 {
if !device.excludeMetrics["nv_energy"] {
y, err := lp.NewMetric("nv_energy", device.tags, device.meta, (energy-device.lastEnergyReading)/1000, now)
if err == nil {
y.AddMeta("unit", "Joules")
output <- y
}
}
if !device.excludeMetrics["nv_average_power"] {
energyDiff := (energy - device.lastEnergyReading) / 1000
timeDiff := now.Sub(device.lastEnergyTimestamp)
y, err := lp.NewMetric("nv_average_power", device.tags, device.meta, energyDiff/uint64(timeDiff.Seconds()), now)
if err == nil {
y.AddMeta("unit", "watts")
output <- y
}
}
}
if !device.excludeMetrics["nv_energy_abs"] {
y, err := lp.NewMetric("nv_energy_abs", device.tags, device.meta, energy/1000, now)
if err == nil {
y.AddMeta("unit", "Joules")
output <- y
}
}
device.lastEnergyReading = energy
device.lastEnergyTimestamp = time.Now()
}
}
}
return nil
}
func readClocks(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
// Retrieves the current clock speeds for the device. // Retrieves the current clock speeds for the device.
// //
// Available clock information: // Available clock information:
@@ -567,7 +513,7 @@ func readClocks(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
return nil return nil
} }
func readMaxClocks(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readMaxClocks(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
// Retrieves the maximum clock speeds for the device. // Retrieves the maximum clock speeds for the device.
// //
// Available clock information: // Available clock information:
@@ -625,7 +571,7 @@ func readMaxClocks(device *NvidiaCollectorDevice, output chan lp.CCMessage) erro
return nil return nil
} }
func readEccErrors(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readEccErrors(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
if !device.excludeMetrics["nv_ecc_uncorrected_error"] { if !device.excludeMetrics["nv_ecc_uncorrected_error"] {
// Retrieves the total ECC error counts for the device. // Retrieves the total ECC error counts for the device.
// //
@@ -656,7 +602,7 @@ func readEccErrors(device *NvidiaCollectorDevice, output chan lp.CCMessage) erro
return nil return nil
} }
func readPowerLimit(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readPowerLimit(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
if !device.excludeMetrics["nv_power_max_limit"] { if !device.excludeMetrics["nv_power_max_limit"] {
// Retrieves the power management limit associated with this device. // Retrieves the power management limit associated with this device.
// //
@@ -676,7 +622,7 @@ func readPowerLimit(device *NvidiaCollectorDevice, output chan lp.CCMessage) err
return nil return nil
} }
func readEncUtilization(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readEncUtilization(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
isMig, ret := nvml.DeviceIsMigDeviceHandle(device.device) isMig, ret := nvml.DeviceIsMigDeviceHandle(device.device)
if ret != nvml.SUCCESS { if ret != nvml.SUCCESS {
err := errors.New(nvml.ErrorString(ret)) err := errors.New(nvml.ErrorString(ret))
@@ -703,7 +649,7 @@ func readEncUtilization(device *NvidiaCollectorDevice, output chan lp.CCMessage)
return nil return nil
} }
func readDecUtilization(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readDecUtilization(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
isMig, ret := nvml.DeviceIsMigDeviceHandle(device.device) isMig, ret := nvml.DeviceIsMigDeviceHandle(device.device)
if ret != nvml.SUCCESS { if ret != nvml.SUCCESS {
err := errors.New(nvml.ErrorString(ret)) err := errors.New(nvml.ErrorString(ret))
@@ -730,7 +676,7 @@ func readDecUtilization(device *NvidiaCollectorDevice, output chan lp.CCMessage)
return nil return nil
} }
func readRemappedRows(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readRemappedRows(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
if !device.excludeMetrics["nv_remapped_rows_corrected"] || if !device.excludeMetrics["nv_remapped_rows_corrected"] ||
!device.excludeMetrics["nv_remapped_rows_uncorrected"] || !device.excludeMetrics["nv_remapped_rows_uncorrected"] ||
!device.excludeMetrics["nv_remapped_rows_pending"] || !device.excludeMetrics["nv_remapped_rows_pending"] ||
@@ -783,7 +729,7 @@ func readRemappedRows(device *NvidiaCollectorDevice, output chan lp.CCMessage) e
return nil return nil
} }
func readProcessCounts(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readProcessCounts(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
if !device.excludeMetrics["nv_compute_processes"] { if !device.excludeMetrics["nv_compute_processes"] {
// Get information about processes with a compute context on a device // Get information about processes with a compute context on a device
// //
@@ -875,7 +821,7 @@ func readProcessCounts(device *NvidiaCollectorDevice, output chan lp.CCMessage)
return nil return nil
} }
func readViolationStats(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readViolationStats(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
var violTime nvml.ViolationTime var violTime nvml.ViolationTime
var ret nvml.Return var ret nvml.Return
@@ -989,7 +935,7 @@ func readViolationStats(device *NvidiaCollectorDevice, output chan lp.CCMessage)
return nil return nil
} }
func readNVLinkStats(device *NvidiaCollectorDevice, output chan lp.CCMessage) error { func readNVLinkStats(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
// Retrieves the specified error counter value // Retrieves the specified error counter value
// Please refer to \a nvmlNvLinkErrorCounter_t for error counters that are available // Please refer to \a nvmlNvLinkErrorCounter_t for error counters that are available
// //
@@ -1124,7 +1070,7 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMessage)
return return
} }
readAll := func(device *NvidiaCollectorDevice, output chan lp.CCMessage) { readAll := func(device NvidiaCollectorDevice, output chan lp.CCMessage) {
name, ret := nvml.DeviceGetName(device.device) name, ret := nvml.DeviceGetName(device.device)
if ret != nvml.SUCCESS { if ret != nvml.SUCCESS {
name = "NoName" name = "NoName"
@@ -1164,11 +1110,6 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMessage)
cclog.ComponentDebug(m.name, "readPowerUsage for device", name, "failed") cclog.ComponentDebug(m.name, "readPowerUsage for device", name, "failed")
} }
err = readEnergyConsumption(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readEnergyConsumption for device", name, "failed")
}
err = readClocks(device, output) err = readClocks(device, output)
if err != nil { if err != nil {
cclog.ComponentDebug(m.name, "readClocks for device", name, "failed") cclog.ComponentDebug(m.name, "readClocks for device", name, "failed")
@@ -1228,7 +1169,7 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMessage)
// Actual read loop over all attached Nvidia GPUs // Actual read loop over all attached Nvidia GPUs
for i := 0; i < m.num_gpus; i++ { for i := 0; i < m.num_gpus; i++ {
readAll(&m.gpus[i], output) readAll(m.gpus[i], output)
// Iterate over all MIG devices if any // Iterate over all MIG devices if any
if m.config.ProcessMigDevices { if m.config.ProcessMigDevices {
@@ -1302,7 +1243,7 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMessage)
} }
} }
readAll(&migDevice, output) readAll(migDevice, output)
} }
} }
} }

View File

@@ -1,13 +1,3 @@
<!--
---
title: "Nvidia NVML metric collector"
description: Collect metrics for Nvidia GPUs using the NVML
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/nvidia.md
---
-->
## `nvidia` collector ## `nvidia` collector
@@ -82,8 +72,5 @@ Metrics:
* `nv_nvlink_ecc_errors` * `nv_nvlink_ecc_errors`
* `nv_nvlink_replay_errors` * `nv_nvlink_replay_errors`
* `nv_nvlink_recovery_errors` * `nv_nvlink_recovery_errors`
* `nv_energy`
* `nv_energy_abs`
* `nv_average_power`
Some metrics add the additional sub type tag (`stype`) like the `nv_nvlink_*` metrics set `stype=nvlink,stype-id=<link_number>`. Some metrics add the additional sub type tag (`stype`) like the `nv_nvlink_*` metrics set `stype=nvlink,stype-id=<link_number>`.

View File

@@ -1,14 +1,3 @@
<!--
---
title: RAPL metric collector
description: Collect energy data through the RAPL sysfs interface
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/rapl.md
---
-->
## `rapl` collector ## `rapl` collector
This collector reads running average power limit (RAPL) monitoring attributes to compute average power consumption metrics. See <https://www.kernel.org/doc/html/latest/power/powercap/powercap.html#monitoring-attributes>. This collector reads running average power limit (RAPL) monitoring attributes to compute average power consumption metrics. See <https://www.kernel.org/doc/html/latest/power/powercap/powercap.html#monitoring-attributes>.

View File

@@ -1,14 +1,3 @@
<!--
---
title: "ROCm SMI metric collector"
description: Collect metrics for AMD GPUs using the SMI library
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/rocmsmi.md
---
-->
## `rocm_smi` collector ## `rocm_smi` collector

View File

@@ -1,13 +1,3 @@
<!--
---
title: SchedStat Metric collector
description: Collect metrics from `/proc/schedstat`
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/schedstat.md
---
-->
## `schedstat` collector ## `schedstat` collector
```json ```json
@@ -18,4 +8,4 @@ hugo_path: docs/reference/cc-metric-collector/collectors/schedstat.md
The `schedstat` collector reads data from /proc/schedstat and calculates a load value, separated by hwthread. This might be useful to detect bad cpu pinning on shared nodes etc. The `schedstat` collector reads data from /proc/schedstat and calculates a load value, separated by hwthread. This might be useful to detect bad cpu pinning on shared nodes etc.
Metric: Metric:
* `cpu_load_core` * `cpu_load_core`

View File

@@ -1,14 +1,3 @@
<!--
---
title: Self-monitoring metric collector
description: Collect metrics from the execution of cc-metric-collector itself
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/self.md
---
-->
## `self` collector ## `self` collector
```json ```json

View File

@@ -1,14 +1,3 @@
<!--
---
title: Temperature metric collector
description: Collect thermal metrics from `/sys/class/hwmon/*`
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/temp.md
---
-->
## `tempstat` collector ## `tempstat` collector

View File

@@ -1,15 +1,3 @@
<!--
---
title: TopProcs collector
description: Collect infos about most CPU-consuming processes
categories: [cc-metric-collector]
tags: ['Admin']
weight: 2
hugo_path: docs/reference/cc-metric-collector/collectors/topprocs.md
---
-->
## `topprocs` collector ## `topprocs` collector

View File

@@ -1,14 +1,3 @@
<!--
---
title: Metric Aggregator
description: Subsystem for evaluating expressions on metrics (deprecated)
categories: [cc-metric-collector]
tags: ['Developer']
weight: 1
hugo_path: docs/reference/cc-metric-collector/internal/metricaggregator/_index.md
---
-->
# The MetricAggregator # The MetricAggregator
In some cases, further combination of metrics or raw values is required. For that strings like `foo + 1` with runtime dependent `foo` need to be evaluated. The MetricAggregator relies on the [`gval`](https://github.com/PaesslerAG/gval) Golang package to perform all expression evaluation. The `gval` package provides the basic arithmetic operations but the MetricAggregator defines additional ones. In some cases, further combination of metrics or raw values is required. For that strings like `foo + 1` with runtime dependent `foo` need to be evaluated. The MetricAggregator relies on the [`gval`](https://github.com/PaesslerAG/gval) Golang package to perform all expression evaluation. The `gval` package provides the basic arithmetic operations but the MetricAggregator defines additional ones.
@@ -46,4 +35,4 @@ The MetricAggregator provides these functions additional to the `Full` language
## Limitations ## Limitations
- Since the metrics are written in JSON files which do not allow `""` without proper escaping inside of JSON strings, you have to use `''` for strings. - Since the metrics are written in JSON files which do not allow `""` without proper escaping inside of JSON strings, you have to use `''` for strings.
- Since `\` is interpreted by JSON as escape character, it cannot be used in metrics. But it is required to write regular expressions. So instead of `/`, use `%` and the MetricAggregator replaces them after reading the JSON file. - Since `\` is interpreted by JSON as escape character, it cannot be used in metrics. But it is required to write regular expressions. So instead of `/`, use `%` and the MetricAggregator replaces them after reading the JSON file.

View File

@@ -1,22 +1,11 @@
<!--
---
title: Message Router
description: Routing component inside cc-metric-collector
categories: [cc-metric-collector]
tags: ['Developer']
weight: 1
hugo_path: docs/reference/cc-metric-collector/internal/metricrouter/_index.md
---
-->
# CC Metric Router # CC Metric Router
The CCMetric router sits in between the collectors and the sinks and can be used to add and remove tags to/from traversing [CCMessages](https://pkg.go.dev/github.com/ClusterCockpit/cc-lib/ccMessage). The CCMetric router sits in between the collectors and the sinks and can be used to add and remove tags to/from traversing [CCMessages](https://pkg.go.dev/github.com/ClusterCockpit/cc-energy-manager@v0.0.0-20240919152819-92a17f2da4f7/pkg/cc-message.
# Configuration # Configuration
**Note**: Use the [message processor configuration](https://github.com/ClusterCockpit/cc-lib/blob/main/messageProcessor/README.md) with option `process_messages`. **Note**: Use the [message processor configuration](../../pkg/messageProcessor/README.md) with option `process_messages`.
```json ```json
{ {
@@ -80,7 +69,7 @@ The CCMetric router sits in between the collectors and the sinks and can be used
There are three main options `add_tags`, `delete_tags` and `interval_timestamp`. `add_tags` and `delete_tags` are lists consisting of dicts with `key`, `value` and `if`. The `value` can be omitted in the `delete_tags` part as it only uses the `key` for removal. The `interval_timestamp` setting means that a unique timestamp is applied to all metrics traversing the router during an interval. There are three main options `add_tags`, `delete_tags` and `interval_timestamp`. `add_tags` and `delete_tags` are lists consisting of dicts with `key`, `value` and `if`. The `value` can be omitted in the `delete_tags` part as it only uses the `key` for removal. The `interval_timestamp` setting means that a unique timestamp is applied to all metrics traversing the router during an interval.
**Note**: Use the [message processor configuration](https://github.com/ClusterCockpit/cc-lib/blob/main/messageProcessor/README.md) (option `process_messages`) instead of `add_tags`, `delete_tags`, `drop_metrics`, `drop_metrics_if`, `rename_metrics`, `normalize_units` and `change_unit_prefix`. These options are deprecated and will be removed in future versions. Until then, they are added to the message processor. **Note**: Use the [message processor configuration](../../pkg/messageProcessor/README.md) (option `process_messages`) instead of `add_tags`, `delete_tags`, `drop_metrics`, `drop_metrics_if`, `rename_metrics`, `normalize_units` and `change_unit_prefix`. These options are deprecated and will be removed in future versions. Until then, they are added to the message processor.
# Processing order in the router # Processing order in the router
@@ -236,13 +225,13 @@ __deprecated__
The cc-metric-collector tries to read the data from the system as it is reported. If available, it tries to read the metric unit from the system as well (e.g. from `/proc/meminfo`). The problem is that, depending on the source, the metric units are named differently. Just think about `byte`, `Byte`, `B`, `bytes`, ... The cc-metric-collector tries to read the data from the system as it is reported. If available, it tries to read the metric unit from the system as well (e.g. from `/proc/meminfo`). The problem is that, depending on the source, the metric units are named differently. Just think about `byte`, `Byte`, `B`, `bytes`, ...
The [cc-units](https://github.com/ClusterCockpit/cc-lib/ccUnits) package provides us a normalization option to use the same metric unit name for all metrics. It this option is set to true, all `unit` meta tags are normalized. The [cc-units](https://github.com/ClusterCockpit/cc-units) package provides us a normalization option to use the same metric unit name for all metrics. It this option is set to true, all `unit` meta tags are normalized.
## The `change_unit_prefix` section ## The `change_unit_prefix` section
__deprecated__ __deprecated__
It is often the case that metrics are reported by the system using a rather outdated unit prefix (like `/proc/meminfo` still uses kByte despite current memory sizes are in the GByte range). If you want to change the prefix of a unit, you can do that with the help of [cc-units](https://github.com/ClusterCockpit/cc-lib/ccUnits). The setting works on the metric name and requires the new prefix for the metric. The cc-units package determines the scaling factor. It is often the case that metrics are reported by the system using a rather outdated unit prefix (like `/proc/meminfo` still uses kByte despite current memory sizes are in the GByte range). If you want to change the prefix of a unit, you can do that with the help of [cc-units](https://github.com/ClusterCockpit/cc-units). The setting works on the metric name and requires the new prefix for the metric. The cc-units package determines the scaling factor.
# Aggregate metric values of the current interval with the `interval_aggregates` option # Aggregate metric values of the current interval with the `interval_aggregates` option
@@ -274,7 +263,7 @@ The above configuration, collects all metric values for metrics evaluating `if`
If you are not interested in the input metrics `sub_metric_%d+` at all, you can add the same condition used here to the `drop_metrics_if` section to drop them. If you are not interested in the input metrics `sub_metric_%d+` at all, you can add the same condition used here to the `drop_metrics_if` section to drop them.
Use cases for `interval_aggregates`: Use cases for `interval_aggregates`:
- Combine multiple metrics of the a collector to a new one like the [MemstatCollector](../../collectors/memstatMetric.md) does it for `mem_used`: - Combine multiple metrics of the a collector to a new one like the [MemstatCollector](../../collectors/memstatMetric.md) does it for `mem_used`)):
```json ```json
{ {
"name" : "mem_used", "name" : "mem_used",

View File

@@ -1,14 +1,3 @@
<!--
---
title: Multi-channel Ticker
description: Timer ticker that sends out the tick to multiple channels
categories: [cc-metric-collector]
tags: ['Developer']
weight: 1
hugo_path: docs/reference/cc-metric-collector/pkg/multichanticker/_index.md
---
-->
# MultiChanTicker # MultiChanTicker
The idea of this ticker is to multiply the output channels. The original Golang `time.Ticker` provides only a single output channel, so the signal can only be received by a single other class. This ticker allows to add multiple channels which get all notified about the time tick. The idea of this ticker is to multiply the output channels. The original Golang `time.Ticker` provides only a single output channel, so the signal can only be received by a single other class. This ticker allows to add multiple channels which get all notified about the time tick.