Add energy metrics to Nvidia collector README

Add energy metrics from NVML to Nvidia NVML collector
Fix URL to new location of cc-units
2025-07-21 12:21:41 +02:00 · 2025-04-28 15:36:04 +00:00 · 2025-04-28 15:29:13 +00:00 · 2025-04-22 12:48:15 +02:00 · 2025-04-17 11:38:03 +02:00 · 2025-04-17 11:37:47 +02:00
33 changed files with 452 additions and 47 deletions
--- a/README.md
+++ b/README.md
@@ -1,6 +1,17 @@
+<!--
+---
+title: cc-metric-collector
+description: Metric collecting node agent
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/_index.md
+---
+-->
+
 # cc-metric-collector

-A node agent for measuring, processing and forwarding node level metrics. It is part of the [ClusterCockpit ecosystem](./docs/introduction.md).
+A node agent for measuring, processing and forwarding node level metrics. It is part of the [ClusterCockpit ecosystem](https://clustercockpit.org/docs/overview/).

 The metric collector sends (and receives) metric in the [InfluxDB line protocol](https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/) as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns).

@@ -35,8 +46,8 @@ The `interval` defines how often the metrics should be read and send to the sink
 See the component READMEs for their configuration:

 * [`collectors`](./collectors/README.md)
-* [`sinks`](./sinks/README.md)
-* [`receivers`](./receivers/README.md)
+* [`sinks`](https://github.com/ClusterCockpit/cc-lib/blob/main/sinks/README.md)
+* [`receivers`](https://github.com/ClusterCockpit/cc-lib/blob/main/receivers/README.md)
 * [`router`](./internal/metricRouter/README.md)

 # Installation
--- a/collectors/README.md
+++ b/collectors/README.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: Metric Collectors
+description: Metric collectors for cc-metric-collector
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/_index.md
+---
+-->
+
 # CCMetric collectors

 This folder contains the collectors for the cc-metric-collector.
@@ -23,7 +34,6 @@ In contrast to the configuration files for sinks and receivers, the collectors c
 * [`loadavg`](./loadavgMetric.md)
 * [`netstat`](./netstatMetric.md)
 * [`ibstat`](./infinibandMetric.md)
-* [`ibstat_perfquery`](./infinibandPerfQueryMetric.md)
 * [`tempstat`](./tempMetric.md)
 * [`lustrestat`](./lustreMetric.md)
 * [`likwid`](./likwidMetric.md)
@@ -53,7 +63,7 @@ A collector reads data from any source, parses it to metrics and submits these m
 * `Name() string`: Return the name of the collector
 * `Init(config json.RawMessage) error`: Initializes the collector using the given collector-specific config in JSON. Check if needed files/commands exists, ...
 * `Initialized() bool`: Check if a collector is successfully initialized
-* `Read(duration time.Duration, output chan ccMetric.CCMetric)`: Read, parse and submit data to the `output` channel as [`CCMetric`](../internal/ccMetric/README.md). If the collector has to measure anything for some duration, use the provided function argument `duration`.
+* `Read(duration time.Duration, output chan ccMessage.CCMessage)`: Read, parse and submit data to the `output` channel as [`CCMessage`](https://github.com/ClusterCockpit/cc-lib/blob/main/ccMessage/README.md). If the collector has to measure anything for some duration, use the provided function argument `duration`.
 * `Close()`: Closes down the collector.

 It is recommanded to call `setup()` in the `Init()` function.
--- a/collectors/beegfsmetaMetric.md
+++ b/collectors/beegfsmetaMetric.md
@@ -1,5 +1,17 @@
+<!--
+---
+title: BeeGFS metadata metric collector
+description: Collect metadata clientstats for `BeeGFS on Demand`
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/beegfsmeta.md
+---
+-->
+
+
 ## `BeeGFS on Demand` collector
-This Collector is to collect BeeGFS on Demand (BeeOND) metadata clientstats.
+This Collector is to collect `BeeGFS on Demand` (BeeOND) metadata clientstats.

 ```json
  "beegfs_meta": {
--- a/collectors/beegfsstorageMetric.md
+++ b/collectors/beegfsstorageMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: "BeeGFS on Demand metric collector"
+description: Collect performance metrics for BeeGFS filesystems
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/beegfsstorage.md
+---
+-->
+
 ## `BeeGFS on Demand` collector
 This Collector is to collect BeeGFS on Demand (BeeOND) storage stats.

--- a/collectors/cpufreqCpuinfoMetric.md
+++ b/collectors/cpufreqCpuinfoMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: CPU frequency metric collector through cpuinfo
+description: Collect the CPU frequency from `/proc/cpuinfo`
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/cpufreq_cpuinfo.md
+---
+-->
+
 ## `cpufreq_cpuinfo` collector

 ```json
--- a/collectors/cpufreqMetric.md
+++ b/collectors/cpufreqMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: CPU frequency metric collector through sysfs
+description: Collect the CPU frequency metrics from `/sys/.../cpu/.../cpufreq`
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/cpufreq.md
+---
+-->
+
 ## `cpufreq_cpuinfo` collector

 ```json
--- a/collectors/cpustatMetric.md
+++ b/collectors/cpustatMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: CPU usage metric collector
+description: Collect CPU metrics from `/proc/stat`
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/cpustat.md
+---
+-->
+

 ## `cpustat` collector

--- a/collectors/customCmdMetric.md
+++ b/collectors/customCmdMetric.md
@@ -1,3 +1,13 @@
+<!--
+---
+title: CustomCommand metric collector
+description: Collect messages from custom command or files
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/customcmd.md
+---
+-->

 ## `customcmd` collector

--- a/collectors/diskstatMetric.md
+++ b/collectors/diskstatMetric.md
@@ -1,3 +1,13 @@
+<!--
+---
+title: Disk usage statistics metric collector
+description: Collect metrics for various filesystems from `/proc/self/mounts`
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/diskstat.md
+---
+-->

 ## `diskstat` collector

--- a/collectors/gpfsMetric.md
+++ b/collectors/gpfsMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: GPFS collector
+description: Collect infos about GPFS filesystems
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/gpfs.md
+---
+-->
+
 ## `gpfs` collector

 ```json
--- a/collectors/infinibandMetric.md
+++ b/collectors/infinibandMetric.md
@@ -1,3 +1,13 @@
+<!--
+---
+title: InfiniBand Metric collector
+description: Collect metrics for InfiniBand devices
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/infiniband.md
+---
+-->

 ## `ibstat` collector

--- a/collectors/iostatMetric.md
+++ b/collectors/iostatMetric.md
@@ -1,3 +1,13 @@
+<!--
+---
+title: IOStat Metric collector
+description: Collect metrics from `/proc/diskstats`
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/iostat.md
+---
+-->

 ## `iostat` collector

--- a/collectors/ipmiMetric.md
+++ b/collectors/ipmiMetric.md
@@ -1,3 +1,13 @@
+<!--
+---
+title: IPMI Metric collector
+description: Collect metrics using ipmitool or ipmi-sensors
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/ipmi.md
+---
+-->

 ## `ipmistat` collector

--- a/collectors/likwidMetric.md
+++ b/collectors/likwidMetric.md
@@ -1,3 +1,13 @@
+<!--
+---
+title: LIKWID collector
+description: Collect hardware performance events and metrics using LIKWID
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/likwid.md
+---
+-->

 ## `likwid` collector

--- a/collectors/loadavgMetric.md
+++ b/collectors/loadavgMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: Load average metric collector
+description: Collect metrics from `/proc/loadavg`
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/loadavg.md
+---
+-->
+

 ## `loadavg` collector

--- a/collectors/lustreMetric.md
+++ b/collectors/lustreMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: Lustre filesystem metric collector
+description: Collect metrics for Lustre filesystems
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/lustre.md
+---
+-->
+

 ## `lustrestat` collector

--- a/collectors/memstatMetric.md
+++ b/collectors/memstatMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: Memory statistics metric collector
+description: Collect metrics from `/proc/meminfo`
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/memstat.md
+---
+-->
+

 ## `memstat` collector

--- a/collectors/netstatMetric.md
+++ b/collectors/netstatMetric.md
@@ -1,3 +1,13 @@
+<!--
+---
+title: Network device metric collector
+description: Collect metrics for network devices through procfs
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/netstat.md
+---
+-->

 ## `netstat` collector

--- a/collectors/nfs3Metric.md
+++ b/collectors/nfs3Metric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: NFS network filesystem (v3) metric collector
+description: Collect metrics for NFS network filesystems in version 3
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/nfs3.md
+---
+-->
+

 ## `nfs3stat` collector

--- a/collectors/nfs4Metric.md
+++ b/collectors/nfs4Metric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: NFS network filesystem (v4) metric collector
+description: Collect metrics for NFS network filesystems in version 4
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/nfs4.md
+---
+-->
+

 ## `nfs4stat` collector

--- a/collectors/nfsiostatMetric.md
+++ b/collectors/nfsiostatMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: NFS network filesystem metrics from procfs
+description: Collect NFS network filesystem metrics for mounts from `/proc/self/mountstats`
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/nfsio.md
+---
+-->
+
 ## `nfsiostat` collector

 ```json
--- a/collectors/numastatsMetric.md
+++ b/collectors/numastatsMetric.md
@@ -1,3 +1,13 @@
+<!--
+---
+title: NUMAStat collector
+description: Collect infos about NUMA domains
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/numastat.md
+---
+-->

 ## `numastat` collector

--- a/collectors/nvidiaMetric.go
+++ b/collectors/nvidiaMetric.go
@@ -31,6 +31,8 @@ type NvidiaCollectorDevice struct {
 	excludeMetrics      map[string]bool
 	tags                map[string]string
 	meta                map[string]string
+	lastEnergyReading   uint64
+	lastEnergyTimestamp time.Time
 }

 type NvidiaCollector struct {
@@ -149,6 +151,8 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {

 		// Add device handle
 		g.device = device
+		g.lastEnergyReading = 0
+		g.lastEnergyTimestamp = time.Now()

 		// Add tags
 		g.tags = map[string]string{
@@ -206,7 +210,7 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
 	return nil
 }

-func readMemoryInfo(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readMemoryInfo(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	if !device.excludeMetrics["nv_fb_mem_total"] || !device.excludeMetrics["nv_fb_mem_used"] || !device.excludeMetrics["nv_fb_mem_reserved"] {
 		var total uint64
 		var used uint64
@@ -250,7 +254,7 @@ func readMemoryInfo(device NvidiaCollectorDevice, output chan lp.CCMessage) erro
 	return nil
 }

-func readBarMemoryInfo(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readBarMemoryInfo(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	if !device.excludeMetrics["nv_bar1_mem_total"] || !device.excludeMetrics["nv_bar1_mem_used"] {
 		meminfo, ret := nvml.DeviceGetBAR1MemoryInfo(device.device)
 		if ret != nvml.SUCCESS {
@@ -277,7 +281,7 @@ func readBarMemoryInfo(device NvidiaCollectorDevice, output chan lp.CCMessage) e
 	return nil
 }

-func readUtilization(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readUtilization(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	isMig, ret := nvml.DeviceIsMigDeviceHandle(device.device)
 	if ret != nvml.SUCCESS {
 		err := errors.New(nvml.ErrorString(ret))
@@ -319,7 +323,7 @@ func readUtilization(device NvidiaCollectorDevice, output chan lp.CCMessage) err
 	return nil
 }

-func readTemp(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readTemp(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	if !device.excludeMetrics["nv_temp"] {
 		// Retrieves the current temperature readings for the device, in degrees C.
 		//
@@ -338,7 +342,7 @@ func readTemp(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	return nil
 }

-func readFan(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readFan(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	if !device.excludeMetrics["nv_fan"] {
 		// Retrieves the intended operating speed of the device's fan.
 		//
@@ -361,7 +365,7 @@ func readFan(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	return nil
 }

-// func readFans(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+// func readFans(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 // 	if !device.excludeMetrics["nv_fan"] {
 // 		numFans, ret := nvml.DeviceGetNumFans(device.device)
 // 		if ret == nvml.SUCCESS {
@@ -382,7 +386,7 @@ func readFan(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
 // 	return nil
 // }

-func readEccMode(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readEccMode(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	if !device.excludeMetrics["nv_ecc_mode"] {
 		// Retrieves the current and pending ECC modes for the device.
 		//
@@ -416,7 +420,7 @@ func readEccMode(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	return nil
 }

-func readPerfState(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readPerfState(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	if !device.excludeMetrics["nv_perf_state"] {
 		// Retrieves the current performance state for the device.
 		//
@@ -436,13 +440,16 @@ func readPerfState(device NvidiaCollectorDevice, output chan lp.CCMessage) error
 	return nil
 }

-func readPowerUsage(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readPowerUsage(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	if !device.excludeMetrics["nv_power_usage"] {
 		// Retrieves power usage for this GPU in milliwatts and its associated circuitry (e.g. memory)
 		//
 		// On Fermi and Kepler GPUs the reading is accurate to within +/- 5% of current power draw.
+		// On Ampere (except GA100) or newer GPUs, the API returns power averaged over 1 sec interval.
+		// On GA100 and older architectures, instantaneous power is returned.
 		//
-		// It is only available if power management mode is supported
+		// It is only available if power management mode is supported.
+
 		mode, ret := nvml.DeviceGetPowerManagementMode(device.device)
 		if ret != nvml.SUCCESS {
 			return nil
@@ -461,7 +468,54 @@ func readPowerUsage(device NvidiaCollectorDevice, output chan lp.CCMessage) erro
 	return nil
 }

-func readClocks(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readEnergyConsumption(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
+	// Retrieves total energy consumption for this GPU in millijoules (mJ) since the driver was last reloaded
+
+	// For Volta or newer fully supported devices.
+	if (!device.excludeMetrics["nv_energy"]) && (!device.excludeMetrics["nv_energy_abs"]) && (!device.excludeMetrics["nv_average_power"]) {
+		now := time.Now()
+		mode, ret := nvml.DeviceGetPowerManagementMode(device.device)
+		if ret != nvml.SUCCESS {
+			return nil
+		}
+		if mode == nvml.FEATURE_ENABLED {
+			energy, ret := nvml.DeviceGetTotalEnergyConsumption(device.device)
+			if ret == nvml.SUCCESS {
+				if device.lastEnergyReading != 0 {
+					if !device.excludeMetrics["nv_energy"] {
+						y, err := lp.NewMetric("nv_energy", device.tags, device.meta, (energy-device.lastEnergyReading)/1000, now)
+						if err == nil {
+							y.AddMeta("unit", "Joules")
+							output <- y
+						}
+					}
+					if !device.excludeMetrics["nv_average_power"] {
+
+						energyDiff := (energy - device.lastEnergyReading) / 1000
+						timeDiff := now.Sub(device.lastEnergyTimestamp)
+						y, err := lp.NewMetric("nv_average_power", device.tags, device.meta, energyDiff/uint64(timeDiff.Seconds()), now)
+						if err == nil {
+							y.AddMeta("unit", "watts")
+							output <- y
+						}
+					}
+				}
+				if !device.excludeMetrics["nv_energy_abs"] {
+					y, err := lp.NewMetric("nv_energy_abs", device.tags, device.meta, energy/1000, now)
+					if err == nil {
+						y.AddMeta("unit", "Joules")
+						output <- y
+					}
+				}
+				device.lastEnergyReading = energy
+				device.lastEnergyTimestamp = time.Now()
+			}
+		}
+	}
+	return nil
+}
+
+func readClocks(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	// Retrieves the current clock speeds for the device.
 	//
 	// Available clock information:
@@ -513,7 +567,7 @@ func readClocks(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	return nil
 }

-func readMaxClocks(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readMaxClocks(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	// Retrieves the maximum clock speeds for the device.
 	//
 	// Available clock information:
@@ -571,7 +625,7 @@ func readMaxClocks(device NvidiaCollectorDevice, output chan lp.CCMessage) error
 	return nil
 }

-func readEccErrors(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readEccErrors(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	if !device.excludeMetrics["nv_ecc_uncorrected_error"] {
 		// Retrieves the total ECC error counts for the device.
 		//
@@ -602,7 +656,7 @@ func readEccErrors(device NvidiaCollectorDevice, output chan lp.CCMessage) error
 	return nil
 }

-func readPowerLimit(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readPowerLimit(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	if !device.excludeMetrics["nv_power_max_limit"] {
 		// Retrieves the power management limit associated with this device.
 		//
@@ -622,7 +676,7 @@ func readPowerLimit(device NvidiaCollectorDevice, output chan lp.CCMessage) erro
 	return nil
 }

-func readEncUtilization(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readEncUtilization(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	isMig, ret := nvml.DeviceIsMigDeviceHandle(device.device)
 	if ret != nvml.SUCCESS {
 		err := errors.New(nvml.ErrorString(ret))
@@ -649,7 +703,7 @@ func readEncUtilization(device NvidiaCollectorDevice, output chan lp.CCMessage)
 	return nil
 }

-func readDecUtilization(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readDecUtilization(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	isMig, ret := nvml.DeviceIsMigDeviceHandle(device.device)
 	if ret != nvml.SUCCESS {
 		err := errors.New(nvml.ErrorString(ret))
@@ -676,7 +730,7 @@ func readDecUtilization(device NvidiaCollectorDevice, output chan lp.CCMessage)
 	return nil
 }

-func readRemappedRows(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readRemappedRows(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	if !device.excludeMetrics["nv_remapped_rows_corrected"] ||
 		!device.excludeMetrics["nv_remapped_rows_uncorrected"] ||
 		!device.excludeMetrics["nv_remapped_rows_pending"] ||
@@ -729,7 +783,7 @@ func readRemappedRows(device NvidiaCollectorDevice, output chan lp.CCMessage) er
 	return nil
 }

-func readProcessCounts(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readProcessCounts(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	if !device.excludeMetrics["nv_compute_processes"] {
 		// Get information about processes with a compute context on a device
 		//
@@ -821,7 +875,7 @@ func readProcessCounts(device NvidiaCollectorDevice, output chan lp.CCMessage) e
 	return nil
 }

-func readViolationStats(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readViolationStats(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	var violTime nvml.ViolationTime
 	var ret nvml.Return

@@ -935,7 +989,7 @@ func readViolationStats(device NvidiaCollectorDevice, output chan lp.CCMessage)
 	return nil
 }

-func readNVLinkStats(device NvidiaCollectorDevice, output chan lp.CCMessage) error {
+func readNVLinkStats(device *NvidiaCollectorDevice, output chan lp.CCMessage) error {
 	// Retrieves the specified error counter value
 	// Please refer to \a nvmlNvLinkErrorCounter_t for error counters that are available
 	//
@@ -1070,7 +1124,7 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMessage)
 		return
 	}

-	readAll := func(device NvidiaCollectorDevice, output chan lp.CCMessage) {
+	readAll := func(device *NvidiaCollectorDevice, output chan lp.CCMessage) {
 		name, ret := nvml.DeviceGetName(device.device)
 		if ret != nvml.SUCCESS {
 			name = "NoName"
@@ -1110,6 +1164,11 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMessage)
 			cclog.ComponentDebug(m.name, "readPowerUsage for device", name, "failed")
 		}

+		err = readEnergyConsumption(device, output)
+		if err != nil {
+			cclog.ComponentDebug(m.name, "readEnergyConsumption for device", name, "failed")
+		}
+
 		err = readClocks(device, output)
 		if err != nil {
 			cclog.ComponentDebug(m.name, "readClocks for device", name, "failed")
@@ -1169,7 +1228,7 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMessage)
 	// Actual read loop over all attached Nvidia GPUs
 	for i := 0; i < m.num_gpus; i++ {

-		readAll(m.gpus[i], output)
+		readAll(&m.gpus[i], output)

 		// Iterate over all MIG devices if any
 		if m.config.ProcessMigDevices {
@@ -1243,7 +1302,7 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMessage)
 					}
 				}

-				readAll(migDevice, output)
+				readAll(&migDevice, output)
 			}
 		}
 	}
--- a/collectors/nvidiaMetric.md
+++ b/collectors/nvidiaMetric.md
@@ -1,3 +1,13 @@
+<!--
+---
+title: "Nvidia NVML metric collector"
+description: Collect metrics for Nvidia GPUs using the NVML
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/nvidia.md
+---
+-->

 ## `nvidia` collector

@@ -72,5 +82,8 @@ Metrics:
 * `nv_nvlink_ecc_errors`
 * `nv_nvlink_replay_errors`
 * `nv_nvlink_recovery_errors`
+* `nv_energy`
+* `nv_energy_abs`
+* `nv_average_power`

 Some metrics add the additional sub type tag (`stype`) like the `nv_nvlink_*` metrics set `stype=nvlink,stype-id=<link_number>`. 
--- a/collectors/raplMetric.md
+++ b/collectors/raplMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: RAPL metric collector
+description: Collect energy data through the RAPL sysfs interface
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/rapl.md
+---
+-->
+
 ## `rapl` collector

 This collector reads running average power limit (RAPL) monitoring attributes to compute average power consumption metrics. See <https://www.kernel.org/doc/html/latest/power/powercap/powercap.html#monitoring-attributes>.
--- a/collectors/rocmsmiMetric.md
+++ b/collectors/rocmsmiMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: "ROCm SMI metric collector"
+description: Collect metrics for AMD GPUs using the SMI library
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/rocmsmi.md
+---
+-->
+

 ## `rocm_smi` collector

--- a/collectors/schedstatMetric.md
+++ b/collectors/schedstatMetric.md
@@ -1,3 +1,13 @@
+<!--
+---
+title: SchedStat Metric collector
+description: Collect metrics from `/proc/schedstat`
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/schedstat.md
+---
+-->

 ## `schedstat` collector
 ```json
--- a/collectors/selfMetric.md
+++ b/collectors/selfMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: Self-monitoring metric collector
+description: Collect metrics from the execution of cc-metric-collector itself
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/self.md
+---
+-->
+
 ## `self` collector

 ```json
--- a/collectors/tempMetric.md
+++ b/collectors/tempMetric.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: Temperature metric collector
+description: Collect thermal metrics from `/sys/class/hwmon/*`
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/temp.md
+---
+-->
+

 ## `tempstat` collector

--- a/collectors/topprocsMetric.md
+++ b/collectors/topprocsMetric.md
@@ -1,3 +1,15 @@
+<!--
+---
+title: TopProcs collector
+description: Collect infos about most CPU-consuming processes
+categories: [cc-metric-collector]
+tags: ['Admin']
+weight: 2
+hugo_path: docs/reference/cc-metric-collector/collectors/topprocs.md
+---
+-->
+
+

 ## `topprocs` collector

--- a/internal/metricAggregator/README.md
+++ b/internal/metricAggregator/README.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: Metric Aggregator
+description: Subsystem for evaluating expressions on metrics (deprecated)
+categories: [cc-metric-collector]
+tags: ['Developer']
+weight: 1
+hugo_path: docs/reference/cc-metric-collector/internal/metricaggregator/_index.md
+---
+-->
+
 # The MetricAggregator

 In some cases, further combination of metrics or raw values is required. For that strings like `foo + 1` with runtime dependent `foo` need to be evaluated. The MetricAggregator relies on the [`gval`](https://github.com/PaesslerAG/gval) Golang package to perform all expression evaluation. The `gval` package provides the basic arithmetic operations but the MetricAggregator defines additional ones.
--- a/internal/metricRouter/README.md
+++ b/internal/metricRouter/README.md
@@ -1,11 +1,22 @@
+<!--
+---
+title: Message Router
+description: Routing component inside cc-metric-collector
+categories: [cc-metric-collector]
+tags: ['Developer']
+weight: 1
+hugo_path: docs/reference/cc-metric-collector/internal/metricrouter/_index.md
+---
+-->
+
 # CC Metric Router

-The CCMetric router sits in between the collectors and the sinks and can be used to add and remove tags to/from traversing [CCMessages](https://pkg.go.dev/github.com/ClusterCockpit/cc-energy-manager@v0.0.0-20240919152819-92a17f2da4f7/pkg/cc-message.
+The CCMetric router sits in between the collectors and the sinks and can be used to add and remove tags to/from traversing [CCMessages](https://pkg.go.dev/github.com/ClusterCockpit/cc-lib/ccMessage).


 # Configuration

-**Note**: Use the [message processor configuration](../../pkg/messageProcessor/README.md) with option `process_messages`.
+**Note**: Use the [message processor configuration](https://github.com/ClusterCockpit/cc-lib/blob/main/messageProcessor/README.md) with option `process_messages`.

 ```json
 {
@@ -69,7 +80,7 @@ The CCMetric router sits in between the collectors and the sinks and can be used

 There are three main options `add_tags`, `delete_tags` and `interval_timestamp`. `add_tags` and `delete_tags` are lists consisting of dicts with `key`, `value` and `if`. The `value` can be omitted in the `delete_tags` part as it only uses the `key` for removal. The `interval_timestamp` setting means that a unique timestamp is applied to all metrics traversing the router during an interval.

-**Note**: Use the [message processor configuration](../../pkg/messageProcessor/README.md) (option `process_messages`) instead of `add_tags`, `delete_tags`, `drop_metrics`, `drop_metrics_if`, `rename_metrics`, `normalize_units` and `change_unit_prefix`. These options are deprecated and will be removed in future versions. Until then, they are added to the message processor.
+**Note**: Use the [message processor configuration](https://github.com/ClusterCockpit/cc-lib/blob/main/messageProcessor/README.md) (option `process_messages`) instead of `add_tags`, `delete_tags`, `drop_metrics`, `drop_metrics_if`, `rename_metrics`, `normalize_units` and `change_unit_prefix`. These options are deprecated and will be removed in future versions. Until then, they are added to the message processor.

 # Processing order in the router

@@ -225,13 +236,13 @@ __deprecated__


 The cc-metric-collector tries to read the data from the system as it is reported. If available, it tries to read the metric unit from the system as well (e.g. from `/proc/meminfo`). The problem is that, depending on the source, the metric units are named differently. Just think about `byte`, `Byte`, `B`, `bytes`, ...
-The [cc-units](https://github.com/ClusterCockpit/cc-units) package provides us a normalization option to use the same metric unit name for all metrics. It this option is set to true, all `unit` meta tags are normalized.
+The [cc-units](https://github.com/ClusterCockpit/cc-lib/ccUnits) package provides us a normalization option to use the same metric unit name for all metrics. It this option is set to true, all `unit` meta tags are normalized.

 ## The `change_unit_prefix` section

 __deprecated__

-It is often the case that metrics are reported by the system using a rather outdated unit prefix (like `/proc/meminfo` still uses kByte despite current memory sizes are in the GByte range). If you want to change the prefix of a unit, you can do that with the help of [cc-units](https://github.com/ClusterCockpit/cc-units). The setting works on the metric name and requires the new prefix for the metric. The cc-units package determines the scaling factor.
+It is often the case that metrics are reported by the system using a rather outdated unit prefix (like `/proc/meminfo` still uses kByte despite current memory sizes are in the GByte range). If you want to change the prefix of a unit, you can do that with the help of [cc-units](https://github.com/ClusterCockpit/cc-lib/ccUnits). The setting works on the metric name and requires the new prefix for the metric. The cc-units package determines the scaling factor.

 # Aggregate metric values of the current interval with the `interval_aggregates` option

@@ -263,7 +274,7 @@ The above configuration, collects all metric values for metrics evaluating `if`
 If you are not interested in the input metrics `sub_metric_%d+` at all, you can add the same condition used here to the `drop_metrics_if` section to drop them.

 Use cases for `interval_aggregates`:
- Combine multiple metrics of the a collector to a new one like the [MemstatCollector](../../collectors/memstatMetric.md) does it for `mem_used`)):
+- Combine multiple metrics of the a collector to a new one like the [MemstatCollector](../../collectors/memstatMetric.md) does it for `mem_used`:
 ```json
  {
    "name" : "mem_used",
--- a/pkg/multiChanTicker/README.md
+++ b/pkg/multiChanTicker/README.md
@@ -1,3 +1,14 @@
+<!--
+---
+title: Multi-channel Ticker
+description: Timer ticker that sends out the tick to multiple channels
+categories: [cc-metric-collector]
+tags: ['Developer']
+weight: 1
+hugo_path: docs/reference/cc-metric-collector/pkg/multichanticker/_index.md
+---
+-->
+
 # MultiChanTicker

 The idea of this ticker is to multiply the output channels. The original Golang `time.Ticker` provides only a single output channel, so the signal can only be received by a single other class. This ticker allows to add multiple channels which get all notified about the time tick.
Author	SHA1	Message	Date
Thomas Roehl	3877e4a0b6	Add energy metrics to Nvidia collector README	2025-04-28 15:36:04 +00:00
Thomas Roehl	a606a3af01	Add energy metrics from NVML to Nvidia NVML collector	2025-04-28 15:29:13 +00:00
Thomas Roehl	f8b2ac0d2c	Fix URL to new location of cc-units	2025-04-22 12:48:15 +02:00
Thomas Roehl	ec34b40295	Merge branch 'main' of github.com:ClusterCockpit/cc-metric-collector	2025-04-17 11:38:03 +02:00
Thomas Gruber	03cd965099	Merge develop into main for documentation (#143 ) * Fix Release part * Fix Release part * Update Hugo integration (#142)	2025-04-17 11:37:47 +02:00