Merge current development version into main (#48)

* DiskstatCollector: cast part_max_used metric to int * Add uint types to GangliaSink and LibgangliaSink * Use new sink instances to allow multiple of same sink type * Update sink README and SampleSink * Use new receiver instances to allow multiple of same receiver type * Fix metric scope in likwid configuration script * Mention likwid config script in LikwidCollector README * Refactor: Embed Init() into New() function * Refactor: Embed Init() into New() function * Fix: MetricReceiver uses uninitialized values, when initialization fails * Use Ganglia configuration (#44) * Copy all metric configurations from original Ganglia code * Use metric configurations from Ganglia for some metrics * Format value string also for known metrics * Numa-aware memstat collector (#45) * Add samples for collectors, sinks and receivers * Ping InfluxDB server after connecting to recognize faulty connections * Add sink for Prometheus monitoring system (#46) * Add sink for Prometheus monitoring system * Add prometheus sink to README * Add scraper for Prometheus clients (#47) Co-authored-by: Holger Obermaier <holgerob@gmx.de> Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
2025-11-04 18:55:05 +01:00 · 2022-02-25 14:49:49 +01:00
parent b4cc6d54ea
commit d98076c792
27 changed files with 1509 additions and 517 deletions
--- a/collectors/diskstatMetric.go
+++ b/collectors/diskstatMetric.go
@@ -102,7 +102,7 @@ func (m *DiskstatCollector) Read(interval time.Duration, output chan lp.CCMetric
 			part_max_used = perc
 		}
 	}
-	y, err := lp.New("part_max_used", map[string]string{"type": "node"}, m.meta, map[string]interface{}{"value": part_max_used}, time.Now())
+	y, err := lp.New("part_max_used", map[string]string{"type": "node"}, m.meta, map[string]interface{}{"value": int(part_max_used)}, time.Now())
 	if err == nil {
 		y.AddMeta("unit", "percent")
 		output <- y
--- a/collectors/likwidMetric.md
+++ b/collectors/likwidMetric.md
@@ -4,7 +4,7 @@
 The `likwid` collector is probably the most complicated collector. The LIKWID library is included as static library with *direct* access mode. The *direct* access mode is suitable if the daemon is executed by a root user. The static library does not contain the performance groups, so all information needs to be provided in the configuration.

 The `likwid` configuration consists of two parts, the "eventsets" and "globalmetrics":
- An event set list itself has two parts, the "events" and a set of derivable "metrics". Each of the "events" is a counter:event pair in LIKWID's syntax. The "metrics" are a list of formulas to derive the metric value from the measurements of the "events". Each metric has a name, the formula, a scope and a publish flag. A counter names can be used like variables in the formulas, so `PMC0+PMC1` sums the measurements for the both events configured in the counters `PMC0` and `PMC1`. The scope tells the Collector whether it is a metric for each hardware thread (`cpu`) or each CPU socket (`socket`). The last one is the publishing flag. It tells the collector whether a metric should be sent to the router.
+- An event set list itself has two parts, the "events" and a set of derivable "metrics". Each of the "events" is a counter:event pair in LIKWID's syntax. The "metrics" are a list of formulas to derive the metric value from the measurements of the "events". Each metric has a name, the formula, a scope and a publish flag. Counter names can be used like variables in the formulas, so `PMC0+PMC1` sums the measurements for the both events configured in the counters `PMC0` and `PMC1`. The scope tells the Collector whether it is a metric for each hardware thread (`cpu`) or each CPU socket (`socket`). The last one is the publishing flag. It tells the collector whether a metric should be sent to the router.
 - The global metrics are metrics which require data from all event set measurements to be derived. The inputs are the metrics in the event sets. Similar to the metrics in the event sets, the global metrics are defined by a name, a formula, a scope and a publish flag. See event set metrics for details. The only difference is that there is no access to the raw event measurements anymore but only to the metrics. So, the idea is to derive a metric in the "eventsets" section and reuse it in the "globalmetrics" part. If you need a metric only for deriving the global metrics, disable forwarding of the event set metrics. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.

 Additional options:
@@ -26,6 +26,42 @@ As a guideline:
 - All `PWRx` counters have scope `socket`, except `"PWR1" : "RAPL_CORE_ENERGY"` has `cpu` scope
 - All `DFCx` counters have scope `socket`

+### Help with the configuration
+
+The configuration for the `likwid` collector is quite complicated. Most users don't use LIKWID with the event:counter notation but rely on the performance groups defined by the LIKWID team for each architecture. In order to help with the `likwid` collector configuration, we included a script `scripts/likwid_perfgroup_to_cc_config.py` that creates the configuration of an `eventset` from a performance group (using a LIKWID installation in `$PATH`):
+```
+$ likwid-perfctr -i
+[...]
+short name: ICX
+[...]
+$ likwid-perfctr -a
+[...]
+MEM_DP
+MEM
+FLOPS_SP
+CLOCK
+[...]
+$ scripts/likwid_perfgroup_to_cc_config.py ICX MEM_DP
+{
+  "events": {
+    "FIXC0": "INSTR_RETIRED_ANY",
+    "..." : "..."
+  },
+  "metrics" : [
+    {
+      "calc": "time",
+      "name": "Runtime (RDTSC) [s]",
+      "publish": true,
+      "scope": "hwthread"
+    },
+    {
+      "..." : "..."
+    }
+  ]
+}
+```
+
+You can copy this JSON and add it to the `eventsets` list. If you specify multiple event sets, you can add globally derived metrics in the extra `global_metrics` section with the metric names as variables. 

 ### Example configuration

--- a/collectors/memstatMetric.go
+++ b/collectors/memstatMetric.go
@@ -1,35 +1,76 @@
 package collectors

 import (
+	"bufio"
 	"encoding/json"
 	"errors"
 	"fmt"
-	"io/ioutil"
-	"log"
+	"os"
+	"path/filepath"
+	"regexp"
 	"strconv"
 	"strings"
 	"time"

+	cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
 	lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
 )

-const MEMSTATFILE = `/proc/meminfo`
+const MEMSTATFILE = "/proc/meminfo"
+const NUMA_MEMSTAT_BASE = "/sys/devices/system/node"

 type MemstatCollectorConfig struct {
 	ExcludeMetrics []string `json:"exclude_metrics"`
+	NodeStats      bool     `json:"node_stats,omitempty"`
+	NumaStats      bool     `json:"numa_stats,omitempty"`
+}
+
+type MemstatCollectorNode struct {
+	file string
+	tags map[string]string
 }

 type MemstatCollector struct {
 	metricCollector
-	stats   map[string]int64
-	tags    map[string]string
-	matches map[string]string
-	config  MemstatCollectorConfig
+	stats     map[string]int64
+	tags      map[string]string
+	matches   map[string]string
+	config    MemstatCollectorConfig
+	nodefiles map[int]MemstatCollectorNode
+}
+
+func getStats(filename string) map[string]float64 {
+	stats := make(map[string]float64)
+	file, err := os.Open(filename)
+	if err != nil {
+		cclog.Error(err.Error())
+	}
+	defer file.Close()
+
+	scanner := bufio.NewScanner(file)
+	for scanner.Scan() {
+		line := scanner.Text()
+		linefields := strings.Fields(line)
+		if len(linefields) == 3 {
+			v, err := strconv.ParseFloat(linefields[1], 64)
+			if err == nil {
+				stats[strings.Trim(linefields[0], ":")] = v
+			}
+		} else if len(linefields) == 5 {
+			v, err := strconv.ParseFloat(linefields[3], 64)
+			if err == nil {
+				stats[strings.Trim(linefields[0], ":")] = v
+			}
+		}
+	}
+	return stats
 }

 func (m *MemstatCollector) Init(config json.RawMessage) error {
 	var err error
 	m.name = "MemstatCollector"
+	m.config.NodeStats = true
+	m.config.NumaStats = false
 	if len(config) > 0 {
 		err = json.Unmarshal(config, &m.config)
 		if err != nil {
@@ -40,7 +81,8 @@ func (m *MemstatCollector) Init(config json.RawMessage) error {
 	m.stats = make(map[string]int64)
 	m.matches = make(map[string]string)
 	m.tags = map[string]string{"type": "node"}
-	matches := map[string]string{`MemTotal`: "mem_total",
+	matches := map[string]string{
+		"MemTotal":     "mem_total",
 		"SwapTotal":    "swap_total",
 		"SReclaimable": "mem_sreclaimable",
 		"Slab":         "mem_slab",
@@ -48,7 +90,9 @@ func (m *MemstatCollector) Init(config json.RawMessage) error {
 		"Buffers":      "mem_buffers",
 		"Cached":       "mem_cached",
 		"MemAvailable": "mem_available",
-		"SwapFree":     "swap_free"}
+		"SwapFree":     "swap_free",
+		"MemShared":    "mem_shared",
+	}
 	for k, v := range matches {
 		_, skip := stringArrayContains(m.config.ExcludeMetrics, k)
 		if !skip {
@@ -56,13 +100,44 @@ func (m *MemstatCollector) Init(config json.RawMessage) error {
 		}
 	}
 	if len(m.matches) == 0 {
-		return errors.New("No metrics to collect")
+		return errors.New("no metrics to collect")
 	}
 	m.setup()
-	_, err = ioutil.ReadFile(string(MEMSTATFILE))
-	if err == nil {
-		m.init = true
+
+	if m.config.NodeStats {
+		if stats := getStats(MEMSTATFILE); len(stats) == 0 {
+			return fmt.Errorf("cannot read data from file %s", MEMSTATFILE)
+		}
 	}
+
+	if m.config.NumaStats {
+		globPattern := filepath.Join(NUMA_MEMSTAT_BASE, "node[0-9]*", "meminfo")
+		regex := regexp.MustCompile(filepath.Join(NUMA_MEMSTAT_BASE, "node(\\d+)", "meminfo"))
+		files, err := filepath.Glob(globPattern)
+		if err == nil {
+			m.nodefiles = make(map[int]MemstatCollectorNode)
+			for _, f := range files {
+				if stats := getStats(f); len(stats) == 0 {
+					return fmt.Errorf("cannot read data from file %s", f)
+				}
+				rematch := regex.FindStringSubmatch(f)
+				if len(rematch) == 2 {
+					id, err := strconv.Atoi(rematch[1])
+					if err == nil {
+						f := MemstatCollectorNode{
+							file: f,
+							tags: map[string]string{
+								"type":    "memoryDomain",
+								"type-id": fmt.Sprintf("%d", id),
+							},
+						}
+						m.nodefiles[id] = f
+					}
+				}
+			}
+		}
+	}
+	m.init = true
 	return err
 }

@@ -71,56 +146,41 @@ func (m *MemstatCollector) Read(interval time.Duration, output chan lp.CCMetric)
 		return
 	}

-	buffer, err := ioutil.ReadFile(string(MEMSTATFILE))
-	if err != nil {
-		log.Print(err)
-		return
-	}
-
-	ll := strings.Split(string(buffer), "\n")
-	for _, line := range ll {
-		ls := strings.Split(line, `:`)
-		if len(ls) > 1 {
-			lv := strings.Fields(ls[1])
-			m.stats[ls[0]], err = strconv.ParseInt(lv[0], 0, 64)
+	sendStats := func(stats map[string]float64, tags map[string]string) {
+		for match, name := range m.matches {
+			var value float64 = 0
+			if v, ok := stats[match]; ok {
+				value = v
+			}
+			y, err := lp.New(name, tags, m.meta, map[string]interface{}{"value": value}, time.Now())
+			if err == nil {
+				output <- y
+			}
 		}
-	}
-
-	if _, exists := m.stats[`MemTotal`]; !exists {
-		err = errors.New("Parse error")
-		log.Print(err)
-		return
-	}
-
-	for match, name := range m.matches {
-		if _, exists := m.stats[match]; !exists {
-			err = fmt.Errorf("Parse error for %s : %s", match, name)
-			log.Print(err)
-			continue
-		}
-		y, err := lp.New(name, m.tags, m.meta, map[string]interface{}{"value": int(float64(m.stats[match]) * 1.0e-3)}, time.Now())
-		if err == nil {
-			output <- y
-		}
-	}
-
-	if _, free := m.stats[`MemFree`]; free {
-		if _, buffers := m.stats[`Buffers`]; buffers {
-			if _, cached := m.stats[`Cached`]; cached {
-				memUsed := m.stats[`MemTotal`] - (m.stats[`MemFree`] + m.stats[`Buffers`] + m.stats[`Cached`])
-				_, skip := stringArrayContains(m.config.ExcludeMetrics, "mem_used")
-				y, err := lp.New("mem_used", m.tags, m.meta, map[string]interface{}{"value": int(float64(memUsed) * 1.0e-3)}, time.Now())
-				if err == nil && !skip {
-					output <- y
+		if _, skip := stringArrayContains(m.config.ExcludeMetrics, "mem_used"); !skip {
+			if freeVal, free := stats["MemFree"]; free {
+				if bufVal, buffers := stats["Buffers"]; buffers {
+					if cacheVal, cached := stats["Cached"]; cached {
+						memUsed := stats["MemTotal"] - (freeVal + bufVal + cacheVal)
+						y, err := lp.New("mem_used", tags, m.meta, map[string]interface{}{"value": memUsed}, time.Now())
+						if err == nil {
+							output <- y
+						}
+					}
 				}
 			}
 		}
 	}
-	if _, found := m.stats[`MemShared`]; found {
-		_, skip := stringArrayContains(m.config.ExcludeMetrics, "mem_shared")
-		y, err := lp.New("mem_shared", m.tags, m.meta, map[string]interface{}{"value": int(float64(m.stats[`MemShared`]) * 1.0e-3)}, time.Now())
-		if err == nil && !skip {
-			output <- y
+
+	if m.config.NodeStats {
+		nodestats := getStats(MEMSTATFILE)
+		sendStats(nodestats, m.tags)
+	}
+
+	if m.config.NumaStats {
+		for _, nodeConf := range m.nodefiles {
+			stats := getStats(nodeConf.file)
+			sendStats(stats, nodeConf.tags)
 		}
 	}
 }
--- a/collectors/sampleMetric.go
+++ b/collectors/sampleMetric.go
@@ -0,0 +1,81 @@
+package collectors
+
+import (
+	"encoding/json"
+	"time"
+
+	cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
+	lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
+)
+
+// These are the fields we read from the JSON configuration
+type SampleCollectorConfig struct {
+	Interval string `json:"interval"`
+}
+
+// This contains all variables we need during execution and the variables
+// defined by metricCollector (name, init, ...)
+type SampleCollector struct {
+	metricCollector
+	config SampleTimerCollectorConfig // the configuration structure
+	meta   map[string]string          // default meta information
+	tags   map[string]string          // default tags
+}
+
+func (m *SampleCollector) Init(config json.RawMessage) error {
+	var err error = nil
+	// Always set the name early in Init() to use it in cclog.Component* functions
+	m.name = "InternalCollector"
+	// This is for later use, also call it early
+	m.setup()
+	// Define meta information sent with each metric
+	// (Can also be dynamic or this is the basic set with extension through AddMeta())
+	m.meta = map[string]string{"source": m.name, "group": "SAMPLE"}
+	// Define tags sent with each metric
+	// The 'type' tag is always needed, it defines the granulatity of the metric
+	// node -> whole system
+	// socket -> CPU socket (requires socket ID as 'type-id' tag)
+	// cpu -> single CPU hardware thread (requires cpu ID as 'type-id' tag)
+	m.tags = map[string]string{"type": "node"}
+	// Read in the JSON configuration
+	if len(config) > 0 {
+		err = json.Unmarshal(config, &m.config)
+		if err != nil {
+			cclog.ComponentError(m.name, "Error reading config:", err.Error())
+			return err
+		}
+	}
+
+	// Set up everything that the collector requires during the Read() execution
+	// Check files required, test execution of some commands, create data structure
+	// for all topological entities (sockets, NUMA domains, ...)
+	// Return some useful error message in case of any failures
+
+	// Set this flag only if everything is initialized properly, all required files exist, ...
+	m.init = true
+	return err
+}
+
+func (m *SampleCollector) Read(interval time.Duration, output chan lp.CCMetric) {
+	// Create a sample metric
+	timestamp := time.Now()
+
+	value := 1.0
+	// If you want to measure something for a specific amout of time, use interval
+	// start := readState()
+	// time.Sleep(interval)
+	// stop := readState()
+	// value = (stop - start) / interval.Seconds()
+
+	y, err := lp.New("sample_metric", m.tags, m.meta, map[string]interface{}{"value": value}, timestamp)
+	if err == nil {
+		// Send it to output channel
+		output <- y
+	}
+
+}
+
+func (m *SampleCollector) Close() {
+	// Unset flag
+	m.init = false
+}
--- a/collectors/sampleTimerMetric.go
+++ b/collectors/sampleTimerMetric.go
@@ -0,0 +1,122 @@
+package collectors
+
+import (
+	"encoding/json"
+	"sync"
+	"time"
+
+	cclog "github.com/ClusterCockpit/cc-metric-collector/internal/ccLogger"
+	lp "github.com/ClusterCockpit/cc-metric-collector/internal/ccMetric"
+)
+
+// These are the fields we read from the JSON configuration
+type SampleTimerCollectorConfig struct {
+	Interval string `json:"interval"`
+}
+
+// This contains all variables we need during execution and the variables
+// defined by metricCollector (name, init, ...)
+type SampleTimerCollector struct {
+	metricCollector
+	wg       sync.WaitGroup             // sync group for management
+	done     chan bool                  // channel for management
+	meta     map[string]string          // default meta information
+	tags     map[string]string          // default tags
+	config   SampleTimerCollectorConfig // the configuration structure
+	interval time.Duration              // the interval parsed from configuration
+	ticker   *time.Ticker               // own timer
+	output   chan lp.CCMetric           // own internal output channel
+}
+
+func (m *SampleTimerCollector) Init(name string, config json.RawMessage) error {
+	var err error = nil
+	// Always set the name early in Init() to use it in cclog.Component* functions
+	m.name = "SampleTimerCollector"
+	// This is for later use, also call it early
+	m.setup()
+	// Define meta information sent with each metric
+	// (Can also be dynamic or this is the basic set with extension through AddMeta())
+	m.meta = map[string]string{"source": m.name, "group": "SAMPLE"}
+	// Define tags sent with each metric
+	// The 'type' tag is always needed, it defines the granulatity of the metric
+	// node -> whole system
+	// socket -> CPU socket (requires socket ID as 'type-id' tag)
+	// cpu -> single CPU hardware thread (requires cpu ID as 'type-id' tag)
+	m.tags = map[string]string{"type": "node"}
+	// Read in the JSON configuration
+	if len(config) > 0 {
+		err = json.Unmarshal(config, &m.config)
+		if err != nil {
+			cclog.ComponentError(m.name, "Error reading config:", err.Error())
+			return err
+		}
+	}
+	// Parse the read interval duration
+	m.interval, err = time.ParseDuration(m.config.Interval)
+	if err != nil {
+		cclog.ComponentError(m.name, "Error parsing interval:", err.Error())
+		return err
+	}
+
+	// Storage for output channel
+	m.output = nil
+	// Mangement channel for the timer function.
+	m.done = make(chan bool)
+	// Create the own ticker
+	m.ticker = time.NewTicker(m.interval)
+
+	// Start the timer loop with return functionality by sending 'true' to the done channel
+	m.wg.Add(1)
+	go func() {
+		select {
+		case <-m.done:
+			// Exit the timer loop
+			cclog.ComponentDebug(m.name, "Closing...")
+			m.wg.Done()
+			return
+		case timestamp := <-m.ticker.C:
+			// This is executed every timer tick but we have to wait until the first
+			// Read() to get the output channel
+			if m.output != nil {
+				m.ReadMetrics(timestamp)
+			}
+		}
+	}()
+
+	// Set this flag only if everything is initialized properly, all required files exist, ...
+	m.init = true
+	return err
+}
+
+// This function is called at each interval timer tick
+func (m *SampleTimerCollector) ReadMetrics(timestamp time.Time) {
+	// Create a sample metric
+
+	value := 1.0
+
+	// If you want to measure something for a specific amout of time, use interval
+	// start := readState()
+	// time.Sleep(interval)
+	// stop := readState()
+	// value = (stop - start) / interval.Seconds()
+
+	y, err := lp.New("sample_metric", m.tags, m.meta, map[string]interface{}{"value": value}, timestamp)
+	if err == nil && m.output != nil {
+		// Send it to output channel if we have a valid channel
+		m.output <- y
+	}
+}
+
+func (m *SampleTimerCollector) Read(interval time.Duration, output chan lp.CCMetric) {
+	// Capture output channel
+	m.output = output
+}
+
+func (m *SampleTimerCollector) Close() {
+	// Send signal to the timer loop to stop it
+	m.done <- true
+	// Wait until the timer loop is done
+	m.wg.Wait()
+	// Unset flag
+	m.init = false
+}