mirror of
https://github.com/ClusterCockpit/cc-metric-collector.git
synced 2026-06-10 13:57:30 +02:00
Compare commits
27 Commits
v0.7.5
...
cclog_upda
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
c9dfca622f | ||
|
|
3b0638e815 | ||
|
|
037b4f1526 | ||
|
|
bed5491068 | ||
|
|
a2eba41150 | ||
|
|
5d55ee7a77 | ||
|
|
5938368a76 | ||
|
|
077204d39f | ||
|
|
dcc9746df4 | ||
|
|
2c51a3ed72 | ||
|
|
656ea73d12 | ||
|
|
330f923596 | ||
|
|
8e58072ff6 | ||
|
|
0f6fee9db4 | ||
|
|
7585ee7289 | ||
|
|
30b2eb69dd | ||
|
|
2a51bd17f3 | ||
|
|
34d3d8970e | ||
|
|
50c7eba192 | ||
|
|
d215cabb3e | ||
|
|
86da3c15f7 | ||
|
|
93cd397b79 | ||
|
|
65b9c0ea14 | ||
|
|
0ecf06cee7 | ||
|
|
9eaf77db4f | ||
|
|
7cb5d1b47a | ||
|
|
319e71a853 |
4
.github/workflows/Release.yml
vendored
4
.github/workflows/Release.yml
vendored
@@ -270,7 +270,7 @@ jobs:
|
||||
- name: Install development packages
|
||||
run: |
|
||||
apt update && apt --assume-yes upgrade
|
||||
apt --assume-yes install build-essential sed git wget bash
|
||||
apt --assume-yes install build-essential sed git wget bash libdrm-dev
|
||||
# Checkout git repository and submodules
|
||||
# fetch-depth must be 0 to use git describe
|
||||
# See: https://github.com/marketplace/actions/checkout
|
||||
@@ -321,7 +321,7 @@ jobs:
|
||||
- name: Install development packages
|
||||
run: |
|
||||
apt update && apt --assume-yes upgrade
|
||||
apt --assume-yes install build-essential sed git wget bash
|
||||
apt --assume-yes install build-essential sed git wget bash libdrm-dev
|
||||
# Checkout git repository and submodules
|
||||
# fetch-depth must be 0 to use git describe
|
||||
# See: https://github.com/marketplace/actions/checkout
|
||||
|
||||
18
README.md
18
README.md
@@ -11,13 +11,9 @@ hugo_path: docs/reference/cc-metric-collector/_index.md
|
||||
|
||||
# cc-metric-collector
|
||||
|
||||
A node agent for measuring, processing and forwarding node level metrics. It is part of the [ClusterCockpit ecosystem](https://clustercockpit.org/docs/overview/).
|
||||
A node agent for measuring, processing and forwarding node level metrics. It is part of the [ClusterCockpit ecosystem](https://clustercockpit.org/docs/overview/).
|
||||
|
||||
The metric collector sends (and receives) metric in the [InfluxDB line protocol](https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/) as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns).
|
||||
|
||||
There is a single timer loop that triggers all collectors serially, collects the collectors' data and sends the metrics to the sink. This is done as all data is submitted with a single time stamp. The sinks currently use mostly blocking APIs.
|
||||
|
||||
The receiver runs as a go routine side-by-side with the timer loop and asynchronously forwards received metrics to the sink.
|
||||
The `cc-metric-collector` sends (and maybe receives) metrics in the [InfluxDB line protocol](https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/) as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns). The `cc-metric-collector` consists of 4 components: collectors, router, sinks and receivers. The collectors read data from the current system and submit metrics to the router. The router can be configured to manipulate the metrics before forwarding them to the sinks. The receivers are also attached to the router like the collectors but they receive data from external source like other `cc-metric-collector` instances.
|
||||
|
||||
|
||||
[](https://doi.org/10.5281/zenodo.7438287)
|
||||
@@ -43,7 +39,7 @@ There is a main configuration file with basic settings that point to the other c
|
||||
}
|
||||
```
|
||||
|
||||
The `interval` defines how often the metrics should be read and send to the sink. The `duration` tells collectors how long one measurement has to take. This is important for some collectors, like the `likwid` collector. For more information, see [here](./docs/configuration.md).
|
||||
The `interval` defines how often the metrics should be read and send to the sink(s). The `duration` tells the collectors how long one measurement has to take. This is important for some collectors, like the `likwid` collector. For more information, see [here](./docs/configuration.md).
|
||||
|
||||
See the component READMEs for their configuration:
|
||||
|
||||
@@ -57,7 +53,7 @@ See the component READMEs for their configuration:
|
||||
```
|
||||
$ git clone git@github.com:ClusterCockpit/cc-metric-collector.git
|
||||
$ make (downloads LIKWID, builds it as static library with 'direct' accessmode and copies all required files for the collector)
|
||||
$ go get (requires at least golang 1.16)
|
||||
$ go get
|
||||
$ make
|
||||
```
|
||||
|
||||
@@ -67,11 +63,13 @@ For more information, see [here](./docs/building.md).
|
||||
|
||||
```
|
||||
$ ./cc-metric-collector --help
|
||||
Usage of metric-collector:
|
||||
Usage of ./cc-metric-collector:
|
||||
-config string
|
||||
Path to configuration file (default "./config.json")
|
||||
-log string
|
||||
Path for logfile (default "stderr")
|
||||
-loglevel string
|
||||
Set log level (default "info")
|
||||
-once
|
||||
Run all collectors only once
|
||||
```
|
||||
@@ -114,7 +112,7 @@ flowchart TD
|
||||
|
||||
# Contributing
|
||||
|
||||
The ClusterCockpit ecosystem is designed to be used by different HPC computing centers. Since configurations and setups differ between the centers, the centers likely have to put some work into the cc-metric-collector to gather all desired metrics.
|
||||
The ClusterCockpit ecosystem is designed to be used by different HPC computing centers. Since configurations and setups differ between the centers, the centers likely have to put some work into `cc-metric-collector` to gather all desired metrics.
|
||||
|
||||
You are free to open an issue to request a collector but we would also be happy about PRs.
|
||||
|
||||
|
||||
@@ -132,11 +132,11 @@ func mainFunc() int {
|
||||
if len(rcfg.ConfigFile.Interval) > 0 {
|
||||
t, err := time.ParseDuration(rcfg.ConfigFile.Interval)
|
||||
if err != nil {
|
||||
cclog.Error("Configuration value 'interval' no valid duration")
|
||||
cclog.Errorf("Configuration value interval=%s no valid duration", rcfg.ConfigFile.Interval)
|
||||
}
|
||||
rcfg.Interval = t
|
||||
if rcfg.Interval == 0 {
|
||||
cclog.Error("Configuration value 'interval' must be greater than zero")
|
||||
cclog.Errorf("Configuration value interval=%s must be greater than zero", rcfg.ConfigFile.Interval)
|
||||
return 1
|
||||
}
|
||||
}
|
||||
@@ -145,11 +145,11 @@ func mainFunc() int {
|
||||
if len(rcfg.ConfigFile.Duration) > 0 {
|
||||
t, err := time.ParseDuration(rcfg.ConfigFile.Duration)
|
||||
if err != nil {
|
||||
cclog.Error("Configuration value 'duration' no valid duration")
|
||||
cclog.Error("Configuration value duration=%s no valid duration", rcfg.ConfigFile.Duration)
|
||||
}
|
||||
rcfg.Duration = t
|
||||
if rcfg.Duration == 0 {
|
||||
cclog.Error("Configuration value 'duration' must be greater than zero")
|
||||
cclog.Error("Configuration value duration=%s must be greater than zero", rcfg.ConfigFile.Duration)
|
||||
return 1
|
||||
}
|
||||
}
|
||||
|
||||
@@ -209,16 +209,16 @@ func (m *BeegfsMetaCollector) Read(interval time.Duration, output chan lp.CCMess
|
||||
} else {
|
||||
f1, err := strconv.ParseFloat(m.matches["other"], 32)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Metric (other): Failed to convert str written '%s' to float: %v", m.matches["other"], err))
|
||||
"Metric (other): Failed to convert str written '%s' to float: %v", m.matches["other"], err)
|
||||
continue
|
||||
}
|
||||
f2, err := strconv.ParseFloat(split[i], 32)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Metric (other): Failed to convert str written '%s' to float: %v", m.matches["other"], err))
|
||||
"Metric (other): Failed to convert str written '%s' to float: %v", m.matches["other"], err)
|
||||
continue
|
||||
}
|
||||
m.matches["beegfs_cstorage_other"] = fmt.Sprintf("%f", f1+f2)
|
||||
@@ -227,8 +227,7 @@ func (m *BeegfsMetaCollector) Read(interval time.Duration, output chan lp.CCMess
|
||||
|
||||
for key, data := range m.matches {
|
||||
value, _ := strconv.ParseFloat(data, 32)
|
||||
y, err := lp.NewMessage(key, m.tags, m.meta, map[string]any{"value": value}, time.Now())
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric(key, m.tags, m.meta, value, time.Now()); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
@@ -200,16 +200,16 @@ func (m *BeegfsStorageCollector) Read(interval time.Duration, output chan lp.CCM
|
||||
} else {
|
||||
f1, err := strconv.ParseFloat(m.matches["other"], 32)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Metric (other): Failed to convert str written '%s' to float: %v", m.matches["other"], err))
|
||||
"Metric (other): Failed to convert str written '%s' to float: %v", m.matches["other"], err)
|
||||
continue
|
||||
}
|
||||
f2, err := strconv.ParseFloat(split[i], 32)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Metric (other): Failed to convert str written '%s' to float: %v", m.matches["other"], err))
|
||||
"Metric (other): Failed to convert str written '%s' to float: %v", m.matches["other"], err)
|
||||
continue
|
||||
}
|
||||
m.matches["beegfs_cstorage_other"] = fmt.Sprintf("%f", f1+f2)
|
||||
@@ -218,8 +218,7 @@ func (m *BeegfsStorageCollector) Read(interval time.Duration, output chan lp.CCM
|
||||
|
||||
for key, data := range m.matches {
|
||||
value, _ := strconv.ParseFloat(data, 32)
|
||||
y, err := lp.NewMessage(key, m.tags, m.meta, map[string]any{"value": value}, time.Now())
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric(key, m.tags, m.meta, value, time.Now()); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
@@ -50,6 +50,7 @@ var AvailableCollectors = map[string]MetricCollector{
|
||||
"nfsiostat": new(NfsIOStatCollector),
|
||||
"slurm_cgroup": new(SlurmCgroupCollector),
|
||||
"smartmon": new(SmartMonCollector),
|
||||
"nvidia_gpm": new(NvidiaGPMCollector),
|
||||
}
|
||||
|
||||
// Metric collector manager data structure
|
||||
@@ -99,17 +100,17 @@ func (cm *collectorManager) Init(ticker mct.MultiChanTicker, duration time.Durat
|
||||
// Initialize configured collectors
|
||||
for collectorName, collectorCfg := range cm.config {
|
||||
if _, found := AvailableCollectors[collectorName]; !found {
|
||||
cclog.ComponentError("CollectorManager", "SKIP unknown collector", collectorName)
|
||||
cclog.ComponentErrorf("CollectorManager", "SKIP unknown collector %s", collectorName)
|
||||
continue
|
||||
}
|
||||
collector := AvailableCollectors[collectorName]
|
||||
|
||||
err := collector.Init(collectorCfg)
|
||||
if err != nil {
|
||||
cclog.ComponentError("CollectorManager", fmt.Sprintf("Collector %s initialization failed: %v", collectorName, err))
|
||||
cclog.ComponentErrorf("CollectorManager", "Collector %s initialization failed: %v", collectorName, err)
|
||||
continue
|
||||
}
|
||||
cclog.ComponentDebug("CollectorManager", "ADD COLLECTOR", collector.Name())
|
||||
cclog.ComponentDebugf("CollectorManager", "ADD COLLECTOR %s", collector.Name())
|
||||
if collector.Parallel() {
|
||||
cm.collectors = append(cm.collectors, collector)
|
||||
} else {
|
||||
@@ -155,7 +156,7 @@ func (cm *collectorManager) Start() {
|
||||
return
|
||||
default:
|
||||
// Read metrics from collector c via goroutine
|
||||
cclog.ComponentDebug("CollectorManager", c.Name(), t)
|
||||
cclog.ComponentDebugf("CollectorManager: Read %s at %v", c.Name(), t)
|
||||
cm.collector_wg.Add(1)
|
||||
go func(myc MetricCollector) {
|
||||
myc.Read(cm.duration, cm.output)
|
||||
@@ -173,7 +174,7 @@ func (cm *collectorManager) Start() {
|
||||
return
|
||||
default:
|
||||
// Read metrics from collector c
|
||||
cclog.ComponentDebug("CollectorManager", c.Name(), t)
|
||||
cclog.ComponentDebugf("CollectorManager: Read %s at %v", c.Name(), t)
|
||||
c.Read(cm.duration, cm.output)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -139,16 +139,16 @@ func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, output chan lp.CC
|
||||
const cpuInfoFile = "/proc/cpuinfo"
|
||||
file, err := os.Open(cpuInfoFile)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to open file '%s': %v", cpuInfoFile, err))
|
||||
"Read(): Failed to open file '%s': %v", cpuInfoFile, err)
|
||||
return
|
||||
}
|
||||
defer func() {
|
||||
if err := file.Close(); err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to close file '%s': %v", cpuInfoFile, err))
|
||||
"Read(): Failed to close file '%s': %v", cpuInfoFile, err)
|
||||
}
|
||||
}()
|
||||
|
||||
@@ -166,12 +166,12 @@ func (m *CPUFreqCpuInfoCollector) Read(interval time.Duration, output chan lp.CC
|
||||
if !t.isHT {
|
||||
value, err := strconv.ParseFloat(strings.TrimSpace(lineSplit[1]), 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert cpu MHz '%s' to float64: %v", lineSplit[1], err))
|
||||
"Read(): Failed to convert cpu MHz '%s' to float64: %v", lineSplit[1], err)
|
||||
return
|
||||
}
|
||||
if y, err := lp.NewMessage("cpufreq", t.tagSet, m.meta, map[string]any{"value": value}, now); err == nil {
|
||||
if y, err := lp.NewMetric("cpufreq", t.tagSet, m.meta, value, now); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
@@ -95,10 +95,7 @@ func (m *CPUFreqCollector) Init(config json.RawMessage) error {
|
||||
}
|
||||
|
||||
// Initialized
|
||||
cclog.ComponentDebug(
|
||||
m.name,
|
||||
"initialized",
|
||||
len(m.topology), "non-hyper-threading CPUs")
|
||||
cclog.ComponentDebugf(m.name, "initialized %d non-hyper-threading CPUs")
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
@@ -116,20 +113,18 @@ func (m *CPUFreqCollector) Read(interval time.Duration, output chan lp.CCMessage
|
||||
// Read current frequency
|
||||
line, err := os.ReadFile(t.scalingCurFreqFile)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to read file '%s': %v", t.scalingCurFreqFile, err))
|
||||
cclog.ComponentErrorf(
|
||||
m.name, "Read(): Failed to read file '%s': %v", t.scalingCurFreqFile, err)
|
||||
continue
|
||||
}
|
||||
cpuFreq, err := strconv.ParseInt(strings.TrimSpace(string(line)), 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert CPU frequency '%s' to int64: %v", line, err))
|
||||
m.name, "Read(): Failed to convert CPU frequency '%s' to int64: %v", line, err)
|
||||
continue
|
||||
}
|
||||
|
||||
if y, err := lp.NewMessage("cpufreq", t.tagSet, m.meta, map[string]any{"value": cpuFreq}, now); err == nil {
|
||||
if y, err := lp.NewMetric("cpufreq", t.tagSet, m.meta, cpuFreq, now); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
@@ -27,6 +27,7 @@ const CPUSTATFILE = `/proc/stat`
|
||||
|
||||
type CpustatCollectorConfig struct {
|
||||
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
|
||||
excludeNumCPUs bool
|
||||
}
|
||||
|
||||
type CpustatCollector struct {
|
||||
@@ -79,6 +80,7 @@ func (m *CpustatCollector) Init(config json.RawMessage) error {
|
||||
m.matches[match] = index
|
||||
}
|
||||
}
|
||||
m.config.excludeNumCPUs = slices.Contains(m.config.ExcludeMetrics, "num_cpus")
|
||||
|
||||
// Check input file
|
||||
file, err := os.Open(CPUSTATFILE)
|
||||
@@ -95,11 +97,13 @@ func (m *CpustatCollector) Init(config json.RawMessage) error {
|
||||
line := scanner.Text()
|
||||
linefields := strings.Fields(line)
|
||||
if strings.Compare(linefields[0], "cpu") == 0 {
|
||||
// Kernel system statistics for all CPUs
|
||||
m.olddata["cpu"] = make(map[string]int64)
|
||||
for k, v := range m.matches {
|
||||
m.olddata["cpu"][k], _ = strconv.ParseInt(linefields[v], 0, 64)
|
||||
}
|
||||
} else if strings.HasPrefix(linefields[0], "cpu") && strings.Compare(linefields[0], "cpu") != 0 {
|
||||
// Kernel system statistics per CPU
|
||||
cpustr := strings.TrimLeft(linefields[0], "cpu")
|
||||
cpu, _ := strconv.Atoi(cpustr)
|
||||
m.cputags[linefields[0]] = map[string]string{
|
||||
@@ -141,7 +145,7 @@ func (m *CpustatCollector) parseStatLine(linefields []string, tags map[string]st
|
||||
sum := float64(0)
|
||||
for name, value := range values {
|
||||
sum += value
|
||||
y, err := lp.NewMessage(name, tags, m.meta, map[string]any{"value": value * 100}, now)
|
||||
y, err := lp.NewMetric(name, tags, m.meta, value*100, now)
|
||||
if err == nil {
|
||||
y.AddTag("unit", "Percent")
|
||||
output <- y
|
||||
@@ -149,7 +153,7 @@ func (m *CpustatCollector) parseStatLine(linefields []string, tags map[string]st
|
||||
}
|
||||
if v, ok := values["cpu_idle"]; ok {
|
||||
sum -= v
|
||||
y, err := lp.NewMessage("cpu_used", tags, m.meta, map[string]any{"value": sum * 100}, now)
|
||||
y, err := lp.NewMetric("cpu_used", tags, m.meta, sum*100, now)
|
||||
if err == nil {
|
||||
y.AddTag("unit", "Percent")
|
||||
output <- y
|
||||
@@ -167,15 +171,15 @@ func (m *CpustatCollector) Read(interval time.Duration, output chan lp.CCMessage
|
||||
|
||||
file, err := os.Open(CPUSTATFILE)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to open file '%s': %v", CPUSTATFILE, err))
|
||||
"Read(): Failed to open file '%s': %v", CPUSTATFILE, err)
|
||||
}
|
||||
defer func() {
|
||||
if err := file.Close(); err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to close file '%s': %v", string(CPUSTATFILE), err))
|
||||
"Read(): Failed to close file '%s': %v", string(CPUSTATFILE), err)
|
||||
}
|
||||
}()
|
||||
|
||||
@@ -191,14 +195,10 @@ func (m *CpustatCollector) Read(interval time.Duration, output chan lp.CCMessage
|
||||
}
|
||||
}
|
||||
|
||||
num_cpus_metric, err := lp.NewMessage("num_cpus",
|
||||
m.nodetags,
|
||||
m.meta,
|
||||
map[string]any{"value": num_cpus},
|
||||
now,
|
||||
)
|
||||
if err == nil {
|
||||
output <- num_cpus_metric
|
||||
if !m.config.excludeNumCPUs {
|
||||
if num_cpus_metric, err := lp.NewMetric("num_cpus", m.nodetags, m.meta, num_cpus, now); err == nil {
|
||||
output <- num_cpus_metric
|
||||
}
|
||||
}
|
||||
|
||||
m.lastTimestamp = now
|
||||
|
||||
@@ -64,9 +64,9 @@ func (m *CustomCmdCollector) Init(config json.RawMessage) error {
|
||||
cmdFields := strings.Fields(c)
|
||||
command := exec.Command(cmdFields[0], cmdFields[1:]...)
|
||||
if _, err := command.Output(); err != nil {
|
||||
cclog.ComponentWarn(
|
||||
cclog.ComponentWarnf(
|
||||
m.name,
|
||||
fmt.Sprintf("%s Init(): Execution of command \"%s\" failed: %v", m.name, command.String(), err))
|
||||
"%s Init(): Execution of command \"%s\" failed: %v", m.name, command.String(), err)
|
||||
continue
|
||||
}
|
||||
m.cmdFieldsSlice = append(m.cmdFieldsSlice, cmdFields)
|
||||
@@ -77,7 +77,7 @@ func (m *CustomCmdCollector) Init(config json.RawMessage) error {
|
||||
if _, err := os.ReadFile(fileName); err != nil {
|
||||
cclog.ComponentWarn(
|
||||
m.name,
|
||||
fmt.Sprintf("%s Init(): Reading of file \"%s\" failed: %v", m.name, fileName, err))
|
||||
"%s Init(): Reading of file \"%s\" failed: %v", m.name, fileName, err)
|
||||
continue
|
||||
}
|
||||
m.files = append(m.files, fileName)
|
||||
@@ -100,20 +100,18 @@ func (m *CustomCmdCollector) Read(interval time.Duration, output chan lp.CCMessa
|
||||
command := exec.Command(cmdFields[0], cmdFields[1:]...)
|
||||
stdout, err := command.Output()
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to read command output for command \"%s\": %v", command.String(), err),
|
||||
)
|
||||
"Read(): Failed to read command output for command \"%s\": %v", command.String(), err)
|
||||
continue
|
||||
}
|
||||
|
||||
// Read and decode influxDB line-protocol from command output
|
||||
metrics, err := lp.FromBytes(stdout)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to decode influx Message: %v", err),
|
||||
)
|
||||
"Read(): Failed to decode influx Message: %v", err)
|
||||
continue
|
||||
}
|
||||
for _, metric := range metrics {
|
||||
@@ -128,20 +126,18 @@ func (m *CustomCmdCollector) Read(interval time.Duration, output chan lp.CCMessa
|
||||
for _, filename := range m.files {
|
||||
input, err := os.ReadFile(filename)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to read file \"%s\": %v\n", filename, err),
|
||||
)
|
||||
"Read(): Failed to read file \"%s\": %v\n", filename, err)
|
||||
continue
|
||||
}
|
||||
|
||||
// Read and decode influxDB line-protocol from file
|
||||
metrics, err := lp.FromBytes(input)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to decode influx Message: %v", err),
|
||||
)
|
||||
"Read(): Failed to decode influx Message: %v", err)
|
||||
continue
|
||||
}
|
||||
for _, metric := range metrics {
|
||||
|
||||
@@ -77,16 +77,16 @@ func (m *DiskstatCollector) Read(interval time.Duration, output chan lp.CCMessag
|
||||
|
||||
file, err := os.Open(MOUNTFILE)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to open file '%s': %v", MOUNTFILE, err))
|
||||
"Read(): Failed to open file '%s': %v", MOUNTFILE, err)
|
||||
return
|
||||
}
|
||||
defer func() {
|
||||
if err := file.Close(); err != nil {
|
||||
cclog.ComponentError(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to close file '%s': %v", MOUNTFILE, err))
|
||||
"Read(): Failed to close file '%s': %v", MOUNTFILE, err)
|
||||
}
|
||||
}()
|
||||
|
||||
@@ -128,30 +128,14 @@ mountLoop:
|
||||
tags := map[string]string{"type": "node", "device": linefields[0]}
|
||||
total := (stat.Blocks * uint64(stat.Bsize)) / uint64(1000_000_000)
|
||||
if m.allowedMetrics["disk_total"] {
|
||||
y, err := lp.NewMessage(
|
||||
"disk_total",
|
||||
tags,
|
||||
m.meta,
|
||||
map[string]any{
|
||||
"value": total,
|
||||
},
|
||||
time.Now())
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("disk_total", tags, m.meta, total, time.Now()); err == nil {
|
||||
y.AddMeta("unit", "GBytes")
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
free := (stat.Bfree * uint64(stat.Bsize)) / uint64(1000_000_000)
|
||||
if m.allowedMetrics["disk_free"] {
|
||||
y, err := lp.NewMessage(
|
||||
"disk_free",
|
||||
tags,
|
||||
m.meta,
|
||||
map[string]any{
|
||||
"value": free,
|
||||
},
|
||||
time.Now())
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("disk_free", tags, m.meta, free, time.Now()); err == nil {
|
||||
y.AddMeta("unit", "GBytes")
|
||||
output <- y
|
||||
}
|
||||
@@ -164,16 +148,7 @@ mountLoop:
|
||||
}
|
||||
}
|
||||
if m.allowedMetrics["part_max_used"] {
|
||||
y, err := lp.NewMessage(
|
||||
"part_max_used",
|
||||
map[string]string{
|
||||
"type": "node",
|
||||
},
|
||||
m.meta,
|
||||
map[string]any{
|
||||
"value": int(part_max_used),
|
||||
},
|
||||
time.Now())
|
||||
y, err := lp.NewMetric("part_max_used", map[string]string{"type": "node"}, m.meta, int(part_max_used), time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", "percent")
|
||||
output <- y
|
||||
|
||||
@@ -371,7 +371,7 @@ func (m *GpfsCollector) Init(config json.RawMessage) error {
|
||||
if err != nil {
|
||||
// if using sudo, exec.lookPath will return EACCES (file mode r-x------), this can be ignored
|
||||
if m.config.Sudo && errors.Is(err, syscall.EACCES) {
|
||||
cclog.ComponentWarn(m.name, fmt.Sprintf("got error looking for mmpmon binary '%s': %v . This is expected when using sudo, continuing.", m.config.Mmpmon, err))
|
||||
cclog.ComponentWarnf(m.name, "got error looking for mmpmon binary '%s': %v . This is expected when using sudo, continuing.", m.config.Mmpmon, err)
|
||||
// the file was given in the config, use it
|
||||
p = m.config.Mmpmon
|
||||
} else {
|
||||
@@ -517,23 +517,23 @@ func (m *GpfsCollector) Read(interval time.Duration, output chan lp.CCMessage) {
|
||||
// return code
|
||||
rc, err := strconv.Atoi(key_value["_rc_"])
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, fmt.Sprintf("Read(): Failed to convert return code '%s' to int: %v", key_value["_rc_"], err))
|
||||
cclog.ComponentErrorf(m.name, "Read(): Failed to convert return code '%s' to int: %v", key_value["_rc_"], err)
|
||||
continue
|
||||
}
|
||||
if rc != 0 {
|
||||
cclog.ComponentError(m.name, fmt.Sprintf("Read(): Filesystem '%s' is not ok.", filesystem))
|
||||
cclog.ComponentErrorf(m.name, "Read(): Filesystem '%s' is not ok.", filesystem)
|
||||
continue
|
||||
}
|
||||
|
||||
// timestamp
|
||||
sec, err := strconv.ParseInt(key_value["_t_"], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, fmt.Sprintf("Read(): Failed to convert seconds '%s' to int64: %v", key_value["_t_"], err))
|
||||
cclog.ComponentErrorf(m.name, "Read(): Failed to convert seconds '%s' to int64: %v", key_value["_t_"], err)
|
||||
continue
|
||||
}
|
||||
msec, err := strconv.ParseInt(key_value["_tu_"], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, fmt.Sprintf("Read(): Failed to convert micro seconds '%s' to int64: %v", key_value["_tu_"], err))
|
||||
cclog.ComponentErrorf(m.name, "Read(): Failed to convert micro seconds '%s' to int64: %v", key_value["_tu_"], err)
|
||||
continue
|
||||
}
|
||||
timestamp := time.Unix(sec, msec*1000)
|
||||
@@ -551,7 +551,7 @@ func (m *GpfsCollector) Read(interval time.Duration, output chan lp.CCMessage) {
|
||||
for _, metric := range GpfsAbsMetrics {
|
||||
value, err := strconv.ParseInt(key_value[metric.prefix], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, fmt.Sprintf("Read(): Failed to convert %s '%s' to int64: %v", metric.desc, key_value[metric.prefix], err))
|
||||
cclog.ComponentErrorf(m.name, "Read(): Failed to convert %s '%s' to int64: %v", metric.desc, key_value[metric.prefix], err)
|
||||
continue
|
||||
}
|
||||
newstate[metric.prefix] = value
|
||||
@@ -636,7 +636,7 @@ func (m *GpfsCollector) Read(interval time.Duration, output chan lp.CCMessage) {
|
||||
}
|
||||
} else {
|
||||
// the value could not be computed correctly
|
||||
cclog.ComponentWarn(m.name, fmt.Sprintf("Read(): Could not compute value for filesystem %s of metric %s: vold_ok = %t, vnew_ok = %t", filesystem, metric.name, vold_ok, vnew_ok))
|
||||
cclog.ComponentWarnf(m.name, "Read(): Could not compute value for filesystem %s of metric %s: vold_ok = %t, vnew_ok = %t", filesystem, metric.name, vold_ok, vnew_ok)
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -81,3 +81,16 @@ Metrics:
|
||||
* `gpfs_metaops_rate` (if `send_total_values == true` and `send_derived_values == true`)
|
||||
|
||||
The collector adds a `filesystem` tag to all metrics
|
||||
|
||||
`mmpmon` typically require root to run.
|
||||
In order to run `cc-metric-collector` without root priviliges, you can enable `use_sudo`.
|
||||
Add a file like this in `/etc/sudoers.d/` to allow `cc-metric-collector` to run the required command:
|
||||
|
||||
```
|
||||
# Do not log the following sudo commands from monitoring, since this causes a lot of log spam.
|
||||
# However keep log_denied enabled, to detect failures
|
||||
Defaults: monitoring !log_allowed, !pam_session
|
||||
|
||||
# Allow to use mmpmon
|
||||
monitoring ALL = (root) NOPASSWD:/absolute/path/to/mmpmon -p -s
|
||||
```
|
||||
@@ -23,21 +23,29 @@ import (
|
||||
"golang.org/x/sys/unix"
|
||||
)
|
||||
|
||||
const IB_BASEPATH = "/sys/class/infiniband/"
|
||||
// See: https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-class-infiniband
|
||||
const (
|
||||
ibBasePath = "/sys/class/infiniband/"
|
||||
ibDataUnit = "bytes"
|
||||
ibDataRateUnit = ibDataUnit + "/sec"
|
||||
ibPkgUnit = "packets"
|
||||
ibPkgRateUnit = ibPkgUnit + "/sec"
|
||||
)
|
||||
|
||||
type InfinibandCollectorMetric struct {
|
||||
name string
|
||||
path string
|
||||
unit string
|
||||
scale int64
|
||||
addToIBTotal bool
|
||||
addToIBTotalPkgs bool
|
||||
currentState int64
|
||||
lastState int64
|
||||
name string
|
||||
path string
|
||||
unit string
|
||||
unitRates string
|
||||
scaleByFourLanes bool
|
||||
addToIBTotal bool
|
||||
addToIBTotalPkgs bool
|
||||
lastState uint64
|
||||
lastStateAvailable bool
|
||||
}
|
||||
|
||||
type InfinibandCollectorInfo struct {
|
||||
LID string // IB local Identifier (LID)
|
||||
lid string // IB local Identifier (LID)
|
||||
device string // IB device
|
||||
port string // IB device port
|
||||
portCounterFiles []InfinibandCollectorMetric // mapping counter name -> InfinibandCollectorMetric
|
||||
@@ -57,7 +65,7 @@ type InfinibandCollector struct {
|
||||
lastTimestamp time.Time // Store time stamp of last tick to derive bandwidths
|
||||
}
|
||||
|
||||
// Init initializes the Infiniband collector by walking through files below IB_BASEPATH
|
||||
// Init initializes the Infiniband collector by walking through files below ibBasePath
|
||||
func (m *InfinibandCollector) Init(config json.RawMessage) error {
|
||||
// Check if already initialized
|
||||
if m.init {
|
||||
@@ -88,7 +96,7 @@ func (m *InfinibandCollector) Init(config json.RawMessage) error {
|
||||
}
|
||||
|
||||
// Loop for all InfiniBand directories
|
||||
globPattern := filepath.Join(IB_BASEPATH, "*", "ports", "*")
|
||||
globPattern := filepath.Join(ibBasePath, "*", "ports", "*")
|
||||
ibDirs, err := filepath.Glob(globPattern)
|
||||
if err != nil {
|
||||
return fmt.Errorf("%s Init(): unable to glob files with pattern %s: %w", m.name, globPattern, err)
|
||||
@@ -123,36 +131,42 @@ func (m *InfinibandCollector) Init(config json.RawMessage) error {
|
||||
countersDir := filepath.Join(path, "counters")
|
||||
portCounterFiles := []InfinibandCollectorMetric{
|
||||
{
|
||||
name: "ib_recv",
|
||||
path: filepath.Join(countersDir, "port_rcv_data"),
|
||||
unit: "bytes",
|
||||
scale: 4,
|
||||
addToIBTotal: true,
|
||||
lastState: -1,
|
||||
// Total number of data octets, divided by 4 (lanes), received on all VLs.
|
||||
// This is 64 bit counter
|
||||
name: "ib_recv",
|
||||
path: filepath.Join(countersDir, "port_rcv_data"),
|
||||
unit: ibDataUnit,
|
||||
unitRates: ibDataRateUnit,
|
||||
scaleByFourLanes: true,
|
||||
addToIBTotal: true,
|
||||
},
|
||||
{
|
||||
name: "ib_xmit",
|
||||
path: filepath.Join(countersDir, "port_xmit_data"),
|
||||
unit: "bytes",
|
||||
scale: 4,
|
||||
addToIBTotal: true,
|
||||
lastState: -1,
|
||||
// Total number of data octets, divided by 4 (lanes), transmitted on all VLs.
|
||||
// This is 64 bit counter
|
||||
name: "ib_xmit",
|
||||
path: filepath.Join(countersDir, "port_xmit_data"),
|
||||
unit: ibDataUnit,
|
||||
unitRates: ibDataRateUnit,
|
||||
scaleByFourLanes: true,
|
||||
addToIBTotal: true,
|
||||
},
|
||||
{
|
||||
// Total number of packets received on all VLs from this port (this may include packets containing Errors.
|
||||
// This is 64 bit counter.
|
||||
name: "ib_recv_pkts",
|
||||
path: filepath.Join(countersDir, "port_rcv_packets"),
|
||||
unit: "packets",
|
||||
scale: 1,
|
||||
unit: ibPkgUnit,
|
||||
unitRates: ibPkgRateUnit,
|
||||
addToIBTotalPkgs: true,
|
||||
lastState: -1,
|
||||
},
|
||||
{
|
||||
// Total number of packets transmitted on all VLs from this port. This may include packets with errors.
|
||||
// This is 64 bit counter.
|
||||
name: "ib_xmit_pkts",
|
||||
path: filepath.Join(countersDir, "port_xmit_packets"),
|
||||
unit: "packets",
|
||||
scale: 1,
|
||||
unit: ibPkgUnit,
|
||||
unitRates: ibPkgRateUnit,
|
||||
addToIBTotalPkgs: true,
|
||||
lastState: -1,
|
||||
},
|
||||
}
|
||||
for _, counter := range portCounterFiles {
|
||||
@@ -164,7 +178,7 @@ func (m *InfinibandCollector) Init(config json.RawMessage) error {
|
||||
|
||||
m.info = append(m.info,
|
||||
InfinibandCollectorInfo{
|
||||
LID: LID,
|
||||
lid: LID,
|
||||
device: device,
|
||||
port: port,
|
||||
portCounterFiles: portCounterFiles,
|
||||
@@ -185,7 +199,7 @@ func (m *InfinibandCollector) Init(config json.RawMessage) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
// Read reads Infiniband counter files below IB_BASEPATH
|
||||
// Read reads Infiniband counter files below ibBasePath
|
||||
func (m *InfinibandCollector) Read(interval time.Duration, output chan lp.CCMessage) {
|
||||
// Check if already initialized
|
||||
if !m.init {
|
||||
@@ -202,44 +216,42 @@ func (m *InfinibandCollector) Read(interval time.Duration, output chan lp.CCMess
|
||||
for i := range m.info {
|
||||
info := &m.info[i]
|
||||
|
||||
var ib_total, ib_total_pkts int64
|
||||
var ibTotal, ibTotalPkts uint64 // sum of xmit and recv counters
|
||||
var ibTotalBw, ibTotalPktsBw float64 // sum of xmit and recv rates
|
||||
var ibTotalBwAvailable, ibTotalPktsBwAvailable bool
|
||||
for i := range info.portCounterFiles {
|
||||
counterDef := &info.portCounterFiles[i]
|
||||
|
||||
// Read counter file
|
||||
line, err := os.ReadFile(counterDef.path)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to read from file '%s': %v", counterDef.path, err))
|
||||
"Read(): Failed to read from file '%s': %v", counterDef.path, err)
|
||||
// Current counter can not be saved as last state
|
||||
counterDef.lastStateAvailable = false
|
||||
continue
|
||||
}
|
||||
data := strings.TrimSpace(string(line))
|
||||
|
||||
// convert counter to int64
|
||||
v, err := strconv.ParseInt(data, 10, 64)
|
||||
// convert counter to uint64
|
||||
vRawCounter, err := strconv.ParseUint(data, 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert Infininiband metrice %s='%s' to int64: %v", counterDef.name, data, err))
|
||||
"Read(): Failed to convert Infininiband metrice %s='%s' to uint64: %v", counterDef.name, data, err)
|
||||
// Current counter can not be saved as last state
|
||||
counterDef.lastStateAvailable = false
|
||||
continue
|
||||
}
|
||||
// Scale raw value
|
||||
v *= counterDef.scale
|
||||
|
||||
// Save current state
|
||||
counterDef.currentState = v
|
||||
vScaledCounter := vRawCounter
|
||||
if counterDef.scaleByFourLanes {
|
||||
vScaledCounter *= uint64(4)
|
||||
}
|
||||
|
||||
// Send absolut values
|
||||
if m.config.SendAbsoluteValues {
|
||||
if y, err := lp.NewMessage(
|
||||
counterDef.name,
|
||||
info.tagSet,
|
||||
m.meta,
|
||||
map[string]any{
|
||||
"value": counterDef.currentState,
|
||||
},
|
||||
now); err == nil {
|
||||
if y, err := lp.NewMetric(counterDef.name, info.tagSet, m.meta, vScaledCounter, now); err == nil {
|
||||
y.AddMeta("unit", counterDef.unit)
|
||||
output <- y
|
||||
}
|
||||
@@ -247,60 +259,75 @@ func (m *InfinibandCollector) Read(interval time.Duration, output chan lp.CCMess
|
||||
|
||||
// Send derived values
|
||||
if m.config.SendDerivedValues {
|
||||
if counterDef.lastState >= 0 {
|
||||
rate := float64((counterDef.currentState - counterDef.lastState)) / timeDiff
|
||||
if y, err := lp.NewMessage(
|
||||
counterDef.name+"_bw",
|
||||
info.tagSet,
|
||||
m.meta,
|
||||
map[string]any{
|
||||
"value": rate,
|
||||
},
|
||||
now); err == nil {
|
||||
y.AddMeta("unit", counterDef.unit+"/sec")
|
||||
if counterDef.lastStateAvailable {
|
||||
var rate float64
|
||||
// uint64 subtraction handles wraparound automatically
|
||||
// in case vRawCounter < counterDef.lastState we would compute:
|
||||
// math.MaxUint64 - lastState + vRawCounter + 1
|
||||
// = (2^64 - 1) - lastState + vRawCounter + 1
|
||||
// = 2^64 - lastState + vRawCounter
|
||||
// ≡ vRawCounter - lastState (mod 2^64)
|
||||
rate = float64(vRawCounter-counterDef.lastState) / timeDiff
|
||||
if counterDef.scaleByFourLanes {
|
||||
rate *= float64(4)
|
||||
}
|
||||
if y, err := lp.NewMetric(counterDef.name+"_bw", info.tagSet, m.meta, rate, now); err == nil {
|
||||
y.AddMeta("unit", counterDef.unitRates)
|
||||
output <- y
|
||||
}
|
||||
|
||||
// Sum up rates for total rates
|
||||
if m.config.SendTotalValues {
|
||||
switch {
|
||||
case counterDef.addToIBTotal:
|
||||
ibTotalBw += rate
|
||||
ibTotalBwAvailable = true
|
||||
case counterDef.addToIBTotalPkgs:
|
||||
ibTotalPktsBw += rate
|
||||
ibTotalPktsBwAvailable = true
|
||||
}
|
||||
}
|
||||
}
|
||||
counterDef.lastState = counterDef.currentState
|
||||
counterDef.lastState = vRawCounter
|
||||
counterDef.lastStateAvailable = true
|
||||
}
|
||||
|
||||
// Sum up total values
|
||||
if m.config.SendTotalValues {
|
||||
switch {
|
||||
case counterDef.addToIBTotal:
|
||||
ib_total += counterDef.currentState
|
||||
ibTotal += vScaledCounter
|
||||
case counterDef.addToIBTotalPkgs:
|
||||
ib_total_pkts += counterDef.currentState
|
||||
ibTotalPkts += vScaledCounter
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Send total values
|
||||
if m.config.SendTotalValues {
|
||||
if y, err := lp.NewMessage(
|
||||
"ib_total",
|
||||
info.tagSet,
|
||||
m.meta,
|
||||
map[string]any{
|
||||
"value": ib_total,
|
||||
},
|
||||
now); err == nil {
|
||||
y.AddMeta("unit", "bytes")
|
||||
if y, err := lp.NewMetric("ib_total", info.tagSet, m.meta, ibTotal, now); err == nil {
|
||||
y.AddMeta("unit", ibDataUnit)
|
||||
output <- y
|
||||
}
|
||||
|
||||
if y, err := lp.NewMessage(
|
||||
"ib_total_pkts",
|
||||
info.tagSet,
|
||||
m.meta,
|
||||
map[string]any{
|
||||
"value": ib_total_pkts,
|
||||
},
|
||||
now); err == nil {
|
||||
y.AddMeta("unit", "packets")
|
||||
if y, err := lp.NewMetric("ib_total_pkts", info.tagSet, m.meta, ibTotalPkts, now); err == nil {
|
||||
y.AddMeta("unit", ibPkgUnit)
|
||||
output <- y
|
||||
}
|
||||
|
||||
if m.config.SendDerivedValues && ibTotalBwAvailable {
|
||||
if y, err := lp.NewMetric("ib_total_bw", info.tagSet, m.meta, ibTotalBw, now); err == nil {
|
||||
y.AddMeta("unit", ibDataRateUnit)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
if m.config.SendDerivedValues && ibTotalPktsBwAvailable {
|
||||
if y, err := lp.NewMetric("ib_total_pkts_bw", info.tagSet, m.meta, ibTotalPktsBw, now); err == nil {
|
||||
y.AddMeta("unit", ibPkgRateUnit)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -41,5 +41,7 @@ Metrics:
|
||||
* `ib_xmit_bw` (if `send_derived_values == true`)
|
||||
* `ib_recv_pkts_bw` (if `send_derived_values == true`)
|
||||
* `ib_xmit_pkts_bw` (if `send_derived_values == true`)
|
||||
* `ib_total_bw` (if `send_total_values == true` and `send_derived_values == true`)
|
||||
* `ib_total_pkts_bw` (if `send_total_values == true` and `send_derived_values == true`)
|
||||
|
||||
The collector adds a `device` tag to all metrics
|
||||
|
||||
@@ -145,16 +145,16 @@ func (m *IOstatCollector) Read(interval time.Duration, output chan lp.CCMessage)
|
||||
|
||||
file, err := os.Open(IOSTATFILE)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to open file '%s': %v", IOSTATFILE, err))
|
||||
"Read(): Failed to open file '%s': %v", IOSTATFILE, err)
|
||||
return
|
||||
}
|
||||
defer func() {
|
||||
if err := file.Close(); err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to close file '%s': %v", IOSTATFILE, err))
|
||||
"Read(): Failed to close file '%s': %v", IOSTATFILE, err)
|
||||
}
|
||||
}()
|
||||
|
||||
|
||||
@@ -28,10 +28,9 @@ type IpmiCollector struct {
|
||||
metricCollector
|
||||
|
||||
config struct {
|
||||
ExcludeDevices []string `json:"exclude_devices"`
|
||||
IpmitoolPath string `json:"ipmitool_path"`
|
||||
IpmisensorsPath string `json:"ipmisensors_path"`
|
||||
Sudo bool `json:"use_sudo"`
|
||||
IpmitoolPath string `json:"ipmitool_path"`
|
||||
IpmisensorsPath string `json:"ipmisensors_path"`
|
||||
Sudo bool `json:"use_sudo"`
|
||||
}
|
||||
|
||||
ipmitool string
|
||||
@@ -158,7 +157,7 @@ func (m *IpmiCollector) readIpmiTool(output chan lp.CCMessage) error {
|
||||
unit = "Watts"
|
||||
}
|
||||
|
||||
y, err := lp.NewMessage(name, map[string]string{"type": "node"}, m.meta, map[string]any{"value": v}, time.Now())
|
||||
y, err := lp.NewMetric(name, map[string]string{"type": "node"}, m.meta, v, time.Now())
|
||||
if err != nil {
|
||||
cclog.ComponentErrorf(m.name, "Failed to create message: %v", err)
|
||||
continue
|
||||
@@ -210,7 +209,7 @@ func (m *IpmiCollector) readIpmiSensors(output chan lp.CCMessage) error {
|
||||
continue
|
||||
}
|
||||
name := strings.ToLower(strings.ReplaceAll(lv[1], " ", "_"))
|
||||
y, err := lp.NewMessage(name, map[string]string{"type": "node"}, m.meta, map[string]any{"value": v}, time.Now())
|
||||
y, err := lp.NewMetric(name, map[string]string{"type": "node"}, m.meta, v, time.Now())
|
||||
if err != nil {
|
||||
cclog.ComponentErrorf(m.name, "Failed to create message: %v", err)
|
||||
continue
|
||||
|
||||
@@ -23,9 +23,9 @@ The `ipmistat` collector reads data from `ipmitool` (`ipmitool sensor`) or `ipmi
|
||||
|
||||
The metrics depend on the output of the underlying tools but contain temperature, power and energy metrics.
|
||||
|
||||
ipmitool and ipmi-sensors typically require root to run.
|
||||
In order to cc-metric-collector without root priviliges, you can enable `use_sudo`.
|
||||
Add a file like this in /etc/sudoers.d/ to allow cc-metric-collector to run this command:
|
||||
`ipmitool` and `ipmi-sensors` typically require root to run.
|
||||
In order to run `cc-metric-collector` without root priviliges, you can enable `use_sudo`.
|
||||
Add a file like this in `/etc/sudoers.d/` to allow `cc-metric-collector` to run the required commands:
|
||||
|
||||
```
|
||||
# Do not log the following sudo commands from monitoring, since this causes a lot of log spam.
|
||||
|
||||
@@ -12,6 +12,12 @@ package collectors
|
||||
#cgo LDFLAGS: -Wl,--unresolved-symbols=ignore-in-object-files
|
||||
#include <stdlib.h>
|
||||
#include <likwid.h>
|
||||
|
||||
|
||||
int cc_add_hwthread(int cpu_id) {
|
||||
return HPMaddThread(cpu_id);
|
||||
}
|
||||
|
||||
*/
|
||||
import "C"
|
||||
|
||||
@@ -261,12 +267,12 @@ func (m *LikwidCollector) Init(config json.RawMessage) error {
|
||||
}
|
||||
for _, metric := range evset.Metrics {
|
||||
// Try to evaluate the metric
|
||||
cclog.ComponentDebug(m.name, "Checking", metric.Name)
|
||||
cclog.ComponentDebugf(m.name, "Checking %s", metric.Name)
|
||||
if !checkMetricType(metric.Type) {
|
||||
cclog.ComponentError(m.name, "Metric", metric.Name, "uses invalid type", metric.Type)
|
||||
cclog.ComponentErrorf(m.name, "Metric %s uses invalid type %s", metric.Name, metric.Type)
|
||||
metric.Calc = ""
|
||||
} else if !testLikwidMetricFormula(metric.Calc, params) {
|
||||
cclog.ComponentError(m.name, "Metric", metric.Name, "cannot be calculated with given counters")
|
||||
cclog.ComponentError(m.name, "Metric %s cannot be calculated with given counters", metric.Name)
|
||||
metric.Calc = ""
|
||||
} else {
|
||||
globalParams = append(globalParams, metric.Name)
|
||||
@@ -281,13 +287,13 @@ func (m *LikwidCollector) Init(config json.RawMessage) error {
|
||||
for _, metric := range m.config.Metrics {
|
||||
// Try to evaluate the global metric
|
||||
if !checkMetricType(metric.Type) {
|
||||
cclog.ComponentError(m.name, "Metric", metric.Name, "uses invalid type", metric.Type)
|
||||
cclog.ComponentErrorf(m.name, "Metric %s uses invalid type %s", metric.Name, metric.Type)
|
||||
metric.Calc = ""
|
||||
} else if !testLikwidMetricFormula(metric.Calc, globalParams) {
|
||||
cclog.ComponentError(m.name, "Metric", metric.Name, "cannot be calculated with given counters")
|
||||
cclog.ComponentError(m.name, "Metric %s cannot be calculated with given counters", metric.Name)
|
||||
metric.Calc = ""
|
||||
} else if !checkMetricType(metric.Type) {
|
||||
cclog.ComponentError(m.name, "Metric", metric.Name, "has invalid type")
|
||||
cclog.ComponentError(m.name, "Metric %s has invalid type", metric.Name)
|
||||
metric.Calc = ""
|
||||
} else {
|
||||
totalMetrics++
|
||||
@@ -328,7 +334,7 @@ func (m *LikwidCollector) Init(config json.RawMessage) error {
|
||||
for _, c := range m.cpulist {
|
||||
m.measureThread.Call(
|
||||
func() {
|
||||
retCode := C.HPMaddThread(C.uint32_t(c))
|
||||
retCode := C.cc_add_hwthread(C.int(c))
|
||||
if retCode != 0 {
|
||||
err := fmt.Errorf("C.HPMaddThread(%v) failed with return code %v", c, retCode)
|
||||
cclog.ComponentError(m.name, err.Error())
|
||||
@@ -375,16 +381,16 @@ func (m *LikwidCollector) takeMeasurement(evidx int, evset LikwidEventsetConfig,
|
||||
// Watch changes for the lock file ()
|
||||
watcher, err := fsnotify.NewWatcher()
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("takeMeasurement(): Failed to create a new fsnotify.Watcher: %v", err))
|
||||
"takeMeasurement(): Failed to create a new fsnotify.Watcher: %v", err)
|
||||
return true, err
|
||||
}
|
||||
defer func() {
|
||||
if err := watcher.Close(); err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("takeMeasurement(): Failed to close fsnotify.Watcher: %v", err))
|
||||
"takeMeasurement(): Failed to close fsnotify.Watcher: %v", err)
|
||||
}
|
||||
}()
|
||||
if len(m.config.LockfilePath) > 0 {
|
||||
@@ -597,7 +603,7 @@ func (m *LikwidCollector) calcEventsetMetrics(evset LikwidEventsetConfig, interv
|
||||
if tid >= 0 && len(metric.Calc) > 0 {
|
||||
value, err := agg.EvalFloat64Condition(metric.Calc, evset.results[tid])
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, "Calculation for metric", metric.Name, "failed:", err.Error())
|
||||
cclog.ComponentErrorf(m.name, "Calculation for metric %s failed: %s", metric.Name, err.Error())
|
||||
value = 0.0
|
||||
}
|
||||
if m.config.InvalidToZero && (math.IsNaN(value) || math.IsInf(value, 0)) {
|
||||
@@ -762,7 +768,7 @@ func (m *LikwidCollector) calcGlobalMetrics(groups []LikwidEventsetConfig, inter
|
||||
// Evaluate the metric
|
||||
value, err := agg.EvalFloat64Condition(metric.Calc, params)
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, "Calculation for metric", metric.Name, "failed:", err.Error())
|
||||
cclog.ComponentErrorf(m.name, "Calculation for metric %s failed: %s", metric.Name, err.Error())
|
||||
value = 0.0
|
||||
}
|
||||
if m.config.InvalidToZero && (math.IsNaN(value) || math.IsInf(value, 0)) {
|
||||
|
||||
@@ -89,9 +89,9 @@ func (m *LoadavgCollector) Read(interval time.Duration, output chan lp.CCMessage
|
||||
}
|
||||
buffer, err := os.ReadFile(LOADAVGFILE)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to read file '%s': %v", LOADAVGFILE, err))
|
||||
"Read(): Failed to read file '%s': %v", LOADAVGFILE, err)
|
||||
return
|
||||
}
|
||||
now := time.Now()
|
||||
@@ -101,15 +101,15 @@ func (m *LoadavgCollector) Read(interval time.Duration, output chan lp.CCMessage
|
||||
for i, name := range m.load_matches {
|
||||
x, err := strconv.ParseFloat(ls[i], 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert '%s' to float64: %v", ls[i], err))
|
||||
"Read(): Failed to convert '%s' to float64: %v", ls[i], err)
|
||||
continue
|
||||
}
|
||||
if m.load_skips[i] {
|
||||
continue
|
||||
}
|
||||
y, err := lp.NewMessage(name, m.tags, m.meta, map[string]any{"value": x}, now)
|
||||
y, err := lp.NewMetric(name, m.tags, m.meta, x, now)
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
@@ -120,15 +120,15 @@ func (m *LoadavgCollector) Read(interval time.Duration, output chan lp.CCMessage
|
||||
for i, name := range m.proc_matches {
|
||||
x, err := strconv.ParseInt(lv[i], 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert '%s' to float64: %v", lv[i], err))
|
||||
"Read(): Failed to convert '%s' to float64: %v", lv[i], err)
|
||||
continue
|
||||
}
|
||||
if m.proc_skips[i] {
|
||||
continue
|
||||
}
|
||||
y, err := lp.NewMessage(name, m.tags, m.meta, map[string]any{"value": x}, now)
|
||||
y, err := lp.NewMetric(name, m.tags, m.meta, x, now)
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
@@ -55,3 +55,16 @@ Metrics:
|
||||
* `lustre_inode_permission_diff` (if `send_diff_values == true`)
|
||||
|
||||
This collector adds an `device` tag.
|
||||
|
||||
`lctl` typically require root to run.
|
||||
In order to run `cc-metric-collector` without root priviliges, you can enable `use_sudo`.
|
||||
Add a file like this in `/etc/sudoers.d/` to allow `cc-metric-collector` to run the required command:
|
||||
|
||||
```
|
||||
# Do not log the following sudo commands from monitoring, since this causes a lot of log spam.
|
||||
# However keep log_denied enabled, to detect failures
|
||||
Defaults: monitoring !log_allowed, !pam_session
|
||||
|
||||
# Allow to use lctl
|
||||
monitoring ALL = (root) NOPASSWD:/absolute/path/to/lctl get_param llite.*.stats
|
||||
```
|
||||
@@ -72,7 +72,8 @@ func getStats(filename string) map[string]MemstatStats {
|
||||
for scanner.Scan() {
|
||||
line := scanner.Text()
|
||||
linefields := strings.Fields(line)
|
||||
if len(linefields) == 3 {
|
||||
switch len(linefields) {
|
||||
case 3:
|
||||
v, err := strconv.ParseFloat(linefields[1], 64)
|
||||
if err == nil {
|
||||
stats[strings.Trim(linefields[0], ":")] = MemstatStats{
|
||||
@@ -80,10 +81,10 @@ func getStats(filename string) map[string]MemstatStats {
|
||||
unit: linefields[2],
|
||||
}
|
||||
}
|
||||
} else if len(linefields) == 5 {
|
||||
case 5:
|
||||
v, err := strconv.ParseFloat(linefields[3], 64)
|
||||
if err == nil {
|
||||
cclog.ComponentDebug("getStats", strings.Trim(linefields[2], ":"), v, linefields[4])
|
||||
cclog.ComponentDebug("MemstatCollector", "getStats %s value %v unit %s", strings.Trim(linefields[2], ":"), v, linefields[4])
|
||||
stats[strings.Trim(linefields[2], ":")] = MemstatStats{
|
||||
value: v,
|
||||
unit: linefields[4],
|
||||
@@ -106,7 +107,10 @@ func (m *MemstatCollector) Init(config json.RawMessage) error {
|
||||
return fmt.Errorf("%s Init(): Error decoding JSON config: %w", m.name, err)
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{"source": m.name, "group": "Memory"}
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "Memory",
|
||||
}
|
||||
m.stats = make(map[string]int64)
|
||||
m.matches = make(map[string]string)
|
||||
m.tags = map[string]string{"type": "node"}
|
||||
@@ -145,7 +149,7 @@ func (m *MemstatCollector) Init(config json.RawMessage) error {
|
||||
"KernelStack": "mem_kernelstack",
|
||||
}
|
||||
for k, v := range matches {
|
||||
if !slices.Contains(m.config.ExcludeMetrics, k) {
|
||||
if !slices.Contains(m.config.ExcludeMetrics, v) {
|
||||
m.matches[k] = v
|
||||
}
|
||||
}
|
||||
@@ -153,7 +157,7 @@ func (m *MemstatCollector) Init(config json.RawMessage) error {
|
||||
if !slices.Contains(m.config.ExcludeMetrics, "mem_used") {
|
||||
m.sendMemUsed = true
|
||||
}
|
||||
if len(m.matches) == 0 {
|
||||
if len(m.matches) == 0 && !m.sendMemUsed {
|
||||
return fmt.Errorf("%s Init(): no metrics to collect", m.name)
|
||||
}
|
||||
if err := m.setup(); err != nil {
|
||||
@@ -213,7 +217,7 @@ func (m *MemstatCollector) Read(interval time.Duration, output chan lp.CCMessage
|
||||
}
|
||||
}
|
||||
|
||||
y, err := lp.NewMessage(name, tags, m.meta, map[string]any{"value": value}, time.Now())
|
||||
y, err := lp.NewMetric(name, tags, m.meta, value, time.Now())
|
||||
if err == nil {
|
||||
if len(unit) > 0 {
|
||||
y.AddMeta("unit", unit)
|
||||
@@ -252,7 +256,7 @@ func (m *MemstatCollector) Read(interval time.Duration, output chan lp.CCMessage
|
||||
}
|
||||
}
|
||||
}
|
||||
y, err := lp.NewMessage("mem_used", tags, m.meta, map[string]any{"value": memUsed}, time.Now())
|
||||
y, err := lp.NewMetric("mem_used", tags, m.meta, memUsed, time.Now())
|
||||
if err == nil {
|
||||
if len(unit) > 0 {
|
||||
y.AddMeta("unit", unit)
|
||||
|
||||
@@ -222,16 +222,16 @@ func (m *NetstatCollector) Read(interval time.Duration, output chan lp.CCMessage
|
||||
|
||||
file, err := os.Open(NETSTATFILE)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to open file '%s': %v", NETSTATFILE, err))
|
||||
"Read(): Failed to open file '%s': %v", NETSTATFILE, err)
|
||||
return
|
||||
}
|
||||
defer func() {
|
||||
if err := file.Close(); err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to close file '%s': %v", NETSTATFILE, err))
|
||||
"Read(): Failed to close file '%s': %v", NETSTATFILE, err)
|
||||
}
|
||||
}()
|
||||
|
||||
@@ -262,14 +262,14 @@ func (m *NetstatCollector) Read(interval time.Duration, output chan lp.CCMessage
|
||||
continue
|
||||
}
|
||||
if m.config.SendAbsoluteValues {
|
||||
if y, err := lp.NewMessage(metric.name, metric.tags, metric.meta, map[string]any{"value": v}, now); err == nil {
|
||||
if y, err := lp.NewMetric(metric.name, metric.tags, metric.meta, v, now); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if m.config.SendDerivedValues {
|
||||
if metric.lastValue >= 0 {
|
||||
rate := float64(v-metric.lastValue) / timeDiff
|
||||
if y, err := lp.NewMessage(metric.name+"_bw", metric.tags, metric.meta_rates, map[string]any{"value": rate}, now); err == nil {
|
||||
if y, err := lp.NewMetric(metric.name+"_bw", metric.tags, metric.meta_rates, rate, now); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
@@ -125,10 +125,9 @@ func (m *nfsCollector) Read(interval time.Duration, output chan lp.CCMessage) {
|
||||
timestamp := time.Now()
|
||||
|
||||
if err := m.updateStats(); err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): updateStats() failed: %v", err),
|
||||
)
|
||||
"Read(): updateStats() failed: %v", err)
|
||||
return
|
||||
}
|
||||
var prefix string
|
||||
@@ -146,14 +145,13 @@ func (m *nfsCollector) Read(interval time.Duration, output chan lp.CCMessage) {
|
||||
continue
|
||||
}
|
||||
|
||||
valueMap := make(map[string]any)
|
||||
if data.current >= 0 && data.last >= 0 {
|
||||
valueMap["value"] = data.current - data.last
|
||||
}
|
||||
y, err := lp.NewMessage(fmt.Sprintf("%s_%s", prefix, name), m.tags, m.meta, valueMap, timestamp)
|
||||
if err == nil {
|
||||
y.AddMeta("version", m.version)
|
||||
output <- y
|
||||
value := data.current - data.last
|
||||
y, err := lp.NewMetric(fmt.Sprintf("%s_%s", prefix, name), m.tags, m.meta, value, timestamp)
|
||||
if err == nil {
|
||||
y.AddMeta("version", m.version)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -145,14 +145,7 @@ func (m *NfsIOStatCollector) Read(interval time.Duration, output chan lp.CCMessa
|
||||
if old, ok := m.data[mntpoint]; ok {
|
||||
for name, newVal := range values {
|
||||
if m.config.SendAbsoluteValues {
|
||||
msg, err := lp.NewMessage(
|
||||
"nfsio_"+name,
|
||||
m.tags,
|
||||
m.meta,
|
||||
map[string]any{
|
||||
"value": newVal,
|
||||
},
|
||||
now)
|
||||
msg, err := lp.NewMetric("nfsio_"+name, m.tags, m.meta, newVal, now)
|
||||
if err == nil {
|
||||
msg.AddTag("stype", "filesystem")
|
||||
msg.AddTag("stype-id", mntpoint)
|
||||
@@ -161,7 +154,7 @@ func (m *NfsIOStatCollector) Read(interval time.Duration, output chan lp.CCMessa
|
||||
}
|
||||
if m.config.SendDerivedValues {
|
||||
rate := float64(newVal-old[name]) / timeDiff
|
||||
msg, err := lp.NewMessage(fmt.Sprintf("nfsio_%s_bw", name), m.tags, m.meta, map[string]any{"value": rate}, now)
|
||||
msg, err := lp.NewMetric(fmt.Sprintf("nfsio_%s_bw", name), m.tags, m.meta, rate, now)
|
||||
if err == nil {
|
||||
if strings.HasPrefix(name, "page") {
|
||||
msg.AddMeta("unit", "4K_pages/s")
|
||||
|
||||
@@ -117,7 +117,7 @@ func (m *NUMAStatsCollector) Init(config json.RawMessage) error {
|
||||
}
|
||||
|
||||
// Initialized
|
||||
cclog.ComponentDebug(m.name, "initialized", len(m.topology), "NUMA domains")
|
||||
cclog.ComponentDebugf(m.name, "initialized %d NUMA domains", len(m.topology))
|
||||
m.init = true
|
||||
return nil
|
||||
}
|
||||
|
||||
396
collectors/nvidiaGPMMetric.go
Normal file
396
collectors/nvidiaGPMMetric.go
Normal file
@@ -0,0 +1,396 @@
|
||||
package collectors
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"slices"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-lib/v2/ccMessage"
|
||||
"github.com/NVIDIA/go-nvml/pkg/nvml"
|
||||
)
|
||||
|
||||
type NvidiaGPMMetricDef struct {
|
||||
name string
|
||||
outname string
|
||||
id nvml.GpmMetricId
|
||||
unit string
|
||||
}
|
||||
|
||||
var NvidiaGPMMetrics []NvidiaGPMMetricDef = []NvidiaGPMMetricDef{
|
||||
{
|
||||
name: "GRAPHICS_UTIL",
|
||||
outname: "nv_gpm_graphics_util",
|
||||
id: nvml.GPM_METRIC_GRAPHICS_UTIL,
|
||||
unit: "%",
|
||||
},
|
||||
{
|
||||
name: "SM_UTIL",
|
||||
outname: "nv_gpm_sm_util",
|
||||
id: nvml.GPM_METRIC_SM_UTIL,
|
||||
unit: "%",
|
||||
},
|
||||
{
|
||||
name: "SM_OCCUPANCY",
|
||||
outname: "nv_gpm_sm_occupancy",
|
||||
id: nvml.GPM_METRIC_SM_OCCUPANCY,
|
||||
unit: "%",
|
||||
},
|
||||
{
|
||||
name: "INTEGER_UTIL",
|
||||
outname: "nv_gpm_integer_util",
|
||||
id: nvml.GPM_METRIC_INTEGER_UTIL,
|
||||
unit: "%",
|
||||
},
|
||||
{
|
||||
name: "ANY_TENSOR_UTIL",
|
||||
outname: "nv_gpm_any_tensor_util",
|
||||
id: nvml.GPM_METRIC_ANY_TENSOR_UTIL,
|
||||
unit: "%",
|
||||
},
|
||||
{
|
||||
name: "DFMA_TENSOR_UTIL",
|
||||
outname: "nv_gpm_dfma_tensor_util",
|
||||
id: nvml.GPM_METRIC_DFMA_TENSOR_UTIL,
|
||||
unit: "%",
|
||||
},
|
||||
{
|
||||
name: "HMMA_TENSOR_UTIL",
|
||||
outname: "nv_gpm_hmma_tensor_util",
|
||||
id: nvml.GPM_METRIC_HMMA_TENSOR_UTIL,
|
||||
unit: "%",
|
||||
},
|
||||
{
|
||||
name: "IMMA_TENSOR_UTIL",
|
||||
outname: "nv_gpm_imma_tensor_util",
|
||||
id: nvml.GPM_METRIC_IMMA_TENSOR_UTIL,
|
||||
unit: "%",
|
||||
},
|
||||
{
|
||||
name: "DRAM_BW_UTIL",
|
||||
outname: "nv_gpm_dram_bw_util",
|
||||
id: nvml.GPM_METRIC_DRAM_BW_UTIL,
|
||||
unit: "%",
|
||||
},
|
||||
{
|
||||
name: "FP64_UTIL",
|
||||
outname: "nv_gpm_fp64_util",
|
||||
id: nvml.GPM_METRIC_FP64_UTIL,
|
||||
unit: "%",
|
||||
},
|
||||
{
|
||||
name: "FP32_UTIL",
|
||||
outname: "nv_gpm_fp32_util",
|
||||
id: nvml.GPM_METRIC_FP32_UTIL,
|
||||
unit: "%",
|
||||
},
|
||||
{
|
||||
name: "FP16_UTIL",
|
||||
outname: "nv_gpm_fp16_util",
|
||||
id: nvml.GPM_METRIC_FP16_UTIL,
|
||||
unit: "%",
|
||||
},
|
||||
}
|
||||
|
||||
type NvidiaGPMCollectorConfig struct {
|
||||
Metrics []string `json:"metrics,omitempty"`
|
||||
ExcludeDevices []string `json:"exclude_devices,omitempty"`
|
||||
AddPciInfoTag bool `json:"add_pci_info_tag,omitempty"`
|
||||
UsePciInfoAsTypeId bool `json:"use_pci_info_as_type_id,omitempty"`
|
||||
AddUuidMeta bool `json:"add_uuid_meta,omitempty"`
|
||||
AddBoardNumberMeta bool `json:"add_board_number_meta,omitempty"`
|
||||
AddSerialMeta bool `json:"add_serial_meta,omitempty"`
|
||||
ProcessMigDevices bool `json:"process_mig_devices,omitempty"`
|
||||
UseUuidForMigDevices bool `json:"use_uuid_for_mig_device,omitempty"`
|
||||
UseSliceForMigDevices bool `json:"use_slice_for_mig_device,omitempty"`
|
||||
}
|
||||
|
||||
type NvidiaGPMCollectorDevice struct {
|
||||
device nvml.Device
|
||||
tags map[string]string
|
||||
meta map[string]string
|
||||
startTime time.Time
|
||||
endTime time.Time
|
||||
measurement nvml.GpmMetricsGetType
|
||||
metricsLookup map[int]NvidiaGPMMetricDef
|
||||
}
|
||||
|
||||
type NvidiaGPMCollector struct {
|
||||
metricCollector
|
||||
|
||||
config NvidiaGPMCollectorConfig
|
||||
gpus []NvidiaGPMCollectorDevice
|
||||
num_gpus int
|
||||
}
|
||||
|
||||
func (m *NvidiaGPMCollector) Init(config json.RawMessage) error {
|
||||
var err error = nil
|
||||
m.name = "NvidiaGPMCollector"
|
||||
m.parallel = true
|
||||
if err := m.setup(); err != nil {
|
||||
return fmt.Errorf("%s Init(): setup() call failed: %w", m.name, err)
|
||||
}
|
||||
if len(config) > 0 {
|
||||
d := json.NewDecoder(strings.NewReader(string(config)))
|
||||
d.DisallowUnknownFields()
|
||||
if err = d.Decode(&m.config); err != nil {
|
||||
return fmt.Errorf("%s Init(): Error decoding JSON config: %w", m.name, err)
|
||||
}
|
||||
}
|
||||
m.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "NvidiaGPM",
|
||||
}
|
||||
|
||||
// Initialize NVIDIA Management Library (NVML)
|
||||
ret := nvml.Init()
|
||||
|
||||
// Error: NVML library not found
|
||||
// (nvml.ErrorString can not be used in this case)
|
||||
if ret == nvml.ERROR_LIBRARY_NOT_FOUND {
|
||||
return fmt.Errorf("%s Init(): NVML library not found", m.name)
|
||||
}
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
return fmt.Errorf("%s Init(): Unable to initialize NVML: %w", m.name, err)
|
||||
}
|
||||
|
||||
// Number of NVIDIA GPUs
|
||||
num_gpus, ret := nvml.DeviceGetCount()
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
return fmt.Errorf("%s Init(): Unable to get device count: %w", m.name, err)
|
||||
}
|
||||
|
||||
// For all GPUs
|
||||
m.gpus = make([]NvidiaGPMCollectorDevice, 0, num_gpus)
|
||||
for i := range num_gpus {
|
||||
|
||||
// Skip excluded devices by ID
|
||||
str_i := strconv.Itoa(i)
|
||||
if slices.Contains(m.config.ExcludeDevices, str_i) {
|
||||
cclog.ComponentDebugf(m.name, "Skipping excluded device %s", str_i)
|
||||
continue
|
||||
}
|
||||
|
||||
// Get device handle
|
||||
device, ret := nvml.DeviceGetHandleByIndex(i)
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentErrorf(m.name, "Unable to get device at index %d: %s", i, err.Error())
|
||||
continue
|
||||
}
|
||||
|
||||
supportInfo, ret := nvml.GpmQueryDeviceSupport(device)
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentErrorf(m.name, "Unable to query GPM support for device at index %d: %s", i, err.Error())
|
||||
continue
|
||||
} else {
|
||||
if supportInfo.IsSupportedDevice == uint32(nvml.FEATURE_DISABLED) {
|
||||
cclog.ComponentErrorf(m.name, "Device at index %d does not support GPM metrics", i)
|
||||
continue
|
||||
}
|
||||
}
|
||||
|
||||
stream, ret := nvml.GpmQueryIfStreamingEnabled(device)
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentErrorf(m.name, "Unable to query GPM streaming for device at index %d: %s", i, err.Error())
|
||||
continue
|
||||
} else {
|
||||
if stream == uint32(nvml.FEATURE_DISABLED) {
|
||||
ret = nvml.GpmSetStreamingEnabled(device, uint32(nvml.FEATURE_ENABLED))
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentErrorf(m.name, "Unable to set streaming mode for device at index %d: %s", i, err.Error())
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Get device's PCI info
|
||||
pciInfo, ret := nvml.DeviceGetPciInfo(device)
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentErrorf(m.name, "Unable to get PCI info for device at index %d: %s", i, err.Error())
|
||||
continue
|
||||
}
|
||||
// Create PCI ID in the common format used by the NVML.
|
||||
pci_id := fmt.Sprintf(
|
||||
nvml.DEVICE_PCI_BUS_ID_FMT,
|
||||
pciInfo.Domain,
|
||||
pciInfo.Bus,
|
||||
pciInfo.Device)
|
||||
|
||||
// Skip excluded devices specified by PCI ID
|
||||
if slices.Contains(m.config.ExcludeDevices, pci_id) {
|
||||
cclog.ComponentDebugf(m.name, "Skipping excluded device %s", pci_id)
|
||||
continue
|
||||
}
|
||||
ss, nvmlErr := nvml.GpmSampleAlloc()
|
||||
if nvmlErr != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentErrorf(m.name, "Failed to allocate GPM sample for device %d: %s", i, err.Error())
|
||||
continue
|
||||
}
|
||||
es, nvmlErr := nvml.GpmSampleAlloc()
|
||||
if nvmlErr != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentErrorf(m.name, "Failed to allocate GPM sample for device %d: %s", i, err.Error())
|
||||
continue
|
||||
}
|
||||
|
||||
// Select which value to use as 'type-id'.
|
||||
// The PCI ID is commonly required in SLURM environments because the
|
||||
// numberic IDs used by SLURM and the ones used by NVML might differ
|
||||
// depending on the job type. The PCI ID is more reliable but is commonly
|
||||
// not recorded for a job, so it must be added manually in prologue or epilogue
|
||||
// e.g. to the comment field
|
||||
tid := str_i
|
||||
if m.config.UsePciInfoAsTypeId {
|
||||
tid = pci_id
|
||||
}
|
||||
|
||||
// Now we got all infos together, populate the device list
|
||||
g := NvidiaGPMCollectorDevice{}
|
||||
|
||||
// Add device handle
|
||||
g.device = device
|
||||
|
||||
// Add tags
|
||||
g.tags = map[string]string{
|
||||
"type": "accelerator",
|
||||
"type-id": tid,
|
||||
}
|
||||
|
||||
// Add PCI info as tag if not already used as 'type-id'
|
||||
if m.config.AddPciInfoTag && !m.config.UsePciInfoAsTypeId {
|
||||
g.tags["pci_identifier"] = pci_id
|
||||
}
|
||||
|
||||
g.meta = map[string]string{
|
||||
"source": m.name,
|
||||
"group": "Nvidia",
|
||||
}
|
||||
|
||||
if m.config.AddBoardNumberMeta {
|
||||
board, ret := nvml.DeviceGetBoardPartNumber(device)
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentError(m.name, "Unable to get boart part number for device at index", i, ":", err.Error())
|
||||
} else {
|
||||
g.meta["board_number"] = board
|
||||
}
|
||||
}
|
||||
if m.config.AddSerialMeta {
|
||||
serial, ret := nvml.DeviceGetSerial(device)
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentError(m.name, "Unable to get serial number for device at index", i, ":", err.Error())
|
||||
} else {
|
||||
g.meta["serial"] = serial
|
||||
}
|
||||
}
|
||||
if m.config.AddUuidMeta {
|
||||
uuid, ret := nvml.DeviceGetUUID(device)
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentError(m.name, "Unable to get UUID for device at index", i, ":", err.Error())
|
||||
} else {
|
||||
g.meta["uuid"] = uuid
|
||||
}
|
||||
}
|
||||
|
||||
g.measurement.Sample1 = ss
|
||||
g.measurement.Sample2 = es
|
||||
g.measurement.Version = nvml.GPM_METRICS_GET_VERSION
|
||||
g.metricsLookup = make(map[int]NvidiaGPMMetricDef)
|
||||
metIdx := 0
|
||||
for _, inmetric := range m.config.Metrics {
|
||||
for _, defmetric := range NvidiaGPMMetrics {
|
||||
if inmetric == defmetric.outname || inmetric == defmetric.name {
|
||||
g.measurement.Metrics[metIdx] = nvml.GpmMetric{
|
||||
MetricId: uint32(defmetric.id),
|
||||
}
|
||||
g.metricsLookup[metIdx] = defmetric
|
||||
metIdx += 1
|
||||
}
|
||||
}
|
||||
}
|
||||
g.measurement.NumMetrics = uint32(metIdx)
|
||||
m.gpus = append(m.gpus, g)
|
||||
}
|
||||
cclog.ComponentDebugf(m.name, "Found %d Nvidia GPUs with GPM support", len(m.gpus))
|
||||
m.num_gpus = len(m.gpus)
|
||||
m.init = true
|
||||
return err
|
||||
}
|
||||
|
||||
func (m *NvidiaGPMCollector) Read(interval time.Duration, output chan lp.CCMessage) {
|
||||
var err error
|
||||
if !m.init {
|
||||
return
|
||||
}
|
||||
for i, gpu := range m.gpus {
|
||||
gpu.startTime = time.Now()
|
||||
nvmlErr := gpu.measurement.Sample1.Get(gpu.device)
|
||||
if nvmlErr != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(nvmlErr))
|
||||
cclog.ComponentError(m.name, "Unable to get start GPM sample for device at index", i, ":", err.Error())
|
||||
continue
|
||||
}
|
||||
}
|
||||
time.Sleep(interval)
|
||||
|
||||
for i, gpu := range m.gpus {
|
||||
gpu.endTime = time.Now()
|
||||
nvmlErr := gpu.measurement.Sample2.Get(gpu.device)
|
||||
if nvmlErr != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(nvmlErr))
|
||||
cclog.ComponentError(m.name, "Unable to get stop GPM sample for device at index", i, ":", err.Error())
|
||||
continue
|
||||
}
|
||||
}
|
||||
|
||||
for i, gpu := range m.gpus {
|
||||
nvmlErr := nvml.GpmMetricsGet(&gpu.measurement)
|
||||
if nvmlErr != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(nvmlErr))
|
||||
cclog.ComponentError(m.name, "Unable to get evaluate GPM sample for device at index", i, ":", err.Error())
|
||||
continue
|
||||
}
|
||||
for idx, metricDef := range gpu.metricsLookup {
|
||||
y, err := lp.NewMetric(metricDef.outname, gpu.tags, gpu.meta, gpu.measurement.Metrics[idx].Value, time.Now())
|
||||
if err == nil {
|
||||
y.AddMeta("unit", metricDef.unit)
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func (m *NvidiaGPMCollector) Close() {
|
||||
if m.init {
|
||||
for i, gpu := range m.gpus {
|
||||
ret := gpu.measurement.Sample1.Free()
|
||||
if ret != nvml.SUCCESS {
|
||||
err := errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentErrorf(m.name, "Unable to free start sample for device at index %d: %s", i, err.Error())
|
||||
}
|
||||
ret = gpu.measurement.Sample2.Free()
|
||||
if ret != nvml.SUCCESS {
|
||||
err := errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentErrorf(m.name, "Unable to free stop sample for device at index %d: %s", i, err.Error())
|
||||
}
|
||||
}
|
||||
if ret := nvml.Shutdown(); ret != nvml.SUCCESS {
|
||||
cclog.ComponentError(m.name, "nvml.Shutdown() not successful")
|
||||
}
|
||||
m.init = false
|
||||
}
|
||||
}
|
||||
54
collectors/nvidiaGPMMetric.md
Normal file
54
collectors/nvidiaGPMMetric.md
Normal file
@@ -0,0 +1,54 @@
|
||||
<!--
|
||||
---
|
||||
title: "Nvidia NVML GPM metric collector"
|
||||
description: Collect metrics for Nvidia GPUs using the NVML GPM interface
|
||||
categories: [cc-metric-collector]
|
||||
tags: ['Admin']
|
||||
weight: 2
|
||||
hugo_path: docs/reference/cc-metric-collector/collectors/nvidiaGPM.md
|
||||
---
|
||||
-->
|
||||
|
||||
## `nvidiaGPM` collector
|
||||
|
||||
```json
|
||||
"nvidia_gpm": {
|
||||
"metrics": [
|
||||
"nv_fb_mem_used",
|
||||
"nv_fan"
|
||||
],
|
||||
"exclude_devices": [
|
||||
"0","1", "0000000:ff:01.0"
|
||||
],
|
||||
|
||||
"process_mig_devices": false,
|
||||
"use_pci_info_as_type_id": true,
|
||||
"add_pci_info_tag": false,
|
||||
"add_uuid_meta": false,
|
||||
"add_board_number_meta": false,
|
||||
"add_serial_meta": false,
|
||||
"use_uuid_for_mig_device": false,
|
||||
"use_slice_for_mig_device": false
|
||||
}
|
||||
```
|
||||
|
||||
The `nvidia_gpm` collector can be configured to leave out specific devices with the `exclude_devices` option. It takes IDs as supplied to the NVML with `nvmlDeviceGetHandleByIndex()` or the PCI address in NVML format (`%08X:%02X:%02X.0`). Commonly only the physical GPUs are monitored. If MIG devices should be analyzed as well, set `process_mig_devices` (adds `stype=mig,stype-id=<mig_index>`). With the options `use_uuid_for_mig_device` and `use_slice_for_mig_device`, the `<mig_index>` can be replaced with the UUID (e.g. `MIG-6a9f7cc8-6d5b-5ce0-92de-750edc4d8849`) or the MIG slice name (e.g. `1g.5gb`).
|
||||
|
||||
The metrics sent by the `nvidia_gpm` collector use `accelerator` as `type` tag. For the `type-id`, it uses the device handle index by default. With the `use_pci_info_as_type_id` option, the PCI ID is used instead. If both values should be added as tags, activate the `add_pci_info_tag` option. It uses the device handle index as `type-id` and adds the PCI ID as separate `pci_identifier` tag.
|
||||
|
||||
Optionally, it is possible to add the UUID, the board part number and the serial to the meta informations. They are not sent to the sinks (if not configured otherwise).
|
||||
|
||||
|
||||
Available Metrics:
|
||||
* `nv_gpm_graphics_util`
|
||||
* `nv_gpm_sm_util`
|
||||
* `nv_gpm_sm_occupancy`
|
||||
* `nv_gpm_integer_util`
|
||||
* `nv_gpm_any_tensor_util`
|
||||
* `nv_gpm_dfma_tensor_util`
|
||||
* `nv_gpm_hmma_tensor_util`
|
||||
* `nv_gpm_imma_tensor_util`
|
||||
* `nv_gpm_dram_bw_util`
|
||||
* `nv_gpm_fp64_util`
|
||||
* `nv_gpm_fp32_util`
|
||||
* `nv_gpm_fp16_util`
|
||||
@@ -113,7 +113,7 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
|
||||
// Skip excluded devices by ID
|
||||
str_i := strconv.Itoa(i)
|
||||
if slices.Contains(m.config.ExcludeDevices, str_i) {
|
||||
cclog.ComponentDebug(m.name, "Skipping excluded device", str_i)
|
||||
cclog.ComponentDebugf(m.name, "Skipping excluded device %s", str_i)
|
||||
continue
|
||||
}
|
||||
|
||||
@@ -121,7 +121,7 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
|
||||
device, ret := nvml.DeviceGetHandleByIndex(i)
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentError(m.name, "Unable to get device at index", i, ":", err.Error())
|
||||
cclog.ComponentErrorf(m.name, "Unable to get device at index %d: %s", i, err.Error())
|
||||
continue
|
||||
}
|
||||
|
||||
@@ -129,7 +129,7 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
|
||||
pciInfo, ret := nvml.DeviceGetPciInfo(device)
|
||||
if ret != nvml.SUCCESS {
|
||||
err = errors.New(nvml.ErrorString(ret))
|
||||
cclog.ComponentError(m.name, "Unable to get PCI info for device at index", i, ":", err.Error())
|
||||
cclog.ComponentErrorf(m.name, "Unable to get PCI info for device at index %d: %s", i, err.Error())
|
||||
continue
|
||||
}
|
||||
// Create PCI ID in the common format used by the NVML.
|
||||
@@ -141,7 +141,7 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
|
||||
|
||||
// Skip excluded devices specified by PCI ID
|
||||
if slices.Contains(m.config.ExcludeDevices, pci_id) {
|
||||
cclog.ComponentDebug(m.name, "Skipping excluded device", pci_id)
|
||||
cclog.ComponentDebugf(m.name, "Skipping excluded device %s", pci_id)
|
||||
continue
|
||||
}
|
||||
|
||||
@@ -183,7 +183,7 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
|
||||
if m.config.AddBoardNumberMeta {
|
||||
board, ret := nvml.DeviceGetBoardPartNumber(device)
|
||||
if ret != nvml.SUCCESS {
|
||||
cclog.ComponentError(m.name, "Unable to get boart part number for device at index", i, ":", err.Error())
|
||||
cclog.ComponentErrorf(m.name, "Unable to get boart part number for device at index %d: %s", i, err.Error())
|
||||
} else {
|
||||
g.meta["board_number"] = board
|
||||
}
|
||||
@@ -191,7 +191,7 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
|
||||
if m.config.AddSerialMeta {
|
||||
serial, ret := nvml.DeviceGetSerial(device)
|
||||
if ret != nvml.SUCCESS {
|
||||
cclog.ComponentError(m.name, "Unable to get serial number for device at index", i, ":", err.Error())
|
||||
cclog.ComponentErrorf(m.name, "Unable to get serial number for device at index %d: %s", i, err.Error())
|
||||
} else {
|
||||
g.meta["serial"] = serial
|
||||
}
|
||||
@@ -199,7 +199,7 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
|
||||
if m.config.AddUuidMeta {
|
||||
uuid, ret := nvml.DeviceGetUUID(device)
|
||||
if ret != nvml.SUCCESS {
|
||||
cclog.ComponentError(m.name, "Unable to get UUID for device at index", i, ":", err.Error())
|
||||
cclog.ComponentErrorf(m.name, "Unable to get UUID for device at index %d: %s", i, err.Error())
|
||||
} else {
|
||||
g.meta["uuid"] = uuid
|
||||
}
|
||||
@@ -1128,97 +1128,97 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMessage)
|
||||
}
|
||||
err = readMemoryInfo(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readMemoryInfo for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readMemoryInfo for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readUtilization(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readUtilization for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readUtilization for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readTemp(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readTemp for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readTemp for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readFan(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readFan for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readFan for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readEccMode(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readEccMode for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readEccMode for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readPerfState(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readPerfState for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readPerfState for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readPowerUsage(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readPowerUsage for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readPowerUsage for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readEnergyConsumption(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readEnergyConsumption for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readEnergyConsumption for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readClocks(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readClocks for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readClocks for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readMaxClocks(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readMaxClocks for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readMaxClocks for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readEccErrors(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readEccErrors for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readEccErrors for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readPowerLimit(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readPowerLimit for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readPowerLimit for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readEncUtilization(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readEncUtilization for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readEncUtilization for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readDecUtilization(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readDecUtilization for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readDecUtilization for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readRemappedRows(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readRemappedRows for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readRemappedRows for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readBarMemoryInfo(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readBarMemoryInfo for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readBarMemoryInfo for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readProcessCounts(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readProcessCounts for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readProcessCounts for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readViolationStats(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readViolationStats for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readViolationStats for device %s failed", name)
|
||||
}
|
||||
|
||||
err = readNVLinkStats(device, output)
|
||||
if err != nil {
|
||||
cclog.ComponentDebug(m.name, "readNVLinkStats for device", name, "failed")
|
||||
cclog.ComponentDebugf(m.name, "readNVLinkStats for device %s failed", name)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1244,7 +1244,7 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMessage)
|
||||
if maxMig == 0 {
|
||||
continue
|
||||
}
|
||||
cclog.ComponentDebug(m.name, "Reading MIG devices for GPU", i)
|
||||
cclog.ComponentDebugf(m.name, "Reading MIG devices for GPU %d", i)
|
||||
|
||||
for j := range maxMig {
|
||||
mdev, ret := nvml.DeviceGetMigDeviceHandleByIndex(m.gpus[i].device, j)
|
||||
@@ -1268,7 +1268,7 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMessage)
|
||||
if m.config.UseUuidForMigDevices {
|
||||
uuid, ret := nvml.DeviceGetUUID(mdev)
|
||||
if ret != nvml.SUCCESS {
|
||||
cclog.ComponentError(m.name, "Unable to get UUID for mig device at index", j, ":", err.Error())
|
||||
cclog.ComponentErrorf(m.name, "Unable to get UUID for mig device at index %d: %s", j, err.Error())
|
||||
} else {
|
||||
migDevice.tags["stype-id"] = uuid
|
||||
}
|
||||
|
||||
@@ -208,11 +208,10 @@ func (m *RAPLCollector) Init(config json.RawMessage) error {
|
||||
}
|
||||
|
||||
// Initialized
|
||||
cclog.ComponentDebug(
|
||||
cclog.ComponentDebugf(
|
||||
m.name,
|
||||
"initialized",
|
||||
len(m.RAPLZoneInfo),
|
||||
"zones with running average power limit (RAPL) monitoring attributes")
|
||||
"initialized %d zones with running average power limit (RAPL) monitoring attributes",
|
||||
len(m.RAPLZoneInfo))
|
||||
m.init = true
|
||||
|
||||
return err
|
||||
@@ -242,12 +241,7 @@ func (m *RAPLCollector) Read(interval time.Duration, output chan lp.CCMessage) {
|
||||
timeDiff := energyTimestamp.Sub(p.energyTimestamp)
|
||||
averagePower := float64(energyDiff) / float64(timeDiff.Microseconds())
|
||||
|
||||
y, err := lp.NewMessage(
|
||||
"rapl_average_power",
|
||||
p.tags,
|
||||
m.meta,
|
||||
map[string]any{"value": averagePower},
|
||||
energyTimestamp)
|
||||
y, err := lp.NewMetric("rapl_average_power", p.tags, m.meta, averagePower, energyTimestamp)
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
@@ -124,7 +124,7 @@ func (m *RocmSmiCollector) Init(config json.RawMessage) error {
|
||||
if m.config.AddSerialMeta {
|
||||
serial, ret := rocm_smi.DeviceGetSerialNumber(device)
|
||||
if ret != rocm_smi.STATUS_SUCCESS {
|
||||
cclog.ComponentError(m.name, "Unable to get serial number for device at index", i, ":", rocm_smi.StatusStringNoError(ret))
|
||||
cclog.ComponentErrorf(m.name, "Unable to get serial number for device at index %d: %s", i, rocm_smi.StatusStringNoError(ret))
|
||||
} else {
|
||||
dev.meta["serial"] = serial
|
||||
}
|
||||
@@ -152,134 +152,116 @@ func (m *RocmSmiCollector) Read(interval time.Duration, output chan lp.CCMessage
|
||||
for _, dev := range m.devices {
|
||||
metrics, ret := rocm_smi.DeviceGetMetrics(dev.device)
|
||||
if ret != rocm_smi.STATUS_SUCCESS {
|
||||
cclog.ComponentError(m.name, "Unable to get metrics for device at index", dev.index, ":", rocm_smi.StatusStringNoError(ret))
|
||||
cclog.ComponentErrorf(m.name, "Unable to get metrics for device at index %d: %s", dev.index, rocm_smi.StatusStringNoError(ret))
|
||||
continue
|
||||
}
|
||||
|
||||
if !dev.excludeMetrics["rocm_gfx_util"] {
|
||||
value := metrics.Average_gfx_activity
|
||||
y, err := lp.NewMessage("rocm_gfx_util", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_gfx_util", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_umc_util"] {
|
||||
value := metrics.Average_umc_activity
|
||||
y, err := lp.NewMessage("rocm_umc_util", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_umc_util", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_mm_util"] {
|
||||
value := metrics.Average_mm_activity
|
||||
y, err := lp.NewMessage("rocm_mm_util", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_mm_util", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_avg_power"] {
|
||||
value := metrics.Average_socket_power
|
||||
y, err := lp.NewMessage("rocm_avg_power", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_avg_power", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_temp_mem"] {
|
||||
value := metrics.Temperature_mem
|
||||
y, err := lp.NewMessage("rocm_temp_mem", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_temp_mem", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_temp_hotspot"] {
|
||||
value := metrics.Temperature_hotspot
|
||||
y, err := lp.NewMessage("rocm_temp_hotspot", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_temp_hotspot", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_temp_edge"] {
|
||||
value := metrics.Temperature_edge
|
||||
y, err := lp.NewMessage("rocm_temp_edge", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_temp_edge", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_temp_vrgfx"] {
|
||||
value := metrics.Temperature_vrgfx
|
||||
y, err := lp.NewMessage("rocm_temp_vrgfx", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_temp_vrgfx", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_temp_vrsoc"] {
|
||||
value := metrics.Temperature_vrsoc
|
||||
y, err := lp.NewMessage("rocm_temp_vrsoc", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_temp_vrsoc", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_temp_vrmem"] {
|
||||
value := metrics.Temperature_vrmem
|
||||
y, err := lp.NewMessage("rocm_temp_vrmem", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_temp_vrmem", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_gfx_clock"] {
|
||||
value := metrics.Average_gfxclk_frequency
|
||||
y, err := lp.NewMessage("rocm_gfx_clock", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_gfx_clock", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_soc_clock"] {
|
||||
value := metrics.Average_socclk_frequency
|
||||
y, err := lp.NewMessage("rocm_soc_clock", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_soc_clock", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_u_clock"] {
|
||||
value := metrics.Average_uclk_frequency
|
||||
y, err := lp.NewMessage("rocm_u_clock", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_u_clock", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_v0_clock"] {
|
||||
value := metrics.Average_vclk0_frequency
|
||||
y, err := lp.NewMessage("rocm_v0_clock", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_v0_clock", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_v1_clock"] {
|
||||
value := metrics.Average_vclk1_frequency
|
||||
y, err := lp.NewMessage("rocm_v1_clock", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_v1_clock", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_d0_clock"] {
|
||||
value := metrics.Average_dclk0_frequency
|
||||
y, err := lp.NewMessage("rocm_d0_clock", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_d0_clock", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_d1_clock"] {
|
||||
value := metrics.Average_dclk1_frequency
|
||||
y, err := lp.NewMessage("rocm_d1_clock", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_d1_clock", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
if !dev.excludeMetrics["rocm_temp_hbm"] {
|
||||
for i := range rocm_smi.NUM_HBM_INSTANCES {
|
||||
value := metrics.Temperature_hbm[i]
|
||||
y, err := lp.NewMessage("rocm_temp_hbm", dev.tags, dev.meta, map[string]any{"value": value}, timestamp)
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric("rocm_temp_hbm", dev.tags, dev.meta, value, timestamp); err == nil {
|
||||
y.AddTag("stype", "device")
|
||||
y.AddTag("stype-id", strconv.Itoa(i))
|
||||
output <- y
|
||||
|
||||
@@ -147,15 +147,15 @@ func (m *SchedstatCollector) Read(interval time.Duration, output chan lp.CCMessa
|
||||
|
||||
file, err := os.Open(SCHEDSTATFILE)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to open file '%s': %v", SCHEDSTATFILE, err))
|
||||
"Read(): Failed to open file '%s': %v", SCHEDSTATFILE, err)
|
||||
}
|
||||
defer func() {
|
||||
if err := file.Close(); err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to close file '%s': %v", SCHEDSTATFILE, err))
|
||||
"Read(): Failed to close file '%s': %v", SCHEDSTATFILE, err)
|
||||
}
|
||||
}()
|
||||
|
||||
|
||||
@@ -240,7 +240,7 @@ func (m *SlurmCgroupCollector) Read(interval time.Duration, output chan lp.CCMes
|
||||
globPattern := filepath.Join(m.cgroupBase, "job_*")
|
||||
jobDirs, err := filepath.Glob(globPattern)
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, "Error globbing job directories:", err.Error())
|
||||
cclog.ComponentErrorf(m.name, "Error globbing job directories: %s", err.Error())
|
||||
return
|
||||
}
|
||||
|
||||
@@ -249,7 +249,7 @@ func (m *SlurmCgroupCollector) Read(interval time.Duration, output chan lp.CCMes
|
||||
|
||||
jobdata, err := m.ReadJobData(jKey)
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, "Error reading job data for", jKey, ":", err.Error())
|
||||
cclog.ComponentError(m.name, "Error reading job data for %s: %s", jKey, err.Error())
|
||||
continue
|
||||
}
|
||||
|
||||
|
||||
@@ -228,12 +228,12 @@ func (m *SmartMonCollector) Read(interval time.Duration, output chan lp.CCMessag
|
||||
|
||||
stdout, err := command.Output()
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, "cannot read data for device", d.Name)
|
||||
cclog.ComponentErrorf(m.name, "cannot read data for device %s", d.Name)
|
||||
continue
|
||||
}
|
||||
err = json.Unmarshal(stdout, &data)
|
||||
if err != nil {
|
||||
cclog.ComponentError(m.name, "cannot unmarshal data for device", d.Name)
|
||||
cclog.ComponentErrorf(m.name, "cannot unmarshal data for device %s", d.Name)
|
||||
continue
|
||||
}
|
||||
if !m.excludeMetric.temp {
|
||||
|
||||
@@ -50,3 +50,18 @@ Metrics:
|
||||
* `smartmon_errlog_entries`: Error log entries
|
||||
* `smartmon_warn_temp_time`: Time above the warning temperature threshold
|
||||
* `smartmon_crit_comp_time`: Time above the critical composite temperature threshold
|
||||
|
||||
`smartctl` typically require root to run.
|
||||
In order to run `cc-metric-collector` without root priviliges, you can enable `use_sudo`.
|
||||
Add a file like this in `/etc/sudoers.d/` to allow `cc-metric-collector` to run the required command:
|
||||
|
||||
```
|
||||
# Do not log the following sudo commands from monitoring, since this causes a lot of log spam.
|
||||
# However keep log_denied enabled, to detect failures
|
||||
Defaults: monitoring !log_allowed, !pam_session
|
||||
|
||||
# Allow to use lctl
|
||||
monitoring ALL = (root) NOPASSWD:/absolute/path/to/smartctl --json=c --device=* "--all" *
|
||||
# Or add individual rules for each device
|
||||
# monitoring ALL = (root) NOPASSWD:/absolute/path/to/smartctl --json=c --device=<device_type> "--all" <device>
|
||||
```
|
||||
@@ -188,39 +188,27 @@ func (m *TempCollector) Read(interval time.Duration, output chan lp.CCMessage) {
|
||||
// Read sensor file
|
||||
buffer, err := os.ReadFile(sensor.file)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to read file '%s': %v", sensor.file, err))
|
||||
"Read(): Failed to read file '%s': %v", sensor.file, err)
|
||||
continue
|
||||
}
|
||||
x, err := strconv.ParseInt(strings.TrimSpace(string(buffer)), 10, 64)
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to convert temperature '%s' to int64: %v", buffer, err))
|
||||
"Read(): Failed to convert temperature '%s' to int64: %v", buffer, err)
|
||||
continue
|
||||
}
|
||||
x /= 1000
|
||||
y, err := lp.NewMessage(
|
||||
sensor.metricName,
|
||||
sensor.tags,
|
||||
m.meta,
|
||||
map[string]any{"value": x},
|
||||
time.Now(),
|
||||
)
|
||||
y, err := lp.NewMetric(sensor.metricName, sensor.tags, m.meta, x, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
// max temperature
|
||||
if m.config.ReportMaxTemp && sensor.maxTemp != 0 {
|
||||
y, err := lp.NewMessage(
|
||||
sensor.maxTempName,
|
||||
sensor.tags,
|
||||
m.meta,
|
||||
map[string]any{"value": sensor.maxTemp},
|
||||
time.Now(),
|
||||
)
|
||||
y, err := lp.NewMetric(sensor.maxTempName, sensor.tags, m.meta, sensor.maxTemp, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
@@ -228,13 +216,7 @@ func (m *TempCollector) Read(interval time.Duration, output chan lp.CCMessage) {
|
||||
|
||||
// critical temperature
|
||||
if m.config.ReportCriticalTemp && sensor.critTemp != 0 {
|
||||
y, err := lp.NewMessage(
|
||||
sensor.critTempName,
|
||||
sensor.tags,
|
||||
m.meta,
|
||||
map[string]any{"value": sensor.critTemp},
|
||||
time.Now(),
|
||||
)
|
||||
y, err := lp.NewMetric(sensor.critTempName, sensor.tags, m.meta, sensor.critTemp, time.Now())
|
||||
if err == nil {
|
||||
output <- y
|
||||
}
|
||||
|
||||
@@ -77,24 +77,16 @@ func (m *TopProcsCollector) Read(interval time.Duration, output chan lp.CCMessag
|
||||
command := exec.Command("ps", "-Ao", "comm", "--sort=-pcpu")
|
||||
stdout, err := command.Output()
|
||||
if err != nil {
|
||||
cclog.ComponentError(
|
||||
cclog.ComponentErrorf(
|
||||
m.name,
|
||||
fmt.Sprintf("Read(): Failed to read output from command \"%s\": %v", command.String(), err))
|
||||
"Read(): Failed to read output from command \"%s\": %v", command.String(), err)
|
||||
return
|
||||
}
|
||||
|
||||
lines := strings.Split(string(stdout), "\n")
|
||||
for i := 1; i < m.config.Num_procs+1; i++ {
|
||||
name := fmt.Sprintf("topproc%d", i)
|
||||
y, err := lp.NewMessage(
|
||||
name,
|
||||
m.tags,
|
||||
m.meta,
|
||||
map[string]any{
|
||||
"value": lines[i],
|
||||
},
|
||||
time.Now())
|
||||
if err == nil {
|
||||
if y, err := lp.NewMetric(name, m.tags, m.meta, lines[i], time.Now()); err == nil {
|
||||
output <- y
|
||||
}
|
||||
}
|
||||
|
||||
28
go.mod
28
go.mod
@@ -3,14 +3,14 @@ module github.com/ClusterCockpit/cc-metric-collector
|
||||
go 1.25.0
|
||||
|
||||
require (
|
||||
github.com/ClusterCockpit/cc-lib/v2 v2.11.0
|
||||
github.com/ClusterCockpit/go-rocm-smi v0.3.0
|
||||
github.com/ClusterCockpit/cc-lib/v2 v2.12.0
|
||||
github.com/ClusterCockpit/go-rocm-smi v0.4.0
|
||||
github.com/NVIDIA/go-nvml v0.13.0-1
|
||||
github.com/PaesslerAG/gval v1.2.4
|
||||
github.com/fsnotify/fsnotify v1.9.0
|
||||
github.com/tklauser/go-sysconf v0.3.16
|
||||
golang.design/x/thread v0.0.0-20210122121316-335e9adffdf1
|
||||
golang.org/x/sys v0.42.0
|
||||
github.com/fsnotify/fsnotify v1.10.1
|
||||
github.com/tklauser/go-sysconf v0.4.0
|
||||
golang.design/x/thread v0.3.2
|
||||
golang.org/x/sys v0.45.0
|
||||
)
|
||||
|
||||
require (
|
||||
@@ -18,27 +18,29 @@ require (
|
||||
github.com/apapsch/go-jsonmerge/v2 v2.0.0 // indirect
|
||||
github.com/beorn7/perks v1.0.1 // indirect
|
||||
github.com/cespare/xxhash/v2 v2.3.0 // indirect
|
||||
github.com/coder/websocket v1.8.14 // indirect
|
||||
github.com/expr-lang/expr v1.17.8 // indirect
|
||||
github.com/google/uuid v1.6.0 // indirect
|
||||
github.com/gorilla/mux v1.8.1 // indirect
|
||||
github.com/influxdata/influxdb-client-go/v2 v2.14.0 // indirect
|
||||
github.com/influxdata/line-protocol v0.0.0-20210922203350-b1ad95c89adf // indirect
|
||||
github.com/klauspost/compress v1.18.4 // indirect
|
||||
github.com/klauspost/compress v1.18.5 // indirect
|
||||
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
|
||||
github.com/nats-io/nats.go v1.49.0 // indirect
|
||||
github.com/nats-io/nats.go v1.51.0 // indirect
|
||||
github.com/nats-io/nkeys v0.4.15 // indirect
|
||||
github.com/nats-io/nuid v1.0.1 // indirect
|
||||
github.com/oapi-codegen/runtime v1.3.0 // indirect
|
||||
github.com/oapi-codegen/runtime v1.4.0 // indirect
|
||||
github.com/prometheus/client_golang v1.23.2 // indirect
|
||||
github.com/prometheus/client_model v0.6.2 // indirect
|
||||
github.com/prometheus/common v0.67.5 // indirect
|
||||
github.com/prometheus/procfs v0.20.1 // indirect
|
||||
github.com/questdb/go-questdb-client/v4 v4.2.0 // indirect
|
||||
github.com/santhosh-tekuri/jsonschema/v5 v5.3.1 // indirect
|
||||
github.com/shopspring/decimal v1.4.0 // indirect
|
||||
github.com/stmcginnis/gofish v0.21.4 // indirect
|
||||
github.com/tklauser/numcpus v0.11.0 // indirect
|
||||
github.com/stmcginnis/gofish v0.21.6 // indirect
|
||||
github.com/tklauser/numcpus v0.12.0 // indirect
|
||||
go.yaml.in/yaml/v2 v2.4.4 // indirect
|
||||
golang.org/x/crypto v0.49.0 // indirect
|
||||
golang.org/x/net v0.52.0 // indirect
|
||||
golang.org/x/crypto v0.50.0 // indirect
|
||||
golang.org/x/net v0.53.0 // indirect
|
||||
google.golang.org/protobuf v1.36.11 // indirect
|
||||
)
|
||||
|
||||
159
go.sum
159
go.sum
@@ -1,10 +1,17 @@
|
||||
github.com/ClusterCockpit/cc-lib/v2 v2.11.0 h1:LaLs4J0b7FArIXT8byMUcIcUr55R5obATjVi7qI02r4=
|
||||
github.com/ClusterCockpit/cc-lib/v2 v2.11.0/go.mod h1:Oj+N2lpFqiBOBzjfrLIGJ2YSWT400TX4M0ii4lNl81A=
|
||||
dario.cat/mergo v1.0.0 h1:AGCNq9Evsj31mOgNPcLyXc+4PNABt905YmuqPYYpBWk=
|
||||
dario.cat/mergo v1.0.0/go.mod h1:uNxQE+84aUszobStD9th8a29P2fMDhsBdgRYvZOxGmk=
|
||||
github.com/Azure/go-ansiterm v0.0.0-20230124172434-306776ec8161 h1:L/gRVlceqvL25UVaW/CKtUDjefjrs0SPonmDGUVOYP0=
|
||||
github.com/Azure/go-ansiterm v0.0.0-20230124172434-306776ec8161/go.mod h1:xomTg63KZ2rFqZQzSB4Vz2SUXa1BpHTVz9L5PTmPC4E=
|
||||
github.com/ClusterCockpit/cc-lib/v2 v2.12.0 h1:ZbGD68nDniuvzFjJCdyYawpCBrabdSyWOg5FFSyFbjQ=
|
||||
github.com/ClusterCockpit/cc-lib/v2 v2.12.0/go.mod h1:ml8xtcYa5WhPM7JDQ+M9/R9ZBxITCR/5xqGJ//GxXJI=
|
||||
github.com/ClusterCockpit/cc-line-protocol/v2 v2.4.0 h1:hIzxgTBWcmCIHtoDKDkSCsKCOCOwUC34sFsbD2wcW0Q=
|
||||
github.com/ClusterCockpit/cc-line-protocol/v2 v2.4.0/go.mod h1:y42qUu+YFmu5fdNuUAS4VbbIKxVjxCvbVqFdpdh8ahY=
|
||||
github.com/ClusterCockpit/go-rocm-smi v0.3.0 h1:1qZnSpG7/NyLtc7AjqnUL9Jb8xtqG1nMVgp69rJfaR8=
|
||||
github.com/ClusterCockpit/go-rocm-smi v0.3.0/go.mod h1:+I3UMeX3OlizXDf1WpGD43W4KGZZGVSGmny6rTeOnWA=
|
||||
github.com/NVIDIA/go-nvml v0.11.6-0/go.mod h1:hy7HYeQy335x6nEss0Ne3PYqleRa6Ct+VKD9RQ4nyFs=
|
||||
github.com/ClusterCockpit/go-rocm-smi v0.4.0 h1:3+bEPrSkjEJcOtt+qBUX48ugDVlOFaKUnXHTef2Ve2Q=
|
||||
github.com/ClusterCockpit/go-rocm-smi v0.4.0/go.mod h1:c19u5vBCcgb7DjL4EWTGSGpo6c79d07r4rxD50z25ng=
|
||||
github.com/Microsoft/go-winio v0.6.1 h1:9/kr64B9VUZrLm5YYwbGtUJnMgqWVOdUAXu6Migciow=
|
||||
github.com/Microsoft/go-winio v0.6.1/go.mod h1:LRdKpFKfdobln8UmuiYcKPot9D2v6svN5+sAH+4kjUM=
|
||||
github.com/Microsoft/hcsshim v0.11.4 h1:68vKo2VN8DE9AdN4tnkWnmdhqdbpUFM8OF3Airm7fz8=
|
||||
github.com/Microsoft/hcsshim v0.11.4/go.mod h1:smjE4dvqPX9Zldna+t5FG3rnoHhaB7QYxPRqGcpAD9w=
|
||||
github.com/NVIDIA/go-nvml v0.13.0-1 h1:OLX8Jq3dONuPOQPC7rndB6+iDmDakw0XTYgzMxObkEw=
|
||||
github.com/NVIDIA/go-nvml v0.13.0-1/go.mod h1:+KNA7c7gIBH7SKSJ1ntlwkfN80zdx8ovl4hrK3LmPt4=
|
||||
github.com/PaesslerAG/gval v1.2.4 h1:rhX7MpjJlcxYwL2eTTYIOBUyEKZ+A96T9vQySWkVUiU=
|
||||
@@ -12,28 +19,52 @@ github.com/PaesslerAG/gval v1.2.4/go.mod h1:XRFLwvmkTEdYziLdaCeCa5ImcGVrfQbeNUbV
|
||||
github.com/PaesslerAG/jsonpath v0.1.0 h1:gADYeifvlqK3R3i2cR5B4DGgxLXIPb3TRTH1mGi0jPI=
|
||||
github.com/PaesslerAG/jsonpath v0.1.0/go.mod h1:4BzmtoM/PI8fPO4aQGIusjGxGir2BzcV0grWtFzq1Y8=
|
||||
github.com/RaveNoX/go-jsoncommentstrip v1.0.0/go.mod h1:78ihd09MekBnJnxpICcwzCMzGrKSKYe4AqU6PDYYpjk=
|
||||
github.com/antithesishq/antithesis-sdk-go v0.5.0-default-no-op h1:Ucf+QxEKMbPogRO5guBNe5cgd9uZgfoJLOYs8WWhtjM=
|
||||
github.com/antithesishq/antithesis-sdk-go v0.5.0-default-no-op/go.mod h1:IUpT2DPAKh6i/YhSbt6Gl3v2yvUZjmKncl7U91fup7E=
|
||||
github.com/antithesishq/antithesis-sdk-go v0.6.0-default-no-op h1:kpBdlEPbRvff0mDD1gk7o9BhI16b9p5yYAXRlidpqJE=
|
||||
github.com/antithesishq/antithesis-sdk-go v0.6.0-default-no-op/go.mod h1:IUpT2DPAKh6i/YhSbt6Gl3v2yvUZjmKncl7U91fup7E=
|
||||
github.com/apapsch/go-jsonmerge/v2 v2.0.0 h1:axGnT1gRIfimI7gJifB699GoE/oq+F2MU7Dml6nw9rQ=
|
||||
github.com/apapsch/go-jsonmerge/v2 v2.0.0/go.mod h1:lvDnEdqiQrp0O42VQGgmlKpxL1AP2+08jFMw88y4klk=
|
||||
github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM=
|
||||
github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw=
|
||||
github.com/bmatcuk/doublestar v1.1.1/go.mod h1:UD6OnuiIn0yFxxA2le/rnRU1G4RaI4UvFv1sNto9p6w=
|
||||
github.com/cenkalti/backoff/v4 v4.2.1 h1:y4OZtCnogmCPw98Zjyt5a6+QwPLGkiQsYW5oUqylYbM=
|
||||
github.com/cenkalti/backoff/v4 v4.2.1/go.mod h1:Y3VNntkOUPxTVeUxJ/G5vcM//AlwfmyYozVcomhLiZE=
|
||||
github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
|
||||
github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs=
|
||||
github.com/coder/websocket v1.8.14 h1:9L0p0iKiNOibykf283eHkKUHHrpG7f65OE3BhhO7v9g=
|
||||
github.com/coder/websocket v1.8.14/go.mod h1:NX3SzP+inril6yawo5CQXx8+fk145lPDC6pumgx0mVg=
|
||||
github.com/containerd/containerd v1.7.12 h1:+KQsnv4VnzyxWcfO9mlxxELaoztsDEjOuCMPAuPqgU0=
|
||||
github.com/containerd/containerd v1.7.12/go.mod h1:/5OMpE1p0ylxtEUGY8kuCYkDRzJm9NO1TFMWjUpdevk=
|
||||
github.com/containerd/log v0.1.0 h1:TCJt7ioM2cr/tfR8GPbGf9/VRAX8D2B4PjzCpfX540I=
|
||||
github.com/containerd/log v0.1.0/go.mod h1:VRRf09a7mHDIRezVKTRCrOq78v577GXq3bSa3EhrzVo=
|
||||
github.com/cpuguy83/dockercfg v0.3.1 h1:/FpZ+JaygUR/lZP2NlFI2DVfrOEMAIKP5wWEJdoYe9E=
|
||||
github.com/cpuguy83/dockercfg v0.3.1/go.mod h1:sugsbF4//dDlL/i+S+rtpIWp+5h0BHJHfjj5/jFyUJc=
|
||||
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
|
||||
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/docker/distribution v2.8.2+incompatible h1:T3de5rq0dB1j30rp0sA2rER+m322EBzniBPB6ZIzuh8=
|
||||
github.com/docker/distribution v2.8.2+incompatible/go.mod h1:J2gT2udsDAN96Uj4KfcMRqY0/ypR+oyYUYmja8H+y+w=
|
||||
github.com/docker/docker v24.0.9+incompatible h1:HPGzNmwfLZWdxHqK9/II92pyi1EpYKsAqcl4G0Of9v0=
|
||||
github.com/docker/docker v24.0.9+incompatible/go.mod h1:eEKB0N0r5NX/I1kEveEz05bcu8tLC/8azJZsviup8Sk=
|
||||
github.com/docker/go-connections v0.5.0 h1:USnMq7hx7gwdVZq1L49hLXaFtUdTADjXGp+uj1Br63c=
|
||||
github.com/docker/go-connections v0.5.0/go.mod h1:ov60Kzw0kKElRwhNs9UlUHAE/F9Fe6GLaXnqyDdmEXc=
|
||||
github.com/docker/go-units v0.5.0 h1:69rxXcBk27SvSaaxTtLh/8llcHD8vYHT7WSdRZ/jvr4=
|
||||
github.com/docker/go-units v0.5.0/go.mod h1:fgPhTUdO+D/Jk86RDLlptpiXQzgHJF7gydDDbaIK4Dk=
|
||||
github.com/expr-lang/expr v1.17.8 h1:W1loDTT+0PQf5YteHSTpju2qfUfNoBt4yw9+wOEU9VM=
|
||||
github.com/expr-lang/expr v1.17.8/go.mod h1:8/vRC7+7HBzESEqt5kKpYXxrxkr31SaO8r40VO/1IT4=
|
||||
github.com/frankban/quicktest v1.13.0 h1:yNZif1OkDfNoDfb9zZa9aXIpejNR4F23Wely0c+Qdqk=
|
||||
github.com/frankban/quicktest v1.13.0/go.mod h1:qLE0fzW0VuyUAJgPU19zByoIr0HtCHN/r/VLSOOIySU=
|
||||
github.com/fsnotify/fsnotify v1.9.0 h1:2Ml+OJNzbYCTzsxtv8vKSFD9PbJjmhYF14k/jKC7S9k=
|
||||
github.com/fsnotify/fsnotify v1.9.0/go.mod h1:8jBTzvmWwFyi3Pb8djgCCO5IBqzKJ/Jwo8TRcHyHii0=
|
||||
github.com/fsnotify/fsnotify v1.10.1 h1:b0/UzAf9yR5rhf3RPm9gf3ehBPpf0oZKIjtpKrx59Ho=
|
||||
github.com/fsnotify/fsnotify v1.10.1/go.mod h1:TLheqan6HD6GBK6PrDWyDPBaEV8LspOxvPSjC+bVfgo=
|
||||
github.com/go-ole/go-ole v1.3.0 h1:Dt6ye7+vXGIKZ7Xtk4s6/xVdGDQynvom7xCFEdWr6uE=
|
||||
github.com/go-ole/go-ole v1.3.0/go.mod h1:5LS6F96DhAwUc7C+1HLexzMXY1xGRSryjyPPKW6zv78=
|
||||
github.com/gogo/protobuf v1.3.2 h1:Ov1cvc58UF3b5XjBnZv7+opcTcQFZebYjWzi34vdm4Q=
|
||||
github.com/gogo/protobuf v1.3.2/go.mod h1:P1XiOD3dCwIKUDQYPy72D8LYyHL2YPYrpS2s69NZV8Q=
|
||||
github.com/golang/protobuf v1.5.3 h1:KhyjKVUg7Usr/dYsdSqoFveMYd5ko72D+zANwlG1mmg=
|
||||
github.com/golang/protobuf v1.5.3/go.mod h1:XVQd3VNwM+JqD3oG2Ue2ip4fOMUkwXdXDdiuN0vRsmY=
|
||||
github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8=
|
||||
github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU=
|
||||
github.com/google/go-tpm v0.9.7 h1:u89J4tUUeDTlH8xxC3CTW7OHZjbjKoHdQ9W7gCUhtxA=
|
||||
github.com/google/go-tpm v0.9.7/go.mod h1:h9jEsEECg7gtLis0upRBQU+GhYVH6jMjrFxI8u6bVUY=
|
||||
github.com/google/go-tpm v0.9.8 h1:slArAR9Ft+1ybZu0lBwpSmpwhRXaa85hWtMinMyRAWo=
|
||||
github.com/google/go-tpm v0.9.8/go.mod h1:h9jEsEECg7gtLis0upRBQU+GhYVH6jMjrFxI8u6bVUY=
|
||||
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
|
||||
github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
|
||||
github.com/gorilla/mux v1.8.1 h1:TuBL49tXwgrFYWhqrNgrUNEY92u81SPhu7sTdzQEiWY=
|
||||
@@ -45,32 +76,54 @@ github.com/influxdata/line-protocol v0.0.0-20210922203350-b1ad95c89adf/go.mod h1
|
||||
github.com/influxdata/line-protocol-corpus v0.0.0-20210922080147-aa28ccfb8937 h1:MHJNQ+p99hFATQm6ORoLmpUCF7ovjwEFshs/NHzAbig=
|
||||
github.com/influxdata/line-protocol-corpus v0.0.0-20210922080147-aa28ccfb8937/go.mod h1:BKR9c0uHSmRgM/se9JhFHtTT7JTO67X23MtKMHtZcpo=
|
||||
github.com/juju/gnuflag v0.0.0-20171113085948-2ce1bb71843d/go.mod h1:2PavIy+JPciBPrBUjwbNvtwB6RQlve+hkpll6QSNmOE=
|
||||
github.com/klauspost/compress v1.18.4 h1:RPhnKRAQ4Fh8zU2FY/6ZFDwTVTxgJ/EMydqSTzE9a2c=
|
||||
github.com/klauspost/compress v1.18.4/go.mod h1:R0h/fSBs8DE4ENlcrlib3PsXS61voFxhIs2DeRhCvJ4=
|
||||
github.com/klauspost/compress v1.18.5 h1:/h1gH5Ce+VWNLSWqPzOVn6XBO+vJbCNGvjoaGBFW2IE=
|
||||
github.com/klauspost/compress v1.18.5/go.mod h1:cwPg85FWrGar70rWktvGQj8/hthj3wpl0PGDogxkrSQ=
|
||||
github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE=
|
||||
github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk=
|
||||
github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY=
|
||||
github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE=
|
||||
github.com/kylelemons/godebug v1.1.0 h1:RPNrshWIDI6G2gRW9EHilWtl7Z6Sb1BR0xunSBf0SNc=
|
||||
github.com/kylelemons/godebug v1.1.0/go.mod h1:9/0rRGxNHcop5bhtWyNeEfOS8JIWk580+fNqagV/RAw=
|
||||
github.com/lufia/plan9stats v0.0.0-20230326075908-cb1d2100619a h1:N9zuLhTvBSRt0gWSiJswwQ2HqDmtX/ZCDJURnKUt1Ik=
|
||||
github.com/lufia/plan9stats v0.0.0-20230326075908-cb1d2100619a/go.mod h1:JKx41uQRwqlTZabZc+kILPrO/3jlKnQ2Z8b7YiVw5cE=
|
||||
github.com/magiconair/properties v1.8.7 h1:IeQXZAiQcpL9mgcAe1Nu6cX9LLw6ExEHKjN0VQdvPDY=
|
||||
github.com/magiconair/properties v1.8.7/go.mod h1:Dhd985XPs7jluiymwWYZ0G4Z61jb3vdS329zhj2hYo0=
|
||||
github.com/minio/highwayhash v1.0.4-0.20251030100505-070ab1a87a76 h1:KGuD/pM2JpL9FAYvBrnBBeENKZNh6eNtjqytV6TYjnk=
|
||||
github.com/minio/highwayhash v1.0.4-0.20251030100505-070ab1a87a76/go.mod h1:GGYsuwP/fPD6Y9hMiXuapVvlIUEhFhMTh0rxU3ik1LQ=
|
||||
github.com/moby/patternmatcher v0.6.0 h1:GmP9lR19aU5GqSSFko+5pRqHi+Ohk1O69aFiKkVGiPk=
|
||||
github.com/moby/patternmatcher v0.6.0/go.mod h1:hDPoyOpDY7OrrMDLaYoY3hf52gNCR/YOUYxkhApJIxc=
|
||||
github.com/moby/sys/sequential v0.5.0 h1:OPvI35Lzn9K04PBbCLW0g4LcFAJgHsvXsRyewg5lXtc=
|
||||
github.com/moby/sys/sequential v0.5.0/go.mod h1:tH2cOOs5V9MlPiXcQzRC+eEyab644PWKGRYaaV5ZZlo=
|
||||
github.com/moby/term v0.5.0 h1:xt8Q1nalod/v7BqbG21f8mQPqH+xAaC9C3N3wfWbVP0=
|
||||
github.com/moby/term v0.5.0/go.mod h1:8FzsFHVUBGZdbDsJw/ot+X+d5HLUbvklYLJ9uGfcI3Y=
|
||||
github.com/morikuni/aec v1.0.0 h1:nP9CBfwrvYnBRgY6qfDQkygYDmYwOilePFkwzv4dU8A=
|
||||
github.com/morikuni/aec v1.0.0/go.mod h1:BbKIizmSmc5MMPqRYbxO4ZU0S0+P200+tUnFx7PXmsc=
|
||||
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 h1:C3w9PqII01/Oq1c1nUAm88MOHcQC9l5mIlSMApZMrHA=
|
||||
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822/go.mod h1:+n7T8mK8HuQTcFwEeznm/DIxMOiR9yIdICNftLE1DvQ=
|
||||
github.com/nats-io/jwt/v2 v2.8.0 h1:K7uzyz50+yGZDO5o772eRE7atlcSEENpL7P+b74JV1g=
|
||||
github.com/nats-io/jwt/v2 v2.8.0/go.mod h1:me11pOkwObtcBNR8AiMrUbtVOUGkqYjMQZ6jnSdVUIA=
|
||||
github.com/nats-io/nats-server/v2 v2.12.3 h1:KRv+1n7lddMVgkJPQer+pt36TcO0ENxjilBmeWdjcHs=
|
||||
github.com/nats-io/nats-server/v2 v2.12.3/go.mod h1:MQXjG9WjyXKz9koWzUc3jYUMKD8x3CLmTNy91IQQz3Y=
|
||||
github.com/nats-io/nats.go v1.49.0 h1:yh/WvY59gXqYpgl33ZI+XoVPKyut/IcEaqtsiuTJpoE=
|
||||
github.com/nats-io/nats.go v1.49.0/go.mod h1:fDCn3mN5cY8HooHwE2ukiLb4p4G4ImmzvXyJt+tGwdw=
|
||||
github.com/nats-io/jwt/v2 v2.8.1 h1:V0xpGuD/N8Mi+fQNDynXohVvp7ZztevW5io8CUWlPmU=
|
||||
github.com/nats-io/jwt/v2 v2.8.1/go.mod h1:nWnOEEiVMiKHQpnAy4eXlizVEtSfzacZ1Q43LIRavZg=
|
||||
github.com/nats-io/nats-server/v2 v2.12.7 h1:prQ9cPiWHcnwfT81Wi5lU9LL8TLY+7pxDru6fQYLCQQ=
|
||||
github.com/nats-io/nats-server/v2 v2.12.7/go.mod h1:dOnmkprKMluTmTF7/QHZioxlau3sKHUM/LBPy9AiBPw=
|
||||
github.com/nats-io/nats.go v1.51.0 h1:ByW84XTz6W03GSSsygsZcA+xgKK8vPGaa/FCAAEHnAI=
|
||||
github.com/nats-io/nats.go v1.51.0/go.mod h1:26HypzazeOkyO3/mqd1zZd53STJN0EjCYF9Uy2ZOBno=
|
||||
github.com/nats-io/nkeys v0.4.15 h1:JACV5jRVO9V856KOapQ7x+EY8Jo3qw1vJt/9Jpwzkk4=
|
||||
github.com/nats-io/nkeys v0.4.15/go.mod h1:CpMchTXC9fxA5zrMo4KpySxNjiDVvr8ANOSZdiNfUrs=
|
||||
github.com/nats-io/nuid v1.0.1 h1:5iA8DT8V7q8WK2EScv2padNa/rTESc1KdnPw4TC2paw=
|
||||
github.com/nats-io/nuid v1.0.1/go.mod h1:19wcPz3Ph3q0Jbyiqsd0kePYG7A95tJPxeL+1OSON2c=
|
||||
github.com/oapi-codegen/runtime v1.3.0 h1:vyK1zc0gDWWXgk2xoQa4+X4RNNc5SL2RbTpJS/4vMYA=
|
||||
github.com/oapi-codegen/runtime v1.3.0/go.mod h1:kOdeacKy7t40Rclb1je37ZLFboFxh+YLy0zaPCMibPY=
|
||||
github.com/oapi-codegen/runtime v1.4.0 h1:KLOSFOp7UzkbS7Cs1ms6NBEKYr0WmH2wZG0KKbd2er4=
|
||||
github.com/oapi-codegen/runtime v1.4.0/go.mod h1:5sw5fxCDmnOzKNYmkVNF8d34kyUeejJEY8HNT2WaPec=
|
||||
github.com/opencontainers/go-digest v1.0.0 h1:apOUWs51W5PlhuyGyz9FCeeBIOUDA/6nW8Oi/yOhh5U=
|
||||
github.com/opencontainers/go-digest v1.0.0/go.mod h1:0JzlMkj0TRzQZfJkVvzbP0HBR3IKzErnv2BNG4W4MAM=
|
||||
github.com/opencontainers/image-spec v1.1.0-rc5 h1:Ygwkfw9bpDvs+c9E34SdgGOj41dX/cbdlwvlWt0pnFI=
|
||||
github.com/opencontainers/image-spec v1.1.0-rc5/go.mod h1:X4pATf0uXsnn3g5aiGIsVnJBR4mxhKzfwmvK/B2NTm8=
|
||||
github.com/opencontainers/runc v1.1.5 h1:L44KXEpKmfWDcS02aeGm8QNTFXTo2D+8MYGDIJ/GDEs=
|
||||
github.com/opencontainers/runc v1.1.5/go.mod h1:1J5XiS+vdZ3wCyZybsuxXZWGrgSr8fFJHLXuG2PsnNg=
|
||||
github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
|
||||
github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
|
||||
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
|
||||
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
|
||||
github.com/power-devops/perfstat v0.0.0-20221212215047-62379fc7944b h1:0LFwY6Q3gMACTjAbMZBjXAqTOzOwFaj2Ld6cjeQ7Rig=
|
||||
github.com/power-devops/perfstat v0.0.0-20221212215047-62379fc7944b/go.mod h1:OmDBASR4679mdNQnz2pUhc2G8CO2JrUAVFDRBDP/hJE=
|
||||
github.com/prometheus/client_golang v1.23.2 h1:Je96obch5RDVy3FDMndoUsjAhG5Edi49h0RJWRi/o0o=
|
||||
github.com/prometheus/client_golang v1.23.2/go.mod h1:Tb1a6LWHB3/SPIzCoaDXI4I8UHKeFTEQ1YCr+0Gyqmg=
|
||||
github.com/prometheus/client_model v0.6.2 h1:oBsgwpGs7iVziMvrGhE53c/GrLUsZdHnqNwqPLxwZyk=
|
||||
@@ -79,40 +132,70 @@ github.com/prometheus/common v0.67.5 h1:pIgK94WWlQt1WLwAC5j2ynLaBRDiinoAb86HZHTU
|
||||
github.com/prometheus/common v0.67.5/go.mod h1:SjE/0MzDEEAyrdr5Gqc6G+sXI67maCxzaT3A2+HqjUw=
|
||||
github.com/prometheus/procfs v0.20.1 h1:XwbrGOIplXW/AU3YhIhLODXMJYyC1isLFfYCsTEycfc=
|
||||
github.com/prometheus/procfs v0.20.1/go.mod h1:o9EMBZGRyvDrSPH1RqdxhojkuXstoe4UlK79eF5TGGo=
|
||||
github.com/questdb/go-questdb-client/v4 v4.2.0 h1:+d0HJwCjUWMj7zmY6qmhoqTJzTyoYKl+LSTYGN0T8T8=
|
||||
github.com/questdb/go-questdb-client/v4 v4.2.0/go.mod h1:/2x93LK1wjM4JX/b5c6q7Yqk22htjWY1lE6p1X8iLbE=
|
||||
github.com/rogpeppe/go-internal v1.10.0 h1:TMyTOH3F/DB16zRVcYyreMH6GnZZrwQVAoYjRBZyWFQ=
|
||||
github.com/rogpeppe/go-internal v1.10.0/go.mod h1:UQnix2H7Ngw/k4C5ijL5+65zddjncjaFoBhdsK/akog=
|
||||
github.com/santhosh-tekuri/jsonschema/v5 v5.3.1 h1:lZUw3E0/J3roVtGQ+SCrUrg3ON6NgVqpn3+iol9aGu4=
|
||||
github.com/santhosh-tekuri/jsonschema/v5 v5.3.1/go.mod h1:uToXkOrWAZ6/Oc07xWQrPOhJotwFIyu2bBVN41fcDUY=
|
||||
github.com/shirou/gopsutil/v3 v3.23.12 h1:z90NtUkp3bMtmICZKpC4+WaknU1eXtp5vtbQ11DgpE4=
|
||||
github.com/shirou/gopsutil/v3 v3.23.12/go.mod h1:1FrWgea594Jp7qmjHUUPlJDTPgcsb9mGnXDxavtikzM=
|
||||
github.com/shoenig/go-m1cpu v0.1.6 h1:nxdKQNcEB6vzgA2E2bvzKIYRuNj7XNJ4S/aRSwKzFtM=
|
||||
github.com/shoenig/go-m1cpu v0.1.6/go.mod h1:1JJMcUBvfNwpq05QDQVAnx3gUHr9IYF7GNg9SUEw2VQ=
|
||||
github.com/shopspring/decimal v1.3.1/go.mod h1:DKyhrW/HYNuLGql+MJL6WCR6knT2jwCFRcu2hWCYk4o=
|
||||
github.com/shopspring/decimal v1.4.0 h1:bxl37RwXBklmTi0C79JfXCEBD1cqqHt0bbgBAGFp81k=
|
||||
github.com/shopspring/decimal v1.4.0/go.mod h1:gawqmDU56v4yIKSwfBSFip1HdCCXN8/+DMd9qYNcwME=
|
||||
github.com/sirupsen/logrus v1.9.3 h1:dueUQJ1C2q9oE3F7wvmSGAaVtTmUizReu6fjN8uqzbQ=
|
||||
github.com/sirupsen/logrus v1.9.3/go.mod h1:naHLuLoDiP4jHNo9R0sCBMtWGeIprob74mVsIT4qYEQ=
|
||||
github.com/spkg/bom v0.0.0-20160624110644-59b7046e48ad/go.mod h1:qLr4V1qq6nMqFKkMo8ZTx3f+BZEkzsRUY10Xsm2mwU0=
|
||||
github.com/stmcginnis/gofish v0.21.4 h1:daexK8sh31CgeSMkPUNs21HWHHA9ecCPJPyLCTxukCg=
|
||||
github.com/stmcginnis/gofish v0.21.4/go.mod h1:PzF5i8ecRG9A2ol8XT64npKUunyraJ+7t0kYMpQAtqU=
|
||||
github.com/stmcginnis/gofish v0.21.6 h1:jK3TGD6VANaAHKHypVNfD6io2nPrU+6eF8X4qARsTlY=
|
||||
github.com/stmcginnis/gofish v0.21.6/go.mod h1:PzF5i8ecRG9A2ol8XT64npKUunyraJ+7t0kYMpQAtqU=
|
||||
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
|
||||
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
|
||||
github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo=
|
||||
github.com/stretchr/objx v0.5.2/go.mod h1:FRsXN1f5AsAjCGJKqEizvkpNtU+EGNCLh3NxZ/8L+MA=
|
||||
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
|
||||
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
|
||||
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
|
||||
github.com/stretchr/testify v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo=
|
||||
github.com/stretchr/testify v1.10.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY=
|
||||
github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U=
|
||||
github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U=
|
||||
github.com/tklauser/go-sysconf v0.3.16 h1:frioLaCQSsF5Cy1jgRBrzr6t502KIIwQ0MArYICU0nA=
|
||||
github.com/tklauser/go-sysconf v0.3.16/go.mod h1:/qNL9xxDhc7tx3HSRsLWNnuzbVfh3e7gh/BmM179nYI=
|
||||
github.com/tklauser/numcpus v0.11.0 h1:nSTwhKH5e1dMNsCdVBukSZrURJRoHbSEQjdEbY+9RXw=
|
||||
github.com/tklauser/numcpus v0.11.0/go.mod h1:z+LwcLq54uWZTX0u/bGobaV34u6V7KNlTZejzM6/3MQ=
|
||||
github.com/testcontainers/testcontainers-go v0.26.0 h1:uqcYdoOHBy1ca7gKODfBd9uTHVK3a7UL848z09MVZ0c=
|
||||
github.com/testcontainers/testcontainers-go v0.26.0/go.mod h1:ICriE9bLX5CLxL9OFQ2N+2N+f+803LNJ1utJb1+Inx0=
|
||||
github.com/tklauser/go-sysconf v0.4.0 h1:7H0uAN+7RkwWRaxhYXDLqa5V3LPrJeV8wmD9dRUgPQU=
|
||||
github.com/tklauser/go-sysconf v0.4.0/go.mod h1:8mTNWyog7H+MpKijp4VmKJAd2bbYQ2zuUwkYRbUArPI=
|
||||
github.com/tklauser/numcpus v0.12.0 h1:NR85qdvHA9pFse3x3weVZ0r0ST8R6l5RHbZrlRaqob4=
|
||||
github.com/tklauser/numcpus v0.12.0/go.mod h1:ABHeXzJnr/qqwguhClkZKT1/8VABcYrsyUiUGobwWJg=
|
||||
github.com/yusufpapurcu/wmi v1.2.3 h1:E1ctvB7uKFMOJw3fdOW32DwGE9I7t++CRUEMKvFoFiw=
|
||||
github.com/yusufpapurcu/wmi v1.2.3/go.mod h1:SBZ9tNy3G9/m5Oi98Zks0QjeHVDvuK0qfxQmPyzfmi0=
|
||||
go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto=
|
||||
go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE=
|
||||
go.yaml.in/yaml/v2 v2.4.4 h1:tuyd0P+2Ont/d6e2rl3be67goVK4R6deVxCUX5vyPaQ=
|
||||
go.yaml.in/yaml/v2 v2.4.4/go.mod h1:gMZqIpDtDqOfM0uNfy0SkpRhvUryYH0Z6wdMYcacYXQ=
|
||||
golang.design/x/thread v0.0.0-20210122121316-335e9adffdf1 h1:P7S/GeHBAFEZIYp0ePPs2kHXoazz8q2KsyxHyQVGCJg=
|
||||
golang.design/x/thread v0.0.0-20210122121316-335e9adffdf1/go.mod h1:9CWpnTUmlQkfdpdutA1nNf4iE5lAVt3QZOu0Z6hahBE=
|
||||
golang.org/x/crypto v0.49.0 h1:+Ng2ULVvLHnJ/ZFEq4KdcDd/cfjrrjjNSXNzxg0Y4U4=
|
||||
golang.org/x/crypto v0.49.0/go.mod h1:ErX4dUh2UM+CFYiXZRTcMpEcN8b/1gxEuv3nODoYtCA=
|
||||
golang.org/x/net v0.52.0 h1:He/TN1l0e4mmR3QqHMT2Xab3Aj3L9qjbhRm78/6jrW0=
|
||||
golang.org/x/net v0.52.0/go.mod h1:R1MAz7uMZxVMualyPXb+VaqGSa3LIaUqk0eEt3w36Sw=
|
||||
golang.org/x/sys v0.0.0-20210122093101-04d7465088b8/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
|
||||
golang.org/x/sys v0.42.0 h1:omrd2nAlyT5ESRdCLYdm3+fMfNFE/+Rf4bDIQImRJeo=
|
||||
golang.org/x/sys v0.42.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
|
||||
golang.org/x/time v0.14.0 h1:MRx4UaLrDotUKUdCIqzPC48t1Y9hANFKIRpNx+Te8PI=
|
||||
golang.org/x/time v0.14.0/go.mod h1:eL/Oa2bBBK0TkX57Fyni+NgnyQQN4LitPmob2Hjnqw4=
|
||||
golang.design/x/thread v0.3.2 h1:FmD1glspGrQCe6FuQLmSrT6wz2CSzq7vKVDluyiMnqo=
|
||||
golang.design/x/thread v0.3.2/go.mod h1:6+Hi2rMOgMHZdKDWaqNHyWtoFUx1HxZ06LfHPh5Z/hQ=
|
||||
golang.org/x/crypto v0.50.0 h1:zO47/JPrL6vsNkINmLoo/PH1gcxpls50DNogFvB5ZGI=
|
||||
golang.org/x/crypto v0.50.0/go.mod h1:3muZ7vA7PBCE6xgPX7nkzzjiUq87kRItoJQM1Yo8S+Q=
|
||||
golang.org/x/exp v0.0.0-20231005195138-3e424a577f31 h1:9k5exFQKQglLo+RoP+4zMjOFE14P6+vyR0baDAi0Rcs=
|
||||
golang.org/x/exp v0.0.0-20231005195138-3e424a577f31/go.mod h1:S2oDrQGGwySpoQPVqRShND87VCbxmc6bL1Yd2oYrm6k=
|
||||
golang.org/x/mod v0.13.0 h1:I/DsJXRlw/8l/0c24sM9yb0T4z9liZTduXvdAWYiysY=
|
||||
golang.org/x/mod v0.13.0/go.mod h1:hTbmBsO62+eylJbnUtE2MGJUyE7QWk4xUqPFrRgJ+7c=
|
||||
golang.org/x/net v0.53.0 h1:d+qAbo5L0orcWAr0a9JweQpjXF19LMXJE8Ey7hwOdUA=
|
||||
golang.org/x/net v0.53.0/go.mod h1:JvMuJH7rrdiCfbeHoo3fCQU24Lf5JJwT9W3sJFulfgs=
|
||||
golang.org/x/sys v0.45.0 h1:dO4czNzziLiiXplLQgBCEpCvXQ3dnkn0SdaZSYdQ+FY=
|
||||
golang.org/x/sys v0.45.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
|
||||
golang.org/x/time v0.15.0 h1:bbrp8t3bGUeFOx08pvsMYRTCVSMk89u4tKbNOZbp88U=
|
||||
golang.org/x/time v0.15.0/go.mod h1:Y4YMaQmXwGQZoFaVFk4YpCt4FLQMYKZe9oeV/f4MSno=
|
||||
golang.org/x/tools v0.14.0 h1:jvNa2pY0M4r62jkRQ6RwEZZyPcymeL9XZMLBbV7U2nc=
|
||||
golang.org/x/tools v0.14.0/go.mod h1:uYBEerGOWcJyEORxN+Ek8+TT266gXkNlHdJBwexUsBg=
|
||||
google.golang.org/genproto/googleapis/rpc v0.0.0-20231002182017-d307bd883b97 h1:6GQBEOdGkX6MMTLT9V+TjtIRZCw9VPD5Z+yHY9wMgS0=
|
||||
google.golang.org/genproto/googleapis/rpc v0.0.0-20231002182017-d307bd883b97/go.mod h1:v7nGkzlmW8P3n/bKmWBn2WpBjpOEx8Q6gMueudAmKfY=
|
||||
google.golang.org/grpc v1.58.3 h1:BjnpXut1btbtgN/6sp+brB2Kbm2LjNXnidYujAVbSoQ=
|
||||
google.golang.org/grpc v1.58.3/go.mod h1:tgX3ZQDlNJGU96V6yHh1T/JeoBQ2TXdr43YbYSsCJk0=
|
||||
google.golang.org/protobuf v1.36.11 h1:fV6ZwhNocDyBLK0dj+fg8ektcVegBBuEolpbTQyBNVE=
|
||||
google.golang.org/protobuf v1.36.11/go.mod h1:HTf+CrKn2C3g5S8VImy6tdcUvCska2kB7j23XfzDpco=
|
||||
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
|
||||
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
|
||||
@@ -35,18 +35,18 @@ type metricRouterTagConfig struct {
|
||||
|
||||
// Metric router configuration
|
||||
type metricRouterConfig struct {
|
||||
HostnameTagName string `json:"hostname_tag"` // Key name used when adding the hostname to a metric (default 'hostname')
|
||||
AddTags []metricRouterTagConfig `json:"add_tags"` // List of tags that are added when the condition is met
|
||||
DelTags []metricRouterTagConfig `json:"delete_tags"` // List of tags that are removed when the condition is met
|
||||
IntervalAgg []agg.MetricAggregatorIntervalConfig `json:"interval_aggregates"` // List of aggregation function processed at the end of an interval
|
||||
DropMetrics []string `json:"drop_metrics"` // List of metric names to drop. For fine-grained dropping use drop_metrics_if
|
||||
DropMetricsIf []string `json:"drop_metrics_if"` // List of evaluatable terms to drop metrics
|
||||
RenameMetrics map[string]string `json:"rename_metrics"` // Map to rename metric name from key to value
|
||||
IntervalStamp bool `json:"interval_timestamp"` // Update timestamp periodically by ticker each interval?
|
||||
NumCacheIntervals int `json:"num_cache_intervals"` // Number of intervals of cached metrics for evaluation
|
||||
MaxForward int `json:"max_forward"` // Number of maximal forwarded metrics at one select
|
||||
NormalizeUnits bool `json:"normalize_units"` // Check unit meta flag and normalize it using cc-units
|
||||
ChangeUnitPrefix map[string]string `json:"change_unit_prefix"` // Add prefix that should be applied to the metrics
|
||||
HostnameTagName string `json:"hostname_tag,omitempty"` // Key name used when adding the hostname to a metric (default 'hostname')
|
||||
AddTags []metricRouterTagConfig `json:"add_tags,omitempty"` // List of tags that are added when the condition is met
|
||||
DelTags []metricRouterTagConfig `json:"delete_tags,omitempty"` // List of tags that are removed when the condition is met
|
||||
IntervalAgg []agg.MetricAggregatorIntervalConfig `json:"interval_aggregates,omitempty"` // List of aggregation function processed at the end of an interval
|
||||
DropMetrics []string `json:"drop_metrics,omitempty"` // List of metric names to drop. For fine-grained dropping use drop_metrics_if
|
||||
DropMetricsIf []string `json:"drop_metrics_if,omitempty"` // List of evaluatable terms to drop metrics
|
||||
RenameMetrics map[string]string `json:"rename_metrics,omitempty"` // Map to rename metric name from key to value
|
||||
IntervalStamp bool `json:"interval_timestamp,omitempty"` // Update timestamp periodically by ticker each interval?
|
||||
NumCacheIntervals int `json:"num_cache_intervals,omitempty"` // Number of intervals of cached metrics for evaluation
|
||||
MaxForward int `json:"max_forward,omitempty"` // Number of maximal forwarded metrics at one select
|
||||
NormalizeUnits bool `json:"normalize_units,omitempty"` // Check unit meta flag and normalize it using cc-units
|
||||
ChangeUnitPrefix map[string]string `json:"change_unit_prefix,omitempty"` // Add prefix that should be applied to the metrics
|
||||
MessageProcessor json.RawMessage `json:"process_messages,omitempty"`
|
||||
}
|
||||
|
||||
@@ -297,7 +297,7 @@ func (r *metricRouter) Start() {
|
||||
|
||||
case timestamp := <-timeChan:
|
||||
r.timestamp = timestamp
|
||||
cclog.ComponentDebug("MetricRouter", "Update timestamp", r.timestamp.UnixNano())
|
||||
cclog.ComponentDebugf("MetricRouter", "Update timestamp %d", r.timestamp.UnixNano())
|
||||
|
||||
case p := <-r.coll_input:
|
||||
coll_forward(p)
|
||||
|
||||
@@ -6,7 +6,7 @@ Installed-Size: {INSTALLED_SIZE}
|
||||
Architecture: {ARCH}
|
||||
Maintainer: thomas.gruber@fau.de
|
||||
Depends: libc6 (>= 2.2.1)
|
||||
Build-Depends: debhelper-compat (= 13), git, golang-go
|
||||
Build-Depends: debhelper-compat (= 13), git, golang-go, libdrm-dev
|
||||
Description: Metric collection daemon from the ClusterCockpit suite
|
||||
Homepage: https://github.com/ClusterCockpit/cc-metric-collector
|
||||
Source: cc-metric-collector
|
||||
|
||||
@@ -29,7 +29,7 @@ make
|
||||
|
||||
|
||||
%install
|
||||
install -Dpm 0750 %{name} %{buildroot}%{_bindir}/%{name}
|
||||
install -Dpm 0755 %{name} %{buildroot}%{_bindir}/%{name}
|
||||
install -Dpm 0600 example-configs/config.json %{buildroot}%{_sysconfdir}/%{name}/%{name}.json
|
||||
install -Dpm 0600 example-configs/collectors.json %{buildroot}%{_sysconfdir}/%{name}/collectors.json
|
||||
install -Dpm 0600 example-configs/sinks.json %{buildroot}%{_sysconfdir}/%{name}/sinks.json
|
||||
@@ -54,7 +54,7 @@ install -Dpm 0644 scripts/%{name}.sysusers %{buildroot}%{_sysusersdir}/%{name}.c
|
||||
|
||||
%files
|
||||
# Binary
|
||||
%attr(-,clustercockpit,clustercockpit) %{_bindir}/%{name}
|
||||
%attr(-,root,root) %{_bindir}/%{name}
|
||||
# Config
|
||||
%dir %{_sysconfdir}/%{name}
|
||||
%attr(0600,clustercockpit,clustercockpit) %config(noreplace) %{_sysconfdir}/%{name}/%{name}.json
|
||||
|
||||
Reference in New Issue
Block a user