Compare commits

...

15 Commits

Author SHA1 Message Date
dependabot[bot]
528fe3bb48 Bump golang.org/x/sys from 0.42.0 to 0.43.0
Bumps [golang.org/x/sys](https://github.com/golang/sys) from 0.42.0 to 0.43.0.
- [Commits](https://github.com/golang/sys/compare/v0.42.0...v0.43.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sys
  dependency-version: 0.43.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-13 03:24:33 +00:00
Michael Panzlaff
86da3c15f7 rpm: The main binary should be owner by root
The system user should not be allowed to modify the ccmc binary.
2026-04-08 16:46:19 +02:00
Michael Panzlaff
93cd397b79 Revert "rpm: chown on /usr/bin/cc-metric-collector is unnecessary"
This reverts commit 65b9c0ea14.
2026-04-08 16:45:57 +02:00
Michael Panzlaff
65b9c0ea14 rpm: chown on /usr/bin/cc-metric-collector is unnecessary
The file belongs to root otherwise. The monitoring user can already
execute it. The monitoring user should not be allowed to change the
file, which is slightly more restricting. However it is in line with
what 99.9% of packages will do.
2026-04-08 15:56:11 +02:00
dependabot[bot]
0ecf06cee7 Bump github.com/ClusterCockpit/go-rocm-smi from 0.3.0 to 0.4.0
Bumps [github.com/ClusterCockpit/go-rocm-smi](https://github.com/ClusterCockpit/go-rocm-smi) from 0.3.0 to 0.4.0.
- [Commits](https://github.com/ClusterCockpit/go-rocm-smi/compare/v0.3...v0.4.0)

---
updated-dependencies:
- dependency-name: github.com/ClusterCockpit/go-rocm-smi
  dependency-version: 0.4.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-30 13:08:36 +02:00
Thomas Roehl
9eaf77db4f Update README.md 2026-03-24 19:50:51 +01:00
Thomas Roehl
7cb5d1b47a Add/update sudo configuration to all collectors with 'use_sudo' 2026-03-24 19:50:41 +01:00
Thomas Roehl
319e71a853 IpmiCollector: Remove unused configuration 'exclude_devices' 2026-03-24 19:48:34 +01:00
Michael Panzlaff
1251f9ef6b Merge pull request #207 from ClusterCockpit/ipmi-sudo
Add IPMI sudo support
2026-03-24 15:32:37 +01:00
Michael Panzlaff
f816f4991b ipmi: refactor and add sudo support 2026-03-24 15:06:47 +01:00
Michael Panzlaff
e40816eb17 ipmi: refactor and add sudo support 2026-03-24 14:24:35 +01:00
Michael Panzlaff
b947f98459 update cc-lib to v2.11.0 2026-03-24 14:24:25 +01:00
dependabot[bot]
c328fbf05a Bump github.com/ClusterCockpit/go-rocm-smi from 0.3.0 to 0.4.0
Bumps [github.com/ClusterCockpit/go-rocm-smi](https://github.com/ClusterCockpit/go-rocm-smi) from 0.3.0 to 0.4.0.
- [Commits](https://github.com/ClusterCockpit/go-rocm-smi/compare/v0.3...v0.4.0)

---
updated-dependencies:
- dependency-name: github.com/ClusterCockpit/go-rocm-smi
  dependency-version: 0.4.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-23 13:00:23 +01:00
dependabot[bot]
37ec7c19e6 Bump github.com/ClusterCockpit/cc-lib/v2 from 2.8.2 to 2.10.0
Bumps [github.com/ClusterCockpit/cc-lib/v2](https://github.com/ClusterCockpit/cc-lib) from 2.8.2 to 2.10.0.
- [Release notes](https://github.com/ClusterCockpit/cc-lib/releases)
- [Commits](https://github.com/ClusterCockpit/cc-lib/compare/v2.8.2...v2.10.0)

---
updated-dependencies:
- dependency-name: github.com/ClusterCockpit/cc-lib/v2
  dependency-version: 2.10.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-23 12:48:02 +01:00
Thomas Roehl
13fc8a53d3 Memstat: Fix mem_shared and add more metrics 2026-03-17 18:07:30 +01:00
11 changed files with 288 additions and 143 deletions

View File

@@ -11,13 +11,9 @@ hugo_path: docs/reference/cc-metric-collector/_index.md
# cc-metric-collector
A node agent for measuring, processing and forwarding node level metrics. It is part of the [ClusterCockpit ecosystem](https://clustercockpit.org/docs/overview/).
A node agent for measuring, processing and forwarding node level metrics. It is part of the [ClusterCockpit ecosystem](https://clustercockpit.org/docs/overview/).
The metric collector sends (and receives) metric in the [InfluxDB line protocol](https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/) as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns).
There is a single timer loop that triggers all collectors serially, collects the collectors' data and sends the metrics to the sink. This is done as all data is submitted with a single time stamp. The sinks currently use mostly blocking APIs.
The receiver runs as a go routine side-by-side with the timer loop and asynchronously forwards received metrics to the sink.
The `cc-metric-collector` sends (and maybe receives) metrics in the [InfluxDB line protocol](https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/) as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns). The `cc-metric-collector` consists of 4 components: collectors, router, sinks and receivers. The collectors read data from the current system and submit metrics to the router. The router can be configured to manipulate the metrics before forwarding them to the sinks. The receivers are also attached to the router like the collectors but they receive data from external source like other `cc-metric-collector` instances.
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7438287.svg)](https://doi.org/10.5281/zenodo.7438287)
@@ -43,7 +39,7 @@ There is a main configuration file with basic settings that point to the other c
}
```
The `interval` defines how often the metrics should be read and send to the sink. The `duration` tells collectors how long one measurement has to take. This is important for some collectors, like the `likwid` collector. For more information, see [here](./docs/configuration.md).
The `interval` defines how often the metrics should be read and send to the sink(s). The `duration` tells the collectors how long one measurement has to take. This is important for some collectors, like the `likwid` collector. For more information, see [here](./docs/configuration.md).
See the component READMEs for their configuration:
@@ -57,7 +53,7 @@ See the component READMEs for their configuration:
```
$ git clone git@github.com:ClusterCockpit/cc-metric-collector.git
$ make (downloads LIKWID, builds it as static library with 'direct' accessmode and copies all required files for the collector)
$ go get (requires at least golang 1.16)
$ go get
$ make
```
@@ -67,11 +63,13 @@ For more information, see [here](./docs/building.md).
```
$ ./cc-metric-collector --help
Usage of metric-collector:
Usage of ./cc-metric-collector:
-config string
Path to configuration file (default "./config.json")
-log string
Path for logfile (default "stderr")
-loglevel string
Set log level (default "info")
-once
Run all collectors only once
```
@@ -114,7 +112,7 @@ flowchart TD
# Contributing
The ClusterCockpit ecosystem is designed to be used by different HPC computing centers. Since configurations and setups differ between the centers, the centers likely have to put some work into the cc-metric-collector to gather all desired metrics.
The ClusterCockpit ecosystem is designed to be used by different HPC computing centers. Since configurations and setups differ between the centers, the centers likely have to put some work into `cc-metric-collector` to gather all desired metrics.
You are free to open an issue to request a collector but we would also be happy about PRs.

View File

@@ -81,3 +81,16 @@ Metrics:
* `gpfs_metaops_rate` (if `send_total_values == true` and `send_derived_values == true`)
The collector adds a `filesystem` tag to all metrics
`mmpmon` typically require root to run.
In order to run `cc-metric-collector` without root priviliges, you can enable `use_sudo`.
Add a file like this in `/etc/sudoers.d/` to allow `cc-metric-collector` to run the required command:
```
# Do not log the following sudo commands from monitoring, since this causes a lot of log spam.
# However keep log_denied enabled, to detect failures
Defaults: monitoring !log_allowed, !pam_session
# Allow to use mmpmon
monitoring ALL = (root) NOPASSWD:/absolute/path/to/mmpmon -p -s
```

View File

@@ -28,10 +28,11 @@ type IpmiCollector struct {
metricCollector
config struct {
ExcludeDevices []string `json:"exclude_devices"`
IpmitoolPath string `json:"ipmitool_path"`
IpmisensorsPath string `json:"ipmisensors_path"`
Sudo bool `json:"use_sudo"`
}
ipmitool string
ipmisensors string
}
@@ -54,6 +55,7 @@ func (m *IpmiCollector) Init(config json.RawMessage) error {
// default path to IPMI tools
m.config.IpmitoolPath = "ipmitool"
m.config.IpmisensorsPath = "ipmi-sensors"
if len(config) > 0 {
d := json.NewDecoder(bytes.NewReader(config))
d.DisallowUnknownFields()
@@ -61,51 +63,67 @@ func (m *IpmiCollector) Init(config json.RawMessage) error {
return fmt.Errorf("%s Init(): Error decoding JSON config: %w", m.name, err)
}
}
// Check if executables ipmitool or ipmisensors are found
p, err := exec.LookPath(m.config.IpmitoolPath)
if err == nil {
command := exec.Command(p)
err := command.Run()
if err != nil {
cclog.ComponentError(m.name, fmt.Sprintf("Failed to execute %s: %s", p, err.Error()))
m.ipmitool = ""
} else {
m.ipmitool = p
m.ipmitool = m.config.IpmitoolPath
m.ipmisensors = m.config.IpmisensorsPath
// Test if any of the supported backends work
var dummyChan chan lp.CCMessage
dummyConsumer := func() {
for range dummyChan {
}
}
p, err = exec.LookPath(m.config.IpmisensorsPath)
if err == nil {
command := exec.Command(p)
err := command.Run()
if err != nil {
cclog.ComponentError(m.name, fmt.Sprintf("Failed to execute %s: %s", p, err.Error()))
m.ipmisensors = ""
} else {
m.ipmisensors = p
}
}
if len(m.ipmitool) == 0 && len(m.ipmisensors) == 0 {
return fmt.Errorf("%s Init(): no usable IPMI reader found", m.name)
}
m.init = true
return nil
// Test if ipmi-sensors works (preferred over ipmitool, because it's faster)
var ipmiSensorsErr error
if _, ipmiSensorsErr = exec.LookPath(m.ipmisensors); ipmiSensorsErr == nil {
dummyChan = make(chan lp.CCMessage)
go dummyConsumer()
ipmiSensorsErr = m.readIpmiSensors(dummyChan)
close(dummyChan)
if ipmiSensorsErr == nil {
cclog.ComponentDebugf(m.name, "Using ipmi-sensors for ipmistat collector")
m.init = true
return nil
}
}
cclog.ComponentDebugf(m.name, "Unable to use ipmi-sensors for ipmistat collector: %v", ipmiSensorsErr)
m.ipmisensors = ""
// Test if ipmitool works (may be very slow)
var ipmiToolErr error
if _, ipmiToolErr = exec.LookPath(m.ipmitool); ipmiToolErr == nil {
dummyChan = make(chan lp.CCMessage)
go dummyConsumer()
ipmiToolErr = m.readIpmiTool(dummyChan)
close(dummyChan)
if ipmiToolErr == nil {
cclog.ComponentDebugf(m.name, "Using ipmitool for ipmistat collector")
m.init = true
return nil
}
}
m.ipmitool = ""
cclog.ComponentDebugf(m.name, "Unable to use ipmitool for ipmistat collector: %v", ipmiToolErr)
return fmt.Errorf("unable to init neither ipmitool (%w) nor ipmi-sensors (%w)", ipmiToolErr, ipmiSensorsErr)
}
func (m *IpmiCollector) readIpmiTool(cmd string, output chan lp.CCMessage) {
func (m *IpmiCollector) readIpmiTool(output chan lp.CCMessage) error {
// Setup ipmitool command
command := exec.Command(cmd, "sensor")
argv := make([]string, 0)
if m.config.Sudo {
argv = append(argv, "sudo", "-n")
}
argv = append(argv, m.ipmitool, "sensor")
command := exec.Command(argv[0], argv[1:]...)
stdout, _ := command.StdoutPipe()
errBuf := new(bytes.Buffer)
command.Stderr = errBuf
// start command
if err := command.Start(); err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("readIpmiTool(): Failed to start command \"%s\": %v", command.String(), err),
)
return
return fmt.Errorf("failed to start command '%s': %w", command.String(), err)
}
// Read command output
@@ -115,86 +133,100 @@ func (m *IpmiCollector) readIpmiTool(cmd string, output chan lp.CCMessage) {
if len(lv) < 3 {
continue
}
v, err := strconv.ParseFloat(strings.TrimSpace(lv[1]), 64)
if err == nil {
name := strings.ToLower(strings.ReplaceAll(strings.TrimSpace(lv[0]), " ", "_"))
unit := strings.TrimSpace(lv[2])
switch unit {
case "Volts":
unit = "Volts"
case "degrees C":
unit = "degC"
case "degrees F":
unit = "degF"
case "Watts":
unit = "Watts"
}
y, err := lp.NewMessage(name, map[string]string{"type": "node"}, m.meta, map[string]any{"value": v}, time.Now())
if err == nil {
y.AddMeta("unit", unit)
output <- y
}
if strings.TrimSpace(lv[1]) == "0x0" || strings.TrimSpace(lv[1]) == "na" {
// Ignore known non-float values
continue
}
v, err := strconv.ParseFloat(strings.TrimSpace(lv[1]), 64)
if err != nil {
cclog.ComponentErrorf(m.name, "Failed to parse float '%s': %v", lv[1], err)
continue
}
name := strings.ToLower(strings.ReplaceAll(strings.TrimSpace(lv[0]), " ", "_"))
unit := strings.TrimSpace(lv[2])
switch unit {
case "Volts":
unit = "Volts"
case "degrees C":
unit = "degC"
case "degrees F":
unit = "degF"
case "Watts":
unit = "Watts"
}
y, err := lp.NewMessage(name, map[string]string{"type": "node"}, m.meta, map[string]any{"value": v}, time.Now())
if err != nil {
cclog.ComponentErrorf(m.name, "Failed to create message: %v", err)
continue
}
y.AddMeta("unit", unit)
output <- y
}
// Wait for command end
if err := command.Wait(); err != nil {
errMsg, _ := io.ReadAll(errBuf)
cclog.ComponentError(
m.name,
fmt.Sprintf("readIpmiTool(): Failed to wait for the end of command \"%s\": %v\n", command.String(), err),
)
cclog.ComponentError(m.name, fmt.Sprintf("readIpmiTool(): command stderr: \"%s\"\n", strings.TrimSpace(string(errMsg))))
return
return fmt.Errorf("failed to complete command '%s': %w (stderr: %s)", command.String(), err, strings.TrimSpace(string(errMsg)))
}
return nil
}
func (m *IpmiCollector) readIpmiSensors(cmd string, output chan lp.CCMessage) {
func (m *IpmiCollector) readIpmiSensors(output chan lp.CCMessage) error {
// Setup ipmisensors command
command := exec.Command(cmd, "--comma-separated-output", "--sdr-cache-recreate")
argv := make([]string, 0)
if m.config.Sudo {
argv = append(argv, "sudo", "-n")
}
argv = append(argv, m.ipmisensors, "--comma-separated-output", "--sdr-cache-recreate")
command := exec.Command(argv[0], argv[1:]...)
stdout, _ := command.StdoutPipe()
errBuf := new(bytes.Buffer)
command.Stderr = errBuf
// start command
if err := command.Start(); err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("readIpmiSensors(): Failed to start command \"%s\": %v", command.String(), err),
)
return
return fmt.Errorf("failed to start command '%s': %w", command.String(), err)
}
// Read command output
scanner := bufio.NewScanner(stdout)
for scanner.Scan() {
lv := strings.Split(scanner.Text(), ",")
if len(lv) > 3 {
v, err := strconv.ParseFloat(lv[3], 64)
if err == nil {
name := strings.ToLower(strings.ReplaceAll(lv[1], " ", "_"))
y, err := lp.NewMessage(name, map[string]string{"type": "node"}, m.meta, map[string]any{"value": v}, time.Now())
if err == nil {
if len(lv) > 4 {
y.AddMeta("unit", lv[4])
}
output <- y
}
}
if len(lv) <= 3 {
continue
}
if lv[3] == "N/A" || lv[3] == "Reading" {
// Ignore known non-float values
continue
}
v, err := strconv.ParseFloat(strings.TrimSpace(lv[3]), 64)
if err != nil {
cclog.ComponentErrorf(m.name, "Failed to parse float '%s': %v", lv[3], err)
continue
}
name := strings.ToLower(strings.ReplaceAll(lv[1], " ", "_"))
y, err := lp.NewMessage(name, map[string]string{"type": "node"}, m.meta, map[string]any{"value": v}, time.Now())
if err != nil {
cclog.ComponentErrorf(m.name, "Failed to create message: %v", err)
continue
}
if len(lv) > 4 {
y.AddMeta("unit", lv[4])
}
output <- y
}
// Wait for command end
if err := command.Wait(); err != nil {
errMsg, _ := io.ReadAll(errBuf)
cclog.ComponentError(
m.name,
fmt.Sprintf("readIpmiSensors(): Failed to wait for the end of command \"%s\": %v\n", command.String(), err),
)
cclog.ComponentError(m.name, fmt.Sprintf("readIpmiSensors(): command stderr: \"%s\"\n", strings.TrimSpace(string(errMsg))))
return
return fmt.Errorf("failed to complete command '%s': %w (stderr: %s)", command.String(), err, strings.TrimSpace(string(errMsg)))
}
return nil
}
func (m *IpmiCollector) Read(interval time.Duration, output chan lp.CCMessage) {
@@ -203,10 +235,16 @@ func (m *IpmiCollector) Read(interval time.Duration, output chan lp.CCMessage) {
return
}
if len(m.config.IpmitoolPath) > 0 {
m.readIpmiTool(m.config.IpmitoolPath, output)
} else if len(m.config.IpmisensorsPath) > 0 {
m.readIpmiSensors(m.config.IpmisensorsPath, output)
if len(m.ipmisensors) > 0 {
err := m.readIpmiSensors(output)
if err != nil {
cclog.ComponentErrorf(m.name, "readIpmiSensors() failed: %v", err)
}
} else if len(m.ipmitool) > 0 {
err := m.readIpmiTool(output)
if err != nil {
cclog.ComponentErrorf(m.name, "readIpmiTool() failed: %v", err)
}
}
}

View File

@@ -14,10 +14,25 @@ hugo_path: docs/reference/cc-metric-collector/collectors/ipmi.md
```json
"ipmistat": {
"ipmitool_path": "/path/to/ipmitool",
"ipmisensors_path": "/path/to/ipmi-sensors"
"ipmisensors_path": "/path/to/ipmi-sensors",
"use_sudo": true
}
```
The `ipmistat` collector reads data from `ipmitool` (`ipmitool sensor`) or `ipmi-sensors` (`ipmi-sensors --sdr-cache-recreate --comma-separated-output`).
The metrics depend on the output of the underlying tools but contain temperature, power and energy metrics.
`ipmitool` and `ipmi-sensors` typically require root to run.
In order to run `cc-metric-collector` without root priviliges, you can enable `use_sudo`.
Add a file like this in `/etc/sudoers.d/` to allow `cc-metric-collector` to run the required commands:
```
# Do not log the following sudo commands from monitoring, since this causes a lot of log spam.
# However keep log_denied enabled, to detect failures
Defaults: monitoring !log_allowed, !pam_session
# Allow to use ipmitool and ipmi-sensors
monitoring ALL = (root) NOPASSWD:/usr/bin/ipmitool sensor
monitoring ALL = (root) NOPASSWD:/usr/sbin/ipmi-sensors --comma-separated-output --sdr-cache-recreate
```

View File

@@ -55,3 +55,16 @@ Metrics:
* `lustre_inode_permission_diff` (if `send_diff_values == true`)
This collector adds an `device` tag.
`lctl` typically require root to run.
In order to run `cc-metric-collector` without root priviliges, you can enable `use_sudo`.
Add a file like this in `/etc/sudoers.d/` to allow `cc-metric-collector` to run the required command:
```
# Do not log the following sudo commands from monitoring, since this causes a lot of log spam.
# However keep log_denied enabled, to detect failures
Defaults: monitoring !log_allowed, !pam_session
# Allow to use lctl
monitoring ALL = (root) NOPASSWD:/absolute/path/to/lctl get_param llite.*.stats
```

View File

@@ -111,16 +111,38 @@ func (m *MemstatCollector) Init(config json.RawMessage) error {
m.matches = make(map[string]string)
m.tags = map[string]string{"type": "node"}
matches := map[string]string{
"MemTotal": "mem_total",
"SwapTotal": "swap_total",
"SReclaimable": "mem_sreclaimable",
"Slab": "mem_slab",
"MemFree": "mem_free",
"Buffers": "mem_buffers",
"Cached": "mem_cached",
"MemAvailable": "mem_available",
"SwapFree": "swap_free",
"MemShared": "mem_shared",
"MemTotal": "mem_total",
"SwapTotal": "swap_total",
"SReclaimable": "mem_sreclaimable",
"Slab": "mem_slab",
"MemFree": "mem_free",
"Buffers": "mem_buffers",
"Cached": "mem_cached",
"MemAvailable": "mem_available",
"SwapFree": "swap_free",
"Shmem": "mem_shared",
"Active": "mem_active",
"Inactive": "mem_inactive",
"Dirty": "mem_dirty",
"Writeback": "mem_writeback",
"AnonPages": "mem_anon_pages",
"Mapped": "mem_mapped",
"VmallocTotal": "mem_vmalloc_total",
"AnonHugePages": "mem_anon_hugepages",
"ShmemHugePages": "mem_shared_hugepages",
"ShmemPmdMapped": "mem_shared_pmd_mapped",
"HugePages_Total": "mem_hugepages_total",
"HugePages_Free": "mem_hugepages_free",
"HugePages_Rsvd": "mem_hugepages_reserved",
"HugePages_Surp": "mem_hugepages_surplus",
"Hugepagesize": "mem_hugepages_size",
"DirectMap4k": "mem_direct_mapped_4k",
"DirectMap4M": "mem_direct_mapped_4m",
"DirectMap2M": "mem_direct_mapped_2m",
"DirectMap1G": "mem_direct_mapped_1g",
"Mlocked": "mem_locked",
"PageTables": "mem_pagetables",
"KernelStack": "mem_kernelstack",
}
for k, v := range matches {
if !slices.Contains(m.config.ExcludeMetrics, k) {
@@ -221,6 +243,12 @@ func (m *MemstatCollector) Read(interval time.Duration, output chan lp.CCMessage
unit = cacheVal.unit
}
}
if shmemVal, shmem := stats["Shmem"]; shmem {
memUsed -= shmemVal.value
if len(shmemVal.unit) > 0 && len(unit) == 0 {
unit = shmemVal.unit
}
}
}
}
}

View File

@@ -32,7 +32,29 @@ Metrics:
* `mem_cached`
* `mem_available`
* `mem_shared`
* `mem_active`
* `mem_inactive`
* `mem_dirty`
* `mem_writeback`
* `mem_anon_pages`
* `mem_mapped`
* `mem_vmalloc_total`
* `mem_anon_hugepages`
* `mem_shared_hugepages`
* `mem_shared_pmd_mapped`
* `mem_hugepages_total`
* `mem_hugepages_free`
* `mem_hugepages_reserved`
* `mem_hugepages_surplus`
* `mem_hugepages_size`
* `mem_direct_mapped_4k`
* `mem_direct_mapped_2m`
* `mem_direct_mapped_4m`
* `mem_direct_mapped_1g`
* `mem_locked`
* `mem_pagetables`
* `mem_kernelstack`
* `swap_total`
* `swap_free`
* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached`)
* `mem_used` = `mem_total` - (`mem_free` + `mem_buffers` + `mem_cached` + `mem_shared`)

View File

@@ -50,3 +50,18 @@ Metrics:
* `smartmon_errlog_entries`: Error log entries
* `smartmon_warn_temp_time`: Time above the warning temperature threshold
* `smartmon_crit_comp_time`: Time above the critical composite temperature threshold
`smartctl` typically require root to run.
In order to run `cc-metric-collector` without root priviliges, you can enable `use_sudo`.
Add a file like this in `/etc/sudoers.d/` to allow `cc-metric-collector` to run the required command:
```
# Do not log the following sudo commands from monitoring, since this causes a lot of log spam.
# However keep log_denied enabled, to detect failures
Defaults: monitoring !log_allowed, !pam_session
# Allow to use lctl
monitoring ALL = (root) NOPASSWD:/absolute/path/to/smartctl --json=c --device=* "--all" *
# Or add individual rules for each device
# monitoring ALL = (root) NOPASSWD:/absolute/path/to/smartctl --json=c --device=<device_type> "--all" <device>
```

18
go.mod
View File

@@ -3,14 +3,14 @@ module github.com/ClusterCockpit/cc-metric-collector
go 1.25.0
require (
github.com/ClusterCockpit/cc-lib/v2 v2.8.2
github.com/ClusterCockpit/go-rocm-smi v0.3.0
github.com/ClusterCockpit/cc-lib/v2 v2.11.0
github.com/ClusterCockpit/go-rocm-smi v0.4.0
github.com/NVIDIA/go-nvml v0.13.0-1
github.com/PaesslerAG/gval v1.2.4
github.com/fsnotify/fsnotify v1.9.0
github.com/tklauser/go-sysconf v0.3.16
golang.design/x/thread v0.0.0-20210122121316-335e9adffdf1
golang.org/x/sys v0.42.0
golang.org/x/sys v0.43.0
)
require (
@@ -28,17 +28,17 @@ require (
github.com/nats-io/nats.go v1.49.0 // indirect
github.com/nats-io/nkeys v0.4.15 // indirect
github.com/nats-io/nuid v1.0.1 // indirect
github.com/oapi-codegen/runtime v1.2.0 // indirect
github.com/oapi-codegen/runtime v1.3.0 // indirect
github.com/prometheus/client_golang v1.23.2 // indirect
github.com/prometheus/client_model v0.6.2 // indirect
github.com/prometheus/common v0.67.5 // indirect
github.com/prometheus/procfs v0.20.0 // indirect
github.com/prometheus/procfs v0.20.1 // indirect
github.com/santhosh-tekuri/jsonschema/v5 v5.3.1 // indirect
github.com/shopspring/decimal v1.4.0 // indirect
github.com/stmcginnis/gofish v0.21.3 // indirect
github.com/stmcginnis/gofish v0.21.4 // indirect
github.com/tklauser/numcpus v0.11.0 // indirect
go.yaml.in/yaml/v2 v2.4.3 // indirect
golang.org/x/crypto v0.48.0 // indirect
golang.org/x/net v0.51.0 // indirect
go.yaml.in/yaml/v2 v2.4.4 // indirect
golang.org/x/crypto v0.49.0 // indirect
golang.org/x/net v0.52.0 // indirect
google.golang.org/protobuf v1.36.11 // indirect
)

49
go.sum
View File

@@ -1,12 +1,9 @@
github.com/ClusterCockpit/cc-lib/v2 v2.8.0 h1:ROduRzRuusi+6kLB991AAu3Pp2AHOasQJFJc7JU/n/E=
github.com/ClusterCockpit/cc-lib/v2 v2.8.0/go.mod h1:FwD8vnTIbBM3ngeLNKmCvp9FoSjQZm7xnuaVxEKR23o=
github.com/ClusterCockpit/cc-lib/v2 v2.8.2 h1:rCLZk8wz8yq8xBnBEdVKigvA2ngR8dPmHbEFwxxb3jw=
github.com/ClusterCockpit/cc-lib/v2 v2.8.2/go.mod h1:FwD8vnTIbBM3ngeLNKmCvp9FoSjQZm7xnuaVxEKR23o=
github.com/ClusterCockpit/cc-lib/v2 v2.11.0 h1:LaLs4J0b7FArIXT8byMUcIcUr55R5obATjVi7qI02r4=
github.com/ClusterCockpit/cc-lib/v2 v2.11.0/go.mod h1:Oj+N2lpFqiBOBzjfrLIGJ2YSWT400TX4M0ii4lNl81A=
github.com/ClusterCockpit/cc-line-protocol/v2 v2.4.0 h1:hIzxgTBWcmCIHtoDKDkSCsKCOCOwUC34sFsbD2wcW0Q=
github.com/ClusterCockpit/cc-line-protocol/v2 v2.4.0/go.mod h1:y42qUu+YFmu5fdNuUAS4VbbIKxVjxCvbVqFdpdh8ahY=
github.com/ClusterCockpit/go-rocm-smi v0.3.0 h1:1qZnSpG7/NyLtc7AjqnUL9Jb8xtqG1nMVgp69rJfaR8=
github.com/ClusterCockpit/go-rocm-smi v0.3.0/go.mod h1:+I3UMeX3OlizXDf1WpGD43W4KGZZGVSGmny6rTeOnWA=
github.com/NVIDIA/go-nvml v0.11.6-0/go.mod h1:hy7HYeQy335x6nEss0Ne3PYqleRa6Ct+VKD9RQ4nyFs=
github.com/ClusterCockpit/go-rocm-smi v0.4.0 h1:3+bEPrSkjEJcOtt+qBUX48ugDVlOFaKUnXHTef2Ve2Q=
github.com/ClusterCockpit/go-rocm-smi v0.4.0/go.mod h1:c19u5vBCcgb7DjL4EWTGSGpo6c79d07r4rxD50z25ng=
github.com/NVIDIA/go-nvml v0.13.0-1 h1:OLX8Jq3dONuPOQPC7rndB6+iDmDakw0XTYgzMxObkEw=
github.com/NVIDIA/go-nvml v0.13.0-1/go.mod h1:+KNA7c7gIBH7SKSJ1ntlwkfN80zdx8ovl4hrK3LmPt4=
github.com/PaesslerAG/gval v1.2.4 h1:rhX7MpjJlcxYwL2eTTYIOBUyEKZ+A96T9vQySWkVUiU=
@@ -69,8 +66,8 @@ github.com/nats-io/nkeys v0.4.15 h1:JACV5jRVO9V856KOapQ7x+EY8Jo3qw1vJt/9Jpwzkk4=
github.com/nats-io/nkeys v0.4.15/go.mod h1:CpMchTXC9fxA5zrMo4KpySxNjiDVvr8ANOSZdiNfUrs=
github.com/nats-io/nuid v1.0.1 h1:5iA8DT8V7q8WK2EScv2padNa/rTESc1KdnPw4TC2paw=
github.com/nats-io/nuid v1.0.1/go.mod h1:19wcPz3Ph3q0Jbyiqsd0kePYG7A95tJPxeL+1OSON2c=
github.com/oapi-codegen/runtime v1.2.0 h1:RvKc1CVS1QeKSNzO97FBQbSMZyQ8s6rZd+LpmzwHMP4=
github.com/oapi-codegen/runtime v1.2.0/go.mod h1:Y7ZhmmlE8ikZOmuHRRndiIm7nf3xcVv+YMweKgG1DT0=
github.com/oapi-codegen/runtime v1.3.0 h1:vyK1zc0gDWWXgk2xoQa4+X4RNNc5SL2RbTpJS/4vMYA=
github.com/oapi-codegen/runtime v1.3.0/go.mod h1:kOdeacKy7t40Rclb1je37ZLFboFxh+YLy0zaPCMibPY=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/prometheus/client_golang v1.23.2 h1:Je96obch5RDVy3FDMndoUsjAhG5Edi49h0RJWRi/o0o=
@@ -79,8 +76,8 @@ github.com/prometheus/client_model v0.6.2 h1:oBsgwpGs7iVziMvrGhE53c/GrLUsZdHnqNw
github.com/prometheus/client_model v0.6.2/go.mod h1:y3m2F6Gdpfy6Ut/GBsUqTWZqCUvMVzSfMLjcu6wAwpE=
github.com/prometheus/common v0.67.5 h1:pIgK94WWlQt1WLwAC5j2ynLaBRDiinoAb86HZHTUGI4=
github.com/prometheus/common v0.67.5/go.mod h1:SjE/0MzDEEAyrdr5Gqc6G+sXI67maCxzaT3A2+HqjUw=
github.com/prometheus/procfs v0.20.0 h1:AA7aCvjxwAquZAlonN7888f2u4IN8WVeFgBi4k82M4Q=
github.com/prometheus/procfs v0.20.0/go.mod h1:o9EMBZGRyvDrSPH1RqdxhojkuXstoe4UlK79eF5TGGo=
github.com/prometheus/procfs v0.20.1 h1:XwbrGOIplXW/AU3YhIhLODXMJYyC1isLFfYCsTEycfc=
github.com/prometheus/procfs v0.20.1/go.mod h1:o9EMBZGRyvDrSPH1RqdxhojkuXstoe4UlK79eF5TGGo=
github.com/rogpeppe/go-internal v1.10.0 h1:TMyTOH3F/DB16zRVcYyreMH6GnZZrwQVAoYjRBZyWFQ=
github.com/rogpeppe/go-internal v1.10.0/go.mod h1:UQnix2H7Ngw/k4C5ijL5+65zddjncjaFoBhdsK/akog=
github.com/santhosh-tekuri/jsonschema/v5 v5.3.1 h1:lZUw3E0/J3roVtGQ+SCrUrg3ON6NgVqpn3+iol9aGu4=
@@ -89,10 +86,17 @@ github.com/shopspring/decimal v1.3.1/go.mod h1:DKyhrW/HYNuLGql+MJL6WCR6knT2jwCFR
github.com/shopspring/decimal v1.4.0 h1:bxl37RwXBklmTi0C79JfXCEBD1cqqHt0bbgBAGFp81k=
github.com/shopspring/decimal v1.4.0/go.mod h1:gawqmDU56v4yIKSwfBSFip1HdCCXN8/+DMd9qYNcwME=
github.com/spkg/bom v0.0.0-20160624110644-59b7046e48ad/go.mod h1:qLr4V1qq6nMqFKkMo8ZTx3f+BZEkzsRUY10Xsm2mwU0=
github.com/stmcginnis/gofish v0.21.3 h1:EBLCHfORnbx7MPw7lplOOVe9QAD1T3XRVz6+a1Z4z5Q=
github.com/stmcginnis/gofish v0.21.3/go.mod h1:PzF5i8ecRG9A2ol8XT64npKUunyraJ+7t0kYMpQAtqU=
github.com/stmcginnis/gofish v0.21.4 h1:daexK8sh31CgeSMkPUNs21HWHHA9ecCPJPyLCTxukCg=
github.com/stmcginnis/gofish v0.21.4/go.mod h1:PzF5i8ecRG9A2ol8XT64npKUunyraJ+7t0kYMpQAtqU=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo=
github.com/stretchr/objx v0.5.2/go.mod h1:FRsXN1f5AsAjCGJKqEizvkpNtU+EGNCLh3NxZ/8L+MA=
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
github.com/stretchr/testify v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo=
github.com/stretchr/testify v1.10.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY=
github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U=
github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U=
github.com/tklauser/go-sysconf v0.3.16 h1:frioLaCQSsF5Cy1jgRBrzr6t502KIIwQ0MArYICU0nA=
@@ -101,23 +105,22 @@ github.com/tklauser/numcpus v0.11.0 h1:nSTwhKH5e1dMNsCdVBukSZrURJRoHbSEQjdEbY+9R
github.com/tklauser/numcpus v0.11.0/go.mod h1:z+LwcLq54uWZTX0u/bGobaV34u6V7KNlTZejzM6/3MQ=
go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto=
go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE=
go.yaml.in/yaml/v2 v2.4.3 h1:6gvOSjQoTB3vt1l+CU+tSyi/HOjfOjRLJ4YwYZGwRO0=
go.yaml.in/yaml/v2 v2.4.3/go.mod h1:zSxWcmIDjOzPXpjlTTbAsKokqkDNAVtZO0WOMiT90s8=
go.yaml.in/yaml/v2 v2.4.4 h1:tuyd0P+2Ont/d6e2rl3be67goVK4R6deVxCUX5vyPaQ=
go.yaml.in/yaml/v2 v2.4.4/go.mod h1:gMZqIpDtDqOfM0uNfy0SkpRhvUryYH0Z6wdMYcacYXQ=
golang.design/x/thread v0.0.0-20210122121316-335e9adffdf1 h1:P7S/GeHBAFEZIYp0ePPs2kHXoazz8q2KsyxHyQVGCJg=
golang.design/x/thread v0.0.0-20210122121316-335e9adffdf1/go.mod h1:9CWpnTUmlQkfdpdutA1nNf4iE5lAVt3QZOu0Z6hahBE=
golang.org/x/crypto v0.48.0 h1:/VRzVqiRSggnhY7gNRxPauEQ5Drw9haKdM0jqfcCFts=
golang.org/x/crypto v0.48.0/go.mod h1:r0kV5h3qnFPlQnBSrULhlsRfryS2pmewsg+XfMgkVos=
golang.org/x/net v0.51.0 h1:94R/GTO7mt3/4wIKpcR5gkGmRLOuE/2hNGeWq/GBIFo=
golang.org/x/net v0.51.0/go.mod h1:aamm+2QF5ogm02fjy5Bb7CQ0WMt1/WVM7FtyaTLlA9Y=
golang.org/x/crypto v0.49.0 h1:+Ng2ULVvLHnJ/ZFEq4KdcDd/cfjrrjjNSXNzxg0Y4U4=
golang.org/x/crypto v0.49.0/go.mod h1:ErX4dUh2UM+CFYiXZRTcMpEcN8b/1gxEuv3nODoYtCA=
golang.org/x/net v0.52.0 h1:He/TN1l0e4mmR3QqHMT2Xab3Aj3L9qjbhRm78/6jrW0=
golang.org/x/net v0.52.0/go.mod h1:R1MAz7uMZxVMualyPXb+VaqGSa3LIaUqk0eEt3w36Sw=
golang.org/x/sys v0.0.0-20210122093101-04d7465088b8/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.42.0 h1:omrd2nAlyT5ESRdCLYdm3+fMfNFE/+Rf4bDIQImRJeo=
golang.org/x/sys v0.42.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
golang.org/x/sys v0.43.0 h1:Rlag2XtaFTxp19wS8MXlJwTvoh8ArU6ezoyFsMyCTNI=
golang.org/x/sys v0.43.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
golang.org/x/time v0.14.0 h1:MRx4UaLrDotUKUdCIqzPC48t1Y9hANFKIRpNx+Te8PI=
golang.org/x/time v0.14.0/go.mod h1:eL/Oa2bBBK0TkX57Fyni+NgnyQQN4LitPmob2Hjnqw4=
google.golang.org/protobuf v1.36.11 h1:fV6ZwhNocDyBLK0dj+fg8ektcVegBBuEolpbTQyBNVE=
google.golang.org/protobuf v1.36.11/go.mod h1:HTf+CrKn2C3g5S8VImy6tdcUvCska2kB7j23XfzDpco=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=

View File

@@ -54,7 +54,7 @@ install -Dpm 0644 scripts/%{name}.sysusers %{buildroot}%{_sysusersdir}/%{name}.c
%files
# Binary
%attr(-,clustercockpit,clustercockpit) %{_bindir}/%{name}
%attr(-,root,root) %{_bindir}/%{name}
# Config
%dir %{_sysconfdir}/%{name}
%attr(0600,clustercockpit,clustercockpit) %config(noreplace) %{_sysconfdir}/%{name}/%{name}.json