Compare commits

..

124 Commits

Author SHA1 Message Date
Thomas Roehl
2505b2f20b Add power averager to Nvidia GPU collector 2024-02-22 20:30:34 +01:00
Thomas Roehl
656e5899b0 Merge branch 'develop' of github.com:ClusterCockpit/cc-metric-collector into develop 2023-12-29 14:53:12 +01:00
Thomas Roehl
9b671ce68f Add comment about precision requirement for cc-metric-store 2023-12-11 16:06:28 +01:00
Thomas Roehl
226e8425cb Allow selection of timestamp precision in HttpSink 2023-12-11 14:57:06 +01:00
Thomas Gruber
a37f6603c8 Update cc-metric-collector.init 2023-12-11 13:47:53 +01:00
Thomas Roehl
78902305e8 Merge branch 'develop' of github.com:ClusterCockpit/cc-metric-collector into develop 2023-12-08 15:11:40 +01:00
Thomas Roehl
c8a91903f6 Add nfsiostat to list of collectors 2023-11-30 14:43:02 +01:00
Obihörnchen
5d19c31fa8 Fix %sysusers_create_package args (#108)
%sysusers_create_package requires two arguments. See: https://github.com/systemd/systemd/blob/main/src/rpm/macros.systemd.in#L165
2023-11-22 14:13:00 +01:00
Holger Obermaier
4bee75d4b5 Allow more then one background send operation 2023-10-13 15:15:10 +02:00
Holger Obermaier
78fac33a06 Add config option for HTTP request timeout and Retry interval 2023-10-13 15:00:06 +02:00
Holger Obermaier
0b509ca9e4 Be more strict, when parsing json 2023-10-13 09:53:49 +02:00
Holger Obermaier
595399e7d9 Add config option for HTTP keep-alives 2023-10-11 17:28:16 +02:00
Holger Obermaier
f059d52d43 Use DefaultServeMux instead of github.com/gorilla/mux 2023-10-11 17:19:39 +02:00
Holger Obermaier
b618e81cbb Add asynchron send of encoder metrics 2023-10-11 14:55:52 +02:00
Holger Obermaier
8837400bf2 Add config option to specify whether to use GZip compression in influx write requests 2023-10-09 16:57:26 +02:00
Holger Obermaier
3be11984f2 Fix: Corrected unlock access to batch slice 2023-10-09 16:48:42 +02:00
Holger Obermaier
dd40c852ca Stop flush timer, when immediatelly flushing 2023-10-09 11:01:01 +02:00
Holger Obermaier
39ae211530 Go pkg update 2023-10-09 10:16:53 +02:00
Holger Obermaier
a4d7593af5 Use line protocol encoder 2023-10-09 10:12:14 +02:00
Holger Obermaier
fd1cdc5c07 Reverted previous changes.
Made the code to complex without much advantages
2023-10-06 16:56:30 +02:00
Holger Obermaier
94c88f23df Be more verbose in error messages 2023-10-05 17:22:30 +02:00
Holger Obermaier
9dae829f9d Wait for concurrent flush operations to finish 2023-10-05 16:44:03 +02:00
Thomas Roehl
b0f0462995 Use stype and stype-id for the NIC in NetstatCollector 2023-10-05 09:12:24 +02:00
Holger Obermaier
778bb62602 Cleanup 2023-10-04 16:24:39 +02:00
Holger Obermaier
5aa9603c01 Add batch_size config 2023-10-04 12:37:25 +02:00
Holger Obermaier
0db1cda27f Do not store unused topology information 2023-10-02 11:30:06 +02:00
Holger Obermaier
013ae7ec6d Reuse ccTopology functionality 2023-10-02 10:57:50 +02:00
Holger Obermaier
9f65365f9d Add Influx client options 2023-09-29 10:36:42 +02:00
Holger Obermaier
1e606a1aa1 Reuse flush timer 2023-09-26 15:04:39 +02:00
Holger Obermaier
19ec6d06db Use generic package maps to clone maps 2023-09-26 11:49:19 +02:00
Holger Obermaier
553fcff468 Add documentation for send_*_total values 2023-09-21 10:44:14 +02:00
Holger Obermaier
7b5a4caf6a Avoid unneccessary memory allocations 2023-09-21 10:19:25 +02:00
Holger Obermaier
a401e4cdd1 Add basic authentication support 2023-09-20 17:41:12 +02:00
Holger Obermaier
94d5822426 Add basic authentication support 2023-09-20 17:33:08 +02:00
Holger Obermaier
f6b5f7fb07 Add config option idle_timeout 2023-09-20 16:39:03 +02:00
Holger Obermaier
0c95db50ad Fix: Error NVML library not found did crash
cc-metric-collector with "SIGSEGV: segmentation violation"
2023-09-20 13:45:38 +02:00
Holger Obermaier
75b705aa87 Avoid package cmp to allow builds with golang v1.20 2023-09-19 17:00:16 +02:00
Holger Obermaier
8da5c692bb Increase golang version requirement to 1.20. 2023-09-19 16:22:36 +02:00
Holger Obermaier
42a9423203 Use slice to store lexialicly orderd key value pairs 2023-09-19 14:48:11 +02:00
Holger Obermaier
c87c77a810 Only access meta data, when it gets used as tag 2023-09-19 14:06:20 +02:00
Holger Obermaier
c472029c2d Add tags in lexical order as required by AddTag() 2023-09-19 13:33:25 +02:00
Holger Obermaier
9e73849081 Use a lock for the flush timer 2023-09-19 12:57:43 +02:00
Holger Obermaier
d1a960e6e1 Add some basic debugging documentation 2023-09-18 17:03:18 +02:00
Holger Obermaier
9530c489b5 Add some basic debugging documentation 2023-09-18 16:51:37 +02:00
Holger Obermaier
64ffa3d23e Allow other fields not only field "value" 2023-09-18 16:35:56 +02:00
Holger Obermaier
3f4b11db47 github.com/influxdata/line-protocol -> github.com/influxdata/line-protocol/v2/lineprotocol 2023-09-18 16:03:57 +02:00
Holger Obermaier
fd227ed8b3 Add some comments 2023-09-18 15:37:52 +02:00
Holger Obermaier
2d41531b51 Corrected spelling 2023-09-18 14:52:09 +02:00
Holger Obermaier
e34b0166f9 github.com/influxdata/line-protocol -> github.com/influxdata/line-protocol/v2/lineprotocol 2023-09-15 15:59:03 +02:00
Holger Obermaier
baa45b833b Fix http server addr format 2023-09-15 14:12:59 +02:00
Holger Obermaier
aac475fc98 Add links to ipmi and redfish receivers 2023-09-15 13:33:47 +02:00
Holger Obermaier
2dfeac8ce8 Fix Ubuntu version number 2023-09-13 13:36:27 +02:00
Holger Obermaier
6a4731ab7e Update golang version 2023-09-13 13:32:44 +02:00
Holger Obermaier
be68aeb44f Pipe golang tar package directly to tar 2023-09-13 13:23:37 +02:00
Holger Obermaier
9975ee6e00 Upgrade Ubuntu focal -> jammy 2023-09-13 13:20:00 +02:00
Holger Obermaier
38478ce8c2 Remove golang versions before 1.20 2023-09-13 13:07:50 +02:00
Holger Obermaier
9c9fd59ed0 Use dnf to download golang 2023-09-13 12:56:22 +02:00
Holger Obermaier
5895490b53 Switch to golang 1.20 for RHEL based distributions 2023-09-13 12:48:39 +02:00
Holger Obermaier
4e08acf509 Add release build jobs to runonce.yml 2023-09-13 12:29:00 +02:00
Holger Obermaier
e1bb3dbef6 Add workflow_dispatch to allow manual run of workflow 2023-09-13 11:31:57 +02:00
Holger Obermaier
562bcbf486 Add workflow_dispatch to allow manual run of workflow 2023-09-13 11:23:09 +02:00
Holger Obermaier
262a119413 Switch to setup-go action version 4 2023-09-13 11:16:40 +02:00
Holger Obermaier
609cafeb2c Switch to checkout action version 4 2023-09-13 11:11:27 +02:00
Holger Obermaier
6dc4e7708a Add build with golang 1.21 2023-09-13 11:03:36 +02:00
Holger Obermaier
3fdb60d708 Updated go packages 2023-09-12 16:17:24 +02:00
Holger Obermaier
12130361fd Add comments 2023-09-12 11:28:57 +02:00
Holger Obermaier
faad23ed64 Remove unused variable gmresults 2023-09-12 10:48:51 +02:00
Holger Obermaier
674e78b3d0 Send all metrics with same time stamp
calcGlobalMetrics does only computiation, counter measurement is done before
2023-09-12 10:45:50 +02:00
Holger Obermaier
302e42d1d0 Input parameters should be float64 when evaluating to float64 2023-09-12 10:35:36 +02:00
Holger Obermaier
1aca1b6caf Send all metrics with same time stamp
calcEventsetMetrics does only computiation, counter measurement is done before
2023-09-12 10:18:55 +02:00
Holger Obermaier
1b60935f38 Allow to send total values per core, socket and node 2023-09-11 16:26:15 +02:00
Holger Obermaier
188f0261b5 Reduce number of required slices 2023-09-11 13:02:22 +02:00
Holger Obermaier
1b06270e9b Replace deprecated thread_siblings_list by core_cpus_list 2023-09-08 11:34:07 +02:00
Holger Obermaier
f3ffa29a37 Add Simultaneous Multithreading siblings 2023-09-08 11:07:37 +02:00
Holger Obermaier
7246278745 Correctly handle lists from /sys 2023-09-08 10:39:41 +02:00
Holger Obermaier
e5173bb9a2 Lookup all information from /sys/devices/system/cpu, /proc/cpuinfo is not portable 2023-09-08 10:09:04 +02:00
Holger Obermaier
fd56a14eb6 Lookup core ID from /sys/devices/system/cpu, /proc/cpuinfo is not portable 2023-09-08 09:25:03 +02:00
Holger Obermaier
35c20110ca Add comment 2023-09-07 14:08:03 +02:00
Holger Obermaier
a871753bdf Cleanup 2023-09-07 14:06:57 +02:00
Holger Obermaier
fbf178326a Add NumaDomainList and SMTList 2023-09-07 13:49:22 +02:00
Holger Obermaier
8fedef9024 Add DieList 2023-09-07 13:04:16 +02:00
Holger Obermaier
094f124a18 Avoid slice cloning. Directly use the cache 2023-09-07 11:45:38 +02:00
Holger Obermaier
1f5856c671 Reuse information from /proc/cpuinfo 2023-09-07 10:24:43 +02:00
Holger Obermaier
ae106566dd Use init function to initalize cache structure to avoid multi threading problems 2023-09-07 10:11:20 +02:00
Holger Obermaier
b3922b3255 Cleanup 2023-09-06 17:29:37 +02:00
Holger Obermaier
5fa53a7ab8 Cache CpuData 2023-09-06 16:46:30 +02:00
Holger Obermaier
3ac1ada204 Add caching 2023-09-06 16:19:16 +02:00
Holger Obermaier
2dc78ee0aa Avoid type conversion by using Atoi
Avoid copying structs by using pointer access
Increase readability with CamelCase variable names
2023-09-06 15:28:49 +02:00
Holger Obermaier
4b16ca4a30 Fix function getNumaDomain, it always returned 0 2023-09-06 11:35:45 +02:00
Holger Obermaier
6a2b74b0dc Use CamelCase 2023-09-06 10:44:23 +02:00
Holger Obermaier
3171792bd6 Use CamelCase 2023-09-06 10:37:57 +02:00
Holger Obermaier
99ccc04933 Read file line by line 2023-09-06 10:15:17 +02:00
Holger Obermaier
34436ac261 Read file line by line 2023-09-06 10:09:53 +02:00
Holger Obermaier
ae44b7f826 Read file line by line 2023-09-06 10:03:33 +02:00
Holger Obermaier
0cf32d5988 Switch to package slices from the golang 1.21 default library 2023-09-06 09:45:01 +02:00
Holger Obermaier
013aa9ec92 ioutil.ReadFile is deprecated: As of Go 1.16, this function simply calls os.ReadFile 2023-09-05 17:41:08 +02:00
Thomas Roehl
62720dec13 Fix path after installation to /usr/bin after installation 2023-08-31 15:12:43 +02:00
Thomas Roehl
c64943a954 Merge branch 'develop' of github.com:ClusterCockpit/cc-metric-collector into develop 2023-08-31 15:07:32 +02:00
Thomas Roehl
6eea1325bf Add safe.directory to Release action 2023-08-29 15:38:44 +02:00
Thomas Roehl
e205c16cdb Merge branch 'main' into develop 2023-08-29 14:14:59 +02:00
Holger Obermaier
fa755ae401 Fixed initialization: Initalization and measurements should run in the same thread 2023-08-25 08:26:05 +02:00
Holger Obermaier
1b97953cdb Completly avoid memory allocations in infinibandMetric read() 2023-08-21 10:09:21 +02:00
Thomas Gruber
fc19b2b9a5 Update likwidMetric.go
Fixes a potential bug when `fsnotify.NewWatcher()` fails with an error
2023-08-18 11:27:47 +02:00
Holger Obermaier
e425b2c38e Add aggregated metrics.
Add missing units
2023-08-18 10:39:43 +02:00
Holger Obermaier
f5d2d27090 Compute metrics ib_total and ib_total_pkts 2023-08-17 16:46:53 +02:00
Holger Obermaier
41ea9139c6 Use simpler sort function 2023-08-17 15:13:31 +02:00
Holger Obermaier
da946472df Remove old entries from go.sum 2023-08-17 15:12:37 +02:00
Holger Obermaier
0ffbedb3ec For older versions of go slices is not part of the installation 2023-08-17 15:05:13 +02:00
Holger Obermaier
eafeea1a76 Use generic function to compute median 2023-08-17 14:46:22 +02:00
Holger Obermaier
fcda7a6921 Add error value to sumAnyType 2023-08-17 13:55:31 +02:00
Holger Obermaier
a25f4f8b8d Use generic function to compute average 2023-08-17 13:50:46 +02:00
Holger Obermaier
ceff67085b Use generic function to compute maximum 2023-08-17 11:46:15 +02:00
Holger Obermaier
ec86a83a27 Use generic function to compute minimum 2023-08-17 11:41:26 +02:00
Holger Obermaier
89c93185d4 Add missing case for type []int32 2023-08-17 11:24:06 +02:00
Holger Obermaier
c3004f8c6d Use generic function to simplify code 2023-08-17 10:20:47 +02:00
Holger Obermaier
a1c2c3856d Allow values to be a slice of type float64, float32, int, int64, int32, bool 2023-08-17 09:48:41 +02:00
Holger Obermaier
fa8dd5992d Allow sum function to handle non float types 2023-08-17 08:16:19 +02:00
Holger Obermaier
0b28c55162 Use only as many arguments as required 2023-08-17 08:03:33 +02:00
Holger Obermaier
fb480993ed Simplify Makefile 2023-08-16 15:40:33 +02:00
Thomas Röhl
ef49701f14 Use not a pointer to line-protocol.Encoder 2023-07-17 18:02:50 +02:00
Thomas Röhl
34bc23fbbd Update fsnotify in LIKWID Collector 2023-07-17 18:01:49 +02:00
Thomas Gruber
a7e8a1dfb5 Update runonce.yml with Golang 1.20 2023-07-17 15:23:21 +02:00
Thomas Röhl
547e2546c7 Update to line-protocol/v2 2023-07-17 15:20:12 +02:00
Thomas Röhl
e7b77f7721 Add cpu_used (all-cpu_idle) to CpustatCollector 2023-04-05 11:20:09 +02:00
6 changed files with 216 additions and 71 deletions

View File

@@ -195,7 +195,7 @@ jobs:
Release:
runs-on: ubuntu-latest
# We need the RPMs, so add dependency
needs: [AlmaLinux-RPM-build, UBI-8-RPM-build, Ubuntu-jammy-build]
needs: [AlmaLinux-RPM-build, UBI-8-RPM-build, Ubuntu-focal-build]
steps:
# See: https://github.com/actions/download-artifact

View File

@@ -6,6 +6,7 @@ import (
"fmt"
"log"
"strings"
"sync"
"time"
cclog "github.com/ClusterCockpit/cc-metric-collector/pkg/ccLogger"
@@ -24,6 +25,81 @@ type NvidiaCollectorConfig struct {
ProcessMigDevices bool `json:"process_mig_devices,omitempty"`
UseUuidForMigDevices bool `json:"use_uuid_for_mig_device,omitempty"`
UseSliceForMigDevices bool `json:"use_slice_for_mig_device,omitempty"`
AveragePowerInterval string `json:"average_power_interval,omitempty"`
}
type powerAverager struct {
device nvml.Device
interval time.Duration
done chan bool
wg sync.WaitGroup
powerSum float64
powerSamples int
ticker *time.Ticker
running bool
}
type PowerAverager interface {
Start()
IsRunning() bool
Get() float64
Close()
}
func (pa *powerAverager) IsRunning() bool {
return pa.running
}
func (pa *powerAverager) Start() {
pa.wg.Add(1)
go func(avger *powerAverager) {
avger.running = true
avger.ticker = time.NewTicker(avger.interval)
for {
select {
case <-avger.done:
avger.wg.Done()
avger.running = false
return
case <-avger.ticker.C:
power, ret := nvml.DeviceGetPowerUsage(avger.device)
if ret == nvml.SUCCESS {
avger.powerSum += float64(power) / 1000
avger.powerSamples += 1
}
}
}
}(pa)
}
func (pa *powerAverager) Get() float64 {
avg := float64(0)
if pa.powerSamples > 0 {
pa.ticker.Stop()
avg = pa.powerSum / float64(pa.powerSamples)
pa.powerSum = 0
pa.powerSamples = 0
pa.ticker.Reset(pa.interval)
}
return avg
}
func (pa *powerAverager) Close() {
pa.done <- true
pa.wg.Wait()
pa.running = false
}
func NewPowerAverager(device nvml.Device, interval time.Duration) (PowerAverager, error) {
pa := new(powerAverager)
pa.device = device
pa.interval = interval
pa.done = make(chan bool)
pa.powerSamples = 0
pa.powerSum = 0
pa.running = false
return pa, nil
}
type NvidiaCollectorDevice struct {
@@ -31,6 +107,8 @@ type NvidiaCollectorDevice struct {
excludeMetrics map[string]bool
tags map[string]string
meta map[string]string
powerInterval time.Duration
averager PowerAverager
}
type NvidiaCollector struct {
@@ -55,6 +133,7 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
m.config.ProcessMigDevices = false
m.config.UseUuidForMigDevices = false
m.config.UseSliceForMigDevices = false
m.config.AveragePowerInterval = ""
m.setup()
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
@@ -93,6 +172,16 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
return err
}
powerDur := time.Duration(0)
if len(m.config.AveragePowerInterval) > 0 {
d, err := time.ParseDuration(m.config.AveragePowerInterval)
if err != nil {
cclog.ComponentError(m.name, "Unable to parse average_power_interval ", m.config.AveragePowerInterval, ":", err.Error())
return err
}
powerDur = d
}
// For all GPUs
idx := 0
m.gpus = make([]NvidiaCollectorDevice, num_gpus)
@@ -197,6 +286,15 @@ func (m *NvidiaCollector) Init(config json.RawMessage) error {
g.excludeMetrics[e] = true
}
if powerDur > 0 {
a, err := NewPowerAverager(g.device, powerDur)
if err != nil {
cclog.ComponentError(m.name, "Failed to initialize power averager for device at index", i, ":", err.Error())
} else {
g.averager = a
}
}
// Increment the index for the next device
idx++
}
@@ -436,6 +534,21 @@ func readPerfState(device NvidiaCollectorDevice, output chan lp.CCMetric) error
return nil
}
func readPowerUsageAverage(device NvidiaCollectorDevice, output chan lp.CCMetric) error {
if !device.excludeMetrics["nv_power_usage_avg"] && device.averager != nil {
if !device.averager.IsRunning() {
device.averager.Start()
} else {
y, err := lp.New("nv_power_usage_avg", device.tags, device.meta, map[string]interface{}{"value": device.averager.Get()}, time.Now())
if err == nil {
y.AddMeta("unit", "watts")
output <- y
}
}
}
return nil
}
func readPowerUsage(device NvidiaCollectorDevice, output chan lp.CCMetric) error {
if !device.excludeMetrics["nv_power_usage"] {
// Retrieves power usage for this GPU in milliwatts and its associated circuitry (e.g. memory)
@@ -1022,95 +1135,100 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMetric)
if ret != nvml.SUCCESS {
name = "NoName"
}
err = readMemoryInfo(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readMemoryInfo for device", name, "failed")
}
// err = readMemoryInfo(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readMemoryInfo for device", name, "failed")
// }
err = readUtilization(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readUtilization for device", name, "failed")
}
// err = readUtilization(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readUtilization for device", name, "failed")
// }
err = readTemp(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readTemp for device", name, "failed")
}
// err = readTemp(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readTemp for device", name, "failed")
// }
err = readFan(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readFan for device", name, "failed")
}
// err = readFan(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readFan for device", name, "failed")
// }
err = readEccMode(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readEccMode for device", name, "failed")
}
// err = readEccMode(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readEccMode for device", name, "failed")
// }
err = readPerfState(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readPerfState for device", name, "failed")
}
// err = readPerfState(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readPerfState for device", name, "failed")
// }
err = readPowerUsage(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readPowerUsage for device", name, "failed")
}
err = readClocks(device, output)
err = readPowerUsageAverage(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readClocks for device", name, "failed")
cclog.ComponentDebug(m.name, "readPowerUsageAverage for device", name, "failed")
}
err = readMaxClocks(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readMaxClocks for device", name, "failed")
}
// err = readClocks(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readClocks for device", name, "failed")
// }
err = readEccErrors(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readEccErrors for device", name, "failed")
}
// err = readMaxClocks(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readMaxClocks for device", name, "failed")
// }
err = readPowerLimit(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readPowerLimit for device", name, "failed")
}
// err = readEccErrors(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readEccErrors for device", name, "failed")
// }
err = readEncUtilization(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readEncUtilization for device", name, "failed")
}
// err = readPowerLimit(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readPowerLimit for device", name, "failed")
// }
err = readDecUtilization(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readDecUtilization for device", name, "failed")
}
// err = readEncUtilization(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readEncUtilization for device", name, "failed")
// }
err = readRemappedRows(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readRemappedRows for device", name, "failed")
}
// err = readDecUtilization(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readDecUtilization for device", name, "failed")
// }
err = readBarMemoryInfo(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readBarMemoryInfo for device", name, "failed")
}
// err = readRemappedRows(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readRemappedRows for device", name, "failed")
// }
err = readProcessCounts(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readProcessCounts for device", name, "failed")
}
// err = readBarMemoryInfo(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readBarMemoryInfo for device", name, "failed")
// }
err = readViolationStats(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readViolationStats for device", name, "failed")
}
// err = readProcessCounts(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readProcessCounts for device", name, "failed")
// }
err = readNVLinkStats(device, output)
if err != nil {
cclog.ComponentDebug(m.name, "readNVLinkStats for device", name, "failed")
}
// err = readViolationStats(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readViolationStats for device", name, "failed")
// }
// err = readNVLinkStats(device, output)
// if err != nil {
// cclog.ComponentDebug(m.name, "readNVLinkStats for device", name, "failed")
// }
}
// Actual read loop over all attached Nvidia GPUs
@@ -1198,6 +1316,9 @@ func (m *NvidiaCollector) Read(interval time.Duration, output chan lp.CCMetric)
func (m *NvidiaCollector) Close() {
if m.init {
for i := 0; i < m.num_gpus; i++ {
m.gpus[i].averager.Close()
}
nvml.Shutdown()
m.init = false
}

View File

@@ -25,7 +25,7 @@ CC_USER=clustercockpit
CC_GROUP=clustercockpit
CONF_DIR=/etc/cc-metric-collector
PID_FILE=/var/run/$NAME.pid
DAEMON=/usr/sbin/$NAME
DAEMON=/usr/bin/$NAME
CONF_FILE=${CONF_DIR}/cc-metric-collector.json
umask 0027

View File

@@ -45,6 +45,9 @@ type HttpSinkConfig struct {
// Maximum number of retries to connect to the http server (default: 3)
MaxRetries int `json:"max_retries,omitempty"`
// Timestamp precision
Precision string `json:"precision,omitempty"`
}
type key_value_pair struct {
@@ -141,7 +144,7 @@ func (s *HttpSink) Write(m lp.CCMetric) error {
// Check that encoding worked
if err != nil {
return fmt.Errorf("Encoding failed: %v", err)
return fmt.Errorf("encoding failed: %v", err)
}
if s.config.flushDelay == 0 {
@@ -268,6 +271,7 @@ func NewHttpSink(name string, config json.RawMessage) (Sink, error) {
s.config.Timeout = "5s"
s.config.FlushDelay = "5s"
s.config.MaxRetries = 3
s.config.Precision = "ns"
cclog.ComponentDebug(s.name, "Init()")
// Read config
@@ -315,6 +319,19 @@ func NewHttpSink(name string, config json.RawMessage) (Sink, error) {
cclog.ComponentDebug(s.name, "Init(): flushDelay", t)
}
}
precision := influx.Nanosecond
if len(s.config.Precision) > 0 {
switch s.config.Precision {
case "s":
precision = influx.Second
case "ms":
precision = influx.Millisecond
case "us":
precision = influx.Microsecond
case "ns":
precision = influx.Nanosecond
}
}
// Create http client
s.client = &http.Client{
@@ -326,7 +343,7 @@ func NewHttpSink(name string, config json.RawMessage) (Sink, error) {
}
// Configure influx line protocol encoder
s.encoder.SetPrecision(influx.Nanosecond)
s.encoder.SetPrecision(precision)
s.extended_tag_list = make([]key_value_pair, 0)
return s, nil

View File

@@ -18,7 +18,8 @@ The `http` sink uses POST requests to a HTTP server to submit the metrics in the
"timeout": "5s",
"idle_connection_timeout" : "5s",
"flush_delay": "2s",
"batch_size": 1000
"batch_size": 1000,
"precision": "s"
}
}
```
@@ -34,3 +35,8 @@ The `http` sink uses POST requests to a HTTP server to submit the metrics in the
- `idle_connection_timeout`: Timeout for idle connections (default '120s'). Should be larger than the measurement interval to keep the connection open
- `flush_delay`: Batch all writes arriving in during this duration (default '1s', batching can be disabled by setting it to 0)
- `batch_size`: Maximal batch size. If `batch_size` is reached before the end of `flush_delay`, the metrics are sent without further delay
- `precision`: Precision of the timestamp. Valid values are 's', 'ms', 'us' and 'ns'. (default is 'ns')
### Using HttpSink for communication with cc-metric-store
The cc-metric-store only accepts metrics with a timestamp precision in seconds, so it is required to set `"precision": "s"`.

View File

@@ -25,3 +25,4 @@ The `nats` sink publishes all metrics into a NATS network. The publishing key is
- `user`: Username for basic authentication
- `password`: Password for basic authentication
- `meta_as_tags`: print all meta information as tags in the output (optional)