280 Commits

Author SHA1 Message Date
brinkcoder
f8f2475ae0 switch to cc-lib. docs: consistency for list 2025-03-05 01:35:21 +01:00
brinkcoder
5dfae0a48e add only_metrics, exclude_metrics, absoulte_values, diff_values and derived_values. docs: consistency for list 2025-03-05 01:34:50 +01:00
brinkcoder
1bd9d389ad add only_metrics, diff_values and derived_values. docs: clarify filter logic, consistency for list 2025-03-05 01:32:03 +01:00
brinkcoder
170a358d79 add only_metrics, diff_values and derived_values. docs: clarify filter logic, consistency for list 2025-03-05 01:30:41 +01:00
brinkcoder
0e7b4fc936 add only_metrics, exclude_metrics, interface_alias. docs: consistency 2025-03-05 01:15:18 +01:00
brinkcoder
478c5159f5 switch to cc-lib 2025-03-05 01:13:17 +01:00
brinkcoder
8de639be08 add only_metrics. docs: consistency 2025-03-05 01:12:46 +01:00
brinkcoder
04fa267d9d add only_metrics. docs: clarify usage of filtering, consistency for metric list and units 2025-03-05 01:11:59 +01:00
brinkcoder
2d792684ff add only_metrics 2025-03-05 01:07:26 +01:00
brinkcoder
9a2898a6a3 switch to cc-lib 2025-03-05 01:06:40 +01:00
brinkcoder
58e9fae966 add exclude_metrics, only_metrics. bugfix: check if ipmitool is present with -V. docs: clarifications 2025-03-05 01:06:17 +01:00
brinkcoder
636c3f312d add only_metrics, diff_values and derived_values 2025-03-05 01:04:06 +01:00
brinkcoder
4702ab1570 add only_metrics. docs: add units 2025-03-05 01:02:29 +01:00
brinkcoder
bcfe2b522a change README metric list from * to - and apply minor fixes 2025-03-05 01:00:21 +01:00
brinkcoder
5a26c37e0d add only_metrics and exclude_mounts 2025-03-05 00:59:44 +01:00
brinkcoder
75a51474b9 add only_metrics, diff_values and derived_values. docs: clarify usage 2025-03-05 00:58:29 +01:00
brinkcoder
f64182630b add only_metrics, diff_values and derived_values. docs: clarify usage 2025-03-05 00:58:06 +01:00
brinkcoder
6d9153b25e add only_metrics. docs: consistency for metric list and unit specification 2025-03-05 00:55:48 +01:00
brinkcoder
2914f42356 bugfix: output of freq changed to MHz instead of GHz. removed exclude_metrics (only one metric anyway) 2025-03-05 00:50:26 +01:00
brinkcoder
6c6fdd8ae7 small changes 2025-03-05 00:48:23 +01:00
brinkcoder
76e36373f1 switch to cc-lib 2025-03-05 00:45:45 +01:00
brinkcoder
16ed2c82e6 change README metric list from * to - and apply minor fixes 2025-03-05 00:45:10 +01:00
brinkcoder
4126132d64 change README metric list from * to - and apply minor fixes 2025-03-05 00:42:23 +01:00
brinkcoder
a001acd2b2 sort metrics in categories and remove ibstat_perfquery 2025-03-05 00:35:42 +01:00
Thomas Gruber
7840de7b82
Merge develop branch into main (#123)
* Add cpu_used (all-cpu_idle) to CpustatCollector

* Update cc-metric-collector.init

* Allow selection of timestamp precision in HttpSink

* Add comment about precision requirement for cc-metric-store

* Fix for API changes in gofish@v0.15.0

* Update requirements to latest version

* Read sensors through redfish

* Update golang toolchain to 1.21

* Remove stray error check

* Update main config in configuration.md

* Update Release action to use golang 1.22 stable release, no golang RPMs anymore

* Update runonce action to use golang 1.22 stable release, no golang RPMs anymore

* Update README.md

Use right JSON type in configuration

* Update sink's README

* Test whether ipmitool or ipmi-sensors can be executed without errors

* Little fixes to the prometheus sink (#115)

* Add uint64 to float64 cast option

* Add prometheus sink to the list of available sinks

* Add aggregated counters by gpu for nvlink errors

---------

Co-authored-by: Michael Schwarz <schwarz@uni-paderborn.de>

* Ccmessage migration (#119)

* Add cpu_used (all-cpu_idle) to CpustatCollector

* Update cc-metric-collector.init

* Allow selection of timestamp precision in HttpSink

* Add comment about precision requirement for cc-metric-store

* Fix for API changes in gofish@v0.15.0

* Update requirements to latest version

* Read sensors through redfish

* Update golang toolchain to 1.21

* Remove stray error check

* Update main config in configuration.md

* Update Release action to use golang 1.22 stable release, no golang RPMs anymore

* Update runonce action to use golang 1.22 stable release, no golang RPMs anymore

* Switch to CCMessage for all files.

---------

Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>

* Switch to ccmessage also for latest additions in nvidiaMetric

* New Message processor (#118)

* Add cpu_used (all-cpu_idle) to CpustatCollector

* Update cc-metric-collector.init

* Allow selection of timestamp precision in HttpSink

* Add comment about precision requirement for cc-metric-store

* Fix for API changes in gofish@v0.15.0

* Update requirements to latest version

* Read sensors through redfish

* Update golang toolchain to 1.21

* Remove stray error check

* Update main config in configuration.md

* Update Release action to use golang 1.22 stable release, no golang RPMs anymore

* Update runonce action to use golang 1.22 stable release, no golang RPMs anymore

* New message processor to check whether a message should be dropped or manipulate it in flight

* Create a copy of message before manipulation

---------

Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>

* Update collector's Makefile and go.mod/sum files

* Use message processor in router, all sinks and all receivers

* Add support for credential file (NKEY) to NATS sink and receiver

* Fix JSON keys in message processor configuration

* Update docs for message processor, router and the default router config file

* Add link to expr syntax and fix regex matching docs

* Update sample collectors

* Minor style change in collector manager

* Some helpers for ccTopology

* LIKWID collector: write log owner change only once

* Fix for metrics without units and reduce debugging messages for messageProcessor

* Use shorted hostname for hostname added by router

* Define default port for NATS

* CPUstat collector: only add unit for applicable metrics

* Add precision option to all sinks using Influx's encoder

* Add message processor to all sink documentation

* Add units to documentation of cpustat collector

---------

Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: oscarminus <me@oscarminus.de>
Co-authored-by: Michael Schwarz <schwarz@uni-paderborn.de>
2024-12-19 23:00:14 +01:00
Thomas Gruber
8e8be09ed9
Merge latest commits from develop to main branch (#114)
* Add cpu_used (all-cpu_idle) to CpustatCollector

* Update cc-metric-collector.init

* Allow selection of timestamp precision in HttpSink

* Add comment about precision requirement for cc-metric-store

* Fix for API changes in gofish@v0.15.0

* Update requirements to latest version

* Read sensors through redfish

* Update golang toolchain to 1.21

* Remove stray error check

* Update main config in configuration.md

* Update Release action to use golang 1.22 stable release, no golang RPMs anymore

* Update runonce action to use golang 1.22 stable release, no golang RPMs anymore

* Update README.md

Use right JSON type in configuration

* Update sink's README

* Test whether ipmitool or ipmi-sensors can be executed without errors

---------

Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
2024-11-20 16:22:39 +01:00
brinkcoder
c96021c7cc
Fix: Create lock file if it does not exist in likwidMetric.go (#120)
Co-authored-by: exterr2f <Robert.Externbrink@rub.de>
2024-11-14 16:20:47 +01:00
Thomas Gruber
8f336c1bb7
Update likwidMetric.md 2024-10-08 13:36:46 +02:00
Thomas Gruber
7d3f67f15b
Update likwidMetric.md 2024-10-07 14:09:09 +02:00
Thomas Gruber
2e7990f87d
Update likwidMetric.md 2024-04-18 13:14:32 +02:00
Thomas Gruber
6ab45dd3ec
Merge develop into main (#109)
* Add cpu_used (all-cpu_idle) to CpustatCollector

* Update to line-protocol/v2

* Update runonce.yml with Golang 1.20

* Update fsnotify in LIKWID Collector

* Use not a pointer to line-protocol.Encoder

* Simplify Makefile

* Use only as many arguments as required

* Allow sum function to handle non float types

* Allow values to be a slice of type float64, float32, int, int64, int32, bool

* Use generic function to simplify code

* Add missing case for type []int32

* Use generic function to compute minimum

* Use generic function to compute maximum

* Use generic function to compute average

* Add error value to sumAnyType

* Use generic function to compute median

* For older versions of go slices is not part of the installation

* Remove old entries from go.sum

* Use simpler sort function

* Compute metrics ib_total and ib_total_pkts

* Add aggregated metrics.
Add missing units

* Update likwidMetric.go

Fixes a potential bug when `fsnotify.NewWatcher()` fails with an error

* Completly avoid memory allocations in infinibandMetric read()

* Fixed initialization: Initalization and measurements should run in the same thread

* Add safe.directory to Release action

* Fix path after installation to /usr/bin after installation

* ioutil.ReadFile is deprecated: As of Go 1.16, this function simply calls os.ReadFile

* Switch to package slices from the golang 1.21 default library

* Read file line by line

* Read file line by line

* Read file line by line

* Use CamelCase

* Use CamelCase

* Fix function getNumaDomain, it always returned 0

* Avoid type conversion by using Atoi
Avoid copying structs by using pointer access
Increase readability with CamelCase variable names

* Add caching

* Cache CpuData

* Cleanup

* Use init function to initalize cache structure to avoid multi threading problems

* Reuse information from /proc/cpuinfo

* Avoid slice cloning. Directly use the cache

* Add DieList

* Add NumaDomainList and SMTList

* Cleanup

* Add comment

* Lookup core ID from /sys/devices/system/cpu, /proc/cpuinfo is not portable

* Lookup all information from /sys/devices/system/cpu, /proc/cpuinfo is not portable

* Correctly handle lists from /sys

* Add Simultaneous Multithreading siblings

* Replace deprecated thread_siblings_list by core_cpus_list

* Reduce number of required slices

* Allow to send total values per core, socket and node

* Send all metrics with same time stamp
calcEventsetMetrics does only computiation, counter measurement is done before

* Input parameters should be float64 when evaluating to float64

* Send all metrics with same time stamp
calcGlobalMetrics does only computiation, counter measurement is done before

* Remove unused variable gmresults

* Add comments

* Updated go packages

* Add build with golang 1.21

* Switch to checkout action version 4

* Switch to setup-go action version 4

* Add workflow_dispatch to allow manual run of workflow

* Add workflow_dispatch to allow manual run of workflow

* Add release build jobs to runonce.yml

* Switch to golang 1.20 for RHEL based distributions

* Use dnf to download golang

* Remove golang versions before 1.20

* Upgrade Ubuntu focal -> jammy

* Pipe golang tar package directly to tar

* Update golang version

* Fix Ubuntu version number

* Add links to ipmi and redfish receivers

* Fix http server addr format

* github.com/influxdata/line-protocol -> github.com/influxdata/line-protocol/v2/lineprotocol

* Corrected spelling

* Add some comments

* github.com/influxdata/line-protocol -> github.com/influxdata/line-protocol/v2/lineprotocol

* Allow other fields not only field "value"

* Add some basic debugging documentation

* Add some basic debugging documentation

* Use a lock for the flush timer

* Add tags in lexical order as required by AddTag()

* Only access meta data, when it gets used as tag

* Use slice to store lexialicly orderd key value pairs

* Increase golang version requirement to 1.20.

* Avoid package cmp to allow builds with golang v1.20

* Fix: Error NVML library not found did crash
cc-metric-collector with "SIGSEGV: segmentation violation"

* Add config option idle_timeout

* Add basic authentication support

* Add basic authentication support

* Avoid unneccessary memory allocations

* Add documentation for send_*_total values

* Use generic package maps to clone maps

* Reuse flush timer

* Add Influx client options

* Reuse ccTopology functionality

* Do not store unused topology information

* Add batch_size config

* Cleanup

* Use stype and stype-id for the NIC in NetstatCollector

* Wait for concurrent flush operations to finish

* Be more verbose in error messages

* Reverted previous changes.
Made the code to complex without much advantages

* Use line protocol encoder

* Go pkg update

* Stop flush timer, when immediatelly flushing

* Fix: Corrected unlock access to batch slice

* Add config option to specify whether to use GZip compression in influx write requests

* Add asynchron send of encoder metrics

* Use DefaultServeMux instead of github.com/gorilla/mux

* Add config option for HTTP keep-alives

* Be more strict, when parsing json

* Add config option for HTTP request timeout and Retry interval

* Allow more then one background send operation

* Fix %sysusers_create_package args (#108)

%sysusers_create_package requires two arguments. See: https://github.com/systemd/systemd/blob/main/src/rpm/macros.systemd.in#L165

* Add nfsiostat to list of collectors

---------

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Holger Obermaier <holgerob@gmx.de>
Co-authored-by: Obihörnchen <obihoernchende@gmail.com>
2023-12-04 12:21:26 +01:00
Thomas Gruber
195d0794b0
Merge develop branch into main (#106)
* Add cpu_used (all-cpu_idle) to CpustatCollector

* Update to line-protocol/v2

* Update runonce.yml with Golang 1.20

* Update fsnotify in LIKWID Collector

* Use not a pointer to line-protocol.Encoder

* Simplify Makefile

* Use only as many arguments as required

* Allow sum function to handle non float types

* Allow values to be a slice of type float64, float32, int, int64, int32, bool

* Use generic function to simplify code

* Add missing case for type []int32

* Use generic function to compute minimum

* Use generic function to compute maximum

* Use generic function to compute average

* Add error value to sumAnyType

* Use generic function to compute median

* For older versions of go slices is not part of the installation

* Remove old entries from go.sum

* Use simpler sort function

* Compute metrics ib_total and ib_total_pkts

* Add aggregated metrics.
Add missing units

* Update likwidMetric.go

Fixes a potential bug when `fsnotify.NewWatcher()` fails with an error

* Completly avoid memory allocations in infinibandMetric read()

* Fixed initialization: Initalization and measurements should run in the same thread

---------

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
2023-08-29 14:12:49 +02:00
fodinabor
ec570f884c
Use customcmd commands if they did not error. (#101)
* Merge develop and main (#99)

* InfiniBandCollector: Scale raw readings from octets to bytes

* Fix clock frequency coming from LikwidCollector and update docs

* Build DEB package for Ubuntu 20.04 for releases

* Fix memstat collector with numa_stats option

* Remove useless prints from MemstatCollector

* Replace ioutils with os and io (#87)

* Use lower case for error strings in RocmSmiCollector

* move maybe-usable-by-other-cc-components to pkg. Fix all files to use the new paths (#88)

* Add collector for monitoring the execution of cc-metric-collector itself (#81)

* Add collector to monitor execution of cc-metric-collector itself

* Register SelfCollector

* Fix import paths for moved packages

* Check if at least one CPU with frequency information was detected

* Correct type: /proc/stats -> /proc/stat

* Update README.md

* Run ipmitool asynchron.  Improved error handling.

* Corrected some typos

* Add running average power limit (RAPL) metric collector

* Add running average power limit (RAPL) metric collector

* Do not mess up with the orignal configuration

* * Corrected json config in numastatsMetric.md
* Added some debug output to numastatsMetric.go

* Fixed computing number of physical packages for non continous physical package IDs (e.g. on Ampere Altra Q80-30)

* Fix kernel panic for receiver config with missing receiver type

* Add receiver to gather remote IPMI sensor metrics

* Added config option to add ipmi-sensors command line options

* Add documentaion for IPMI receiver

* Update to latest version of included go modules

* Add go.mod to App dependency

* Try to use common metric tags across hardware vendors

* Add IPMI metric: current

* remove prefix enumeration like 01-...

* Add IPMI receiver example configuration to receivers.json

* Minimal formating changes

* Add hostlist package

* Added tests for hostlist Expand()

* Use package hostlist to expand a host list

* Use package hostlist to expand a host list

* Some servers return "ConsumedPowerWatt":65535 instead of "ConsumedPowerWatt":null

* Updated to latest package versions

* Do not allow unknown fields in JSON configuration file

* Add workflow to customize packages to docs

* NFS I/O Stats Collector (#91)

* Initial version

* Delete values for vanished mount points and  comments

* Fix for Likwid collector (#95)

* Run LIKWID in separate thread and check metric type

* Change LIKWID collector documentation to use 'type' instead of 'scope'

* Re-initialize LIKWID after one read is missing due to lock toggle

* Register cc-metric-collector at Zenodo (#93)

* Add initial version of Zenodo project file

* Orcid ID added

* Update .zenodo.json

Co-authored-by: Holger Obermaier <holger.obermaier@kit.edu>

* Update ipmiMetric.go

* Use latest LIKWID version for builds

* Update README.md

* Remove development stuff from Makefile

* Add Requires(pre) to RPM SPEC file

* Use curly brackets in packaging make targets

* Fix for LIKWID collector with separate measurement thread and inotify watcher on the LIKWID lock (#97)

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>

* Update likwid_perfgroup_to_cc_config.py

* Use customcmd commands if they did not error.

---------

Co-authored-by: Thomas Gruber <Thomas.Roehl@googlemail.com>
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
2023-02-28 12:02:01 +01:00
Thomas Gruber
b0423b842d
Merge branch 'main' into develop 2022-12-20 13:02:31 +01:00
Thomas Gruber
6c10c9741a
Fix for LIKWID collector with separate measurement thread and inotify watcher on the LIKWID lock (#97) 2022-12-20 12:59:33 +01:00
Thomas Roehl
2bd386dae7 Use latest LIKWID version for builds 2022-12-14 17:43:41 +01:00
Thomas Gruber
162cce0fda
Merge develop branch into main (#96)
* InfiniBandCollector: Scale raw readings from octets to bytes

* Fix clock frequency coming from LikwidCollector and update docs

* Build DEB package for Ubuntu 20.04 for releases

* Fix memstat collector with numa_stats option

* Remove useless prints from MemstatCollector

* Replace ioutils with os and io (#87)

* Use lower case for error strings in RocmSmiCollector

* move maybe-usable-by-other-cc-components to pkg. Fix all files to use the new paths (#88)

* Add collector for monitoring the execution of cc-metric-collector itself (#81)

* Add collector to monitor execution of cc-metric-collector itself

* Register SelfCollector

* Fix import paths for moved packages

* Check if at least one CPU with frequency information was detected

* Correct type: /proc/stats -> /proc/stat

* Update README.md

* Run ipmitool asynchron.  Improved error handling.

* Corrected some typos

* Add running average power limit (RAPL) metric collector

* Add running average power limit (RAPL) metric collector

* Do not mess up with the orignal configuration

* * Corrected json config in numastatsMetric.md
* Added some debug output to numastatsMetric.go

* Fixed computing number of physical packages for non continous physical package IDs (e.g. on Ampere Altra Q80-30)

* Fix kernel panic for receiver config with missing receiver type

* Add receiver to gather remote IPMI sensor metrics

* Added config option to add ipmi-sensors command line options

* Add documentaion for IPMI receiver

* Update to latest version of included go modules

* Add go.mod to App dependency

* Try to use common metric tags across hardware vendors

* Add IPMI metric: current

* remove prefix enumeration like 01-...

* Add IPMI receiver example configuration to receivers.json

* Minimal formating changes

* Add hostlist package

* Added tests for hostlist Expand()

* Use package hostlist to expand a host list

* Use package hostlist to expand a host list

* Some servers return "ConsumedPowerWatt":65535 instead of "ConsumedPowerWatt":null

* Updated to latest package versions

* Do not allow unknown fields in JSON configuration file

* Add workflow to customize packages to docs

* NFS I/O Stats Collector (#91)

* Initial version

* Delete values for vanished mount points and  comments

* Fix for Likwid collector (#95)

* Run LIKWID in separate thread and check metric type

* Change LIKWID collector documentation to use 'type' instead of 'scope'

* Re-initialize LIKWID after one read is missing due to lock toggle

* Register cc-metric-collector at Zenodo (#93)

* Add initial version of Zenodo project file

* Orcid ID added

* Update .zenodo.json

Co-authored-by: Holger Obermaier <holger.obermaier@kit.edu>

* Update ipmiMetric.go

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
2022-12-14 17:02:39 +01:00
Thomas Gruber
155d1b9acf
Update ipmiMetric.go 2022-12-14 17:00:09 +01:00
Thomas Gruber
c9b9752b6a
Merge branch 'main' into develop 2022-12-14 16:58:12 +01:00
Thomas Gruber
efd4f5feb4
Fix for Likwid collector (#95)
* Run LIKWID in separate thread and check metric type

* Change LIKWID collector documentation to use 'type' instead of 'scope'

* Re-initialize LIKWID after one read is missing due to lock toggle
2022-12-14 16:53:08 +01:00
Thomas Gruber
a1f4dd6a6c
NFS I/O Stats Collector (#91)
* Initial version

* Delete values for vanished mount points and  comments
2022-12-14 16:52:53 +01:00
Holger Obermaier
5918f96fd8 Minimal formating changes 2022-11-24 09:48:44 +01:00
Holger Obermaier
7bb80780e0 Fixed computing number of physical packages for non continous physical package IDs (e.g. on Ampere Altra Q80-30) 2022-11-16 14:58:11 +01:00
Holger Obermaier
e66d52bb32 * Corrected json config in numastatsMetric.md
* Added some debug output to numastatsMetric.go
2022-11-16 14:10:25 +01:00
Holger Obermaier
9840d0193d Do not mess up with the orignal configuration 2022-11-16 09:37:40 +01:00
Holger Obermaier
ce7eef8d30 Add running average power limit (RAPL) metric collector 2022-11-15 17:15:27 +01:00
Holger Obermaier
92e45ca62c Add running average power limit (RAPL) metric collector 2022-11-15 17:09:26 +01:00
Holger Obermaier
fd10a279fc Corrected some typos 2022-11-14 09:35:02 +01:00
Holger Obermaier
9e63d0ea59 Run ipmitool asynchron. Improved error handling. 2022-11-11 16:16:14 +01:00
Holger Obermaier
deb1bcfa2f Correct type: /proc/stats -> /proc/stat 2022-10-13 15:01:39 +02:00