Compare commits

..

16 Commits

Author SHA1 Message Date
Thomas Gruber
94b086acf0 Develop (#102)
* InfiniBandCollector: Scale raw readings from octets to bytes

* Fix clock frequency coming from LikwidCollector and update docs

* Build DEB package for Ubuntu 20.04 for releases

* Fix memstat collector with numa_stats option

* Remove useless prints from MemstatCollector

* Replace ioutils with os and io (#87)

* Use lower case for error strings in RocmSmiCollector

* move maybe-usable-by-other-cc-components to pkg. Fix all files to use the new paths (#88)

* Add collector for monitoring the execution of cc-metric-collector itself (#81)

* Add collector to monitor execution of cc-metric-collector itself

* Register SelfCollector

* Fix import paths for moved packages

* Check if at least one CPU with frequency information was detected

* Correct type: /proc/stats -> /proc/stat

* Update README.md

* Run ipmitool asynchron.  Improved error handling.

* Corrected some typos

* Add running average power limit (RAPL) metric collector

* Add running average power limit (RAPL) metric collector

* Do not mess up with the orignal configuration

* * Corrected json config in numastatsMetric.md
* Added some debug output to numastatsMetric.go

* Fixed computing number of physical packages for non continous physical package IDs (e.g. on Ampere Altra Q80-30)

* Fix kernel panic for receiver config with missing receiver type

* Add receiver to gather remote IPMI sensor metrics

* Added config option to add ipmi-sensors command line options

* Add documentaion for IPMI receiver

* Update to latest version of included go modules

* Add go.mod to App dependency

* Try to use common metric tags across hardware vendors

* Add IPMI metric: current

* remove prefix enumeration like 01-...

* Add IPMI receiver example configuration to receivers.json

* Minimal formating changes

* Add hostlist package

* Added tests for hostlist Expand()

* Use package hostlist to expand a host list

* Use package hostlist to expand a host list

* Some servers return "ConsumedPowerWatt":65535 instead of "ConsumedPowerWatt":null

* Updated to latest package versions

* Do not allow unknown fields in JSON configuration file

* Add workflow to customize packages to docs

* NFS I/O Stats Collector (#91)

* Initial version

* Delete values for vanished mount points and  comments

* Fix for Likwid collector (#95)

* Run LIKWID in separate thread and check metric type

* Change LIKWID collector documentation to use 'type' instead of 'scope'

* Re-initialize LIKWID after one read is missing due to lock toggle

* Register cc-metric-collector at Zenodo (#93)

* Add initial version of Zenodo project file

* Orcid ID added

* Update .zenodo.json

Co-authored-by: Holger Obermaier <holger.obermaier@kit.edu>

* Update ipmiMetric.go

* Use latest LIKWID version for builds

* Update README.md

* Remove development stuff from Makefile

* Add Requires(pre) to RPM SPEC file

* Use curly brackets in packaging make targets

* Fix for LIKWID collector with separate measurement thread and inotify watcher on the LIKWID lock (#97)

* Debian does not like underscores in the version

* Update cc-metric-collector.service

Remove dependency services not used by cc-metric-collector

* Add new requirements to module file

* Use customcmd commands if they did not error. (#101)

* Merge develop and main (#99)

* InfiniBandCollector: Scale raw readings from octets to bytes

* Fix clock frequency coming from LikwidCollector and update docs

* Build DEB package for Ubuntu 20.04 for releases

* Fix memstat collector with numa_stats option

* Remove useless prints from MemstatCollector

* Replace ioutils with os and io (#87)

* Use lower case for error strings in RocmSmiCollector

* move maybe-usable-by-other-cc-components to pkg. Fix all files to use the new paths (#88)

* Add collector for monitoring the execution of cc-metric-collector itself (#81)

* Add collector to monitor execution of cc-metric-collector itself

* Register SelfCollector

* Fix import paths for moved packages

* Check if at least one CPU with frequency information was detected

* Correct type: /proc/stats -> /proc/stat

* Update README.md

* Run ipmitool asynchron.  Improved error handling.

* Corrected some typos

* Add running average power limit (RAPL) metric collector

* Add running average power limit (RAPL) metric collector

* Do not mess up with the orignal configuration

* * Corrected json config in numastatsMetric.md
* Added some debug output to numastatsMetric.go

* Fixed computing number of physical packages for non continous physical package IDs (e.g. on Ampere Altra Q80-30)

* Fix kernel panic for receiver config with missing receiver type

* Add receiver to gather remote IPMI sensor metrics

* Added config option to add ipmi-sensors command line options

* Add documentaion for IPMI receiver

* Update to latest version of included go modules

* Add go.mod to App dependency

* Try to use common metric tags across hardware vendors

* Add IPMI metric: current

* remove prefix enumeration like 01-...

* Add IPMI receiver example configuration to receivers.json

* Minimal formating changes

* Add hostlist package

* Added tests for hostlist Expand()

* Use package hostlist to expand a host list

* Use package hostlist to expand a host list

* Some servers return "ConsumedPowerWatt":65535 instead of "ConsumedPowerWatt":null

* Updated to latest package versions

* Do not allow unknown fields in JSON configuration file

* Add workflow to customize packages to docs

* NFS I/O Stats Collector (#91)

* Initial version

* Delete values for vanished mount points and  comments

* Fix for Likwid collector (#95)

* Run LIKWID in separate thread and check metric type

* Change LIKWID collector documentation to use 'type' instead of 'scope'

* Re-initialize LIKWID after one read is missing due to lock toggle

* Register cc-metric-collector at Zenodo (#93)

* Add initial version of Zenodo project file

* Orcid ID added

* Update .zenodo.json

Co-authored-by: Holger Obermaier <holger.obermaier@kit.edu>

* Update ipmiMetric.go

* Use latest LIKWID version for builds

* Update README.md

* Remove development stuff from Makefile

* Add Requires(pre) to RPM SPEC file

* Use curly brackets in packaging make targets

* Fix for LIKWID collector with separate measurement thread and inotify watcher on the LIKWID lock (#97)

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>

* Update likwid_perfgroup_to_cc_config.py

* Use customcmd commands if they did not error.

---------

Co-authored-by: Thomas Gruber <Thomas.Roehl@googlemail.com>
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>

---------

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
Co-authored-by: fodinabor <5982050+fodinabor@users.noreply.github.com>
2023-03-20 15:17:24 +01:00
Thomas Gruber
abd49a377c Update likwid_perfgroup_to_cc_config.py 2023-01-26 10:21:45 +07:00
Thomas Gruber
84e019c693 Merge develop and main (#99)
* InfiniBandCollector: Scale raw readings from octets to bytes

* Fix clock frequency coming from LikwidCollector and update docs

* Build DEB package for Ubuntu 20.04 for releases

* Fix memstat collector with numa_stats option

* Remove useless prints from MemstatCollector

* Replace ioutils with os and io (#87)

* Use lower case for error strings in RocmSmiCollector

* move maybe-usable-by-other-cc-components to pkg. Fix all files to use the new paths (#88)

* Add collector for monitoring the execution of cc-metric-collector itself (#81)

* Add collector to monitor execution of cc-metric-collector itself

* Register SelfCollector

* Fix import paths for moved packages

* Check if at least one CPU with frequency information was detected

* Correct type: /proc/stats -> /proc/stat

* Update README.md

* Run ipmitool asynchron.  Improved error handling.

* Corrected some typos

* Add running average power limit (RAPL) metric collector

* Add running average power limit (RAPL) metric collector

* Do not mess up with the orignal configuration

* * Corrected json config in numastatsMetric.md
* Added some debug output to numastatsMetric.go

* Fixed computing number of physical packages for non continous physical package IDs (e.g. on Ampere Altra Q80-30)

* Fix kernel panic for receiver config with missing receiver type

* Add receiver to gather remote IPMI sensor metrics

* Added config option to add ipmi-sensors command line options

* Add documentaion for IPMI receiver

* Update to latest version of included go modules

* Add go.mod to App dependency

* Try to use common metric tags across hardware vendors

* Add IPMI metric: current

* remove prefix enumeration like 01-...

* Add IPMI receiver example configuration to receivers.json

* Minimal formating changes

* Add hostlist package

* Added tests for hostlist Expand()

* Use package hostlist to expand a host list

* Use package hostlist to expand a host list

* Some servers return "ConsumedPowerWatt":65535 instead of "ConsumedPowerWatt":null

* Updated to latest package versions

* Do not allow unknown fields in JSON configuration file

* Add workflow to customize packages to docs

* NFS I/O Stats Collector (#91)

* Initial version

* Delete values for vanished mount points and  comments

* Fix for Likwid collector (#95)

* Run LIKWID in separate thread and check metric type

* Change LIKWID collector documentation to use 'type' instead of 'scope'

* Re-initialize LIKWID after one read is missing due to lock toggle

* Register cc-metric-collector at Zenodo (#93)

* Add initial version of Zenodo project file

* Orcid ID added

* Update .zenodo.json

Co-authored-by: Holger Obermaier <holger.obermaier@kit.edu>

* Update ipmiMetric.go

* Use latest LIKWID version for builds

* Update README.md

* Remove development stuff from Makefile

* Add Requires(pre) to RPM SPEC file

* Use curly brackets in packaging make targets

* Fix for LIKWID collector with separate measurement thread and inotify watcher on the LIKWID lock (#97)

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
2022-12-20 13:08:04 +01:00
Thomas Gruber
ff0833c413 Push LIKWID collector fix into main (#98)
* InfiniBandCollector: Scale raw readings from octets to bytes

* Fix clock frequency coming from LikwidCollector and update docs

* Build DEB package for Ubuntu 20.04 for releases

* Fix memstat collector with numa_stats option

* Remove useless prints from MemstatCollector

* Replace ioutils with os and io (#87)

* Use lower case for error strings in RocmSmiCollector

* move maybe-usable-by-other-cc-components to pkg. Fix all files to use the new paths (#88)

* Add collector for monitoring the execution of cc-metric-collector itself (#81)

* Add collector to monitor execution of cc-metric-collector itself

* Register SelfCollector

* Fix import paths for moved packages

* Check if at least one CPU with frequency information was detected

* Correct type: /proc/stats -> /proc/stat

* Update README.md

* Run ipmitool asynchron.  Improved error handling.

* Corrected some typos

* Add running average power limit (RAPL) metric collector

* Add running average power limit (RAPL) metric collector

* Do not mess up with the orignal configuration

* * Corrected json config in numastatsMetric.md
* Added some debug output to numastatsMetric.go

* Fixed computing number of physical packages for non continous physical package IDs (e.g. on Ampere Altra Q80-30)

* Fix kernel panic for receiver config with missing receiver type

* Add receiver to gather remote IPMI sensor metrics

* Added config option to add ipmi-sensors command line options

* Add documentaion for IPMI receiver

* Update to latest version of included go modules

* Add go.mod to App dependency

* Try to use common metric tags across hardware vendors

* Add IPMI metric: current

* remove prefix enumeration like 01-...

* Add IPMI receiver example configuration to receivers.json

* Minimal formating changes

* Add hostlist package

* Added tests for hostlist Expand()

* Use package hostlist to expand a host list

* Use package hostlist to expand a host list

* Some servers return "ConsumedPowerWatt":65535 instead of "ConsumedPowerWatt":null

* Updated to latest package versions

* Do not allow unknown fields in JSON configuration file

* Add workflow to customize packages to docs

* NFS I/O Stats Collector (#91)

* Initial version

* Delete values for vanished mount points and  comments

* Fix for Likwid collector (#95)

* Run LIKWID in separate thread and check metric type

* Change LIKWID collector documentation to use 'type' instead of 'scope'

* Re-initialize LIKWID after one read is missing due to lock toggle

* Register cc-metric-collector at Zenodo (#93)

* Add initial version of Zenodo project file

* Orcid ID added

* Update .zenodo.json

Co-authored-by: Holger Obermaier <holger.obermaier@kit.edu>

* Update ipmiMetric.go

* Use latest LIKWID version for builds

* Update README.md

* Remove development stuff from Makefile

* Add Requires(pre) to RPM SPEC file

* Use curly brackets in packaging make targets

* Fix for LIKWID collector with separate measurement thread and inotify watcher on the LIKWID lock (#97)

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
2022-12-20 13:04:24 +01:00
Thomas Gruber
162cce0fda Merge develop branch into main (#96)
* InfiniBandCollector: Scale raw readings from octets to bytes

* Fix clock frequency coming from LikwidCollector and update docs

* Build DEB package for Ubuntu 20.04 for releases

* Fix memstat collector with numa_stats option

* Remove useless prints from MemstatCollector

* Replace ioutils with os and io (#87)

* Use lower case for error strings in RocmSmiCollector

* move maybe-usable-by-other-cc-components to pkg. Fix all files to use the new paths (#88)

* Add collector for monitoring the execution of cc-metric-collector itself (#81)

* Add collector to monitor execution of cc-metric-collector itself

* Register SelfCollector

* Fix import paths for moved packages

* Check if at least one CPU with frequency information was detected

* Correct type: /proc/stats -> /proc/stat

* Update README.md

* Run ipmitool asynchron.  Improved error handling.

* Corrected some typos

* Add running average power limit (RAPL) metric collector

* Add running average power limit (RAPL) metric collector

* Do not mess up with the orignal configuration

* * Corrected json config in numastatsMetric.md
* Added some debug output to numastatsMetric.go

* Fixed computing number of physical packages for non continous physical package IDs (e.g. on Ampere Altra Q80-30)

* Fix kernel panic for receiver config with missing receiver type

* Add receiver to gather remote IPMI sensor metrics

* Added config option to add ipmi-sensors command line options

* Add documentaion for IPMI receiver

* Update to latest version of included go modules

* Add go.mod to App dependency

* Try to use common metric tags across hardware vendors

* Add IPMI metric: current

* remove prefix enumeration like 01-...

* Add IPMI receiver example configuration to receivers.json

* Minimal formating changes

* Add hostlist package

* Added tests for hostlist Expand()

* Use package hostlist to expand a host list

* Use package hostlist to expand a host list

* Some servers return "ConsumedPowerWatt":65535 instead of "ConsumedPowerWatt":null

* Updated to latest package versions

* Do not allow unknown fields in JSON configuration file

* Add workflow to customize packages to docs

* NFS I/O Stats Collector (#91)

* Initial version

* Delete values for vanished mount points and  comments

* Fix for Likwid collector (#95)

* Run LIKWID in separate thread and check metric type

* Change LIKWID collector documentation to use 'type' instead of 'scope'

* Re-initialize LIKWID after one read is missing due to lock toggle

* Register cc-metric-collector at Zenodo (#93)

* Add initial version of Zenodo project file

* Orcid ID added

* Update .zenodo.json

Co-authored-by: Holger Obermaier <holger.obermaier@kit.edu>

* Update ipmiMetric.go

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
2022-12-14 17:02:39 +01:00
Thomas Gruber
f0da07310b Update README.md 2022-11-04 14:53:08 +01:00
Thomas Gruber
0f35469168 Update httpSink.md 2022-11-04 14:52:05 +01:00
Thomas Roehl
e79601e2e8 Try fixing DEB package 2022-10-13 16:49:58 +02:00
Thomas Roehl
317d36c9dd Try fixing DEB package 2022-10-13 16:46:54 +02:00
Thomas Roehl
821d104656 Try fixing DEB package 2022-10-13 16:42:04 +02:00
Thomas Gruber
be20f956c2 Add latest development to main branch (#89)
* InfiniBandCollector: Scale raw readings from octets to bytes

* Fix clock frequency coming from LikwidCollector and update docs

* Build DEB package for Ubuntu 20.04 for releases

* Fix memstat collector with numa_stats option

* Remove useless prints from MemstatCollector

* Replace ioutils with os and io (#87)

* Use lower case for error strings in RocmSmiCollector

* move maybe-usable-by-other-cc-components to pkg. Fix all files to use the new paths (#88)

* Add collector for monitoring the execution of cc-metric-collector itself (#81)

* Add collector to monitor execution of cc-metric-collector itself

* Register SelfCollector

* Fix import paths for moved packages
2022-10-10 12:23:51 +02:00
Thomas Gruber
5b6a2b9018 Merge latest fixed from develop to main (#85)
* InfiniBandCollector: Scale raw readings from octets to bytes

* Fix clock frequency coming from LikwidCollector and update docs
2022-09-12 12:54:40 +02:00
Thomas Roehl
3438972237 Merge branch 'develop' into main 2022-09-07 15:11:26 +02:00
oscarminus
88fabc2e83 cpustatMetric.go: Use derived values instead of absolute values (#83)
* cpustatMetric.go: Use derived values instead of absolute values

  The values in /proc/stat are absolute counters related to the boot
  time of the system. To obtain a utilization of the CPU, the changes
  in the counters must be derived according to time. To take only the
  absolute values leads to the fact that changes in the utilization,
  straight with larger values, do not become visible.

* Add new collector for /proc/schedstat

  The `schedstat` collector reads data from /proc/schedstat and calculates
  a load value, separated by hwthread. This might be useful to detect bad
  cpu pinning on shared nodes etc.

Co-authored-by: Michael Schwarz <post@michael-schwarz.name>
2022-09-07 14:09:29 +02:00
Thomas Gruber
b3c27e0af5 Merge latest development changes (#80)
* Cleanup: Remove unused code

* Use Golang duration parser for 'interval' and 'duration'
 in main config

* Update handling of LIKWID headers. Download only if not already present in the system. Fixes #73

* Units with cc-units (#64)

* Add option to normalize units with cc-unit

* Add unit conversion to router

* Add option to change unit prefix in the router

* Add to MetricRouter README

* Add order of operations in router to README

* Use second add_tags/del_tags only if metric gets renamed

* Skip disks in DiskstatCollector that have size=0

* Check readability of sensor files in TempCollector

* Fix for --once option

* Rename `cpu` type to `hwthread` (#69)

* Rename 'cpu' type to 'hwthread' to avoid naming clashes with MetricStore and CC-Webfrontend

* Collectors in parallel (#74)

* Provide info to CollectorManager whether the collector can be executed in parallel with others

* Split serial and parallel collectors. Read in parallel first

* Update NvidiaCollector with new metrics, MIG and NvLink support (#75)

* CC topology module update (#76)

* Rename CPU to hardware thread, write some comments

* Do renaming in other parts

* Remove CpuList and SocketList function from metricCollector. Available in ccTopology

* Option to use MIG UUID as subtype-id in NvidiaCollector

* Option to use MIG slice name as subtype-id in NvidiaCollector

* MetricRouter: Fix JSON in README

* Fix for Github Action to really use the selected version

* Remove Ganglia installation in runonce Action and add Go 1.18

* Fix daemon options in init script

* Add separate go.mod files to use it with deprecated 1.16

* Minor updates for Makefiles

* fix string comparison

* AMD ROCm SMI collector (#77)

* Add collector for AMD ROCm SMI metrics

* Fix import path

* Fix imports

* Remove Board Number

* store GPU index explicitly

* Remove board number from description

* Use http instead of ftp to download likwid

* Fix serial number in rocmCollector

* Improved http sink (#78)

* automatic flush in NatsSink

* tweak default options of HttpSink

* shorter cirt. section and retries for HttpSink

* fix error handling

* Remove file added by mistake.

* Use http instead of ftp to download likwid

* Fix serial number in rocmCollector

Co-authored-by: Thomas Roehl <thomas.roehl@fau.de>

* Fix: When sending metrics failed the batch size could be exceeded

* Improved dropping of metrics failed to send

* Add memstats and topprocs metric

* Updated to latest modules

* Check that at least one sink is running

* Add drop rate, when send buffer is full

* Allow only one timer at a time

* Use mutex to ensure only on flush timer is running

* Fix for NvidiaCollector when devices are not in MiG mode

* Remove Golang version 1.16 an 1.17 from Action. Latest commits require Golang 1.18

* Use Golang 1.18 in Release action to build RPMs

* Change unit of CpufreqCollector to Hz. That's what the sysfs outputs

* Make wget quiet in Release action to reduce log size

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Lou <lou.knauer@gmx.de>
2022-07-13 10:09:49 +02:00
Thomas Roehl
2adf9484a3 Redo fix for NvidiaCollector and MiG. Got lost somehow 2022-07-12 12:31:24 +02:00
45 changed files with 2255 additions and 1075 deletions

29
.zenodo.json Normal file
View File

@@ -0,0 +1,29 @@
{
"title": "cc-metric-collector",
"description": "Monitoring agent for ClusterCockpit.",
"creators": [
{
"affiliation": "Regionales Rechenzentrum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg",
"name": "Thomas Gruber",
"orcid": "0000-0001-5560-6964"
},
{
"affiliation": "Steinbuch Centre for Computing, Karlsruher Institut für Technologie",
"name": "Holger Obermaier",
"orcid": "0000-0002-6830-6626"
}
],
"upload_type": "software",
"license": "MIT",
"access_right": "open",
"keywords": [
"performance-monitoring",
"cluster-monitoring",
"open-source"
],
"communities": [
{
"identifier": "clustercockpit"
}
]
}

View File

@@ -22,7 +22,7 @@ GOBIN = $(shell which go)
.PHONY: all
all: $(APP)
$(APP): $(GOSRC)
$(APP): $(GOSRC) go.mod
make -C collectors
$(GOBIN) get
$(GOBIN) build -o $(APP) $(GOSRC_APP)
@@ -84,7 +84,7 @@ RPM: scripts/cc-metric-collector.spec
@COMMITISH="HEAD"
@VERS=$$(git describe --tags $${COMMITISH})
@VERS=$${VERS#v}
@VERS=$$(echo $$VERS | sed -e s+'-'+'_'+g)
@VERS=$$(echo $${VERS} | sed -e s+'-'+'_'+g)
@eval $$(rpmspec --query --queryformat "NAME='%{name}' VERSION='%{version}' RELEASE='%{release}' NVR='%{NVR}' NVRA='%{NVRA}'" --define="VERS $${VERS}" "$${SPECFILE}")
@PREFIX="$${NAME}-$${VERSION}"
@FORMAT="tar.gz"
@@ -96,10 +96,8 @@ RPM: scripts/cc-metric-collector.spec
@if [[ "$${GITHUB_ACTIONS}" == true ]]; then
@ RPMFILE="$${RPMDIR}/$${ARCH}/$${NVRA}.rpm"
@ SRPMFILE="$${SRPMDIR}/$${NVR}.src.rpm"
@ echo "RPM: $${RPMFILE}"
@ echo "SRPM: $${SRPMFILE}"
@ echo "::set-output name=SRPM::$${SRPMFILE}"
@ echo "::set-output name=RPM::$${RPMFILE}"
@ echo "SRPM=$${SRPMFILE}" >> $${GITHUB_OUTPUT}
@ echo "RPM=$${RPMFILE}" >> $${GITHUB_OUTPUT}
@fi
.PHONY: DEB
@@ -108,21 +106,24 @@ DEB: scripts/cc-metric-collector.deb.control $(APP)
@WORKSPACE=$${PWD}/.dpkgbuild
@DEBIANDIR=$${WORKSPACE}/debian
@DEBIANBINDIR=$${WORKSPACE}/DEBIAN
@mkdir --parents --verbose $$WORKSPACE $$DEBIANBINDIR
@mkdir --parents --verbose $${WORKSPACE} $${DEBIANBINDIR}
#@mkdir --parents --verbose $$DEBIANDIR
@CONTROLFILE="$${BASEDIR}/scripts/cc-metric-collector.deb.control"
@COMMITISH="HEAD"
@VERS=$$(git describe --tags --abbrev=0 $${COMMITISH})
@if [ -z "$${VERS}" ]; then VERS=${GITHUB_REF_NAME}; fi
@VERS=$${VERS#v}
@VERS=$$(echo $$VERS | sed -e s+'-'+'_'+g)
@ARCH=$$(uname -m)
@ARCH=$$(echo $$ARCH | sed -e s+'_'+'-'+g)
@ARCH=$$(echo $${ARCH} | sed -e s+'_'+'-'+g)
@if [ "$${ARCH}" = "x86-64" ]; then ARCH=amd64; fi
@PREFIX="$${NAME}-$${VERSION}_$${ARCH}"
@SIZE_BYTES=$$(du -bcs --exclude=.dpkgbuild "$$WORKSPACE"/ | awk '{print $$1}' | head -1 | sed -e 's/^0\+//')
@SIZE="$$(awk -v size="$$SIZE_BYTES" 'BEGIN {print (size/1024)+1}' | awk '{print int($$0)}')"
#@sed -e s+"{VERSION}"+"$$VERS"+g -e s+"{INSTALLED_SIZE}"+"$$SIZE"+g -e s+"{ARCH}"+"$$ARCH"+g $$CONTROLFILE > $${DEBIANDIR}/control
@sed -e s+"{VERSION}"+"$$VERS"+g -e s+"{INSTALLED_SIZE}"+"$$SIZE"+g -e s+"{ARCH}"+"$$ARCH"+g $$CONTROLFILE > $${DEBIANBINDIR}/control
@SIZE_BYTES=$$(du -bcs --exclude=.dpkgbuild "$${WORKSPACE}"/ | awk '{print $$1}' | head -1 | sed -e 's/^0\+//')
@SIZE="$$(awk -v size="$${SIZE_BYTES}" 'BEGIN {print (size/1024)+1}' | awk '{print int($$0)}')"
@sed -e s+"{VERSION}"+"$${VERS}"+g -e s+"{INSTALLED_SIZE}"+"$${SIZE}"+g -e s+"{ARCH}"+"$${ARCH}"+g $${CONTROLFILE} > $${DEBIANBINDIR}/control
@make PREFIX=$${WORKSPACE} install
@DEB_FILE="cc-metric-collector_$${VERS}_$${ARCH}.deb"
@dpkg-deb -b $${WORKSPACE} "$$DEB_FILE"
@dpkg-deb -b $${WORKSPACE} "$${DEB_FILE}"
@if [ "$${GITHUB_ACTIONS}" = "true" ]; then
@ echo "DEB=$${DEB_FILE}" >> $${GITHUB_OUTPUT}
@fi
@rm -r "$${WORKSPACE}"

View File

@@ -8,6 +8,10 @@ There is a single timer loop that triggers all collectors serially, collects the
The receiver runs as a go routine side-by-side with the timer loop and asynchronously forwards received metrics to the sink.
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7438287.svg)](https://doi.org/10.5281/zenodo.7438287)
# Configuration
Configuration is implemented using a single json document that is distributed over network and may be persisted as file.

View File

@@ -1,5 +1,5 @@
# LIKWID version
LIKWID_VERSION = 5.2.1
LIKWID_VERSION = 5.2.2
LIKWID_INSTALLED_FOLDER=$(shell dirname $(shell which likwid-topology 2>/dev/null) 2>/dev/null)
LIKWID_FOLDER="$(shell pwd)/likwid"

View File

@@ -51,7 +51,7 @@ A collector reads data from any source, parses it to metrics and submits these m
* `Name() string`: Return the name of the collector
* `Init(config json.RawMessage) error`: Initializes the collector using the given collector-specific config in JSON. Check if needed files/commands exists, ...
* `Initialized() bool`: Check if a collector is successfully initialized
* `Read(duration time.Duration, output chan ccMetric.CCMetric)`: Read, parse and submit data to the `output` channel as [`CCMetric`](../internal/ccMetric/README.md). If the collector has to measure anything for some duration, use the provided function argument `duration`.
* `Read(duration time.Duration, output chan ccMetric.CCMetric)`: Read, parse and submit data to the `output` channel as [`CCMetric`](../internal/ccMetric/README.md). If the collector has to measure anything for some duration, use the provided function argument `duration`.
* `Close()`: Closes down the collector.
It is recommanded to call `setup()` in the `Init()` function.

View File

@@ -20,7 +20,6 @@ var AvailableCollectors = map[string]MetricCollector{
"netstat": new(NetstatCollector),
"ibstat": new(InfinibandCollector),
"lustrestat": new(LustreCollector),
"lustre_jobstat": new(LustreJobstatCollector),
"cpustat": new(CpustatCollector),
"topprocs": new(TopProcsCollector),
"nvidia": new(NvidiaCollector),
@@ -37,7 +36,9 @@ var AvailableCollectors = map[string]MetricCollector{
"numastats": new(NUMAStatsCollector),
"beegfs_meta": new(BeegfsMetaCollector),
"beegfs_storage": new(BeegfsStorageCollector),
"rapl": new(RAPLCollector),
"rocm_smi": new(RocmSmiCollector),
"self": new(SelfCollector),
"schedstat": new(SchedstatCollector),
}

View File

@@ -142,6 +142,11 @@ func (m *CPUFreqCpuInfoCollector) Init(config json.RawMessage) error {
}
}
// Check if at least one CPU with frequency information was detected
if len(m.topology) == 0 {
return fmt.Errorf("No CPU frequency info found in %s", cpuInfoFile)
}
numPhysicalPackageID_int := maxPhysicalPackageID + 1
numPhysicalPackageID := fmt.Sprint(numPhysicalPackageID_int)
numNonHT := fmt.Sprint(numNonHT_int)

View File

@@ -23,20 +23,18 @@ type CPUFreqCollectorTopology struct {
numPhysicalPackages string // number of sockets / packages
numPhysicalPackages_int int64
isHT bool
numNonHT string // number of non hyperthreading processors
numNonHT string // number of non hyper-threading processors
numNonHT_int int64
scalingCurFreqFile string
tagSet map[string]string
}
//
// CPUFreqCollector
// a metric collector to measure the current frequency of the CPUs
// as obtained from the hardware (in KHz)
// Only measure on the first hyper thread
// Only measure on the first hyper-thread
//
// See: https://www.kernel.org/doc/html/latest/admin-guide/pm/cpufreq.html
//
type CPUFreqCollector struct {
metricCollector
topology []CPUFreqCollectorTopology
@@ -126,7 +124,7 @@ func (m *CPUFreqCollector) Init(config json.RawMessage) error {
t.scalingCurFreqFile = scalingCurFreqFile
}
// is processor a hyperthread?
// is processor a hyper-thread?
coreSeenBefore := make(map[string]bool)
for i := range m.topology {
t := &m.topology[i]
@@ -136,23 +134,20 @@ func (m *CPUFreqCollector) Init(config json.RawMessage) error {
coreSeenBefore[globalID] = true
}
// number of non hyper thread cores and packages / sockets
// number of non hyper-thread cores and packages / sockets
var numNonHT_int int64 = 0
var maxPhysicalPackageID int64 = 0
PhysicalPackageIDs := make(map[int64]struct{})
for i := range m.topology {
t := &m.topology[i]
// Update maxPackageID
if t.physicalPackageID_int > maxPhysicalPackageID {
maxPhysicalPackageID = t.physicalPackageID_int
}
if !t.isHT {
numNonHT_int++
}
PhysicalPackageIDs[t.physicalPackageID_int] = struct{}{}
}
numPhysicalPackageID_int := maxPhysicalPackageID + 1
numPhysicalPackageID_int := int64(len(PhysicalPackageIDs))
numPhysicalPackageID := fmt.Sprint(numPhysicalPackageID_int)
numNonHT := fmt.Sprint(numNonHT_int)
for i := range m.topology {
@@ -168,6 +163,13 @@ func (m *CPUFreqCollector) Init(config json.RawMessage) error {
}
}
// Initialized
cclog.ComponentDebug(
m.name,
"initialized",
numPhysicalPackageID_int, "physical packages,",
len(cpuDirs), "CPUs,",
numNonHT, "non-hyper-threading CPUs")
m.init = true
return nil
}
@@ -182,7 +184,7 @@ func (m *CPUFreqCollector) Read(interval time.Duration, output chan lp.CCMetric)
for i := range m.topology {
t := &m.topology[i]
// skip hyperthreads
// skip hyper-threads
if t.isHT {
continue
}

View File

@@ -1,4 +1,5 @@
## `cpufreq_cpuinfo` collector
```json
"cpufreq": {
"exclude_metrics": []
@@ -8,4 +9,5 @@
The `cpufreq` collector reads the clock frequency from `/sys/devices/system/cpu/cpu*/cpufreq` and outputs a handful **hwthread** metrics.
Metrics:
* `cpufreq`
* `cpufreq`

View File

@@ -1,5 +1,6 @@
## `cpustat` collector
```json
"cpustat": {
"exclude_metrics": [
@@ -8,9 +9,10 @@
}
```
The `cpustat` collector reads data from `/proc/stats` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
The `cpustat` collector reads data from `/proc/stat` and outputs a handful **node** and **hwthread** metrics. If a metric is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `cpu_user`
* `cpu_nice`
* `cpu_system`

View File

@@ -48,7 +48,7 @@ func (m *CustomCmdCollector) Init(config json.RawMessage) error {
command := exec.Command(cmdfields[0], strings.Join(cmdfields[1:], " "))
command.Wait()
_, err = command.Output()
if err != nil {
if err == nil {
m.commands = append(m.commands, c)
}
}

View File

@@ -1,51 +1,57 @@
package collectors
import (
"bufio"
"bytes"
"encoding/json"
"errors"
"fmt"
"io"
"log"
"os"
"os/exec"
"strconv"
"strings"
"time"
cclog "github.com/ClusterCockpit/cc-metric-collector/pkg/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/pkg/ccMetric"
)
const IPMITOOL_PATH = `ipmitool`
const IPMISENSORS_PATH = `ipmi-sensors`
type IpmiCollectorConfig struct {
ExcludeDevices []string `json:"exclude_devices"`
IpmitoolPath string `json:"ipmitool_path"`
IpmisensorsPath string `json:"ipmisensors_path"`
}
type IpmiCollector struct {
metricCollector
//tags map[string]string
//matches map[string]string
config IpmiCollectorConfig
config struct {
ExcludeDevices []string `json:"exclude_devices"`
IpmitoolPath string `json:"ipmitool_path"`
IpmisensorsPath string `json:"ipmisensors_path"`
}
ipmitool string
ipmisensors string
}
func (m *IpmiCollector) Init(config json.RawMessage) error {
// Check if already initialized
if m.init {
return nil
}
m.name = "IpmiCollector"
m.setup()
m.parallel = true
m.meta = map[string]string{"source": m.name, "group": "IPMI"}
m.config.IpmitoolPath = string(IPMITOOL_PATH)
m.config.IpmisensorsPath = string(IPMISENSORS_PATH)
m.ipmitool = ""
m.ipmisensors = ""
m.meta = map[string]string{
"source": m.name,
"group": "IPMI",
}
// default path to IPMI tools
m.config.IpmitoolPath = "ipmitool"
m.config.IpmisensorsPath = "ipmi-sensors"
if len(config) > 0 {
err := json.Unmarshal(config, &m.config)
if err != nil {
return err
}
}
// Check if executables ipmitool or ipmisensors are found
p, err := exec.LookPath(m.config.IpmitoolPath)
if err == nil {
m.ipmitool = p
@@ -62,25 +68,33 @@ func (m *IpmiCollector) Init(config json.RawMessage) error {
}
func (m *IpmiCollector) readIpmiTool(cmd string, output chan lp.CCMetric) {
// Setup ipmitool command
command := exec.Command(cmd, "sensor")
command.Wait()
stdout, err := command.Output()
if err != nil {
log.Print(err)
stdout, _ := command.StdoutPipe()
errBuf := new(bytes.Buffer)
command.Stderr = errBuf
// start command
if err := command.Start(); err != nil {
cclog.ComponentError(
m.name,
fmt.Sprintf("readIpmiTool(): Failed to start command \"%s\": %v", command.String(), err),
)
return
}
ll := strings.Split(string(stdout), "\n")
for _, line := range ll {
lv := strings.Split(line, "|")
// Read command output
scanner := bufio.NewScanner(stdout)
for scanner.Scan() {
lv := strings.Split(scanner.Text(), "|")
if len(lv) < 3 {
continue
}
v, err := strconv.ParseFloat(strings.Trim(lv[1], " "), 64)
v, err := strconv.ParseFloat(strings.TrimSpace(lv[1]), 64)
if err == nil {
name := strings.ToLower(strings.Replace(strings.Trim(lv[0], " "), " ", "_", -1))
unit := strings.Trim(lv[2], " ")
name := strings.ToLower(strings.Replace(strings.TrimSpace(lv[0]), " ", "_", -1))
unit := strings.TrimSpace(lv[2])
if unit == "Volts" {
unit = "Volts"
} else if unit == "degrees C" {
@@ -98,6 +112,17 @@ func (m *IpmiCollector) readIpmiTool(cmd string, output chan lp.CCMetric) {
}
}
}
// Wait for command end
if err := command.Wait(); err != nil {
errMsg, _ := io.ReadAll(errBuf)
cclog.ComponentError(
m.name,
fmt.Sprintf("readIpmiTool(): Failed to wait for the end of command \"%s\": %v\n", command.String(), err),
fmt.Sprintf("readIpmiTool(): command stderr: \"%s\"\n", string(errMsg)),
)
return
}
}
func (m *IpmiCollector) readIpmiSensors(cmd string, output chan lp.CCMetric) {
@@ -131,16 +156,16 @@ func (m *IpmiCollector) readIpmiSensors(cmd string, output chan lp.CCMetric) {
}
func (m *IpmiCollector) Read(interval time.Duration, output chan lp.CCMetric) {
// Check if already initialized
if !m.init {
return
}
if len(m.config.IpmitoolPath) > 0 {
_, err := os.Stat(m.config.IpmitoolPath)
if err == nil {
m.readIpmiTool(m.config.IpmitoolPath, output)
}
m.readIpmiTool(m.config.IpmitoolPath, output)
} else if len(m.config.IpmisensorsPath) > 0 {
_, err := os.Stat(m.config.IpmisensorsPath)
if err == nil {
m.readIpmiSensors(m.config.IpmisensorsPath, output)
}
m.readIpmiSensors(m.config.IpmisensorsPath, output)
}
}

View File

@@ -8,9 +8,6 @@
}
```
The `ipmistat` collector reads data from `ipmitool` (`ipmitool sensor`) or `ipmi-sensors` (`ipmi-sensors --sdr-cache-recreate --comma-separated-output`).
The `ipmistat` collector reads data from `ipmitool` (`ipmitool sensor`) or `ipmi-sensors` (`ipmi-sensors --sdr-cache-recreate --comma-separated-output`).
The metrics depend on the output of the underlying tools but contain temperature, power and energy metrics.

View File

@@ -15,6 +15,7 @@ import (
"math"
"os"
"os/signal"
"os/user"
"sort"
"strconv"
"strings"
@@ -28,12 +29,15 @@ import (
lp "github.com/ClusterCockpit/cc-metric-collector/pkg/ccMetric"
topo "github.com/ClusterCockpit/cc-metric-collector/pkg/ccTopology"
"github.com/NVIDIA/go-nvml/pkg/dl"
"golang.design/x/thread"
fsnotify "gopkg.in/fsnotify.v0"
)
const (
LIKWID_LIB_NAME = "liblikwid.so"
LIKWID_LIB_DL_FLAGS = dl.RTLD_LAZY | dl.RTLD_GLOBAL
LIKWID_DEF_ACCESSMODE = "direct"
LIKWID_DEF_LOCKFILE = "/var/run/likwid.lock"
)
type LikwidCollectorMetricConfig struct {
@@ -67,22 +71,25 @@ type LikwidCollectorConfig struct {
AccessMode string `json:"access_mode,omitempty"`
DaemonPath string `json:"accessdaemon_path,omitempty"`
LibraryPath string `json:"liblikwid_path,omitempty"`
LockfilePath string `json:"lockfile_path,omitempty"`
}
type LikwidCollector struct {
metricCollector
cpulist []C.int
cpu2tid map[int]int
sock2tid map[int]int
metrics map[C.int]map[string]int
groups []C.int
config LikwidCollectorConfig
gmresults map[int]map[string]float64
basefreq float64
running bool
initialized bool
likwidGroups map[C.int]LikwidEventsetConfig
lock sync.Mutex
cpulist []C.int
cpu2tid map[int]int
sock2tid map[int]int
metrics map[C.int]map[string]int
groups []C.int
config LikwidCollectorConfig
gmresults map[int]map[string]float64
basefreq float64
running bool
initialized bool
needs_reinit bool
likwidGroups map[C.int]LikwidEventsetConfig
lock sync.Mutex
measureThread thread.Thread
}
type LikwidMetric struct {
@@ -92,6 +99,18 @@ type LikwidMetric struct {
group_idx int
}
func checkMetricType(t string) bool {
valid := map[string]bool{
"node": true,
"socket": true,
"hwthread": true,
"core": true,
"memoryDomain": true,
}
_, ok := valid[t]
return ok
}
func eventsToEventStr(events map[string]string) string {
elist := make([]string, 0)
for k, v := range events {
@@ -179,9 +198,11 @@ func (m *LikwidCollector) Init(config json.RawMessage) error {
m.name = "LikwidCollector"
m.parallel = false
m.initialized = false
m.needs_reinit = true
m.running = false
m.config.AccessMode = LIKWID_DEF_ACCESSMODE
m.config.LibraryPath = LIKWID_LIB_NAME
m.config.LockfilePath = LIKWID_DEF_LOCKFILE
if len(config) > 0 {
err := json.Unmarshal(config, &m.config)
if err != nil {
@@ -239,12 +260,16 @@ func (m *LikwidCollector) Init(config json.RawMessage) error {
}
for _, metric := range evset.Metrics {
// Try to evaluate the metric
if testLikwidMetricFormula(metric.Calc, params) {
// Add the computable metric to the parameter list for the global metrics
cclog.ComponentDebug(m.name, "Checking", metric.Name)
if !checkMetricType(metric.Type) {
cclog.ComponentError(m.name, "Metric", metric.Name, "uses invalid type", metric.Type)
metric.Calc = ""
} else if !testLikwidMetricFormula(metric.Calc, params) {
cclog.ComponentError(m.name, "Metric", metric.Name, "cannot be calculated with given counters")
metric.Calc = ""
} else {
globalParams = append(globalParams, metric.Name)
totalMetrics++
} else {
metric.Calc = ""
}
}
} else {
@@ -254,8 +279,14 @@ func (m *LikwidCollector) Init(config json.RawMessage) error {
}
for _, metric := range m.config.Metrics {
// Try to evaluate the global metric
if !testLikwidMetricFormula(metric.Calc, globalParams) {
cclog.ComponentError(m.name, "Calculation for metric", metric.Name, "failed")
if !checkMetricType(metric.Type) {
cclog.ComponentError(m.name, "Metric", metric.Name, "uses invalid type", metric.Type)
metric.Calc = ""
} else if !testLikwidMetricFormula(metric.Calc, globalParams) {
cclog.ComponentError(m.name, "Metric", metric.Name, "cannot be calculated with given counters")
metric.Calc = ""
} else if !checkMetricType(metric.Type) {
cclog.ComponentError(m.name, "Metric", metric.Name, "has invalid type")
metric.Calc = ""
} else {
totalMetrics++
@@ -268,56 +299,194 @@ func (m *LikwidCollector) Init(config json.RawMessage) error {
cclog.ComponentError(m.name, err.Error())
return err
}
ret := C.topology_init()
if ret != 0 {
err := errors.New("failed to initialize topology module")
cclog.ComponentError(m.name, err.Error())
return err
}
switch m.config.AccessMode {
case "direct":
C.HPMmode(0)
case "accessdaemon":
if len(m.config.DaemonPath) > 0 {
p := os.Getenv("PATH")
os.Setenv("PATH", m.config.DaemonPath+":"+p)
}
C.HPMmode(1)
for _, c := range m.cpulist {
C.HPMaddThread(c)
}
}
m.sock2tid = make(map[int]int)
tmp := make([]C.int, 1)
for _, sid := range topo.SocketList() {
cstr := C.CString(fmt.Sprintf("S%d:0", sid))
ret = C.cpustr_to_cpulist(cstr, &tmp[0], 1)
if ret > 0 {
m.sock2tid[sid] = m.cpu2tid[int(tmp[0])]
}
C.free(unsafe.Pointer(cstr))
}
m.basefreq = getBaseFreq()
m.measureThread = thread.New()
m.init = true
return nil
}
// take a measurement for 'interval' seconds of event set index 'group'
func (m *LikwidCollector) takeMeasurement(evset LikwidEventsetConfig, interval time.Duration) (bool, error) {
func (m *LikwidCollector) takeMeasurement(evidx int, evset LikwidEventsetConfig, interval time.Duration) (bool, error) {
var ret C.int
m.lock.Lock()
if m.initialized {
ret = C.perfmon_setupCounters(evset.gid)
if ret != 0 {
var err error = nil
var skip bool = false
if ret == -37 {
skip = true
} else {
err = fmt.Errorf("failed to setup performance group %d", evset.gid)
}
m.lock.Unlock()
return skip, err
var gid C.int = -1
sigchan := make(chan os.Signal, 1)
watcher, err := fsnotify.NewWatcher()
if err != nil {
cclog.ComponentError(m.name, err.Error())
}
defer watcher.Close()
if len(m.config.LockfilePath) > 0 {
info, err := os.Stat(m.config.LockfilePath)
if err != nil {
return true, err
}
ret = C.perfmon_startCounters()
if ret != 0 {
var err error = nil
var skip bool = false
if ret == -37 {
skip = true
stat := info.Sys().(*syscall.Stat_t)
if stat.Uid != uint32(os.Getuid()) {
usr, err := user.LookupId(strconv.FormatUint(uint64(stat.Uid), 10))
if err == nil {
return true, fmt.Errorf("Access to performance counters locked by %s", usr.Username)
} else {
err = fmt.Errorf("failed to setup performance group %d", evset.gid)
return true, fmt.Errorf("Access to performance counters locked by %d", stat.Uid)
}
m.lock.Unlock()
return skip, err
}
m.running = true
time.Sleep(interval)
m.running = false
ret = C.perfmon_stopCounters()
if ret != 0 {
var err error = nil
var skip bool = false
if ret == -37 {
skip = true
} else {
err = fmt.Errorf("failed to setup performance group %d", evset.gid)
}
m.lock.Unlock()
return skip, err
err = watcher.Watch(m.config.LockfilePath)
if err != nil {
cclog.ComponentError(m.name, err.Error())
}
}
m.lock.Unlock()
m.lock.Lock()
defer m.lock.Unlock()
select {
case e := <-watcher.Event:
ret = -1
if !e.IsAttrib() {
ret = C.perfmon_init(C.int(len(m.cpulist)), &m.cpulist[0])
}
default:
ret = C.perfmon_init(C.int(len(m.cpulist)), &m.cpulist[0])
}
if ret != 0 {
return true, fmt.Errorf("failed to initialize library, error %d", ret)
}
signal.Notify(sigchan, os.Interrupt)
signal.Notify(sigchan, syscall.SIGCHLD)
select {
case <-sigchan:
gid = -1
case e := <-watcher.Event:
gid = -1
if !e.IsAttrib() {
gid = C.perfmon_addEventSet(evset.estr)
}
default:
gid = C.perfmon_addEventSet(evset.estr)
}
if gid < 0 {
return true, fmt.Errorf("failed to add events %s, error %d", evset.go_estr, gid)
} else {
evset.gid = gid
//m.likwidGroups[gid] = evset
}
select {
case <-sigchan:
ret = -1
case e := <-watcher.Event:
if !e.IsAttrib() {
ret = C.perfmon_setupCounters(gid)
}
default:
ret = C.perfmon_setupCounters(gid)
}
if ret != 0 {
return true, fmt.Errorf("failed to setup events '%s', error %d", evset.go_estr, ret)
}
select {
case <-sigchan:
ret = -1
case e := <-watcher.Event:
if !e.IsAttrib() {
ret = C.perfmon_startCounters()
}
default:
ret = C.perfmon_startCounters()
}
if ret != 0 {
return true, fmt.Errorf("failed to start events '%s', error %d", evset.go_estr, ret)
}
select {
case <-sigchan:
ret = -1
case e := <-watcher.Event:
if !e.IsAttrib() {
ret = C.perfmon_readCounters()
}
default:
ret = C.perfmon_readCounters()
}
if ret != 0 {
return true, fmt.Errorf("failed to read events '%s', error %d", evset.go_estr, ret)
}
time.Sleep(interval)
select {
case <-sigchan:
ret = -1
case e := <-watcher.Event:
if !e.IsAttrib() {
ret = C.perfmon_readCounters()
}
default:
ret = C.perfmon_readCounters()
}
if ret != 0 {
return true, fmt.Errorf("failed to read events '%s', error %d", evset.go_estr, ret)
}
for eidx, counter := range evset.eorder {
gctr := C.GoString(counter)
for _, tid := range m.cpu2tid {
res := C.perfmon_getLastResult(gid, C.int(eidx), C.int(tid))
fres := float64(res)
if m.config.InvalidToZero && (math.IsNaN(fres) || math.IsInf(fres, 0)) {
fres = 0.0
}
evset.results[tid][gctr] = fres
}
}
for _, tid := range m.cpu2tid {
evset.results[tid]["time"] = float64(C.perfmon_getLastTimeOfGroup(gid))
}
select {
case <-sigchan:
ret = -1
case e := <-watcher.Event:
if !e.IsAttrib() {
ret = C.perfmon_stopCounters()
}
default:
ret = C.perfmon_stopCounters()
}
if ret != 0 {
return true, fmt.Errorf("failed to stop events '%s', error %d", evset.go_estr, ret)
}
signal.Stop(sigchan)
select {
case e := <-watcher.Event:
if !e.IsAttrib() {
C.perfmon_finalize()
}
default:
C.perfmon_finalize()
}
return false, nil
}
@@ -325,19 +494,8 @@ func (m *LikwidCollector) takeMeasurement(evset LikwidEventsetConfig, interval t
func (m *LikwidCollector) calcEventsetMetrics(evset LikwidEventsetConfig, interval time.Duration, output chan lp.CCMetric) error {
invClock := float64(1.0 / m.basefreq)
// Go over events and get the results
for eidx, counter := range evset.eorder {
gctr := C.GoString(counter)
for _, tid := range m.cpu2tid {
res := C.perfmon_getLastResult(evset.gid, C.int(eidx), C.int(tid))
fres := float64(res)
if m.config.InvalidToZero && (math.IsNaN(fres) || math.IsInf(fres, 0)) {
fres = 0.0
}
evset.results[tid][gctr] = fres
evset.results[tid]["time"] = interval.Seconds()
evset.results[tid]["inverseClock"] = invClock
}
for _, tid := range m.cpu2tid {
evset.results[tid]["inverseClock"] = invClock
}
// Go over the event set metrics, derive the value out of the event:counter values and send it
@@ -383,7 +541,7 @@ func (m *LikwidCollector) calcEventsetMetrics(evset LikwidEventsetConfig, interv
}
// Go over the global metrics, derive the value out of the event sets' metric values and send it
func (m *LikwidCollector) calcGlobalMetrics(interval time.Duration, output chan lp.CCMetric) error {
func (m *LikwidCollector) calcGlobalMetrics(groups []LikwidEventsetConfig, interval time.Duration, output chan lp.CCMetric) error {
for _, metric := range m.config.Metrics {
scopemap := m.cpu2tid
if metric.Type == "socket" {
@@ -393,7 +551,7 @@ func (m *LikwidCollector) calcGlobalMetrics(interval time.Duration, output chan
if tid >= 0 {
// Here we generate parameter list
params := make(map[string]interface{})
for _, evset := range m.likwidGroups {
for _, evset := range groups {
for mname, mres := range evset.metrics[tid] {
params[mname] = mres
}
@@ -407,7 +565,7 @@ func (m *LikwidCollector) calcGlobalMetrics(interval time.Duration, output chan
if m.config.InvalidToZero && (math.IsNaN(value) || math.IsInf(value, 0)) {
value = 0.0
}
m.gmresults[tid][metric.Name] = value
//m.gmresults[tid][metric.Name] = value
// Now we have the result, send it with the proper tags
if !math.IsNaN(value) {
if metric.Publish {
@@ -431,163 +589,53 @@ func (m *LikwidCollector) calcGlobalMetrics(interval time.Duration, output chan
return nil
}
func (m *LikwidCollector) LateInit() error {
var ret C.int
if m.initialized {
return nil
}
switch m.config.AccessMode {
case "direct":
C.HPMmode(0)
case "accessdaemon":
if len(m.config.DaemonPath) > 0 {
p := os.Getenv("PATH")
os.Setenv("PATH", m.config.DaemonPath+":"+p)
}
C.HPMmode(1)
}
cclog.ComponentDebug(m.name, "initialize LIKWID topology")
ret = C.topology_init()
if ret != 0 {
err := errors.New("failed to initialize LIKWID topology")
cclog.ComponentError(m.name, err.Error())
return err
}
m.sock2tid = make(map[int]int)
tmp := make([]C.int, 1)
for _, sid := range topo.SocketList() {
cstr := C.CString(fmt.Sprintf("S%d:0", sid))
ret = C.cpustr_to_cpulist(cstr, &tmp[0], 1)
if ret > 0 {
m.sock2tid[sid] = m.cpu2tid[int(tmp[0])]
}
C.free(unsafe.Pointer(cstr))
}
func (m *LikwidCollector) ReadThread(interval time.Duration, output chan lp.CCMetric) {
var err error = nil
groups := make([]LikwidEventsetConfig, 0)
m.basefreq = getBaseFreq()
cclog.ComponentDebug(m.name, "BaseFreq", m.basefreq)
cclog.ComponentDebug(m.name, "initialize LIKWID perfmon module")
ret = C.perfmon_init(C.int(len(m.cpulist)), &m.cpulist[0])
if ret != 0 {
var err error = nil
C.topology_finalize()
if ret != -22 {
err = errors.New("failed to initialize LIKWID perfmon")
cclog.ComponentError(m.name, err.Error())
} else {
err = errors.New("access to LIKWID perfmon locked")
}
return err
}
// While adding the events, we test the metrics whether they can be computed at all
for i, evset := range m.config.Eventsets {
var gid C.int
if len(evset.Events) > 0 {
skip := false
likwidGroup := genLikwidEventSet(evset)
for _, g := range m.likwidGroups {
if likwidGroup.go_estr == g.go_estr {
skip = true
break
}
for evidx, evset := range m.config.Eventsets {
e := genLikwidEventSet(evset)
e.internal = evidx
skip := false
if !skip {
// measure event set 'i' for 'interval' seconds
skip, err = m.takeMeasurement(evidx, e, interval)
if err != nil {
cclog.ComponentError(m.name, err.Error())
return
}
if skip {
continue
}
// Now we add the list of events to likwid
gid = C.perfmon_addEventSet(likwidGroup.estr)
if gid >= 0 {
likwidGroup.gid = gid
likwidGroup.internal = i
m.likwidGroups[gid] = likwidGroup
}
} else {
cclog.ComponentError(m.name, "Invalid Likwid eventset config, no events given")
continue
}
if !skip {
// read measurements and derive event set metrics
m.calcEventsetMetrics(e, interval, output)
}
groups = append(groups, e)
}
// If no event set could be added, shut down LikwidCollector
if len(m.likwidGroups) == 0 {
C.perfmon_finalize()
C.topology_finalize()
err := errors.New("no LIKWID performance group initialized")
cclog.ComponentError(m.name, err.Error())
return err
}
sigchan := make(chan os.Signal, 1)
signal.Notify(sigchan, syscall.SIGCHLD)
signal.Notify(sigchan, os.Interrupt)
go func() {
<-sigchan
signal.Stop(sigchan)
m.initialized = false
}()
m.initialized = true
return nil
// calculate global metrics
m.calcGlobalMetrics(groups, interval, output)
}
// main read function taking multiple measurement rounds, each 'interval' seconds long
func (m *LikwidCollector) Read(interval time.Duration, output chan lp.CCMetric) {
var skip bool = false
var err error
//var skip bool = false
//var err error
if !m.init {
return
}
if !m.initialized {
m.lock.Lock()
err = m.LateInit()
if err != nil {
m.lock.Unlock()
return
}
m.initialized = true
m.lock.Unlock()
}
if m.initialized && !skip {
for _, evset := range m.likwidGroups {
if !skip {
// measure event set 'i' for 'interval' seconds
skip, err = m.takeMeasurement(evset, interval)
if err != nil {
cclog.ComponentError(m.name, err.Error())
return
}
}
if !skip {
// read measurements and derive event set metrics
m.calcEventsetMetrics(evset, interval, output)
}
}
if !skip {
// use the event set metrics to derive the global metrics
m.calcGlobalMetrics(interval, output)
}
}
m.measureThread.Call(func() {
m.ReadThread(interval, output)
})
}
func (m *LikwidCollector) Close() {
if m.init {
m.init = false
cclog.ComponentDebug(m.name, "Closing ...")
m.lock.Lock()
if m.initialized {
cclog.ComponentDebug(m.name, "Finalize LIKWID perfmon module")
C.perfmon_finalize()
m.initialized = false
}
m.measureThread.Terminate()
m.initialized = false
m.lock.Unlock()
cclog.ComponentDebug(m.name, "Finalize LIKWID topology module")
C.topology_finalize()
cclog.ComponentDebug(m.name, "Closing done")
}
}

View File

@@ -10,6 +10,7 @@ The `likwid` collector is probably the most complicated collector. The LIKWID li
"liblikwid_path" : "/path/to/liblikwid.so",
"accessdaemon_path" : "/folder/that/contains/likwid-accessD",
"access_mode" : "direct or accessdaemon or perf_event",
"lockfile_path" : "/var/run/likwid.lock",
"eventsets": [
{
"events" : {
@@ -41,7 +42,7 @@ The `likwid` collector is probably the most complicated collector. The LIKWID li
The `likwid` configuration consists of two parts, the `eventsets` and `globalmetrics`:
- An event set list itself has two parts, the `events` and a set of derivable `metrics`. Each of the `events` is a `counter:event` pair in LIKWID's syntax. The `metrics` are a list of formulas to derive the metric value from the measurements of the `events`' values. Each metric has a name, the formula, a type and a publish flag. There is an optional `unit` field. Counter names can be used like variables in the formulas, so `PMC0+PMC1` sums the measurements for the both events configured in the counters `PMC0` and `PMC1`. You can optionally use `time` for the measurement time and `inverseClock` for `1.0/baseCpuFrequency`. The type tells the LikwidCollector whether it is a metric for each hardware thread (`cpu`) or each CPU socket (`socket`). You may specify a unit for the metric with `unit`. The last one is the publishing flag. It tells the LikwidCollector whether a metric should be sent to the router or is only used internally to compute a global metric.
- The `globalmetrics` are metrics which require data from multiple event set measurements to be derived. The inputs are the metrics in the event sets. Similar to the metrics in the event sets, the global metrics are defined by a name, a formula, a scope and a publish flag. See event set metrics for details. The only difference is that there is no access to the raw event measurements anymore but only to the metrics. Also `time` and `inverseClock` cannot be used anymore. So, the idea is to derive a metric in the `eventsets` section and reuse it in the `globalmetrics` part. If you need a metric only for deriving the global metrics, disable forwarding of the event set metrics (`"publish": false`). **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases. Similar to the metrics in the eventset, you can specify a metric unit with the `unit` field.
- The `globalmetrics` are metrics which require data from multiple event set measurements to be derived. The inputs are the metrics in the event sets. Similar to the metrics in the event sets, the global metrics are defined by a name, a formula, a type and a publish flag. See event set metrics for details. The only difference is that there is no access to the raw event measurements anymore but only to the metrics. Also `time` and `inverseClock` cannot be used anymore. So, the idea is to derive a metric in the `eventsets` section and reuse it in the `globalmetrics` part. If you need a metric only for deriving the global metrics, disable forwarding of the event set metrics (`"publish": false`). **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases. Similar to the metrics in the eventset, you can specify a metric unit with the `unit` field.
Additional options:
- `force_overwrite`: Same as setting `LIKWID_FORCE=1`. In case counters are already in-use, LIKWID overwrites their configuration to do its measurements
@@ -49,21 +50,22 @@ Additional options:
- `access_mode`: Specify LIKWID access mode: `direct` for direct register access as root user or `accessdaemon`. The access mode `perf_event` is current untested.
- `accessdaemon_path`: Folder of the accessDaemon `likwid-accessD` (like `/usr/local/sbin`)
- `liblikwid_path`: Location of `liblikwid.so` including file name like `/usr/local/lib/liblikwid.so`
- `lockfile_path`: Location of LIKWID's lock file if multiple tools should access the hardware counters. Default `/var/run/likwid.lock`
### Available metric scopes
### Available metric types
Hardware performance counters are scattered all over the system nowadays. A counter coveres a specific part of the system. While there are hardware thread specific counter for CPU cycles, instructions and so on, some others are specific for a whole CPU socket/package. To address that, the LikwidCollector provides the specification of a `type` for each metric.
- `hwthread` : One metric per CPU hardware thread with the tags `"type" : "hwthread"` and `"type-id" : "$hwthread_id"`
- `socket` : One metric per CPU socket/package with the tags `"type" : "socket"` and `"type-id" : "$socket_id"`
**Note:** You cannot specify `socket` scope for a metric that is measured at `hwthread` scope, so some kind of expert knowledge or lookup work in the [Likwid Wiki](https://github.com/RRZE-HPC/likwid/wiki) is required. Get the scope of each counter from the *Architecture* pages and as soon as one counter in a metric is socket-specific, the whole metric is socket-specific.
**Note:** You cannot specify `socket` type for a metric that is measured at `hwthread` type, so some kind of expert knowledge or lookup work in the [Likwid Wiki](https://github.com/RRZE-HPC/likwid/wiki) is required. Get the type of each counter from the *Architecture* pages and as soon as one counter in a metric is socket-specific, the whole metric is socket-specific.
As a guideline:
- All counters `FIXCx`, `PMCy` and `TMAz` have the scope `hwthread`
- All counters names containing `BOX` have the scope `socket`
- All `PWRx` counters have scope `socket`, except `"PWR1" : "RAPL_CORE_ENERGY"` has `hwthread` scope
- All `DFCx` counters have scope `socket`
- All counters `FIXCx`, `PMCy` and `TMAz` have the type `hwthread`
- All counters names containing `BOX` have the type `socket`
- All `PWRx` counters have type `socket`, except `"PWR1" : "RAPL_CORE_ENERGY"` has `hwthread` type
- All `DFCx` counters have type `socket`
### Help with the configuration
@@ -93,7 +95,7 @@ $ scripts/likwid_perfgroup_to_cc_config.py ICX MEM_DP
"name": "Runtime (RDTSC) [s]",
"publish": true,
"unit": "seconds"
"scope": "hwthread"
"type": "hwthread"
},
{
"..." : "..."
@@ -245,7 +247,7 @@ METRICS -> "metrics": [
IPC PMC0/PMC1 -> {
-> "name" : "IPC",
-> "calc" : "PMC0/PMC1",
-> "scope": "hwthread",
-> "type": "hwthread",
-> "publish": true
-> }
-> ]

View File

@@ -1,270 +0,0 @@
package collectors
import (
"encoding/json"
"errors"
"os/exec"
"regexp"
"strings"
"time"
cclog "github.com/ClusterCockpit/cc-metric-collector/pkg/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/pkg/ccMetric"
)
type LustreJobstatCollectorConfig struct {
LCtlCommand string `json:"lctl_command,omitempty"`
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
Sudo bool `json:"use_sudo,omitempty"`
SendAbsoluteValues bool `json:"send_abs_values,omitempty"`
SendDerivedValues bool `json:"send_derived_values,omitempty"`
SendDiffValues bool `json:"send_diff_values,omitempty"`
JobRegex string `json:"jobid_regex,omitempty"`
}
type LustreJobstatCollector struct {
metricCollector
tags map[string]string
config LustreJobstatCollectorConfig
lctl string
sudoCmd string
lastTimestamp time.Time // Store time stamp of last tick to derive bandwidths
definitions []LustreMetricDefinition // Combined list without excluded metrics
//stats map[string]map[string]int64 // Data for last value per device and metric
lastMdtData *map[string]map[string]LustreMetricData
lastObdfilterData *map[string]map[string]LustreMetricData
jobidRegex *regexp.Regexp
}
var defaultJobidRegex = `^(?P<jobid>[\d\w\.]+)$`
var LustreMetricJobstatsDefinition = []LustreMetricDefinition{
{
name: "lustre_job_read_samples",
lineprefix: "read",
offsetname: "samples",
unit: "requests",
calc: "none",
},
{
name: "lustre_job_read_min_bytes",
lineprefix: "read_bytes",
offsetname: "min",
unit: "bytes",
calc: "none",
},
{
name: "lustre_job_read_max_bytes",
lineprefix: "read_bytes",
offsetname: "max",
unit: "bytes",
calc: "none",
},
}
func (m *LustreJobstatCollector) executeLustreCommand(option string) []string {
return executeLustreCommand(m.sudoCmd, m.lctl, LCTL_OPTION, option, m.config.Sudo)
}
func (m *LustreJobstatCollector) Init(config json.RawMessage) error {
var err error
m.name = "LustreJobstatCollector"
m.parallel = true
m.config.JobRegex = defaultJobidRegex
m.config.SendAbsoluteValues = true
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
return err
}
}
m.setup()
m.tags = map[string]string{"type": "jobid"}
m.meta = map[string]string{"source": m.name, "group": "Lustre", "scope": "job"}
// Lustre file system statistics can only be queried by user root
// or with password-less sudo
// if !m.config.Sudo {
// user, err := user.Current()
// if err != nil {
// cclog.ComponentError(m.name, "Failed to get current user:", err.Error())
// return err
// }
// if user.Uid != "0" {
// cclog.ComponentError(m.name, "Lustre file system statistics can only be queried by user root")
// return err
// }
// } else {
// p, err := exec.LookPath("sudo")
// if err != nil {
// cclog.ComponentError(m.name, "Cannot find 'sudo'")
// return err
// }
// m.sudoCmd = p
// }
p, err := exec.LookPath(m.config.LCtlCommand)
if err != nil {
p, err = exec.LookPath(LCTL_CMD)
if err != nil {
return err
}
}
m.lctl = p
m.definitions = make([]LustreMetricDefinition, 0)
if m.config.SendAbsoluteValues {
for _, def := range LustreMetricJobstatsDefinition {
if _, skip := stringArrayContains(m.config.ExcludeMetrics, def.name); !skip {
m.definitions = append(m.definitions, def)
}
}
}
if len(m.definitions) == 0 {
return errors.New("no metrics to collect")
}
x := make(map[string]map[string]LustreMetricData)
m.lastMdtData = &x
x = make(map[string]map[string]LustreMetricData)
m.lastObdfilterData = &x
if len(m.config.JobRegex) > 0 {
jregex := strings.ReplaceAll(m.config.JobRegex, "%", "\\")
r, err := regexp.Compile(jregex)
if err == nil {
m.jobidRegex = r
} else {
cclog.ComponentError(m.name, "Cannot compile jobid regex")
return err
}
}
m.lastTimestamp = time.Now()
m.init = true
return nil
}
func (m *LustreJobstatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
if !m.init {
return
}
getValue := func(data map[string]map[string]LustreMetricData, device string, jobid string, operation string, field string) int64 {
var value int64 = -1
if ddata, ok := data[device]; ok {
if jdata, ok := ddata[jobid]; ok {
if opdata, ok := jdata.op_data[operation]; ok {
if v, ok := opdata[field]; ok {
value = v
}
}
}
}
return value
}
jobIdToTags := func(jobregex *regexp.Regexp, job string) map[string]string {
tags := make(map[string]string)
groups := jobregex.SubexpNames()
for _, match := range jobregex.FindAllStringSubmatch(job, -1) {
for groupIdx, group := range match {
if len(groups[groupIdx]) > 0 {
tags[groups[groupIdx]] = group
}
}
}
return tags
}
generateMetric := func(definition LustreMetricDefinition, data map[string]map[string]LustreMetricData, last map[string]map[string]LustreMetricData, now time.Time) {
tdiff := now.Sub(m.lastTimestamp)
for dev, ddata := range data {
for jobid, jdata := range ddata {
jobtags := jobIdToTags(m.jobidRegex, jobid)
if _, ok := jobtags["jobid"]; !ok {
continue
}
cur := getValue(data, dev, jobid, definition.lineprefix, definition.offsetname)
old := getValue(last, dev, jobid, definition.lineprefix, definition.offsetname)
var x interface{} = -1
var valid = false
switch definition.calc {
case "none":
x = cur
valid = true
case "difference":
if len(last) > 0 {
if old >= 0 {
x = cur - old
valid = true
}
}
case "derivative":
if len(last) > 0 {
if old >= 0 {
x = float64(cur-old) / tdiff.Seconds()
valid = true
}
}
}
if valid {
y, err := lp.New(definition.name, m.tags, m.meta, map[string]interface{}{"value": x}, now)
if err == nil {
y.AddTag("stype", "device")
y.AddTag("stype-id", dev)
if j, ok := jobtags["jobid"]; ok {
y.AddTag("type-id", j)
} else {
y.AddTag("type-id", jobid)
}
for k, v := range jobtags {
switch k {
case "jobid":
case "hostname":
y.AddTag("hostname", v)
default:
y.AddMeta(k, v)
}
}
if len(definition.unit) > 0 {
y.AddMeta("unit", definition.unit)
} else {
if unit, ok := jdata.op_units[definition.lineprefix]; ok {
y.AddMeta("unit", unit)
}
}
output <- y
}
}
}
}
}
now := time.Now()
mdt_lines := m.executeLustreCommand("mdt.*.job_stats")
if len(mdt_lines) > 0 {
mdt_data := readCommandOutput(mdt_lines)
for _, def := range m.definitions {
generateMetric(def, mdt_data, *m.lastMdtData, now)
}
m.lastMdtData = &mdt_data
}
obdfilter_lines := m.executeLustreCommand("obdfilter.*.job_stats")
if len(obdfilter_lines) > 0 {
obdfilter_data := readCommandOutput(obdfilter_lines)
for _, def := range m.definitions {
generateMetric(def, obdfilter_data, *m.lastObdfilterData, now)
}
m.lastObdfilterData = &obdfilter_data
}
m.lastTimestamp = now
}
func (m *LustreJobstatCollector) Close() {
m.init = false
}

View File

@@ -1,35 +0,0 @@
## `lustre_jobstat` collector
**Note**: This collector is meant to run on the Lustre servers, **not** the clients
The Lustre filesystem provides a feature (`job_stats`) to group processes on client side with an identifier string (like a compute job with its jobid) and retrieve the file system operation counts on the server side. Check the section [How to configure `job_stats`]() for more information.
### Configuration
```json
"lustre_jobstat_": {
"lctl_command": "/path/to/lctl",
"use_sudo": false,
"exclude_metrics": [
"setattr",
"getattr"
],
"send_abs_values" : true,
"jobid_regex" : "^(?P<jobid>[%d%w%.]+)$"
}
```
The `lustre_jobstat` collector uses the `lctl` application with the `get_param` option to get all `mdt.*.job_stats` and `obdfilter.*.job_stats` metrics. These metrics are only available for root users. If password-less sudo is configured, you can enable `sudo` in the configuration. In the `exclude_metrics` list, some metrics can be excluded to reduce network traffic and storage. With the `send_abs_values` flag, the collector sends absolute values for the configured metrics. The `jobid_regex` can be used to split the Lustre `job_stats` identifier into multiple parts. Since JSON cannot handle strings like `\d`, use `%` instead of `\`.
Metrics:
- `lustre_job_read_samples` (unit: `requests`)
- `lustre_job_read_min_bytes` (unit: `bytes`)
- `lustre_job_read_max_bytes` (unit: `bytes`)
The collector adds the tags: `type=jobid,typeid=<jobid_from_regex>,stype=device,stype=<device_name_from_output>`.
The collector adds the mega information: `unit=<unit>,scope=job`
### How to configure `job_stats`

View File

@@ -14,6 +14,10 @@ import (
lp "github.com/ClusterCockpit/cc-metric-collector/pkg/ccMetric"
)
const LUSTRE_SYSFS = `/sys/fs/lustre`
const LCTL_CMD = `lctl`
const LCTL_OPTION = `get_param`
type LustreCollectorConfig struct {
LCtlCommand string `json:"lctl_command,omitempty"`
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
@@ -23,6 +27,14 @@ type LustreCollectorConfig struct {
SendDiffValues bool `json:"send_diff_values,omitempty"`
}
type LustreMetricDefinition struct {
name string
lineprefix string
lineoffset int
unit string
calc string
}
type LustreCollector struct {
metricCollector
tags map[string]string
@@ -34,209 +46,17 @@ type LustreCollector struct {
stats map[string]map[string]int64 // Data for last value per device and metric
}
var LustreAbsMetrics = []LustreMetricDefinition{
{
name: "lustre_read_requests",
lineprefix: "read_bytes",
lineoffset: 1,
offsetname: "samples",
unit: "requests",
calc: "none",
},
{
name: "lustre_write_requests",
lineprefix: "write_bytes",
lineoffset: 1,
offsetname: "samples",
unit: "requests",
calc: "none",
},
{
name: "lustre_read_bytes",
lineprefix: "read_bytes",
lineoffset: 6,
offsetname: "sum",
unit: "bytes",
calc: "none",
},
{
name: "lustre_write_bytes",
lineprefix: "write_bytes",
lineoffset: 6,
offsetname: "sum",
unit: "bytes",
calc: "none",
},
{
name: "lustre_open",
lineprefix: "open",
lineoffset: 1,
offsetname: "samples",
unit: "",
calc: "none",
},
{
name: "lustre_close",
lineprefix: "close",
lineoffset: 1,
offsetname: "samples",
unit: "",
calc: "none",
},
{
name: "lustre_setattr",
lineprefix: "setattr",
lineoffset: 1,
offsetname: "samples",
unit: "",
calc: "none",
},
{
name: "lustre_getattr",
lineprefix: "getattr",
lineoffset: 1,
offsetname: "samples",
unit: "",
calc: "none",
},
{
name: "lustre_statfs",
lineprefix: "statfs",
lineoffset: 1,
offsetname: "samples",
unit: "",
calc: "none",
},
{
name: "lustre_inode_permission",
lineprefix: "inode_permission",
lineoffset: 1,
offsetname: "samples",
unit: "",
calc: "none",
},
}
var LustreDiffMetrics = []LustreMetricDefinition{
{
name: "lustre_read_requests_diff",
lineprefix: "read_bytes",
lineoffset: 1,
offsetname: "samples",
unit: "requests",
calc: "difference",
},
{
name: "lustre_write_requests_diff",
lineprefix: "write_bytes",
lineoffset: 1,
offsetname: "samples",
unit: "requests",
calc: "difference",
},
{
name: "lustre_read_bytes_diff",
lineprefix: "read_bytes",
lineoffset: 6,
offsetname: "sum",
unit: "bytes",
calc: "difference",
},
{
name: "lustre_write_bytes_diff",
lineprefix: "write_bytes",
lineoffset: 6,
offsetname: "sum",
unit: "bytes",
calc: "difference",
},
{
name: "lustre_open_diff",
lineprefix: "open",
lineoffset: 1,
offsetname: "samples",
unit: "",
calc: "difference",
},
{
name: "lustre_close_diff",
lineprefix: "close",
lineoffset: 1,
offsetname: "samples",
unit: "",
calc: "difference",
},
{
name: "lustre_setattr_diff",
lineprefix: "setattr",
lineoffset: 1,
offsetname: "samples",
unit: "",
calc: "difference",
},
{
name: "lustre_getattr_diff",
lineprefix: "getattr",
lineoffset: 1,
offsetname: "samples",
unit: "",
calc: "difference",
},
{
name: "lustre_statfs_diff",
lineprefix: "statfs",
lineoffset: 1,
offsetname: "samples",
unit: "",
calc: "difference",
},
{
name: "lustre_inode_permission_diff",
lineprefix: "inode_permission",
lineoffset: 1,
offsetname: "samples",
unit: "",
calc: "difference",
},
}
var LustreDeriveMetrics = []LustreMetricDefinition{
{
name: "lustre_read_requests_rate",
lineprefix: "read_bytes",
lineoffset: 1,
offsetname: "samples",
unit: "requests/sec",
calc: "derivative",
},
{
name: "lustre_write_requests_rate",
lineprefix: "write_bytes",
lineoffset: 1,
offsetname: "samples",
unit: "requests/sec",
calc: "derivative",
},
{
name: "lustre_read_bw",
lineprefix: "read_bytes",
lineoffset: 6,
offsetname: "sum",
unit: "bytes/sec",
calc: "derivative",
},
{
name: "lustre_write_bw",
lineprefix: "write_bytes",
lineoffset: 6,
offsetname: "sum",
unit: "bytes/sec",
calc: "derivative",
},
}
func (m *LustreCollector) getDeviceDataCommand(device string) []string {
return executeLustreCommand(m.sudoCmd, m.lctl, LCTL_OPTION, fmt.Sprintf("llite.%s.stats", device), m.config.Sudo)
var command *exec.Cmd
statsfile := fmt.Sprintf("llite.%s.stats", device)
if m.config.Sudo {
command = exec.Command(m.sudoCmd, m.lctl, LCTL_OPTION, statsfile)
} else {
command = exec.Command(m.lctl, LCTL_OPTION, statsfile)
}
command.Wait()
stdout, _ := command.Output()
return strings.Split(string(stdout), "\n")
}
func (m *LustreCollector) getDevices() []string {
@@ -288,6 +108,183 @@ func getMetricData(lines []string, prefix string, offset int) (int64, error) {
// return strings.Split(string(buffer), "\n")
// }
var LustreAbsMetrics = []LustreMetricDefinition{
{
name: "lustre_read_requests",
lineprefix: "read_bytes",
lineoffset: 1,
unit: "requests",
calc: "none",
},
{
name: "lustre_write_requests",
lineprefix: "write_bytes",
lineoffset: 1,
unit: "requests",
calc: "none",
},
{
name: "lustre_read_bytes",
lineprefix: "read_bytes",
lineoffset: 6,
unit: "bytes",
calc: "none",
},
{
name: "lustre_write_bytes",
lineprefix: "write_bytes",
lineoffset: 6,
unit: "bytes",
calc: "none",
},
{
name: "lustre_open",
lineprefix: "open",
lineoffset: 1,
unit: "",
calc: "none",
},
{
name: "lustre_close",
lineprefix: "close",
lineoffset: 1,
unit: "",
calc: "none",
},
{
name: "lustre_setattr",
lineprefix: "setattr",
lineoffset: 1,
unit: "",
calc: "none",
},
{
name: "lustre_getattr",
lineprefix: "getattr",
lineoffset: 1,
unit: "",
calc: "none",
},
{
name: "lustre_statfs",
lineprefix: "statfs",
lineoffset: 1,
unit: "",
calc: "none",
},
{
name: "lustre_inode_permission",
lineprefix: "inode_permission",
lineoffset: 1,
unit: "",
calc: "none",
},
}
var LustreDiffMetrics = []LustreMetricDefinition{
{
name: "lustre_read_requests_diff",
lineprefix: "read_bytes",
lineoffset: 1,
unit: "requests",
calc: "difference",
},
{
name: "lustre_write_requests_diff",
lineprefix: "write_bytes",
lineoffset: 1,
unit: "requests",
calc: "difference",
},
{
name: "lustre_read_bytes_diff",
lineprefix: "read_bytes",
lineoffset: 6,
unit: "bytes",
calc: "difference",
},
{
name: "lustre_write_bytes_diff",
lineprefix: "write_bytes",
lineoffset: 6,
unit: "bytes",
calc: "difference",
},
{
name: "lustre_open_diff",
lineprefix: "open",
lineoffset: 1,
unit: "",
calc: "difference",
},
{
name: "lustre_close_diff",
lineprefix: "close",
lineoffset: 1,
unit: "",
calc: "difference",
},
{
name: "lustre_setattr_diff",
lineprefix: "setattr",
lineoffset: 1,
unit: "",
calc: "difference",
},
{
name: "lustre_getattr_diff",
lineprefix: "getattr",
lineoffset: 1,
unit: "",
calc: "difference",
},
{
name: "lustre_statfs_diff",
lineprefix: "statfs",
lineoffset: 1,
unit: "",
calc: "difference",
},
{
name: "lustre_inode_permission_diff",
lineprefix: "inode_permission",
lineoffset: 1,
unit: "",
calc: "difference",
},
}
var LustreDeriveMetrics = []LustreMetricDefinition{
{
name: "lustre_read_requests_rate",
lineprefix: "read_bytes",
lineoffset: 1,
unit: "requests/sec",
calc: "derivative",
},
{
name: "lustre_write_requests_rate",
lineprefix: "write_bytes",
lineoffset: 1,
unit: "requests/sec",
calc: "derivative",
},
{
name: "lustre_read_bw",
lineprefix: "read_bytes",
lineoffset: 6,
unit: "bytes/sec",
calc: "derivative",
},
{
name: "lustre_write_bw",
lineprefix: "write_bytes",
lineoffset: 6,
unit: "bytes/sec",
calc: "derivative",
},
}
func (m *LustreCollector) Init(config json.RawMessage) error {
var err error
m.name = "LustreCollector"
@@ -300,7 +297,7 @@ func (m *LustreCollector) Init(config json.RawMessage) error {
}
m.setup()
m.tags = map[string]string{"type": "node"}
m.meta = map[string]string{"source": m.name, "group": "Lustre", "scope": "node"}
m.meta = map[string]string{"source": m.name, "group": "Lustre"}
// Lustre file system statistics can only be queried by user root
// or with password-less sudo

View File

@@ -1,190 +0,0 @@
package collectors
import (
"fmt"
"os/exec"
"regexp"
"strconv"
"strings"
)
const LUSTRE_SYSFS = `/sys/fs/lustre`
const LCTL_CMD = `lctl`
const LCTL_OPTION = `get_param`
type LustreMetricDefinition struct {
name string
lineprefix string
lineoffset int
offsetname string
unit string
calc string
}
type LustreMetricData struct {
sample_time int64
start_time int64
elapsed_time int64
op_data map[string]map[string]int64
op_units map[string]string
sample_time_unit string
start_time_unit string
elapsed_time_unit string
}
var devicePattern = regexp.MustCompile(`^[\w\d\-_]+\.([\w\d\-_]+)\.[\w\d\-_]+=$`)
var jobPattern = regexp.MustCompile(`^-\s*job_id:\s*([\w\d\-_\.:]+)$`)
var snapshotPattern = regexp.MustCompile(`^\s*snapshot_time\s*:\s*([\d\.]+)\s*([\w\d\-_\.]*)$`)
var startPattern = regexp.MustCompile(`^\s*start_time\s*:\s*([\d\.]+)\s*([\w\d\-_\.]*)$`)
var elapsedPattern = regexp.MustCompile(`^\s*elapsed_time\s*:\s*([\d\.]+)\s*([\w\d\-_\.]*)$`)
var linePattern = regexp.MustCompile(`^\s*([\w\d\-_\.]+):\s*\{\s*samples:\s*([\d\.]+),\s*unit:\s*([\w\d\-_\.]+),\s*min:\s*([\d\.]+),\s*max:\s*([\d\.]+),\s*sum:\s*([\d\.]+),\s*sumsq:\s*([\d\.]+)\s*\}`)
func executeLustreCommand(sudo, lctl, option, search string, use_sudo bool) []string {
var command *exec.Cmd
if use_sudo {
command = exec.Command(sudo, lctl, option, search)
} else {
command = exec.Command(lctl, option, search)
}
command.Wait()
stdout, _ := command.Output()
return strings.Split(string(stdout), "\n")
}
func splitTree(lines []string, splitRegex *regexp.Regexp) map[string][]string {
entries := make(map[string][]string)
ent_lines := make([]int, 0)
for i, l := range lines {
m := splitRegex.FindStringSubmatch(l)
if len(m) == 2 {
ent_lines = append(ent_lines, i)
}
}
if len(ent_lines) > 0 {
for i, idx := range ent_lines[:len(ent_lines)-1] {
m := splitRegex.FindStringSubmatch(lines[idx])
entries[m[1]] = lines[idx+1 : ent_lines[i+1]]
}
last := ent_lines[len(ent_lines)-1]
m := splitRegex.FindStringSubmatch(lines[last])
entries[m[1]] = lines[last:]
}
return entries
}
func readDevices(lines []string) map[string][]string {
return splitTree(lines, devicePattern)
}
func readJobs(lines []string) map[string][]string {
return splitTree(lines, jobPattern)
}
func readJobdata(lines []string) LustreMetricData {
jobdata := LustreMetricData{
op_data: make(map[string]map[string]int64),
op_units: make(map[string]string),
sample_time: 0,
sample_time_unit: "nsec",
start_time: 0,
start_time_unit: "nsec",
elapsed_time: 0,
elapsed_time_unit: "nsec",
}
parseTime := func(value, unit string) int64 {
var t int64 = 0
if len(unit) == 0 {
unit = "secs"
}
values := strings.Split(value, ".")
units := strings.Split(unit, ".")
if len(values) != len(units) {
fmt.Printf("Invalid time specification '%s' and '%s'\n", value, unit)
}
for i, v := range values {
if len(units) > i {
s, err := strconv.ParseInt(v, 10, 64)
if err == nil {
switch units[i] {
case "secs":
t += s * 1e9
case "msecs":
t += s * 1e6
case "usecs":
t += s * 1e3
case "nsecs":
t += s
}
}
}
}
return t
}
parseNumber := func(value string) int64 {
s, err := strconv.ParseInt(value, 10, 64)
if err == nil {
return s
}
return 0
}
for _, l := range lines {
if jobdata.sample_time == 0 {
m := snapshotPattern.FindStringSubmatch(l)
if len(m) == 3 {
if len(m[2]) > 0 {
jobdata.sample_time = parseTime(m[1], m[2])
} else {
jobdata.sample_time = parseTime(m[1], "secs")
}
}
}
if jobdata.start_time == 0 {
m := startPattern.FindStringSubmatch(l)
if len(m) == 3 {
if len(m[2]) > 0 {
jobdata.start_time = parseTime(m[1], m[2])
} else {
jobdata.start_time = parseTime(m[1], "secs")
}
}
}
if jobdata.elapsed_time == 0 {
m := elapsedPattern.FindStringSubmatch(l)
if len(m) == 3 {
if len(m[2]) > 0 {
jobdata.elapsed_time = parseTime(m[1], m[2])
} else {
jobdata.elapsed_time = parseTime(m[1], "secs")
}
}
}
m := linePattern.FindStringSubmatch(l)
if len(m) == 8 {
jobdata.op_units[m[1]] = m[3]
jobdata.op_data[m[1]] = map[string]int64{
"samples": parseNumber(m[2]),
"min": parseNumber(m[4]),
"max": parseNumber(m[5]),
"sum": parseNumber(m[6]),
"sumsq": parseNumber(m[7]),
}
}
}
return jobdata
}
func readCommandOutput(lines []string) map[string]map[string]LustreMetricData {
var data map[string]map[string]LustreMetricData = make(map[string]map[string]LustreMetricData)
devs := readDevices(lines)
for d, ddata := range devs {
data[d] = make(map[string]LustreMetricData)
jobs := readJobs(ddata)
for j, jdata := range jobs {
x := readJobdata(jdata)
data[d][j] = x
}
}
return data
}

View File

@@ -0,0 +1,166 @@
package collectors
import (
"encoding/json"
"fmt"
"os"
"regexp"
"strconv"
"strings"
"time"
cclog "github.com/ClusterCockpit/cc-metric-collector/pkg/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/pkg/ccMetric"
)
// These are the fields we read from the JSON configuration
type NfsIOStatCollectorConfig struct {
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
ExcludeFilesystem []string `json:"exclude_filesystem,omitempty"`
UseServerAddressAsSType bool `json:"use_server_as_stype,omitempty"`
}
// This contains all variables we need during execution and the variables
// defined by metricCollector (name, init, ...)
type NfsIOStatCollector struct {
metricCollector
config NfsIOStatCollectorConfig // the configuration structure
meta map[string]string // default meta information
tags map[string]string // default tags
data map[string]map[string]int64 // data storage for difference calculation
key string // which device info should be used as subtype ID? 'server' or 'mntpoint', see NfsIOStatCollectorConfig.UseServerAddressAsSType
}
var deviceRegex = regexp.MustCompile(`device (?P<server>[^ ]+) mounted on (?P<mntpoint>[^ ]+) with fstype nfs(?P<version>\d*) statvers=[\d\.]+`)
var bytesRegex = regexp.MustCompile(`\s+bytes:\s+(?P<nread>[^ ]+) (?P<nwrite>[^ ]+) (?P<dread>[^ ]+) (?P<dwrite>[^ ]+) (?P<nfsread>[^ ]+) (?P<nfswrite>[^ ]+) (?P<pageread>[^ ]+) (?P<pagewrite>[^ ]+)`)
func resolve_regex_fields(s string, regex *regexp.Regexp) map[string]string {
fields := make(map[string]string)
groups := regex.SubexpNames()
for _, match := range regex.FindAllStringSubmatch(s, -1) {
for groupIdx, group := range match {
if len(groups[groupIdx]) > 0 {
fields[groups[groupIdx]] = group
}
}
}
return fields
}
func (m *NfsIOStatCollector) readNfsiostats() map[string]map[string]int64 {
data := make(map[string]map[string]int64)
filename := "/proc/self/mountstats"
stats, err := os.ReadFile(filename)
if err != nil {
return data
}
lines := strings.Split(string(stats), "\n")
var current map[string]string = nil
for _, l := range lines {
// Is this a device line with mount point, remote target and NFS version?
dev := resolve_regex_fields(l, deviceRegex)
if len(dev) > 0 {
if _, ok := stringArrayContains(m.config.ExcludeFilesystem, dev[m.key]); !ok {
current = dev
if len(current["version"]) == 0 {
current["version"] = "3"
}
}
}
if len(current) > 0 {
// Byte line parsing (if found the device for it)
bytes := resolve_regex_fields(l, bytesRegex)
if len(bytes) > 0 {
data[current[m.key]] = make(map[string]int64)
for name, sval := range bytes {
if _, ok := stringArrayContains(m.config.ExcludeMetrics, name); !ok {
val, err := strconv.ParseInt(sval, 10, 64)
if err == nil {
data[current[m.key]][name] = val
}
}
}
current = nil
}
}
}
return data
}
func (m *NfsIOStatCollector) Init(config json.RawMessage) error {
var err error = nil
m.name = "NfsIOStatCollector"
m.setup()
m.parallel = true
m.meta = map[string]string{"source": m.name, "group": "NFS", "unit": "bytes"}
m.tags = map[string]string{"type": "node"}
m.config.UseServerAddressAsSType = false
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
cclog.ComponentError(m.name, "Error reading config:", err.Error())
return err
}
}
m.key = "mntpoint"
if m.config.UseServerAddressAsSType {
m.key = "server"
}
m.data = m.readNfsiostats()
m.init = true
return err
}
func (m *NfsIOStatCollector) Read(interval time.Duration, output chan lp.CCMetric) {
timestamp := time.Now()
// Get the current values for all mountpoints
newdata := m.readNfsiostats()
for mntpoint, values := range newdata {
// Was the mount point already present in the last iteration
if old, ok := m.data[mntpoint]; ok {
// Calculate the difference of old and new values
for i := range values {
x := values[i] - old[i]
y, err := lp.New(fmt.Sprintf("nfsio_%s", i), m.tags, m.meta, map[string]interface{}{"value": x}, timestamp)
if err == nil {
if strings.HasPrefix(i, "page") {
y.AddMeta("unit", "4K_Pages")
}
y.AddTag("stype", "filesystem")
y.AddTag("stype-id", mntpoint)
// Send it to output channel
output <- y
}
// Update old to the new value for the next iteration
old[i] = values[i]
}
} else {
// First time we see this mount point, store all values
m.data[mntpoint] = values
}
}
// Reset entries that do not exist anymore
for mntpoint := range m.data {
found := false
for new := range newdata {
if new == mntpoint {
found = true
break
}
}
if !found {
m.data[mntpoint] = nil
}
}
}
func (m *NfsIOStatCollector) Close() {
// Unset flag
m.init = false
}

View File

@@ -0,0 +1,27 @@
## `nfsiostat` collector
```json
"nfsiostat": {
"exclude_metrics": [
"nfsio_oread"
],
"exclude_filesystems" : [
"/mnt",
],
"use_server_as_stype": false
}
```
The `nfsiostat` collector reads data from `/proc/self/mountstats` and outputs a handful **node** metrics for each NFS filesystem. If a metric or filesystem is not required, it can be excluded from forwarding it to the sink.
Metrics:
* `nfsio_nread`: Bytes transferred by normal `read()` calls
* `nfsio_nwrite`: Bytes transferred by normal `write()` calls
* `nfsio_oread`: Bytes transferred by `read()` calls with `O_DIRECT`
* `nfsio_owrite`: Bytes transferred by `write()` calls with `O_DIRECT`
* `nfsio_pageread`: Pages transferred by `read()` calls
* `nfsio_pagewrite`: Pages transferred by `write()` calls
* `nfsio_nfsread`: Bytes transferred for reading from the server
* `nfsio_nfswrite`: Pages transferred by writing to the server
The `nfsiostat` collector adds the mountpoint to the tags as `stype=filesystem,stype-id=<mountpoint>`. If the server address should be used instead of the mountpoint, use the `use_server_as_stype` config setting.

View File

@@ -14,29 +14,38 @@ import (
lp "github.com/ClusterCockpit/cc-metric-collector/pkg/ccMetric"
)
//
// Numa policy hit/miss statistics
// Non-Uniform Memory Access (NUMA) policy hit/miss statistics
//
// numa_hit:
// A process wanted to allocate memory from this node, and succeeded.
//
// A process wanted to allocate memory from this node, and succeeded.
//
// numa_miss:
// A process wanted to allocate memory from another node,
// but ended up with memory from this node.
//
// A process wanted to allocate memory from another node,
// but ended up with memory from this node.
//
// numa_foreign:
// A process wanted to allocate on this node,
// but ended up with memory from another node.
//
// A process wanted to allocate on this node,
// but ended up with memory from another node.
//
// local_node:
// A process ran on this node's CPU,
// and got memory from this node.
//
// A process ran on this node's CPU,
// and got memory from this node.
//
// other_node:
// A process ran on a different node's CPU
// and got memory from this node.
//
// A process ran on a different node's CPU
// and got memory from this node.
//
// interleave_hit:
// Interleaving wanted to allocate from this node
// and succeeded.
//
// Interleaving wanted to allocate from this node
// and succeeded.
//
// See: https://www.kernel.org/doc/html/latest/admin-guide/numastat.html
//
type NUMAStatsCollectorTopolgy struct {
file string
tagSet map[string]string
@@ -82,6 +91,8 @@ func (m *NUMAStatsCollector) Init(config json.RawMessage) error {
})
}
// Initialized
cclog.ComponentDebug(m.name, "initialized", len(m.topology), "NUMA domains")
m.init = true
return nil
}

View File

@@ -1,15 +1,17 @@
## `numastat` collector
```json
"numastat": {}
"numastats": {}
```
The `numastat` collector reads data from `/sys/devices/system/node/node*/numastat` and outputs a handful **memoryDomain** metrics. See: https://www.kernel.org/doc/html/latest/admin-guide/numastat.html
The `numastat` collector reads data from `/sys/devices/system/node/node*/numastat` and outputs a handful **memoryDomain** metrics. See: <https://www.kernel.org/doc/html/latest/admin-guide/numastat.html>
Metrics:
* `numastats_numa_hit`: A process wanted to allocate memory from this node, and succeeded.
* `numastats_numa_miss`: A process wanted to allocate memory from another node, but ended up with memory from this node.
* `numastats_numa_foreign`: A process wanted to allocate on this node, but ended up with memory from another node.
* `numastats_local_node`: A process ran on this node's CPU, and got memory from this node.
* `numastats_other_node`: A process ran on a different node's CPU, and got memory from this node.
* `numastats_interleave_hit`: Interleaving wanted to allocate from this node and succeeded.
* `numastats_interleave_hit`: Interleaving wanted to allocate from this node and succeeded.

262
collectors/raplMetric.go Normal file
View File

@@ -0,0 +1,262 @@
package collectors
import (
"encoding/json"
"fmt"
"os"
"path/filepath"
"strconv"
"strings"
"time"
cclog "github.com/ClusterCockpit/cc-metric-collector/pkg/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/pkg/ccMetric"
)
// running average power limit (RAPL) monitoring attributes for a zone
type RAPLZoneInfo struct {
// tags describing the RAPL zone:
// * zone_name, subzone_name: e.g. psys, dram, core, uncore, package-0
// * zone_id: e.g. 0:1 (zone 0 sub zone 1)
tags map[string]string
energyFilepath string // path to a file containing the zones current energy counter in micro joules
energy int64 // current reading of the energy counter in micro joules
energyTimestamp time.Time // timestamp when energy counter was read
maxEnergyRange int64 // Range of the above energy counter in micro-joules
}
type RAPLCollector struct {
metricCollector
config struct {
// Exclude IDs for RAPL zones, e.g.
// * 0 for zone 0
// * 0:1 for zone 0 subzone 1
ExcludeByID []string `json:"exclude_device_by_id,omitempty"`
// Exclude names for RAPL zones, e.g. psys, dram, core, uncore, package-0
ExcludeByName []string `json:"exclude_device_by_name,omitempty"`
}
RAPLZoneInfo []RAPLZoneInfo
meta map[string]string // default meta information
}
// Init initializes the running average power limit (RAPL) collector
func (m *RAPLCollector) Init(config json.RawMessage) error {
// Check if already initialized
if m.init {
return nil
}
var err error = nil
m.name = "RAPLCollector"
m.setup()
m.parallel = true
m.meta = map[string]string{
"source": m.name,
"group": "energy",
"unit": "Watt",
}
// Read in the JSON configuration
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
cclog.ComponentError(m.name, "Error reading config:", err.Error())
return err
}
}
// Configure excluded RAPL zones
isIDExcluded := make(map[string]bool)
if m.config.ExcludeByID != nil {
for _, ID := range m.config.ExcludeByID {
isIDExcluded[ID] = true
}
}
isNameExcluded := make(map[string]bool)
if m.config.ExcludeByName != nil {
for _, name := range m.config.ExcludeByName {
isNameExcluded[name] = true
}
}
// readZoneInfo reads RAPL monitoring attributes for a zone given by zonePath
// See: https://www.kernel.org/doc/html/latest/power/powercap/powercap.html#monitoring-attributes
readZoneInfo := func(zonePath string) (z struct {
name string // zones name e.g. psys, dram, core, uncore, package-0
energyFilepath string // path to a file containing the zones current energy counter in micro joules
energy int64 // current reading of the energy counter in micro joules
energyTimestamp time.Time // timestamp when energy counter was read
maxEnergyRange int64 // Range of the above energy counter in micro-joules
ok bool // Are all information available?
}) {
// zones name e.g. psys, dram, core, uncore, package-0
foundName := false
if v, err :=
os.ReadFile(
filepath.Join(zonePath, "name")); err == nil {
foundName = true
z.name = strings.TrimSpace(string(v))
}
// path to a file containing the zones current energy counter in micro joules
z.energyFilepath = filepath.Join(zonePath, "energy_uj")
// current reading of the energy counter in micro joules
foundEnergy := false
if v, err := os.ReadFile(z.energyFilepath); err == nil {
// timestamp when energy counter was read
z.energyTimestamp = time.Now()
if i, err := strconv.ParseInt(strings.TrimSpace(string(v)), 10, 64); err == nil {
foundEnergy = true
z.energy = i
}
}
// Range of the above energy counter in micro-joules
foundMaxEnergyRange := false
if v, err :=
os.ReadFile(
filepath.Join(zonePath, "max_energy_range_uj")); err == nil {
if i, err := strconv.ParseInt(strings.TrimSpace(string(v)), 10, 64); err == nil {
foundMaxEnergyRange = true
z.maxEnergyRange = i
}
}
// Are all information available?
z.ok = foundName && foundEnergy && foundMaxEnergyRange
return
}
powerCapPrefix := "/sys/devices/virtual/powercap"
controlType := "intel-rapl"
controlTypePath := filepath.Join(powerCapPrefix, controlType)
// Find all RAPL zones
zonePrefix := filepath.Join(controlTypePath, controlType+":")
zonesPath, err := filepath.Glob(zonePrefix + "*")
if err != nil || zonesPath == nil {
return fmt.Errorf("unable to find any zones under %s", controlTypePath)
}
for _, zonePath := range zonesPath {
zoneID := strings.TrimPrefix(zonePath, zonePrefix)
z := readZoneInfo(zonePath)
if z.ok &&
!isIDExcluded[zoneID] &&
!isNameExcluded[z.name] {
// Add RAPL monitoring attributes for a zone
m.RAPLZoneInfo =
append(
m.RAPLZoneInfo,
RAPLZoneInfo{
tags: map[string]string{
"id": zoneID,
"zone_name": z.name,
},
energyFilepath: z.energyFilepath,
energy: z.energy,
energyTimestamp: z.energyTimestamp,
maxEnergyRange: z.maxEnergyRange,
})
}
// find all sub zones for the given zone
subZonePrefix := filepath.Join(zonePath, controlType+":"+zoneID+":")
subZonesPath, err := filepath.Glob(subZonePrefix + "*")
if err != nil || subZonesPath == nil {
continue
}
for _, subZonePath := range subZonesPath {
subZoneID := strings.TrimPrefix(subZonePath, subZonePrefix)
sz := readZoneInfo(subZonePath)
if len(zoneID) > 0 && len(z.name) > 0 &&
sz.ok &&
!isIDExcluded[zoneID+":"+subZoneID] &&
!isNameExcluded[sz.name] {
m.RAPLZoneInfo =
append(
m.RAPLZoneInfo,
RAPLZoneInfo{
tags: map[string]string{
"id": zoneID + ":" + subZoneID,
"zone_name": z.name,
"sub_zone_name": sz.name,
},
energyFilepath: sz.energyFilepath,
energy: sz.energy,
energyTimestamp: sz.energyTimestamp,
maxEnergyRange: sz.maxEnergyRange,
})
}
}
}
if m.RAPLZoneInfo == nil {
return fmt.Errorf("no running average power limit (RAPL) device found in %s", controlTypePath)
}
// Initialized
cclog.ComponentDebug(
m.name,
"initialized",
len(m.RAPLZoneInfo),
"zones with running average power limit (RAPL) monitoring attributes")
m.init = true
return err
}
// Read reads running average power limit (RAPL) monitoring attributes for all initialized zones
// See: https://www.kernel.org/doc/html/latest/power/powercap/powercap.html#monitoring-attributes
func (m *RAPLCollector) Read(interval time.Duration, output chan lp.CCMetric) {
for i := range m.RAPLZoneInfo {
p := &m.RAPLZoneInfo[i]
// Read current value of the energy counter in micro joules
if v, err := os.ReadFile(p.energyFilepath); err == nil {
energyTimestamp := time.Now()
if i, err := strconv.ParseInt(strings.TrimSpace(string(v)), 10, 64); err == nil {
energy := i
// Compute average power (Δ energy / Δ time)
energyDiff := energy - p.energy
if energyDiff < 0 {
// Handle overflow:
// ( p.maxEnergyRange - p.energy ) + energy
// = p.maxEnergyRange + ( energy - p.energy )
// = p.maxEnergyRange + diffEnergy
energyDiff += p.maxEnergyRange
}
timeDiff := energyTimestamp.Sub(p.energyTimestamp)
averagePower := float64(energyDiff) / float64(timeDiff.Microseconds())
y, err := lp.New(
"rapl_average_power",
p.tags,
m.meta,
map[string]interface{}{"value": averagePower},
energyTimestamp)
if err == nil {
output <- y
}
// Save current energy counter state
p.energy = energy
p.energyTimestamp = energyTimestamp
}
}
}
}
// Close closes running average power limit (RAPL) metric collector
func (m *RAPLCollector) Close() {
// Unset flag
m.init = false
}

18
collectors/raplMetric.md Normal file
View File

@@ -0,0 +1,18 @@
# Running average power limit (RAPL) metric collector
This collector reads running average power limit (RAPL) monitoring attributes to compute average power consumption metrics. See <https://www.kernel.org/doc/html/latest/power/powercap/powercap.html#monitoring-attributes>.
The Likwid metric collector provides similar functionality.
## Configuration
```json
"rapl": {
"exclude_device_by_id": ["0:1", "0:2"],
"exclude_device_by_name": ["psys"]
}
```
## Metrics
* `rapl_average_power`: average power consumption in Watt. The average is computed over the entire runtime from the last measurement to the current measurement

View File

@@ -17,9 +17,9 @@ type SampleCollectorConfig struct {
// defined by metricCollector (name, init, ...)
type SampleCollector struct {
metricCollector
config SampleTimerCollectorConfig // the configuration structure
meta map[string]string // default meta information
tags map[string]string // default tags
config SampleCollectorConfig // the configuration structure
meta map[string]string // default meta information
tags map[string]string // default tags
}
// Functions to implement MetricCollector interface
@@ -36,14 +36,14 @@ func (m *SampleCollector) Init(config json.RawMessage) error {
// This is for later use, also call it early
m.setup()
// Tell whether the collector should be run in parallel with others (reading files, ...)
// or it should be run serially, mostly for collectors acutally doing measurements
// or it should be run serially, mostly for collectors actually doing measurements
// because they should not measure the execution of the other collectors
m.parallel = true
// Define meta information sent with each metric
// (Can also be dynamic or this is the basic set with extension through AddMeta())
m.meta = map[string]string{"source": m.name, "group": "SAMPLE"}
// Define tags sent with each metric
// The 'type' tag is always needed, it defines the granulatity of the metric
// The 'type' tag is always needed, it defines the granularity of the metric
// node -> whole system
// socket -> CPU socket (requires socket ID as 'type-id' tag)
// die -> CPU die (requires CPU die ID as 'type-id' tag)

View File

@@ -38,7 +38,7 @@ func (m *SampleTimerCollector) Init(name string, config json.RawMessage) error {
// (Can also be dynamic or this is the basic set with extension through AddMeta())
m.meta = map[string]string{"source": m.name, "group": "SAMPLE"}
// Define tags sent with each metric
// The 'type' tag is always needed, it defines the granulatity of the metric
// The 'type' tag is always needed, it defines the granularity of the metric
// node -> whole system
// socket -> CPU socket (requires socket ID as 'type-id' tag)
// cpu -> single CPU hardware thread (requires cpu ID as 'type-id' tag)
@@ -60,7 +60,7 @@ func (m *SampleTimerCollector) Init(name string, config json.RawMessage) error {
// Storage for output channel
m.output = nil
// Mangement channel for the timer function.
// Management channel for the timer function.
m.done = make(chan bool)
// Create the own ticker
m.ticker = time.NewTicker(m.interval)
@@ -94,7 +94,7 @@ func (m *SampleTimerCollector) ReadMetrics(timestamp time.Time) {
value := 1.0
// If you want to measure something for a specific amout of time, use interval
// If you want to measure something for a specific amount of time, use interval
// start := readState()
// time.Sleep(interval)
// stop := readState()

144
collectors/selfMetric.go Normal file
View File

@@ -0,0 +1,144 @@
package collectors
import (
"encoding/json"
"runtime"
"syscall"
"time"
cclog "github.com/ClusterCockpit/cc-metric-collector/pkg/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/pkg/ccMetric"
)
type SelfCollectorConfig struct {
MemStats bool `json:"read_mem_stats"`
GoRoutines bool `json:"read_goroutines"`
CgoCalls bool `json:"read_cgo_calls"`
Rusage bool `json:"read_rusage"`
}
type SelfCollector struct {
metricCollector
config SelfCollectorConfig // the configuration structure
meta map[string]string // default meta information
tags map[string]string // default tags
}
func (m *SelfCollector) Init(config json.RawMessage) error {
var err error = nil
m.name = "SelfCollector"
m.setup()
m.parallel = true
m.meta = map[string]string{"source": m.name, "group": "Self"}
m.tags = map[string]string{"type": "node"}
if len(config) > 0 {
err = json.Unmarshal(config, &m.config)
if err != nil {
cclog.ComponentError(m.name, "Error reading config:", err.Error())
return err
}
}
m.init = true
return err
}
func (m *SelfCollector) Read(interval time.Duration, output chan lp.CCMetric) {
timestamp := time.Now()
if m.config.MemStats {
var memstats runtime.MemStats
runtime.ReadMemStats(&memstats)
y, err := lp.New("total_alloc", m.tags, m.meta, map[string]interface{}{"value": memstats.TotalAlloc}, timestamp)
if err == nil {
y.AddMeta("unit", "Bytes")
output <- y
}
y, err = lp.New("heap_alloc", m.tags, m.meta, map[string]interface{}{"value": memstats.HeapAlloc}, timestamp)
if err == nil {
y.AddMeta("unit", "Bytes")
output <- y
}
y, err = lp.New("heap_sys", m.tags, m.meta, map[string]interface{}{"value": memstats.HeapSys}, timestamp)
if err == nil {
y.AddMeta("unit", "Bytes")
output <- y
}
y, err = lp.New("heap_idle", m.tags, m.meta, map[string]interface{}{"value": memstats.HeapIdle}, timestamp)
if err == nil {
y.AddMeta("unit", "Bytes")
output <- y
}
y, err = lp.New("heap_inuse", m.tags, m.meta, map[string]interface{}{"value": memstats.HeapInuse}, timestamp)
if err == nil {
y.AddMeta("unit", "Bytes")
output <- y
}
y, err = lp.New("heap_released", m.tags, m.meta, map[string]interface{}{"value": memstats.HeapReleased}, timestamp)
if err == nil {
y.AddMeta("unit", "Bytes")
output <- y
}
y, err = lp.New("heap_objects", m.tags, m.meta, map[string]interface{}{"value": memstats.HeapObjects}, timestamp)
if err == nil {
output <- y
}
}
if m.config.GoRoutines {
y, err := lp.New("num_goroutines", m.tags, m.meta, map[string]interface{}{"value": runtime.NumGoroutine()}, timestamp)
if err == nil {
output <- y
}
}
if m.config.CgoCalls {
y, err := lp.New("num_cgo_calls", m.tags, m.meta, map[string]interface{}{"value": runtime.NumCgoCall()}, timestamp)
if err == nil {
output <- y
}
}
if m.config.Rusage {
var rusage syscall.Rusage
err := syscall.Getrusage(syscall.RUSAGE_SELF, &rusage)
if err == nil {
sec, nsec := rusage.Utime.Unix()
t := float64(sec) + (float64(nsec) * 1e-9)
y, err := lp.New("rusage_user_time", m.tags, m.meta, map[string]interface{}{"value": t}, timestamp)
if err == nil {
y.AddMeta("unit", "seconds")
output <- y
}
sec, nsec = rusage.Stime.Unix()
t = float64(sec) + (float64(nsec) * 1e-9)
y, err = lp.New("rusage_system_time", m.tags, m.meta, map[string]interface{}{"value": t}, timestamp)
if err == nil {
y.AddMeta("unit", "seconds")
output <- y
}
y, err = lp.New("rusage_vol_ctx_switch", m.tags, m.meta, map[string]interface{}{"value": rusage.Nvcsw}, timestamp)
if err == nil {
output <- y
}
y, err = lp.New("rusage_invol_ctx_switch", m.tags, m.meta, map[string]interface{}{"value": rusage.Nivcsw}, timestamp)
if err == nil {
output <- y
}
y, err = lp.New("rusage_signals", m.tags, m.meta, map[string]interface{}{"value": rusage.Nsignals}, timestamp)
if err == nil {
output <- y
}
y, err = lp.New("rusage_major_pgfaults", m.tags, m.meta, map[string]interface{}{"value": rusage.Majflt}, timestamp)
if err == nil {
output <- y
}
y, err = lp.New("rusage_minor_pgfaults", m.tags, m.meta, map[string]interface{}{"value": rusage.Minflt}, timestamp)
if err == nil {
output <- y
}
}
}
}
func (m *SelfCollector) Close() {
m.init = false
}

34
collectors/selfMetric.md Normal file
View File

@@ -0,0 +1,34 @@
## `self` collector
```json
"self": {
"read_mem_stats" : true,
"read_goroutines" : true,
"read_cgo_calls" : true,
"read_rusage" : true
}
```
The `self` collector reads the data from the `runtime` and `syscall` packages, so monitors the execution of the cc-metric-collector itself.
Metrics:
* If `read_mem_stats == true`:
* `total_alloc`: The metric reports cumulative bytes allocated for heap objects.
* `heap_alloc`: The metric reports bytes of allocated heap objects.
* `heap_sys`: The metric reports bytes of heap memory obtained from the OS.
* `heap_idle`: The metric reports bytes in idle (unused) spans.
* `heap_inuse`: The metric reports bytes in in-use spans.
* `heap_released`: The metric reports bytes of physical memory returned to the OS.
* `heap_objects`: The metric reports the number of allocated heap objects.
* If `read_goroutines == true`:
* `num_goroutines`: The metric reports the number of goroutines that currently exist.
* If `read_cgo_calls == true`:
* `num_cgo_calls`: The metric reports the number of cgo calls made by the current process.
* If `read_rusage == true`:
* `rusage_user_time`: The metric reports the amount of time that this process has been scheduled in user mode.
* `rusage_system_time`: The metric reports the amount of time that this process has been scheduled in kernel mode.
* `rusage_vol_ctx_switch`: The metric reports the amount of voluntary context switches.
* `rusage_invol_ctx_switch`: The metric reports the amount of involuntary context switches.
* `rusage_signals`: The metric reports the number of signals received.
* `rusage_major_pgfaults`: The metric reports the number of major faults the process has made which have required loading a memory page from disk.
* `rusage_minor_pgfaults`: The metric reports the number of minor faults the process has made which have not required loading a memory page from disk.

View File

@@ -37,7 +37,9 @@ $ install --mode 644 \
$ systemctl enable cc-metric-collector
```
## RPM
## Packaging
### RPM
In order to get a RPM packages for cc-metric-collector, just use:
@@ -47,7 +49,7 @@ $ make RPM
It uses the RPM SPEC file `scripts/cc-metric-collector.spec` and requires the RPM tools (`rpm` and `rpmspec`) and `git`.
## DEB
### DEB
In order to get very simple Debian packages for cc-metric-collector, just use:
@@ -57,4 +59,16 @@ $ make DEB
It uses the DEB control file `scripts/cc-metric-collector.control` and requires `dpkg-deb`, `awk`, `sed` and `git`. It creates only a binary deb package.
_This option is not well tested and therefore experimental_
_This option is not well tested and therefore experimental_
### Customizing RPMs or DEB packages
If you want to customize the RPMs or DEB packages for your local system, use the following workflow.
- (if there is already a fork in the private account, delete it and wait until Github realizes the deletion)
- Fork the cc-metric-collector repository (if Github hasn't realized it, it creates a fork named cc-metric-collector2)
- Go to private cc-metric-collector repository and enable Github Actions
- Do changes to the scripts, code, ... Commit and push your changes.
- Tag the new commit with `v0.x.y-<myversion>` (`git tag v0.x.y-<myversion>`)
- Push tags to repository (`git push --tags`)
- Wait until the Release action finishes. It creates fresh RPMs and DEBs in your private repository on the Releases page.

36
go.mod
View File

@@ -6,35 +6,37 @@ require (
github.com/ClusterCockpit/cc-units v0.3.0
github.com/ClusterCockpit/go-rocm-smi v0.3.0
github.com/NVIDIA/go-nvml v0.11.6-0
github.com/PaesslerAG/gval v1.2.0
github.com/PaesslerAG/gval v1.2.1
github.com/gorilla/mux v1.8.0
github.com/influxdata/influxdb-client-go/v2 v2.9.1
github.com/influxdata/influxdb-client-go/v2 v2.12.1
github.com/influxdata/line-protocol v0.0.0-20210922203350-b1ad95c89adf
github.com/nats-io/nats.go v1.16.0
github.com/prometheus/client_golang v1.12.2
github.com/nats-io/nats.go v1.22.1
github.com/prometheus/client_golang v1.14.0
github.com/stmcginnis/gofish v0.13.0
github.com/tklauser/go-sysconf v0.3.10
golang.org/x/sys v0.0.0-20220712014510-0a85c31ab51e
github.com/tklauser/go-sysconf v0.3.11
golang.design/x/thread v0.0.0-20210122121316-335e9adffdf1
golang.org/x/sys v0.3.0
gopkg.in/fsnotify.v0 v0.9.3
)
require (
github.com/apapsch/go-jsonmerge/v2 v2.0.0 // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/cespare/xxhash/v2 v2.1.2 // indirect
github.com/deepmap/oapi-codegen v1.11.0 // indirect
github.com/cespare/xxhash/v2 v2.2.0 // indirect
github.com/deepmap/oapi-codegen v1.12.4 // indirect
github.com/golang/protobuf v1.5.2 // indirect
github.com/google/uuid v1.3.0 // indirect
github.com/matttproud/golang_protobuf_extensions v1.0.1 // indirect
github.com/matttproud/golang_protobuf_extensions v1.0.4 // indirect
github.com/nats-io/nats-server/v2 v2.8.4 // indirect
github.com/nats-io/nkeys v0.3.0 // indirect
github.com/nats-io/nuid v1.0.1 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/prometheus/client_model v0.2.0 // indirect
github.com/prometheus/common v0.37.0 // indirect
github.com/prometheus/procfs v0.7.3 // indirect
github.com/prometheus/client_model v0.3.0 // indirect
github.com/prometheus/common v0.39.0 // indirect
github.com/prometheus/procfs v0.9.0 // indirect
github.com/shopspring/decimal v1.3.1 // indirect
github.com/tklauser/numcpus v0.4.0 // indirect
golang.org/x/crypto v0.0.0-20220622213112-05595931fe9d // indirect
golang.org/x/net v0.0.0-20220708220712-1185a9018129 // indirect
google.golang.org/protobuf v1.28.0 // indirect
gopkg.in/yaml.v2 v2.4.0 // indirect
github.com/tklauser/numcpus v0.6.0 // indirect
golang.org/x/crypto v0.4.0 // indirect
golang.org/x/net v0.4.0 // indirect
google.golang.org/protobuf v1.28.1 // indirect
)

View File

@@ -9,8 +9,8 @@ import (
cclog "github.com/ClusterCockpit/cc-metric-collector/pkg/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/pkg/ccMetric"
agg "github.com/ClusterCockpit/cc-metric-collector/internal/metricAggregator"
lp "github.com/ClusterCockpit/cc-metric-collector/pkg/ccMetric"
mct "github.com/ClusterCockpit/cc-metric-collector/pkg/multiChanTicker"
units "github.com/ClusterCockpit/cc-units"
)
@@ -281,9 +281,7 @@ func (r *metricRouter) Start() {
// Foward message received from collector channel
coll_forward := func(p lp.CCMetric) {
// receive from metric collector
if !p.HasTag(r.config.HostnameTagName) {
p.AddTag(r.config.HostnameTagName, r.hostname)
}
p.AddTag(r.config.HostnameTagName, r.hostname)
if r.config.IntervalStamp {
p.SetTime(r.timestamp)
}
@@ -312,9 +310,7 @@ func (r *metricRouter) Start() {
cache_forward := func(p lp.CCMetric) {
// receive from metric collector
if !r.dropMetric(p) {
if !p.HasTag(r.config.HostnameTagName) {
p.AddTag(r.config.HostnameTagName, r.hostname)
}
p.AddTag(r.config.HostnameTagName, r.hostname)
forward(p)
}
}

125
pkg/hostlist/hostlist.go Normal file
View File

@@ -0,0 +1,125 @@
package hostlist
import (
"fmt"
"regexp"
"sort"
"strconv"
"strings"
)
func Expand(in string) (result []string, err error) {
// Create ranges regular expression
reStNumber := "[[:digit:]]+"
reStRange := reStNumber + "-" + reStNumber
reStOptionalNumberOrRange := "(" + reStNumber + ",|" + reStRange + ",)*"
reStNumberOrRange := "(" + reStNumber + "|" + reStRange + ")"
reStBraceLeft := "[[]"
reStBraceRight := "[]]"
reStRanges := reStBraceLeft +
reStOptionalNumberOrRange +
reStNumberOrRange +
reStBraceRight
reRanges := regexp.MustCompile(reStRanges)
// Create host list regular expression
reStDNSChars := "[a-zA-Z0-9-]+"
reStPrefix := "^(" + reStDNSChars + ")"
reStOptionalSuffix := "(" + reStDNSChars + ")?"
re := regexp.MustCompile(reStPrefix + "([[][0-9,-]+[]])?" + reStOptionalSuffix)
// Remove all delimiters from the input
in = strings.TrimLeft(in, ", ")
for len(in) > 0 {
if v := re.FindStringSubmatch(in); v != nil {
// Remove matched part from the input
lenPrefix := len(v[0])
in = in[lenPrefix:]
// Remove all delimiters from the input
in = strings.TrimLeft(in, ", ")
// matched prefix, range and suffix
hlPrefix := v[1]
hlRanges := v[2]
hlSuffix := v[3]
// Single node without ranges
if hlRanges == "" {
result = append(result, hlPrefix)
continue
}
// Node with ranges
if v := reRanges.FindStringSubmatch(hlRanges); v != nil {
// Remove braces
hlRanges = hlRanges[1 : len(hlRanges)-1]
// Split host ranges at ,
for _, hlRange := range strings.Split(hlRanges, ",") {
// Split host range at -
RangeStartEnd := strings.Split(hlRange, "-")
// Range is only a single number
if len(RangeStartEnd) == 1 {
result = append(result, hlPrefix+RangeStartEnd[0]+hlSuffix)
continue
}
// Range has a start and an end
widthRangeStart := len(RangeStartEnd[0])
widthRangeEnd := len(RangeStartEnd[1])
iStart, _ := strconv.ParseUint(RangeStartEnd[0], 10, 64)
iEnd, _ := strconv.ParseUint(RangeStartEnd[1], 10, 64)
if iStart > iEnd {
return nil, fmt.Errorf("single range start is greater than end: %s", hlRange)
}
// Create print format string for range numbers
doPadding := widthRangeStart == widthRangeEnd
widthPadding := widthRangeStart
var formatString string
if doPadding {
formatString = "%0" + fmt.Sprint(widthPadding) + "d"
} else {
formatString = "%d"
}
formatString = hlPrefix + formatString + hlSuffix
// Add nodes from this range
for i := iStart; i <= iEnd; i++ {
result = append(result, fmt.Sprintf(formatString, i))
}
}
} else {
return nil, fmt.Errorf("not at hostlist range: %s", hlRanges)
}
} else {
return nil, fmt.Errorf("not a hostlist: %s", in)
}
}
if result != nil {
// sort
sort.Strings(result)
// uniq
previous := 1
for current := 1; current < len(result); current++ {
if result[current-1] != result[current] {
if previous != current {
result[previous] = result[current]
}
previous++
}
}
result = result[:previous]
}
return
}

View File

@@ -0,0 +1,126 @@
package hostlist
import (
"testing"
)
func TestExpand(t *testing.T) {
// Compare two slices of strings
equal := func(a, b []string) bool {
if len(a) != len(b) {
return false
}
for i, v := range a {
if v != b[i] {
return false
}
}
return true
}
type testDefinition struct {
input string
resultExpected []string
errorExpected bool
}
expandTests := []testDefinition{
{
// Single node
input: "n1",
resultExpected: []string{"n1"},
errorExpected: false,
},
{
// Single node, duplicated
input: "n1,n1",
resultExpected: []string{"n1"},
errorExpected: false,
},
{
// Single node with padding
input: "n[01]",
resultExpected: []string{"n01"},
errorExpected: false,
},
{
// Single node with suffix
input: "n[01]-p",
resultExpected: []string{"n01-p"},
errorExpected: false,
},
{
// Multiple nodes with a single range
input: "n[1-2]",
resultExpected: []string{"n1", "n2"},
errorExpected: false,
},
{
// Multiple nodes with a single range and a single index
input: "n[1-2,3]",
resultExpected: []string{"n1", "n2", "n3"},
errorExpected: false,
},
{
// Multiple nodes with different prefixes
input: "n[1-2],m[1,2]",
resultExpected: []string{"m1", "m2", "n1", "n2"},
errorExpected: false,
},
{
// Multiple nodes with different suffixes
input: "n[1-2]-p,n[1,2]-q",
resultExpected: []string{"n1-p", "n1-q", "n2-p", "n2-q"},
errorExpected: false,
},
{
// Multiple nodes with and without node ranges
input: " n09, n[01-04,06-07,09] , , n10,n04",
resultExpected: []string{"n01", "n02", "n03", "n04", "n06", "n07", "n09", "n10"},
errorExpected: false,
},
{
// Forbidden DNS character
input: "n@",
resultExpected: []string{},
errorExpected: true,
},
{
// Forbidden range
input: "n[1-2-2,3]",
resultExpected: []string{},
errorExpected: true,
},
{
// Forbidden range limits
input: "n[2-1]",
resultExpected: []string{},
errorExpected: true,
},
}
for _, expandTest := range expandTests {
result, err := Expand(expandTest.input)
hasError := err != nil
if hasError != expandTest.errorExpected && hasError {
t.Errorf("Expand('%s') failed: unexpected error '%v'",
expandTest.input, err)
continue
}
if hasError != expandTest.errorExpected && !hasError {
t.Errorf("Expand('%s') did not fail as expected: got result '%+v'",
expandTest.input, result)
continue
}
if !hasError && !equal(result, expandTest.resultExpected) {
t.Errorf("Expand('%s') failed: got result '%+v', expected result '%v'",
expandTest.input, result, expandTest.resultExpected)
continue
}
t.Logf("Checked hostlist.Expand('%s'): result = '%+v', err = '%v'",
expandTest.input, result, err)
}
}

View File

@@ -1,25 +1,44 @@
{
"natsrecv" : {
"natsrecv": {
"type": "nats",
"address": "nats://my-url",
"port" : "4222",
"port": "4222",
"database": "testcluster"
},
"redfish_recv": {
"type": "redfish",
"endpoint": "https://%h-bmc",
"client_config": [
{
"hostname": "my-host-1",
"host_list": "my-host-1-[1-2]",
"username": "username-1",
"password": "password-1",
"endpoint": "https://my-endpoint-1"
"password": "password-1"
},
{
"host_list": "my-host-2-[1,2]",
"username": "username-2",
"password": "password-2"
}
]
},
"ipmi_recv": {
"type": "ipmi",
"endpoint": "ipmi-sensors://%h-ipmi",
"exclude_metrics": [
"fan_speed",
"voltage"
],
"client_config": [
{
"username": "username-1",
"password": "password-1",
"host_list": "my-host-1-[1-2]"
},
{
"hostname": "my-host-2",
"username": "username-2",
"password": "password-2",
"endpoint": "https://my-endpoint-2"
"host_list": "my-host-2-[1,2]"
}
]
}
}
}

534
receivers/ipmiReceiver.go Normal file
View File

@@ -0,0 +1,534 @@
package receivers
import (
"bufio"
"bytes"
"encoding/json"
"fmt"
"io"
"os/exec"
"regexp"
"strconv"
"strings"
"sync"
"time"
cclog "github.com/ClusterCockpit/cc-metric-collector/pkg/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/pkg/ccMetric"
"github.com/ClusterCockpit/cc-metric-collector/pkg/hostlist"
)
type IPMIReceiverClientConfig struct {
// Hostname the IPMI service belongs to
Protocol string // Protocol / tool to use for IPMI sensor reading
DriverType string // Out of band IPMI driver
Fanout int // Maximum number of simultaneous IPMI connections
NumHosts int // Number of remote IPMI devices with the same configuration
IPMIHosts string // List of remote IPMI devices to communicate with
IPMI2HostMapping map[string]string // Mapping between IPMI device name and host name
Username string // User name to authenticate with
Password string // Password to use for authentication
CLIOptions []string // Additional command line options for ipmi-sensors
isExcluded map[string]bool // is metric excluded
}
type IPMIReceiver struct {
receiver
config struct {
Interval time.Duration
// Client config for each IPMI hosts
ClientConfigs []IPMIReceiverClientConfig
}
// Storage for static information
meta map[string]string
done chan bool // channel to finish / stop IPMI receiver
wg sync.WaitGroup // wait group for IPMI receiver
}
// doReadMetrics reads metrics from all configure IPMI hosts.
func (r *IPMIReceiver) doReadMetric() {
for i := range r.config.ClientConfigs {
clientConfig := &r.config.ClientConfigs[i]
var cmd_options []string
if clientConfig.Protocol == "ipmi-sensors" {
cmd_options = append(cmd_options,
"--always-prefix",
"--sdr-cache-recreate",
// Attempt to interpret OEM data, such as event data, sensor readings, or general extra info
"--interpret-oem-data",
// Ignore not-available (i.e. N/A) sensors in output
"--ignore-not-available-sensors",
// Ignore unrecognized sensor events
"--ignore-unrecognized-events",
// Output fields in comma separated format
"--comma-separated-output",
// Do not output column headers
"--no-header-output",
// Output non-abbreviated units (e.g. 'Amps' instead of 'A').
// May aid in disambiguation of units (e.g. 'C' for Celsius or Coulombs).
"--non-abbreviated-units",
"--fanout", fmt.Sprint(clientConfig.Fanout),
"--driver-type", clientConfig.DriverType,
"--hostname", clientConfig.IPMIHosts,
"--username", clientConfig.Username,
"--password", clientConfig.Password,
)
cmd_options := append(cmd_options, clientConfig.CLIOptions...)
command := exec.Command("ipmi-sensors", cmd_options...)
stdout, _ := command.StdoutPipe()
errBuf := new(bytes.Buffer)
command.Stderr = errBuf
// start command
if err := command.Start(); err != nil {
cclog.ComponentError(
r.name,
fmt.Sprintf("doReadMetric(): Failed to start command \"%s\": %v", command.String(), err),
)
continue
}
// Read command output
const (
idxID = iota
idxName
idxType
idxReading
idxUnits
idxEvent
)
numPrefixRegex := regexp.MustCompile("^[[:digit:]][[:digit:]]-(.*)$")
scanner := bufio.NewScanner(stdout)
for scanner.Scan() {
// Read host
v1 := strings.Split(scanner.Text(), ": ")
if len(v1) != 2 {
continue
}
host, ok := clientConfig.IPMI2HostMapping[v1[0]]
if !ok {
continue
}
// Read sensors
v2 := strings.Split(v1[1], ",")
if len(v2) != 6 {
continue
}
// Skip sensors with non available sensor readings
if v2[idxReading] == "N/A" {
continue
}
metric := strings.ToLower(v2[idxType])
name := strings.ToLower(
strings.Replace(
strings.TrimSpace(
v2[idxName]), " ", "_", -1))
// remove prefix enumeration like 01-...
if v := numPrefixRegex.FindStringSubmatch(name); v != nil {
name = v[1]
}
unit := v2[idxUnits]
if unit == "Watts" {
// Power
metric = "power"
name = strings.TrimSuffix(name, "_power")
name = strings.TrimSuffix(name, "_pwr")
name = strings.TrimPrefix(name, "pwr_")
} else if metric == "voltage" &&
unit == "Volts" {
// Voltage
name = strings.TrimPrefix(name, "volt_")
} else if metric == "current" &&
unit == "Amps" {
// Current
unit = "Ampere"
} else if metric == "temperature" &&
unit == "degrees C" {
// Temperature
name = strings.TrimSuffix(name, "_temp")
unit = "degC"
} else if metric == "temperature" &&
unit == "degrees F" {
// Temperature
name = strings.TrimSuffix(name, "_temp")
unit = "degF"
} else if metric == "fan" && unit == "RPM" {
// Fan speed
metric = "fan_speed"
name = strings.TrimSuffix(name, "_tach")
name = strings.TrimPrefix(name, "spd_")
} else if (metric == "cooling device" ||
metric == "other units based sensor") &&
name == "system_air_flow" &&
unit == "CFM" {
// Air flow
metric = "air_flow"
name = strings.TrimSuffix(name, "_air_flow")
unit = "CubicFeetPerMinute"
} else if (metric == "processor" ||
metric == "other units based sensor") &&
(name == "cpu_utilization" ||
name == "io_utilization" ||
name == "mem_utilization" ||
name == "sys_utilization") &&
(unit == "unspecified" ||
unit == "%") {
// Utilization
metric = "utilization"
name = strings.TrimSuffix(name, "_utilization")
unit = "percent"
} else {
if false {
// Debug output for unprocessed metrics
fmt.Printf(
"host: '%s', metric: '%s', name: '%s', unit: '%s'\n",
host, metric, name, unit)
}
continue
}
// Skip excluded metrics
if clientConfig.isExcluded[metric] {
continue
}
// Parse sensor value
value, err := strconv.ParseFloat(v2[idxReading], 64)
if err != nil {
continue
}
y, err := lp.New(
metric,
map[string]string{
"hostname": host,
"type": "node",
"name": name,
},
map[string]string{
"source": r.name,
"group": "IPMI",
"unit": unit,
},
map[string]interface{}{
"value": value,
},
time.Now())
if err == nil {
r.sink <- y
}
}
// Wait for command end
if err := command.Wait(); err != nil {
errMsg, _ := io.ReadAll(errBuf)
cclog.ComponentError(
r.name,
fmt.Sprintf("doReadMetric(): Failed to wait for the end of command \"%s\": %v\n",
strings.Replace(command.String(), clientConfig.Password, "<PW>", -1), err),
fmt.Sprintf("doReadMetric(): command stderr: \"%s\"\n", string(errMsg)),
)
}
}
}
}
func (r *IPMIReceiver) Start() {
cclog.ComponentDebug(r.name, "START")
// Start IPMI receiver
r.wg.Add(1)
go func() {
defer r.wg.Done()
// Create ticker
ticker := time.NewTicker(r.config.Interval)
defer ticker.Stop()
for {
r.doReadMetric()
select {
case tickerTime := <-ticker.C:
// Check if we missed the ticker event
if since := time.Since(tickerTime); since > 5*time.Second {
cclog.ComponentInfo(r.name, "Missed ticker event for more then", since)
}
// process ticker event -> continue
continue
case <-r.done:
// process done event
return
}
}
}()
cclog.ComponentDebug(r.name, "STARTED")
}
// Close receiver: close network connection, close files, close libraries, ...
func (r *IPMIReceiver) Close() {
cclog.ComponentDebug(r.name, "CLOSE")
// Send the signal and wait
close(r.done)
r.wg.Wait()
cclog.ComponentDebug(r.name, "DONE")
}
// NewIPMIReceiver creates a new instance of the redfish receiver
// Initialize the receiver by giving it a name and reading in the config JSON
func NewIPMIReceiver(name string, config json.RawMessage) (Receiver, error) {
r := new(IPMIReceiver)
// Config options from config file
configJSON := struct {
Type string `json:"type"`
// How often the IPMI sensor metrics should be read and send to the sink (default: 30 s)
IntervalString string `json:"interval,omitempty"`
// Maximum number of simultaneous IPMI connections (default: 64)
Fanout int `json:"fanout,omitempty"`
// Out of band IPMI driver (default: LAN_2_0)
DriverType string `json:"driver_type,omitempty"`
// Default client username, password and endpoint
Username *string `json:"username"` // User name to authenticate with
Password *string `json:"password"` // Password to use for authentication
Endpoint *string `json:"endpoint"` // URL of the IPMI device
// Globally excluded metrics
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
ClientConfigs []struct {
Fanout int `json:"fanout,omitempty"` // Maximum number of simultaneous IPMI connections (default: 64)
DriverType string `json:"driver_type,omitempty"` // Out of band IPMI driver (default: LAN_2_0)
HostList string `json:"host_list"` // List of hosts with the same client configuration
Username *string `json:"username"` // User name to authenticate with
Password *string `json:"password"` // Password to use for authentication
Endpoint *string `json:"endpoint"` // URL of the IPMI service
// Per client excluded metrics
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
// Additional command line options for ipmi-sensors
CLIOptions []string `json:"cli_options,omitempty"`
} `json:"client_config"`
}{
// Set defaults values
// Allow overwriting these defaults by reading config JSON
Fanout: 64,
DriverType: "LAN_2_0",
IntervalString: "30s",
}
// Set name of IPMIReceiver
r.name = fmt.Sprintf("IPMIReceiver(%s)", name)
// Create done channel
r.done = make(chan bool)
// Set static information
r.meta = map[string]string{"source": r.name}
// Read the IPMI receiver specific JSON config
if len(config) > 0 {
d := json.NewDecoder(bytes.NewReader(config))
d.DisallowUnknownFields()
if err := d.Decode(&configJSON); err != nil {
cclog.ComponentError(r.name, "Error reading config:", err.Error())
return nil, err
}
}
// Convert interval string representation to duration
var err error
r.config.Interval, err = time.ParseDuration(configJSON.IntervalString)
if err != nil {
err := fmt.Errorf(
"Failed to parse duration string interval='%s': %w",
configJSON.IntervalString,
err,
)
cclog.Error(r.name, err)
return nil, err
}
// Create client config from JSON config
totalNumHosts := 0
for i := range configJSON.ClientConfigs {
clientConfigJSON := &configJSON.ClientConfigs[i]
var endpoint string
if clientConfigJSON.Endpoint != nil {
endpoint = *clientConfigJSON.Endpoint
} else if configJSON.Endpoint != nil {
endpoint = *configJSON.Endpoint
} else {
err := fmt.Errorf("client config number %v requires endpoint", i)
cclog.ComponentError(r.name, err)
return nil, err
}
fanout := configJSON.Fanout
if clientConfigJSON.Fanout != 0 {
fanout = clientConfigJSON.Fanout
}
driverType := configJSON.DriverType
if clientConfigJSON.DriverType != "" {
driverType = clientConfigJSON.DriverType
}
if driverType != "LAN" && driverType != "LAN_2_0" {
err := fmt.Errorf("client config number %v has invalid driver type %s", i, driverType)
cclog.ComponentError(r.name, err)
return nil, err
}
var protocol string
var host_pattern string
if e := strings.Split(endpoint, "://"); len(e) == 2 {
protocol = e[0]
host_pattern = e[1]
} else {
err := fmt.Errorf("client config number %v has invalid endpoint %s", i, endpoint)
cclog.ComponentError(r.name, err)
return nil, err
}
var username string
if clientConfigJSON.Username != nil {
username = *clientConfigJSON.Username
} else if configJSON.Username != nil {
username = *configJSON.Username
} else {
err := fmt.Errorf("client config number %v requires username", i)
cclog.ComponentError(r.name, err)
return nil, err
}
var password string
if clientConfigJSON.Password != nil {
password = *clientConfigJSON.Password
} else if configJSON.Password != nil {
password = *configJSON.Password
} else {
err := fmt.Errorf("client config number %v requires password", i)
cclog.ComponentError(r.name, err)
return nil, err
}
// Create mapping between IPMI host name and node host name
// This also guaranties that all IPMI host names are unique
ipmi2HostMapping := make(map[string]string)
hostList, err := hostlist.Expand(clientConfigJSON.HostList)
if err != nil {
err := fmt.Errorf("client config number %d failed to parse host list %s: %v",
i, clientConfigJSON.HostList, err)
cclog.ComponentError(r.name, err)
return nil, err
}
for _, host := range hostList {
ipmiHost := strings.Replace(host_pattern, "%h", host, -1)
ipmi2HostMapping[ipmiHost] = host
}
numHosts := len(ipmi2HostMapping)
totalNumHosts += numHosts
ipmiHostList := make([]string, 0, numHosts)
for ipmiHost := range ipmi2HostMapping {
ipmiHostList = append(ipmiHostList, ipmiHost)
}
// Additional command line options
for _, v := range clientConfigJSON.CLIOptions {
switch {
case v == "-u" || strings.HasPrefix(v, "--username"):
err := fmt.Errorf("client config number %v: do not set username in cli_options. Use json config username instead", i)
cclog.ComponentError(r.name, err)
return nil, err
case v == "-p" || strings.HasPrefix(v, "--password"):
err := fmt.Errorf("client config number %v: do not set password in cli_options. Use json config password instead", i)
cclog.ComponentError(r.name, err)
return nil, err
case v == "-h" || strings.HasPrefix(v, "--hostname"):
err := fmt.Errorf("client config number %v: do not set hostname in cli_options. Use json config host_list instead", i)
cclog.ComponentError(r.name, err)
return nil, err
case v == "-D" || strings.HasPrefix(v, "--driver-type"):
err := fmt.Errorf("client config number %v: do not set driver type in cli_options. Use json config driver_type instead", i)
cclog.ComponentError(r.name, err)
return nil, err
case v == "-F" || strings.HasPrefix(v, " --fanout"):
err := fmt.Errorf("client config number %v: do not set fanout in cli_options. Use json config fanout instead", i)
cclog.ComponentError(r.name, err)
return nil, err
case v == "--always-prefix" ||
v == "--sdr-cache-recreate" ||
v == "--interpret-oem-data" ||
v == "--ignore-not-available-sensors" ||
v == "--ignore-unrecognized-events" ||
v == "--comma-separated-output" ||
v == "--no-header-output" ||
v == "--non-abbreviated-units":
err := fmt.Errorf("client config number %v: Do not use option %s in cli_options, it is used internally", i, v)
cclog.ComponentError(r.name, err)
return nil, err
}
}
cliOptions := make([]string, 0)
cliOptions = append(cliOptions, clientConfigJSON.CLIOptions...)
// Is metrics excluded globally or per client
isExcluded := make(map[string]bool)
for _, key := range clientConfigJSON.ExcludeMetrics {
isExcluded[key] = true
}
for _, key := range configJSON.ExcludeMetrics {
isExcluded[key] = true
}
r.config.ClientConfigs = append(
r.config.ClientConfigs,
IPMIReceiverClientConfig{
Protocol: protocol,
Fanout: fanout,
DriverType: driverType,
NumHosts: numHosts,
IPMIHosts: strings.Join(ipmiHostList, ","),
IPMI2HostMapping: ipmi2HostMapping,
Username: username,
Password: password,
CLIOptions: cliOptions,
isExcluded: isExcluded,
})
}
if totalNumHosts == 0 {
err := fmt.Errorf("at least one IPMI host config is required")
cclog.ComponentError(r.name, err)
return nil, err
}
cclog.ComponentInfo(r.name, "monitoring", totalNumHosts, "IPMI hosts")
return r, nil
}

48
receivers/ipmiReceiver.md Normal file
View File

@@ -0,0 +1,48 @@
## IPMI Receiver
The IPMI Receiver uses `ipmi-sensors` from the [FreeIPMI](https://www.gnu.org/software/freeipmi/) project to read IPMI sensor readings and sensor data repository (SDR) information. The available metrics depend on the sensors provided by the hardware vendor but typically contain temperature, fan speed, voltage and power metrics.
### Configuration structure
```json
{
"<IPMI receiver name>": {
"type": "ipmi",
"interval": "30s",
"fanout": 256,
"username": "<Username>",
"password": "<Password>",
"endpoint": "ipmi-sensors://%h-bmc",
"exclude_metrics": [ "fan_speed", "voltage" ],
"client_config": [
{
"host_list": "n[1,2-4]"
},
{
"host_list": "n[5-6]",
"driver_type": "LAN",
"cli_options": [ "--workaround-flags=..." ],
"password": "<Password 2>"
}
]
}
}
```
Global settings:
- `interval`: How often the IPMI sensor metrics should be read and send to the sink (default: 30 s)
Global and per IPMI device settings (per IPMI device settings overwrite the global settings):
- `exclude_metrics`: list of excluded metrics e.g. fan_speed, power, temperature, utilization, voltage
- `fanout`: Maximum number of simultaneous IPMI connections (default: 64)
- `driver_type`: Out of band IPMI driver (default: LAN_2_0)
- `username`: User name to authenticate with
- `password`: Password to use for authentication
- `endpoint`: URL of the IPMI device (placeholder `%h` gets replaced by the hostname)
Per IPMI device settings:
- `host_list`: List of hosts with the same client configuration
- `cli_options`: Additional command line options for ipmi-sensors

View File

@@ -2,6 +2,7 @@ package receivers
import (
"encoding/json"
"fmt"
"os"
"sync"
@@ -10,6 +11,7 @@ import (
)
var AvailableReceivers = map[string]func(name string, config json.RawMessage) (Receiver, error){
"ipmi": NewIPMIReceiver,
"nats": NewNatsReceiver,
"redfish": NewRedfishReceiver,
}
@@ -71,9 +73,13 @@ func (rm *receiveManager) AddInput(name string, rawConfig json.RawMessage) error
cclog.ComponentError("ReceiveManager", "SKIP", config.Type, "JSON config error:", err.Error())
return err
}
if config.Type == "" {
cclog.ComponentError("ReceiveManager", "SKIP", "JSON config for receiver", name, "does not contain a receiver type")
return fmt.Errorf("JSON config for receiver %s does not contain a receiver type", name)
}
if _, found := AvailableReceivers[config.Type]; !found {
cclog.ComponentError("ReceiveManager", "SKIP", config.Type, "unknown receiver:", err.Error())
return err
cclog.ComponentError("ReceiveManager", "SKIP", "unknown receiver type:", config.Type)
return fmt.Errorf("unknown receiver type: %s", config.Type)
}
r, err := AvailableReceivers[config.Type](name, rawConfig)
if err != nil {

View File

@@ -1,9 +1,11 @@
package receivers
import (
"bytes"
"crypto/tls"
"encoding/json"
"fmt"
"io"
"net/http"
"strconv"
"strings"
@@ -12,6 +14,7 @@ import (
cclog "github.com/ClusterCockpit/cc-metric-collector/pkg/ccLogger"
lp "github.com/ClusterCockpit/cc-metric-collector/pkg/ccMetric"
"github.com/ClusterCockpit/cc-metric-collector/pkg/hostlist"
// See: https://pkg.go.dev/github.com/stmcginnis/gofish
"github.com/stmcginnis/gofish"
@@ -346,9 +349,17 @@ func (r *RedfishReceiver) readProcessorMetrics(
// This property shall contain the temperature, in Celsius, of the processor.
TemperatureCelsius float32 `json:"TemperatureCelsius"`
}
err = json.NewDecoder(resp.Body).Decode(&processorMetrics)
body, err := io.ReadAll(resp.Body)
if err != nil {
return fmt.Errorf("unable to decode JSON for processor metrics: %+w", err)
return fmt.Errorf("unable to read JSON for processor metrics: %+w", err)
}
err = json.Unmarshal(body, &processorMetrics)
if err != nil {
return fmt.Errorf(
"unable to unmarshal JSON='%s' for processor metrics: %+w",
string(body),
err,
)
}
processorMetrics.SetClient(processor.Client)
@@ -380,7 +391,9 @@ func (r *RedfishReceiver) readProcessorMetrics(
namePower := "consumed_power"
if !clientConfig.isExcluded[namePower] {
if !clientConfig.isExcluded[namePower] &&
// Some servers return "ConsumedPowerWatt":65535 instead of "ConsumedPowerWatt":null
processorMetrics.ConsumedPowerWatt != 65535 {
y, err := lp.New(namePower, tags, metaPower,
map[string]interface{}{
"value": processorMetrics.ConsumedPowerWatt,
@@ -630,10 +643,10 @@ func NewRedfishReceiver(name string, config json.RawMessage) (Receiver, error) {
ExcludeMetrics []string `json:"exclude_metrics,omitempty"`
ClientConfigs []struct {
HostList []string `json:"host_list"` // List of hosts with the same client configuration
Username *string `json:"username"` // User name to authenticate with
Password *string `json:"password"` // Password to use for authentication
Endpoint *string `json:"endpoint"` // URL of the redfish service
HostList string `json:"host_list"` // List of hosts with the same client configuration
Username *string `json:"username"` // User name to authenticate with
Password *string `json:"password"` // Password to use for authentication
Endpoint *string `json:"endpoint"` // URL of the redfish service
// Per client disable collection of power,processor or thermal metrics
DisablePowerMetrics bool `json:"disable_power_metrics"`
@@ -660,8 +673,9 @@ func NewRedfishReceiver(name string, config json.RawMessage) (Receiver, error) {
// Read the redfish receiver specific JSON config
if len(config) > 0 {
err := json.Unmarshal(config, &configJSON)
if err != nil {
d := json.NewDecoder(bytes.NewReader(config))
d.DisallowUnknownFields()
if err := d.Decode(&configJSON); err != nil {
cclog.ComponentError(r.name, "Error reading config:", err.Error())
return nil, err
}
@@ -763,7 +777,14 @@ func NewRedfishReceiver(name string, config json.RawMessage) (Receiver, error) {
isExcluded[key] = true
}
for _, host := range clientConfigJSON.HostList {
hostList, err := hostlist.Expand(clientConfigJSON.HostList)
if err != nil {
err := fmt.Errorf("client config number %d failed to parse host list %s: %v",
i, clientConfigJSON.HostList, err)
cclog.ComponentError(r.name, err)
return nil, err
}
for _, host := range hostList {
// Endpoint of the redfish service
endpoint := strings.Replace(endpoint_pattern, "%h", host, -1)

View File

@@ -8,22 +8,22 @@ The Redfish receiver uses the [Redfish (specification)](https://www.dmtf.org/sta
{
"<redfish receiver name>": {
"type": "redfish",
"username": "<user A>",
"password": "<password A>",
"username": "<Username>",
"password": "<Password>",
"endpoint": "https://%h-bmc",
"exclude_metrics": [ "min_consumed_watts" ],
"client_config": [
{
"host_list": [ "<host 1>", "<host 2>" ]
"host_list": "n[1,2-4]"
},
{
"host_list": [ "<host 3>", "<host 4>" ]
"host_list": "n5"
"disable_power_metrics": true
},
{
"host_list": [ "<host 5>" ],
"username": "<user B>",
"password": "<password B>",
"host_list": "n6" ],
"username": "<Username 2>",
"password": "<Password 2>",
"endpoint": "https://%h-BMC",
"disable_thermal_metrics": true
}

View File

@@ -3,7 +3,6 @@ Description=ClusterCockpit metric collector
Documentation=https://github.com/ClusterCockpit/cc-metric-collector
Wants=network-online.target
After=network-online.target
After=postgresql.service mariadb.service mysql.service
[Service]
EnvironmentFile=/etc/default/cc-metric-collector

View File

@@ -10,6 +10,8 @@ BuildRequires: go-toolset
BuildRequires: systemd-rpm-macros
# for header downloads
BuildRequires: wget
# Recommended when using the sysusers_create_package macro
Requires(pre): /usr/bin/systemd-sysusers
Provides: %{name} = %{version}

View File

@@ -45,7 +45,7 @@ def group_to_json(groupfile):
if "PWR" in calc:
scope = "socket"
m = {"name" : metric, "calc": calc, "scope" : scope, "publish" : True}
m = {"name" : metric, "calc": calc, "type" : scope, "publish" : True}
metrics.append(m)
return {"events" : events, "metrics" : metrics}

View File

@@ -20,7 +20,9 @@ The configuration file for the sinks is a list of configurations. The `type` fie
[
"mystdout" : {
"type" : "stdout",
"meta_as_tags" : false
"meta_as_tags" : [
"unit"
]
},
"metricstore" : {
"type" : "http",
@@ -103,4 +105,4 @@ func NewSampleSink(name string, config json.RawMessage) (Sink, error) {
return s, err
}
```
```

View File

@@ -8,7 +8,9 @@ The `http` sink uses POST requests to a HTTP server to submit the metrics in the
{
"<name>": {
"type": "http",
"meta_as_tags" : true,
"meta_as_tags" : [
"meta-key"
],
"url" : "https://my-monitoring.example.com:1234/api/write",
"jwt" : "blabla.blabla.blabla",
"timeout": "5s",
@@ -20,7 +22,7 @@ The `http` sink uses POST requests to a HTTP server to submit the metrics in the
```
- `type`: makes the sink an `http` sink
- `meta_as_tags`: print all meta information as tags in the output (optional)
- `meta_as_tags`: Move specific meta information to the tags in the output (optional)
- `url`: The full URL of the endpoint
- `jwt`: JSON web tokens for authentification (Using the *Bearer* scheme)
- `timeout`: General timeout for the HTTP client (default '5s')