Commit Graph

490 Commits

Author SHA1 Message Date
oscarminus 88fabc2e83 cpustatMetric.go: Use derived values instead of absolute values (#83)
* cpustatMetric.go: Use derived values instead of absolute values

  The values in /proc/stat are absolute counters related to the boot
  time of the system. To obtain a utilization of the CPU, the changes
  in the counters must be derived according to time. To take only the
  absolute values leads to the fact that changes in the utilization,
  straight with larger values, do not become visible.

* Add new collector for /proc/schedstat

  The `schedstat` collector reads data from /proc/schedstat and calculates
  a load value, separated by hwthread. This might be useful to detect bad
  cpu pinning on shared nodes etc.

Co-authored-by: Michael Schwarz <post@michael-schwarz.name>
2022-09-07 14:09:29 +02:00
Thomas Gruber b3c27e0af5 Merge latest development changes (#80)
* Cleanup: Remove unused code

* Use Golang duration parser for 'interval' and 'duration'
 in main config

* Update handling of LIKWID headers. Download only if not already present in the system. Fixes #73

* Units with cc-units (#64)

* Add option to normalize units with cc-unit

* Add unit conversion to router

* Add option to change unit prefix in the router

* Add to MetricRouter README

* Add order of operations in router to README

* Use second add_tags/del_tags only if metric gets renamed

* Skip disks in DiskstatCollector that have size=0

* Check readability of sensor files in TempCollector

* Fix for --once option

* Rename `cpu` type to `hwthread` (#69)

* Rename 'cpu' type to 'hwthread' to avoid naming clashes with MetricStore and CC-Webfrontend

* Collectors in parallel (#74)

* Provide info to CollectorManager whether the collector can be executed in parallel with others

* Split serial and parallel collectors. Read in parallel first

* Update NvidiaCollector with new metrics, MIG and NvLink support (#75)

* CC topology module update (#76)

* Rename CPU to hardware thread, write some comments

* Do renaming in other parts

* Remove CpuList and SocketList function from metricCollector. Available in ccTopology

* Option to use MIG UUID as subtype-id in NvidiaCollector

* Option to use MIG slice name as subtype-id in NvidiaCollector

* MetricRouter: Fix JSON in README

* Fix for Github Action to really use the selected version

* Remove Ganglia installation in runonce Action and add Go 1.18

* Fix daemon options in init script

* Add separate go.mod files to use it with deprecated 1.16

* Minor updates for Makefiles

* fix string comparison

* AMD ROCm SMI collector (#77)

* Add collector for AMD ROCm SMI metrics

* Fix import path

* Fix imports

* Remove Board Number

* store GPU index explicitly

* Remove board number from description

* Use http instead of ftp to download likwid

* Fix serial number in rocmCollector

* Improved http sink (#78)

* automatic flush in NatsSink

* tweak default options of HttpSink

* shorter cirt. section and retries for HttpSink

* fix error handling

* Remove file added by mistake.

* Use http instead of ftp to download likwid

* Fix serial number in rocmCollector

Co-authored-by: Thomas Roehl <thomas.roehl@fau.de>

* Fix: When sending metrics failed the batch size could be exceeded

* Improved dropping of metrics failed to send

* Add memstats and topprocs metric

* Updated to latest modules

* Check that at least one sink is running

* Add drop rate, when send buffer is full

* Allow only one timer at a time

* Use mutex to ensure only on flush timer is running

* Fix for NvidiaCollector when devices are not in MiG mode

* Remove Golang version 1.16 an 1.17 from Action. Latest commits require Golang 1.18

* Use Golang 1.18 in Release action to build RPMs

* Change unit of CpufreqCollector to Hz. That's what the sysfs outputs

* Make wget quiet in Release action to reduce log size

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Lou <lou.knauer@gmx.de>
2022-07-13 10:09:49 +02:00
Thomas Roehl 2adf9484a3 Redo fix for NvidiaCollector and MiG. Got lost somehow v0.6.1 2022-07-12 12:31:24 +02:00
Thomas Roehl 31a38bc17d Update release action v0.6.0 2022-06-09 14:36:25 +02:00
Thomas Roehl dbdec1eab8 Merge branch 'main' of github.com:ClusterCockpit/cc-metric-collector into main 2022-06-09 12:46:47 +02:00
Thomas Gruber 0d31ec481b Update Release.yml 2022-06-09 12:42:11 +02:00
Thomas Roehl e22c3287e9 Merge branch 'main' of github.com:ClusterCockpit/cc-metric-collector into main 2022-06-08 15:26:05 +02:00
Thomas Gruber 8d85bd53f1 Merge latest development changes to main branch (#79)
* Cleanup: Remove unused code

* Use Golang duration parser for 'interval' and 'duration'
 in main config

* Update handling of LIKWID headers. Download only if not already present in the system. Fixes #73

* Units with cc-units (#64)

* Add option to normalize units with cc-unit

* Add unit conversion to router

* Add option to change unit prefix in the router

* Add to MetricRouter README

* Add order of operations in router to README

* Use second add_tags/del_tags only if metric gets renamed

* Skip disks in DiskstatCollector that have size=0

* Check readability of sensor files in TempCollector

* Fix for --once option

* Rename `cpu` type to `hwthread` (#69)

* Rename 'cpu' type to 'hwthread' to avoid naming clashes with MetricStore and CC-Webfrontend

* Collectors in parallel (#74)

* Provide info to CollectorManager whether the collector can be executed in parallel with others

* Split serial and parallel collectors. Read in parallel first

* Update NvidiaCollector with new metrics, MIG and NvLink support (#75)

* CC topology module update (#76)

* Rename CPU to hardware thread, write some comments

* Do renaming in other parts

* Remove CpuList and SocketList function from metricCollector. Available in ccTopology

* Option to use MIG UUID as subtype-id in NvidiaCollector

* Option to use MIG slice name as subtype-id in NvidiaCollector

* MetricRouter: Fix JSON in README

* Fix for Github Action to really use the selected version

* Remove Ganglia installation in runonce Action and add Go 1.18

* Fix daemon options in init script

* Add separate go.mod files to use it with deprecated 1.16

* Minor updates for Makefiles

* fix string comparison

* AMD ROCm SMI collector (#77)

* Add collector for AMD ROCm SMI metrics

* Fix import path

* Fix imports

* Remove Board Number

* store GPU index explicitly

* Remove board number from description

* Use http instead of ftp to download likwid

* Fix serial number in rocmCollector

* Improved http sink (#78)

* automatic flush in NatsSink

* tweak default options of HttpSink

* shorter cirt. section and retries for HttpSink

* fix error handling

* Remove file added by mistake.

* Use http instead of ftp to download likwid

* Fix serial number in rocmCollector

Co-authored-by: Thomas Roehl <thomas.roehl@fau.de>

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Lou <lou.knauer@gmx.de>
2022-06-08 15:25:40 +02:00
Holger Obermaier 186a62a86b Fix: influx sink ignores config batch_size.
Feature: Add Redfish receiver
2022-05-04 13:00:56 +02:00
Holger Obermaier e098c33179 Add some golang debug options 2022-05-04 12:48:46 +02:00
Thomas Roehl 38d4e0a730 Merge branch 'develop' of github.com:ClusterCockpit/cc-metric-collector into develop 2022-05-04 11:54:55 +02:00
Thomas Roehl 54d14519ca Skip mount points in DiskstatCollector if statfs() call does not work (bind mounts, ...) 2022-05-04 11:54:34 +02:00
Holger Obermaier c35ac9dba8 Flush if batch size is reached 2022-05-04 11:28:06 +02:00
Holger Obermaier c019f8e7ad Reuse tags and meta data tags 2022-05-03 17:55:33 +02:00
Holger Obermaier fb6f6a4daa Fix GPFS collector last state handling 2022-05-02 16:57:19 +02:00
Holger Obermaier 9d6d0dbd93 Delete empty tags and meta data tags 2022-04-20 14:39:26 +02:00
Holger Obermaier c2d4272fdf Clear workerInput channel after done event 2022-04-20 12:36:45 +02:00
Holger Obermaier 8c73095548 Allow to shutdown redfish receiver during metric read 2022-04-20 09:58:02 +02:00
Holger Obermaier 31c5c89a5a Fix: Close done channel 2022-04-19 14:01:23 +02:00
Holger Obermaier bf9c7e1830 Update requirements 2022-04-19 12:15:51 +02:00
Holger Obermaier 48d34bf564 Adopt sinks.json for new meta_as_tags usage 2022-04-19 12:06:53 +02:00
Holger Obermaier a1d85fa886 Add redfish receiver 2022-04-19 12:05:03 +02:00
Holger Obermaier 96ee16398e Removed unused done channel and wg wait group 2022-04-19 11:53:11 +02:00
Holger Obermaier e7b8088c41 Extended go routine use case in sample receiver 2022-04-19 11:42:46 +02:00
Thomas Roehl 017cd58247 Updating page for LikwidCollector 2022-04-05 10:57:09 +02:00
Thomas Roehl 36dd440864 Merge branch 'develop' into main v0.5.1 2022-04-04 15:16:33 +02:00
Thomas Roehl 7b098e0b1b Fix for missing metrics in LikwidCollector is hwthread is inactive 2022-04-04 15:16:11 +02:00
Thomas Roehl 229a57b16a Merge branch 'develop' into main 2022-04-04 11:49:33 +02:00
Thomas Roehl 70a9530aba Set WriteFailedCallback to get some error message 2022-04-04 11:48:54 +02:00
Thomas Roehl 2f0b6057ca Merge branch 'develop' into main
- MetricRouter: Fix interval_timestamp option
- InfluxSink & InfluxAsyncSink: Add own flush mechanism
- InfluxSink & InfluxAsyncSink: Use default options, overwrite if
configured otherwise
- InfinibandCollector: Add units to metrics
- LikwidCollector: Fix for dying accessdaemon
2022-04-04 02:59:42 +02:00
Thomas Roehl 69f7c19659 InfluxAsyncSink: Add custom flush mechanism 2022-04-04 02:56:23 +02:00
Thomas Roehl ecdb4c1bcf Add debug message when updating interval_timestep 2022-04-04 02:55:44 +02:00
Thomas Roehl 4d5b1adbc8 Fix for interval_timestamp option 2022-04-04 02:26:04 +02:00
Thomas Roehl 28348bd108 InfluxSink: Use batch&flush logic from HttpSink 2022-04-01 18:37:45 +02:00
Thomas Roehl a3b9d8a90b HttpSink: Use sink name in error outputs 2022-04-01 18:36:54 +02:00
Thomas Roehl 7e43e9171e Use default options. Overwrite if anything is configured differently. Use seconds as precision 2022-04-01 17:26:56 +02:00
Thomas Roehl 5d25a7bf12 Add units to InfiniBandCollector 2022-04-01 17:14:26 +02:00
Thomas Roehl 83b4343310 Likwid receives signal at first Read, check when re-initializing 2022-04-01 17:10:31 +02:00
Thomas Roehl f1d3cabdc6 Merge branch 'develop' of github.com:ClusterCockpit/cc-metric-collector into develop 2022-04-01 12:45:25 +02:00
Thomas Röhl 4763733d8d Merge branch 'develop' into main v0.5 2022-03-31 11:57:19 +02:00
Thomas Gruber 2a014b6fba Read unit of values from /proc/meminfo (#68) 2022-03-31 11:56:31 +02:00
Thomas Röhl 16e898ecca Merge branch 'develop' into main 2022-03-31 11:47:02 +02:00
Thomas Roehl 50479f9325 Move all LIKWID related stuff to late initialization routine 2022-03-24 18:12:23 +01:00
Thomas Roehl e0e91844bc Use late initialization of LIKWID and catch access daemon death. Fixes #70 and fixes #71. 2022-03-24 17:56:51 +01:00
Thomas Roehl 296225f3a8 Always export all metrics in NfsCollectors 2022-03-24 13:50:35 +01:00
Thomas Roehl 43bcce6fb5 Merge branch 'develop' of github.com:ClusterCockpit/cc-metric-collector into develop 2022-03-22 18:05:38 +01:00
Thomas Roehl 622e94ae0e Fix DieList() if system does not support dies. Explicitly set entries in CpuData list 2022-03-22 15:58:10 +01:00
Thomas Roehl c506114480 Add processing order to MetricRouter README and add missing options 2022-03-18 12:29:00 +01:00
Thomas Roehl 657543dded Ensure max_forward is at least 1 2022-03-18 12:28:52 +01:00
Thomas Roehl 4851382ad7 Merge branch 'develop' into main v0.4 2022-03-16 19:08:13 +01:00