* cpustatMetric.go: Use derived values instead of absolute values
The values in /proc/stat are absolute counters related to the boot
time of the system. To obtain a utilization of the CPU, the changes
in the counters must be derived according to time. To take only the
absolute values leads to the fact that changes in the utilization,
straight with larger values, do not become visible.
* Add new collector for /proc/schedstat
The `schedstat` collector reads data from /proc/schedstat and calculates
a load value, separated by hwthread. This might be useful to detect bad
cpu pinning on shared nodes etc.
Co-authored-by: Michael Schwarz <post@michael-schwarz.name>
* Cleanup: Remove unused code
* Use Golang duration parser for 'interval' and 'duration'
in main config
* Update handling of LIKWID headers. Download only if not already present in the system. Fixes#73
* Units with cc-units (#64)
* Add option to normalize units with cc-unit
* Add unit conversion to router
* Add option to change unit prefix in the router
* Add to MetricRouter README
* Add order of operations in router to README
* Use second add_tags/del_tags only if metric gets renamed
* Skip disks in DiskstatCollector that have size=0
* Check readability of sensor files in TempCollector
* Fix for --once option
* Rename `cpu` type to `hwthread` (#69)
* Rename 'cpu' type to 'hwthread' to avoid naming clashes with MetricStore and CC-Webfrontend
* Collectors in parallel (#74)
* Provide info to CollectorManager whether the collector can be executed in parallel with others
* Split serial and parallel collectors. Read in parallel first
* Update NvidiaCollector with new metrics, MIG and NvLink support (#75)
* CC topology module update (#76)
* Rename CPU to hardware thread, write some comments
* Do renaming in other parts
* Remove CpuList and SocketList function from metricCollector. Available in ccTopology
* Option to use MIG UUID as subtype-id in NvidiaCollector
* Option to use MIG slice name as subtype-id in NvidiaCollector
* MetricRouter: Fix JSON in README
* Fix for Github Action to really use the selected version
* Remove Ganglia installation in runonce Action and add Go 1.18
* Fix daemon options in init script
* Add separate go.mod files to use it with deprecated 1.16
* Minor updates for Makefiles
* fix string comparison
* AMD ROCm SMI collector (#77)
* Add collector for AMD ROCm SMI metrics
* Fix import path
* Fix imports
* Remove Board Number
* store GPU index explicitly
* Remove board number from description
* Use http instead of ftp to download likwid
* Fix serial number in rocmCollector
* Improved http sink (#78)
* automatic flush in NatsSink
* tweak default options of HttpSink
* shorter cirt. section and retries for HttpSink
* fix error handling
* Remove file added by mistake.
* Use http instead of ftp to download likwid
* Fix serial number in rocmCollector
Co-authored-by: Thomas Roehl <thomas.roehl@fau.de>
* Fix: When sending metrics failed the batch size could be exceeded
* Improved dropping of metrics failed to send
* Add memstats and topprocs metric
* Updated to latest modules
* Check that at least one sink is running
* Add drop rate, when send buffer is full
* Allow only one timer at a time
* Use mutex to ensure only on flush timer is running
* Fix for NvidiaCollector when devices are not in MiG mode
* Remove Golang version 1.16 an 1.17 from Action. Latest commits require Golang 1.18
* Use Golang 1.18 in Release action to build RPMs
* Change unit of CpufreqCollector to Hz. That's what the sysfs outputs
* Make wget quiet in Release action to reduce log size
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Lou <lou.knauer@gmx.de>
* Cleanup: Remove unused code
* Use Golang duration parser for 'interval' and 'duration'
in main config
* Update handling of LIKWID headers. Download only if not already present in the system. Fixes#73
* Units with cc-units (#64)
* Add option to normalize units with cc-unit
* Add unit conversion to router
* Add option to change unit prefix in the router
* Add to MetricRouter README
* Add order of operations in router to README
* Use second add_tags/del_tags only if metric gets renamed
* Skip disks in DiskstatCollector that have size=0
* Check readability of sensor files in TempCollector
* Fix for --once option
* Rename `cpu` type to `hwthread` (#69)
* Rename 'cpu' type to 'hwthread' to avoid naming clashes with MetricStore and CC-Webfrontend
* Collectors in parallel (#74)
* Provide info to CollectorManager whether the collector can be executed in parallel with others
* Split serial and parallel collectors. Read in parallel first
* Update NvidiaCollector with new metrics, MIG and NvLink support (#75)
* CC topology module update (#76)
* Rename CPU to hardware thread, write some comments
* Do renaming in other parts
* Remove CpuList and SocketList function from metricCollector. Available in ccTopology
* Option to use MIG UUID as subtype-id in NvidiaCollector
* Option to use MIG slice name as subtype-id in NvidiaCollector
* MetricRouter: Fix JSON in README
* Fix for Github Action to really use the selected version
* Remove Ganglia installation in runonce Action and add Go 1.18
* Fix daemon options in init script
* Add separate go.mod files to use it with deprecated 1.16
* Minor updates for Makefiles
* fix string comparison
* AMD ROCm SMI collector (#77)
* Add collector for AMD ROCm SMI metrics
* Fix import path
* Fix imports
* Remove Board Number
* store GPU index explicitly
* Remove board number from description
* Use http instead of ftp to download likwid
* Fix serial number in rocmCollector
* Improved http sink (#78)
* automatic flush in NatsSink
* tweak default options of HttpSink
* shorter cirt. section and retries for HttpSink
* fix error handling
* Remove file added by mistake.
* Use http instead of ftp to download likwid
* Fix serial number in rocmCollector
Co-authored-by: Thomas Roehl <thomas.roehl@fau.de>
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Lou <lou.knauer@gmx.de>
* Add time-based derivatived (e.g. bandwidth) to some collectors
* Add documentation
* Add comments
* Fix: Only compute rates with a valid previous state
* Only compute rates with a valid previous state
* Define const values for net/dev fields
* Set default config values
* Add comments
* Refactor: Consolidate data structures
* Refactor: Consolidate data structures
* Refactor: Avoid struct deep copy
* Refactor: Avoid redundant tag maps
* Refactor: Use int64 type for absolut values
* Update LustreCollector
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
* Add time-based derivatived (e.g. bandwidth) to some collectors
* Add documentation
* Add comments
* Fix: Only compute rates with a valid previous state
* Only compute rates with a valid previous state
* Define const values for net/dev fields
* Set default config values
* Add comments
* Refactor: Consolidate data structures
* Refactor: Consolidate data structures
* Refactor: Avoid struct deep copy
* Refactor: Avoid redundant tag maps
* Refactor: Use int64 type for absolut values
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
* added beegfs collectors to collectors/README.md
* added beegfs collectors and docs
* added new beegfs collectors to AvailableCollectors list
* Feedback implemented
* changed error type
* changed error to only return
* changed beegfs lookup path
* fixed typo in md files
Co-authored-by: Mehmet Soysal <mehmet.soysal@kit.edu>
* Check whether LIKWID library is present
* Generalize nan_to_zero option to invalid_to_zero including +Inf,+Inf and NaN
* Remove double error printing and return if measurements do not work