A node agent for measuring, processing and forwarding node level metrics
Go to file
Thomas Gruber 84e019c693
Merge develop and main (#99)
* InfiniBandCollector: Scale raw readings from octets to bytes

* Fix clock frequency coming from LikwidCollector and update docs

* Build DEB package for Ubuntu 20.04 for releases

* Fix memstat collector with numa_stats option

* Remove useless prints from MemstatCollector

* Replace ioutils with os and io (#87)

* Use lower case for error strings in RocmSmiCollector

* move maybe-usable-by-other-cc-components to pkg. Fix all files to use the new paths (#88)

* Add collector for monitoring the execution of cc-metric-collector itself (#81)

* Add collector to monitor execution of cc-metric-collector itself

* Register SelfCollector

* Fix import paths for moved packages

* Check if at least one CPU with frequency information was detected

* Correct type: /proc/stats -> /proc/stat

* Update README.md

* Run ipmitool asynchron.  Improved error handling.

* Corrected some typos

* Add running average power limit (RAPL) metric collector

* Add running average power limit (RAPL) metric collector

* Do not mess up with the orignal configuration

* * Corrected json config in numastatsMetric.md
* Added some debug output to numastatsMetric.go

* Fixed computing number of physical packages for non continous physical package IDs (e.g. on Ampere Altra Q80-30)

* Fix kernel panic for receiver config with missing receiver type

* Add receiver to gather remote IPMI sensor metrics

* Added config option to add ipmi-sensors command line options

* Add documentaion for IPMI receiver

* Update to latest version of included go modules

* Add go.mod to App dependency

* Try to use common metric tags across hardware vendors

* Add IPMI metric: current

* remove prefix enumeration like 01-...

* Add IPMI receiver example configuration to receivers.json

* Minimal formating changes

* Add hostlist package

* Added tests for hostlist Expand()

* Use package hostlist to expand a host list

* Use package hostlist to expand a host list

* Some servers return "ConsumedPowerWatt":65535 instead of "ConsumedPowerWatt":null

* Updated to latest package versions

* Do not allow unknown fields in JSON configuration file

* Add workflow to customize packages to docs

* NFS I/O Stats Collector (#91)

* Initial version

* Delete values for vanished mount points and  comments

* Fix for Likwid collector (#95)

* Run LIKWID in separate thread and check metric type

* Change LIKWID collector documentation to use 'type' instead of 'scope'

* Re-initialize LIKWID after one read is missing due to lock toggle

* Register cc-metric-collector at Zenodo (#93)

* Add initial version of Zenodo project file

* Orcid ID added

* Update .zenodo.json

Co-authored-by: Holger Obermaier <holger.obermaier@kit.edu>

* Update ipmiMetric.go

* Use latest LIKWID version for builds

* Update README.md

* Remove development stuff from Makefile

* Add Requires(pre) to RPM SPEC file

* Use curly brackets in packaging make targets

* Fix for LIKWID collector with separate measurement thread and inotify watcher on the LIKWID lock (#97)

Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
2022-12-20 13:08:04 +01:00
.github Add latest development to main branch (#89) 2022-10-10 12:23:51 +02:00
collectors Push LIKWID collector fix into main (#98) 2022-12-20 13:04:24 +01:00
docs Merge develop branch into main (#96) 2022-12-14 17:02:39 +01:00
internal Add latest development to main branch (#89) 2022-10-10 12:23:51 +02:00
pkg Merge develop branch into main (#96) 2022-12-14 17:02:39 +01:00
receivers Merge develop branch into main (#96) 2022-12-14 17:02:39 +01:00
scripts Push LIKWID collector fix into main (#98) 2022-12-20 13:04:24 +01:00
sinks Update README.md 2022-11-04 14:53:08 +01:00
.gitignore Initial commit 2021-02-16 16:24:11 +01:00
.gitmodules Ganglia sink using libganglia.so directly (#35) 2022-02-16 18:33:46 +01:00
.zenodo.json Merge develop branch into main (#96) 2022-12-14 17:02:39 +01:00
cc-metric-collector.go Add latest development to main branch (#89) 2022-10-10 12:23:51 +02:00
collectors.json Merge latest development changes (#80) 2022-07-13 10:09:49 +02:00
config.json Merge latest development changes to main branch (#79) 2022-06-08 15:25:40 +02:00
go.mod Merge develop branch into main (#96) 2022-12-14 17:02:39 +01:00
go.sum Add latest development to main branch (#89) 2022-10-10 12:23:51 +02:00
LICENSE Initial commit 2021-02-16 16:24:11 +01:00
Makefile Push LIKWID collector fix into main (#98) 2022-12-20 13:04:24 +01:00
README.md Push LIKWID collector fix into main (#98) 2022-12-20 13:04:24 +01:00
receivers.json Merge develop branch into main (#96) 2022-12-14 17:02:39 +01:00
router.json Modularize the whole thing (#16) 2022-01-25 15:37:43 +01:00
sinks.json Adopt sinks.json for new meta_as_tags usage 2022-04-19 12:06:53 +02:00

cc-metric-collector

A node agent for measuring, processing and forwarding node level metrics. It is part of the ClusterCockpit ecosystem.

The metric collector sends (and receives) metric in the InfluxDB line protocol as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns).

There is a single timer loop that triggers all collectors serially, collects the collectors' data and sends the metrics to the sink. This is done as all data is submitted with a single time stamp. The sinks currently use mostly blocking APIs.

The receiver runs as a go routine side-by-side with the timer loop and asynchronously forwards received metrics to the sink.

DOI

Configuration

Configuration is implemented using a single json document that is distributed over network and may be persisted as file. Supported metrics are documented here.

There is a main configuration file with basic settings that point to the other configuration files for the different components.

{
  "sinks": "sinks.json",
  "collectors" : "collectors.json",
  "receivers" : "receivers.json",
  "router" : "router.json",
  "interval": "10s",
  "duration": "1s"
}

The interval defines how often the metrics should be read and send to the sink. The duration tells collectors how long one measurement has to take. This is important for some collectors, like the likwid collector. For more information, see here.

See the component READMEs for their configuration:

Installation

$ git clone git@github.com:ClusterCockpit/cc-metric-collector.git
$ make (downloads LIKWID, builds it as static library with 'direct' accessmode and copies all required files for the collector)
$ go get (requires at least golang 1.16)
$ make

For more information, see here.

Running

$ ./cc-metric-collector --help
Usage of metric-collector:
  -config string
    	Path to configuration file (default "./config.json")
  -log string
    	Path for logfile (default "stderr")
  -once
    	Run all collectors only once

Scenarios

The metric collector was designed with flexibility in mind, so it can be used in many scenarios. Here are a few:

flowchart TD
  subgraph a ["Cluster A"]
  nodeA[NodeA with CC collector]
  nodeB[NodeB with CC collector]
  nodeC[NodeC with CC collector]
  end
  a --> db[(Database)]
  db <--> ccweb("Webfrontend")
flowchart TD
  subgraph a [ClusterA]
  direction LR
  nodeA[NodeA with CC collector]
  nodeB[NodeB with CC collector]
  nodeC[NodeC with CC collector]
  end
  subgraph b [ClusterB]
  direction LR
  nodeD[NodeD with CC collector]
  nodeE[NodeE with CC collector]
  nodeF[NodeF with CC collector]
  end
  a --> ccrecv{"CC collector as receiver"}
  b --> ccrecv
  ccrecv --> db[("Database1")]
  ccrecv -.-> db2[("Database2")]
  db <-.-> ccweb("Webfrontend")

Contributing

The ClusterCockpit ecosystem is designed to be used by different HPC computing centers. Since configurations and setups differ between the centers, the centers likely have to put some work into the cc-metric-collector to gather all desired metrics.

You are free to open an issue to request a collector but we would also be happy about PRs.

Contact