A node agent for measuring, processing and forwarding node level metrics
Go to file
Thomas Gruber 7840de7b82
Merge develop branch into main (#123)
* Add cpu_used (all-cpu_idle) to CpustatCollector

* Update cc-metric-collector.init

* Allow selection of timestamp precision in HttpSink

* Add comment about precision requirement for cc-metric-store

* Fix for API changes in gofish@v0.15.0

* Update requirements to latest version

* Read sensors through redfish

* Update golang toolchain to 1.21

* Remove stray error check

* Update main config in configuration.md

* Update Release action to use golang 1.22 stable release, no golang RPMs anymore

* Update runonce action to use golang 1.22 stable release, no golang RPMs anymore

* Update README.md

Use right JSON type in configuration

* Update sink's README

* Test whether ipmitool or ipmi-sensors can be executed without errors

* Little fixes to the prometheus sink (#115)

* Add uint64 to float64 cast option

* Add prometheus sink to the list of available sinks

* Add aggregated counters by gpu for nvlink errors

---------

Co-authored-by: Michael Schwarz <schwarz@uni-paderborn.de>

* Ccmessage migration (#119)

* Add cpu_used (all-cpu_idle) to CpustatCollector

* Update cc-metric-collector.init

* Allow selection of timestamp precision in HttpSink

* Add comment about precision requirement for cc-metric-store

* Fix for API changes in gofish@v0.15.0

* Update requirements to latest version

* Read sensors through redfish

* Update golang toolchain to 1.21

* Remove stray error check

* Update main config in configuration.md

* Update Release action to use golang 1.22 stable release, no golang RPMs anymore

* Update runonce action to use golang 1.22 stable release, no golang RPMs anymore

* Switch to CCMessage for all files.

---------

Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>

* Switch to ccmessage also for latest additions in nvidiaMetric

* New Message processor (#118)

* Add cpu_used (all-cpu_idle) to CpustatCollector

* Update cc-metric-collector.init

* Allow selection of timestamp precision in HttpSink

* Add comment about precision requirement for cc-metric-store

* Fix for API changes in gofish@v0.15.0

* Update requirements to latest version

* Read sensors through redfish

* Update golang toolchain to 1.21

* Remove stray error check

* Update main config in configuration.md

* Update Release action to use golang 1.22 stable release, no golang RPMs anymore

* Update runonce action to use golang 1.22 stable release, no golang RPMs anymore

* New message processor to check whether a message should be dropped or manipulate it in flight

* Create a copy of message before manipulation

---------

Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>

* Update collector's Makefile and go.mod/sum files

* Use message processor in router, all sinks and all receivers

* Add support for credential file (NKEY) to NATS sink and receiver

* Fix JSON keys in message processor configuration

* Update docs for message processor, router and the default router config file

* Add link to expr syntax and fix regex matching docs

* Update sample collectors

* Minor style change in collector manager

* Some helpers for ccTopology

* LIKWID collector: write log owner change only once

* Fix for metrics without units and reduce debugging messages for messageProcessor

* Use shorted hostname for hostname added by router

* Define default port for NATS

* CPUstat collector: only add unit for applicable metrics

* Add precision option to all sinks using Influx's encoder

* Add message processor to all sink documentation

* Add units to documentation of cpustat collector

---------

Co-authored-by: Holger Obermaier <Holger.Obermaier@kit.edu>
Co-authored-by: Holger Obermaier <40787752+ho-ob@users.noreply.github.com>
Co-authored-by: oscarminus <me@oscarminus.de>
Co-authored-by: Michael Schwarz <schwarz@uni-paderborn.de>
2024-12-19 23:00:14 +01:00
.github Merge develop branch into main (#123) 2024-12-19 23:00:14 +01:00
collectors Merge develop branch into main (#123) 2024-12-19 23:00:14 +01:00
docs Merge latest commits from develop to main branch (#114) 2024-11-20 16:22:39 +01:00
internal Merge develop branch into main (#123) 2024-12-19 23:00:14 +01:00
pkg Merge develop branch into main (#123) 2024-12-19 23:00:14 +01:00
receivers Merge develop branch into main (#123) 2024-12-19 23:00:14 +01:00
scripts Merge latest commits from develop to main branch (#114) 2024-11-20 16:22:39 +01:00
sinks Merge develop branch into main (#123) 2024-12-19 23:00:14 +01:00
.gitignore Initial commit 2021-02-16 16:24:11 +01:00
.gitmodules Ganglia sink using libganglia.so directly (#35) 2022-02-16 18:33:46 +01:00
.zenodo.json Register cc-metric-collector at Zenodo (#93) 2022-12-14 16:53:44 +01:00
cc-metric-collector.go Merge develop branch into main (#123) 2024-12-19 23:00:14 +01:00
collectors.json Add memstats and topprocs metric 2022-06-23 11:44:06 +02:00
config.json Use Golang duration parser for 'interval' and 'duration' 2022-05-13 12:33:33 +02:00
go.mod Merge develop branch into main (#123) 2024-12-19 23:00:14 +01:00
go.sum Merge develop branch into main (#123) 2024-12-19 23:00:14 +01:00
LICENSE Initial commit 2021-02-16 16:24:11 +01:00
Makefile Debian does not like underscores in the version 2022-12-20 13:35:21 +01:00
README.md Update README.md 2022-12-14 18:47:32 +01:00
receivers.json Use package hostlist to expand a host list 2022-12-01 09:48:34 +01:00
router.json Merge develop branch into main (#123) 2024-12-19 23:00:14 +01:00
sinks.json Adopt sinks.json for new meta_as_tags usage 2022-04-19 12:06:53 +02:00

cc-metric-collector

A node agent for measuring, processing and forwarding node level metrics. It is part of the ClusterCockpit ecosystem.

The metric collector sends (and receives) metric in the InfluxDB line protocol as it provides flexibility while providing a separation between tags (like index columns in relational databases) and fields (like data columns).

There is a single timer loop that triggers all collectors serially, collects the collectors' data and sends the metrics to the sink. This is done as all data is submitted with a single time stamp. The sinks currently use mostly blocking APIs.

The receiver runs as a go routine side-by-side with the timer loop and asynchronously forwards received metrics to the sink.

DOI

Configuration

Configuration is implemented using a single json document that is distributed over network and may be persisted as file. Supported metrics are documented here.

There is a main configuration file with basic settings that point to the other configuration files for the different components.

{
  "sinks": "sinks.json",
  "collectors" : "collectors.json",
  "receivers" : "receivers.json",
  "router" : "router.json",
  "interval": "10s",
  "duration": "1s"
}

The interval defines how often the metrics should be read and send to the sink. The duration tells collectors how long one measurement has to take. This is important for some collectors, like the likwid collector. For more information, see here.

See the component READMEs for their configuration:

Installation

$ git clone git@github.com:ClusterCockpit/cc-metric-collector.git
$ make (downloads LIKWID, builds it as static library with 'direct' accessmode and copies all required files for the collector)
$ go get (requires at least golang 1.16)
$ make

For more information, see here.

Running

$ ./cc-metric-collector --help
Usage of metric-collector:
  -config string
    	Path to configuration file (default "./config.json")
  -log string
    	Path for logfile (default "stderr")
  -once
    	Run all collectors only once

Scenarios

The metric collector was designed with flexibility in mind, so it can be used in many scenarios. Here are a few:

flowchart TD
  subgraph a ["Cluster A"]
  nodeA[NodeA with CC collector]
  nodeB[NodeB with CC collector]
  nodeC[NodeC with CC collector]
  end
  a --> db[(Database)]
  db <--> ccweb("Webfrontend")
flowchart TD
  subgraph a [ClusterA]
  direction LR
  nodeA[NodeA with CC collector]
  nodeB[NodeB with CC collector]
  nodeC[NodeC with CC collector]
  end
  subgraph b [ClusterB]
  direction LR
  nodeD[NodeD with CC collector]
  nodeE[NodeE with CC collector]
  nodeF[NodeF with CC collector]
  end
  a --> ccrecv{"CC collector as receiver"}
  b --> ccrecv
  ccrecv --> db[("Database1")]
  ccrecv -.-> db2[("Database2")]
  db <-.-> ccweb("Webfrontend")

Contributing

The ClusterCockpit ecosystem is designed to be used by different HPC computing centers. Since configurations and setups differ between the centers, the centers likely have to put some work into the cc-metric-collector to gather all desired metrics.

You are free to open an issue to request a collector but we would also be happy about PRs.

Contact