mirror of
https://github.com/ClusterCockpit/cc-metric-store.git
synced 2024-11-10 05:07:25 +01:00
153 lines
5.3 KiB
Markdown
153 lines
5.3 KiB
Markdown
# ClusterCockpit Metric Store
|
|
|
|
![test workflow](https://github.com/ClusterCockpit/cc-metric-store/actions/workflows/test.yml/badge.svg)
|
|
|
|
Barely unusable yet. Go look at the [GitHub Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress overview.
|
|
|
|
### REST API Endpoints
|
|
|
|
_TODO... (For now, look at the examples below)_
|
|
|
|
### Run tests
|
|
|
|
Some benchmarks concurrently access the `MemoryStore`, so enabling the
|
|
[Race Detector](https://golang.org/doc/articles/race_detector) might be useful.
|
|
The benchmarks also work as tests as they do check if the returned values are as
|
|
expected.
|
|
|
|
```sh
|
|
# Tests only
|
|
go test -v ./...
|
|
|
|
# Benchmarks as well
|
|
go test -bench=. -race -v ./...
|
|
```
|
|
|
|
### What are these selectors mentioned in the code and API?
|
|
|
|
Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags have no
|
|
relation to each other, they do not depend on each other and have no hierarchy.
|
|
Different tags build up different indexes.
|
|
|
|
This project also works as a time-series database and uses the InfluxDB line protocol.
|
|
Unlike InfluxDB, the data is indexed by one single strictly hierarchical tree structure.
|
|
A selector is build out of the tags in the InfluxDB line protocol, and can be used to select
|
|
a node (not in the sense of a compute node, can also be a socket, cpu, ...) in that tree.
|
|
The implementation calls those nodes `level` to avoid confusion. It is impossible to access data
|
|
only by knowing the *socket* or *cpu* tag, all higher up levels have to be specified as well.
|
|
|
|
Metrics have to be specified in advance! Those are taken from the *fields* of a line-protocol message.
|
|
New levels will be created on the fly at any depth, meaning that the clusters, hosts, sockets, number of cpus,
|
|
and so on do *not* have to be known at startup. Every level can hold all kinds of metrics. If a level is asked for
|
|
metrics it does not have itself, *all* child-levels will be asked for their values for that metric and
|
|
the data will be aggregated on a per-timestep basis.
|
|
|
|
A.t.m., the *cc-metric-collector* does not provide a *socket* tag for *cpu* metrics, so currently,
|
|
the tree structure looks like this (example selector: `["cluster1", "host1", "cpu", "0"]`):
|
|
|
|
- cluster1
|
|
- host1
|
|
- socket
|
|
- 0
|
|
- 1
|
|
- cpu
|
|
- 0
|
|
- 1
|
|
- 2
|
|
- ...
|
|
- host2
|
|
- ...
|
|
- cluster2
|
|
- ...
|
|
|
|
The plan is later to have the structure look like this (for this, the socket of every cpu must be kown in advance or provided as tag, example selector: `["cluster1", "host1", "socket0", "cpu0"]`):
|
|
|
|
- cluster1
|
|
- host1
|
|
- socket0
|
|
- cpu0
|
|
- cpu1
|
|
- cpu2
|
|
- ...
|
|
- socket1
|
|
- cpu8
|
|
- cpu9
|
|
- cpu10
|
|
- ...
|
|
- ...
|
|
- host2
|
|
- ...
|
|
- cluster2
|
|
- ...
|
|
|
|
### Config file
|
|
|
|
- `metrics`: Map of metric-name to objects with the following properties
|
|
- `frequency`: Timestep/Interval/Resolution of this metric
|
|
- `aggregation`: Can be `"sum"`, `"avg"` or `null`.
|
|
- `null` means "horizontal" aggregation is disabled
|
|
- `"sum"` means that values from the child levels are summed up for the parent level
|
|
- `"avg"` means that values from the child levels are averaged for the parent level
|
|
- `scope`: Unused at the moment, should be something like `"node"`, `"socket"` or `"cpu"`
|
|
- `nats`: Url of NATS.io server (The `updates` channel will be subscribed for metrics)
|
|
- `archive-root`: Directory to be used as archive (__Unimplemented__)
|
|
- `restore-last-hours`: After restart, load data from the past *X* hours back to memory (__Unimplemented__)
|
|
- `checkpoint-interval-hours`: Every *X* hours, write currently held data to disk (__Unimplemented__)
|
|
|
|
### Test the complete setup (excluding ClusterCockpit itself)
|
|
|
|
First, get a NATS server running:
|
|
|
|
```sh
|
|
# Only needed once, downloads the docker image
|
|
docker pull nats:latest
|
|
|
|
# Start the NATS server
|
|
docker run -p 4222:4222 -ti nats:latest
|
|
```
|
|
|
|
Second, build and start start the [cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector) using the following as `config.json`:
|
|
|
|
```json
|
|
{
|
|
"sink": {
|
|
"type": "nats",
|
|
"host": "localhost",
|
|
"port": "4222",
|
|
"database": "updates"
|
|
},
|
|
"interval" : 3,
|
|
"duration" : 1,
|
|
"collectors": [ "likwid", "loadavg" ],
|
|
"default_tags": { "cluster": "testcluster" },
|
|
"receiver": { "type": "none" }
|
|
}
|
|
```
|
|
|
|
Third, build and start the metric store. For this example here, the `config.json` file
|
|
already in the repository should work just fine.
|
|
|
|
```sh
|
|
# Assuming you have a clone of this repo in ./cc-metric-store:
|
|
cd cc-metric-store
|
|
go get
|
|
go build
|
|
./cc-metric-store
|
|
```
|
|
|
|
And finally, use the API to fetch some data:
|
|
|
|
```sh
|
|
# If the collector and store and nats-server have been running for at least 60 seconds on the same host, you may run:
|
|
curl -D - "http://localhost:8080/api/$(expr $(date +%s) - 60)/$(date +%s)/timeseries" -d "[ { \"selector\": [\"testcluster\", \"$(hostname)\"], \"metrics\": [\"load_one\"] } ]"
|
|
|
|
# Get flops_any for all CPUs:
|
|
curl -D - "http://localhost:8080/api/$(expr $(date +%s) - 60)/$(date +%s)/timeseries" -d "[ { \"selector\": [\"testcluster\", \"$(hostname)\", \"cpu\"], \"metrics\": [\"flops_any\"] } ]"
|
|
|
|
# Get flops_any for CPU 0:
|
|
curl -D - "http://localhost:8080/api/$(expr $(date +%s) - 60)/$(date +%s)/timeseries" -d "[ { \"selector\": [\"testcluster\", \"$(hostname)\", \"cpu\", \"0\"], \"metrics\": [\"flops_any\"] } ]"
|
|
|
|
# ...
|
|
```
|
|
|