2021-06-09 06:03:31 +02:00
|
|
|
# ClusterCockpit Metric Store
|
|
|
|
|
2021-09-01 08:48:35 +02:00
|
|
|
[![Build & Test](https://github.com/ClusterCockpit/cc-metric-store/actions/workflows/test.yml/badge.svg)](https://github.com/ClusterCockpit/cc-metric-store/actions/workflows/test.yml)
|
2021-08-31 15:18:06 +02:00
|
|
|
|
2024-06-18 07:27:27 +02:00
|
|
|
The cc-metric-store provides a simple in-memory time series database for storing
|
|
|
|
metrics of cluster nodes at preconfigured intervals. It is meant to be used as
|
|
|
|
part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all
|
|
|
|
data is kept in-memory (but written to disk as compressed JSON for long term
|
2024-10-23 13:19:18 +02:00
|
|
|
storage), accessing it is very fast. It also provides topology aware
|
|
|
|
aggregations over time _and_ nodes/sockets/cpus.
|
2022-01-07 08:49:13 +01:00
|
|
|
|
2024-06-18 07:27:27 +02:00
|
|
|
There are major limitations: Data only gets written to disk at periodic
|
2024-10-23 13:19:18 +02:00
|
|
|
checkpoints, not as soon as it is received. Also only the fixed configured
|
|
|
|
duration is stored and available.
|
2022-01-31 10:53:06 +01:00
|
|
|
|
2024-10-23 13:19:18 +02:00
|
|
|
Go look at the [GitHub
|
2024-06-18 07:27:27 +02:00
|
|
|
Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress
|
2024-10-23 13:19:18 +02:00
|
|
|
overview. The [NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
|
2024-06-18 07:27:27 +02:00
|
|
|
format of the InfluxDB line
|
|
|
|
protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md).
|
2021-08-24 10:41:30 +02:00
|
|
|
|
2024-06-18 07:27:27 +02:00
|
|
|
## REST API Endpoints
|
2021-08-24 10:41:30 +02:00
|
|
|
|
2024-10-23 06:41:19 +02:00
|
|
|
The REST API is documented in [swagger.json](./api/swagger.json). You can
|
|
|
|
explore and try the REST API using the integrated [SwaggerUI web
|
|
|
|
interface](http://localhost:8082/swagger).
|
2021-08-24 10:41:30 +02:00
|
|
|
|
2024-06-18 07:27:27 +02:00
|
|
|
## Run tests
|
2021-08-24 10:41:30 +02:00
|
|
|
|
2021-08-31 15:18:06 +02:00
|
|
|
Some benchmarks concurrently access the `MemoryStore`, so enabling the
|
|
|
|
[Race Detector](https://golang.org/doc/articles/race_detector) might be useful.
|
|
|
|
The benchmarks also work as tests as they do check if the returned values are as
|
|
|
|
expected.
|
|
|
|
|
2021-08-24 10:41:30 +02:00
|
|
|
```sh
|
2021-08-31 15:18:06 +02:00
|
|
|
# Tests only
|
|
|
|
go test -v ./...
|
|
|
|
|
|
|
|
# Benchmarks as well
|
|
|
|
go test -bench=. -race -v ./...
|
2021-08-24 10:41:30 +02:00
|
|
|
```
|
|
|
|
|
2024-06-18 07:27:27 +02:00
|
|
|
## What are these selectors mentioned in the code?
|
2021-08-31 15:18:06 +02:00
|
|
|
|
2024-10-23 13:19:18 +02:00
|
|
|
The cc-metric-store works as a time-series database and uses the InfluxDB line
|
|
|
|
protocol as input format. Unlike InfluxDB, the data is indexed by one single
|
|
|
|
strictly hierarchical tree structure. A selector is build out of the tags in the
|
|
|
|
InfluxDB line protocol, and can be used to select a node (not in the sense of a
|
|
|
|
compute node, can also be a socket, cpu, ...) in that tree. The implementation
|
|
|
|
calls those nodes `level` to avoid confusion. It is impossible to access data
|
|
|
|
only by knowing the _socket_ or _cpu_ tag, all higher up levels have to be
|
|
|
|
specified as well.
|
2021-08-31 15:18:06 +02:00
|
|
|
|
2021-09-20 11:25:25 +02:00
|
|
|
This is what the hierarchy currently looks like:
|
2021-08-31 15:18:06 +02:00
|
|
|
|
|
|
|
- cluster1
|
|
|
|
- host1
|
|
|
|
- socket0
|
|
|
|
- socket1
|
2021-09-08 12:26:22 +02:00
|
|
|
- ...
|
|
|
|
- cpu1
|
|
|
|
- cpu2
|
|
|
|
- cpu3
|
|
|
|
- cpu4
|
2021-08-31 15:18:06 +02:00
|
|
|
- ...
|
2024-10-23 13:19:18 +02:00
|
|
|
- gpu1
|
|
|
|
- gpu2
|
2021-08-31 15:18:06 +02:00
|
|
|
- host2
|
|
|
|
- ...
|
|
|
|
- cluster2
|
|
|
|
- ...
|
|
|
|
|
2021-09-20 11:25:25 +02:00
|
|
|
Example selectors:
|
2024-06-18 07:27:27 +02:00
|
|
|
|
2021-09-20 11:25:25 +02:00
|
|
|
1. `["cluster1", "host1", "cpu0"]`: Select only the cpu0 of host1 in cluster1
|
|
|
|
2. `["cluster1", "host1", ["cpu4", "cpu5", "cpu6", "cpu7"]]`: Select only CPUs 4-7 of host1 in cluster1
|
|
|
|
3. `["cluster1", "host1"]`: Select the complete node. If querying for a CPU-specific metric such as floats, all CPUs are implied
|
|
|
|
|
2024-06-18 07:27:27 +02:00
|
|
|
## Config file
|
2021-08-31 15:18:06 +02:00
|
|
|
|
2024-06-18 07:27:27 +02:00
|
|
|
All durations are specified as string that will be parsed [like
|
|
|
|
this](https://pkg.go.dev/time#ParseDuration) (Allowed suffixes: `s`, `m`, `h`,
|
|
|
|
...).
|
2021-10-11 10:55:36 +02:00
|
|
|
|
2021-08-31 15:18:06 +02:00
|
|
|
- `metrics`: Map of metric-name to objects with the following properties
|
2024-06-18 07:27:27 +02:00
|
|
|
- `frequency`: Timestep/Interval/Resolution of this metric
|
|
|
|
- `aggregation`: Can be `"sum"`, `"avg"` or `null`
|
|
|
|
- `null` means aggregation across nodes is forbidden for this metric
|
|
|
|
- `"sum"` means that values from the child levels are summed up for the parent level
|
|
|
|
- `"avg"` means that values from the child levels are averaged for the parent level
|
|
|
|
- `scope`: Unused at the moment, should be something like `"node"`, `"socket"` or `"hwthread"`
|
2022-02-04 08:46:14 +01:00
|
|
|
- `nats`:
|
2024-06-18 07:27:27 +02:00
|
|
|
- `address`: Url of NATS.io server, example: "nats://localhost:4222"
|
|
|
|
- `username` and `password`: Optional, if provided use those for the connection
|
|
|
|
- `subscriptions`:
|
|
|
|
- `subscribe-to`: Where to expect the measurements to be published
|
|
|
|
- `cluster-tag`: Default value for the cluster tag
|
2022-02-04 08:46:14 +01:00
|
|
|
- `http-api`:
|
2024-06-18 07:27:27 +02:00
|
|
|
- `address`: Address to bind to, for example `0.0.0.0:8080`
|
|
|
|
- `https-cert-file` and `https-key-file`: Optional, if provided enable HTTPS using those files as certificate/key
|
2021-10-11 10:55:36 +02:00
|
|
|
- `jwt-public-key`: Base64 encoded string, use this to verify requests to the HTTP API
|
2022-01-31 10:53:06 +01:00
|
|
|
- `retention-on-memory`: Keep all values in memory for at least that amount of time
|
|
|
|
- `checkpoints`:
|
2024-06-18 07:27:27 +02:00
|
|
|
- `interval`: Do checkpoints every X seconds/minutes/hours
|
|
|
|
- `directory`: Path to a directory
|
|
|
|
- `restore`: After a restart, load the last X seconds/minutes/hours of data back into memory
|
2022-01-31 10:53:06 +01:00
|
|
|
- `archive`:
|
2024-06-18 07:27:27 +02:00
|
|
|
- `interval`: Move and compress all checkpoints not needed anymore every X seconds/minutes/hours
|
|
|
|
- `directory`: Path to a directory
|
2021-08-31 15:18:06 +02:00
|
|
|
|
2024-06-18 07:27:27 +02:00
|
|
|
## Test the complete setup (excluding cc-backend itself)
|
2021-08-24 10:41:30 +02:00
|
|
|
|
2024-06-18 07:27:27 +02:00
|
|
|
There are two ways for sending data to the cc-metric-store, both of which are
|
|
|
|
supported by the
|
|
|
|
[cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector).
|
2024-10-23 13:19:18 +02:00
|
|
|
This example uses NATS, the alternative is to use HTTP.
|
2021-08-24 10:41:30 +02:00
|
|
|
|
|
|
|
```sh
|
|
|
|
# Only needed once, downloads the docker image
|
|
|
|
docker pull nats:latest
|
|
|
|
|
|
|
|
# Start the NATS server
|
|
|
|
docker run -p 4222:4222 -ti nats:latest
|
|
|
|
```
|
|
|
|
|
2024-06-18 07:27:27 +02:00
|
|
|
Second, build and start the
|
|
|
|
[cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector)
|
|
|
|
using the following as Sink-Config:
|
2021-08-24 10:41:30 +02:00
|
|
|
|
|
|
|
```json
|
|
|
|
{
|
2022-01-31 10:53:06 +01:00
|
|
|
"type": "nats",
|
|
|
|
"host": "localhost",
|
|
|
|
"port": "4222",
|
|
|
|
"database": "updates"
|
2021-08-24 10:41:30 +02:00
|
|
|
}
|
|
|
|
```
|
|
|
|
|
2024-06-18 07:27:27 +02:00
|
|
|
Third, build and start the metric store. For this example here, the
|
|
|
|
`config.json` file already in the repository should work just fine.
|
2021-08-24 10:41:30 +02:00
|
|
|
|
|
|
|
```sh
|
|
|
|
# Assuming you have a clone of this repo in ./cc-metric-store:
|
|
|
|
cd cc-metric-store
|
2024-06-18 07:27:27 +02:00
|
|
|
make
|
2021-08-24 10:41:30 +02:00
|
|
|
./cc-metric-store
|
|
|
|
```
|
|
|
|
|
2024-06-18 07:27:27 +02:00
|
|
|
And finally, use the API to fetch some data. The API is protected by JWT based
|
|
|
|
authentication if `jwt-public-key` is set in `config.json`. You can use this JWT
|
|
|
|
for testing:
|
|
|
|
`eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw`
|
2021-08-24 10:41:30 +02:00
|
|
|
|
|
|
|
```sh
|
2021-09-20 09:27:31 +02:00
|
|
|
JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
|
|
|
|
|
2021-08-24 10:41:30 +02:00
|
|
|
# If the collector and store and nats-server have been running for at least 60 seconds on the same host, you may run:
|
2022-01-20 10:43:10 +01:00
|
|
|
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/query" -d "{ \"cluster\": \"testcluster\", \"from\": $(expr $(date +%s) - 60), \"to\": $(date +%s), \"queries\": [{
|
|
|
|
\"metric\": \"load_one\",
|
2022-01-31 16:34:42 +01:00
|
|
|
\"host\": \"$(hostname)\"
|
2022-01-20 10:43:10 +01:00
|
|
|
}] }"
|
2021-08-24 10:41:30 +02:00
|
|
|
|
|
|
|
# ...
|
|
|
|
```
|
|
|
|
|
2023-10-19 09:22:09 +02:00
|
|
|
For debugging there is a debug endpoint to dump the current content to stdout:
|
2024-06-18 07:27:27 +02:00
|
|
|
|
2023-10-19 09:22:09 +02:00
|
|
|
```sh
|
|
|
|
JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
|
|
|
|
|
|
|
|
# If the collector and store and nats-server have been running for at least 60 seconds on the same host, you may run:
|
|
|
|
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/debug"
|
|
|
|
|
|
|
|
# ...
|
|
|
|
```
|