cc-metric-store/README.md

173 lines
9.1 KiB
Markdown
Raw Normal View History

2021-06-09 06:03:31 +02:00
# ClusterCockpit Metric Store
2021-09-01 08:48:35 +02:00
[![Build & Test](https://github.com/ClusterCockpit/cc-metric-store/actions/workflows/test.yml/badge.svg)](https://github.com/ClusterCockpit/cc-metric-store/actions/workflows/test.yml)
2021-08-31 15:18:06 +02:00
2022-01-07 08:49:13 +01:00
The cc-metric-store provides a simple in-memory time series database for storing metrics of cluster nodes at preconfigured intervals. It is meant to be used as part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all data is kept in-memory (but written to disk as compressed JSON for long term storage), accessing it is very fast. It also provides aggregations over time *and* nodes/sockets/cpus.
Go look at the `TODO.md` file and the [GitHub Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress overview. Things work, but are not properly tested.
2021-10-12 13:26:54 +02:00
The [NATS.io](https://nats.io/) based writing endpoint consumes messages in [this format of the InfluxDB line protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md).
2021-08-24 10:41:30 +02:00
### REST API Endpoints
In case `jwt-public-key` is a non-empty string in the `config.json` file, the API is protected by JWT based authentication. The signing algorithm has to be `Ed25519`, but no
fields are required in the JWT payload. Expiration will be checked if specified. The JWT has to be provided using the HTTP `Authorization` header.
All but one endpoints use *selectors* to access the data. A selector must be an array of strings or another array of strings. Examples are provided below.
In the requests, `to` and `from` have to be UNIX timestamps in seconds. The response might also contain `from`/`to` timestamps. They can differ from those in the request,
if there was not data for a section of the requested data.
1. `POST /api/<from>/<to>/timeseries`
- Request-Body: `{ "selectors": [<sel1>, <sel2>, <sel3>, ...], "metrics": ["flops_any", "load_one", ...] }`
- The response will be a JSON array, each entry in the array corresponding to the selector found at that index in the request's `selectors` array
- Each array entry will be a map from every requested metric to this: `{ "from": Timestamp, "to": Timestamp, "data": Array of Floats }`
- Some values in `data` might be `null` if there is no data available for that time slot
2. `POST /api/<from>/<to>/stats`
- The Request-Body shall be the same as for a `timeseries` query
- The response will be a JSON array, each entry in the array corresponding to the selector found at that index in the request's `selectors` array
- Each array entry will be a map from every requested metric to this: `{ "from": Timestamp, "to": Timestamp, "samples": Int, "avg": Float, "min": Float, "max": Float }`
- If the `samples` value is 0, the statistics should be ignored.
3. `POST /api/<to>/free`
- Request-Body: Array of selectors
- This request will free up and release all data older than `to` for all nodes specified by the selectors
4. `GET /api/{cluster}/peek`
- Return a map from every node in the specified cluster to a map from every metric to the newest value available for that metric
- All cpu/socket level metrics are aggregated to the node level
2021-10-12 13:26:54 +02:00
5. `POST /api/write`
- You can send lines of the InfluxDB line protocol to this endpoint and they will be written to the store (Basically an alternative to NATS)
2021-08-24 10:41:30 +02:00
### Run tests
2021-08-31 15:18:06 +02:00
Some benchmarks concurrently access the `MemoryStore`, so enabling the
[Race Detector](https://golang.org/doc/articles/race_detector) might be useful.
The benchmarks also work as tests as they do check if the returned values are as
expected.
2021-08-24 10:41:30 +02:00
```sh
2021-08-31 15:18:06 +02:00
# Tests only
go test -v ./...
# Benchmarks as well
go test -bench=. -race -v ./...
2021-08-24 10:41:30 +02:00
```
2021-08-31 15:18:06 +02:00
### What are these selectors mentioned in the code and API?
Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags have no
relation to each other, they do not depend on each other and have no hierarchy.
2021-09-01 08:48:35 +02:00
Different tags build up different indexes (I am no expert at all, but this is how i think they work).
2021-08-31 15:18:06 +02:00
This project also works as a time-series database and uses the InfluxDB line protocol.
Unlike InfluxDB, the data is indexed by one single strictly hierarchical tree structure.
A selector is build out of the tags in the InfluxDB line protocol, and can be used to select
a node (not in the sense of a compute node, can also be a socket, cpu, ...) in that tree.
The implementation calls those nodes `level` to avoid confusion.
It is impossible to access data only by knowing the *socket* or *cpu* tag, all higher up levels have to be specified as well.
2021-08-31 15:18:06 +02:00
This is what the hierarchy currently looks like:
2021-08-31 15:18:06 +02:00
- cluster1
- host1
- socket0
- socket1
2021-09-08 12:26:22 +02:00
- ...
- cpu1
- cpu2
- cpu3
- cpu4
2021-08-31 15:18:06 +02:00
- ...
- host2
- ...
- cluster2
- ...
Example selectors:
1. `["cluster1", "host1", "cpu0"]`: Select only the cpu0 of host1 in cluster1
2. `["cluster1", "host1", ["cpu4", "cpu5", "cpu6", "cpu7"]]`: Select only CPUs 4-7 of host1 in cluster1
3. `["cluster1", "host1"]`: Select the complete node. If querying for a CPU-specific metric such as floats, all CPUs are implied
2021-08-31 15:18:06 +02:00
### Config file
All durations are specified in seconds.
2021-08-31 15:18:06 +02:00
- `metrics`: Map of metric-name to objects with the following properties
2021-09-01 08:48:35 +02:00
- `frequency`: Timestep/Interval/Resolution of this metric (In seconds)
- `aggregation`: Can be `"sum"`, `"avg"` or `null`
- `null` means aggregation across nodes is forbidden for this metric
2021-08-31 15:18:06 +02:00
- `"sum"` means that values from the child levels are summed up for the parent level
- `"avg"` means that values from the child levels are averaged for the parent level
- `scope`: Unused at the moment, should be something like `"node"`, `"socket"` or `"cpu"`
2021-10-12 13:26:54 +02:00
- `nats`: Url of NATS.io server (The `updates` channel will be subscribed for metrics), example: "nats://localhost:4222"
- `http-api-address`: Where to listen via HTTP, example: ":8080"
- `jwt-public-key`: Base64 encoded string, use this to verify requests to the HTTP API
2021-10-12 13:26:54 +02:00
- `retention-on-memory`: Keep all values in memory for at least that amount of seconds
2021-08-31 15:18:06 +02:00
2021-08-24 10:41:30 +02:00
### Test the complete setup (excluding ClusterCockpit itself)
First, get a NATS server running:
```sh
# Only needed once, downloads the docker image
docker pull nats:latest
# Start the NATS server
docker run -p 4222:4222 -ti nats:latest
```
2021-09-01 08:48:35 +02:00
Second, build and start the [cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector) using the following as `config.json`:
2021-08-24 10:41:30 +02:00
```json
{
"sink": {
"type": "nats",
"host": "localhost",
"port": "4222",
"database": "updates"
},
"interval" : 3,
"duration" : 1,
"collectors": [ "likwid", "loadavg" ],
"default_tags": { "cluster": "testcluster" },
"receiver": { "type": "none" }
}
```
2021-08-31 15:18:06 +02:00
Third, build and start the metric store. For this example here, the `config.json` file
already in the repository should work just fine.
2021-08-24 10:41:30 +02:00
```sh
# Assuming you have a clone of this repo in ./cc-metric-store:
cd cc-metric-store
go get
go build
./cc-metric-store
```
And finally, use the API to fetch some data. The API is protected by JWT based authentication if `jwt-public-key` is set in `config.json`. You can use this JWT for testing: `eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw`
2021-08-24 10:41:30 +02:00
```sh
JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
2021-08-24 10:41:30 +02:00
# If the collector and store and nats-server have been running for at least 60 seconds on the same host, you may run:
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/$(expr $(date +%s) - 60)/$(date +%s)/timeseries" -d "{ \"selectors\": [[\"testcluster\", \"$(hostname)\"]], \"metrics\": [\"load_one\"] }"
2021-08-24 10:41:30 +02:00
2021-08-31 15:18:06 +02:00
# Get flops_any for all CPUs:
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/$(expr $(date +%s) - 60)/$(date +%s)/timeseries" -d "{ \"selectors\": [[\"testcluster\", \"$(hostname)\"]], \"metrics\": [\"flops_any\"] }"
2021-08-24 10:41:30 +02:00
2021-08-31 15:18:06 +02:00
# Get flops_any for CPU 0:
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/$(expr $(date +%s) - 60)/$(date +%s)/timeseries" -d "{ \"selectors\": [[\"testcluster\", \"$(hostname)\", \"cpu0\"]], \"metrics\": [\"flops_any\"] }"
2021-09-08 12:26:22 +02:00
# Get flops_any for CPU 0, 1, 2 and 3:
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/$(expr $(date +%s) - 60)/$(date +%s)/timeseries" -d "{ \"selectors\": [[\"testcluster\", \"$(hostname)\", [\"cpu0\", \"cpu1\", \"cpu2\", \"cpu3\"]]], \"metrics\": [\"flops_any\"] }"
2021-09-01 08:48:35 +02:00
# Stats for load_one and proc_run:
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/$(expr $(date +%s) - 60)/$(date +%s)/stats" -d "{ \"selectors\": [[\"testcluster\", \"$(hostname)\"]], \"metrics\": [\"load_one\", \"proc_run\"] }"
2021-09-01 08:48:35 +02:00
# Stats for *all* CPUs aggregated both from CPU to node and over time:
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/$(expr $(date +%s) - 60)/$(date +%s)/stats" -d "{ \"selectors\": [[\"testcluster\", \"$(hostname)\"]], \"metrics\": [\"flops_sp\", \"flops_dp\"] }"
2021-09-01 08:48:35 +02:00
2021-08-24 10:41:30 +02:00
# ...
```