Merge pull request #28 from ClusterCockpit/hotfix

Update README and Remove TODO
This commit is contained in:
Jan Eitzinger 2024-10-23 17:06:07 +02:00 committed by GitHub
commit 171d298b4c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 54 additions and 111 deletions

View File

@ -6,25 +6,41 @@ The cc-metric-store provides a simple in-memory time series database for storing
metrics of cluster nodes at preconfigured intervals. It is meant to be used as
part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all
data is kept in-memory (but written to disk as compressed JSON for long term
storage), accessing it is very fast. It also provides aggregations over time
_and_ nodes/sockets/cpus.
storage), accessing it is very fast. It also provides topology aware
aggregations over time _and_ nodes/sockets/cpus.
There are major limitations: Data only gets written to disk at periodic
checkpoints, not as soon as it is received.
checkpoints, not as soon as it is received. Also only the fixed configured
duration is stored and available.
Go look at the `TODO.md` file and the [GitHub
Go look at the [GitHub
Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress
overview. Things work, but are not properly tested. The
[NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
overview. The [NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
format of the InfluxDB line
protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md).
## Building
`cc-metric-store` can be built using the provided `Makefile`.
It supports the following targets:
- `make`: Build the application, copy a example configuration file and generate
checkpoint folders if required.
- `make clean`: Clean the golang build cache and application binary
- `make distclean`: In addition to the clean target also remove the `./var`
folder
- `make swagger`: Regenerate the Swagger files from the source comments.
- `make test`: Run test and basic checks.
## REST API Endpoints
The REST API is documented in [swagger.json](./api/swagger.json). You can
explore and try the REST API using the integrated [SwaggerUI web
interface](http://localhost:8082/swagger).
For more information on the `cc-metric-store` REST API have a look at the
ClusterCockpit documentation [website](https://clustercockpit.org/docs/reference/cc-metric-store/ccms-rest-api/)
## Run tests
Some benchmarks concurrently access the `MemoryStore`, so enabling the
@ -42,19 +58,14 @@ go test -bench=. -race -v ./...
## What are these selectors mentioned in the code?
Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags
have no relation to each other, they do not depend on each other and have no
hierarchy. Different tags build up different indexes (I am no expert at all, but
this is how i think they work).
This project also works as a time-series database and uses the InfluxDB line
protocol. Unlike InfluxDB, the data is indexed by one single strictly
hierarchical tree structure. A selector is build out of the tags in the InfluxDB
line protocol, and can be used to select a node (not in the sense of a compute
node, can also be a socket, cpu, ...) in that tree. The implementation calls
those nodes `level` to avoid confusion. It is impossible to access data only by
knowing the _socket_ or _cpu_ tag, all higher up levels have to be specified as
well.
The cc-metric-store works as a time-series database and uses the InfluxDB line
protocol as input format. Unlike InfluxDB, the data is indexed by one single
strictly hierarchical tree structure. A selector is build out of the tags in the
InfluxDB line protocol, and can be used to select a node (not in the sense of a
compute node, can also be a socket, cpu, ...) in that tree. The implementation
calls those nodes `level` to avoid confusion. It is impossible to access data
only by knowing the _socket_ or _cpu_ tag, all higher up levels have to be
specified as well.
This is what the hierarchy currently looks like:
@ -68,6 +79,8 @@ This is what the hierarchy currently looks like:
- cpu3
- cpu4
- ...
- gpu1
- gpu2
- host2
- ...
- cluster2
@ -81,42 +94,14 @@ Example selectors:
## Config file
All durations are specified as string that will be parsed [like
this](https://pkg.go.dev/time#ParseDuration) (Allowed suffixes: `s`, `m`, `h`,
...).
- `metrics`: Map of metric-name to objects with the following properties
- `frequency`: Timestep/Interval/Resolution of this metric
- `aggregation`: Can be `"sum"`, `"avg"` or `null`
- `null` means aggregation across nodes is forbidden for this metric
- `"sum"` means that values from the child levels are summed up for the parent level
- `"avg"` means that values from the child levels are averaged for the parent level
- `scope`: Unused at the moment, should be something like `"node"`, `"socket"` or `"hwthread"`
- `nats`:
- `address`: Url of NATS.io server, example: "nats://localhost:4222"
- `username` and `password`: Optional, if provided use those for the connection
- `subscriptions`:
- `subscribe-to`: Where to expect the measurements to be published
- `cluster-tag`: Default value for the cluster tag
- `http-api`:
- `address`: Address to bind to, for example `0.0.0.0:8080`
- `https-cert-file` and `https-key-file`: Optional, if provided enable HTTPS using those files as certificate/key
- `jwt-public-key`: Base64 encoded string, use this to verify requests to the HTTP API
- `retention-on-memory`: Keep all values in memory for at least that amount of time
- `checkpoints`:
- `interval`: Do checkpoints every X seconds/minutes/hours
- `directory`: Path to a directory
- `restore`: After a restart, load the last X seconds/minutes/hours of data back into memory
- `archive`:
- `interval`: Move and compress all checkpoints not needed anymore every X seconds/minutes/hours
- `directory`: Path to a directory
You find the configuration options on the ClusterCockpit [website](https://clustercockpit.org/docs/reference/cc-metric-store/ccms-configuration/).
## Test the complete setup (excluding cc-backend itself)
There are two ways for sending data to the cc-metric-store, both of which are
supported by the
[cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector).
This example uses Nats, the alternative is to use HTTP.
This example uses NATS, the alternative is to use HTTP.
```sh
# Only needed once, downloads the docker image

51
TODO.md
View File

@ -1,51 +0,0 @@
# Possible Tasks and Improvements
Importance:
- **I** Important
- **N** Nice to have
- **W** Won't do. Probably not necessary.
- Benchmarking
- Benchmark and compare common timeseries DBs with our data and our queries (N)
- Web interface
- Provide simple http endpoint with a status and debug view (Start with Basic
Authentication)
- Configuration
- Consolidate configuration with cc-backend, remove redundant information
- Support to receive configuration via NATS channel
- Memory management
- To overcome garbage collection overhead: Reimplement in Rust (N)
- Request memory directly batchwise via mmap (started in branch) (W)
- Archive
- S3 backend for archive (I)
- Store information in each buffer if already archived (N)
- Do not create new checkpoint if all buffers already archived (N)
- Checkpoints
- S3 backend for checkpoints (I)
- Combine checkpoints into larger files (I)
- Binary checkpoints (started in branch) (W)
- API
- Redesign query interface (N)
- Provide an endpoint for node health based on received metric data (I)
- Introduce JWT authentication for REST and NATS (I)
- Testing
- General tests (I)
- Test data generator for regression tests (I)
- Check for corner cases that should fail gracefully (N)
- Write a more realistic `ToArchive`/`FromArchive` Tests (N)
- Aggregation
- Calculate averages buffer-wise as soon as full, average weighted by length of buffer (N)
- Only the head-buffer needs to be fully traversed (N)
- If aggregating over hwthreads/cores/sockets cache those results and reuse
some of that for new queries aggregating only over the newer data (W)
- Core functionality
- Implement a health checker component that provides information to the web
interface and REST API (I)
- Support units for metrics including to request unit conversions (I)
- Compression
- Enable compression for http API requests (N)
- Enable compression for checkpoints/archive (I)
- Sampling
- Support data re sampling to reduce data points (I)
- Use re sampling algorithms that preserve min/max as far as possible (I)

View File

@ -24,7 +24,7 @@
"ApiKeyAuth": []
}
],
"description": "Write metrics to store",
"description": "This endpoint allows the users to print the content of",
"produces": [
"application/json"
],
@ -81,6 +81,7 @@
"ApiKeyAuth": []
}
],
"description": "This endpoint allows the users to free the Buffers from the",
"produces": [
"application/json"
],
@ -136,7 +137,7 @@
"ApiKeyAuth": []
}
],
"description": "Query metrics.",
"description": "This endpoint allows the users to retrieve data from the",
"consumes": [
"application/json"
],

View File

@ -106,7 +106,7 @@ info:
paths:
/debug/:
post:
description: Write metrics to store
description: This endpoint allows the users to print the content of
parameters:
- description: Selector
in: query
@ -142,6 +142,7 @@ paths:
- debug
/free/:
post:
description: This endpoint allows the users to free the Buffers from the
parameters:
- description: up to timestamp
in: query
@ -178,7 +179,7 @@ paths:
get:
consumes:
- application/json
description: Query metrics.
description: This endpoint allows the users to retrieve data from the
parameters:
- description: API query payload object
in: body

View File

@ -127,7 +127,10 @@ func (data *ApiMetricData) PadDataWithNull(ms *memorystore.MemoryStore, from, to
// handleFree godoc
// @summary
// @tags free
// @description
// @description This endpoint allows the users to free the Buffers from the
// metric store. This endpoint offers the users to remove then systematically
// and also allows then to prune the data under node, if they do not want to
// remove the whole node.
// @produce json
// @param to query string false "up to timestamp"
// @success 200 {string} string "ok"
@ -182,9 +185,9 @@ func handleFree(rw http.ResponseWriter, r *http.Request) {
}
// handleWrite godoc
// @summary Receive metrics in line-protocol
// @summary Receive metrics in InfluxDB line-protocol
// @tags write
// @description Receives metrics in the influx line-protocol using [this format](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md)
// @description Write data to the in-memory store in the InfluxDB line-protocol using [this format](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md)
// @accept plain
// @produce json
@ -245,7 +248,9 @@ type ApiQuery struct {
// handleQuery godoc
// @summary Query metrics
// @tags query
// @description Query metrics.
// @description This endpoint allows the users to retrieve data from the
// in-memory database. The CCMS will return data in JSON format for the
// specified interval requested by the user
// @accept json
// @produce json
// @param request body api.ApiQueryRequest true "API query payload object"
@ -383,7 +388,8 @@ func handleQuery(rw http.ResponseWriter, r *http.Request) {
// handleDebug godoc
// @summary Debug endpoint
// @tags debug
// @description Write metrics to store
// @description This endpoint allows the users to print the content of
// nodes/clusters/metrics to review the state of the data.
// @produce json
// @param selector query string false "Selector"
// @success 200 {string} string "Debug dump"

View File

@ -30,7 +30,7 @@ const docTemplate = `{
"ApiKeyAuth": []
}
],
"description": "Write metrics to store",
"description": "This endpoint allows the users to print the content of",
"produces": [
"application/json"
],
@ -87,6 +87,7 @@ const docTemplate = `{
"ApiKeyAuth": []
}
],
"description": "This endpoint allows the users to free the Buffers from the",
"produces": [
"application/json"
],
@ -142,7 +143,7 @@ const docTemplate = `{
"ApiKeyAuth": []
}
],
"description": "Query metrics.",
"description": "This endpoint allows the users to retrieve data from the",
"consumes": [
"application/json"
],