mirror of
https://github.com/ClusterCockpit/cc-metric-store.git
synced 2024-12-26 00:49:05 +01:00
Merge pull request #28 from ClusterCockpit/hotfix
Update README and Remove TODO
This commit is contained in:
commit
171d298b4c
83
README.md
83
README.md
@ -6,25 +6,41 @@ The cc-metric-store provides a simple in-memory time series database for storing
|
|||||||
metrics of cluster nodes at preconfigured intervals. It is meant to be used as
|
metrics of cluster nodes at preconfigured intervals. It is meant to be used as
|
||||||
part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all
|
part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all
|
||||||
data is kept in-memory (but written to disk as compressed JSON for long term
|
data is kept in-memory (but written to disk as compressed JSON for long term
|
||||||
storage), accessing it is very fast. It also provides aggregations over time
|
storage), accessing it is very fast. It also provides topology aware
|
||||||
_and_ nodes/sockets/cpus.
|
aggregations over time _and_ nodes/sockets/cpus.
|
||||||
|
|
||||||
There are major limitations: Data only gets written to disk at periodic
|
There are major limitations: Data only gets written to disk at periodic
|
||||||
checkpoints, not as soon as it is received.
|
checkpoints, not as soon as it is received. Also only the fixed configured
|
||||||
|
duration is stored and available.
|
||||||
|
|
||||||
Go look at the `TODO.md` file and the [GitHub
|
Go look at the [GitHub
|
||||||
Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress
|
Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress
|
||||||
overview. Things work, but are not properly tested. The
|
overview. The [NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
|
||||||
[NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
|
|
||||||
format of the InfluxDB line
|
format of the InfluxDB line
|
||||||
protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md).
|
protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md).
|
||||||
|
|
||||||
|
## Building
|
||||||
|
|
||||||
|
`cc-metric-store` can be built using the provided `Makefile`.
|
||||||
|
It supports the following targets:
|
||||||
|
|
||||||
|
- `make`: Build the application, copy a example configuration file and generate
|
||||||
|
checkpoint folders if required.
|
||||||
|
- `make clean`: Clean the golang build cache and application binary
|
||||||
|
- `make distclean`: In addition to the clean target also remove the `./var`
|
||||||
|
folder
|
||||||
|
- `make swagger`: Regenerate the Swagger files from the source comments.
|
||||||
|
- `make test`: Run test and basic checks.
|
||||||
|
|
||||||
## REST API Endpoints
|
## REST API Endpoints
|
||||||
|
|
||||||
The REST API is documented in [swagger.json](./api/swagger.json). You can
|
The REST API is documented in [swagger.json](./api/swagger.json). You can
|
||||||
explore and try the REST API using the integrated [SwaggerUI web
|
explore and try the REST API using the integrated [SwaggerUI web
|
||||||
interface](http://localhost:8082/swagger).
|
interface](http://localhost:8082/swagger).
|
||||||
|
|
||||||
|
For more information on the `cc-metric-store` REST API have a look at the
|
||||||
|
ClusterCockpit documentation [website](https://clustercockpit.org/docs/reference/cc-metric-store/ccms-rest-api/)
|
||||||
|
|
||||||
## Run tests
|
## Run tests
|
||||||
|
|
||||||
Some benchmarks concurrently access the `MemoryStore`, so enabling the
|
Some benchmarks concurrently access the `MemoryStore`, so enabling the
|
||||||
@ -42,19 +58,14 @@ go test -bench=. -race -v ./...
|
|||||||
|
|
||||||
## What are these selectors mentioned in the code?
|
## What are these selectors mentioned in the code?
|
||||||
|
|
||||||
Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags
|
The cc-metric-store works as a time-series database and uses the InfluxDB line
|
||||||
have no relation to each other, they do not depend on each other and have no
|
protocol as input format. Unlike InfluxDB, the data is indexed by one single
|
||||||
hierarchy. Different tags build up different indexes (I am no expert at all, but
|
strictly hierarchical tree structure. A selector is build out of the tags in the
|
||||||
this is how i think they work).
|
InfluxDB line protocol, and can be used to select a node (not in the sense of a
|
||||||
|
compute node, can also be a socket, cpu, ...) in that tree. The implementation
|
||||||
This project also works as a time-series database and uses the InfluxDB line
|
calls those nodes `level` to avoid confusion. It is impossible to access data
|
||||||
protocol. Unlike InfluxDB, the data is indexed by one single strictly
|
only by knowing the _socket_ or _cpu_ tag, all higher up levels have to be
|
||||||
hierarchical tree structure. A selector is build out of the tags in the InfluxDB
|
specified as well.
|
||||||
line protocol, and can be used to select a node (not in the sense of a compute
|
|
||||||
node, can also be a socket, cpu, ...) in that tree. The implementation calls
|
|
||||||
those nodes `level` to avoid confusion. It is impossible to access data only by
|
|
||||||
knowing the _socket_ or _cpu_ tag, all higher up levels have to be specified as
|
|
||||||
well.
|
|
||||||
|
|
||||||
This is what the hierarchy currently looks like:
|
This is what the hierarchy currently looks like:
|
||||||
|
|
||||||
@ -68,6 +79,8 @@ This is what the hierarchy currently looks like:
|
|||||||
- cpu3
|
- cpu3
|
||||||
- cpu4
|
- cpu4
|
||||||
- ...
|
- ...
|
||||||
|
- gpu1
|
||||||
|
- gpu2
|
||||||
- host2
|
- host2
|
||||||
- ...
|
- ...
|
||||||
- cluster2
|
- cluster2
|
||||||
@ -81,42 +94,14 @@ Example selectors:
|
|||||||
|
|
||||||
## Config file
|
## Config file
|
||||||
|
|
||||||
All durations are specified as string that will be parsed [like
|
You find the configuration options on the ClusterCockpit [website](https://clustercockpit.org/docs/reference/cc-metric-store/ccms-configuration/).
|
||||||
this](https://pkg.go.dev/time#ParseDuration) (Allowed suffixes: `s`, `m`, `h`,
|
|
||||||
...).
|
|
||||||
|
|
||||||
- `metrics`: Map of metric-name to objects with the following properties
|
|
||||||
- `frequency`: Timestep/Interval/Resolution of this metric
|
|
||||||
- `aggregation`: Can be `"sum"`, `"avg"` or `null`
|
|
||||||
- `null` means aggregation across nodes is forbidden for this metric
|
|
||||||
- `"sum"` means that values from the child levels are summed up for the parent level
|
|
||||||
- `"avg"` means that values from the child levels are averaged for the parent level
|
|
||||||
- `scope`: Unused at the moment, should be something like `"node"`, `"socket"` or `"hwthread"`
|
|
||||||
- `nats`:
|
|
||||||
- `address`: Url of NATS.io server, example: "nats://localhost:4222"
|
|
||||||
- `username` and `password`: Optional, if provided use those for the connection
|
|
||||||
- `subscriptions`:
|
|
||||||
- `subscribe-to`: Where to expect the measurements to be published
|
|
||||||
- `cluster-tag`: Default value for the cluster tag
|
|
||||||
- `http-api`:
|
|
||||||
- `address`: Address to bind to, for example `0.0.0.0:8080`
|
|
||||||
- `https-cert-file` and `https-key-file`: Optional, if provided enable HTTPS using those files as certificate/key
|
|
||||||
- `jwt-public-key`: Base64 encoded string, use this to verify requests to the HTTP API
|
|
||||||
- `retention-on-memory`: Keep all values in memory for at least that amount of time
|
|
||||||
- `checkpoints`:
|
|
||||||
- `interval`: Do checkpoints every X seconds/minutes/hours
|
|
||||||
- `directory`: Path to a directory
|
|
||||||
- `restore`: After a restart, load the last X seconds/minutes/hours of data back into memory
|
|
||||||
- `archive`:
|
|
||||||
- `interval`: Move and compress all checkpoints not needed anymore every X seconds/minutes/hours
|
|
||||||
- `directory`: Path to a directory
|
|
||||||
|
|
||||||
## Test the complete setup (excluding cc-backend itself)
|
## Test the complete setup (excluding cc-backend itself)
|
||||||
|
|
||||||
There are two ways for sending data to the cc-metric-store, both of which are
|
There are two ways for sending data to the cc-metric-store, both of which are
|
||||||
supported by the
|
supported by the
|
||||||
[cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector).
|
[cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector).
|
||||||
This example uses Nats, the alternative is to use HTTP.
|
This example uses NATS, the alternative is to use HTTP.
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
# Only needed once, downloads the docker image
|
# Only needed once, downloads the docker image
|
||||||
|
51
TODO.md
51
TODO.md
@ -1,51 +0,0 @@
|
|||||||
# Possible Tasks and Improvements
|
|
||||||
|
|
||||||
Importance:
|
|
||||||
|
|
||||||
- **I** Important
|
|
||||||
- **N** Nice to have
|
|
||||||
- **W** Won't do. Probably not necessary.
|
|
||||||
|
|
||||||
- Benchmarking
|
|
||||||
- Benchmark and compare common timeseries DBs with our data and our queries (N)
|
|
||||||
- Web interface
|
|
||||||
- Provide simple http endpoint with a status and debug view (Start with Basic
|
|
||||||
Authentication)
|
|
||||||
- Configuration
|
|
||||||
- Consolidate configuration with cc-backend, remove redundant information
|
|
||||||
- Support to receive configuration via NATS channel
|
|
||||||
- Memory management
|
|
||||||
- To overcome garbage collection overhead: Reimplement in Rust (N)
|
|
||||||
- Request memory directly batchwise via mmap (started in branch) (W)
|
|
||||||
- Archive
|
|
||||||
- S3 backend for archive (I)
|
|
||||||
- Store information in each buffer if already archived (N)
|
|
||||||
- Do not create new checkpoint if all buffers already archived (N)
|
|
||||||
- Checkpoints
|
|
||||||
- S3 backend for checkpoints (I)
|
|
||||||
- Combine checkpoints into larger files (I)
|
|
||||||
- Binary checkpoints (started in branch) (W)
|
|
||||||
- API
|
|
||||||
- Redesign query interface (N)
|
|
||||||
- Provide an endpoint for node health based on received metric data (I)
|
|
||||||
- Introduce JWT authentication for REST and NATS (I)
|
|
||||||
- Testing
|
|
||||||
- General tests (I)
|
|
||||||
- Test data generator for regression tests (I)
|
|
||||||
- Check for corner cases that should fail gracefully (N)
|
|
||||||
- Write a more realistic `ToArchive`/`FromArchive` Tests (N)
|
|
||||||
- Aggregation
|
|
||||||
- Calculate averages buffer-wise as soon as full, average weighted by length of buffer (N)
|
|
||||||
- Only the head-buffer needs to be fully traversed (N)
|
|
||||||
- If aggregating over hwthreads/cores/sockets cache those results and reuse
|
|
||||||
some of that for new queries aggregating only over the newer data (W)
|
|
||||||
- Core functionality
|
|
||||||
- Implement a health checker component that provides information to the web
|
|
||||||
interface and REST API (I)
|
|
||||||
- Support units for metrics including to request unit conversions (I)
|
|
||||||
- Compression
|
|
||||||
- Enable compression for http API requests (N)
|
|
||||||
- Enable compression for checkpoints/archive (I)
|
|
||||||
- Sampling
|
|
||||||
- Support data re sampling to reduce data points (I)
|
|
||||||
- Use re sampling algorithms that preserve min/max as far as possible (I)
|
|
@ -24,7 +24,7 @@
|
|||||||
"ApiKeyAuth": []
|
"ApiKeyAuth": []
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"description": "Write metrics to store",
|
"description": "This endpoint allows the users to print the content of",
|
||||||
"produces": [
|
"produces": [
|
||||||
"application/json"
|
"application/json"
|
||||||
],
|
],
|
||||||
@ -81,6 +81,7 @@
|
|||||||
"ApiKeyAuth": []
|
"ApiKeyAuth": []
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
|
"description": "This endpoint allows the users to free the Buffers from the",
|
||||||
"produces": [
|
"produces": [
|
||||||
"application/json"
|
"application/json"
|
||||||
],
|
],
|
||||||
@ -136,7 +137,7 @@
|
|||||||
"ApiKeyAuth": []
|
"ApiKeyAuth": []
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"description": "Query metrics.",
|
"description": "This endpoint allows the users to retrieve data from the",
|
||||||
"consumes": [
|
"consumes": [
|
||||||
"application/json"
|
"application/json"
|
||||||
],
|
],
|
||||||
|
@ -106,7 +106,7 @@ info:
|
|||||||
paths:
|
paths:
|
||||||
/debug/:
|
/debug/:
|
||||||
post:
|
post:
|
||||||
description: Write metrics to store
|
description: This endpoint allows the users to print the content of
|
||||||
parameters:
|
parameters:
|
||||||
- description: Selector
|
- description: Selector
|
||||||
in: query
|
in: query
|
||||||
@ -142,6 +142,7 @@ paths:
|
|||||||
- debug
|
- debug
|
||||||
/free/:
|
/free/:
|
||||||
post:
|
post:
|
||||||
|
description: This endpoint allows the users to free the Buffers from the
|
||||||
parameters:
|
parameters:
|
||||||
- description: up to timestamp
|
- description: up to timestamp
|
||||||
in: query
|
in: query
|
||||||
@ -178,7 +179,7 @@ paths:
|
|||||||
get:
|
get:
|
||||||
consumes:
|
consumes:
|
||||||
- application/json
|
- application/json
|
||||||
description: Query metrics.
|
description: This endpoint allows the users to retrieve data from the
|
||||||
parameters:
|
parameters:
|
||||||
- description: API query payload object
|
- description: API query payload object
|
||||||
in: body
|
in: body
|
||||||
|
@ -127,7 +127,10 @@ func (data *ApiMetricData) PadDataWithNull(ms *memorystore.MemoryStore, from, to
|
|||||||
// handleFree godoc
|
// handleFree godoc
|
||||||
// @summary
|
// @summary
|
||||||
// @tags free
|
// @tags free
|
||||||
// @description
|
// @description This endpoint allows the users to free the Buffers from the
|
||||||
|
// metric store. This endpoint offers the users to remove then systematically
|
||||||
|
// and also allows then to prune the data under node, if they do not want to
|
||||||
|
// remove the whole node.
|
||||||
// @produce json
|
// @produce json
|
||||||
// @param to query string false "up to timestamp"
|
// @param to query string false "up to timestamp"
|
||||||
// @success 200 {string} string "ok"
|
// @success 200 {string} string "ok"
|
||||||
@ -182,9 +185,9 @@ func handleFree(rw http.ResponseWriter, r *http.Request) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// handleWrite godoc
|
// handleWrite godoc
|
||||||
// @summary Receive metrics in line-protocol
|
// @summary Receive metrics in InfluxDB line-protocol
|
||||||
// @tags write
|
// @tags write
|
||||||
// @description Receives metrics in the influx line-protocol using [this format](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md)
|
// @description Write data to the in-memory store in the InfluxDB line-protocol using [this format](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md)
|
||||||
|
|
||||||
// @accept plain
|
// @accept plain
|
||||||
// @produce json
|
// @produce json
|
||||||
@ -245,7 +248,9 @@ type ApiQuery struct {
|
|||||||
// handleQuery godoc
|
// handleQuery godoc
|
||||||
// @summary Query metrics
|
// @summary Query metrics
|
||||||
// @tags query
|
// @tags query
|
||||||
// @description Query metrics.
|
// @description This endpoint allows the users to retrieve data from the
|
||||||
|
// in-memory database. The CCMS will return data in JSON format for the
|
||||||
|
// specified interval requested by the user
|
||||||
// @accept json
|
// @accept json
|
||||||
// @produce json
|
// @produce json
|
||||||
// @param request body api.ApiQueryRequest true "API query payload object"
|
// @param request body api.ApiQueryRequest true "API query payload object"
|
||||||
@ -383,7 +388,8 @@ func handleQuery(rw http.ResponseWriter, r *http.Request) {
|
|||||||
// handleDebug godoc
|
// handleDebug godoc
|
||||||
// @summary Debug endpoint
|
// @summary Debug endpoint
|
||||||
// @tags debug
|
// @tags debug
|
||||||
// @description Write metrics to store
|
// @description This endpoint allows the users to print the content of
|
||||||
|
// nodes/clusters/metrics to review the state of the data.
|
||||||
// @produce json
|
// @produce json
|
||||||
// @param selector query string false "Selector"
|
// @param selector query string false "Selector"
|
||||||
// @success 200 {string} string "Debug dump"
|
// @success 200 {string} string "Debug dump"
|
||||||
|
@ -30,7 +30,7 @@ const docTemplate = `{
|
|||||||
"ApiKeyAuth": []
|
"ApiKeyAuth": []
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"description": "Write metrics to store",
|
"description": "This endpoint allows the users to print the content of",
|
||||||
"produces": [
|
"produces": [
|
||||||
"application/json"
|
"application/json"
|
||||||
],
|
],
|
||||||
@ -87,6 +87,7 @@ const docTemplate = `{
|
|||||||
"ApiKeyAuth": []
|
"ApiKeyAuth": []
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
|
"description": "This endpoint allows the users to free the Buffers from the",
|
||||||
"produces": [
|
"produces": [
|
||||||
"application/json"
|
"application/json"
|
||||||
],
|
],
|
||||||
@ -142,7 +143,7 @@ const docTemplate = `{
|
|||||||
"ApiKeyAuth": []
|
"ApiKeyAuth": []
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"description": "Query metrics.",
|
"description": "This endpoint allows the users to retrieve data from the",
|
||||||
"consumes": [
|
"consumes": [
|
||||||
"application/json"
|
"application/json"
|
||||||
],
|
],
|
||||||
|
Loading…
Reference in New Issue
Block a user