Merge pull request #28 from ClusterCockpit/hotfix

Update README and Remove TODO
2025-12-24 16:16:16 +01:00 · 2024-10-23 17:06:07 +02:00
parent 699bde372d fd8a919c32
commit 171d298b4c
6 changed files with 54 additions and 111 deletions
--- a/README.md
+++ b/README.md
@@ -6,25 +6,41 @@ The cc-metric-store provides a simple in-memory time series database for storing
 metrics of cluster nodes at preconfigured intervals. It is meant to be used as
 part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all
 data is kept in-memory (but written to disk as compressed JSON for long term
-storage), accessing it is very fast. It also provides aggregations over time
-_and_ nodes/sockets/cpus.
+storage), accessing it is very fast. It also provides topology aware
+aggregations over time _and_ nodes/sockets/cpus.

 There are major limitations: Data only gets written to disk at periodic
-checkpoints, not as soon as it is received.
+checkpoints, not as soon as it is received. Also only the fixed configured
+duration is stored and available.

-Go look at the `TODO.md` file and the [GitHub
+Go look at the [GitHub
 Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress
-overview. Things work, but are not properly tested. The
-[NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
+overview. The [NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
 format of the InfluxDB line
 protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md).

+## Building
+
+`cc-metric-store` can be built using the provided `Makefile`.
+It supports the following targets:
+
+- `make`: Build the application, copy a example configuration file and generate
+  checkpoint folders if required.
+- `make clean`: Clean the golang build cache and application binary
+- `make distclean`: In addition to the clean target also remove the `./var`
+  folder
+- `make swagger`: Regenerate the Swagger files from the source comments.
+- `make test`: Run test and basic checks.
+
 ## REST API Endpoints

 The REST API is documented in [swagger.json](./api/swagger.json). You can
 explore and try the REST API using the integrated [SwaggerUI web
 interface](http://localhost:8082/swagger).

+For more information on the `cc-metric-store` REST API have a look at the
+ClusterCockpit documentation [website](https://clustercockpit.org/docs/reference/cc-metric-store/ccms-rest-api/)
+
 ## Run tests

 Some benchmarks concurrently access the `MemoryStore`, so enabling the
@@ -42,19 +58,14 @@ go test -bench=. -race -v ./...

 ## What are these selectors mentioned in the code?

-Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags
-have no relation to each other, they do not depend on each other and have no
-hierarchy. Different tags build up different indexes (I am no expert at all, but
-this is how i think they work).
-
-This project also works as a time-series database and uses the InfluxDB line
-protocol. Unlike InfluxDB, the data is indexed by one single strictly
-hierarchical tree structure. A selector is build out of the tags in the InfluxDB
-line protocol, and can be used to select a node (not in the sense of a compute
-node, can also be a socket, cpu, ...) in that tree. The implementation calls
-those nodes `level` to avoid confusion. It is impossible to access data only by
-knowing the _socket_ or _cpu_ tag, all higher up levels have to be specified as
-well.
+The cc-metric-store works as a time-series database and uses the InfluxDB line
+protocol as input format. Unlike InfluxDB, the data is indexed by one single
+strictly hierarchical tree structure. A selector is build out of the tags in the
+InfluxDB line protocol, and can be used to select a node (not in the sense of a
+compute node, can also be a socket, cpu, ...) in that tree. The implementation
+calls those nodes `level` to avoid confusion. It is impossible to access data
+only by knowing the _socket_ or _cpu_ tag, all higher up levels have to be
+specified as well.

 This is what the hierarchy currently looks like:

@@ -68,6 +79,8 @@ This is what the hierarchy currently looks like:
    - cpu3
    - cpu4
    - ...
+    - gpu1
+    - gpu2
  - host2
  - ...
 - cluster2
@@ -81,42 +94,14 @@ Example selectors:

 ## Config file

-All durations are specified as string that will be parsed [like
-this](https://pkg.go.dev/time#ParseDuration) (Allowed suffixes: `s`, `m`, `h`,
-...).
-
- `metrics`: Map of metric-name to objects with the following properties
-  - `frequency`: Timestep/Interval/Resolution of this metric
-  - `aggregation`: Can be `"sum"`, `"avg"` or `null`
-    - `null` means aggregation across nodes is forbidden for this metric
-    - `"sum"` means that values from the child levels are summed up for the parent level
-    - `"avg"` means that values from the child levels are averaged for the parent level
-  - `scope`: Unused at the moment, should be something like `"node"`, `"socket"` or `"hwthread"`
- `nats`:
-  - `address`: Url of NATS.io server, example: "nats://localhost:4222"
-  - `username` and `password`: Optional, if provided use those for the connection
-  - `subscriptions`:
-    - `subscribe-to`: Where to expect the measurements to be published
-    - `cluster-tag`: Default value for the cluster tag
- `http-api`:
-  - `address`: Address to bind to, for example `0.0.0.0:8080`
-  - `https-cert-file` and `https-key-file`: Optional, if provided enable HTTPS using those files as certificate/key
- `jwt-public-key`: Base64 encoded string, use this to verify requests to the HTTP API
- `retention-on-memory`: Keep all values in memory for at least that amount of time
- `checkpoints`:
-  - `interval`: Do checkpoints every X seconds/minutes/hours
-  - `directory`: Path to a directory
-  - `restore`: After a restart, load the last X seconds/minutes/hours of data back into memory
- `archive`:
-  - `interval`: Move and compress all checkpoints not needed anymore every X seconds/minutes/hours
-  - `directory`: Path to a directory
+You find the configuration options on the ClusterCockpit [website](https://clustercockpit.org/docs/reference/cc-metric-store/ccms-configuration/).

 ## Test the complete setup (excluding cc-backend itself)

 There are two ways for sending data to the cc-metric-store, both of which are
 supported by the
 [cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector).
-This example uses Nats, the alternative is to use HTTP.
+This example uses NATS, the alternative is to use HTTP.

 ```sh
 # Only needed once, downloads the docker image
--- a/TODO.md
+++ b/TODO.md
@@ -1,51 +0,0 @@
-# Possible Tasks and Improvements
-
-Importance:
-
- **I** Important
- **N** Nice to have
- **W** Won't do. Probably not necessary.
-
- Benchmarking
-  - Benchmark and compare common timeseries DBs with our data and our queries (N)
- Web interface
-  - Provide simple http endpoint with a status and debug view (Start with Basic
-    Authentication)
- Configuration
-  - Consolidate configuration with cc-backend, remove redundant information
-  - Support to receive configuration via NATS channel
- Memory management
-  - To overcome garbage collection overhead: Reimplement in Rust (N)
-  - Request memory directly batchwise via mmap (started in branch) (W)
- Archive
-  - S3 backend for archive (I)
-  - Store information in each buffer if already archived (N)
-  - Do not create new checkpoint if all buffers already archived (N)
- Checkpoints
-  - S3 backend for checkpoints (I)
-  - Combine checkpoints into larger files (I)
-  - Binary checkpoints (started in branch) (W)
- API
-  - Redesign query interface (N)
-  - Provide an endpoint for node health based on received metric data (I)
-  - Introduce JWT authentication for REST and NATS (I)
- Testing
-  - General tests (I)
-  - Test data generator for regression tests (I)
-  - Check for corner cases that should fail gracefully (N)
-  - Write a more realistic `ToArchive`/`FromArchive` Tests (N)
- Aggregation
-  - Calculate averages buffer-wise as soon as full, average weighted by length of buffer (N)
-  - Only the head-buffer needs to be fully traversed (N)
-  - If aggregating over hwthreads/cores/sockets cache those results and reuse
-    some of that for new queries aggregating only over the newer data (W)
- Core functionality
-  - Implement a health checker component that provides information to the web
-    interface and REST API (I)
-  - Support units for metrics including to request unit conversions (I)
- Compression
-  - Enable compression for http API requests (N)
-  - Enable compression for checkpoints/archive (I)
- Sampling
-  - Support data re sampling to reduce data points (I)
-  - Use re sampling algorithms that preserve min/max as far as possible (I)
--- a/api/swagger.json
+++ b/api/swagger.json
@@ -24,7 +24,7 @@
                        "ApiKeyAuth": []
                    }
                ],
-                "description": "Write metrics to store",
+                "description": "This endpoint allows the users to print the content of",
                "produces": [
                    "application/json"
                ],
@@ -81,6 +81,7 @@
                        "ApiKeyAuth": []
                    }
                ],
+                "description": "This endpoint allows the users to free the Buffers from the",
                "produces": [
                    "application/json"
                ],
@@ -136,7 +137,7 @@
                        "ApiKeyAuth": []
                    }
                ],
-                "description": "Query metrics.",
+                "description": "This endpoint allows the users to retrieve data from the",
                "consumes": [
                    "application/json"
                ],
--- a/api/swagger.yaml
+++ b/api/swagger.yaml
@@ -106,7 +106,7 @@ info:
 paths:
  /debug/:
    post:
-      description: Write metrics to store
+      description: This endpoint allows the users to print the content of
      parameters:
      - description: Selector
        in: query
@@ -142,6 +142,7 @@ paths:
      - debug
  /free/:
    post:
+      description: This endpoint allows the users to free the Buffers from the
      parameters:
      - description: up to timestamp
        in: query
@@ -178,7 +179,7 @@ paths:
    get:
      consumes:
      - application/json
-      description: Query metrics.
+      description: This endpoint allows the users to retrieve data from the
      parameters:
      - description: API query payload object
        in: body
--- a/internal/api/api.go
+++ b/internal/api/api.go
@@ -127,7 +127,10 @@ func (data *ApiMetricData) PadDataWithNull(ms *memorystore.MemoryStore, from, to
 // handleFree godoc
 // @summary
 // @tags free
-// @description
+// @description This endpoint allows the users to free the Buffers from the
+// metric store. This endpoint offers the users to remove then systematically
+// and also allows then to prune the data under node, if they do not want to
+// remove the whole node.
 // @produce     json
 // @param       to        query    string        false  "up to timestamp"
 // @success     200            {string} string  "ok"
@@ -182,9 +185,9 @@ func handleFree(rw http.ResponseWriter, r *http.Request) {
 }

 // handleWrite godoc
-// @summary Receive metrics in line-protocol
+// @summary Receive metrics in InfluxDB line-protocol
 // @tags write
-// @description Receives metrics in the influx line-protocol using [this format](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md)
+// @description Write data to the in-memory store in the InfluxDB line-protocol using [this format](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md)

 // @accept      plain
 // @produce     json
@@ -245,7 +248,9 @@ type ApiQuery struct {
 // handleQuery godoc
 // @summary    Query metrics
 // @tags query
-// @description Query metrics.
+// @description This endpoint allows the users to retrieve data from the
+// in-memory database. The CCMS will return data in JSON format for the
+// specified interval requested by the user
 // @accept      json
 // @produce     json
 // @param       request body     api.ApiQueryRequest  true "API query payload object"
@@ -383,7 +388,8 @@ func handleQuery(rw http.ResponseWriter, r *http.Request) {
 // handleDebug godoc
 // @summary Debug endpoint
 // @tags debug
-// @description Write metrics to store
+// @description This endpoint allows the users to print the content of
+// nodes/clusters/metrics to review the state of the data.
 // @produce     json
 // @param       selector        query    string            false "Selector"
 // @success     200            {string} string  "Debug dump"
--- a/internal/api/docs.go
+++ b/internal/api/docs.go
@@ -30,7 +30,7 @@ const docTemplate = `{
                        "ApiKeyAuth": []
                    }
                ],
-                "description": "Write metrics to store",
+                "description": "This endpoint allows the users to print the content of",
                "produces": [
                    "application/json"
                ],
@@ -87,6 +87,7 @@ const docTemplate = `{
                        "ApiKeyAuth": []
                    }
                ],
+                "description": "This endpoint allows the users to free the Buffers from the",
                "produces": [
                    "application/json"
                ],
@@ -142,7 +143,7 @@ const docTemplate = `{
                        "ApiKeyAuth": []
                    }
                ],
-                "description": "Query metrics.",
+                "description": "This endpoint allows the users to retrieve data from the",
                "consumes": [
                    "application/json"
                ],