Update README and Remove TODO

2026-06-23 03:50:38 +02:00 · 2024-10-23 13:19:18 +02:00
parent 699bde372d
commit 53cb497e0c
2 changed files with 17 additions and 71 deletions
@@ -6,16 +6,16 @@ The cc-metric-store provides a simple in-memory time series database for storing
 metrics of cluster nodes at preconfigured intervals. It is meant to be used as
 part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all
 data is kept in-memory (but written to disk as compressed JSON for long term
-storage), accessing it is very fast. It also provides aggregations over time
+storage), accessing it is very fast. It also provides topology aware
-_and_ nodes/sockets/cpus.
+aggregations over time _and_ nodes/sockets/cpus.
 There are major limitations: Data only gets written to disk at periodic
-checkpoints, not as soon as it is received.
+checkpoints, not as soon as it is received. Also only the fixed configured
 duration is stored and available.
-Go look at the `TODO.md` file and the [GitHub
+Go look at the [GitHub
 Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress
-overview. Things work, but are not properly tested. The
+overview. The [NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
 [NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
 format of the InfluxDB line
 protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md).
@@ -42,19 +42,14 @@ go test -bench=. -race -v ./...
 ## What are these selectors mentioned in the code?
-Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags
+The cc-metric-store works as a time-series database and uses the InfluxDB line
-have no relation to each other, they do not depend on each other and have no
+protocol as input format. Unlike InfluxDB, the data is indexed by one single
-hierarchy. Different tags build up different indexes (I am no expert at all, but
+strictly hierarchical tree structure. A selector is build out of the tags in the
-this is how i think they work).
+InfluxDB line protocol, and can be used to select a node (not in the sense of a
-
+compute node, can also be a socket, cpu, ...) in that tree. The implementation
-This project also works as a time-series database and uses the InfluxDB line
+calls those nodes `level` to avoid confusion. It is impossible to access data
-protocol. Unlike InfluxDB, the data is indexed by one single strictly
+only by knowing the _socket_ or _cpu_ tag, all higher up levels have to be
-hierarchical tree structure. A selector is build out of the tags in the InfluxDB
+specified as well.
 line protocol, and can be used to select a node (not in the sense of a compute
 node, can also be a socket, cpu, ...) in that tree. The implementation calls
 those nodes `level` to avoid confusion. It is impossible to access data only by
 knowing the _socket_ or _cpu_ tag, all higher up levels have to be specified as
 well.
 This is what the hierarchy currently looks like:
@@ -68,6 +63,8 @@ This is what the hierarchy currently looks like:
    - cpu3
    - cpu4
    - ...
    - gpu1
    - gpu2
  - host2
  - ...
 - cluster2
@@ -116,7 +113,7 @@ this](https://pkg.go.dev/time#ParseDuration) (Allowed suffixes: `s`, `m`, `h`,
 There are two ways for sending data to the cc-metric-store, both of which are
 supported by the
 [cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector).
-This example uses Nats, the alternative is to use HTTP.
+This example uses NATS, the alternative is to use HTTP.
 ```sh
 # Only needed once, downloads the docker image
@@ -1,51 +0,0 @@
 # Possible Tasks and Improvements
 Importance:
 - **I** Important
 - **N** Nice to have
 - **W** Won't do. Probably not necessary.
 - Benchmarking
  - Benchmark and compare common timeseries DBs with our data and our queries (N)
 - Web interface
  - Provide simple http endpoint with a status and debug view (Start with Basic
    Authentication)
 - Configuration
  - Consolidate configuration with cc-backend, remove redundant information
  - Support to receive configuration via NATS channel
 - Memory management
  - To overcome garbage collection overhead: Reimplement in Rust (N)
  - Request memory directly batchwise via mmap (started in branch) (W)
 - Archive
  - S3 backend for archive (I)
  - Store information in each buffer if already archived (N)
  - Do not create new checkpoint if all buffers already archived (N)
 - Checkpoints
  - S3 backend for checkpoints (I)
  - Combine checkpoints into larger files (I)
  - Binary checkpoints (started in branch) (W)
 - API
  - Redesign query interface (N)
  - Provide an endpoint for node health based on received metric data (I)
  - Introduce JWT authentication for REST and NATS (I)
 - Testing
  - General tests (I)
  - Test data generator for regression tests (I)
  - Check for corner cases that should fail gracefully (N)
  - Write a more realistic `ToArchive`/`FromArchive` Tests (N)
 - Aggregation
  - Calculate averages buffer-wise as soon as full, average weighted by length of buffer (N)
  - Only the head-buffer needs to be fully traversed (N)
  - If aggregating over hwthreads/cores/sockets cache those results and reuse
    some of that for new queries aggregating only over the newer data (W)
 - Core functionality
  - Implement a health checker component that provides information to the web
    interface and REST API (I)
  - Support units for metrics including to request unit conversions (I)
 - Compression
  - Enable compression for http API requests (N)
  - Enable compression for checkpoints/archive (I)
 - Sampling
  - Support data re sampling to reduce data points (I)
  - Use re sampling algorithms that preserve min/max as far as possible (I)