From 53cb497e0c96ae42c4a0b3b2b412230226277e48 Mon Sep 17 00:00:00 2001 From: Jan Eitzinger Date: Wed, 23 Oct 2024 13:19:18 +0200 Subject: [PATCH] Update README and Remove TODO --- README.md | 37 +++++++++++++++++-------------------- TODO.md | 51 --------------------------------------------------- 2 files changed, 17 insertions(+), 71 deletions(-) delete mode 100644 TODO.md diff --git a/README.md b/README.md index a8a9487..af8ce30 100644 --- a/README.md +++ b/README.md @@ -6,16 +6,16 @@ The cc-metric-store provides a simple in-memory time series database for storing metrics of cluster nodes at preconfigured intervals. It is meant to be used as part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all data is kept in-memory (but written to disk as compressed JSON for long term -storage), accessing it is very fast. It also provides aggregations over time -_and_ nodes/sockets/cpus. +storage), accessing it is very fast. It also provides topology aware +aggregations over time _and_ nodes/sockets/cpus. There are major limitations: Data only gets written to disk at periodic -checkpoints, not as soon as it is received. +checkpoints, not as soon as it is received. Also only the fixed configured +duration is stored and available. -Go look at the `TODO.md` file and the [GitHub +Go look at the [GitHub Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress -overview. Things work, but are not properly tested. The -[NATS.io](https://nats.io/) based writing endpoint consumes messages in [this +overview. The [NATS.io](https://nats.io/) based writing endpoint consumes messages in [this format of the InfluxDB line protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md). @@ -42,19 +42,14 @@ go test -bench=. -race -v ./... ## What are these selectors mentioned in the code? -Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags -have no relation to each other, they do not depend on each other and have no -hierarchy. Different tags build up different indexes (I am no expert at all, but -this is how i think they work). - -This project also works as a time-series database and uses the InfluxDB line -protocol. Unlike InfluxDB, the data is indexed by one single strictly -hierarchical tree structure. A selector is build out of the tags in the InfluxDB -line protocol, and can be used to select a node (not in the sense of a compute -node, can also be a socket, cpu, ...) in that tree. The implementation calls -those nodes `level` to avoid confusion. It is impossible to access data only by -knowing the _socket_ or _cpu_ tag, all higher up levels have to be specified as -well. +The cc-metric-store works as a time-series database and uses the InfluxDB line +protocol as input format. Unlike InfluxDB, the data is indexed by one single +strictly hierarchical tree structure. A selector is build out of the tags in the +InfluxDB line protocol, and can be used to select a node (not in the sense of a +compute node, can also be a socket, cpu, ...) in that tree. The implementation +calls those nodes `level` to avoid confusion. It is impossible to access data +only by knowing the _socket_ or _cpu_ tag, all higher up levels have to be +specified as well. This is what the hierarchy currently looks like: @@ -68,6 +63,8 @@ This is what the hierarchy currently looks like: - cpu3 - cpu4 - ... + - gpu1 + - gpu2 - host2 - ... - cluster2 @@ -116,7 +113,7 @@ this](https://pkg.go.dev/time#ParseDuration) (Allowed suffixes: `s`, `m`, `h`, There are two ways for sending data to the cc-metric-store, both of which are supported by the [cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector). -This example uses Nats, the alternative is to use HTTP. +This example uses NATS, the alternative is to use HTTP. ```sh # Only needed once, downloads the docker image diff --git a/TODO.md b/TODO.md deleted file mode 100644 index 8c1a36d..0000000 --- a/TODO.md +++ /dev/null @@ -1,51 +0,0 @@ -# Possible Tasks and Improvements - -Importance: - -- **I** Important -- **N** Nice to have -- **W** Won't do. Probably not necessary. - -- Benchmarking - - Benchmark and compare common timeseries DBs with our data and our queries (N) -- Web interface - - Provide simple http endpoint with a status and debug view (Start with Basic - Authentication) -- Configuration - - Consolidate configuration with cc-backend, remove redundant information - - Support to receive configuration via NATS channel -- Memory management - - To overcome garbage collection overhead: Reimplement in Rust (N) - - Request memory directly batchwise via mmap (started in branch) (W) -- Archive - - S3 backend for archive (I) - - Store information in each buffer if already archived (N) - - Do not create new checkpoint if all buffers already archived (N) -- Checkpoints - - S3 backend for checkpoints (I) - - Combine checkpoints into larger files (I) - - Binary checkpoints (started in branch) (W) -- API - - Redesign query interface (N) - - Provide an endpoint for node health based on received metric data (I) - - Introduce JWT authentication for REST and NATS (I) -- Testing - - General tests (I) - - Test data generator for regression tests (I) - - Check for corner cases that should fail gracefully (N) - - Write a more realistic `ToArchive`/`FromArchive` Tests (N) -- Aggregation - - Calculate averages buffer-wise as soon as full, average weighted by length of buffer (N) - - Only the head-buffer needs to be fully traversed (N) - - If aggregating over hwthreads/cores/sockets cache those results and reuse - some of that for new queries aggregating only over the newer data (W) -- Core functionality - - Implement a health checker component that provides information to the web - interface and REST API (I) - - Support units for metrics including to request unit conversions (I) -- Compression - - Enable compression for http API requests (N) - - Enable compression for checkpoints/archive (I) -- Sampling - - Support data re sampling to reduce data points (I) - - Use re sampling algorithms that preserve min/max as far as possible (I)