Update README and Remove TODO

This commit is contained in:
Jan Eitzinger 2024-10-23 13:19:18 +02:00
parent 699bde372d
commit 53cb497e0c
Signed by: moebiusband
GPG Key ID: 2574BA29B90D6DD5
2 changed files with 17 additions and 71 deletions

View File

@ -6,16 +6,16 @@ The cc-metric-store provides a simple in-memory time series database for storing
metrics of cluster nodes at preconfigured intervals. It is meant to be used as metrics of cluster nodes at preconfigured intervals. It is meant to be used as
part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all
data is kept in-memory (but written to disk as compressed JSON for long term data is kept in-memory (but written to disk as compressed JSON for long term
storage), accessing it is very fast. It also provides aggregations over time storage), accessing it is very fast. It also provides topology aware
_and_ nodes/sockets/cpus. aggregations over time _and_ nodes/sockets/cpus.
There are major limitations: Data only gets written to disk at periodic There are major limitations: Data only gets written to disk at periodic
checkpoints, not as soon as it is received. checkpoints, not as soon as it is received. Also only the fixed configured
duration is stored and available.
Go look at the `TODO.md` file and the [GitHub Go look at the [GitHub
Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress
overview. Things work, but are not properly tested. The overview. The [NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
[NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
format of the InfluxDB line format of the InfluxDB line
protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md). protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md).
@ -42,19 +42,14 @@ go test -bench=. -race -v ./...
## What are these selectors mentioned in the code? ## What are these selectors mentioned in the code?
Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags The cc-metric-store works as a time-series database and uses the InfluxDB line
have no relation to each other, they do not depend on each other and have no protocol as input format. Unlike InfluxDB, the data is indexed by one single
hierarchy. Different tags build up different indexes (I am no expert at all, but strictly hierarchical tree structure. A selector is build out of the tags in the
this is how i think they work). InfluxDB line protocol, and can be used to select a node (not in the sense of a
compute node, can also be a socket, cpu, ...) in that tree. The implementation
This project also works as a time-series database and uses the InfluxDB line calls those nodes `level` to avoid confusion. It is impossible to access data
protocol. Unlike InfluxDB, the data is indexed by one single strictly only by knowing the _socket_ or _cpu_ tag, all higher up levels have to be
hierarchical tree structure. A selector is build out of the tags in the InfluxDB specified as well.
line protocol, and can be used to select a node (not in the sense of a compute
node, can also be a socket, cpu, ...) in that tree. The implementation calls
those nodes `level` to avoid confusion. It is impossible to access data only by
knowing the _socket_ or _cpu_ tag, all higher up levels have to be specified as
well.
This is what the hierarchy currently looks like: This is what the hierarchy currently looks like:
@ -68,6 +63,8 @@ This is what the hierarchy currently looks like:
- cpu3 - cpu3
- cpu4 - cpu4
- ... - ...
- gpu1
- gpu2
- host2 - host2
- ... - ...
- cluster2 - cluster2
@ -116,7 +113,7 @@ this](https://pkg.go.dev/time#ParseDuration) (Allowed suffixes: `s`, `m`, `h`,
There are two ways for sending data to the cc-metric-store, both of which are There are two ways for sending data to the cc-metric-store, both of which are
supported by the supported by the
[cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector). [cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector).
This example uses Nats, the alternative is to use HTTP. This example uses NATS, the alternative is to use HTTP.
```sh ```sh
# Only needed once, downloads the docker image # Only needed once, downloads the docker image

51
TODO.md
View File

@ -1,51 +0,0 @@
# Possible Tasks and Improvements
Importance:
- **I** Important
- **N** Nice to have
- **W** Won't do. Probably not necessary.
- Benchmarking
- Benchmark and compare common timeseries DBs with our data and our queries (N)
- Web interface
- Provide simple http endpoint with a status and debug view (Start with Basic
Authentication)
- Configuration
- Consolidate configuration with cc-backend, remove redundant information
- Support to receive configuration via NATS channel
- Memory management
- To overcome garbage collection overhead: Reimplement in Rust (N)
- Request memory directly batchwise via mmap (started in branch) (W)
- Archive
- S3 backend for archive (I)
- Store information in each buffer if already archived (N)
- Do not create new checkpoint if all buffers already archived (N)
- Checkpoints
- S3 backend for checkpoints (I)
- Combine checkpoints into larger files (I)
- Binary checkpoints (started in branch) (W)
- API
- Redesign query interface (N)
- Provide an endpoint for node health based on received metric data (I)
- Introduce JWT authentication for REST and NATS (I)
- Testing
- General tests (I)
- Test data generator for regression tests (I)
- Check for corner cases that should fail gracefully (N)
- Write a more realistic `ToArchive`/`FromArchive` Tests (N)
- Aggregation
- Calculate averages buffer-wise as soon as full, average weighted by length of buffer (N)
- Only the head-buffer needs to be fully traversed (N)
- If aggregating over hwthreads/cores/sockets cache those results and reuse
some of that for new queries aggregating only over the newer data (W)
- Core functionality
- Implement a health checker component that provides information to the web
interface and REST API (I)
- Support units for metrics including to request unit conversions (I)
- Compression
- Enable compression for http API requests (N)
- Enable compression for checkpoints/archive (I)
- Sampling
- Support data re sampling to reduce data points (I)
- Use re sampling algorithms that preserve min/max as far as possible (I)