Update README and Remove TODO

This commit is contained in:
Jan Eitzinger 2024-10-23 13:19:18 +02:00
parent 699bde372d
commit 53cb497e0c
Signed by: moebiusband
GPG Key ID: 2574BA29B90D6DD5
2 changed files with 17 additions and 71 deletions

View File

@ -6,16 +6,16 @@ The cc-metric-store provides a simple in-memory time series database for storing
metrics of cluster nodes at preconfigured intervals. It is meant to be used as
part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all
data is kept in-memory (but written to disk as compressed JSON for long term
storage), accessing it is very fast. It also provides aggregations over time
_and_ nodes/sockets/cpus.
storage), accessing it is very fast. It also provides topology aware
aggregations over time _and_ nodes/sockets/cpus.
There are major limitations: Data only gets written to disk at periodic
checkpoints, not as soon as it is received.
checkpoints, not as soon as it is received. Also only the fixed configured
duration is stored and available.
Go look at the `TODO.md` file and the [GitHub
Go look at the [GitHub
Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress
overview. Things work, but are not properly tested. The
[NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
overview. The [NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
format of the InfluxDB line
protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md).
@ -42,19 +42,14 @@ go test -bench=. -race -v ./...
## What are these selectors mentioned in the code?
Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags
have no relation to each other, they do not depend on each other and have no
hierarchy. Different tags build up different indexes (I am no expert at all, but
this is how i think they work).
This project also works as a time-series database and uses the InfluxDB line
protocol. Unlike InfluxDB, the data is indexed by one single strictly
hierarchical tree structure. A selector is build out of the tags in the InfluxDB
line protocol, and can be used to select a node (not in the sense of a compute
node, can also be a socket, cpu, ...) in that tree. The implementation calls
those nodes `level` to avoid confusion. It is impossible to access data only by
knowing the _socket_ or _cpu_ tag, all higher up levels have to be specified as
well.
The cc-metric-store works as a time-series database and uses the InfluxDB line
protocol as input format. Unlike InfluxDB, the data is indexed by one single
strictly hierarchical tree structure. A selector is build out of the tags in the
InfluxDB line protocol, and can be used to select a node (not in the sense of a
compute node, can also be a socket, cpu, ...) in that tree. The implementation
calls those nodes `level` to avoid confusion. It is impossible to access data
only by knowing the _socket_ or _cpu_ tag, all higher up levels have to be
specified as well.
This is what the hierarchy currently looks like:
@ -68,6 +63,8 @@ This is what the hierarchy currently looks like:
- cpu3
- cpu4
- ...
- gpu1
- gpu2
- host2
- ...
- cluster2
@ -116,7 +113,7 @@ this](https://pkg.go.dev/time#ParseDuration) (Allowed suffixes: `s`, `m`, `h`,
There are two ways for sending data to the cc-metric-store, both of which are
supported by the
[cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector).
This example uses Nats, the alternative is to use HTTP.
This example uses NATS, the alternative is to use HTTP.
```sh
# Only needed once, downloads the docker image

51
TODO.md
View File

@ -1,51 +0,0 @@
# Possible Tasks and Improvements
Importance:
- **I** Important
- **N** Nice to have
- **W** Won't do. Probably not necessary.
- Benchmarking
- Benchmark and compare common timeseries DBs with our data and our queries (N)
- Web interface
- Provide simple http endpoint with a status and debug view (Start with Basic
Authentication)
- Configuration
- Consolidate configuration with cc-backend, remove redundant information
- Support to receive configuration via NATS channel
- Memory management
- To overcome garbage collection overhead: Reimplement in Rust (N)
- Request memory directly batchwise via mmap (started in branch) (W)
- Archive
- S3 backend for archive (I)
- Store information in each buffer if already archived (N)
- Do not create new checkpoint if all buffers already archived (N)
- Checkpoints
- S3 backend for checkpoints (I)
- Combine checkpoints into larger files (I)
- Binary checkpoints (started in branch) (W)
- API
- Redesign query interface (N)
- Provide an endpoint for node health based on received metric data (I)
- Introduce JWT authentication for REST and NATS (I)
- Testing
- General tests (I)
- Test data generator for regression tests (I)
- Check for corner cases that should fail gracefully (N)
- Write a more realistic `ToArchive`/`FromArchive` Tests (N)
- Aggregation
- Calculate averages buffer-wise as soon as full, average weighted by length of buffer (N)
- Only the head-buffer needs to be fully traversed (N)
- If aggregating over hwthreads/cores/sockets cache those results and reuse
some of that for new queries aggregating only over the newer data (W)
- Core functionality
- Implement a health checker component that provides information to the web
interface and REST API (I)
- Support units for metrics including to request unit conversions (I)
- Compression
- Enable compression for http API requests (N)
- Enable compression for checkpoints/archive (I)
- Sampling
- Support data re sampling to reduce data points (I)
- Use re sampling algorithms that preserve min/max as far as possible (I)