mirror of
https://github.com/ClusterCockpit/cc-metric-store.git
synced 2024-12-26 00:49:05 +01:00
Update README and Remove TODO
This commit is contained in:
parent
699bde372d
commit
53cb497e0c
37
README.md
37
README.md
@ -6,16 +6,16 @@ The cc-metric-store provides a simple in-memory time series database for storing
|
||||
metrics of cluster nodes at preconfigured intervals. It is meant to be used as
|
||||
part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all
|
||||
data is kept in-memory (but written to disk as compressed JSON for long term
|
||||
storage), accessing it is very fast. It also provides aggregations over time
|
||||
_and_ nodes/sockets/cpus.
|
||||
storage), accessing it is very fast. It also provides topology aware
|
||||
aggregations over time _and_ nodes/sockets/cpus.
|
||||
|
||||
There are major limitations: Data only gets written to disk at periodic
|
||||
checkpoints, not as soon as it is received.
|
||||
checkpoints, not as soon as it is received. Also only the fixed configured
|
||||
duration is stored and available.
|
||||
|
||||
Go look at the `TODO.md` file and the [GitHub
|
||||
Go look at the [GitHub
|
||||
Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress
|
||||
overview. Things work, but are not properly tested. The
|
||||
[NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
|
||||
overview. The [NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
|
||||
format of the InfluxDB line
|
||||
protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md).
|
||||
|
||||
@ -42,19 +42,14 @@ go test -bench=. -race -v ./...
|
||||
|
||||
## What are these selectors mentioned in the code?
|
||||
|
||||
Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags
|
||||
have no relation to each other, they do not depend on each other and have no
|
||||
hierarchy. Different tags build up different indexes (I am no expert at all, but
|
||||
this is how i think they work).
|
||||
|
||||
This project also works as a time-series database and uses the InfluxDB line
|
||||
protocol. Unlike InfluxDB, the data is indexed by one single strictly
|
||||
hierarchical tree structure. A selector is build out of the tags in the InfluxDB
|
||||
line protocol, and can be used to select a node (not in the sense of a compute
|
||||
node, can also be a socket, cpu, ...) in that tree. The implementation calls
|
||||
those nodes `level` to avoid confusion. It is impossible to access data only by
|
||||
knowing the _socket_ or _cpu_ tag, all higher up levels have to be specified as
|
||||
well.
|
||||
The cc-metric-store works as a time-series database and uses the InfluxDB line
|
||||
protocol as input format. Unlike InfluxDB, the data is indexed by one single
|
||||
strictly hierarchical tree structure. A selector is build out of the tags in the
|
||||
InfluxDB line protocol, and can be used to select a node (not in the sense of a
|
||||
compute node, can also be a socket, cpu, ...) in that tree. The implementation
|
||||
calls those nodes `level` to avoid confusion. It is impossible to access data
|
||||
only by knowing the _socket_ or _cpu_ tag, all higher up levels have to be
|
||||
specified as well.
|
||||
|
||||
This is what the hierarchy currently looks like:
|
||||
|
||||
@ -68,6 +63,8 @@ This is what the hierarchy currently looks like:
|
||||
- cpu3
|
||||
- cpu4
|
||||
- ...
|
||||
- gpu1
|
||||
- gpu2
|
||||
- host2
|
||||
- ...
|
||||
- cluster2
|
||||
@ -116,7 +113,7 @@ this](https://pkg.go.dev/time#ParseDuration) (Allowed suffixes: `s`, `m`, `h`,
|
||||
There are two ways for sending data to the cc-metric-store, both of which are
|
||||
supported by the
|
||||
[cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector).
|
||||
This example uses Nats, the alternative is to use HTTP.
|
||||
This example uses NATS, the alternative is to use HTTP.
|
||||
|
||||
```sh
|
||||
# Only needed once, downloads the docker image
|
||||
|
51
TODO.md
51
TODO.md
@ -1,51 +0,0 @@
|
||||
# Possible Tasks and Improvements
|
||||
|
||||
Importance:
|
||||
|
||||
- **I** Important
|
||||
- **N** Nice to have
|
||||
- **W** Won't do. Probably not necessary.
|
||||
|
||||
- Benchmarking
|
||||
- Benchmark and compare common timeseries DBs with our data and our queries (N)
|
||||
- Web interface
|
||||
- Provide simple http endpoint with a status and debug view (Start with Basic
|
||||
Authentication)
|
||||
- Configuration
|
||||
- Consolidate configuration with cc-backend, remove redundant information
|
||||
- Support to receive configuration via NATS channel
|
||||
- Memory management
|
||||
- To overcome garbage collection overhead: Reimplement in Rust (N)
|
||||
- Request memory directly batchwise via mmap (started in branch) (W)
|
||||
- Archive
|
||||
- S3 backend for archive (I)
|
||||
- Store information in each buffer if already archived (N)
|
||||
- Do not create new checkpoint if all buffers already archived (N)
|
||||
- Checkpoints
|
||||
- S3 backend for checkpoints (I)
|
||||
- Combine checkpoints into larger files (I)
|
||||
- Binary checkpoints (started in branch) (W)
|
||||
- API
|
||||
- Redesign query interface (N)
|
||||
- Provide an endpoint for node health based on received metric data (I)
|
||||
- Introduce JWT authentication for REST and NATS (I)
|
||||
- Testing
|
||||
- General tests (I)
|
||||
- Test data generator for regression tests (I)
|
||||
- Check for corner cases that should fail gracefully (N)
|
||||
- Write a more realistic `ToArchive`/`FromArchive` Tests (N)
|
||||
- Aggregation
|
||||
- Calculate averages buffer-wise as soon as full, average weighted by length of buffer (N)
|
||||
- Only the head-buffer needs to be fully traversed (N)
|
||||
- If aggregating over hwthreads/cores/sockets cache those results and reuse
|
||||
some of that for new queries aggregating only over the newer data (W)
|
||||
- Core functionality
|
||||
- Implement a health checker component that provides information to the web
|
||||
interface and REST API (I)
|
||||
- Support units for metrics including to request unit conversions (I)
|
||||
- Compression
|
||||
- Enable compression for http API requests (N)
|
||||
- Enable compression for checkpoints/archive (I)
|
||||
- Sampling
|
||||
- Support data re sampling to reduce data points (I)
|
||||
- Use re sampling algorithms that preserve min/max as far as possible (I)
|
Loading…
Reference in New Issue
Block a user