A simple in-memory metric store
Go to file
2022-01-20 10:43:10 +01:00
.github/workflows Update README.md 2021-08-31 15:18:06 +02:00
.gitignore host to hostname in lp; update README.md 2021-10-11 10:55:36 +02:00
api.go Unified API for all query types; remove dead code 2022-01-20 10:42:44 +01:00
archive.go add config flag; fix bug in archive 2021-12-15 10:58:03 +01:00
config.json Start working on pre-computed stats 2021-12-15 09:59:33 +01:00
debug.go Do re-write buffers loaded from checkpoint; Add SIGUSR1 for debugging 2021-11-22 17:04:09 +01:00
go.mod Switch to influxes line protocol parser 2021-10-07 14:52:45 +02:00
go.sum Switch to influxes line protocol parser 2021-10-07 14:52:45 +02:00
LICENSE Initial commit 2021-06-08 18:15:24 +02:00
lineprotocol.go move decodeLine function 2022-01-20 10:14:28 +01:00
memoryStore_test.go.orig New unfinished MemoryStore implementation 2021-08-31 10:52:22 +02:00
memstore_test.go Reduce gaps/rewrites in the same cell with offset 2021-12-02 12:57:35 +01:00
memstore.go Unified API for all query types; remove dead code 2022-01-20 10:42:44 +01:00
metric-store.go move decodeLine function 2022-01-20 10:14:28 +01:00
openapi.yaml Add OpenAPI spec 2022-01-20 10:43:10 +01:00
README.md Add OpenAPI spec 2022-01-20 10:43:10 +01:00
selector.go Fix panic in findBuffers when Any pattern is used 2021-12-01 13:22:07 +01:00
stats.go Unified API for all query types; remove dead code 2022-01-20 10:42:44 +01:00
TODO.md Add OpenAPI spec 2022-01-20 10:43:10 +01:00

ClusterCockpit Metric Store

Build & Test

The cc-metric-store provides a simple in-memory time series database for storing metrics of cluster nodes at preconfigured intervals. It is meant to be used as part of the ClusterCockpit suite. As all data is kept in-memory (but written to disk as compressed JSON for long term storage), accessing it is very fast. It also provides aggregations over time and nodes/sockets/cpus.

Go look at the TODO.md file and the GitHub Issues for a progress overview. Things work, but are not properly tested. The NATS.io based writing endpoint consumes messages in this format of the InfluxDB line protocol.

REST API Endpoints

The REST API is documented in openapi.yaml in the OpenAPI 3.0 format.

Run tests

Some benchmarks concurrently access the MemoryStore, so enabling the Race Detector might be useful. The benchmarks also work as tests as they do check if the returned values are as expected.

# Tests only
go test -v ./...

# Benchmarks as well
go test -bench=. -race -v ./...

What are these selectors mentioned in the code?

Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags have no relation to each other, they do not depend on each other and have no hierarchy. Different tags build up different indexes (I am no expert at all, but this is how i think they work).

This project also works as a time-series database and uses the InfluxDB line protocol. Unlike InfluxDB, the data is indexed by one single strictly hierarchical tree structure. A selector is build out of the tags in the InfluxDB line protocol, and can be used to select a node (not in the sense of a compute node, can also be a socket, cpu, ...) in that tree. The implementation calls those nodes level to avoid confusion. It is impossible to access data only by knowing the socket or cpu tag, all higher up levels have to be specified as well.

This is what the hierarchy currently looks like:

  • cluster1
    • host1
      • socket0
      • socket1
      • ...
      • cpu1
      • cpu2
      • cpu3
      • cpu4
      • ...
    • host2
    • ...
  • cluster2
  • ...

Example selectors:

  1. ["cluster1", "host1", "cpu0"]: Select only the cpu0 of host1 in cluster1
  2. ["cluster1", "host1", ["cpu4", "cpu5", "cpu6", "cpu7"]]: Select only CPUs 4-7 of host1 in cluster1
  3. ["cluster1", "host1"]: Select the complete node. If querying for a CPU-specific metric such as floats, all CPUs are implied

Config file

All durations are specified in seconds.

  • metrics: Map of metric-name to objects with the following properties
    • frequency: Timestep/Interval/Resolution of this metric (In seconds)
    • aggregation: Can be "sum", "avg" or null
      • null means aggregation across nodes is forbidden for this metric
      • "sum" means that values from the child levels are summed up for the parent level
      • "avg" means that values from the child levels are averaged for the parent level
    • scope: Unused at the moment, should be something like "node", "socket" or "cpu"
  • nats: Url of NATS.io server (The updates channel will be subscribed for metrics), example: "nats://localhost:4222"
  • http-api-address: Where to listen via HTTP, example: ":8080"
  • jwt-public-key: Base64 encoded string, use this to verify requests to the HTTP API
  • retention-on-memory: Keep all values in memory for at least that amount of seconds

Test the complete setup (excluding ClusterCockpit itself)

First, get a NATS server running:

# Only needed once, downloads the docker image
docker pull nats:latest

# Start the NATS server
docker run -p 4222:4222 -ti nats:latest

Second, build and start the cc-metric-collector using the following as config.json:

{
    "sink": {
        "type": "nats",
        "host": "localhost",
        "port": "4222",
        "database": "updates"
    },
    "interval" : 3,
    "duration" : 1,
    "collectors": [ "likwid", "loadavg" ],
    "default_tags": { "cluster": "testcluster" },
    "receiver": { "type": "none" }
}

Third, build and start the metric store. For this example here, the config.json file already in the repository should work just fine.

# Assuming you have a clone of this repo in ./cc-metric-store:
cd cc-metric-store
go get
go build
./cc-metric-store

And finally, use the API to fetch some data. The API is protected by JWT based authentication if jwt-public-key is set in config.json. You can use this JWT for testing: eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw

JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"

# If the collector and store and nats-server have been running for at least 60 seconds on the same host, you may run:
curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/query" -d "{ \"cluster\": \"testcluster\", \"from\": $(expr $(date +%s) - 60), \"to\": $(date +%s), \"queries\": [{
  \"metric\": \"load_one\",
  \"hostname\": \"$(hostname)\"
}] }"

# ...