Initial complete draft of updated line protocol specs

This commit is contained in:
Jan Eitzinger 2025-02-20 06:30:00 +01:00
parent 917e17d9bf
commit 4594c99395

View File

@ -15,8 +15,9 @@ tags. All metrics/events have the following format (if written to `stdout`):
Where `<tag set>` and `<field set>` are comma-separated lists of `key=value`
entries. In a mind-model, think about tags as `indices` in the database for
faster lookup and the `<field set>` as values.
The timestamp is UNIX epoch time in seconds!
We are using the tag set to add metadata information and the field for the
We are using the tag set to add metadata information and one field for the
payload.
**Remark**: In the first iteration, we only sent metrics (number values) but we
@ -26,15 +27,22 @@ text was changed accordingly. The update is backward-compatible, for metrics
## Message categories
There exist the following line line-protocol message flavors:
There are four line-protocol message flavors:
- Metric: The field key is `value`, measurement = metric name
- Event: The field key is `event`, Events are actionable informations, measurement = event subtype (job, phases, ?? ), Additional tag function=<string>
- Log message: The field key is `log`. Log messages are purely informational,
measurement = [ccb, ccms, ccmc, ccem, ccnc], Additional tag loglevel
- Control message: The field key is `control`, measurement = knob name (rapl, freq, prefetcher, topology, config), Additional tags: method=[get,put]
- **Metric**: The `field` key is `value`, the `measurement` is the metric name
- **Event**: The `field` key is `event`. Events are actionable informations. The
`measurement` is set to an event class (job, slurm, status, phases, ?? ). Additional tag
`function` to indicate the purpose, similar to a REST endpoint (for the job
class this can be start_job and stop_job).
- **Log**: The `field` key is `log`. Log messages are purely informational.
The `measurement` is set to the component identifier [ccb, ccms, ccmc, ccem,
ccnc]. Additional tag `loglevel` to set the log level (debug, info, warn,
error).
- **Control**: The `field` key is `control`, the `measurement` is set to a
control class (rapl, freq, prefetcher, topology, config). Additional tag
`method` with on of [GET,PUT].
## Messaging
## Messaging subjects
ClusterCockpit uses the NATS messaging network, with the option to support other
messaging frameworks in the future. To distinguish between different message
@ -45,19 +53,16 @@ subject hierarchy tree is used:
<cluster name>. |
--- metrics
|
--- events.[job]
--- events.[job, slurm]
|
--- log.[ccb, ccms, ccmc, ccem, ccnc]
|
--- control.[get,put]
--- control.[get, put]
```
## Points generic for all message categories
## Rules valid for all message categories
In ClusterCockpit we limit the flexibility of the InfluxData line-protocol
slightly. The idea is to keep the format usable by different components.
Each message is identifiable by the `measurement` (= metric name), the
Each message is identifiable by the `measurement`, and the tags
`hostname`, the `type` and, if required, a `type-id`.
### Mandatory tags per message
@ -76,7 +81,7 @@ Each message is identifiable by the `measurement` (= metric name), the
Although no `type-id` is required if `type=node`, it is recommended to send `type=node,type-id=0`.
#### Optional tags depending on the message
#### Optional tags depending on the message type
In some cases, optional tags are required like `filesystem`, `device` or
`version`. While you are free to do that, the ClusterCockpit components in the
@ -109,25 +114,47 @@ ClusterCockpit ecosystem
- `clock`: CPU clock in `MHz`
- ...
FIXME: What about the unit??
For the whole list, see [job-data schema](../../datastructures/job-data.schema.json)
Example:
```txt
flops_any,hostname=e1208,type=core,type-id=23 value=1203.3 1740027951
```
#### Events
**Identification:** `event="X"` field with `"X"` being a string
The name (measurement) of the event message can further specialize the purpose
(similar to REST endpoints), e.g. `start_job`, and `stop_job` for events of type
job.
**Identification:** Field `event="X"` with `"X"` being the payload string.
The name (measurement) of the event message indicates the event
class. The function tag specifies the purpose (similar to REST endpoints), e.g.
`start_job`, and `stop_job` for events of class job.
Example start job event:
TBD
Example:
```txt
job,hostname=mngmt02,type=node,type-id=0,function=stop_job event={"jobId": 69, "cluster": "ccfront", "stopTime": 1738842306, "jobState": "completed"} 1740027951
```
#### Controls
**Identification:**
**Identification:** Field `control="X"` with `"X"` being the control request. `measurement` is
set to a control class, the tag `method` is either `GET` or `PUT`.
- `control="X"` field with `"X"` being a string
- `method` tag is either `GET` or `PUT`
Example:
```txt
rapl,hostname=e1208,type=socket,type-id=2,method=GET control=intel.pkg.energy_status 1740027951
```
#### Logs
**Identification:** `log="X"` field with `"X"` being a string
**Identification:** `log="X"` field with `"X"` being the log message. The `measurement` is
set to source component id, the tag `loglevel` is one of debug, info, warn,
error.
Example:
```txt
ccb,hostname=server01,type=node,type-id=1,loglevel=info log="component: archiver cluster: alex jobId: 232383 - archiving finished" 1740027951
```