Initial complete draft of updated line protocol specs

This commit is contained in:
Jan Eitzinger 2025-02-20 06:30:00 +01:00
parent 917e17d9bf
commit 4594c99395

View File

@ -15,8 +15,9 @@ tags. All metrics/events have the following format (if written to `stdout`):
Where `<tag set>` and `<field set>` are comma-separated lists of `key=value` Where `<tag set>` and `<field set>` are comma-separated lists of `key=value`
entries. In a mind-model, think about tags as `indices` in the database for entries. In a mind-model, think about tags as `indices` in the database for
faster lookup and the `<field set>` as values. faster lookup and the `<field set>` as values.
The timestamp is UNIX epoch time in seconds!
We are using the tag set to add metadata information and the field for the We are using the tag set to add metadata information and one field for the
payload. payload.
**Remark**: In the first iteration, we only sent metrics (number values) but we **Remark**: In the first iteration, we only sent metrics (number values) but we
@ -26,15 +27,22 @@ text was changed accordingly. The update is backward-compatible, for metrics
## Message categories ## Message categories
There exist the following line line-protocol message flavors: There are four line-protocol message flavors:
- Metric: The field key is `value`, measurement = metric name - **Metric**: The `field` key is `value`, the `measurement` is the metric name
- Event: The field key is `event`, Events are actionable informations, measurement = event subtype (job, phases, ?? ), Additional tag function=<string> - **Event**: The `field` key is `event`. Events are actionable informations. The
- Log message: The field key is `log`. Log messages are purely informational, `measurement` is set to an event class (job, slurm, status, phases, ?? ). Additional tag
measurement = [ccb, ccms, ccmc, ccem, ccnc], Additional tag loglevel `function` to indicate the purpose, similar to a REST endpoint (for the job
- Control message: The field key is `control`, measurement = knob name (rapl, freq, prefetcher, topology, config), Additional tags: method=[get,put] class this can be start_job and stop_job).
- **Log**: The `field` key is `log`. Log messages are purely informational.
The `measurement` is set to the component identifier [ccb, ccms, ccmc, ccem,
ccnc]. Additional tag `loglevel` to set the log level (debug, info, warn,
error).
- **Control**: The `field` key is `control`, the `measurement` is set to a
control class (rapl, freq, prefetcher, topology, config). Additional tag
`method` with on of [GET,PUT].
## Messaging ## Messaging subjects
ClusterCockpit uses the NATS messaging network, with the option to support other ClusterCockpit uses the NATS messaging network, with the option to support other
messaging frameworks in the future. To distinguish between different message messaging frameworks in the future. To distinguish between different message
@ -45,19 +53,16 @@ subject hierarchy tree is used:
<cluster name>. | <cluster name>. |
--- metrics --- metrics
| |
--- events.[job] --- events.[job, slurm]
| |
--- log.[ccb, ccms, ccmc, ccem, ccnc] --- log.[ccb, ccms, ccmc, ccem, ccnc]
| |
--- control.[get,put] --- control.[get, put]
``` ```
## Points generic for all message categories ## Rules valid for all message categories
In ClusterCockpit we limit the flexibility of the InfluxData line-protocol Each message is identifiable by the `measurement`, and the tags
slightly. The idea is to keep the format usable by different components.
Each message is identifiable by the `measurement` (= metric name), the
`hostname`, the `type` and, if required, a `type-id`. `hostname`, the `type` and, if required, a `type-id`.
### Mandatory tags per message ### Mandatory tags per message
@ -76,7 +81,7 @@ Each message is identifiable by the `measurement` (= metric name), the
Although no `type-id` is required if `type=node`, it is recommended to send `type=node,type-id=0`. Although no `type-id` is required if `type=node`, it is recommended to send `type=node,type-id=0`.
#### Optional tags depending on the message #### Optional tags depending on the message type
In some cases, optional tags are required like `filesystem`, `device` or In some cases, optional tags are required like `filesystem`, `device` or
`version`. While you are free to do that, the ClusterCockpit components in the `version`. While you are free to do that, the ClusterCockpit components in the
@ -109,25 +114,47 @@ ClusterCockpit ecosystem
- `clock`: CPU clock in `MHz` - `clock`: CPU clock in `MHz`
- ... - ...
FIXME: What about the unit??
For the whole list, see [job-data schema](../../datastructures/job-data.schema.json) For the whole list, see [job-data schema](../../datastructures/job-data.schema.json)
Example:
```txt
flops_any,hostname=e1208,type=core,type-id=23 value=1203.3 1740027951
```
#### Events #### Events
**Identification:** `event="X"` field with `"X"` being a string **Identification:** Field `event="X"` with `"X"` being the payload string.
The name (measurement) of the event message can further specialize the purpose The name (measurement) of the event message indicates the event
(similar to REST endpoints), e.g. `start_job`, and `stop_job` for events of type class. The function tag specifies the purpose (similar to REST endpoints), e.g.
job. `start_job`, and `stop_job` for events of class job.
Example start job event: Example:
TBD
```txt
job,hostname=mngmt02,type=node,type-id=0,function=stop_job event={"jobId": 69, "cluster": "ccfront", "stopTime": 1738842306, "jobState": "completed"} 1740027951
```
#### Controls #### Controls
**Identification:** **Identification:** Field `control="X"` with `"X"` being the control request. `measurement` is
set to a control class, the tag `method` is either `GET` or `PUT`.
- `control="X"` field with `"X"` being a string Example:
- `method` tag is either `GET` or `PUT`
```txt
rapl,hostname=e1208,type=socket,type-id=2,method=GET control=intel.pkg.energy_status 1740027951
```
#### Logs #### Logs
**Identification:** `log="X"` field with `"X"` being a string **Identification:** `log="X"` field with `"X"` being the log message. The `measurement` is
set to source component id, the tag `loglevel` is one of debug, info, warn,
error.
Example:
```txt
ccb,hostname=server01,type=node,type-id=1,loglevel=info log="component: archiver cluster: alex jobId: 232383 - archiving finished" 1740027951
```