LikwidCollector: Filter out NaNs or set them to zero if 'nan_to_zero' option is set

2025-07-18 19:01:41 +02:00 · 2022-02-07 18:35:08 +01:00
parent 7182b339b9
commit a6bec61b1e
2 changed files with 17 additions and 5 deletions
--- a/collectors/likwidMetric.go
+++ b/collectors/likwidMetric.go
@@ -86,9 +86,9 @@ type LikwidCollectorEventsetConfig struct {
 type LikwidCollectorConfig struct {
 	Eventsets      []LikwidCollectorEventsetConfig `json:"eventsets"`
-	Metrics        []LikwidCollectorMetricConfig   `json:"globalmetrics"`
+	Metrics        []LikwidCollectorMetricConfig   `json:"globalmetrics,omitempty"`
-	ExcludeMetrics []string                        `json:"exclude_metrics"`
+	ForceOverwrite bool                            `json:"force_overwrite,omitempty"`
-	ForceOverwrite bool                            `json:"force_overwrite"`
+	NanToZero      bool                            `json:"nan_to_zero,omitempty"`
 }
 type LikwidCollector struct {
@@ -434,8 +434,11 @@ func (m *LikwidCollector) calcEventsetMetrics(group int, interval time.Duration,
 					continue
 				}
 				m.mresults[group][tid][metric.Name] = value
 				if m.config.NanToZero && math.IsNaN(value) {
 					value = 0.0
 				}
 				// Now we have the result, send it with the proper tags
-				if metric.Publish {
+				if metric.Publish && !math.IsNaN(value) {
 					tags := map[string]string{"type": metric.Scope.String()}
 					if metric.Scope != "node" {
 						tags["type-id"] = fmt.Sprintf("%d", domain)
@@ -473,8 +476,11 @@ func (m *LikwidCollector) calcGlobalMetrics(interval time.Duration, output chan
 					continue
 				}
 				m.gmresults[tid][metric.Name] = value
 				if m.config.NanToZero && math.IsNaN(value) {
 					value = 0.0
 				}
 				// Now we have the result, send it with the proper tags
-				if metric.Publish {
+				if metric.Publish && !math.IsNaN(value) {
 					tags := map[string]string{"type": metric.Scope.String()}
 					if metric.Scope != "node" {
 						tags["type-id"] = fmt.Sprintf("%d", domain)
--- a/collectors/likwidMetric.md
+++ b/collectors/likwidMetric.md
@@ -7,6 +7,10 @@ The `likwid` configuration consists of two parts, the "eventsets" and "globalmet
 - An event set list itself has two parts, the "events" and a set of derivable "metrics". Each of the "events" is a counter:event pair in LIKWID's syntax. The "metrics" are a list of formulas to derive the metric value from the measurements of the "events". Each metric has a name, the formula, a scope and a publish flag. A counter names can be used like variables in the formulas, so `PMC0+PMC1` sums the measurements for the both events configured in the counters `PMC0` and `PMC1`. The scope tells the Collector whether it is a metric for each hardware thread (`hwthread`) or each CPU socket (`socket`). The last one is the publishing flag. It tells the collector whether a metric should be sent to the router.
 - The global metrics are metrics which require data from all event set measurements to be derived. The inputs are the metrics in the event sets. Similar to the metrics in the event sets, the global metrics are defined by a name, a formula, a scope and a publish flag. See event set metrics for details. The only difference is that there is no access to the raw event measurements anymore but only to the metrics. So, the idea is to derive a metric in the "eventsets" section and reuse it in the "globalmetrics" part. If you need a metric only for deriving the global metrics, disable forwarding of the event set metrics. **Be aware** that the combination might be misleading because the "behavior" of a metric changes over time and the multiple measurements might count different computing phases.
 Additional options:
 - `force_overwrite`: Same as setting `LIKWID_FORCE=1`. In case counters are already in-use, LIKWID overwrites their configuration to do its measurements
 - `nan_to_zero`: In some cases, the calculations result in `NaN`. With this option, all `NaN` values are replaces with `0.0`.
 ### Available metric scopes
 Hardware performance counters are scattered all over the system nowadays. A counter coveres a specific part of the system. While there are hardware thread specific counter for CPU cycles, instructions and so on, some others are specific for a whole CPU socket/package. To address that, the collector provides the specification of a 'scope' for each metric.
@@ -28,6 +32,8 @@ As a guideline:
 ```json
  "likwid": {
    "force_overwrite" : false,
    "nan_to_zero" : false,
    "eventsets": [
      {
        "events": {