Add likwid collector

2025-09-16 21:34:32 +02:00 · 2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions
--- a/collectors/likwid/groups/zen/BRANCH.txt
+++ b/collectors/likwid/groups/zen/BRANCH.txt
@@ -0,0 +1,32 @@
+SHORT Branch prediction miss rate/ratio
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  RETIRED_BRANCH_INSTR
+PMC3  RETIRED_MISP_BRANCH_INSTR
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   PMC1/PMC0
+Branch rate   PMC2/PMC0
+Branch misprediction rate  PMC3/PMC0
+Branch misprediction ratio  PMC3/PMC2
+Instructions per branch  PMC0/PMC2
+
+LONG
+Formulas:
+Branch rate = RETIRED_BRANCH_INSTR/RETIRED_INSTRUCTIONS
+Branch misprediction rate = RETIRED_MISP_BRANCH_INSTR/RETIRED_INSTRUCTIONS
+Branch misprediction ratio = RETIRED_MISP_BRANCH_INSTR/RETIRED_BRANCH_INSTR
+Instructions per branch = RETIRED_INSTRUCTIONS/RETIRED_BRANCH_INSTR
+-
+The rates state how often on average a branch or a mispredicted branch occurred
+per instruction retired in total. The branch misprediction ratio sets directly
+into relation what ratio of all branch instruction where mispredicted.
+Instructions per branch is 1/branch rate.
+
--- a/collectors/likwid/groups/zen/CACHE.txt
+++ b/collectors/likwid/groups/zen/CACHE.txt
@@ -0,0 +1,39 @@
+SHORT Data cache miss rate/ratio
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  DATA_CACHE_ACCESSES
+PMC3  DATA_CACHE_REFILLS_ALL
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   PMC1/PMC0
+data cache requests PMC2
+data cache request rate PMC2/PMC0
+data cache misses PMC3
+data cache miss rate PMC3/PMC0
+data cache miss ratio PMC3/PMC2
+
+LONG
+Formulas:
+data cache requests = DATA_CACHE_ACCESSES
+data cache request rate = DATA_CACHE_ACCESSES / RETIRED_INSTRUCTIONS
+data cache misses = DATA_CACHE_REFILLS_ALL
+data cache miss rate = DATA_CACHE_REFILLS_ALL / RETIRED_INSTRUCTIONS
+data cache miss ratio = DATA_CACHE_REFILLS_ALL / DATA_CACHE_ACCESSES
+-
+This group measures the locality of your data accesses with regard to the
+L1 cache. Data cache request rate tells you how data intensive your code is
+or how many data accesses you have on average per instruction.
+The data cache miss rate gives a measure how often it was necessary to get
+cache lines from higher levels of the memory hierarchy. And finally
+data cache miss ratio tells you how many of your memory references required
+a cache line to be loaded from a higher level. While the# data cache miss rate
+might be given by your algorithm you should try to get data cache miss ratio
+as low as possible by increasing your cache reuse.
+
--- a/collectors/likwid/groups/zen/CPI.txt
+++ b/collectors/likwid/groups/zen/CPI.txt
@@ -0,0 +1,30 @@
+SHORT  Cycles per instruction
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  RETIRED_UOPS
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   PMC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI PMC1/PMC0
+CPI (based on uops)   PMC1/PMC2
+IPC PMC0/PMC1
+
+
+LONG
+Formulas:
+CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
+CPI (based on uops) = CPU_CLOCKS_UNHALTED/RETIRED_UOPS
+IPC = RETIRED_INSTRUCTIONS/CPU_CLOCKS_UNHALTED
+-
+This group measures how efficient the processor works with
+regard to instruction throughput. Also important as a standalone
+metric is RETIRED_INSTRUCTIONS as it tells you how many instruction
+you need to execute for a task. An optimization might show very
+low CPI values but execute many more instruction for it.
+
--- a/collectors/likwid/groups/zen/DATA.txt
+++ b/collectors/likwid/groups/zen/DATA.txt
@@ -0,0 +1,23 @@
+SHORT Load to store ratio
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  LS_DISPATCH_LOADS
+PMC3  LS_DISPATCH_STORES
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   PMC1/PMC0
+Load to store ratio PMC2/PMC3
+
+LONG
+Formulas:
+Load to store ratio = LS_DISPATCH_LOADS/LS_DISPATCH_STORES
+-
+This is a simple metric to determine your load to store ratio.
+
--- a/collectors/likwid/groups/zen/DIVIDE.txt
+++ b/collectors/likwid/groups/zen/DIVIDE.txt
@@ -0,0 +1,26 @@
+SHORT Divide unit information
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  DIV_OP_COUNT
+PMC3  DIV_BUSY_CYCLES
+
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI PMC1/PMC0
+Number of divide ops PMC2
+Avg. divide unit usage duration PMC3/PMC2
+
+LONG
+Formulas:
+CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
+Number of divide ops = DIV_OP_COUNT
+Avg. divide unit usage duration = DIV_BUSY_CYCLES/DIV_OP_COUNT
+-
+This performance group measures the average latency of divide operations
--- a/collectors/likwid/groups/zen/ENERGY.txt
+++ b/collectors/likwid/groups/zen/ENERGY.txt
@@ -0,0 +1,32 @@
+SHORT Power and Energy consumption
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PWR0  RAPL_CORE_ENERGY
+PWR1  RAPL_PKG_ENERGY
+
+
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   PMC1/PMC0
+Energy Core [J]  PWR0
+Power Core [W] PWR0/time
+Energy PKG [J]  PWR1
+Power PKG [W] PWR1/time
+
+LONG
+Formulas:
+Power Core [W] = RAPL_CORE_ENERGY/time
+Power PKG [W] = RAPL_PKG_ENERGY/time
+-
+Ryzen implements the RAPL interface previously introduced by Intel.
+This interface enables to monitor the consumed energy on the core and package
+domain.
+It is not documented by AMD which parts of the CPU are in which domain.
+
--- a/collectors/likwid/groups/zen/FLOPS_DP.txt
+++ b/collectors/likwid/groups/zen/FLOPS_DP.txt
@@ -0,0 +1,26 @@
+SHORT Double Precision MFLOP/s
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  RETIRED_SSE_AVX_FLOPS_DOUBLE_ALL
+PMC3  MERGE
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   PMC1/PMC0
+DP [MFLOP/s]   1.0E-06*(PMC2)/time
+
+LONG
+Formulas:
+CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
+DP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_DOUBLE_ALL)/time
+-
+Profiling group to measure double precisision FLOP rate. The event might
+have a higher per-cycle increment than 15, so the MERGE event is required.
+
+
--- a/collectors/likwid/groups/zen/FLOPS_SP.txt
+++ b/collectors/likwid/groups/zen/FLOPS_SP.txt
@@ -0,0 +1,26 @@
+SHORT Single Precision MFLOP/s
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  RETIRED_SSE_AVX_FLOPS_SINGLE_ALL
+PMC3  MERGE
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   PMC1/PMC0
+SP [MFLOP/s]   1.0E-06*(PMC2)/time
+
+LONG
+Formulas:
+CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
+SP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_SINGLE_ALL)/time
+-
+Profiling group to measure single precisision FLOP rate. The event might
+have a higher per-cycle increment than 15, so the MERGE event is required.
+
+
--- a/collectors/likwid/groups/zen/ICACHE.txt
+++ b/collectors/likwid/groups/zen/ICACHE.txt
@@ -0,0 +1,28 @@
+SHORT Instruction cache miss rate/ratio
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  ICACHE_FETCHES
+PMC2  ICACHE_L2_REFILLS
+PMC3  ICACHE_SYSTEM_REFILLS
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   FIXC1/PMC0
+L1I request rate   PMC1/PMC0
+L1I miss rate    (PMC2+PMC3)/PMC0
+L1I miss ratio   (PMC2+PMC3)/PMC1
+
+LONG
+Formulas:
+L1I request rate = ICACHE_FETCHES / RETIRED_INSTRUCTIONS
+L1I miss rate = (ICACHE_L2_REFILLS + ICACHE_SYSTEM_REFILLS)/RETIRED_INSTRUCTIONS
+L1I miss ratio = (ICACHE_L2_REFILLS + ICACHE_SYSTEM_REFILLS)/ICACHE_FETCHES
+-
+This group measures the locality of your instruction code with regard to the
+L1 I-Cache.
+
--- a/collectors/likwid/groups/zen/L2.txt
+++ b/collectors/likwid/groups/zen/L2.txt
@@ -0,0 +1,28 @@
+SHORT L2 cache bandwidth in MBytes/s (experimental)
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC3  REQUESTS_TO_L2_GRP1_ALL_NO_PF
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  PMC1/PMC0
+L2D load bandwidth [MBytes/s]  1.0E-06*PMC3*64.0/time
+L2D load data volume [GBytes]  1.0E-09*PMC3*64.0
+L2 bandwidth [MBytes/s] 1.0E-06*(PMC3)*64.0/time
+L2 data volume [GBytes] 1.0E-09*(PMC3)*64.0
+
+LONG
+Formulas:
+L2D load bandwidth [MBytes/s] = 1.0E-06*REQUESTS_TO_L2_GRP1_ALL_NO_PF*64.0/time
+L2D load data volume [GBytes] = 1.0E-09*REQUESTS_TO_L2_GRP1_ALL_NO_PF*64.0
+L2 bandwidth [MBytes/s] = 1.0E-06*(REQUESTS_TO_L2_GRP1_ALL_NO_PF)*64/time
+L2 data volume [GBytes] = 1.0E-09*(REQUESTS_TO_L2_GRP1_ALL_NO_PF)*64
+-
+Profiling group to measure L2 cache bandwidth. There is no way to measure
+the store traffic between L1 and L2.
--- a/collectors/likwid/groups/zen/L3.txt
+++ b/collectors/likwid/groups/zen/L3.txt
@@ -0,0 +1,32 @@
+SHORT L3 cache bandwidth in MBytes/s
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+CPMC0 L3_ACCESS
+CPMC1 L3_MISS
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  PMC1/PMC0
+L3 access bandwidth [MBytes/s]  1.0E-06*CPMC0*64.0/time
+L3 access data volume [GBytes]  1.0E-09*CPMC0*64.0
+L3 access rate [%] (CPMC0/PMC0)*100.0
+L3 miss rate [%] (CPMC1/PMC0)*100.0
+L3 miss ratio [%] (CPMC1/CPMC0)*100.0
+
+LONG
+Formulas:
+L3 access bandwidth [MBytes/s] = 1.0E-06*L3_ACCESS*64.0/time
+L3 access data volume [GBytes] = 1.0E-09*L3_ACCESS*64.0
+L3 access rate [%] = (L3_ACCESS/RETIRED_INSTRUCTIONS)*100
+L3 miss rate [%] = (L3_MISS/RETIRED_INSTRUCTIONS)*100
+L3 miss ratio [%]= (L3_MISS/L3_ACCESS)*100
+-
+Profiling group to measure L3 cache bandwidth. There is no way to measure
+the store traffic between L2 and L3. The only two published L3 events are
+L3_ACCESS and L3_MISS.
--- a/collectors/likwid/groups/zen/MEM.txt
+++ b/collectors/likwid/groups/zen/MEM.txt
@@ -0,0 +1,32 @@
+SHORT Main memory bandwidth in MBytes/s (experimental)
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+DFC0  DATA_FROM_LOCAL_DRAM_CHANNEL
+DFC1  DATA_TO_LOCAL_DRAM_CHANNEL
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  PMC1/PMC0
+Memory bandwidth [MBytes/s] 1.0E-06*(DFC0+DFC1)*64.0/time
+Memory data volume [GBytes] 1.0E-09*(DFC0+DFC1)*64.0
+
+LONG
+Formulas:
+Memory bandwidth [MBytes/s] = 1.0E-06*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0/runtime
+Memory data volume [GBytes] = 1.0E-09*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0
+-
+Profiling group to measure memory bandwidth drawn by all cores of a socket.
+Since this group is based on Uncore events it is only possible to measure on a
+per socket base.
+The group provides almost accurate results for the total bandwidth and data volume.
+AMD describes this metric as "approximate" in the documentation for AMD Rome.
+
+Be aware that despite the events imply a traffic direction (FROM and TO), the events
+cannot be used to differentiate between read and write traffic. The events will be
+renamed to avoid that confusion in the future.
--- a/collectors/likwid/groups/zen/MEM_DP.txt
+++ b/collectors/likwid/groups/zen/MEM_DP.txt
@@ -0,0 +1,39 @@
+SHORT Overview of arithmetic and main memory performance
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  RETIRED_SSE_AVX_FLOPS_DOUBLE_ALL
+PMC3  MERGE
+DFC0  DATA_FROM_LOCAL_DRAM_CHANNEL
+DFC1  DATA_TO_LOCAL_DRAM_CHANNEL
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  PMC1/PMC0
+DP [MFLOP/s]   1.0E-06*(PMC2)/time
+Memory bandwidth [MBytes/s] 1.0E-06*(DFC0+DFC1)*64.0/time
+Memory data volume [GBytes] 1.0E-09*(DFC0+DFC1)*64.0
+Operational intensity PMC2/((DFC0+DFC1)*64.0)
+
+LONG
+Formulas:
+DP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_DOUBLE_ALL)/time
+Memory bandwidth [MBytes/s] = 1.0E-06*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0/runtime
+Memory data volume [GBytes] = 1.0E-09*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0
+Operational intensity = RETIRED_SSE_AVX_FLOPS_DOUBLE_ALL/((DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0)
+-
+Profiling group to measure memory bandwidth drawn by all cores of a socket.
+Since this group is based on Uncore events it is only possible to measure on a
+per socket base.
+The group provides almost accurate results for the total bandwidth and data volume.
+AMD describes this metric as "approximate" in the documentation for AMD Rome.
+
+Be aware that despite the events imply a traffic direction (FROM and TO), the events
+cannot be used to differentiate between read and write traffic. The events will be
+renamed to avoid that confusion in the future.
+
--- a/collectors/likwid/groups/zen/MEM_SP.txt
+++ b/collectors/likwid/groups/zen/MEM_SP.txt
@@ -0,0 +1,39 @@
+SHORT Overview of arithmetic and main memory performance
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  RETIRED_SSE_AVX_FLOPS_SINGLE_ALL
+PMC3  MERGE
+DFC0  DATA_FROM_LOCAL_DRAM_CHANNEL
+DFC1  DATA_TO_LOCAL_DRAM_CHANNEL
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  PMC1/PMC0
+SP [MFLOP/s]   1.0E-06*(PMC2)/time
+Memory bandwidth [MBytes/s] 1.0E-06*(DFC0+DFC1)*64.0/time
+Memory data volume [GBytes] 1.0E-09*(DFC0+DFC1)*64.0
+Operational intensity PMC2/((DFC0+DFC1)*64.0)
+
+LONG
+Formulas:
+DP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_DOUBLE_ALL)/time
+Memory bandwidth [MBytes/s] = 1.0E-06*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0/runtime
+Memory data volume [GBytes] = 1.0E-09*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0
+Operational intensity = RETIRED_SSE_AVX_FLOPS_SINGLE_ALL/((DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0)
+-
+Profiling group to measure memory bandwidth drawn by all cores of a socket.
+Since this group is based on Uncore events it is only possible to measure on a
+per socket base.
+The group provides almost accurate results for the total bandwidth and data volume.
+AMD describes this metric as "approximate" in the documentation for AMD Rome.
+
+Be aware that despite the events imply a traffic direction (FROM and TO), the events
+cannot be used to differentiate between read and write traffic. The events will be
+renamed to avoid that confusion in the future.
+
--- a/collectors/likwid/groups/zen/NUMA.txt
+++ b/collectors/likwid/groups/zen/NUMA.txt
@@ -0,0 +1,35 @@
+SHORT L2 cache bandwidth in MBytes/s (experimental)
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  DATA_CACHE_REFILLS_LOCAL_ALL
+PMC1  DATA_CACHE_REFILLS_REMOTE_ALL
+PMC2  HWPREF_DATA_CACHE_FILLS_LOCAL_ALL
+PMC3  HWPREF_DATA_CACHE_FILLS_REMOTE_ALL
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  PMC1/PMC0
+Local bandwidth [MBytes/s]  1.0E-06*(PMC0+PMC2)*64.0/time
+Local data volume [GBytes]  1.0E-09*(PMC0+PMC2)*64.0
+Remote bandwidth [MBytes/s]  1.0E-06*(PMC1+PMC3)*64.0/time
+Remote data volume [GBytes]  1.0E-09*(PMC1+PMC3)*64.0
+Total bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC2+PMC1+PMC3)*64.0/time
+Total data volume [GBytes] 1.0E-09*(PMC0+PMC2+PMC1+PMC3)*64.0
+
+LONG
+Formulas:
+Local bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_LOCAL_ALL+HWPREF_DATA_CACHE_FILLS_LOCAL_ALL)*64.0/time
+Local data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_LOCAL_ALL+HWPREF_DATA_CACHE_FILLS_LOCAL_ALL)*64.0
+Remote bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_REMOTE_ALL+HWPREF_DATA_CACHE_FILLS_REMOTE_ALL)*64.0/time
+Remote data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_REMOTE_ALL+HWPREF_DATA_CACHE_FILLS_REMOTE_ALL)*64.0
+Total bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_LOCAL_ALL+HWPREF_DATA_CACHE_FILLS_LOCAL_ALL+DATA_CACHE_REFILLS_REMOTE_ALL+HWPREF_DATA_CACHE_FILLS_REMOTE_ALL)*64.0/time
+Total data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_LOCAL_ALL+HWPREF_DATA_CACHE_FILLS_LOCAL_ALL+DATA_CACHE_REFILLS_REMOTE_ALL+HWPREF_DATA_CACHE_FILLS_REMOTE_ALL)*64.0
+-
+Profiling group to measure NUMA traffic. The data sources range from
+local L2, CCX and memory for the local metrics and remote CCX and memory
+for the remote metrics. There are also events that measure the software
+prefetches from local and remote domain but AMD Zen provides only 4 counters.
--- a/collectors/likwid/groups/zen/TLB.txt
+++ b/collectors/likwid/groups/zen/TLB.txt
@@ -0,0 +1,39 @@
+SHORT  TLB miss rate/ratio
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  DATA_CACHE_ACCESSES
+PMC2  L1_DTLB_MISS_ANY_L2_HIT
+PMC3  L1_DTLB_MISS_ANY_L2_MISS
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   FIXC1/PMC0
+L1 DTLB request rate  PMC1/PMC0
+L1 DTLB miss rate   (PMC2+PMC3)/PMC0
+L1 DTLB miss ratio   (PMC2+PMC3)/PMC1
+L2 DTLB request rate   (PMC2+PMC3)/PMC0
+L2 DTLB miss rate    PMC3/PMC0
+L2 DTLB miss ratio    PMC3/(PMC2+PMC3)
+
+
+LONG
+Formulas:
+L1 DTLB request rate = DATA_CACHE_ACCESSES / RETIRED_INSTRUCTIONS
+L1 DTLB miss rate = (L1_DTLB_MISS_ANY_L2_HIT+L1_DTLB_MISS_ANY_L2_MISS)/RETIRED_INSTRUCTIONS
+L1 DTLB miss ratio = (L1_DTLB_MISS_ANY_L2_HIT+L1_DTLB_MISS_ANY_L2_MISS)/DATA_CACHE_ACCESSES
+L2 DTLB request rate = (L1_DTLB_MISS_ANY_L2_HIT+L1_DTLB_MISS_ANY_L2_MISS)/RETIRED_INSTRUCTIONS
+L2 DTLB miss rate = L1_DTLB_MISS_ANY_L2_MISS / RETIRED_INSTRUCTIONS
+L2 DTLB miss ratio = L1_DTLB_MISS_ANY_L2_MISS / (L1_DTLB_MISS_ANY_L2_HIT+L1_DTLB_MISS_ANY_L2_MISS)
+-
+L1 DTLB request  rate tells you how data intensive your code is
+or how many data accesses you have on average per instruction.
+The DTLB miss  rate gives a measure how often a TLB miss occurred
+per instruction. And finally L1 DTLB  miss ratio tells you how many
+of your memory references required caused a TLB miss on average.
+NOTE: The L2 metrics are only relevant if L2 DTLB request rate is
+equal to the L1 DTLB miss rate!