Add likwid collector

2025-12-18 21:26:18 +01:00 · 2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions
--- a/collectors/likwid/groups/core2/BRANCH.txt
+++ b/collectors/likwid/groups/core2/BRANCH.txt
@@ -0,0 +1,30 @@
+SHORT Branch prediction miss rate/ratio
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0  BR_INST_RETIRED_ANY
+PMC1  BR_INST_RETIRED_MISPRED
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+Branch rate   PMC0/FIXC0
+Branch misprediction rate  PMC1/FIXC0
+Branch misprediction ratio  PMC1/PMC0
+Instructions per branch  FIXC0/PMC0
+
+LONG
+Formulas:
+Branch rate = BR_INST_RETIRED_ANY/INSTR_RETIRED_ANY
+Branch misprediction rate = BR_INST_RETIRED_MISPRED/INSTR_RETIRED_ANY
+Branch misprediction ratio = BR_INST_RETIRED_MISPRED/BR_INST_RETIRED_ANY
+Instructions per branch = INSTR_RETIRED_ANY/BR_INST_RETIRED_ANY
+-
+The rates state how often on average a branch or a mispredicted branch occurred
+per instruction retired in total. The branch misprediction ratio sets directly
+into relation what ratio of all branch instruction where mispredicted.
+Instructions per branch is 1/branch rate.
--- a/collectors/likwid/groups/core2/CACHE.txt
+++ b/collectors/likwid/groups/core2/CACHE.txt
@@ -0,0 +1,34 @@
+SHORT Data cache miss rate/ratio
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0  L1D_REPL
+PMC1  L1D_ALL_REF
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+CPI  FIXC1/FIXC0
+data cache misses PMC0
+data cache request rate PMC1/FIXC0
+data cache miss rate PMC0/FIXC0
+data cache miss ratio PMC0/PMC1
+
+LONG
+Formulas:
+data cache request rate =  L1D_ALL_REF / INSTR_RETIRED_ANY
+data cache miss rate = L1D_REPL / INSTR_RETIRED_ANY
+data cache miss ratio =  L1D_REPL / L1D_ALL_REF
+-
+This group measures the locality of your data accesses with regard to the
+L1 cache. Data cache request rate tells you how data intensive your code is
+or how many data accesses you have on average per instruction.
+The data cache miss rate gives a measure how often it was necessary to get
+cache lines from higher levels of the memory hierarchy. And finally
+data cache miss ratio tells you how many of your memory references required
+a cache line to be loaded from a higher level. While the# data cache miss rate
+might be given by your algorithm you should try to get data cache miss ratio
+as low as possible by increasing your cache reuse.
+
--- a/collectors/likwid/groups/core2/CLOCK.txt
+++ b/collectors/likwid/groups/core2/CLOCK.txt
@@ -0,0 +1,19 @@
+SHORT CPU clock information
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+
+LONG
+Formulas:
+CPI = CPU_CLK_UNHALTED_CORE / INSTR_RETIRED_ANY
+-
+Most basic performance group measuring the the clock frequency of the machine.
+
--- a/collectors/likwid/groups/core2/DATA.txt
+++ b/collectors/likwid/groups/core2/DATA.txt
@@ -0,0 +1,22 @@
+SHORT Load to store ratio
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0  INST_RETIRED_LOADS
+PMC1  INST_RETIRED_STORES
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+Load to store ratio PMC0/PMC1
+
+LONG
+Formulas:
+Load to store ratio = INST_RETIRED_LOADS/INST_RETIRED_STORES
+-
+This is a simple metric to determine your load to store ratio.
+
--- a/collectors/likwid/groups/core2/DIVIDE.txt
+++ b/collectors/likwid/groups/core2/DIVIDE.txt
@@ -0,0 +1,24 @@
+SHORT Divide unit information
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0 CYCLES_DIV_BUSY
+PMC1 DIV
+
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+Number of divide ops PMC1
+Avg. divide unit usage duration PMC0/PMC1
+
+LONG
+Formulas:
+Number of divide ops = DIV
+Avg. divide unit usage duration = CYCLES_DIV_BUSY/DIV
+-
+This performance group measures the average latency of divide operations
--- a/collectors/likwid/groups/core2/FLOPS_DP.txt
+++ b/collectors/likwid/groups/core2/FLOPS_DP.txt
@@ -0,0 +1,29 @@
+SHORT Double Precision MFLOP/s
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0  SIMD_COMP_INST_RETIRED_PACKED_DOUBLE
+PMC1  SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+CPI  FIXC1/FIXC0
+DP [MFLOP/s]    1.0E-06*(PMC0*2.0+PMC1)/time
+Packed [MUOPS/s] 1.0E-06*PMC0/time
+Scalar [MUOPS/s] 1.0E-06*PMC1/time
+Vectorization ratio 100*PMC0/PMC1
+
+LONG
+Formulas:
+DP [MFLOP/s] = 1.0E-06*(SIMD_COMP_INST_RETIRED_PACKED_DOUBLE*2+SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE)/time
+Packed [MUOPS/s] = 1.0E-06*SIMD_COMP_INST_RETIRED_PACKED_DOUBLE/runtime
+Scalar [MUOPS/s] = 1.0E-06*SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE/runtime
+Vectorization ratio = 100*SIMD_COMP_INST_RETIRED_PACKED_DOUBLE/SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE
+-
+Profiling group to measure double SSE FLOPs. Don't forget that your code might also execute X87 FLOPs.
+On the number of SIMD_COMP_INST_RETIRED_PACKED_DOUBLE you can see how well your code was vectorized.
+
+
--- a/collectors/likwid/groups/core2/FLOPS_SP.txt
+++ b/collectors/likwid/groups/core2/FLOPS_SP.txt
@@ -0,0 +1,29 @@
+SHORT Single Precision MFLOP/s
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0  SIMD_COMP_INST_RETIRED_PACKED_SINGLE
+PMC1  SIMD_COMP_INST_RETIRED_SCALAR_SINGLE
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+CPI  FIXC1/FIXC0
+SP [MFLOP/s] 1.0E-06*(PMC0*4.0+PMC1)/time
+Packed [MUOPS/s]   1.0E-06*PMC0/time
+Scalar [MUOPS/s] 1.0E-06*PMC1/time
+Vectorization ratio 100*PMC0/PMC1
+
+LONG
+Formulas:
+SP [MFLOP/s] = 1.0E-06*(SIMD_COMP_INST_RETIRED_PACKED_SINGLE*4+SIMD_COMP_INST_RETIRED_SCALAR_SINGLE)/time
+Packed [MUOPS/s] = 1.0E-06*SIMD_COMP_INST_RETIRED_PACKED_SINGLE/runtime
+Scalar [MUOPS/s] = 1.0E-06*SIMD_COMP_INST_RETIRED_SCALAR_SINGLE/runtime
+Vectorization ratio [%] = 100*SIMD_COMP_INST_RETIRED_PACKED_SINGLE/SIMD_COMP_INST_RETIRED_SCALAR_SINGLE
+-
+Profiling group to measure single precision SSE FLOPs. Don't forget that your code might also execute X87 FLOPs.
+On the number of SIMD_COMP_INST_RETIRED_PACKED_SINGLE you can see how well your code was vectorized.
+
+
--- a/collectors/likwid/groups/core2/FLOPS_X87.txt
+++ b/collectors/likwid/groups/core2/FLOPS_X87.txt
@@ -0,0 +1,21 @@
+SHORT X87 MFLOP/s
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0  X87_OPS_RETIRED_ANY
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+CPI  FIXC1/FIXC0
+X87 [MFLOP/s]  1.0E-06*PMC0/time
+
+LONG
+Formulas:
+X87 [MFLOP/s] = 1.0E-06*X87_OPS_RETIRED_ANY/time
+-
+Profiling group to measure X87 FLOPs. Note that also non computational operations
+are measured by this event.
+
--- a/collectors/likwid/groups/core2/L2.txt
+++ b/collectors/likwid/groups/core2/L2.txt
@@ -0,0 +1,35 @@
+SHORT L2 cache bandwidth in MBytes/s
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0  L1D_REPL
+PMC1  L1D_M_EVICT
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+CPI  FIXC1/FIXC0
+L2D load bandwidth [MBytes/s]  1.0E-06*PMC0*64.0/time
+L2D load data volume [GBytes]  1.0E-09*PMC0*64.0
+L2D evict bandwidth [MBytes/s]  1.0E-06*PMC1*64.0/time
+L2D evict data volume [GBytes]  1.0E-09*PMC1*64.0
+L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
+L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
+
+LONG
+Formulas:
+L2D load bandwidth [MBytes/s] = 1.0E-06*L1D_REPL*64.0/time
+L2D load data volume [GBytes] = 1.0E-09*L1D_REPL*64.0
+L2D evict bandwidth [MBytes/s] = 1.0E-06*L1D_M_EVICT*64.0/time
+L2D evict data volume [GBytes] = 1.0E-09*L1D_M_EVICT*64.0
+L2 bandwidth [MBytes/s] = 1.0E-06*(L1D_REPL+L1D_M_EVICT)*64/time
+L2 data volume [GBytes] = 1.0E-09*(L1D_REPL+L1D_M_EVICT)*64.0
+-
+Profiling group to measure L2 cache bandwidth. The bandwidth is
+computed by the number of cache line allocated in the L1 and the
+number of modified cache lines evicted from the L1.
+Note that this bandwidth also includes data transfers due to a
+write allocate load on a store miss in L1.
+
--- a/collectors/likwid/groups/core2/L2CACHE.txt
+++ b/collectors/likwid/groups/core2/L2CACHE.txt
@@ -0,0 +1,34 @@
+SHORT L2 cache miss rate/ratio
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0  L2_RQSTS_THIS_CORE_ALL_MESI
+PMC1  L2_RQSTS_SELF_I_STATE
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+L2 request rate PMC0/FIXC0
+L2 miss rate PMC1/FIXC0
+L2 miss ratio PMC1/PMC0
+
+LONG
+Formulas:
+L2 request rate =  L2_RQSTS_THIS_CORE_ALL_MESI / INSTR_RETIRED_ANY
+L2 miss rate  = L2_RQSTS_SELF_I_STATE / INSTR_RETIRED_ANY
+L2 miss ratio = L2_RQSTS_SELF_I_STATE / L2_RQSTS_THIS_CORE_ALL_MESI
+-
+This group measures the locality of your data accesses with regard to the
+L2 cache. L2 request rate tells you how data intensive your code is
+or how many data accesses you have on average per instruction.
+The L2 miss rate gives a measure how often it was necessary to get
+cache lines from memory. And finally L2 miss ratio tells you how many of your
+memory references required a cache line to be loaded from a higher level.
+While the# data cache miss rate might be given by your algorithm you should
+try to get data cache miss ratio as low as possible by increasing your cache reuse.
+
+
--- a/collectors/likwid/groups/core2/MEM.txt
+++ b/collectors/likwid/groups/core2/MEM.txt
@@ -0,0 +1,23 @@
+SHORT Main memory bandwidth in MBytes/s
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0  BUS_TRANS_MEM_THIS_CORE_THIS_A
+PMC1  BUS_TRANS_WB_THIS_CORE_ALL_A
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+Memory bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
+Memory data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
+
+LONG
+Formulas:
+Memory bandwidth [MBytes/s] = 1.0E-06*BUS_TRANS_MEM_THIS_CORE_THIS_A*64/time
+Memory data volume [GBytes] = 1.0E-09*BUS_TRANS_MEM_THIS_CORE_THIS_A*64.0
+-
+Profiling group to measure memory bandwidth drawn by this core.
--- a/collectors/likwid/groups/core2/TLB.txt
+++ b/collectors/likwid/groups/core2/TLB.txt
@@ -0,0 +1,29 @@
+SHORT TLB miss rate/ratio
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0  DTLB_MISSES_ANY
+PMC1  L1D_ALL_REF
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+CPI  FIXC1/FIXC0
+L1 DTLB request rate    PMC1/FIXC0
+DTLB miss rate    PMC0/FIXC0
+L1 DTLB miss ratio   PMC0/PMC1
+
+LONG
+Formulas:
+L1 DTLB request rate =  L1D_ALL_REF / INSTR_RETIRED_ANY
+DTLB miss rate  = DTLB_MISSES_ANY / INSTR_RETIRED_ANY
+L1 DTLB miss ratio  =  DTLB_MISSES_ANY / L1D_ALL_REF
+-
+L1 DTLB request rate tells you how data intensive your code is
+or how many data accesses you have on average per instruction.
+The DTLB miss  rate gives a measure how often a TLB miss occurred
+per instruction. And finally L1 DTLB  miss ratio tells you how many
+of your memory references required caused a TLB miss on average.
+
--- a/collectors/likwid/groups/core2/UOPS.txt
+++ b/collectors/likwid/groups/core2/UOPS.txt
@@ -0,0 +1,26 @@
+SHORT UOPs execution info
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0  RS_UOPS_DISPATCHED_ALL
+PMC1  UOPS_RETIRED_ANY
+
+
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+Executed UOPs PMC0
+Retired UOPs PMC1
+
+LONG
+Formulas:
+Executed UOPs = RS_UOPS_DISPATCHED_ALL
+Retired UOPs = UOPS_RETIRED_ANY
+-
+Performance group measures the executed and retired micro ops. The difference
+between executed and retired uOPs are the speculatively executed uOPs.
--- a/collectors/likwid/groups/core2/UOPS_RETIRE.txt
+++ b/collectors/likwid/groups/core2/UOPS_RETIRE.txt
@@ -0,0 +1,25 @@
+SHORT UOPs retirement
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+PMC0  UOPS_RETIRED_USED_CYCLES
+PMC1  UOPS_RETIRED_STALL_CYCLES
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+Used cycles ratio PMC0/FIXC1
+Unused cycles ratio PMC1/FIXC1
+
+
+LONG
+Formulas:
+Used cycles ratio = UOPS_RETIRED_USED_CYCLES/CPU_CLK_UNHALTED_CORE
+Unused cycles ratio = UOPS_RETIRED_STALL_CYCLES/CPU_CLK_UNHALTED_CORE
+-
+This performance group returns the ratios of used and unused CPU cycles. Here
+unused cycles are cycles where no operation is performed due to some stall.