Add likwid collector

This commit is contained in:
Thomas Roehl
2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions

View File

@@ -0,0 +1,30 @@
SHORT Branch prediction miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 BR_INST_RETIRED_ANY
PMC1 BR_INST_RETIRED_MISPRED
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Branch rate PMC0/FIXC0
Branch misprediction rate PMC1/FIXC0
Branch misprediction ratio PMC1/PMC0
Instructions per branch FIXC0/PMC0
LONG
Formulas:
Branch rate = BR_INST_RETIRED_ANY/INSTR_RETIRED_ANY
Branch misprediction rate = BR_INST_RETIRED_MISPRED/INSTR_RETIRED_ANY
Branch misprediction ratio = BR_INST_RETIRED_MISPRED/BR_INST_RETIRED_ANY
Instructions per branch = INSTR_RETIRED_ANY/BR_INST_RETIRED_ANY
-
The rates state how often on average a branch or a mispredicted branch occurred
per instruction retired in total. The branch misprediction ratio sets directly
into relation what ratio of all branch instruction where mispredicted.
Instructions per branch is 1/branch rate.

View File

@@ -0,0 +1,34 @@
SHORT Data cache miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L1D_REPL
PMC1 L1D_ALL_REF
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
CPI FIXC1/FIXC0
data cache misses PMC0
data cache request rate PMC1/FIXC0
data cache miss rate PMC0/FIXC0
data cache miss ratio PMC0/PMC1
LONG
Formulas:
data cache request rate = L1D_ALL_REF / INSTR_RETIRED_ANY
data cache miss rate = L1D_REPL / INSTR_RETIRED_ANY
data cache miss ratio = L1D_REPL / L1D_ALL_REF
-
This group measures the locality of your data accesses with regard to the
L1 cache. Data cache request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The data cache miss rate gives a measure how often it was necessary to get
cache lines from higher levels of the memory hierarchy. And finally
data cache miss ratio tells you how many of your memory references required
a cache line to be loaded from a higher level. While the# data cache miss rate
might be given by your algorithm you should try to get data cache miss ratio
as low as possible by increasing your cache reuse.

View File

@@ -0,0 +1,19 @@
SHORT CPU clock information
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
LONG
Formulas:
CPI = CPU_CLK_UNHALTED_CORE / INSTR_RETIRED_ANY
-
Most basic performance group measuring the the clock frequency of the machine.

View File

@@ -0,0 +1,22 @@
SHORT Load to store ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 INST_RETIRED_LOADS
PMC1 INST_RETIRED_STORES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Load to store ratio PMC0/PMC1
LONG
Formulas:
Load to store ratio = INST_RETIRED_LOADS/INST_RETIRED_STORES
-
This is a simple metric to determine your load to store ratio.

View File

@@ -0,0 +1,24 @@
SHORT Divide unit information
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 CYCLES_DIV_BUSY
PMC1 DIV
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Number of divide ops PMC1
Avg. divide unit usage duration PMC0/PMC1
LONG
Formulas:
Number of divide ops = DIV
Avg. divide unit usage duration = CYCLES_DIV_BUSY/DIV
-
This performance group measures the average latency of divide operations

View File

@@ -0,0 +1,29 @@
SHORT Double Precision MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 SIMD_COMP_INST_RETIRED_PACKED_DOUBLE
PMC1 SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
CPI FIXC1/FIXC0
DP [MFLOP/s] 1.0E-06*(PMC0*2.0+PMC1)/time
Packed [MUOPS/s] 1.0E-06*PMC0/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
Vectorization ratio 100*PMC0/PMC1
LONG
Formulas:
DP [MFLOP/s] = 1.0E-06*(SIMD_COMP_INST_RETIRED_PACKED_DOUBLE*2+SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE)/time
Packed [MUOPS/s] = 1.0E-06*SIMD_COMP_INST_RETIRED_PACKED_DOUBLE/runtime
Scalar [MUOPS/s] = 1.0E-06*SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE/runtime
Vectorization ratio = 100*SIMD_COMP_INST_RETIRED_PACKED_DOUBLE/SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE
-
Profiling group to measure double SSE FLOPs. Don't forget that your code might also execute X87 FLOPs.
On the number of SIMD_COMP_INST_RETIRED_PACKED_DOUBLE you can see how well your code was vectorized.

View File

@@ -0,0 +1,29 @@
SHORT Single Precision MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 SIMD_COMP_INST_RETIRED_PACKED_SINGLE
PMC1 SIMD_COMP_INST_RETIRED_SCALAR_SINGLE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
CPI FIXC1/FIXC0
SP [MFLOP/s] 1.0E-06*(PMC0*4.0+PMC1)/time
Packed [MUOPS/s] 1.0E-06*PMC0/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
Vectorization ratio 100*PMC0/PMC1
LONG
Formulas:
SP [MFLOP/s] = 1.0E-06*(SIMD_COMP_INST_RETIRED_PACKED_SINGLE*4+SIMD_COMP_INST_RETIRED_SCALAR_SINGLE)/time
Packed [MUOPS/s] = 1.0E-06*SIMD_COMP_INST_RETIRED_PACKED_SINGLE/runtime
Scalar [MUOPS/s] = 1.0E-06*SIMD_COMP_INST_RETIRED_SCALAR_SINGLE/runtime
Vectorization ratio [%] = 100*SIMD_COMP_INST_RETIRED_PACKED_SINGLE/SIMD_COMP_INST_RETIRED_SCALAR_SINGLE
-
Profiling group to measure single precision SSE FLOPs. Don't forget that your code might also execute X87 FLOPs.
On the number of SIMD_COMP_INST_RETIRED_PACKED_SINGLE you can see how well your code was vectorized.

View File

@@ -0,0 +1,21 @@
SHORT X87 MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 X87_OPS_RETIRED_ANY
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
CPI FIXC1/FIXC0
X87 [MFLOP/s] 1.0E-06*PMC0/time
LONG
Formulas:
X87 [MFLOP/s] = 1.0E-06*X87_OPS_RETIRED_ANY/time
-
Profiling group to measure X87 FLOPs. Note that also non computational operations
are measured by this event.

View File

@@ -0,0 +1,35 @@
SHORT L2 cache bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L1D_REPL
PMC1 L1D_M_EVICT
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
CPI FIXC1/FIXC0
L2D load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
L2D load data volume [GBytes] 1.0E-09*PMC0*64.0
L2D evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
L2D evict data volume [GBytes] 1.0E-09*PMC1*64.0
L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
LONG
Formulas:
L2D load bandwidth [MBytes/s] = 1.0E-06*L1D_REPL*64.0/time
L2D load data volume [GBytes] = 1.0E-09*L1D_REPL*64.0
L2D evict bandwidth [MBytes/s] = 1.0E-06*L1D_M_EVICT*64.0/time
L2D evict data volume [GBytes] = 1.0E-09*L1D_M_EVICT*64.0
L2 bandwidth [MBytes/s] = 1.0E-06*(L1D_REPL+L1D_M_EVICT)*64/time
L2 data volume [GBytes] = 1.0E-09*(L1D_REPL+L1D_M_EVICT)*64.0
-
Profiling group to measure L2 cache bandwidth. The bandwidth is
computed by the number of cache line allocated in the L1 and the
number of modified cache lines evicted from the L1.
Note that this bandwidth also includes data transfers due to a
write allocate load on a store miss in L1.

View File

@@ -0,0 +1,34 @@
SHORT L2 cache miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L2_RQSTS_THIS_CORE_ALL_MESI
PMC1 L2_RQSTS_SELF_I_STATE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L2 request rate PMC0/FIXC0
L2 miss rate PMC1/FIXC0
L2 miss ratio PMC1/PMC0
LONG
Formulas:
L2 request rate = L2_RQSTS_THIS_CORE_ALL_MESI / INSTR_RETIRED_ANY
L2 miss rate = L2_RQSTS_SELF_I_STATE / INSTR_RETIRED_ANY
L2 miss ratio = L2_RQSTS_SELF_I_STATE / L2_RQSTS_THIS_CORE_ALL_MESI
-
This group measures the locality of your data accesses with regard to the
L2 cache. L2 request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The L2 miss rate gives a measure how often it was necessary to get
cache lines from memory. And finally L2 miss ratio tells you how many of your
memory references required a cache line to be loaded from a higher level.
While the# data cache miss rate might be given by your algorithm you should
try to get data cache miss ratio as low as possible by increasing your cache reuse.

View File

@@ -0,0 +1,23 @@
SHORT Main memory bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 BUS_TRANS_MEM_THIS_CORE_THIS_A
PMC1 BUS_TRANS_WB_THIS_CORE_ALL_A
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Memory bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
LONG
Formulas:
Memory bandwidth [MBytes/s] = 1.0E-06*BUS_TRANS_MEM_THIS_CORE_THIS_A*64/time
Memory data volume [GBytes] = 1.0E-09*BUS_TRANS_MEM_THIS_CORE_THIS_A*64.0
-
Profiling group to measure memory bandwidth drawn by this core.

View File

@@ -0,0 +1,29 @@
SHORT TLB miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 DTLB_MISSES_ANY
PMC1 L1D_ALL_REF
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
CPI FIXC1/FIXC0
L1 DTLB request rate PMC1/FIXC0
DTLB miss rate PMC0/FIXC0
L1 DTLB miss ratio PMC0/PMC1
LONG
Formulas:
L1 DTLB request rate = L1D_ALL_REF / INSTR_RETIRED_ANY
DTLB miss rate = DTLB_MISSES_ANY / INSTR_RETIRED_ANY
L1 DTLB miss ratio = DTLB_MISSES_ANY / L1D_ALL_REF
-
L1 DTLB request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The DTLB miss rate gives a measure how often a TLB miss occurred
per instruction. And finally L1 DTLB miss ratio tells you how many
of your memory references required caused a TLB miss on average.

View File

@@ -0,0 +1,26 @@
SHORT UOPs execution info
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 RS_UOPS_DISPATCHED_ALL
PMC1 UOPS_RETIRED_ANY
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Executed UOPs PMC0
Retired UOPs PMC1
LONG
Formulas:
Executed UOPs = RS_UOPS_DISPATCHED_ALL
Retired UOPs = UOPS_RETIRED_ANY
-
Performance group measures the executed and retired micro ops. The difference
between executed and retired uOPs are the speculatively executed uOPs.

View File

@@ -0,0 +1,25 @@
SHORT UOPs retirement
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 UOPS_RETIRED_USED_CYCLES
PMC1 UOPS_RETIRED_STALL_CYCLES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Used cycles ratio PMC0/FIXC1
Unused cycles ratio PMC1/FIXC1
LONG
Formulas:
Used cycles ratio = UOPS_RETIRED_USED_CYCLES/CPU_CLK_UNHALTED_CORE
Unused cycles ratio = UOPS_RETIRED_STALL_CYCLES/CPU_CLK_UNHALTED_CORE
-
This performance group returns the ratios of used and unused CPU cycles. Here
unused cycles are cycles where no operation is performed due to some stall.