Add likwid collector

This commit is contained in:
Thomas Roehl
2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions

View File

@@ -0,0 +1,32 @@
SHORT Branch prediction miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 BR_INST_RETIRED_ALL_BRANCHES
PMC1 BR_MISP_RETIRED_ALL_BRANCHES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Branch rate PMC0/FIXC0
Branch misprediction rate PMC1/FIXC0
Branch misprediction ratio PMC1/PMC0
Instructions per branch FIXC0/PMC0
LONG
Formulas:
Branch rate = BR_INST_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
Branch misprediction rate = BR_MISP_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
Branch misprediction ratio = BR_MISP_RETIRED_ALL_BRANCHES/BR_INST_RETIRED_ALL_BRANCHES
Instructions per branch = INSTR_RETIRED_ANY/BR_INST_RETIRED_ALL_BRANCHES
-
The rates state how often on average a branch or a mispredicted branch occurred
per instruction retired in total. The branch misprediction ratio sets directly
into relation what ratio of all branch instruction where mispredicted.
Instructions per branch is 1/branch rate.

View File

@@ -0,0 +1,23 @@
SHORT Load to store ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 MEM_INST_RETIRED_ALL_LOADS
PMC1 MEM_INST_RETIRED_ALL_STORES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Load to store ratio PMC0/PMC1
LONG
Formulas:
Load to store ratio = MEM_INST_RETIRED_ALL_LOADS/MEM_INST_RETIRED_ALL_STORES
-
This is a metric to determine your load to store ratio.

View File

@@ -0,0 +1,25 @@
SHORT Divide unit information
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 ARITH_DIVIDER_COUNT
PMC1 ARITH_DIVIDER_ACTIVE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Number of divide ops PMC0
Avg. divide unit usage duration PMC1/PMC0
LONG
Formulas:
Number of divide ops = ARITH_DIVIDER_COUNT
Avg. divide unit usage duration = ARITH_DIVIDER_ACTIVE/ARITH_DIVIDER_COUNT
-
This performance group measures the average latency of divide operations

View File

@@ -0,0 +1,26 @@
SHORT Packed AVX MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE
PMC1 FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE
PMC2 FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE
PMC3 FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Packed SP [MFLOP/s] 1.0E-06*(PMC0*8.0+PMC2*16.0)/time
Packed DP [MFLOP/s] 1.0E-06*(PMC1*4.0+PMC3*8.0)/time
LONG
Formulas:
Packed SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime
Packed DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime
-
Packed 32b AVX FLOPs rates.

View File

@@ -0,0 +1,35 @@
SHORT Double Precision MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE
PMC1 FP_ARITH_INST_RETIRED_SCALAR_DOUBLE
PMC2 FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE
PMC3 FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
DP [MFLOP/s] 1.0E-06*(PMC0*2.0+PMC1+PMC2*4.0+PMC3*8.0)/time
AVX DP [MFLOP/s] 1.0E-06*(PMC2*4.0+PMC3*8.0)/time
AVX512 DP [MFLOP/s] 1.0E-06*(PMC3*8.0)/time
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2+PMC3)/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
Vectorization ratio 100*(PMC0+PMC2+PMC3)/(PMC0+PMC1+PMC2+PMC3)
LONG
Formulas:
DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE*2+FP_ARITH_INST_RETIRED_SCALAR_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime
AVX DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime
AVX512 DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE)/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED_SCALAR_DOUBLE/runtime
Vectorization ratio = 100*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE)/(FP_ARITH_INST_RETIRED_SCALAR_DOUBLE+FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE)
-
SSE scalar and packed double precision FLOP rates.

View File

@@ -0,0 +1,35 @@
SHORT Single Precision MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE
PMC1 FP_ARITH_INST_RETIRED_SCALAR_SINGLE
PMC2 FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE
PMC3 FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
SP [MFLOP/s] 1.0E-06*(PMC0*4.0+PMC1+PMC2*8.0+PMC3*16.0)/time
AVX SP [MFLOP/s] 1.0E-06*(PMC2*8.0+PMC3*16.0)/time
AVX512 SP [MFLOP/s] 1.0E-06*(PMC3*16.0)/time
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2+PMC3)/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
Vectorization ratio 100*(PMC0+PMC2+PMC3)/(PMC0+PMC1+PMC2+PMC3)
LONG
Formulas:
SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE*4+FP_ARITH_INST_RETIRED_SCALAR_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime
AVX SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime
AVX512 SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE)/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED_SCALAR_SINGLE/runtime
Vectorization ratio [%] = 100*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE)/(FP_ARITH_INST_RETIRED_SCALAR_SINGLE+FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE)
-
SSE scalar and packed single precision FLOP rates.

View File

@@ -0,0 +1,39 @@
SHORT L2 cache bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 L1D_REPLACEMENT
PMC1 L2_TRANS_L1D_WB
PMC2 ICACHE_64B_IFTAG_MISS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L2D load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
L2D load data volume [GBytes] 1.0E-09*PMC0*64.0
L2D evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
L2D evict data volume [GBytes] 1.0E-09*PMC1*64.0
L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1+PMC2)*64.0/time
L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1+PMC2)*64.0
LONG
Formulas:
L2D load bandwidth [MBytes/s] = 1.0E-06*L1D_REPLACEMENT*64.0/time
L2D load data volume [GBytes] = 1.0E-09*L1D_REPLACEMENT*64.0
L2D evict bandwidth [MBytes/s] = 1.0E-06*L2_TRANS_L1D_WB*64.0/time
L2D evict data volume [GBytes] = 1.0E-09*L2_TRANS_L1D_WB*64.0
L2 bandwidth [MBytes/s] = 1.0E-06*(L1D_REPLACEMENT+L2_TRANS_L1D_WB+ICACHE_64B_IFTAG_MISS)*64/time
L2 data volume [GBytes] = 1.0E-09*(L1D_REPLACEMENT+L2_TRANS_L1D_WB+ICACHE_64B_IFTAG_MISS)*64
-
Profiling group to measure L2 cache bandwidth. The bandwidth is computed by the
number of cache line allocated in the L1 and the number of modified cache lines
evicted from the L1. The group also output total data volume transferred between
L2 and L1. Note that this bandwidth also includes data transfers due to a write
allocate load on a store miss in L1 and traffic caused by misses in the
L1 instruction cache.