Add likwid collector

This commit is contained in:
Thomas Roehl
2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions

View File

@@ -0,0 +1,30 @@
SHORT Branch prediction miss rate/ratio
EVENTSET
PMC1 PM_BR_PRED
PMC2 PM_IOPS_CMPL
PMC3 PM_BR_MPRED_CMPL
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
Branch rate (PMC1)/PMC4
Branch misprediction rate PMC3/PMC4
Branch misprediction ratio PMC3/(PMC1)
Instructions per branch PMC4/(PMC1)
Operations per branch PMC2/PMC1
LONG
Formulas:
Branch rate = PM_BR_PRED/PM_RUN_INST_CMPL
Branch misprediction rate = PM_BR_MPRED_CMPL/PM_RUN_INST_CMPL
Branch misprediction ratio = PM_BR_MPRED_CMPL/PM_BR_PRED
Instructions per branch = PM_RUN_INST_CMPL/PM_BR_PRED
-
The rates state how often in average a branch or a mispredicted branch occured
per instruction retired in total. The Branch misprediction ratio sets directly
into relation what ratio of all branch instruction where mispredicted.
Instructions per branch is 1/Branch rate.

View File

@@ -0,0 +1,23 @@
SHORT Load to store ratio
EVENTSET
PMC3 PM_LD_CMPL
PMC1 PM_ST_CMPL
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
Load to store ratio PMC3/PMC1
Load rate PMC3/PMC4
Store rate PMC1/PMC4
LONG
Formulas:
Load to store ratio = PM_LD_CMPL/PM_ST_CMPL
Load ratio = PM_LD_CMPL/PM_RUN_INST_CMPL
Store ratio = PM_ST_CMPL/PM_RUN_INST_CMPL
-
This is a metric to determine your load to store ratio.

View File

@@ -0,0 +1,25 @@
SHORT SP/DP scalar/vector MFlops/s
EVENTSET
PMC3 PM_FLOP_CMPL
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
SP/DP [MFLOP/s] (scalar assumed) 1.0E-06*PMC3*2.0/time
SP [MFLOP/s] (vector assumed) 1.0E-06*PMC3*8.0/time
DP [MFLOP/s] (vector assumed) 1.0E-06*PMC3*4.0/time
LONG
Formulas:
CPI = PM_RUN_CYC/PM_RUN_INST_CMPL
SP/DP [MFLOP/s] (scalar assumed) = 1.0E-06*PM_FLOP_CMPL*2.0/runtime
SP [MFLOP/s] (vector assumed) = 1.0E-06*PM_FLOP_CMPL*8.0/runtime
DP [MFLOP/s] (vector assumed) = 1.0E-06*PM_FLOP_CMPL*4.0/runtime
--
This group counts floating-point operations. All is derived out of a
single event PM_FLOP_CMPL, so if you have mixed usage of SP or DP and
scalar and vector operations, the count won't be exact. With pure codes
the counts are pretty accurate (e.g. when using likwid-bench).

View File

@@ -0,0 +1,21 @@
SHORT Floating-point operations with scalar FMA instuctions
EVENTSET
PMC3 PM_FMA_CMPL
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
Scalar FMAs PMC3
Scalar FMA [MFLOP/s] 1E-6*(PMC3)*2.0/time
LONG
Formulas:
Scalar FMAs = PM_FMA_CMPL
Scalar FMA [MFLOP/s] = 1E-6*(PM_FMA_CMPL)*2.0/runtime
--
This groups counts scalar FMA operations.
PM_FMA_CMPL: Two-flops instruction completed (fmadd, fnmadd, fmsub,
fnmsub). Scalar instructions only.

View File

@@ -0,0 +1,23 @@
SHORT Vectorized MFlops/s
EVENTSET
PMC1 PM_VSU_FIN
PMC3 PM_VECTOR_FLOP_CMPL
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
SP [MFLOP/s] (assumed) 1.0E-06*(PMC3*8.0)/time
DP [MFLOP/s] (assumed) 1.0E-06*(PMC3*4.0)/time
Vector MIOPS/s 1.0E-06*(PMC1)/time
LONG
Formulas:
CPI = PM_RUN_CYC/PM_RUN_INST_CMPL
SP [MFLOP/s] (assumed) = 1.0E-06*(PM_VECTOR_FLOP_CMPL*4)/runtime
DP [MFLOP/s] (assumed) = 1.0E-06*(PM_VECTOR_FLOP_CMPL*8)/runtime
Vector MIOPS/s = 1.0E-06*(PM_VECTOR_FLOP_CMPL)/runtime
--
This group measures vector operations. There is no differentiation between SP and DP possible.

View File

@@ -0,0 +1,22 @@
SHORT Instruction cache miss rate/ratio
EVENTSET
PMC0 PM_INST_FROM_L1
PMC1 PM_L1_ICACHE_MISS
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
L1I request rate PMC0/PMC4
L1I miss rate PMC1/PMC4
L1I miss ratio PMC1/PMC0
LONG
Formulas:
L1I request rate = ICACHE_ACCESSES / INSTR_RETIRED_ANY
L1I miss rate = ICACHE_MISSES / INSTR_RETIRED_ANY
L1I miss ratio = ICACHE_MISSES / ICACHE_ACCESSES
-
This group measures some L1 instruction cache metrics.

View File

@@ -0,0 +1,33 @@
SHORT L2 cache miss rate/ratio
EVENTSET
PMC1 PM_L2_LD_MISS
PMC2 PM_L2_LD_DISP
PMC3 PM_L2_ST_DISP
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
L2 request rate (PMC2+PMC3)/PMC4
L2 load miss rate PMC1/PMC4
L2 load miss ratio PMC1/(PMC2+PMC3)
LONG
Formulas:
L2 request rate = (PM_L2_LD_DISP+PM_L2_ST_DISP)/PM_RUN_INST_CMPL
L2 load miss rate = (PM_L2_LD_MISS)/PM_RUN_INST_CMPL
L2 load miss ratio = (PM_L2_LD_MISS)/(PM_L2_LD_DISP+PM_L2_ST_DISP)
-
This group measures the locality of your data accesses with regard to the
L2 Cache. L2 request rate tells you how data intensive your code is
or how many data accesses you have in average per instruction.
The L2 miss rate gives a measure how often it was necessary to get
cachelines from memory. And finally L2 load miss ratio tells you how many of your
memory references required a cacheline to be loaded from a higher level.
While the data cache miss rate might be given by your algorithm you should
try to get data cache miss ratio as low as possible by increasing your cache reuse.

View File

@@ -0,0 +1,23 @@
SHORT L2 cache bandwidth in MBytes/s
EVENTSET
PMC0 PM_L2_LD
PMC2 PM_L2_INST
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
L2 load bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC2)*128.0/time
L2 load data volume [GBytes] 1.0E-09*(PMC0+PMC2)*128.0
LONG
Formulas:
CPI = PM_RUN_CYC/PM_RUN_INST_CMPL
L2 load bandwidth [MBytes/s] = 1.0E-06*(PM_L2_LD+PM_L2_INST)*128.0/time
L2 load data volume [GBytes] = 1.0E-09*(PM_L2_LD+PM_L2_INST)*128.0
-
Profiling group to measure L2 load cache bandwidth. The bandwidth is computed by the
number of cacheline loaded from L2 cache to L1.

View File

@@ -0,0 +1,22 @@
SHORT L2 cache bandwidth in MBytes/s
EVENTSET
PMC0 PM_L2_ST
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
L2 store bandwidth [MBytes/s] 1.0E-06*(PMC0)*128.0/time
L2 store data volume [GBytes] 1.0E-09*(PMC0)*128.0
LONG
Formulas:
CPI = PM_RUN_CYC/PM_RUN_INST_CMPL
L2 load bandwidth [MBytes/s] = 1.0E-06*(PM_L2_ST)*128.0/time
L2 load data volume [GBytes] = 1.0E-09*(PM_L2_ST)*128.0
-
Profiling group to measure L2 store cache bandwidth. The bandwidth is computed by the
number of cacheline stored from L1 cache to L2.

View File

@@ -0,0 +1,29 @@
SHORT L3 cache bandwidth in MBytes/s
EVENTSET
PMC0 PM_L3_LD_PREF
PMC3 PM_DATA_FROM_L3
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
L3D load bandwidth [MBytes/s] 1.0E-06*(PMC3+PMC0)*128.0/time
L3D load data volume [GBytes] 1.0E-09*(PMC3+PMC0)*128.0
Loads from local L3 per cycle 100.0*(PMC3+PMC0)/PMC5
LONG
Formulas:
CPI = PM_RUN_CYC/PM_RUN_INST_CMPL
L3D load bandwidth [MBytes/s] = 1.0E-06*(PM_DATA_FROM_L3)*128.0/time
L3D load data volume [GBytes] = 1.0E-09*(PM_DATA_FROM_L3)*128.0
L3D evict bandwidth [MBytes/s] = 1.0E-06*(PM_L2_CASTOUT_MOD)*128.0/time
L3D evict data volume [GBytes] = 1.0E-09*(PM_L2_CASTOUT_MOD)*128.0
L3 bandwidth [MBytes/s] = 1.0E-06*(PM_DATA_FROM_L3+PM_L2_CASTOUT_MOD)*128.0/time
L3 data volume [GBytes] = 1.0E-09*(PM_DATA_FROM_L3+PM_L2_CASTOUT_MOD)*128.0
-
Profiling group to measure L3 cache bandwidth. The bandwidth is computed by the
number of cacheline loaded from the L3 to the L2 data cache. There is currently no
event to get the evicted data volume.

View File

@@ -0,0 +1,47 @@
SHORT Main memory bandwidth in MBytes/s
EVENTSET
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
MBOX0C0 PM_MBA0_READ_BYTES
MBOX0C1 PM_MBA0_WRITE_BYTES
MBOX1C0 PM_MBA1_READ_BYTES
MBOX1C1 PM_MBA1_WRITE_BYTES
MBOX2C0 PM_MBA2_READ_BYTES
MBOX2C1 PM_MBA2_WRITE_BYTES
MBOX3C0 PM_MBA3_READ_BYTES
MBOX3C1 PM_MBA3_WRITE_BYTES
MBOX4C0 PM_MBA4_READ_BYTES
MBOX4C1 PM_MBA4_WRITE_BYTES
MBOX5C0 PM_MBA5_READ_BYTES
MBOX5C1 PM_MBA5_WRITE_BYTES
MBOX6C0 PM_MBA6_READ_BYTES
MBOX6C1 PM_MBA6_WRITE_BYTES
MBOX7C0 PM_MBA7_READ_BYTES
MBOX7C1 PM_MBA7_WRITE_BYTES
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
LONG
Formulas:
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(PM_MBAx_READ_BYTES))*64.0/runtime
Memory read data volume [GBytes] = 1.0E-09*(SUM(PM_MBAx_READ_BYTES))*64.0
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(PM_MBAx_WRITE_BYTES))*64.0/runtime
Memory write data volume [GBytes] = 1.0E-09*(SUM(PM_MBAx_WRITE_BYTES))*64.0
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(PM_MBAx_READ_BYTES)+SUM(PM_MBAx_WRITE_BYTES))*64.0/runtime
Memory data volume [GBytes] = 1.0E-09*(SUM(PM_MBAx_READ_BYTES)+SUM(PM_MBAx_WRITE_BYTES))*64.0
-
Profiling group to measure memory bandwidth drawn by all cores of a socket.
Since this group is based on Uncore events it is only possible to measure on a
per socket base. Some of the counters may not be available on your system.
Also outputs total data volume transferred from main memory.

View File

@@ -0,0 +1,42 @@
SHORT L1 Data TLB miss rate/ratio
EVENTSET
PMC0 PM_LSU_DTLB_MISS_16G_1G
PMC1 PM_LSU_DTLB_MISS_4K
PMC2 PM_LSU_DTLB_MISS_64K
PMC3 PM_LSU_DTLB_MISS_16M_2M
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
L1 DTLB 4K misses PMC1
L1 DTLB 4K miss rate PMC1/PMC4
L1 DTLB 4K miss ratio [%] (PMC1/(PMC0+PMC1+PMC2+PMC3))*100.0
L1 DTLB 64K misses PMC2
L1 DTLB 64K miss rate PMC2/PMC4
L1 DTLB 64K miss ratio [%] (PMC2/(PMC0+PMC1+PMC2+PMC3))*100.0
L1 DTLB 16M/2M misses PMC3
L1 DTLB 16M/2M miss rate PMC3/PMC4
L1 DTLB 16M/2M miss ratio [%] (PMC3/(PMC0+PMC1+PMC2+PMC3))*100.0
L1 DTLB 16G/1G misses PMC0
L1 DTLB 16G/1G miss rate PMC0/PMC4
L1 DTLB 16G/1G miss ratio [%] (PMC0/(PMC0+PMC1+PMC2+PMC3))*100.0
LONG
Formulas:
L1 DTLB 4K misses = PM_LSU_DTLB_MISS_4K
L1 DTLB 4K miss rate = PM_LSU_DTLB_MISS_4K/PM_RUN_INST_CMPL
L1 DTLB 4K miss ratio [%] = (PM_LSU_DTLB_MISS_4K/(PM_LSU_DTLB_MISS_4K+PM_DTLB_MISS_64K+PM_DTLB_MISS_16M_2M+PM_DTLB_MISS_16G_1G))*100
L1 DTLB 64K misses = PM_LSU_DTLB_MISS_64K
L1 DTLB 64K miss rate = PM_LSU_DTLB_MISS_64K/PM_RUN_INST_CMPL
L1 DTLB 64K miss ratio [%] = (PM_LSU_DTLB_MISS_64K/(PM_LSU_DTLB_MISS_4K+PM_DTLB_MISS_64K+PM_DTLB_MISS_16M_2M+PM_DTLB_MISS_16G_1G))*100
L1 DTLB 4K misses = PM_LSU_DTLB_MISS_4K
L1 DTLB 4K miss rate = PM_LSU_DTLB_MISS_4K/PM_RUN_INST_CMPL
L1 DTLB 4K miss ratio [%] = (PM_LSU_DTLB_MISS_4K/(PM_LSU_DTLB_MISS_4K+PM_DTLB_MISS_64K+PM_DTLB_MISS_16M_2M+PM_DTLB_MISS_16G_1G))*100
L1 DTLB 4K misses = PM_LSU_DTLB_MISS_4K
L1 DTLB 4K miss rate = PM_LSU_DTLB_MISS_4K/PM_RUN_INST_CMPL
L1 DTLB 4K miss ratio [%] = (PM_LSU_DTLB_MISS_4K/(PM_LSU_DTLB_MISS_4K+PM_DTLB_MISS_64K+PM_DTLB_MISS_16M_2M+PM_DTLB_MISS_16G_1G))*100
-
This group measures the data TLB misses for different page sizes.

View File

@@ -0,0 +1,21 @@
SHORT L1 Instruction TLB miss rate/ratio
EVENTSET
PMC3 PM_ITLB_MISS
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
Runtime (RDTSC) [s] time
CPI PMC5/PMC4
L1 ITLB misses PMC3
L1 ITLB miss rate PMC3/PMC4
LONG
Formulas:
L1 ITLB misses = PM_ITLB_MISS
L1 ITLB miss rate = PM_ITLB_MISS/PM_RUN_INST_CMPL
-
This group measures the reloads of the instruction TLB.
Misses to the HPT are counted once while misses in the Radix
tree count the number of tree levels traversed.

View File

@@ -0,0 +1,22 @@
SHORT Rate of useful instructions
EVENTSET
PMC0 PM_RUN_SPURR
PMC1 PM_INST_DISP
PMC3 PM_RUN_PURR
PMC4 PM_RUN_INST_CMPL
PMC5 PM_RUN_CYC
METRICS
CPI PMC5/PMC4
Useful instr. rate [%] (PMC4/PMC1)*100.0
Processor Utilization [%] (PMC0/PMC3)*100.0
LONG
Formulas:
Useful instr. rate [%] = (PM_RUN_INST_CMPL/PM_INST_DISP)*100
Processor Utilization [%] = (PM_RUN_SPURR/PM_RUN_PURR)*100
--
This performance group shows the overhead of speculative
execution of instructions and the processor utilization.