Add likwid collector

2025-12-15 03:56:15 +01:00 · 2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions
--- a/collectors/likwid/groups/power9/BRANCH.txt
+++ b/collectors/likwid/groups/power9/BRANCH.txt
@@ -0,0 +1,30 @@
+SHORT Branch prediction miss rate/ratio
+
+EVENTSET
+PMC1  PM_BR_PRED
+PMC2 PM_IOPS_CMPL
+PMC3  PM_BR_MPRED_CMPL
+PMC4 PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI  PMC5/PMC4
+Branch rate   (PMC1)/PMC4
+Branch misprediction rate  PMC3/PMC4
+Branch misprediction ratio  PMC3/(PMC1)
+Instructions per branch  PMC4/(PMC1)
+Operations per branch PMC2/PMC1
+
+LONG
+Formulas:
+Branch rate = PM_BR_PRED/PM_RUN_INST_CMPL
+Branch misprediction rate =  PM_BR_MPRED_CMPL/PM_RUN_INST_CMPL
+Branch misprediction ratio = PM_BR_MPRED_CMPL/PM_BR_PRED
+Instructions per branch = PM_RUN_INST_CMPL/PM_BR_PRED
+-
+The rates state how often in average a branch or a mispredicted branch occured
+per instruction retired in total. The Branch misprediction ratio sets directly
+into relation what ratio of all branch instruction where mispredicted.
+Instructions per branch is 1/Branch rate.
+
--- a/collectors/likwid/groups/power9/DATA.txt
+++ b/collectors/likwid/groups/power9/DATA.txt
@@ -0,0 +1,23 @@
+SHORT Load to store ratio
+
+EVENTSET
+PMC3  PM_LD_CMPL
+PMC1  PM_ST_CMPL
+PMC4  PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI  PMC5/PMC4
+Load to store ratio PMC3/PMC1
+Load rate PMC3/PMC4
+Store rate PMC1/PMC4
+
+LONG
+Formulas:
+Load to store ratio = PM_LD_CMPL/PM_ST_CMPL
+Load ratio = PM_LD_CMPL/PM_RUN_INST_CMPL
+Store ratio = PM_ST_CMPL/PM_RUN_INST_CMPL
+-
+This is a metric to determine your load to store ratio.
+
--- a/collectors/likwid/groups/power9/FLOPS.txt
+++ b/collectors/likwid/groups/power9/FLOPS.txt
@@ -0,0 +1,25 @@
+SHORT SP/DP scalar/vector MFlops/s
+
+EVENTSET
+PMC3  PM_FLOP_CMPL
+PMC4  PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI  PMC5/PMC4
+SP/DP [MFLOP/s] (scalar assumed) 1.0E-06*PMC3*2.0/time
+SP [MFLOP/s] (vector assumed) 1.0E-06*PMC3*8.0/time
+DP [MFLOP/s] (vector assumed) 1.0E-06*PMC3*4.0/time
+
+LONG
+Formulas:
+CPI = PM_RUN_CYC/PM_RUN_INST_CMPL
+SP/DP [MFLOP/s] (scalar assumed) = 1.0E-06*PM_FLOP_CMPL*2.0/runtime
+SP [MFLOP/s] (vector assumed) = 1.0E-06*PM_FLOP_CMPL*8.0/runtime
+DP [MFLOP/s] (vector assumed) = 1.0E-06*PM_FLOP_CMPL*4.0/runtime
+--
+This group counts floating-point operations. All is derived out of a
+single event PM_FLOP_CMPL, so if you have mixed usage of SP or DP and
+scalar and vector operations, the count won't be exact. With pure codes
+the counts are pretty accurate (e.g. when using likwid-bench).
--- a/collectors/likwid/groups/power9/FLOPS_FMA.txt
+++ b/collectors/likwid/groups/power9/FLOPS_FMA.txt
@@ -0,0 +1,21 @@
+SHORT Floating-point operations with scalar FMA instuctions
+
+EVENTSET
+PMC3  PM_FMA_CMPL
+PMC4  PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI PMC5/PMC4
+Scalar FMAs PMC3
+Scalar FMA [MFLOP/s] 1E-6*(PMC3)*2.0/time
+
+LONG
+Formulas:
+Scalar FMAs = PM_FMA_CMPL
+Scalar FMA [MFLOP/s] = 1E-6*(PM_FMA_CMPL)*2.0/runtime
+--
+This groups counts scalar FMA operations.
+PM_FMA_CMPL: Two-flops instruction completed (fmadd, fnmadd, fmsub,
+fnmsub). Scalar instructions only.
--- a/collectors/likwid/groups/power9/FLOPS_VSX.txt
+++ b/collectors/likwid/groups/power9/FLOPS_VSX.txt
@@ -0,0 +1,23 @@
+SHORT Vectorized MFlops/s
+
+EVENTSET
+PMC1  PM_VSU_FIN
+PMC3  PM_VECTOR_FLOP_CMPL
+PMC4  PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI  PMC5/PMC4
+SP [MFLOP/s] (assumed)  1.0E-06*(PMC3*8.0)/time
+DP [MFLOP/s] (assumed)  1.0E-06*(PMC3*4.0)/time
+Vector MIOPS/s   1.0E-06*(PMC1)/time
+
+LONG
+Formulas:
+CPI = PM_RUN_CYC/PM_RUN_INST_CMPL
+SP [MFLOP/s] (assumed) = 1.0E-06*(PM_VECTOR_FLOP_CMPL*4)/runtime
+DP [MFLOP/s] (assumed) = 1.0E-06*(PM_VECTOR_FLOP_CMPL*8)/runtime
+Vector MIOPS/s = 1.0E-06*(PM_VECTOR_FLOP_CMPL)/runtime
+--
+This group measures vector operations. There is no differentiation between SP and DP possible.
--- a/collectors/likwid/groups/power9/ICACHE.txt
+++ b/collectors/likwid/groups/power9/ICACHE.txt
@@ -0,0 +1,22 @@
+SHORT  Instruction cache miss rate/ratio
+
+EVENTSET
+PMC0  PM_INST_FROM_L1
+PMC1  PM_L1_ICACHE_MISS
+PMC4  PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI  PMC5/PMC4
+L1I request rate PMC0/PMC4
+L1I miss rate PMC1/PMC4
+L1I miss ratio PMC1/PMC0
+
+LONG
+Formulas:
+L1I request rate = ICACHE_ACCESSES / INSTR_RETIRED_ANY
+L1I miss rate = ICACHE_MISSES / INSTR_RETIRED_ANY
+L1I miss ratio = ICACHE_MISSES / ICACHE_ACCESSES
+-
+This group measures some L1 instruction cache metrics.
--- a/collectors/likwid/groups/power9/L2CACHE.txt
+++ b/collectors/likwid/groups/power9/L2CACHE.txt
@@ -0,0 +1,33 @@
+SHORT L2 cache miss rate/ratio
+
+EVENTSET
+PMC1  PM_L2_LD_MISS
+PMC2  PM_L2_LD_DISP
+PMC3  PM_L2_ST_DISP
+PMC4  PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI  PMC5/PMC4
+L2 request rate (PMC2+PMC3)/PMC4
+L2 load miss rate PMC1/PMC4
+L2 load miss ratio PMC1/(PMC2+PMC3)
+
+LONG
+Formulas:
+L2 request rate = (PM_L2_LD_DISP+PM_L2_ST_DISP)/PM_RUN_INST_CMPL
+L2 load miss rate = (PM_L2_LD_MISS)/PM_RUN_INST_CMPL
+L2 load miss ratio = (PM_L2_LD_MISS)/(PM_L2_LD_DISP+PM_L2_ST_DISP)
+-
+This group measures the locality of your data accesses with regard to the
+L2 Cache. L2 request rate tells you how data intensive your code is
+or how many data accesses you have in average per instruction.
+The L2 miss rate gives a measure how often it was necessary to get
+cachelines from memory. And finally L2 load miss ratio tells you how many of your
+memory references required a cacheline to be loaded from a higher level.
+While the data cache miss rate might be given by your algorithm you should
+try to get data cache miss ratio as low as possible by increasing your cache reuse.
+
+
--- a/collectors/likwid/groups/power9/L2LOAD.txt
+++ b/collectors/likwid/groups/power9/L2LOAD.txt
@@ -0,0 +1,23 @@
+SHORT  L2 cache bandwidth in MBytes/s
+
+EVENTSET
+PMC0  PM_L2_LD
+PMC2  PM_L2_INST
+PMC4  PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI  PMC5/PMC4
+L2 load bandwidth [MBytes/s]  1.0E-06*(PMC0+PMC2)*128.0/time
+L2 load data volume [GBytes]  1.0E-09*(PMC0+PMC2)*128.0
+
+LONG
+Formulas:
+CPI = PM_RUN_CYC/PM_RUN_INST_CMPL
+L2 load bandwidth [MBytes/s] = 1.0E-06*(PM_L2_LD+PM_L2_INST)*128.0/time
+L2 load data volume [GBytes] = 1.0E-09*(PM_L2_LD+PM_L2_INST)*128.0
+-
+Profiling group to measure L2 load cache bandwidth. The bandwidth is computed by the
+number of cacheline loaded from L2 cache to L1.
--- a/collectors/likwid/groups/power9/L2STORE.txt
+++ b/collectors/likwid/groups/power9/L2STORE.txt
@@ -0,0 +1,22 @@
+SHORT  L2 cache bandwidth in MBytes/s
+
+EVENTSET
+PMC0  PM_L2_ST
+PMC4  PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI  PMC5/PMC4
+L2 store bandwidth [MBytes/s]  1.0E-06*(PMC0)*128.0/time
+L2 store data volume [GBytes]  1.0E-09*(PMC0)*128.0
+
+LONG
+Formulas:
+CPI = PM_RUN_CYC/PM_RUN_INST_CMPL
+L2 load bandwidth [MBytes/s] = 1.0E-06*(PM_L2_ST)*128.0/time
+L2 load data volume [GBytes] = 1.0E-09*(PM_L2_ST)*128.0
+-
+Profiling group to measure L2 store cache bandwidth. The bandwidth is computed by the
+number of cacheline stored from L1 cache to L2.
--- a/collectors/likwid/groups/power9/L3.txt
+++ b/collectors/likwid/groups/power9/L3.txt
@@ -0,0 +1,29 @@
+SHORT  L3 cache bandwidth in MBytes/s
+
+EVENTSET
+PMC0  PM_L3_LD_PREF
+PMC3  PM_DATA_FROM_L3
+PMC4  PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI  PMC5/PMC4
+L3D load bandwidth [MBytes/s]  1.0E-06*(PMC3+PMC0)*128.0/time
+L3D load data volume [GBytes]  1.0E-09*(PMC3+PMC0)*128.0
+Loads from local L3 per cycle 100.0*(PMC3+PMC0)/PMC5
+
+LONG
+Formulas:
+CPI = PM_RUN_CYC/PM_RUN_INST_CMPL
+L3D load bandwidth [MBytes/s] = 1.0E-06*(PM_DATA_FROM_L3)*128.0/time
+L3D load data volume [GBytes] = 1.0E-09*(PM_DATA_FROM_L3)*128.0
+L3D evict bandwidth [MBytes/s] = 1.0E-06*(PM_L2_CASTOUT_MOD)*128.0/time
+L3D evict data volume [GBytes] = 1.0E-09*(PM_L2_CASTOUT_MOD)*128.0
+L3 bandwidth [MBytes/s] = 1.0E-06*(PM_DATA_FROM_L3+PM_L2_CASTOUT_MOD)*128.0/time
+L3 data volume [GBytes] = 1.0E-09*(PM_DATA_FROM_L3+PM_L2_CASTOUT_MOD)*128.0
+-
+Profiling group to measure L3 cache bandwidth. The bandwidth is computed by the
+number of cacheline loaded from the L3 to the L2 data cache. There is currently no
+event to get the evicted data volume.
--- a/collectors/likwid/groups/power9/MEM.txt
+++ b/collectors/likwid/groups/power9/MEM.txt
@@ -0,0 +1,47 @@
+SHORT Main memory bandwidth in MBytes/s
+
+EVENTSET
+PMC4 PM_RUN_INST_CMPL
+PMC5 PM_RUN_CYC
+MBOX0C0 PM_MBA0_READ_BYTES
+MBOX0C1 PM_MBA0_WRITE_BYTES
+MBOX1C0 PM_MBA1_READ_BYTES
+MBOX1C1 PM_MBA1_WRITE_BYTES
+MBOX2C0 PM_MBA2_READ_BYTES
+MBOX2C1 PM_MBA2_WRITE_BYTES
+MBOX3C0 PM_MBA3_READ_BYTES
+MBOX3C1 PM_MBA3_WRITE_BYTES
+MBOX4C0 PM_MBA4_READ_BYTES
+MBOX4C1 PM_MBA4_WRITE_BYTES
+MBOX5C0 PM_MBA5_READ_BYTES
+MBOX5C1 PM_MBA5_WRITE_BYTES
+MBOX6C0 PM_MBA6_READ_BYTES
+MBOX6C1 PM_MBA6_WRITE_BYTES
+MBOX7C0 PM_MBA7_READ_BYTES
+MBOX7C1 PM_MBA7_WRITE_BYTES
+
+
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI  PMC5/PMC4
+Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0/time
+Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0
+Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
+Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
+Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
+Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
+
+LONG
+Formulas:
+Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(PM_MBAx_READ_BYTES))*64.0/runtime
+Memory read data volume [GBytes] = 1.0E-09*(SUM(PM_MBAx_READ_BYTES))*64.0
+Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(PM_MBAx_WRITE_BYTES))*64.0/runtime
+Memory write data volume [GBytes] = 1.0E-09*(SUM(PM_MBAx_WRITE_BYTES))*64.0
+Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(PM_MBAx_READ_BYTES)+SUM(PM_MBAx_WRITE_BYTES))*64.0/runtime
+Memory data volume [GBytes] = 1.0E-09*(SUM(PM_MBAx_READ_BYTES)+SUM(PM_MBAx_WRITE_BYTES))*64.0
+-
+Profiling group to measure memory bandwidth drawn by all cores of a socket.
+Since this group is based on Uncore events it is only possible to measure on a
+per socket base. Some of the counters may not be available on your system.
+Also outputs total data volume transferred from main memory.
--- a/collectors/likwid/groups/power9/TLB_DATA.txt
+++ b/collectors/likwid/groups/power9/TLB_DATA.txt
@@ -0,0 +1,42 @@
+SHORT  L1 Data TLB miss rate/ratio
+
+EVENTSET
+PMC0  PM_LSU_DTLB_MISS_16G_1G
+PMC1  PM_LSU_DTLB_MISS_4K
+PMC2  PM_LSU_DTLB_MISS_64K
+PMC3  PM_LSU_DTLB_MISS_16M_2M
+PMC4  PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI  PMC5/PMC4
+L1 DTLB 4K misses     PMC1
+L1 DTLB 4K miss rate  PMC1/PMC4
+L1 DTLB 4K miss ratio [%] (PMC1/(PMC0+PMC1+PMC2+PMC3))*100.0
+L1 DTLB 64K misses     PMC2
+L1 DTLB 64K miss rate  PMC2/PMC4
+L1 DTLB 64K miss ratio [%] (PMC2/(PMC0+PMC1+PMC2+PMC3))*100.0
+L1 DTLB 16M/2M misses     PMC3
+L1 DTLB 16M/2M miss rate  PMC3/PMC4
+L1 DTLB 16M/2M miss ratio [%] (PMC3/(PMC0+PMC1+PMC2+PMC3))*100.0
+L1 DTLB 16G/1G misses     PMC0
+L1 DTLB 16G/1G miss rate  PMC0/PMC4
+L1 DTLB 16G/1G miss ratio [%] (PMC0/(PMC0+PMC1+PMC2+PMC3))*100.0
+
+LONG
+Formulas:
+L1 DTLB 4K misses = PM_LSU_DTLB_MISS_4K
+L1 DTLB 4K miss rate = PM_LSU_DTLB_MISS_4K/PM_RUN_INST_CMPL
+L1 DTLB 4K miss ratio [%] = (PM_LSU_DTLB_MISS_4K/(PM_LSU_DTLB_MISS_4K+PM_DTLB_MISS_64K+PM_DTLB_MISS_16M_2M+PM_DTLB_MISS_16G_1G))*100
+L1 DTLB 64K misses = PM_LSU_DTLB_MISS_64K
+L1 DTLB 64K miss rate = PM_LSU_DTLB_MISS_64K/PM_RUN_INST_CMPL
+L1 DTLB 64K miss ratio [%] = (PM_LSU_DTLB_MISS_64K/(PM_LSU_DTLB_MISS_4K+PM_DTLB_MISS_64K+PM_DTLB_MISS_16M_2M+PM_DTLB_MISS_16G_1G))*100
+L1 DTLB 4K misses = PM_LSU_DTLB_MISS_4K
+L1 DTLB 4K miss rate = PM_LSU_DTLB_MISS_4K/PM_RUN_INST_CMPL
+L1 DTLB 4K miss ratio [%] = (PM_LSU_DTLB_MISS_4K/(PM_LSU_DTLB_MISS_4K+PM_DTLB_MISS_64K+PM_DTLB_MISS_16M_2M+PM_DTLB_MISS_16G_1G))*100
+L1 DTLB 4K misses = PM_LSU_DTLB_MISS_4K
+L1 DTLB 4K miss rate = PM_LSU_DTLB_MISS_4K/PM_RUN_INST_CMPL
+L1 DTLB 4K miss ratio [%] = (PM_LSU_DTLB_MISS_4K/(PM_LSU_DTLB_MISS_4K+PM_DTLB_MISS_64K+PM_DTLB_MISS_16M_2M+PM_DTLB_MISS_16G_1G))*100
+-
+This group measures the data TLB misses for different page sizes.
--- a/collectors/likwid/groups/power9/TLB_INSTR.txt
+++ b/collectors/likwid/groups/power9/TLB_INSTR.txt
@@ -0,0 +1,21 @@
+SHORT  L1 Instruction TLB miss rate/ratio
+
+EVENTSET
+PMC3  PM_ITLB_MISS
+PMC4  PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI  PMC5/PMC4
+L1 ITLB misses     PMC3
+L1 ITLB miss rate  PMC3/PMC4
+
+LONG
+Formulas:
+L1 ITLB misses = PM_ITLB_MISS
+L1 ITLB miss rate = PM_ITLB_MISS/PM_RUN_INST_CMPL
+-
+This group measures the reloads of the instruction TLB.
+Misses to the HPT are counted once while misses in the Radix
+tree count the number of tree levels traversed.
--- a/collectors/likwid/groups/power9/USEFUL.txt
+++ b/collectors/likwid/groups/power9/USEFUL.txt
@@ -0,0 +1,22 @@
+SHORT Rate of useful instructions
+
+EVENTSET
+PMC0  PM_RUN_SPURR
+PMC1  PM_INST_DISP
+PMC3  PM_RUN_PURR
+PMC4  PM_RUN_INST_CMPL
+PMC5  PM_RUN_CYC
+
+METRICS
+CPI  PMC5/PMC4
+Useful instr. rate [%] (PMC4/PMC1)*100.0
+Processor Utilization [%] (PMC0/PMC3)*100.0
+
+
+LONG
+Formulas:
+Useful instr. rate [%] = (PM_RUN_INST_CMPL/PM_INST_DISP)*100
+Processor Utilization [%] = (PM_RUN_SPURR/PM_RUN_PURR)*100
+--
+This performance group shows the overhead of speculative
+execution of instructions and the processor utilization.