Add likwid collector

2025-10-30 16:45:07 +01:00 · 2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions
--- a/collectors/likwid/groups/k10/BRANCH.txt
+++ b/collectors/likwid/groups/k10/BRANCH.txt
@@ -0,0 +1,26 @@
+SHORT Branch prediction miss rate/ratio
+
+EVENTSET
+PMC0  INSTRUCTIONS_RETIRED
+PMC1  BRANCH_RETIRED
+PMC2  BRANCH_MISPREDICT_RETIRED
+
+METRICS
+Runtime (RDTSC) [s] time
+Branch rate   PMC1/PMC0
+Branch misprediction rate  PMC2/PMC0
+Branch misprediction ratio  PMC2/PMC1
+Instructions per branch  PMC0/PMC1
+
+LONG
+Formulas:
+Branch rate = BRANCH_RETIRED/INSTRUCTIONS_RETIRED
+Branch misprediction rate = BRANCH_MISPREDICT_RETIRED/INSTRUCTIONS_RETIRED
+Branch misprediction ratio = BRANCH_MISPREDICT_RETIRED/BRANCH_RETIRED
+Instructions per branch = INSTRUCTIONS_RETIRED/BRANCH_RETIRED
+-
+The rates state how often on average a branch or a mispredicted branch occurred
+per instruction retired in total. The branch misprediction ratio sets directly
+into relation what ration of all branch instruction where mispredicted.
+Instructions per branch is 1/branch rate.
+
--- a/collectors/likwid/groups/k10/CACHE.txt
+++ b/collectors/likwid/groups/k10/CACHE.txt
@@ -0,0 +1,34 @@
+SHORT Data cache miss rate/ratio
+
+EVENTSET
+PMC0  INSTRUCTIONS_RETIRED
+PMC1  DATA_CACHE_ACCESSES
+PMC2  DATA_CACHE_REFILLS_L2_ALL
+PMC3  DATA_CACHE_REFILLS_NORTHBRIDGE_ALL
+
+METRICS
+Runtime (RDTSC) [s] time
+data cache misses PMC2+PMC3
+data cache request rate PMC1/PMC0
+data cache miss rate (PMC2+PMC3)/PMC0
+data cache miss ratio (PMC2+PMC3)/PMC1
+
+LONG
+Formulas:
+data cache misses = DATA_CACHE_REFILLS_L2_AL + DATA_CACHE_REFILLS_NORTHBRIDGE_ALL
+data cache request rate = DATA_CACHE_ACCESSES / INSTRUCTIONS_RETIRED
+data cache miss rate = (DATA_CACHE_REFILLS_L2_AL + DATA_CACHE_REFILLS_NORTHBRIDGE_ALL)/INSTRUCTIONS_RETIRED
+data cache miss ratio = (DATA_CACHE_REFILLS_L2_AL + DATA_CACHE_REFILLS_NORTHBRIDGE_ALL)/DATA_CACHE_ACCESSES
+-
+This group measures the locality of your data accesses with regard to the
+L1 cache. Data cache request rate tells you how data intensive your code is
+or how many data accesses you have on average per instruction.
+The data cache miss rate gives a measure how often it was necessary to get
+cache lines from higher levels of the memory hierarchy. And finally
+data cache miss ratio tells you how many of your memory references required
+a cache line to be loaded from a higher level. While the# data cache miss rate
+might be given by your algorithm you should try to get data cache miss ratio
+as low as possible by increasing your cache reuse.
+This group was taken from the whitepaper -Basic Performance Measurements for AMD Athlon 64,
+AMD Opteron and AMD Phenom Processors- from Paul J. Drongowski.
+
--- a/collectors/likwid/groups/k10/CPI.txt
+++ b/collectors/likwid/groups/k10/CPI.txt
@@ -0,0 +1,26 @@
+SHORT  Cycles per instruction
+
+EVENTSET
+PMC0  INSTRUCTIONS_RETIRED
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  UOPS_RETIRED
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]  PMC1*inverseClock
+CPI   PMC1/PMC0
+CPI (based on uops)   PMC1/PMC2
+IPC   PMC0/PMC1
+
+LONG
+Formulas:
+CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
+CPI (based on uops) = CPU_CLOCKS_UNHALTED/RETIRED_UOPS
+IPC = RETIRED_INSTRUCTIONS/CPU_CLOCKS_UNHALTED
+-
+This group measures how efficient the processor works with
+regard to instruction throughput. Also important as a standalone
+metric is INSTRUCTIONS_RETIRED as it tells you how many instruction
+you need to execute for a task. An optimization might show very
+low CPI values but execute many more instruction for it.
+
--- a/collectors/likwid/groups/k10/FLOPS_DP.txt
+++ b/collectors/likwid/groups/k10/FLOPS_DP.txt
@@ -0,0 +1,24 @@
+SHORT Double Precision MFLOP/s
+
+EVENTSET
+PMC0  SSE_RETIRED_ADD_DOUBLE_FLOPS
+PMC1  SSE_RETIRED_MULT_DOUBLE_FLOPS
+PMC2  CPU_CLOCKS_UNHALTED
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] PMC2*inverseClock
+DP [MFLOP/s]    1.0E-06*(PMC0+PMC1)/time
+DP Add [MFLOP/s]    1.0E-06*PMC0/time
+DP Mult [MFLOP/s]    1.0E-06*PMC1/time
+
+LONG
+Formulas:
+DP [MFLOP/s] = 1.0E-06*(SSE_RETIRED_ADD_DOUBLE_FLOPS+SSE_RETIRED_MULT_DOUBLE_FLOPS)/time
+DP Add [MFLOP/s] = 1.0E-06*(SSE_RETIRED_ADD_DOUBLE_FLOPS)/time
+DP Mult [MFLOP/s] = 1.0E-06*(SSE_RETIRED_MULT_DOUBLE_FLOPS)/time
+-
+Profiling group to measure double SSE FLOPs.
+Don't forget that your code might also execute X87 FLOPs.
+
+
--- a/collectors/likwid/groups/k10/FLOPS_SP.txt
+++ b/collectors/likwid/groups/k10/FLOPS_SP.txt
@@ -0,0 +1,24 @@
+SHORT Single Precision MFLOP/s
+
+EVENTSET
+PMC0  SSE_RETIRED_ADD_SINGLE_FLOPS
+PMC1  SSE_RETIRED_MULT_SINGLE_FLOPS
+PMC2  CPU_CLOCKS_UNHALTED
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] PMC2*inverseClock
+SP [MFLOP/s]  1.0E-06*(PMC0+PMC1)/time
+SP Add [MFLOP/s]  1.0E-06*PMC0/time
+SP Mult [MFLOP/s]   1.0E-06*PMC1/time
+
+LONG
+Formulas:
+SP [MFLOP/s] = 1.0E-06*(SSE_RETIRED_ADD_SINGLE_FLOPS+SSE_RETIRED_MULT_SINGLE_FLOPS)/time
+SP Add [MFLOP/s] = 1.0E-06*(SSE_RETIRED_ADD_SINGLE_FLOPS)/time
+SP Mult [MFLOP/s] = 1.0E-06*(SSE_RETIRED_MULT_SINGLE_FLOPS)/time
+-
+Profiling group to measure single precision SSE FLOPs.
+Don't forget that your code might also execute X87 FLOPs.
+
+
--- a/collectors/likwid/groups/k10/FLOPS_X87.txt
+++ b/collectors/likwid/groups/k10/FLOPS_X87.txt
@@ -0,0 +1,25 @@
+SHORT X87 MFLOP/s
+
+EVENTSET
+PMC0  X87_FLOPS_RETIRED_ADD
+PMC1  X87_FLOPS_RETIRED_MULT
+PMC2  X87_FLOPS_RETIRED_DIV
+PMC3  CPU_CLOCKS_UNHALTED
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] PMC3*inverseClock
+X87 [MFLOP/s]       1.0E-06*(PMC0+PMC1+PMC2)/time
+X87 Add [MFLOP/s]    1.0E-06*PMC0/time
+X87 Mult [MFLOP/s]   1.0E-06*PMC1/time
+X87 Div [MFLOP/s]    1.0E-06*PMC2/time
+
+LONG
+Formulas:
+X87 [MFLOP/s] = 1.0E-06*(X87_FLOPS_RETIRED_ADD+X87_FLOPS_RETIRED_MULT+X87_FLOPS_RETIRED_DIV)/time
+X87 Add [MFLOP/s] = 1.0E-06*X87_FLOPS_RETIRED_ADD/time
+X87 Mult [MFLOP/s] = 1.0E-06*X87_FLOPS_RETIRED_MULT/time
+X87 Div [MFLOP/s] = 1.0E-06*X87_FLOPS_RETIRED_DIV/time
+-
+Profiling group to measure X87 FLOP rates.
+
--- a/collectors/likwid/groups/k10/FPU_EXCEPTION.txt
+++ b/collectors/likwid/groups/k10/FPU_EXCEPTION.txt
@@ -0,0 +1,21 @@
+SHORT   Floating point exceptions
+
+EVENTSET
+PMC0  INSTRUCTIONS_RETIRED
+PMC1  FP_INSTRUCTIONS_RETIRED_ALL
+PMC2  FPU_EXCEPTIONS_ALL
+
+METRICS
+Runtime (RDTSC) [s] time
+Overall FP exception rate  PMC2/PMC0
+FP exception rate    PMC2/PMC1
+
+LONG
+Formulas:
+Overall FP exception rate = FPU_EXCEPTIONS_ALL / INSTRUCTIONS_RETIRED
+FP exception rate = FPU_EXCEPTIONS_ALL / FP_INSTRUCTIONS_RETIRED_ALL
+-
+Floating point exceptions occur e.g. on the treatment of denormal numbers.
+There might be a large penalty if there are too many floating point
+exceptions.
+
--- a/collectors/likwid/groups/k10/ICACHE.txt
+++ b/collectors/likwid/groups/k10/ICACHE.txt
@@ -0,0 +1,23 @@
+SHORT Instruction cache miss rate/ratio
+
+EVENTSET
+PMC0  INSTRUCTIONS_RETIRED
+PMC1  ICACHE_FETCHES
+PMC2  ICACHE_REFILLS_L2
+PMC3  ICACHE_REFILLS_MEM
+
+METRICS
+Runtime (RDTSC) [s] time
+L1I request rate   PMC1/PMC0
+L1I miss rate    (PMC2+PMC3)/PMC0
+L1I miss ratio   (PMC2+PMC3)/PMC1
+
+LONG
+Formulas:
+L1I request rate = ICACHE_FETCHES / INSTRUCTIONS_RETIRED
+L1I miss rate = (ICACHE_REFILLS_L2+ICACHE_REFILLS_MEM)/INSTRUCTIONS_RETIRED
+L1I miss ratio = (ICACHE_REFILLS_L2+ICACHE_REFILLS_MEM)/ICACHE_FETCHES
+-
+This group measures the locality of your instruction code with regard to the
+L1 I-Cache.
+
--- a/collectors/likwid/groups/k10/L2.txt
+++ b/collectors/likwid/groups/k10/L2.txt
@@ -0,0 +1,33 @@
+SHORT L2 cache bandwidth in MBytes/s
+
+EVENTSET
+PMC0  DATA_CACHE_REFILLS_L2_ALL
+PMC1  DATA_CACHE_EVICTED_ALL
+PMC2  CPU_CLOCKS_UNHALTED
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   PMC2*inverseClock
+L2D load bandwidth [MBytes/s]  1.0E-06*PMC0*64.0/time
+L2D load data volume [GBytes]  1.0E-09*PMC0*64.0
+L2D evict bandwidth [MBytes/s]  1.0E-06*PMC1*64.0/time
+L2D evict data volume [GBytes]  1.0E-09*PMC1*64.0
+L2 bandwidth [MBytes/s]   1.0E-06*(PMC0+PMC1)*64.0/time
+L2 data volume [GBytes]   1.0E-09*(PMC0+PMC1)*64.0
+
+LONG
+Formulas:
+L2D load bandwidth [MBytes/s] = 1.0E-06*DATA_CACHE_REFILLS_L2_ALL*64.0/time
+L2D load data volume [GBytes] = 1.0E-09*DATA_CACHE_REFILLS_L2_ALL*64.0
+L2D evict bandwidth [MBytes/s] = 1.0E-06*DATA_CACHE_EVICTED_ALL*64.0/time
+L2D evict data volume [GBytes] = 1.0E-09*DATA_CACHE_EVICTED_ALL*64.0
+L2 bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_L2_ALL+DATA_CACHE_EVICTED_ALL)*64/time
+L2 data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_L2_ALL+DATA_CACHE_EVICTED_ALL)*64
+-
+Profiling group to measure L2 cache bandwidth. The bandwidth is
+computed by the number of cache line loaded from L2 to L1 and the
+number of modified cache lines evicted from the L1.
+Note that this bandwidth also includes data transfers due to a
+write allocate load on a store miss in L1 and copy back transfers if
+originated from L2.
+
--- a/collectors/likwid/groups/k10/L2CACHE.txt
+++ b/collectors/likwid/groups/k10/L2CACHE.txt
@@ -0,0 +1,32 @@
+SHORT L2 cache miss rate/ratio
+
+EVENTSET
+PMC0  INSTRUCTIONS_RETIRED
+PMC1  L2_REQUESTS_ALL
+PMC2  L2_MISSES_ALL
+PMC3  L2_FILL_ALL
+
+METRICS
+Runtime (RDTSC) [s] time
+L2 request rate   (PMC1+PMC3)/PMC0
+L2 miss rate   PMC2/PMC0
+L2 miss ratio   PMC2/(PMC1+PMC3)
+
+LONG
+Formulas:
+L2 request rate = (L2_REQUESTS_ALL+L2_FILL_ALL)/INSTRUCTIONS_RETIRED
+L2 miss rate  = L2_MISSES_ALL/INSTRUCTIONS_RETIRED
+L2 miss ratio = L2_MISSES_ALL/(L2_REQUESTS_ALL+L2_FILL_ALL)
+-
+This group measures the locality of your data accesses with regard to the
+L2 cache. L2 request rate tells you how data intensive your code is
+or how many data accesses you have on average per instruction.
+The L2 miss rate gives a measure how often it was necessary to get
+cache lines from memory. And finally L2 miss ratio tells you how many of your
+memory references required a cache line to be loaded from a higher level.
+While the# data cache miss rate might be given by your algorithm you should
+try to get data cache miss ratio as low as possible by increasing your cache reuse.
+This group was taken from the whitepaper -Basic Performance Measurements for AMD Athlon 64,
+AMD Opteron and AMD Phenom Processors- from Paul J. Drongowski.
+
+
--- a/collectors/likwid/groups/k10/MEM.txt
+++ b/collectors/likwid/groups/k10/MEM.txt
@@ -0,0 +1,35 @@
+SHORT Main memory bandwidth in MBytes/s
+
+EVENTSET
+PMC0  NORTHBRIDGE_READ_RESPONSE_ALL
+PMC1  OCTWORDS_WRITE_TRANSFERS
+PMC2  DRAM_ACCESSES_DCTO_ALL
+PMC3  DRAM_ACCESSES_DCT1_ALL
+
+METRICS
+Runtime (RDTSC) [s] time
+Memory read bandwidth [MBytes/s]  1.0E-06*PMC0*64.0/time
+Memory read data volume [GBytes]  1.0E-09*PMC0*64.0
+Memory write bandwidth [MBytes/s]  1.0E-06*PMC1*8.0/time
+Memory write data volume [GBytes]  1.0E-09*PMC1*8.0
+Memory bandwidth [MBytes/s]   1.0E-06*(PMC2+PMC3)*64.0/time
+Memory data volume [GBytes]   1.0E-09*(PMC2+PMC3)*64.0
+
+LONG
+Formulas:
+Memory read bandwidth [MBytes/s] = 1.0E-06*NORTHBRIDGE_READ_RESPONSE_ALL*64/time
+Memory read data volume [GBytes] = 1.0E-09*NORTHBRIDGE_READ_RESPONSE_ALL*64
+Memory write bandwidth [MBytes/s] = 1.0E-06*OCTWORDS_WRITE_TRANSFERS*8/time
+Memory write data volume [GBytes] = 1.0E-09*OCTWORDS_WRITE_TRANSFERS*8
+Memory bandwidth [MBytes/s] = 1.0E-06*(DRAM_ACCESSES_DCTO_ALL+DRAM_ACCESSES_DCT1_ALL)*64/time
+Memory data volume [GBytes] = 1.0E-09*(DRAM_ACCESSES_DCTO_ALL+DRAM_ACCESSES_DCT1_ALL)*64
+-
+Profiling group to measure memory bandwidth drawn by all cores of a socket.
+Note: As this group measures the accesses from all cores it only makes sense
+to measure with one core per socket, similar as with the Intel Nehalem Uncore events.
+The memory read bandwidth contains all data from DRAM, L3, or another cache,
+including another core on the same node. The event OCTWORDS_WRITE_TRANSFERS counts
+16 Byte transfers, not 64 Byte.
+
+
+
--- a/collectors/likwid/groups/k10/NUMA_0_3.txt
+++ b/collectors/likwid/groups/k10/NUMA_0_3.txt
@@ -0,0 +1,27 @@
+SHORT Bandwidth on the Hypertransport links
+
+EVENTSET
+PMC0  CPU_TO_DRAM_LOCAL_TO_0
+PMC1  CPU_TO_DRAM_LOCAL_TO_1
+PMC2  CPU_TO_DRAM_LOCAL_TO_2
+PMC3  CPU_TO_DRAM_LOCAL_TO_3
+
+METRICS
+Runtime (RDTSC) [s] time
+Hyper Transport link0 bandwidth [MBytes/s]  1.0E-06*PMC0*4.0/time
+Hyper Transport link1 bandwidth [MBytes/s]  1.0E-06*PMC1*4.0/time
+Hyper Transport link2 bandwidth [MBytes/s]  1.0E-06*PMC2*4.0/time
+Hyper Transport link3 bandwidth [MBytes/s]  1.0E-06*PMC3*4.0/time
+
+LONG
+Formulas:
+Hyper Transport link0 bandwidth [MBytes/s]  = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_0*4.0/time
+Hyper Transport link1 bandwidth [MBytes/s]  = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_1*4.0/time
+Hyper Transport link2 bandwidth [MBytes/s]  = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_2*4.0/time
+Hyper Transport link3 bandwidth [MBytes/s]  = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_3*4.0/time
+-
+Profiling group to measure the bandwidth over the Hypertransport links. Can be used
+to detect NUMA problems. Usually there should be only limited traffic over the QPI
+links for optimal performance.
+
+
--- a/collectors/likwid/groups/k10/NUMA_4_7.txt
+++ b/collectors/likwid/groups/k10/NUMA_4_7.txt
@@ -0,0 +1,27 @@
+SHORT Bandwidth on the Hypertransport links
+
+EVENTSET
+PMC0  CPU_TO_DRAM_LOCAL_TO_4
+PMC1  CPU_TO_DRAM_LOCAL_TO_5
+PMC2  CPU_TO_DRAM_LOCAL_TO_6
+PMC3  CPU_TO_DRAM_LOCAL_TO_7
+
+METRICS
+Runtime (RDTSC) [s] time
+Hyper Transport link4 bandwidth [MBytes/s]  1.0E-06*PMC0*4.0/time
+Hyper Transport link5 bandwidth [MBytes/s]  1.0E-06*PMC1*4.0/time
+Hyper Transport link6 bandwidth [MBytes/s]  1.0E-06*PMC2*4.0/time
+Hyper Transport link7 bandwidth [MBytes/s]  1.0E-06*PMC3*4.0/time
+
+LONG
+Formulas:
+Hyper Transport link4 bandwidth [MBytes/s]  = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_0*4.0/time
+Hyper Transport link5 bandwidth [MBytes/s]  = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_1*4.0/time
+Hyper Transport link6 bandwidth [MBytes/s]  = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_2*4.0/time
+Hyper Transport link7 bandwidth [MBytes/s]  = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_3*4.0/time
+-
+Profiling group to measure the bandwidth over the Hypertransport links. Can be used
+to detect NUMA problems. Usually there should be only limited traffic over the QPI
+links for optimal performance.
+
+
--- a/collectors/likwid/groups/k10/TLB.txt
+++ b/collectors/likwid/groups/k10/TLB.txt
@@ -0,0 +1,35 @@
+SHORT  TLB miss rate/ratio
+
+EVENTSET
+PMC0  INSTRUCTIONS_RETIRED
+PMC1  DATA_CACHE_ACCESSES
+PMC2  DTLB_L2_HIT_ALL
+PMC3  DTLB_L2_MISS_ALL
+
+METRICS
+Runtime (RDTSC) [s] time
+L1 DTLB request rate  PMC1/PMC0
+L1 DTLB miss rate   (PMC2+PMC3)/PMC0
+L1 DTLB miss ratio   (PMC2+PMC3)/PMC1
+L2 DTLB request rate   (PMC2+PMC3)/PMC0
+L2 DTLB miss rate    PMC3/PMC0
+L2 DTLB miss ratio    PMC3/(PMC2+PMC3)
+
+
+LONG
+Formulas:
+L1 DTLB request rate = DATA_CACHE_ACCESSES / INSTRUCTIONS_RETIRED
+L1 DTLB miss rate = (DTLB_L2_HIT_ALL+DTLB_L2_MISS_ALL)/INSTRUCTIONS_RETIRED
+L1 DTLB miss ratio = (DTLB_L2_HIT_ALL+DTLB_L2_MISS_ALL)/DATA_CACHE_ACCESSES
+L2 DTLB request rate = (DTLB_L2_HIT_ALL+DTLB_L2_MISS_ALL)/INSTRUCTIONS_RETIRED
+L2 DTLB miss rate = DTLB_L2_MISS_ALL / INSTRUCTIONS_RETIRED
+L2 DTLB miss ratio = DTLB_L2_MISS_ALL / (DTLB_L2_HIT_ALL+DTLB_L2_MISS_ALL)
+-
+L1 DTLB request  rate tells you how data intensive your code is
+or how many data accesses you have on average per instruction.
+The DTLB miss  rate gives a measure how often a TLB miss occurred
+per instruction. And finally L1 DTLB  miss ratio tells you how many
+of your memory references required caused a TLB miss on average.
+NOTE: The L2 metrics are only relevant if L2 DTLB request rate is equal to the L1 DTLB miss rate!
+This group was taken from the whitepaper Basic -Performance Measurements for AMD Athlon 64,
+AMD Opteron and AMD Phenom Processors- from Paul J. Drongowski.