Add likwid collector

This commit is contained in:
Thomas Roehl
2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions

View File

@@ -0,0 +1,26 @@
SHORT Branch prediction miss rate/ratio
EVENTSET
PMC0 INSTRUCTIONS_RETIRED
PMC1 BRANCH_RETIRED
PMC2 BRANCH_MISPREDICT_RETIRED
METRICS
Runtime (RDTSC) [s] time
Branch rate PMC1/PMC0
Branch misprediction rate PMC2/PMC0
Branch misprediction ratio PMC2/PMC1
Instructions per branch PMC0/PMC1
LONG
Formulas:
Branch rate = BRANCH_RETIRED/INSTRUCTIONS_RETIRED
Branch misprediction rate = BRANCH_MISPREDICT_RETIRED/INSTRUCTIONS_RETIRED
Branch misprediction ratio = BRANCH_MISPREDICT_RETIRED/BRANCH_RETIRED
Instructions per branch = INSTRUCTIONS_RETIRED/BRANCH_RETIRED
-
The rates state how often on average a branch or a mispredicted branch occurred
per instruction retired in total. The branch misprediction ratio sets directly
into relation what ration of all branch instruction where mispredicted.
Instructions per branch is 1/branch rate.

View File

@@ -0,0 +1,34 @@
SHORT Data cache miss rate/ratio
EVENTSET
PMC0 INSTRUCTIONS_RETIRED
PMC1 DATA_CACHE_ACCESSES
PMC2 DATA_CACHE_REFILLS_L2_ALL
PMC3 DATA_CACHE_REFILLS_NORTHBRIDGE_ALL
METRICS
Runtime (RDTSC) [s] time
data cache misses PMC2+PMC3
data cache request rate PMC1/PMC0
data cache miss rate (PMC2+PMC3)/PMC0
data cache miss ratio (PMC2+PMC3)/PMC1
LONG
Formulas:
data cache misses = DATA_CACHE_REFILLS_L2_AL + DATA_CACHE_REFILLS_NORTHBRIDGE_ALL
data cache request rate = DATA_CACHE_ACCESSES / INSTRUCTIONS_RETIRED
data cache miss rate = (DATA_CACHE_REFILLS_L2_AL + DATA_CACHE_REFILLS_NORTHBRIDGE_ALL)/INSTRUCTIONS_RETIRED
data cache miss ratio = (DATA_CACHE_REFILLS_L2_AL + DATA_CACHE_REFILLS_NORTHBRIDGE_ALL)/DATA_CACHE_ACCESSES
-
This group measures the locality of your data accesses with regard to the
L1 cache. Data cache request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The data cache miss rate gives a measure how often it was necessary to get
cache lines from higher levels of the memory hierarchy. And finally
data cache miss ratio tells you how many of your memory references required
a cache line to be loaded from a higher level. While the# data cache miss rate
might be given by your algorithm you should try to get data cache miss ratio
as low as possible by increasing your cache reuse.
This group was taken from the whitepaper -Basic Performance Measurements for AMD Athlon 64,
AMD Opteron and AMD Phenom Processors- from Paul J. Drongowski.

View File

@@ -0,0 +1,26 @@
SHORT Cycles per instruction
EVENTSET
PMC0 INSTRUCTIONS_RETIRED
PMC1 CPU_CLOCKS_UNHALTED
PMC2 UOPS_RETIRED
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC1*inverseClock
CPI PMC1/PMC0
CPI (based on uops) PMC1/PMC2
IPC PMC0/PMC1
LONG
Formulas:
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
CPI (based on uops) = CPU_CLOCKS_UNHALTED/RETIRED_UOPS
IPC = RETIRED_INSTRUCTIONS/CPU_CLOCKS_UNHALTED
-
This group measures how efficient the processor works with
regard to instruction throughput. Also important as a standalone
metric is INSTRUCTIONS_RETIRED as it tells you how many instruction
you need to execute for a task. An optimization might show very
low CPI values but execute many more instruction for it.

View File

@@ -0,0 +1,24 @@
SHORT Double Precision MFLOP/s
EVENTSET
PMC0 SSE_RETIRED_ADD_DOUBLE_FLOPS
PMC1 SSE_RETIRED_MULT_DOUBLE_FLOPS
PMC2 CPU_CLOCKS_UNHALTED
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC2*inverseClock
DP [MFLOP/s] 1.0E-06*(PMC0+PMC1)/time
DP Add [MFLOP/s] 1.0E-06*PMC0/time
DP Mult [MFLOP/s] 1.0E-06*PMC1/time
LONG
Formulas:
DP [MFLOP/s] = 1.0E-06*(SSE_RETIRED_ADD_DOUBLE_FLOPS+SSE_RETIRED_MULT_DOUBLE_FLOPS)/time
DP Add [MFLOP/s] = 1.0E-06*(SSE_RETIRED_ADD_DOUBLE_FLOPS)/time
DP Mult [MFLOP/s] = 1.0E-06*(SSE_RETIRED_MULT_DOUBLE_FLOPS)/time
-
Profiling group to measure double SSE FLOPs.
Don't forget that your code might also execute X87 FLOPs.

View File

@@ -0,0 +1,24 @@
SHORT Single Precision MFLOP/s
EVENTSET
PMC0 SSE_RETIRED_ADD_SINGLE_FLOPS
PMC1 SSE_RETIRED_MULT_SINGLE_FLOPS
PMC2 CPU_CLOCKS_UNHALTED
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC2*inverseClock
SP [MFLOP/s] 1.0E-06*(PMC0+PMC1)/time
SP Add [MFLOP/s] 1.0E-06*PMC0/time
SP Mult [MFLOP/s] 1.0E-06*PMC1/time
LONG
Formulas:
SP [MFLOP/s] = 1.0E-06*(SSE_RETIRED_ADD_SINGLE_FLOPS+SSE_RETIRED_MULT_SINGLE_FLOPS)/time
SP Add [MFLOP/s] = 1.0E-06*(SSE_RETIRED_ADD_SINGLE_FLOPS)/time
SP Mult [MFLOP/s] = 1.0E-06*(SSE_RETIRED_MULT_SINGLE_FLOPS)/time
-
Profiling group to measure single precision SSE FLOPs.
Don't forget that your code might also execute X87 FLOPs.

View File

@@ -0,0 +1,25 @@
SHORT X87 MFLOP/s
EVENTSET
PMC0 X87_FLOPS_RETIRED_ADD
PMC1 X87_FLOPS_RETIRED_MULT
PMC2 X87_FLOPS_RETIRED_DIV
PMC3 CPU_CLOCKS_UNHALTED
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC3*inverseClock
X87 [MFLOP/s] 1.0E-06*(PMC0+PMC1+PMC2)/time
X87 Add [MFLOP/s] 1.0E-06*PMC0/time
X87 Mult [MFLOP/s] 1.0E-06*PMC1/time
X87 Div [MFLOP/s] 1.0E-06*PMC2/time
LONG
Formulas:
X87 [MFLOP/s] = 1.0E-06*(X87_FLOPS_RETIRED_ADD+X87_FLOPS_RETIRED_MULT+X87_FLOPS_RETIRED_DIV)/time
X87 Add [MFLOP/s] = 1.0E-06*X87_FLOPS_RETIRED_ADD/time
X87 Mult [MFLOP/s] = 1.0E-06*X87_FLOPS_RETIRED_MULT/time
X87 Div [MFLOP/s] = 1.0E-06*X87_FLOPS_RETIRED_DIV/time
-
Profiling group to measure X87 FLOP rates.

View File

@@ -0,0 +1,21 @@
SHORT Floating point exceptions
EVENTSET
PMC0 INSTRUCTIONS_RETIRED
PMC1 FP_INSTRUCTIONS_RETIRED_ALL
PMC2 FPU_EXCEPTIONS_ALL
METRICS
Runtime (RDTSC) [s] time
Overall FP exception rate PMC2/PMC0
FP exception rate PMC2/PMC1
LONG
Formulas:
Overall FP exception rate = FPU_EXCEPTIONS_ALL / INSTRUCTIONS_RETIRED
FP exception rate = FPU_EXCEPTIONS_ALL / FP_INSTRUCTIONS_RETIRED_ALL
-
Floating point exceptions occur e.g. on the treatment of denormal numbers.
There might be a large penalty if there are too many floating point
exceptions.

View File

@@ -0,0 +1,23 @@
SHORT Instruction cache miss rate/ratio
EVENTSET
PMC0 INSTRUCTIONS_RETIRED
PMC1 ICACHE_FETCHES
PMC2 ICACHE_REFILLS_L2
PMC3 ICACHE_REFILLS_MEM
METRICS
Runtime (RDTSC) [s] time
L1I request rate PMC1/PMC0
L1I miss rate (PMC2+PMC3)/PMC0
L1I miss ratio (PMC2+PMC3)/PMC1
LONG
Formulas:
L1I request rate = ICACHE_FETCHES / INSTRUCTIONS_RETIRED
L1I miss rate = (ICACHE_REFILLS_L2+ICACHE_REFILLS_MEM)/INSTRUCTIONS_RETIRED
L1I miss ratio = (ICACHE_REFILLS_L2+ICACHE_REFILLS_MEM)/ICACHE_FETCHES
-
This group measures the locality of your instruction code with regard to the
L1 I-Cache.

View File

@@ -0,0 +1,33 @@
SHORT L2 cache bandwidth in MBytes/s
EVENTSET
PMC0 DATA_CACHE_REFILLS_L2_ALL
PMC1 DATA_CACHE_EVICTED_ALL
PMC2 CPU_CLOCKS_UNHALTED
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC2*inverseClock
L2D load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
L2D load data volume [GBytes] 1.0E-09*PMC0*64.0
L2D evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
L2D evict data volume [GBytes] 1.0E-09*PMC1*64.0
L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
LONG
Formulas:
L2D load bandwidth [MBytes/s] = 1.0E-06*DATA_CACHE_REFILLS_L2_ALL*64.0/time
L2D load data volume [GBytes] = 1.0E-09*DATA_CACHE_REFILLS_L2_ALL*64.0
L2D evict bandwidth [MBytes/s] = 1.0E-06*DATA_CACHE_EVICTED_ALL*64.0/time
L2D evict data volume [GBytes] = 1.0E-09*DATA_CACHE_EVICTED_ALL*64.0
L2 bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_L2_ALL+DATA_CACHE_EVICTED_ALL)*64/time
L2 data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_L2_ALL+DATA_CACHE_EVICTED_ALL)*64
-
Profiling group to measure L2 cache bandwidth. The bandwidth is
computed by the number of cache line loaded from L2 to L1 and the
number of modified cache lines evicted from the L1.
Note that this bandwidth also includes data transfers due to a
write allocate load on a store miss in L1 and copy back transfers if
originated from L2.

View File

@@ -0,0 +1,32 @@
SHORT L2 cache miss rate/ratio
EVENTSET
PMC0 INSTRUCTIONS_RETIRED
PMC1 L2_REQUESTS_ALL
PMC2 L2_MISSES_ALL
PMC3 L2_FILL_ALL
METRICS
Runtime (RDTSC) [s] time
L2 request rate (PMC1+PMC3)/PMC0
L2 miss rate PMC2/PMC0
L2 miss ratio PMC2/(PMC1+PMC3)
LONG
Formulas:
L2 request rate = (L2_REQUESTS_ALL+L2_FILL_ALL)/INSTRUCTIONS_RETIRED
L2 miss rate = L2_MISSES_ALL/INSTRUCTIONS_RETIRED
L2 miss ratio = L2_MISSES_ALL/(L2_REQUESTS_ALL+L2_FILL_ALL)
-
This group measures the locality of your data accesses with regard to the
L2 cache. L2 request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The L2 miss rate gives a measure how often it was necessary to get
cache lines from memory. And finally L2 miss ratio tells you how many of your
memory references required a cache line to be loaded from a higher level.
While the# data cache miss rate might be given by your algorithm you should
try to get data cache miss ratio as low as possible by increasing your cache reuse.
This group was taken from the whitepaper -Basic Performance Measurements for AMD Athlon 64,
AMD Opteron and AMD Phenom Processors- from Paul J. Drongowski.

View File

@@ -0,0 +1,35 @@
SHORT Main memory bandwidth in MBytes/s
EVENTSET
PMC0 NORTHBRIDGE_READ_RESPONSE_ALL
PMC1 OCTWORDS_WRITE_TRANSFERS
PMC2 DRAM_ACCESSES_DCTO_ALL
PMC3 DRAM_ACCESSES_DCT1_ALL
METRICS
Runtime (RDTSC) [s] time
Memory read bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
Memory read data volume [GBytes] 1.0E-09*PMC0*64.0
Memory write bandwidth [MBytes/s] 1.0E-06*PMC1*8.0/time
Memory write data volume [GBytes] 1.0E-09*PMC1*8.0
Memory bandwidth [MBytes/s] 1.0E-06*(PMC2+PMC3)*64.0/time
Memory data volume [GBytes] 1.0E-09*(PMC2+PMC3)*64.0
LONG
Formulas:
Memory read bandwidth [MBytes/s] = 1.0E-06*NORTHBRIDGE_READ_RESPONSE_ALL*64/time
Memory read data volume [GBytes] = 1.0E-09*NORTHBRIDGE_READ_RESPONSE_ALL*64
Memory write bandwidth [MBytes/s] = 1.0E-06*OCTWORDS_WRITE_TRANSFERS*8/time
Memory write data volume [GBytes] = 1.0E-09*OCTWORDS_WRITE_TRANSFERS*8
Memory bandwidth [MBytes/s] = 1.0E-06*(DRAM_ACCESSES_DCTO_ALL+DRAM_ACCESSES_DCT1_ALL)*64/time
Memory data volume [GBytes] = 1.0E-09*(DRAM_ACCESSES_DCTO_ALL+DRAM_ACCESSES_DCT1_ALL)*64
-
Profiling group to measure memory bandwidth drawn by all cores of a socket.
Note: As this group measures the accesses from all cores it only makes sense
to measure with one core per socket, similar as with the Intel Nehalem Uncore events.
The memory read bandwidth contains all data from DRAM, L3, or another cache,
including another core on the same node. The event OCTWORDS_WRITE_TRANSFERS counts
16 Byte transfers, not 64 Byte.

View File

@@ -0,0 +1,27 @@
SHORT Bandwidth on the Hypertransport links
EVENTSET
PMC0 CPU_TO_DRAM_LOCAL_TO_0
PMC1 CPU_TO_DRAM_LOCAL_TO_1
PMC2 CPU_TO_DRAM_LOCAL_TO_2
PMC3 CPU_TO_DRAM_LOCAL_TO_3
METRICS
Runtime (RDTSC) [s] time
Hyper Transport link0 bandwidth [MBytes/s] 1.0E-06*PMC0*4.0/time
Hyper Transport link1 bandwidth [MBytes/s] 1.0E-06*PMC1*4.0/time
Hyper Transport link2 bandwidth [MBytes/s] 1.0E-06*PMC2*4.0/time
Hyper Transport link3 bandwidth [MBytes/s] 1.0E-06*PMC3*4.0/time
LONG
Formulas:
Hyper Transport link0 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_0*4.0/time
Hyper Transport link1 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_1*4.0/time
Hyper Transport link2 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_2*4.0/time
Hyper Transport link3 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_3*4.0/time
-
Profiling group to measure the bandwidth over the Hypertransport links. Can be used
to detect NUMA problems. Usually there should be only limited traffic over the QPI
links for optimal performance.

View File

@@ -0,0 +1,27 @@
SHORT Bandwidth on the Hypertransport links
EVENTSET
PMC0 CPU_TO_DRAM_LOCAL_TO_4
PMC1 CPU_TO_DRAM_LOCAL_TO_5
PMC2 CPU_TO_DRAM_LOCAL_TO_6
PMC3 CPU_TO_DRAM_LOCAL_TO_7
METRICS
Runtime (RDTSC) [s] time
Hyper Transport link4 bandwidth [MBytes/s] 1.0E-06*PMC0*4.0/time
Hyper Transport link5 bandwidth [MBytes/s] 1.0E-06*PMC1*4.0/time
Hyper Transport link6 bandwidth [MBytes/s] 1.0E-06*PMC2*4.0/time
Hyper Transport link7 bandwidth [MBytes/s] 1.0E-06*PMC3*4.0/time
LONG
Formulas:
Hyper Transport link4 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_0*4.0/time
Hyper Transport link5 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_1*4.0/time
Hyper Transport link6 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_2*4.0/time
Hyper Transport link7 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_3*4.0/time
-
Profiling group to measure the bandwidth over the Hypertransport links. Can be used
to detect NUMA problems. Usually there should be only limited traffic over the QPI
links for optimal performance.

View File

@@ -0,0 +1,35 @@
SHORT TLB miss rate/ratio
EVENTSET
PMC0 INSTRUCTIONS_RETIRED
PMC1 DATA_CACHE_ACCESSES
PMC2 DTLB_L2_HIT_ALL
PMC3 DTLB_L2_MISS_ALL
METRICS
Runtime (RDTSC) [s] time
L1 DTLB request rate PMC1/PMC0
L1 DTLB miss rate (PMC2+PMC3)/PMC0
L1 DTLB miss ratio (PMC2+PMC3)/PMC1
L2 DTLB request rate (PMC2+PMC3)/PMC0
L2 DTLB miss rate PMC3/PMC0
L2 DTLB miss ratio PMC3/(PMC2+PMC3)
LONG
Formulas:
L1 DTLB request rate = DATA_CACHE_ACCESSES / INSTRUCTIONS_RETIRED
L1 DTLB miss rate = (DTLB_L2_HIT_ALL+DTLB_L2_MISS_ALL)/INSTRUCTIONS_RETIRED
L1 DTLB miss ratio = (DTLB_L2_HIT_ALL+DTLB_L2_MISS_ALL)/DATA_CACHE_ACCESSES
L2 DTLB request rate = (DTLB_L2_HIT_ALL+DTLB_L2_MISS_ALL)/INSTRUCTIONS_RETIRED
L2 DTLB miss rate = DTLB_L2_MISS_ALL / INSTRUCTIONS_RETIRED
L2 DTLB miss ratio = DTLB_L2_MISS_ALL / (DTLB_L2_HIT_ALL+DTLB_L2_MISS_ALL)
-
L1 DTLB request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The DTLB miss rate gives a measure how often a TLB miss occurred
per instruction. And finally L1 DTLB miss ratio tells you how many
of your memory references required caused a TLB miss on average.
NOTE: The L2 metrics are only relevant if L2 DTLB request rate is equal to the L1 DTLB miss rate!
This group was taken from the whitepaper Basic -Performance Measurements for AMD Athlon 64,
AMD Opteron and AMD Phenom Processors- from Paul J. Drongowski.