Add likwid collector

This commit is contained in:
Thomas Roehl
2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions

View File

@@ -0,0 +1,26 @@
SHORT Branch prediction miss rate/ratio
EVENTSET
PMC0 RETIRED_INSTRUCTIONS
PMC1 RETIRED_BRANCH_INSTR
PMC2 RETIRED_MISPREDICTED_BRANCH_INSTR
METRICS
Runtime (RDTSC) [s] time
Branch rate PMC1/PMC0
Branch misprediction rate PMC2/PMC0
Branch misprediction ratio PMC2/PMC1
Instructions per branch PMC0/PMC1
LONG
Formulas:
Branch rate = RETIRED_BRANCH_INSTR/RETIRED_INSTRUCTIONS
Branch misprediction rate = RETIRED_MISPREDICTED_BRANCH_INSTR/RETIRED_INSTRUCTIONS
Branch misprediction ratio = RETIRED_MISPREDICTED_BRANCH_INSTR/RETIRED_BRANCH_INSTR
Instructions per branch = RETIRED_INSTRUCTIONS/RETIRED_BRANCH_INSTR
-
The rates state how often on average a branch or a mispredicted branch occurred
per instruction retired in total. The branch misprediction ratio sets directly
into relation what ratio of all branch instruction where mispredicted.
Instructions per branch is 1/branch rate.

View File

@@ -0,0 +1,32 @@
SHORT Data cache miss rate/ratio
EVENTSET
PMC0 RETIRED_INSTRUCTIONS
PMC1 DATA_CACHE_ACCESSES
PMC2 DATA_CACHE_REFILLS_VALID
PMC3 DATA_CACHE_MISSES_ALL
METRICS
Runtime (RDTSC) [s] time
data cache misses PMC3
data cache request rate PMC1/PMC0
data cache miss rate (PMC2)/PMC0
data cache miss ratio (PMC2)/PMC1
LONG
Formulas:
data cache misses = DATA_CACHE_MISSES_ALL
data cache request rate = DATA_CACHE_ACCESSES / RETIRED_INSTRUCTIONS
data cache miss rate = (DATA_CACHE_REFILLS_VALID) / RETIRED_INSTRUCTIONS
data cache miss ratio = (DATA_CACHE_REFILLS_VALID)/DATA_CACHE_ACCESSES
-
This group measures the locality of your data accesses with regard to the
L1 cache. Data cache request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The data cache miss rate gives a measure how often it was necessary to get
cache lines from higher levels of the memory hierarchy. And finally
data cache miss ratio tells you how many of your memory references required
a cache line to be loaded from a higher level. While the# data cache miss rate
might be given by your algorithm you should try to get data cache miss ratio
as low as possible by increasing your cache reuse.

View File

@@ -0,0 +1,26 @@
SHORT Cycles per instruction
EVENTSET
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 RETIRED_UOPS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC1*inverseClock
CPI PMC1/PMC0
CPI (based on uops) PMC1/PMC2
IPC PMC0/PMC1
LONG
Formulas:
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
CPI (based on uops) = CPU_CLOCKS_UNHALTED/RETIRED_UOPS
IPC = RETIRED_INSTRUCTIONS/CPU_CLOCKS_UNHALTED
-
This group measures how efficient the processor works with
regard to instruction throughput. Also important as a standalone
metric is RETIRED_INSTRUCTIONS as it tells you how many instruction
you need to execute for a task. An optimization might show very
low CPI values but execute many more instruction for it.

View File

@@ -0,0 +1,16 @@
SHORT Load to store ratio
EVENTSET
PMC0 LS_DISPATCH_LOADS
PMC1 LS_DISPATCH_STORES
METRICS
Runtime (RDTSC) [s] time
Load to store ratio PMC0/PMC1
LONG
Formulas:
Load to store ratio = LS_DISPATCH_LOADS/LS_DISPATCH_STORES
-
This is a simple metric to determine your load to store ratio.

View File

@@ -0,0 +1,23 @@
SHORT Double Precision MFLOP/s
EVENTSET
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 RETIRED_UOPS
PMC3 RETIRED_FLOPS_DOUBLE_ALL
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC1*inverseClock
DP [MFLOP/s] 1.0E-06*(PMC3)/time
CPI PMC1/PMC0
CPI (based on uops) PMC1/PMC2
IPC PMC0/PMC1
LONG
Formulas:
DP [MFLOP/s] = 1.0E-06*(RETIRED_FLOPS_DOUBLE_ALL)/time
-
Profiling group to measure double precisision FLOP rate.

View File

@@ -0,0 +1,23 @@
SHORT Single Precision MFLOP/s
EVENTSET
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 RETIRED_UOPS
PMC3 RETIRED_FLOPS_SINGLE_ALL
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC1*inverseClock
SP [MFLOP/s] 1.0E-06*(PMC3)/time
CPI PMC1/PMC0
CPI (based on uops) PMC1/PMC2
IPC PMC0/PMC1
LONG
Formulas:
SP [MFLOP/s] = 1.0E-06*(RETIRED_FLOPS_SINGLE_ALL)/time
-
Profiling group to measure single precision FLOP rate.

View File

@@ -0,0 +1,21 @@
SHORT Floating point exceptions
EVENTSET
PMC0 RETIRED_INSTRUCTIONS
PMC1 RETIRED_FP_INSTRUCTIONS_ALL
PMC2 FPU_EXCEPTION_ALL
METRICS
Runtime (RDTSC) [s] time
Overall FP exception rate PMC2/PMC0
FP exception rate PMC2/PMC1
LONG
Formulas:
Overall FP exception rate = FPU_EXCEPTIONS_ALL / INSTRUCTIONS_RETIRED
FP exception rate = FPU_EXCEPTIONS_ALL / FP_INSTRUCTIONS_RETIRED_ALL
-
Floating point exceptions occur e.g. on the treatment of denormal numbers.
There might be a large penalty if there are too many floating point
exceptions.

View File

@@ -0,0 +1,23 @@
SHORT Instruction cache miss rate/ratio
EVENTSET
PMC0 INSTRUCTION_CACHE_FETCHES
PMC1 INSTRUCTION_CACHE_L2_REFILLS
PMC2 INSTRUCTION_CACHE_SYSTEM_REFILLS
PMC3 RETIRED_INSTRUCTIONS
METRICS
Runtime (RDTSC) [s] time
L1I request rate PMC0/PMC3
L1I miss rate (PMC1+PMC2)/PMC3
L1I miss ratio (PMC1+PMC2)/PMC0
LONG
Formulas:
L1I request rate = INSTRUCTION_CACHE_FETCHES / RETIRED_INSTRUCTIONS
L1I miss rate = (INSTRUCTION_CACHE_L2_REFILLS + INSTRUCTION_CACHE_SYSTEM_REFILLS)/RETIRED_INSTRUCTIONS
L1I miss ratio = (INSTRUCTION_CACHE_L2_REFILLS + INSTRUCTION_CACHE_SYSTEM_REFILLS)/INSTRUCTION_CACHE_FETCHES
-
This group measures the locality of your instruction code with regard to the
L1 I-Cache.

View File

@@ -0,0 +1,29 @@
SHORT L2 cache bandwidth in MBytes/s
EVENTSET
PMC0 DATA_CACHE_REFILLS_ALL
PMC1 DATA_CACHE_REFILLS_SYSTEM
PMC2 CPU_CLOCKS_UNHALTED
METRICS
Runtime (RDTSC) [s] time
L2 bandwidth [MBytes/s] 1.0E-06*(PMC0-PMC1)*64.0/time
L2 data volume [GBytes] 1.0E-09*(PMC0-PMC1)*64.0
Cache refill bandwidth System/L2 [MBytes/s] 1.0E-06*PMC0*64.0/time
Cache refill bandwidth System [MBytes/s] 1.0E-06*PMC1*64.0/time
LONG
Formulas:
L2 bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_ALL-DATA_CACHE_REFILLS_SYSTEM)*64/time
L2 data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_ALL-DATA_CACHE_REFILLS_SYSTEM)*64
Cache refill bandwidth system/L2 [MBytes/s] = 1.0E-06*DATA_CACHE_REFILLS_ALL*64/time
Cache refill bandwidth system [MBytes/s] = 1.0E-06*DATA_CACHE_REFILLS_SYSTEM*64/time
-
Profiling group to measure L2 cache bandwidth. The bandwidth is
computed by the number of cache line loaded from L2 to L1 and the
number of modified cache lines evicted from the L1.
Note that this bandwidth also included data transfers due to a
write allocate load on a store miss in L1 and copy back transfers if
originated from L2. L2-L2 data volume is the total data volume transferred
between L2 and L1.

View File

@@ -0,0 +1,31 @@
SHORT L2 cache miss rate/ratio
EVENTSET
PMC0 RETIRED_INSTRUCTIONS
PMC1 REQUESTS_TO_L2_DC_FILL
PMC2 L2_CACHE_MISS_DC_FILL
METRICS
Runtime (RDTSC) [s] time
L2 request rate PMC1/PMC0
L2 miss rate PMC2/PMC0
L2 miss ratio PMC2/PMC1
LONG
Formulas:
L2 request rate = L2_REQUESTS_ALL/INSTRUCTIONS_RETIRED
L2 miss rate = L2_MISSES_ALL/INSTRUCTIONS_RETIRED
L2 miss ratio = L2_MISSES_ALL/L2_REQUESTS_ALL
-
This group measures the locality of your data accesses with regard to the L2
Cache. L2 request rate tells you how data intensive your code is or how many
data accesses you have on average per instruction. The L2 miss rate gives a
measure how often it was necessary to get cache lines from memory. And finally
L2 miss ratio tells you how many of your memory references required a cache line
to be loaded from a higher level. While the# data cache miss rate might be
given by your algorithm you should try to get data cache miss ratio as low as
possible by increasing your cache reuse. This group is inspired from the
whitepaper -Basic Performance Measurements for AMD Athlon 64, AMD Opteron and
AMD Phenom Processors- from Paul J. Drongowski.

View File

@@ -0,0 +1,29 @@
SHORT L3 cache bandwidth in MBytes/s
EVENTSET
PMC0 L2_FILL_WB_FILL
PMC1 L2_FILL_WB_WB
PMC2 CPU_CLOCKS_UNHALTED
METRICS
Runtime (RDTSC) [s] time
L3 load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
L3 load data volume [GBytes] 1.0E-09*PMC0*64.0
L3 evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
L3 evict data volume [GBytes] 1.0E-09*PMC1*64.0
L3 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
L3 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
LONG
Formulas:
L3 load bandwidth [MBytes/s] = 1.0E-06*L2_FILL_WB_FILL*64.0/time
L3 load data volume [GBytes] = 1.0E-09*L2_FILL_WB_FILL*64.0
L3 evict bandwidth [MBytes/s] = 1.0E-06*L2_FILL_WB_WB*64.0/time
L3 evict data volume [GBytes] = 1.0E-09*L2_FILL_WB_WB*64.0
L3 bandwidth [MBytes/s] = 1.0E-06*(L2_FILL_WB_FILL+L2_FILL_WB_WB)*64/time
L3 data volume [GBytes] = 1.0E-09*(L2_FILL_WB_FILL+L2_FILL_WB_WB)*64
-
Profiling group to measure L3 cache bandwidth. The bandwidth is
computed by the number of cache line loaded from L3 to L2 and the
number of modified cache lines evicted from the L2.

View File

@@ -0,0 +1,35 @@
SHORT L3 cache miss rate/ratio
EVENTSET
PMC0 RETIRED_INSTRUCTIONS
UPMC0 UNC_READ_REQ_TO_L3_ALL
UPMC1 UNC_L3_CACHE_MISS_ALL
UPMC2 UNC_L3_LATENCY_CYCLE_COUNT
UPMC3 UNC_L3_LATENCY_REQUEST_COUNT
METRICS
Runtime (RDTSC) [s] time
L3 request rate UPMC0/PMC0
L3 miss rate UPMC1/PMC0
L3 miss ratio UPMC1/UPMC0
L3 average access latency [cycles] UPMC2/UPMC3
LONG
Formulas:
L3 request rate = UNC_READ_REQ_TO_L3_ALL/INSTRUCTIONS_RETIRED
L3 miss rate = UNC_L3_CACHE_MISS_ALL/INSTRUCTIONS_RETIRED
L3 miss ratio = UNC_L3_CACHE_MISS_ALL/UNC_READ_REQ_TO_L3_ALL
L3 average access latency = UNC_L3_LATENCY_CYCLE_COUNT/UNC_L3_LATENCY_REQUEST_COUNT
-
This group measures the locality of your data accesses with regard to the L3
Cache. L3 request rate tells you how data intensive your code is or how many
data accesses you have on average per instruction. The L3 miss rate gives a
measure how often it was necessary to get cache lines from memory. And finally
L3 miss ratio tells you how many of your memory references required a cache line
to be loaded from a higher level. While the# data cache miss rate might be
given by your algorithm you should try to get data cache miss ratio as low as
possible by increasing your cache reuse. This group was inspired from the
whitepaper - Basic Performance Measurements for AMD Athlon 64, AMD Opteron and
AMD Phenom Processors - from Paul J. Drongowski.

View File

@@ -0,0 +1,26 @@
SHORT Bandwidth on the Hypertransport links
EVENTSET
UPMC0 UNC_LINK_TRANSMIT_BW_L0_USE
UPMC1 UNC_LINK_TRANSMIT_BW_L1_USE
UPMC2 UNC_LINK_TRANSMIT_BW_L2_USE
UPMC3 UNC_LINK_TRANSMIT_BW_L3_USE
METRICS
Runtime (RDTSC) [s] time
Link bandwidth L0 [MBytes/s] 1.0E-06*UPMC0*4.0/time
Link bandwidth L1 [MBytes/s] 1.0E-06*UPMC1*4.0/time
Link bandwidth L2 [MBytes/s] 1.0E-06*UPMC2*4.0/time
Link bandwidth L3 [MBytes/s] 1.0E-06*UPMC3*4.0/time
LONG
Formulas:
Link bandwidth L0 [MBytes/s] = 1.0E-06*UNC_LINK_TRANSMIT_BW_L0_USE*4.0/time
Link bandwidth L1 [MBytes/s] = 1.0E-06*UNC_LINK_TRANSMIT_BW_L1_USE*4.0/time
Link bandwidth L2 [MBytes/s] = 1.0E-06*UNC_LINK_TRANSMIT_BW_L2_USE*4.0/time
Link bandwidth L3 [MBytes/s] = 1.0E-06*UNC_LINK_TRANSMIT_BW_L3_USE*4.0/time
-
Profiling group to measure the HyperTransport link bandwidth for the four links
of a local node. This indicates the# data flow between different ccNUMA nodes.

View File

@@ -0,0 +1,20 @@
SHORT Main memory bandwidth in MBytes/s
EVENTSET
UPMC0 UNC_DRAM_ACCESSES_DCT0_ALL
UPMC1 UNC_DRAM_ACCESSES_DCT1_ALL
METRICS
Runtime (RDTSC) [s] time
Memory bandwidth [MBytes/s] 1.0E-06*(UPMC0+UPMC1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(UPMC0+UPMC1)*64.0
LONG
Formulas:
Memory bandwidth [MBytes/s] = 1.0E-06*(DRAM_ACCESSES_DCTO_ALL+DRAM_ACCESSES_DCT1_ALL)*64/time
Memory data volume [GBytes] = 1.0E-09*(DRAM_ACCESSES_DCTO_ALL+DRAM_ACCESSES_DCT1_ALL)*64
-
Profiling group to measure memory bandwidth drawn by all cores of a socket.
Note: As this group measures the accesses from all cores it only makes sense
to measure with one core per socket, similar as with the Intel Nehalem Uncore events.

View File

@@ -0,0 +1,28 @@
SHORT Read/Write Events between the ccNUMA nodes
EVENTSET
UPMC0 UNC_CPU_TO_DRAM_LOCAL_TO_0
UPMC1 UNC_CPU_TO_DRAM_LOCAL_TO_1
UPMC2 UNC_CPU_TO_DRAM_LOCAL_TO_2
UPMC3 UNC_CPU_TO_DRAM_LOCAL_TO_3
METRICS
Runtime (RDTSC) [s] time
DRAM read/write local to 0 [MegaEvents/s] 1.0E-06*UPMC0/time
DRAM read/write local to 1 [MegaEvents/s] 1.0E-06*UPMC1/time
DRAM read/write local to 2 [MegaEvents/s] 1.0E-06*UPMC2/time
DRAM read/write local to 3 [MegaEvents/s] 1.0E-06*UPMC3/time
LONG
Formulas:
DRAM read/write local to 0 [MegaEvents/s] = 1.0E-06*UNC_CPU_TO_DRAM_LOCAL_TO_0/time
DRAM read/write local to 1 [MegaEvents/s] = 1.0E-06*UNC_CPU_TO_DRAM_LOCAL_TO_1/time
DRAM read/write local to 2 [MegaEvents/s] = 1.0E-06*UNC_CPU_TO_DRAM_LOCAL_TO_2/time
DRAM read/write local to 3 [MegaEvents/s] = 1.0E-06*UNC_CPU_TO_DRAM_LOCAL_TO_3/time
-
Profiling group to measure the traffic from local CPU to the different
DRAM NUMA nodes. This group allows to detect NUMA problems in a threaded
code. You must first determine on which memory domains your code is running.
A code should only have significant traffic to its own memory domain.

View File

@@ -0,0 +1,28 @@
SHORT Read/Write Events between the ccNUMA nodes
EVENTSET
UPMC0 UNC_CPU_TO_DRAM_LOCAL_TO_0
UPMC1 UNC_CPU_TO_DRAM_LOCAL_TO_1
UPMC2 UNC_CPU_TO_DRAM_LOCAL_TO_2
UPMC3 UNC_CPU_TO_DRAM_LOCAL_TO_3
METRICS
Runtime (RDTSC) [s] time
DRAM read/write local to 0 [MegaEvents/s] 1.0E-06*UPMC0/time
DRAM read/write local to 1 [MegaEvents/s] 1.0E-06*UPMC1/time
DRAM read/write local to 2 [MegaEvents/s] 1.0E-06*UPMC2/time
DRAM read/write local to 3 [MegaEvents/s] 1.0E-06*UPMC3/time
LONG
Formulas:
DRAM read/write local to 0 [MegaEvents/s] = 1.0E-06*UNC_CPU_TO_DRAM_LOCAL_TO_0/time
DRAM read/write local to 1 [MegaEvents/s] = 1.0E-06*UNC_CPU_TO_DRAM_LOCAL_TO_1/time
DRAM read/write local to 2 [MegaEvents/s] = 1.0E-06*UNC_CPU_TO_DRAM_LOCAL_TO_2/time
DRAM read/write local to 3 [MegaEvents/s] = 1.0E-06*UNC_CPU_TO_DRAM_LOCAL_TO_3/time
-
Profiling group to measure the traffic from local CPU to the different
DRAM NUMA nodes. This group allows to detect NUMA problems in a threaded
code. You must first determine on which memory domains your code is running.
A code should only have significant traffic to its own memory domain.

View File

@@ -0,0 +1,28 @@
SHORT Read/Write Events between the ccNUMA nodes
EVENTSET
UPMC0 UNC_CPU_TO_DRAM_LOCAL_TO_4
UPMC1 UNC_CPU_TO_DRAM_LOCAL_TO_5
UPMC2 UNC_CPU_TO_DRAM_LOCAL_TO_6
UPMC3 UNC_CPU_TO_DRAM_LOCAL_TO_7
METRICS
Runtime (RDTSC) [s] time
DRAM read/write local to 4 [MegaEvents/s] 1.0E-06*UPMC0/time
DRAM read/write local to 5 [MegaEvents/s] 1.0E-06*UPMC1/time
DRAM read/write local to 6 [MegaEvents/s] 1.0E-06*UPMC2/time
DRAM read/write local to 7 [MegaEvents/s] 1.0E-06*UPMC3/time
LONG
Formulas:
DRAM read/write local to 4 [MegaEvents/s] = 1.0E-06*UNC_CPU_TO_DRAM_LOCAL_TO_0/time
DRAM read/write local to 5 [MegaEvents/s] = 1.0E-06*UNC_CPU_TO_DRAM_LOCAL_TO_1/time
DRAM read/write local to 6 [MegaEvents/s] = 1.0E-06*UNC_CPU_TO_DRAM_LOCAL_TO_2/time
DRAM read/write local to 7 [MegaEvents/s] = 1.0E-06*UNC_CPU_TO_DRAM_LOCAL_TO_3/time
-
Profiling group to measure the traffic from local CPU to the different
DRAM NUMA nodes. This group allows to detect NUMA problems in a threaded
code. You must first determine on which memory domains your code is running.
A code should only have significant traffic to its own memory domain.