Add likwid collector

This commit is contained in:
Thomas Roehl
2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions

View File

@@ -0,0 +1,32 @@
SHORT Branch prediction miss rate/ratio
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 RETIRED_BRANCH_INSTR
PMC3 RETIRED_MISP_BRANCH_INSTR
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
Branch rate PMC2/PMC0
Branch misprediction rate PMC3/PMC0
Branch misprediction ratio PMC3/PMC2
Instructions per branch PMC0/PMC2
LONG
Formulas:
Branch rate = RETIRED_BRANCH_INSTR/RETIRED_INSTRUCTIONS
Branch misprediction rate = RETIRED_MISP_BRANCH_INSTR/RETIRED_INSTRUCTIONS
Branch misprediction ratio = RETIRED_MISP_BRANCH_INSTR/RETIRED_BRANCH_INSTR
Instructions per branch = RETIRED_INSTRUCTIONS/RETIRED_BRANCH_INSTR
-
The rates state how often on average a branch or a mispredicted branch occurred
per instruction retired in total. The branch misprediction ratio sets directly
into relation what ratio of all branch instruction where mispredicted.
Instructions per branch is 1/branch rate.

View File

@@ -0,0 +1,39 @@
SHORT Data cache miss rate/ratio
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 DATA_CACHE_ACCESSES
PMC3 DATA_CACHE_REFILLS_ALL
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
data cache requests PMC2
data cache request rate PMC2/PMC0
data cache misses PMC3
data cache miss rate PMC3/PMC0
data cache miss ratio PMC3/PMC2
LONG
Formulas:
data cache requests = DATA_CACHE_ACCESSES
data cache request rate = DATA_CACHE_ACCESSES / RETIRED_INSTRUCTIONS
data cache misses = DATA_CACHE_REFILLS_ALL
data cache miss rate = DATA_CACHE_REFILLS_ALL / RETIRED_INSTRUCTIONS
data cache miss ratio = DATA_CACHE_REFILLS_ALL / DATA_CACHE_ACCESSES
-
This group measures the locality of your data accesses with regard to the
L1 cache. Data cache request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The data cache miss rate gives a measure how often it was necessary to get
cache lines from higher levels of the memory hierarchy. And finally
data cache miss ratio tells you how many of your memory references required
a cache line to be loaded from a higher level. While the# data cache miss rate
might be given by your algorithm you should try to get data cache miss ratio
as low as possible by increasing your cache reuse.

View File

@@ -0,0 +1,30 @@
SHORT Cycles per instruction
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 RETIRED_UOPS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
CPI (based on uops) PMC1/PMC2
IPC PMC0/PMC1
LONG
Formulas:
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
CPI (based on uops) = CPU_CLOCKS_UNHALTED/RETIRED_UOPS
IPC = RETIRED_INSTRUCTIONS/CPU_CLOCKS_UNHALTED
-
This group measures how efficient the processor works with
regard to instruction throughput. Also important as a standalone
metric is RETIRED_INSTRUCTIONS as it tells you how many instruction
you need to execute for a task. An optimization might show very
low CPI values but execute many more instruction for it.

View File

@@ -0,0 +1,23 @@
SHORT Load to store ratio
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 LS_DISPATCH_LOADS
PMC3 LS_DISPATCH_STORES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
Load to store ratio PMC2/PMC3
LONG
Formulas:
Load to store ratio = LS_DISPATCH_LOADS/LS_DISPATCH_STORES
-
This is a simple metric to determine your load to store ratio.

View File

@@ -0,0 +1,25 @@
SHORT Divide unit information
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 DIV_OP_COUNT
PMC3 DIV_BUSY_CYCLES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
Number of divide ops PMC2
Avg. divide unit usage duration PMC3/PMC2
LONG
Formulas:
Number of divide ops = DIV_OP_COUNT
Avg. divide unit usage duration = DIV_BUSY_CYCLES/DIV_OP_COUNT
--
This performance group measures the average latency of divide operations

View File

@@ -0,0 +1,32 @@
SHORT Power and Energy consumption
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PWR0 RAPL_CORE_ENERGY
PWR1 RAPL_PKG_ENERGY
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
Energy Core [J] PWR0
Power Core [W] PWR0/time
Energy PKG [J] PWR1
Power PKG [W] PWR1/time
LONG
Formulas:
Power Core [W] = RAPL_CORE_ENERGY/time
Power PKG [W] = RAPL_PKG_ENERGY/time
-
Ryzen implements the RAPL interface previously introduced by Intel.
This interface enables to monitor the consumed energy on the core and package
domain.
It is not documented by AMD which parts of the CPU are in which domain.

View File

@@ -0,0 +1,28 @@
SHORT Double Precision MFLOP/s
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 RETIRED_SSE_AVX_FLOPS_ALL
PMC3 MERGE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
DP [MFLOP/s] 1.0E-06*(PMC2)/time
LONG
Formulas:
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
DP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_ALL)/time
-
Profiling group to measure (double-precisision) FLOP rate. The event might
have a higher per-cycle increment than 15, so the MERGE event is required. In
contrast to AMD Zen, the Zen2 microarchitecture does not provide events to
differentiate between single- and double-precision.

View File

@@ -0,0 +1,28 @@
SHORT Single Precision MFLOP/s
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 RETIRED_SSE_AVX_FLOPS_ALL
PMC3 MERGE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
SP [MFLOP/s] 1.0E-06*(PMC2)/time
LONG
Formulas:
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
SP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_ALL)/time
-
Profiling group to measure (single-precisision) FLOP rate. The event might
have a higher per-cycle increment than 15, so the MERGE event is required. In
contrast to AMD Zen, the Zen2 microarchitecture does not provide events to
differentiate between single- and double-precision.

View File

@@ -0,0 +1,28 @@
SHORT Instruction cache miss rate/ratio
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 ICACHE_FETCHES
PMC2 ICACHE_L2_REFILLS
PMC3 ICACHE_SYSTEM_REFILLS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/PMC0
L1I request rate PMC1/PMC0
L1I miss rate (PMC2+PMC3)/PMC0
L1I miss ratio (PMC2+PMC3)/PMC1
LONG
Formulas:
L1I request rate = ICACHE_FETCHES / RETIRED_INSTRUCTIONS
L1I miss rate = (ICACHE_L2_REFILLS + ICACHE_SYSTEM_REFILLS)/RETIRED_INSTRUCTIONS
L1I miss ratio = (ICACHE_L2_REFILLS + ICACHE_SYSTEM_REFILLS)/ICACHE_FETCHES
-
This group measures the locality of your instruction code with regard to the
L1 I-Cache.

View File

@@ -0,0 +1,28 @@
SHORT L2 cache bandwidth in MBytes/s (experimental)
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC3 REQUESTS_TO_L2_GRP1_ALL_NO_PF
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
L2D load bandwidth [MBytes/s] 1.0E-06*PMC3*64.0/time
L2D load data volume [GBytes] 1.0E-09*PMC3*64.0
L2 bandwidth [MBytes/s] 1.0E-06*(PMC3)*64.0/time
L2 data volume [GBytes] 1.0E-09*(PMC3)*64.0
LONG
Formulas:
L2D load bandwidth [MBytes/s] = 1.0E-06*REQUESTS_TO_L2_GRP1_ALL_NO_PF*64.0/time
L2D load data volume [GBytes] = 1.0E-09*REQUESTS_TO_L2_GRP1_ALL_NO_PF*64.0
L2 bandwidth [MBytes/s] = 1.0E-06*(REQUESTS_TO_L2_GRP1_ALL_NO_PF)*64/time
L2 data volume [GBytes] = 1.0E-09*(REQUESTS_TO_L2_GRP1_ALL_NO_PF)*64
-
Profiling group to measure L2 cache bandwidth. There is no way to measure
the store traffic between L1 and L2.

View File

@@ -0,0 +1,32 @@
SHORT L3 cache bandwidth in MBytes/s
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
CPMC0 L3_ACCESS
CPMC1 L3_MISS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
L3 access bandwidth [MBytes/s] 1.0E-06*CPMC0*64.0/time
L3 access data volume [GBytes] 1.0E-09*CPMC0*64.0
L3 access rate [%] (CPMC0/PMC0)*100.0
L3 miss rate [%] (CPMC1/PMC0)*100.0
L3 miss ratio [%] (CPMC1/CPMC0)*100.0
LONG
Formulas:
L3 access bandwidth [MBytes/s] = 1.0E-06*L3_ACCESS*64.0/time
L3 access data volume [GBytes] = 1.0E-09*L3_ACCESS*64.0
L3 access rate [%] = (L3_ACCESS/RETIRED_INSTRUCTIONS)*100
L3 miss rate [%] = (L3_MISS/RETIRED_INSTRUCTIONS)*100
L3 miss ratio [%]= (L3_MISS/L3_ACCESS)*100
-
Profiling group to measure L3 cache bandwidth. There is no way to measure
the store traffic between L2 and L3. The only two published L3 events are
L3_ACCESS and L3_MISS.

View File

@@ -0,0 +1,35 @@
SHORT Main memory bandwidth in MBytes/s (experimental)
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
DFC0 DATA_FROM_LOCAL_DRAM_CHANNEL
DFC1 DATA_TO_LOCAL_DRAM_CHANNEL
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
Memory bandwidth [MBytes/s] 1.0E-06*(DFC0+DFC1)*(4.0/num_numadomains)*64.0/time
Memory data volume [GBytes] 1.0E-09*(DFC0+DFC1)*(4.0/num_numadomains)*64.0
LONG
Formulas:
Memory bandwidth [MBytes/s] = 4.0E-06*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*(4.0/num_numadomains)*64.0/runtime
Memory data volume [GBytes] = 4.0E-09*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*(4.0/num_numadomains)*64.0
-
Profiling group to measure memory bandwidth drawn by all cores of a socket.
Since this group is based on Uncore events it is only possible to measure on a
per socket base.
The group provides almost accurate results for the total bandwidth
and data volume.
The metric formulas contain a correction factor of (4.0/num_numadomains) to
return the value for all 4 memory controllers in NPS1 mode. This is probably
a work-around. Requested info from AMD but no answer.
Be aware that despite the events imply a traffic direction (FROM and TO), the events
cannot be used to differentiate between read and write traffic. The events will be
renamed to avoid that confusion in the future.

View File

@@ -0,0 +1,35 @@
SHORT Local and remote memory accesses (experimental)
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 DATA_CACHE_REFILLS_LOCAL_ALL
PMC1 DATA_CACHE_REFILLS_REMOTE_ALL
PMC2 HWPREF_DATA_CACHE_FILLS_LOCAL_ALL
PMC3 HWPREF_DATA_CACHE_FILLS_REMOTE_ALL
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
Local bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC2)*64.0/time
Local data volume [GBytes] 1.0E-09*(PMC0+PMC2)*64.0
Remote bandwidth [MBytes/s] 1.0E-06*(PMC1+PMC3)*64.0/time
Remote data volume [GBytes] 1.0E-09*(PMC1+PMC3)*64.0
Total bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC2+PMC1+PMC3)*64.0/time
Total data volume [GBytes] 1.0E-09*(PMC0+PMC2+PMC1+PMC3)*64.0
LONG
Formulas:
Local bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_LOCAL_ALL+HWPREF_DATA_CACHE_FILLS_LOCAL_ALL)*64.0/time
Local data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_LOCAL_ALL+HWPREF_DATA_CACHE_FILLS_LOCAL_ALL)*64.0
Remote bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_REMOTE_ALL+HWPREF_DATA_CACHE_FILLS_REMOTE_ALL)*64.0/time
Remote data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_REMOTE_ALL+HWPREF_DATA_CACHE_FILLS_REMOTE_ALL)*64.0
Total bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_LOCAL_ALL+HWPREF_DATA_CACHE_FILLS_LOCAL_ALL+DATA_CACHE_REFILLS_REMOTE_ALL+HWPREF_DATA_CACHE_FILLS_REMOTE_ALL)*64.0/time
Total data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_LOCAL_ALL+HWPREF_DATA_CACHE_FILLS_LOCAL_ALL+DATA_CACHE_REFILLS_REMOTE_ALL+HWPREF_DATA_CACHE_FILLS_REMOTE_ALL)*64.0
-
Profiling group to measure NUMA traffic. The data sources range from
local L2, CCX and memory for the local metrics and remote CCX and memory
for the remote metrics. There are also events that measure the software
prefetches from local and remote domain but AMD Zen provides only 4 counters.

View File

@@ -0,0 +1,39 @@
SHORT TLB miss rate/ratio
EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 DATA_CACHE_ACCESSES
PMC2 L1_DTLB_MISS_ANY_L2_HIT
PMC3 L1_DTLB_MISS_ANY_L2_MISS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/PMC0
L1 DTLB request rate PMC1/PMC0
L1 DTLB miss rate (PMC2+PMC3)/PMC0
L1 DTLB miss ratio (PMC2+PMC3)/PMC1
L2 DTLB request rate (PMC2+PMC3)/PMC0
L2 DTLB miss rate PMC3/PMC0
L2 DTLB miss ratio PMC3/(PMC2+PMC3)
LONG
Formulas:
L1 DTLB request rate = DATA_CACHE_ACCESSES / RETIRED_INSTRUCTIONS
L1 DTLB miss rate = (L1_DTLB_MISS_ANY_L2_HIT+L1_DTLB_MISS_ANY_L2_MISS)/RETIRED_INSTRUCTIONS
L1 DTLB miss ratio = (L1_DTLB_MISS_ANY_L2_HIT+L1_DTLB_MISS_ANY_L2_MISS)/DATA_CACHE_ACCESSES
L2 DTLB request rate = (L1_DTLB_MISS_ANY_L2_HIT+L1_DTLB_MISS_ANY_L2_MISS)/RETIRED_INSTRUCTIONS
L2 DTLB miss rate = L1_DTLB_MISS_ANY_L2_MISS / RETIRED_INSTRUCTIONS
L2 DTLB miss ratio = L1_DTLB_MISS_ANY_L2_MISS / (L1_DTLB_MISS_ANY_L2_HIT+L1_DTLB_MISS_ANY_L2_MISS)
-
L1 DTLB request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The DTLB miss rate gives a measure how often a TLB miss occurred
per instruction. And finally L1 DTLB miss ratio tells you how many
of your memory references required caused a TLB miss on average.
NOTE: The L2 metrics are only relevant if L2 DTLB request rate is
equal to the L1 DTLB miss rate!