mirror of
https://github.com/ClusterCockpit/cc-metric-collector.git
synced 2025-08-01 00:56:26 +02:00
Add likwid collector
This commit is contained in:
32
collectors/likwid/groups/zen/BRANCH.txt
Normal file
32
collectors/likwid/groups/zen/BRANCH.txt
Normal file
@@ -0,0 +1,32 @@
|
||||
SHORT Branch prediction miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
PMC2 RETIRED_BRANCH_INSTR
|
||||
PMC3 RETIRED_MISP_BRANCH_INSTR
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
Branch rate PMC2/PMC0
|
||||
Branch misprediction rate PMC3/PMC0
|
||||
Branch misprediction ratio PMC3/PMC2
|
||||
Instructions per branch PMC0/PMC2
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Branch rate = RETIRED_BRANCH_INSTR/RETIRED_INSTRUCTIONS
|
||||
Branch misprediction rate = RETIRED_MISP_BRANCH_INSTR/RETIRED_INSTRUCTIONS
|
||||
Branch misprediction ratio = RETIRED_MISP_BRANCH_INSTR/RETIRED_BRANCH_INSTR
|
||||
Instructions per branch = RETIRED_INSTRUCTIONS/RETIRED_BRANCH_INSTR
|
||||
-
|
||||
The rates state how often on average a branch or a mispredicted branch occurred
|
||||
per instruction retired in total. The branch misprediction ratio sets directly
|
||||
into relation what ratio of all branch instruction where mispredicted.
|
||||
Instructions per branch is 1/branch rate.
|
||||
|
39
collectors/likwid/groups/zen/CACHE.txt
Normal file
39
collectors/likwid/groups/zen/CACHE.txt
Normal file
@@ -0,0 +1,39 @@
|
||||
SHORT Data cache miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
PMC2 DATA_CACHE_ACCESSES
|
||||
PMC3 DATA_CACHE_REFILLS_ALL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
data cache requests PMC2
|
||||
data cache request rate PMC2/PMC0
|
||||
data cache misses PMC3
|
||||
data cache miss rate PMC3/PMC0
|
||||
data cache miss ratio PMC3/PMC2
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
data cache requests = DATA_CACHE_ACCESSES
|
||||
data cache request rate = DATA_CACHE_ACCESSES / RETIRED_INSTRUCTIONS
|
||||
data cache misses = DATA_CACHE_REFILLS_ALL
|
||||
data cache miss rate = DATA_CACHE_REFILLS_ALL / RETIRED_INSTRUCTIONS
|
||||
data cache miss ratio = DATA_CACHE_REFILLS_ALL / DATA_CACHE_ACCESSES
|
||||
-
|
||||
This group measures the locality of your data accesses with regard to the
|
||||
L1 cache. Data cache request rate tells you how data intensive your code is
|
||||
or how many data accesses you have on average per instruction.
|
||||
The data cache miss rate gives a measure how often it was necessary to get
|
||||
cache lines from higher levels of the memory hierarchy. And finally
|
||||
data cache miss ratio tells you how many of your memory references required
|
||||
a cache line to be loaded from a higher level. While the# data cache miss rate
|
||||
might be given by your algorithm you should try to get data cache miss ratio
|
||||
as low as possible by increasing your cache reuse.
|
||||
|
30
collectors/likwid/groups/zen/CPI.txt
Normal file
30
collectors/likwid/groups/zen/CPI.txt
Normal file
@@ -0,0 +1,30 @@
|
||||
SHORT Cycles per instruction
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
PMC2 RETIRED_UOPS
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] PMC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
CPI (based on uops) PMC1/PMC2
|
||||
IPC PMC0/PMC1
|
||||
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
|
||||
CPI (based on uops) = CPU_CLOCKS_UNHALTED/RETIRED_UOPS
|
||||
IPC = RETIRED_INSTRUCTIONS/CPU_CLOCKS_UNHALTED
|
||||
-
|
||||
This group measures how efficient the processor works with
|
||||
regard to instruction throughput. Also important as a standalone
|
||||
metric is RETIRED_INSTRUCTIONS as it tells you how many instruction
|
||||
you need to execute for a task. An optimization might show very
|
||||
low CPI values but execute many more instruction for it.
|
||||
|
23
collectors/likwid/groups/zen/DATA.txt
Normal file
23
collectors/likwid/groups/zen/DATA.txt
Normal file
@@ -0,0 +1,23 @@
|
||||
SHORT Load to store ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
PMC2 LS_DISPATCH_LOADS
|
||||
PMC3 LS_DISPATCH_STORES
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
Load to store ratio PMC2/PMC3
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Load to store ratio = LS_DISPATCH_LOADS/LS_DISPATCH_STORES
|
||||
-
|
||||
This is a simple metric to determine your load to store ratio.
|
||||
|
26
collectors/likwid/groups/zen/DIVIDE.txt
Normal file
26
collectors/likwid/groups/zen/DIVIDE.txt
Normal file
@@ -0,0 +1,26 @@
|
||||
SHORT Divide unit information
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
PMC2 DIV_OP_COUNT
|
||||
PMC3 DIV_BUSY_CYCLES
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
Number of divide ops PMC2
|
||||
Avg. divide unit usage duration PMC3/PMC2
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
|
||||
Number of divide ops = DIV_OP_COUNT
|
||||
Avg. divide unit usage duration = DIV_BUSY_CYCLES/DIV_OP_COUNT
|
||||
-
|
||||
This performance group measures the average latency of divide operations
|
32
collectors/likwid/groups/zen/ENERGY.txt
Normal file
32
collectors/likwid/groups/zen/ENERGY.txt
Normal file
@@ -0,0 +1,32 @@
|
||||
SHORT Power and Energy consumption
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
PWR0 RAPL_CORE_ENERGY
|
||||
PWR1 RAPL_PKG_ENERGY
|
||||
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
Energy Core [J] PWR0
|
||||
Power Core [W] PWR0/time
|
||||
Energy PKG [J] PWR1
|
||||
Power PKG [W] PWR1/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Power Core [W] = RAPL_CORE_ENERGY/time
|
||||
Power PKG [W] = RAPL_PKG_ENERGY/time
|
||||
-
|
||||
Ryzen implements the RAPL interface previously introduced by Intel.
|
||||
This interface enables to monitor the consumed energy on the core and package
|
||||
domain.
|
||||
It is not documented by AMD which parts of the CPU are in which domain.
|
||||
|
26
collectors/likwid/groups/zen/FLOPS_DP.txt
Normal file
26
collectors/likwid/groups/zen/FLOPS_DP.txt
Normal file
@@ -0,0 +1,26 @@
|
||||
SHORT Double Precision MFLOP/s
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
PMC2 RETIRED_SSE_AVX_FLOPS_DOUBLE_ALL
|
||||
PMC3 MERGE
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
DP [MFLOP/s] 1.0E-06*(PMC2)/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
|
||||
DP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_DOUBLE_ALL)/time
|
||||
-
|
||||
Profiling group to measure double precisision FLOP rate. The event might
|
||||
have a higher per-cycle increment than 15, so the MERGE event is required.
|
||||
|
||||
|
26
collectors/likwid/groups/zen/FLOPS_SP.txt
Normal file
26
collectors/likwid/groups/zen/FLOPS_SP.txt
Normal file
@@ -0,0 +1,26 @@
|
||||
SHORT Single Precision MFLOP/s
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
PMC2 RETIRED_SSE_AVX_FLOPS_SINGLE_ALL
|
||||
PMC3 MERGE
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
SP [MFLOP/s] 1.0E-06*(PMC2)/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
|
||||
SP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_SINGLE_ALL)/time
|
||||
-
|
||||
Profiling group to measure single precisision FLOP rate. The event might
|
||||
have a higher per-cycle increment than 15, so the MERGE event is required.
|
||||
|
||||
|
28
collectors/likwid/groups/zen/ICACHE.txt
Normal file
28
collectors/likwid/groups/zen/ICACHE.txt
Normal file
@@ -0,0 +1,28 @@
|
||||
SHORT Instruction cache miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 ICACHE_FETCHES
|
||||
PMC2 ICACHE_L2_REFILLS
|
||||
PMC3 ICACHE_SYSTEM_REFILLS
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/PMC0
|
||||
L1I request rate PMC1/PMC0
|
||||
L1I miss rate (PMC2+PMC3)/PMC0
|
||||
L1I miss ratio (PMC2+PMC3)/PMC1
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L1I request rate = ICACHE_FETCHES / RETIRED_INSTRUCTIONS
|
||||
L1I miss rate = (ICACHE_L2_REFILLS + ICACHE_SYSTEM_REFILLS)/RETIRED_INSTRUCTIONS
|
||||
L1I miss ratio = (ICACHE_L2_REFILLS + ICACHE_SYSTEM_REFILLS)/ICACHE_FETCHES
|
||||
-
|
||||
This group measures the locality of your instruction code with regard to the
|
||||
L1 I-Cache.
|
||||
|
28
collectors/likwid/groups/zen/L2.txt
Normal file
28
collectors/likwid/groups/zen/L2.txt
Normal file
@@ -0,0 +1,28 @@
|
||||
SHORT L2 cache bandwidth in MBytes/s (experimental)
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
PMC3 REQUESTS_TO_L2_GRP1_ALL_NO_PF
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
L2D load bandwidth [MBytes/s] 1.0E-06*PMC3*64.0/time
|
||||
L2D load data volume [GBytes] 1.0E-09*PMC3*64.0
|
||||
L2 bandwidth [MBytes/s] 1.0E-06*(PMC3)*64.0/time
|
||||
L2 data volume [GBytes] 1.0E-09*(PMC3)*64.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L2D load bandwidth [MBytes/s] = 1.0E-06*REQUESTS_TO_L2_GRP1_ALL_NO_PF*64.0/time
|
||||
L2D load data volume [GBytes] = 1.0E-09*REQUESTS_TO_L2_GRP1_ALL_NO_PF*64.0
|
||||
L2 bandwidth [MBytes/s] = 1.0E-06*(REQUESTS_TO_L2_GRP1_ALL_NO_PF)*64/time
|
||||
L2 data volume [GBytes] = 1.0E-09*(REQUESTS_TO_L2_GRP1_ALL_NO_PF)*64
|
||||
-
|
||||
Profiling group to measure L2 cache bandwidth. There is no way to measure
|
||||
the store traffic between L1 and L2.
|
32
collectors/likwid/groups/zen/L3.txt
Normal file
32
collectors/likwid/groups/zen/L3.txt
Normal file
@@ -0,0 +1,32 @@
|
||||
SHORT L3 cache bandwidth in MBytes/s
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
CPMC0 L3_ACCESS
|
||||
CPMC1 L3_MISS
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
L3 access bandwidth [MBytes/s] 1.0E-06*CPMC0*64.0/time
|
||||
L3 access data volume [GBytes] 1.0E-09*CPMC0*64.0
|
||||
L3 access rate [%] (CPMC0/PMC0)*100.0
|
||||
L3 miss rate [%] (CPMC1/PMC0)*100.0
|
||||
L3 miss ratio [%] (CPMC1/CPMC0)*100.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L3 access bandwidth [MBytes/s] = 1.0E-06*L3_ACCESS*64.0/time
|
||||
L3 access data volume [GBytes] = 1.0E-09*L3_ACCESS*64.0
|
||||
L3 access rate [%] = (L3_ACCESS/RETIRED_INSTRUCTIONS)*100
|
||||
L3 miss rate [%] = (L3_MISS/RETIRED_INSTRUCTIONS)*100
|
||||
L3 miss ratio [%]= (L3_MISS/L3_ACCESS)*100
|
||||
-
|
||||
Profiling group to measure L3 cache bandwidth. There is no way to measure
|
||||
the store traffic between L2 and L3. The only two published L3 events are
|
||||
L3_ACCESS and L3_MISS.
|
32
collectors/likwid/groups/zen/MEM.txt
Normal file
32
collectors/likwid/groups/zen/MEM.txt
Normal file
@@ -0,0 +1,32 @@
|
||||
SHORT Main memory bandwidth in MBytes/s (experimental)
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
DFC0 DATA_FROM_LOCAL_DRAM_CHANNEL
|
||||
DFC1 DATA_TO_LOCAL_DRAM_CHANNEL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
Memory bandwidth [MBytes/s] 1.0E-06*(DFC0+DFC1)*64.0/time
|
||||
Memory data volume [GBytes] 1.0E-09*(DFC0+DFC1)*64.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Memory bandwidth [MBytes/s] = 1.0E-06*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0/runtime
|
||||
Memory data volume [GBytes] = 1.0E-09*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0
|
||||
-
|
||||
Profiling group to measure memory bandwidth drawn by all cores of a socket.
|
||||
Since this group is based on Uncore events it is only possible to measure on a
|
||||
per socket base.
|
||||
The group provides almost accurate results for the total bandwidth and data volume.
|
||||
AMD describes this metric as "approximate" in the documentation for AMD Rome.
|
||||
|
||||
Be aware that despite the events imply a traffic direction (FROM and TO), the events
|
||||
cannot be used to differentiate between read and write traffic. The events will be
|
||||
renamed to avoid that confusion in the future.
|
39
collectors/likwid/groups/zen/MEM_DP.txt
Normal file
39
collectors/likwid/groups/zen/MEM_DP.txt
Normal file
@@ -0,0 +1,39 @@
|
||||
SHORT Overview of arithmetic and main memory performance
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
PMC2 RETIRED_SSE_AVX_FLOPS_DOUBLE_ALL
|
||||
PMC3 MERGE
|
||||
DFC0 DATA_FROM_LOCAL_DRAM_CHANNEL
|
||||
DFC1 DATA_TO_LOCAL_DRAM_CHANNEL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
DP [MFLOP/s] 1.0E-06*(PMC2)/time
|
||||
Memory bandwidth [MBytes/s] 1.0E-06*(DFC0+DFC1)*64.0/time
|
||||
Memory data volume [GBytes] 1.0E-09*(DFC0+DFC1)*64.0
|
||||
Operational intensity PMC2/((DFC0+DFC1)*64.0)
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
DP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_DOUBLE_ALL)/time
|
||||
Memory bandwidth [MBytes/s] = 1.0E-06*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0/runtime
|
||||
Memory data volume [GBytes] = 1.0E-09*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0
|
||||
Operational intensity = RETIRED_SSE_AVX_FLOPS_DOUBLE_ALL/((DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0)
|
||||
-
|
||||
Profiling group to measure memory bandwidth drawn by all cores of a socket.
|
||||
Since this group is based on Uncore events it is only possible to measure on a
|
||||
per socket base.
|
||||
The group provides almost accurate results for the total bandwidth and data volume.
|
||||
AMD describes this metric as "approximate" in the documentation for AMD Rome.
|
||||
|
||||
Be aware that despite the events imply a traffic direction (FROM and TO), the events
|
||||
cannot be used to differentiate between read and write traffic. The events will be
|
||||
renamed to avoid that confusion in the future.
|
||||
|
39
collectors/likwid/groups/zen/MEM_SP.txt
Normal file
39
collectors/likwid/groups/zen/MEM_SP.txt
Normal file
@@ -0,0 +1,39 @@
|
||||
SHORT Overview of arithmetic and main memory performance
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
PMC2 RETIRED_SSE_AVX_FLOPS_SINGLE_ALL
|
||||
PMC3 MERGE
|
||||
DFC0 DATA_FROM_LOCAL_DRAM_CHANNEL
|
||||
DFC1 DATA_TO_LOCAL_DRAM_CHANNEL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
SP [MFLOP/s] 1.0E-06*(PMC2)/time
|
||||
Memory bandwidth [MBytes/s] 1.0E-06*(DFC0+DFC1)*64.0/time
|
||||
Memory data volume [GBytes] 1.0E-09*(DFC0+DFC1)*64.0
|
||||
Operational intensity PMC2/((DFC0+DFC1)*64.0)
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
DP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_DOUBLE_ALL)/time
|
||||
Memory bandwidth [MBytes/s] = 1.0E-06*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0/runtime
|
||||
Memory data volume [GBytes] = 1.0E-09*(DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0
|
||||
Operational intensity = RETIRED_SSE_AVX_FLOPS_SINGLE_ALL/((DATA_FROM_LOCAL_DRAM_CHANNEL+DATA_TO_LOCAL_DRAM_CHANNEL)*64.0)
|
||||
-
|
||||
Profiling group to measure memory bandwidth drawn by all cores of a socket.
|
||||
Since this group is based on Uncore events it is only possible to measure on a
|
||||
per socket base.
|
||||
The group provides almost accurate results for the total bandwidth and data volume.
|
||||
AMD describes this metric as "approximate" in the documentation for AMD Rome.
|
||||
|
||||
Be aware that despite the events imply a traffic direction (FROM and TO), the events
|
||||
cannot be used to differentiate between read and write traffic. The events will be
|
||||
renamed to avoid that confusion in the future.
|
||||
|
35
collectors/likwid/groups/zen/NUMA.txt
Normal file
35
collectors/likwid/groups/zen/NUMA.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
SHORT L2 cache bandwidth in MBytes/s (experimental)
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 DATA_CACHE_REFILLS_LOCAL_ALL
|
||||
PMC1 DATA_CACHE_REFILLS_REMOTE_ALL
|
||||
PMC2 HWPREF_DATA_CACHE_FILLS_LOCAL_ALL
|
||||
PMC3 HWPREF_DATA_CACHE_FILLS_REMOTE_ALL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI PMC1/PMC0
|
||||
Local bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC2)*64.0/time
|
||||
Local data volume [GBytes] 1.0E-09*(PMC0+PMC2)*64.0
|
||||
Remote bandwidth [MBytes/s] 1.0E-06*(PMC1+PMC3)*64.0/time
|
||||
Remote data volume [GBytes] 1.0E-09*(PMC1+PMC3)*64.0
|
||||
Total bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC2+PMC1+PMC3)*64.0/time
|
||||
Total data volume [GBytes] 1.0E-09*(PMC0+PMC2+PMC1+PMC3)*64.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Local bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_LOCAL_ALL+HWPREF_DATA_CACHE_FILLS_LOCAL_ALL)*64.0/time
|
||||
Local data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_LOCAL_ALL+HWPREF_DATA_CACHE_FILLS_LOCAL_ALL)*64.0
|
||||
Remote bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_REMOTE_ALL+HWPREF_DATA_CACHE_FILLS_REMOTE_ALL)*64.0/time
|
||||
Remote data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_REMOTE_ALL+HWPREF_DATA_CACHE_FILLS_REMOTE_ALL)*64.0
|
||||
Total bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_LOCAL_ALL+HWPREF_DATA_CACHE_FILLS_LOCAL_ALL+DATA_CACHE_REFILLS_REMOTE_ALL+HWPREF_DATA_CACHE_FILLS_REMOTE_ALL)*64.0/time
|
||||
Total data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_LOCAL_ALL+HWPREF_DATA_CACHE_FILLS_LOCAL_ALL+DATA_CACHE_REFILLS_REMOTE_ALL+HWPREF_DATA_CACHE_FILLS_REMOTE_ALL)*64.0
|
||||
-
|
||||
Profiling group to measure NUMA traffic. The data sources range from
|
||||
local L2, CCX and memory for the local metrics and remote CCX and memory
|
||||
for the remote metrics. There are also events that measure the software
|
||||
prefetches from local and remote domain but AMD Zen provides only 4 counters.
|
39
collectors/likwid/groups/zen/TLB.txt
Normal file
39
collectors/likwid/groups/zen/TLB.txt
Normal file
@@ -0,0 +1,39 @@
|
||||
SHORT TLB miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC1 ACTUAL_CPU_CLOCK
|
||||
FIXC2 MAX_CPU_CLOCK
|
||||
PMC0 RETIRED_INSTRUCTIONS
|
||||
PMC1 DATA_CACHE_ACCESSES
|
||||
PMC2 L1_DTLB_MISS_ANY_L2_HIT
|
||||
PMC3 L1_DTLB_MISS_ANY_L2_MISS
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/PMC0
|
||||
L1 DTLB request rate PMC1/PMC0
|
||||
L1 DTLB miss rate (PMC2+PMC3)/PMC0
|
||||
L1 DTLB miss ratio (PMC2+PMC3)/PMC1
|
||||
L2 DTLB request rate (PMC2+PMC3)/PMC0
|
||||
L2 DTLB miss rate PMC3/PMC0
|
||||
L2 DTLB miss ratio PMC3/(PMC2+PMC3)
|
||||
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L1 DTLB request rate = DATA_CACHE_ACCESSES / RETIRED_INSTRUCTIONS
|
||||
L1 DTLB miss rate = (L1_DTLB_MISS_ANY_L2_HIT+L1_DTLB_MISS_ANY_L2_MISS)/RETIRED_INSTRUCTIONS
|
||||
L1 DTLB miss ratio = (L1_DTLB_MISS_ANY_L2_HIT+L1_DTLB_MISS_ANY_L2_MISS)/DATA_CACHE_ACCESSES
|
||||
L2 DTLB request rate = (L1_DTLB_MISS_ANY_L2_HIT+L1_DTLB_MISS_ANY_L2_MISS)/RETIRED_INSTRUCTIONS
|
||||
L2 DTLB miss rate = L1_DTLB_MISS_ANY_L2_MISS / RETIRED_INSTRUCTIONS
|
||||
L2 DTLB miss ratio = L1_DTLB_MISS_ANY_L2_MISS / (L1_DTLB_MISS_ANY_L2_HIT+L1_DTLB_MISS_ANY_L2_MISS)
|
||||
-
|
||||
L1 DTLB request rate tells you how data intensive your code is
|
||||
or how many data accesses you have on average per instruction.
|
||||
The DTLB miss rate gives a measure how often a TLB miss occurred
|
||||
per instruction. And finally L1 DTLB miss ratio tells you how many
|
||||
of your memory references required caused a TLB miss on average.
|
||||
NOTE: The L2 metrics are only relevant if L2 DTLB request rate is
|
||||
equal to the L1 DTLB miss rate!
|
Reference in New Issue
Block a user