mirror of
https://github.com/ClusterCockpit/cc-metric-collector.git
synced 2025-07-31 08:56:06 +02:00
Add likwid collector
This commit is contained in:
26
collectors/likwid/groups/k10/BRANCH.txt
Normal file
26
collectors/likwid/groups/k10/BRANCH.txt
Normal file
@@ -0,0 +1,26 @@
|
||||
SHORT Branch prediction miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
PMC0 INSTRUCTIONS_RETIRED
|
||||
PMC1 BRANCH_RETIRED
|
||||
PMC2 BRANCH_MISPREDICT_RETIRED
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Branch rate PMC1/PMC0
|
||||
Branch misprediction rate PMC2/PMC0
|
||||
Branch misprediction ratio PMC2/PMC1
|
||||
Instructions per branch PMC0/PMC1
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Branch rate = BRANCH_RETIRED/INSTRUCTIONS_RETIRED
|
||||
Branch misprediction rate = BRANCH_MISPREDICT_RETIRED/INSTRUCTIONS_RETIRED
|
||||
Branch misprediction ratio = BRANCH_MISPREDICT_RETIRED/BRANCH_RETIRED
|
||||
Instructions per branch = INSTRUCTIONS_RETIRED/BRANCH_RETIRED
|
||||
-
|
||||
The rates state how often on average a branch or a mispredicted branch occurred
|
||||
per instruction retired in total. The branch misprediction ratio sets directly
|
||||
into relation what ration of all branch instruction where mispredicted.
|
||||
Instructions per branch is 1/branch rate.
|
||||
|
34
collectors/likwid/groups/k10/CACHE.txt
Normal file
34
collectors/likwid/groups/k10/CACHE.txt
Normal file
@@ -0,0 +1,34 @@
|
||||
SHORT Data cache miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
PMC0 INSTRUCTIONS_RETIRED
|
||||
PMC1 DATA_CACHE_ACCESSES
|
||||
PMC2 DATA_CACHE_REFILLS_L2_ALL
|
||||
PMC3 DATA_CACHE_REFILLS_NORTHBRIDGE_ALL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
data cache misses PMC2+PMC3
|
||||
data cache request rate PMC1/PMC0
|
||||
data cache miss rate (PMC2+PMC3)/PMC0
|
||||
data cache miss ratio (PMC2+PMC3)/PMC1
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
data cache misses = DATA_CACHE_REFILLS_L2_AL + DATA_CACHE_REFILLS_NORTHBRIDGE_ALL
|
||||
data cache request rate = DATA_CACHE_ACCESSES / INSTRUCTIONS_RETIRED
|
||||
data cache miss rate = (DATA_CACHE_REFILLS_L2_AL + DATA_CACHE_REFILLS_NORTHBRIDGE_ALL)/INSTRUCTIONS_RETIRED
|
||||
data cache miss ratio = (DATA_CACHE_REFILLS_L2_AL + DATA_CACHE_REFILLS_NORTHBRIDGE_ALL)/DATA_CACHE_ACCESSES
|
||||
-
|
||||
This group measures the locality of your data accesses with regard to the
|
||||
L1 cache. Data cache request rate tells you how data intensive your code is
|
||||
or how many data accesses you have on average per instruction.
|
||||
The data cache miss rate gives a measure how often it was necessary to get
|
||||
cache lines from higher levels of the memory hierarchy. And finally
|
||||
data cache miss ratio tells you how many of your memory references required
|
||||
a cache line to be loaded from a higher level. While the# data cache miss rate
|
||||
might be given by your algorithm you should try to get data cache miss ratio
|
||||
as low as possible by increasing your cache reuse.
|
||||
This group was taken from the whitepaper -Basic Performance Measurements for AMD Athlon 64,
|
||||
AMD Opteron and AMD Phenom Processors- from Paul J. Drongowski.
|
||||
|
26
collectors/likwid/groups/k10/CPI.txt
Normal file
26
collectors/likwid/groups/k10/CPI.txt
Normal file
@@ -0,0 +1,26 @@
|
||||
SHORT Cycles per instruction
|
||||
|
||||
EVENTSET
|
||||
PMC0 INSTRUCTIONS_RETIRED
|
||||
PMC1 CPU_CLOCKS_UNHALTED
|
||||
PMC2 UOPS_RETIRED
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] PMC1*inverseClock
|
||||
CPI PMC1/PMC0
|
||||
CPI (based on uops) PMC1/PMC2
|
||||
IPC PMC0/PMC1
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
|
||||
CPI (based on uops) = CPU_CLOCKS_UNHALTED/RETIRED_UOPS
|
||||
IPC = RETIRED_INSTRUCTIONS/CPU_CLOCKS_UNHALTED
|
||||
-
|
||||
This group measures how efficient the processor works with
|
||||
regard to instruction throughput. Also important as a standalone
|
||||
metric is INSTRUCTIONS_RETIRED as it tells you how many instruction
|
||||
you need to execute for a task. An optimization might show very
|
||||
low CPI values but execute many more instruction for it.
|
||||
|
24
collectors/likwid/groups/k10/FLOPS_DP.txt
Normal file
24
collectors/likwid/groups/k10/FLOPS_DP.txt
Normal file
@@ -0,0 +1,24 @@
|
||||
SHORT Double Precision MFLOP/s
|
||||
|
||||
EVENTSET
|
||||
PMC0 SSE_RETIRED_ADD_DOUBLE_FLOPS
|
||||
PMC1 SSE_RETIRED_MULT_DOUBLE_FLOPS
|
||||
PMC2 CPU_CLOCKS_UNHALTED
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] PMC2*inverseClock
|
||||
DP [MFLOP/s] 1.0E-06*(PMC0+PMC1)/time
|
||||
DP Add [MFLOP/s] 1.0E-06*PMC0/time
|
||||
DP Mult [MFLOP/s] 1.0E-06*PMC1/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
DP [MFLOP/s] = 1.0E-06*(SSE_RETIRED_ADD_DOUBLE_FLOPS+SSE_RETIRED_MULT_DOUBLE_FLOPS)/time
|
||||
DP Add [MFLOP/s] = 1.0E-06*(SSE_RETIRED_ADD_DOUBLE_FLOPS)/time
|
||||
DP Mult [MFLOP/s] = 1.0E-06*(SSE_RETIRED_MULT_DOUBLE_FLOPS)/time
|
||||
-
|
||||
Profiling group to measure double SSE FLOPs.
|
||||
Don't forget that your code might also execute X87 FLOPs.
|
||||
|
||||
|
24
collectors/likwid/groups/k10/FLOPS_SP.txt
Normal file
24
collectors/likwid/groups/k10/FLOPS_SP.txt
Normal file
@@ -0,0 +1,24 @@
|
||||
SHORT Single Precision MFLOP/s
|
||||
|
||||
EVENTSET
|
||||
PMC0 SSE_RETIRED_ADD_SINGLE_FLOPS
|
||||
PMC1 SSE_RETIRED_MULT_SINGLE_FLOPS
|
||||
PMC2 CPU_CLOCKS_UNHALTED
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] PMC2*inverseClock
|
||||
SP [MFLOP/s] 1.0E-06*(PMC0+PMC1)/time
|
||||
SP Add [MFLOP/s] 1.0E-06*PMC0/time
|
||||
SP Mult [MFLOP/s] 1.0E-06*PMC1/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
SP [MFLOP/s] = 1.0E-06*(SSE_RETIRED_ADD_SINGLE_FLOPS+SSE_RETIRED_MULT_SINGLE_FLOPS)/time
|
||||
SP Add [MFLOP/s] = 1.0E-06*(SSE_RETIRED_ADD_SINGLE_FLOPS)/time
|
||||
SP Mult [MFLOP/s] = 1.0E-06*(SSE_RETIRED_MULT_SINGLE_FLOPS)/time
|
||||
-
|
||||
Profiling group to measure single precision SSE FLOPs.
|
||||
Don't forget that your code might also execute X87 FLOPs.
|
||||
|
||||
|
25
collectors/likwid/groups/k10/FLOPS_X87.txt
Normal file
25
collectors/likwid/groups/k10/FLOPS_X87.txt
Normal file
@@ -0,0 +1,25 @@
|
||||
SHORT X87 MFLOP/s
|
||||
|
||||
EVENTSET
|
||||
PMC0 X87_FLOPS_RETIRED_ADD
|
||||
PMC1 X87_FLOPS_RETIRED_MULT
|
||||
PMC2 X87_FLOPS_RETIRED_DIV
|
||||
PMC3 CPU_CLOCKS_UNHALTED
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] PMC3*inverseClock
|
||||
X87 [MFLOP/s] 1.0E-06*(PMC0+PMC1+PMC2)/time
|
||||
X87 Add [MFLOP/s] 1.0E-06*PMC0/time
|
||||
X87 Mult [MFLOP/s] 1.0E-06*PMC1/time
|
||||
X87 Div [MFLOP/s] 1.0E-06*PMC2/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
X87 [MFLOP/s] = 1.0E-06*(X87_FLOPS_RETIRED_ADD+X87_FLOPS_RETIRED_MULT+X87_FLOPS_RETIRED_DIV)/time
|
||||
X87 Add [MFLOP/s] = 1.0E-06*X87_FLOPS_RETIRED_ADD/time
|
||||
X87 Mult [MFLOP/s] = 1.0E-06*X87_FLOPS_RETIRED_MULT/time
|
||||
X87 Div [MFLOP/s] = 1.0E-06*X87_FLOPS_RETIRED_DIV/time
|
||||
-
|
||||
Profiling group to measure X87 FLOP rates.
|
||||
|
21
collectors/likwid/groups/k10/FPU_EXCEPTION.txt
Normal file
21
collectors/likwid/groups/k10/FPU_EXCEPTION.txt
Normal file
@@ -0,0 +1,21 @@
|
||||
SHORT Floating point exceptions
|
||||
|
||||
EVENTSET
|
||||
PMC0 INSTRUCTIONS_RETIRED
|
||||
PMC1 FP_INSTRUCTIONS_RETIRED_ALL
|
||||
PMC2 FPU_EXCEPTIONS_ALL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Overall FP exception rate PMC2/PMC0
|
||||
FP exception rate PMC2/PMC1
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Overall FP exception rate = FPU_EXCEPTIONS_ALL / INSTRUCTIONS_RETIRED
|
||||
FP exception rate = FPU_EXCEPTIONS_ALL / FP_INSTRUCTIONS_RETIRED_ALL
|
||||
-
|
||||
Floating point exceptions occur e.g. on the treatment of denormal numbers.
|
||||
There might be a large penalty if there are too many floating point
|
||||
exceptions.
|
||||
|
23
collectors/likwid/groups/k10/ICACHE.txt
Normal file
23
collectors/likwid/groups/k10/ICACHE.txt
Normal file
@@ -0,0 +1,23 @@
|
||||
SHORT Instruction cache miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
PMC0 INSTRUCTIONS_RETIRED
|
||||
PMC1 ICACHE_FETCHES
|
||||
PMC2 ICACHE_REFILLS_L2
|
||||
PMC3 ICACHE_REFILLS_MEM
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
L1I request rate PMC1/PMC0
|
||||
L1I miss rate (PMC2+PMC3)/PMC0
|
||||
L1I miss ratio (PMC2+PMC3)/PMC1
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L1I request rate = ICACHE_FETCHES / INSTRUCTIONS_RETIRED
|
||||
L1I miss rate = (ICACHE_REFILLS_L2+ICACHE_REFILLS_MEM)/INSTRUCTIONS_RETIRED
|
||||
L1I miss ratio = (ICACHE_REFILLS_L2+ICACHE_REFILLS_MEM)/ICACHE_FETCHES
|
||||
-
|
||||
This group measures the locality of your instruction code with regard to the
|
||||
L1 I-Cache.
|
||||
|
33
collectors/likwid/groups/k10/L2.txt
Normal file
33
collectors/likwid/groups/k10/L2.txt
Normal file
@@ -0,0 +1,33 @@
|
||||
SHORT L2 cache bandwidth in MBytes/s
|
||||
|
||||
EVENTSET
|
||||
PMC0 DATA_CACHE_REFILLS_L2_ALL
|
||||
PMC1 DATA_CACHE_EVICTED_ALL
|
||||
PMC2 CPU_CLOCKS_UNHALTED
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] PMC2*inverseClock
|
||||
L2D load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
|
||||
L2D load data volume [GBytes] 1.0E-09*PMC0*64.0
|
||||
L2D evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
|
||||
L2D evict data volume [GBytes] 1.0E-09*PMC1*64.0
|
||||
L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
|
||||
L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L2D load bandwidth [MBytes/s] = 1.0E-06*DATA_CACHE_REFILLS_L2_ALL*64.0/time
|
||||
L2D load data volume [GBytes] = 1.0E-09*DATA_CACHE_REFILLS_L2_ALL*64.0
|
||||
L2D evict bandwidth [MBytes/s] = 1.0E-06*DATA_CACHE_EVICTED_ALL*64.0/time
|
||||
L2D evict data volume [GBytes] = 1.0E-09*DATA_CACHE_EVICTED_ALL*64.0
|
||||
L2 bandwidth [MBytes/s] = 1.0E-06*(DATA_CACHE_REFILLS_L2_ALL+DATA_CACHE_EVICTED_ALL)*64/time
|
||||
L2 data volume [GBytes] = 1.0E-09*(DATA_CACHE_REFILLS_L2_ALL+DATA_CACHE_EVICTED_ALL)*64
|
||||
-
|
||||
Profiling group to measure L2 cache bandwidth. The bandwidth is
|
||||
computed by the number of cache line loaded from L2 to L1 and the
|
||||
number of modified cache lines evicted from the L1.
|
||||
Note that this bandwidth also includes data transfers due to a
|
||||
write allocate load on a store miss in L1 and copy back transfers if
|
||||
originated from L2.
|
||||
|
32
collectors/likwid/groups/k10/L2CACHE.txt
Normal file
32
collectors/likwid/groups/k10/L2CACHE.txt
Normal file
@@ -0,0 +1,32 @@
|
||||
SHORT L2 cache miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
PMC0 INSTRUCTIONS_RETIRED
|
||||
PMC1 L2_REQUESTS_ALL
|
||||
PMC2 L2_MISSES_ALL
|
||||
PMC3 L2_FILL_ALL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
L2 request rate (PMC1+PMC3)/PMC0
|
||||
L2 miss rate PMC2/PMC0
|
||||
L2 miss ratio PMC2/(PMC1+PMC3)
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L2 request rate = (L2_REQUESTS_ALL+L2_FILL_ALL)/INSTRUCTIONS_RETIRED
|
||||
L2 miss rate = L2_MISSES_ALL/INSTRUCTIONS_RETIRED
|
||||
L2 miss ratio = L2_MISSES_ALL/(L2_REQUESTS_ALL+L2_FILL_ALL)
|
||||
-
|
||||
This group measures the locality of your data accesses with regard to the
|
||||
L2 cache. L2 request rate tells you how data intensive your code is
|
||||
or how many data accesses you have on average per instruction.
|
||||
The L2 miss rate gives a measure how often it was necessary to get
|
||||
cache lines from memory. And finally L2 miss ratio tells you how many of your
|
||||
memory references required a cache line to be loaded from a higher level.
|
||||
While the# data cache miss rate might be given by your algorithm you should
|
||||
try to get data cache miss ratio as low as possible by increasing your cache reuse.
|
||||
This group was taken from the whitepaper -Basic Performance Measurements for AMD Athlon 64,
|
||||
AMD Opteron and AMD Phenom Processors- from Paul J. Drongowski.
|
||||
|
||||
|
35
collectors/likwid/groups/k10/MEM.txt
Normal file
35
collectors/likwid/groups/k10/MEM.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
SHORT Main memory bandwidth in MBytes/s
|
||||
|
||||
EVENTSET
|
||||
PMC0 NORTHBRIDGE_READ_RESPONSE_ALL
|
||||
PMC1 OCTWORDS_WRITE_TRANSFERS
|
||||
PMC2 DRAM_ACCESSES_DCTO_ALL
|
||||
PMC3 DRAM_ACCESSES_DCT1_ALL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Memory read bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
|
||||
Memory read data volume [GBytes] 1.0E-09*PMC0*64.0
|
||||
Memory write bandwidth [MBytes/s] 1.0E-06*PMC1*8.0/time
|
||||
Memory write data volume [GBytes] 1.0E-09*PMC1*8.0
|
||||
Memory bandwidth [MBytes/s] 1.0E-06*(PMC2+PMC3)*64.0/time
|
||||
Memory data volume [GBytes] 1.0E-09*(PMC2+PMC3)*64.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Memory read bandwidth [MBytes/s] = 1.0E-06*NORTHBRIDGE_READ_RESPONSE_ALL*64/time
|
||||
Memory read data volume [GBytes] = 1.0E-09*NORTHBRIDGE_READ_RESPONSE_ALL*64
|
||||
Memory write bandwidth [MBytes/s] = 1.0E-06*OCTWORDS_WRITE_TRANSFERS*8/time
|
||||
Memory write data volume [GBytes] = 1.0E-09*OCTWORDS_WRITE_TRANSFERS*8
|
||||
Memory bandwidth [MBytes/s] = 1.0E-06*(DRAM_ACCESSES_DCTO_ALL+DRAM_ACCESSES_DCT1_ALL)*64/time
|
||||
Memory data volume [GBytes] = 1.0E-09*(DRAM_ACCESSES_DCTO_ALL+DRAM_ACCESSES_DCT1_ALL)*64
|
||||
-
|
||||
Profiling group to measure memory bandwidth drawn by all cores of a socket.
|
||||
Note: As this group measures the accesses from all cores it only makes sense
|
||||
to measure with one core per socket, similar as with the Intel Nehalem Uncore events.
|
||||
The memory read bandwidth contains all data from DRAM, L3, or another cache,
|
||||
including another core on the same node. The event OCTWORDS_WRITE_TRANSFERS counts
|
||||
16 Byte transfers, not 64 Byte.
|
||||
|
||||
|
||||
|
27
collectors/likwid/groups/k10/NUMA_0_3.txt
Normal file
27
collectors/likwid/groups/k10/NUMA_0_3.txt
Normal file
@@ -0,0 +1,27 @@
|
||||
SHORT Bandwidth on the Hypertransport links
|
||||
|
||||
EVENTSET
|
||||
PMC0 CPU_TO_DRAM_LOCAL_TO_0
|
||||
PMC1 CPU_TO_DRAM_LOCAL_TO_1
|
||||
PMC2 CPU_TO_DRAM_LOCAL_TO_2
|
||||
PMC3 CPU_TO_DRAM_LOCAL_TO_3
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Hyper Transport link0 bandwidth [MBytes/s] 1.0E-06*PMC0*4.0/time
|
||||
Hyper Transport link1 bandwidth [MBytes/s] 1.0E-06*PMC1*4.0/time
|
||||
Hyper Transport link2 bandwidth [MBytes/s] 1.0E-06*PMC2*4.0/time
|
||||
Hyper Transport link3 bandwidth [MBytes/s] 1.0E-06*PMC3*4.0/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Hyper Transport link0 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_0*4.0/time
|
||||
Hyper Transport link1 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_1*4.0/time
|
||||
Hyper Transport link2 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_2*4.0/time
|
||||
Hyper Transport link3 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_3*4.0/time
|
||||
-
|
||||
Profiling group to measure the bandwidth over the Hypertransport links. Can be used
|
||||
to detect NUMA problems. Usually there should be only limited traffic over the QPI
|
||||
links for optimal performance.
|
||||
|
||||
|
27
collectors/likwid/groups/k10/NUMA_4_7.txt
Normal file
27
collectors/likwid/groups/k10/NUMA_4_7.txt
Normal file
@@ -0,0 +1,27 @@
|
||||
SHORT Bandwidth on the Hypertransport links
|
||||
|
||||
EVENTSET
|
||||
PMC0 CPU_TO_DRAM_LOCAL_TO_4
|
||||
PMC1 CPU_TO_DRAM_LOCAL_TO_5
|
||||
PMC2 CPU_TO_DRAM_LOCAL_TO_6
|
||||
PMC3 CPU_TO_DRAM_LOCAL_TO_7
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Hyper Transport link4 bandwidth [MBytes/s] 1.0E-06*PMC0*4.0/time
|
||||
Hyper Transport link5 bandwidth [MBytes/s] 1.0E-06*PMC1*4.0/time
|
||||
Hyper Transport link6 bandwidth [MBytes/s] 1.0E-06*PMC2*4.0/time
|
||||
Hyper Transport link7 bandwidth [MBytes/s] 1.0E-06*PMC3*4.0/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Hyper Transport link4 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_0*4.0/time
|
||||
Hyper Transport link5 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_1*4.0/time
|
||||
Hyper Transport link6 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_2*4.0/time
|
||||
Hyper Transport link7 bandwidth [MBytes/s] = 1.0E-06*CPU_TO_DRAM_LOCAL_TO_3*4.0/time
|
||||
-
|
||||
Profiling group to measure the bandwidth over the Hypertransport links. Can be used
|
||||
to detect NUMA problems. Usually there should be only limited traffic over the QPI
|
||||
links for optimal performance.
|
||||
|
||||
|
35
collectors/likwid/groups/k10/TLB.txt
Normal file
35
collectors/likwid/groups/k10/TLB.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
SHORT TLB miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
PMC0 INSTRUCTIONS_RETIRED
|
||||
PMC1 DATA_CACHE_ACCESSES
|
||||
PMC2 DTLB_L2_HIT_ALL
|
||||
PMC3 DTLB_L2_MISS_ALL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
L1 DTLB request rate PMC1/PMC0
|
||||
L1 DTLB miss rate (PMC2+PMC3)/PMC0
|
||||
L1 DTLB miss ratio (PMC2+PMC3)/PMC1
|
||||
L2 DTLB request rate (PMC2+PMC3)/PMC0
|
||||
L2 DTLB miss rate PMC3/PMC0
|
||||
L2 DTLB miss ratio PMC3/(PMC2+PMC3)
|
||||
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L1 DTLB request rate = DATA_CACHE_ACCESSES / INSTRUCTIONS_RETIRED
|
||||
L1 DTLB miss rate = (DTLB_L2_HIT_ALL+DTLB_L2_MISS_ALL)/INSTRUCTIONS_RETIRED
|
||||
L1 DTLB miss ratio = (DTLB_L2_HIT_ALL+DTLB_L2_MISS_ALL)/DATA_CACHE_ACCESSES
|
||||
L2 DTLB request rate = (DTLB_L2_HIT_ALL+DTLB_L2_MISS_ALL)/INSTRUCTIONS_RETIRED
|
||||
L2 DTLB miss rate = DTLB_L2_MISS_ALL / INSTRUCTIONS_RETIRED
|
||||
L2 DTLB miss ratio = DTLB_L2_MISS_ALL / (DTLB_L2_HIT_ALL+DTLB_L2_MISS_ALL)
|
||||
-
|
||||
L1 DTLB request rate tells you how data intensive your code is
|
||||
or how many data accesses you have on average per instruction.
|
||||
The DTLB miss rate gives a measure how often a TLB miss occurred
|
||||
per instruction. And finally L1 DTLB miss ratio tells you how many
|
||||
of your memory references required caused a TLB miss on average.
|
||||
NOTE: The L2 metrics are only relevant if L2 DTLB request rate is equal to the L1 DTLB miss rate!
|
||||
This group was taken from the whitepaper Basic -Performance Measurements for AMD Athlon 64,
|
||||
AMD Opteron and AMD Phenom Processors- from Paul J. Drongowski.
|
Reference in New Issue
Block a user