mirror of
https://github.com/ClusterCockpit/cc-metric-collector.git
synced 2025-08-01 09:00:35 +02:00
Add likwid collector
This commit is contained in:
31
collectors/likwid/groups/westmereEX/BRANCH.txt
Normal file
31
collectors/likwid/groups/westmereEX/BRANCH.txt
Normal file
@@ -0,0 +1,31 @@
|
||||
SHORT Branch prediction miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 BR_INST_RETIRED_ALL_BRANCHES
|
||||
PMC1 BR_MISP_RETIRED_ALL_BRANCHES
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Branch rate PMC0/FIXC0
|
||||
Branch misprediction rate PMC1/FIXC0
|
||||
Branch misprediction ratio PMC1/PMC0
|
||||
Instructions per branch FIXC0/PMC0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Branch rate = BR_INST_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
|
||||
Branch misprediction rate = BR_MISP_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
|
||||
Branch misprediction ratio = BR_MISP_RETIRED_ALL_BRANCHES/BR_INST_RETIRED_ALL_BRANCHES
|
||||
Instructions per branch = INSTR_RETIRED_ANY/BR_INST_RETIRED_ALL_BRANCHES
|
||||
-
|
||||
The rates state how often on average a branch or a mispredicted branch occurred
|
||||
per instruction retired in total. The branch misprediction ratio sets directly
|
||||
into relation what ratio of all branch instruction where mispredicted.
|
||||
Instructions per branch is 1/branch rate.
|
||||
|
25
collectors/likwid/groups/westmereEX/CACHE.txt
Normal file
25
collectors/likwid/groups/westmereEX/CACHE.txt
Normal file
@@ -0,0 +1,25 @@
|
||||
SHORT Data cache miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 L1D_REPL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
data cache misses PMC0
|
||||
data cache miss rate PMC0/FIXC0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
data cache misses = L1D_REPL
|
||||
data cache miss rate = L1D_REPL / INSTR_RETIRED_ANY
|
||||
-
|
||||
This group measures the locality of your data accesses with regard to the L1
|
||||
cache. The data cache miss rate gives a measure how often it was necessary to
|
||||
get cache lines from higher levels of the memory hierarchy.
|
||||
|
22
collectors/likwid/groups/westmereEX/DATA.txt
Normal file
22
collectors/likwid/groups/westmereEX/DATA.txt
Normal file
@@ -0,0 +1,22 @@
|
||||
SHORT Load to store ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 MEM_INST_RETIRED_LOADS
|
||||
PMC1 MEM_INST_RETIRED_STORES
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Load to store ratio PMC0/PMC1
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Load to store ratio = MEM_INST_RETIRED_LOADS/MEM_INST_RETIRED_STORES
|
||||
-
|
||||
This is a simple metric to determine your load to store ratio.
|
||||
|
24
collectors/likwid/groups/westmereEX/DIVIDE.txt
Normal file
24
collectors/likwid/groups/westmereEX/DIVIDE.txt
Normal file
@@ -0,0 +1,24 @@
|
||||
SHORT Divide unit information
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 ARITH_NUM_DIV
|
||||
PMC1 ARITH_CYCLES_DIV_BUSY
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Number of divide ops PMC0
|
||||
Avg. divide unit usage duration PMC1/PMC0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Number of divide ops = ARITH_NUM_DIV
|
||||
Avg. divide unit usage duration = ARITH_CYCLES_DIV_BUSY/ARITH_NUM_DIV
|
||||
-
|
||||
This performance group measures the average latency of divide operations
|
35
collectors/likwid/groups/westmereEX/FLOPS_DP.txt
Normal file
35
collectors/likwid/groups/westmereEX/FLOPS_DP.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
SHORT Double Precision MFLOP/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 FP_COMP_OPS_EXE_SSE_FP_PACKED
|
||||
PMC1 FP_COMP_OPS_EXE_SSE_FP_SCALAR
|
||||
PMC2 FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION
|
||||
PMC3 FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
DP [MFLOP/s] 1.0E-06*(PMC0*2.0+PMC1)/time
|
||||
Packed [MUOPS/s] 1.0E-06*PMC0/time
|
||||
Scalar [MUOPS/s] 1.0E-06*PMC1/time
|
||||
SP [MUOPS/s] 1.0E-06*PMC2/time
|
||||
DP [MUOPS/s] 1.0E-06*PMC3/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
DP [MFLOP/s] = 1.0E-06*(FP_COMP_OPS_EXE_SSE_FP_PACKED*2+FP_COMP_OPS_EXE_SSE_FP_SCALAR)/runtime
|
||||
Packed [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_FP_PACKED/runtime
|
||||
Scalar [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_FP_SCALAR/runtime
|
||||
SP [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION/runtime
|
||||
DP [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION/runtime
|
||||
-
|
||||
The Nehalem has no possibility to measure MFLOPs if mixed precision calculations are done.
|
||||
Therefore both single as well as double precision are measured to ensure the correctness
|
||||
of the measurements. You can check if your code was vectorized on the number of
|
||||
FP_COMP_OPS_EXE_SSE_FP_PACKED versus the FP_COMP_OPS_EXE_SSE_FP_SCALAR.
|
||||
|
35
collectors/likwid/groups/westmereEX/FLOPS_SP.txt
Normal file
35
collectors/likwid/groups/westmereEX/FLOPS_SP.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
SHORT Single Precision MFLOP/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 FP_COMP_OPS_EXE_SSE_FP_PACKED
|
||||
PMC1 FP_COMP_OPS_EXE_SSE_FP_SCALAR
|
||||
PMC2 FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION
|
||||
PMC3 FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
SP [MFLOP/s] 1.0E-06*(PMC0*4.0+PMC1)/time
|
||||
Packed [MUOPS/s] 1.0E-06*PMC0/time
|
||||
Scalar [MUOPS/s] 1.0E-06*PMC1/time
|
||||
SP [MUOPS/s] 1.0E-06*PMC2/time
|
||||
DP [MUOPS/s] 1.0E-06*PMC3/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
SP [MFLOP/s] = 1.0E-06*(FP_COMP_OPS_EXE_SSE_FP_PACKED*4+FP_COMP_OPS_EXE_SSE_FP_SCALAR)/runtime
|
||||
Packed [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_FP_PACKED/runtime
|
||||
Scalar [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_FP_SCALAR/runtime
|
||||
SP [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION/runtime
|
||||
DP [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION/runtime
|
||||
-
|
||||
The Westmere EX has no possibility to measure MFLOPs if mixed precision calculations are done.
|
||||
Therefore both single as well as double precision are measured to ensure the correctness
|
||||
of the measurements. You can check if your code was vectorized on the number of
|
||||
FP_COMP_OPS_EXE_SSE_FP_PACKED versus the FP_COMP_OPS_EXE_SSE_FP_SCALAR.
|
||||
|
21
collectors/likwid/groups/westmereEX/FLOPS_X87.txt
Normal file
21
collectors/likwid/groups/westmereEX/FLOPS_X87.txt
Normal file
@@ -0,0 +1,21 @@
|
||||
SHORT X87 MFLOP/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 INST_RETIRED_X87
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
X87 [MFLOP/s] 1.0E-06*PMC0/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
X87 [MFLOP/s] = 1.0E-06*INST_RETIRED_X87/runtime
|
||||
-
|
||||
Profiling group to measure X87 FLOP rate.
|
||||
|
25
collectors/likwid/groups/westmereEX/ICACHE.txt
Normal file
25
collectors/likwid/groups/westmereEX/ICACHE.txt
Normal file
@@ -0,0 +1,25 @@
|
||||
SHORT Instruction cache miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 L1I_READS
|
||||
PMC1 L1I_MISSES
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L1I request rate PMC0/FIXC0
|
||||
L1I miss rate PMC1/FIXC0
|
||||
L1I miss ratio PMC1/PMC0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L1I request rate = L1I_READS / INSTR_RETIRED_ANY
|
||||
L1I miss rate = ICACHE_MISSES / INSTR_RETIRED_ANY
|
||||
L1I miss ratio = ICACHE_MISSES / L1I_READS
|
||||
-
|
||||
This group measures some L1 instruction cache metrics.
|
38
collectors/likwid/groups/westmereEX/L2.txt
Normal file
38
collectors/likwid/groups/westmereEX/L2.txt
Normal file
@@ -0,0 +1,38 @@
|
||||
SHORT L2 cache bandwidth in MBytes/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 L1D_REPL
|
||||
PMC1 L1D_M_EVICT
|
||||
PMC2 L1I_MISSES
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L2D load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
|
||||
L2D load data volume [GBytes] 1.0E-09*PMC0*64.0
|
||||
L2D evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
|
||||
L2D evict data volume [GBytes] 1.0E-09*PMC1*64.0
|
||||
L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1+PMC2)*64.0/time
|
||||
L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1+PMC2)*64.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L2D load bandwidth [MBytes/s] = 1.0E-06*L1D_REPL*64.0/time
|
||||
L2D load data volume [GBytes] = 1.0E-09*L1D_REPL*64.0
|
||||
L2D evict bandwidth [MBytes/s] = 1.0E-06*L1D_M_EVICT*64.0/time
|
||||
L2D evict data volume [GBytes] = 1.0E-09*L1D_M_EVICT*64.0
|
||||
L2 bandwidth [MBytes/s] = 1.0E-06*(L1D_REPL+L1D_M_EVICT+L1I_MISSES)*64/time
|
||||
L2 data volume [GBytes] = 1.0E-09*(L1D_REPL+L1D_M_EVICT+L1I_MISSES)*64
|
||||
-
|
||||
Profiling group to measure L2 cache bandwidth. The bandwidth is computed by the
|
||||
number of cache line allocated in the L1 and the number of modified cache lines
|
||||
evicted from the L1. Also reports on total data volume transferred between L2
|
||||
and L1 cache. Note that this bandwidth also includes data transfers due to a
|
||||
write allocate load on a store miss in L1 and traffic caused by misses in the
|
||||
instruction cache.
|
||||
|
34
collectors/likwid/groups/westmereEX/L2CACHE.txt
Normal file
34
collectors/likwid/groups/westmereEX/L2CACHE.txt
Normal file
@@ -0,0 +1,34 @@
|
||||
SHORT L2 cache miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 L2_RQSTS_REFERENCES
|
||||
PMC1 L2_RQSTS_MISS
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L2 request rate PMC0/FIXC0
|
||||
L2 miss rate PMC1/FIXC0
|
||||
L2 miss ratio PMC1/PMC0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L2 request rate = L2_RQSTS_REFERENCES/INSTR_RETIRED_ANY
|
||||
L2 miss rate = L2_RQSTS_MISS/INSTR_RETIRED_ANY
|
||||
L2 miss ratio = L2_RQSTS_MISS/L2_RQSTS_REFERENCES
|
||||
-
|
||||
This group measures the locality of your data accesses with regard to the
|
||||
L2 cache. L2 request rate tells you how data intensive your code is
|
||||
or how many data accesses you have on average per instruction.
|
||||
The L2 miss rate gives a measure how often it was necessary to get
|
||||
cache lines from memory. And finally L2 miss ratio tells you how many of your
|
||||
memory references required a cache line to be loaded from a higher level.
|
||||
While the data cache miss rate might be given by your algorithm you should
|
||||
try to get data cache miss ratio as low as possible by increasing your cache reuse.
|
||||
|
||||
|
36
collectors/likwid/groups/westmereEX/L3.txt
Normal file
36
collectors/likwid/groups/westmereEX/L3.txt
Normal file
@@ -0,0 +1,36 @@
|
||||
SHORT L3 cache bandwidth in MBytes/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 L2_LINES_IN_ANY
|
||||
PMC1 L2_LINES_OUT_DEMAND_DIRTY
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L3 load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
|
||||
L3 load data volume [GBytes] 1.0E-09*PMC0*64.0
|
||||
L3 evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
|
||||
L3 evict data volume [GBytes] 1.0E-09*PMC1*64.0
|
||||
L3 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
|
||||
L3 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L3 load bandwidth [MBytes/s] = 1.0E-06*L2_LINES_IN_ANY*64.0/time
|
||||
L3 load data volume [GBytes] = 1.0E-09*L2_LINES_IN_ANY*64.0
|
||||
L3 evict bandwidth [MBytes/s] = 1.0E-06*L2_LINES_OUT_DEMAND_DIRTY*64.0/time
|
||||
L3 evict data volume [GBytes] = 1.0E-09*L2_LINES_OUT_DEMAND_DIRTY*64.0
|
||||
L3 bandwidth [MBytes/s] = 1.0E-06*(L2_LINES_IN_ANY+L2_LINES_OUT_DEMAND_DIRTY)*64/time
|
||||
L3 data volume [GBytes] = 1.0E-09*(L2_LINES_IN_ANY+L2_LINES_OUT_DEMAND_DIRTY)*64
|
||||
-
|
||||
Profiling group to measure L3 cache bandwidth. The bandwidth is
|
||||
computed by the number of cache line allocated in the L2 and the number of
|
||||
modified cache lines evicted from the L2. Also reporto data volume transferred
|
||||
between L3 and L2 caches. Note that this bandwidth also includes data transfers
|
||||
due to a write allocate load on a store miss in L2.
|
||||
|
52
collectors/likwid/groups/westmereEX/L3CACHE.txt
Normal file
52
collectors/likwid/groups/westmereEX/L3CACHE.txt
Normal file
@@ -0,0 +1,52 @@
|
||||
SHORT L3 cache miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
CBOX0C0 LLC_HITS_ALL
|
||||
CBOX0C1 LLC_MISSES_ALL
|
||||
CBOX1C0 LLC_HITS_ALL
|
||||
CBOX1C1 LLC_MISSES_ALL
|
||||
CBOX2C0 LLC_HITS_ALL
|
||||
CBOX2C1 LLC_MISSES_ALL
|
||||
CBOX3C0 LLC_HITS_ALL
|
||||
CBOX3C1 LLC_MISSES_ALL
|
||||
CBOX4C0 LLC_HITS_ALL
|
||||
CBOX4C1 LLC_MISSES_ALL
|
||||
CBOX5C0 LLC_HITS_ALL
|
||||
CBOX5C1 LLC_MISSES_ALL
|
||||
CBOX6C0 LLC_HITS_ALL
|
||||
CBOX6C1 LLC_MISSES_ALL
|
||||
CBOX7C0 LLC_HITS_ALL
|
||||
CBOX7C1 LLC_MISSES_ALL
|
||||
CBOX8C0 LLC_HITS_ALL
|
||||
CBOX8C1 LLC_MISSES_ALL
|
||||
CBOX9C0 LLC_HITS_ALL
|
||||
CBOX9C1 LLC_MISSES_ALL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L3 request rate (CBOX0C0+CBOX0C1+CBOX1C0+CBOX1C1+CBOX2C0+CBOX2C1+CBOX3C0+CBOX3C1+CBOX4C0+CBOX4C1+CBOX5C0+CBOX5C1+CBOX6C0+CBOX6C1+CBOX7C0+CBOX7C1+CBOX8C0+CBOX8C1+CBOX9C0+CBOX9C1)/FIXC0
|
||||
L3 miss rate (CBOX0C1+CBOX1C1+CBOX2C1+CBOX3C1+CBOX4C1+CBOX5C1+CBOX6C1+CBOX7C1+CBOX8C1+CBOX9C1)/FIXC0
|
||||
L3 miss ratio (CBOX0C1+CBOX1C1+CBOX2C1+CBOX3C1+CBOX4C1+CBOX5C1+CBOX6C1+CBOX7C1+CBOX8C1+CBOX9C1)/(CBOX0C0+CBOX0C1+CBOX1C0+CBOX1C1+CBOX2C0+CBOX2C1+CBOX3C0+CBOX3C1+CBOX4C0+CBOX4C1+CBOX5C0+CBOX5C1+CBOX6C0+CBOX6C1+CBOX7C0+CBOX7C1+CBOX8C0+CBOX8C1+CBOX9C0+CBOX9C1)
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L3 request rate = (SUM(LLC_HITS_ALL)+SUM(LLC_MISSES_ALL))/INSTR_RETIRED_ANY
|
||||
L3 miss rate = SUM(LLC_MISSES_ALL)/INSTR_RETIRED_ANY
|
||||
L3 miss ratio = SUM(LLC_MISSES_ALL)/(SUM(LLC_HITS_ALL)+SUM(LLC_MISSES_ALL))
|
||||
-
|
||||
This group measures the locality of your data accesses with regard to the
|
||||
L3 cache. L3 request rate tells you how data intensive your code is
|
||||
or how many data accesses you have on average per instruction.
|
||||
The L3 miss rate gives a measure how often it was necessary to get
|
||||
cache lines from memory. And finally L3 miss ratio tells you how many of your
|
||||
memory references required a cache line to be loaded from a higher level.
|
||||
While the data cache miss rate might be given by your algorithm you should
|
||||
try to get data cache miss ratio as low as possible by increasing your cache reuse.
|
||||
|
||||
|
38
collectors/likwid/groups/westmereEX/MEM.txt
Normal file
38
collectors/likwid/groups/westmereEX/MEM.txt
Normal file
@@ -0,0 +1,38 @@
|
||||
SHORT Main memory bandwidth in MBytes/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
MBOX0C0 FVC_EV0_BBOX_CMDS_READS
|
||||
MBOX0C1 DRAM_CMD_CAS_WR_OPN
|
||||
MBOX0C2 DRAM_MISC_CAS_WR_CLS
|
||||
MBOX1C0 FVC_EV0_BBOX_CMDS_READS
|
||||
MBOX1C1 DRAM_CMD_CAS_WR_OPN
|
||||
MBOX1C2 DRAM_MISC_CAS_WR_CLS
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0)*64.0/time
|
||||
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0)*64.0
|
||||
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX0C2+MBOX1C1+MBOX1C2)*64.0/time
|
||||
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX0C2+MBOX1C1+MBOX1C2)*64.0
|
||||
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX0C1+MBOX1C1+MBOX0C2+MBOX1C2)*64/time
|
||||
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX0C1+MBOX1C1+MBOX0C2+MBOX1C2)*64
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0))*64.0/time
|
||||
Memory read data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0))*64.0
|
||||
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC1)+SUM(MBOXxC2))*64.0/time
|
||||
Memory write data volume [GBytes] = 1.0E-09*(SUM(MBOXxC1)+SUM(MBOXxC2))*64.0
|
||||
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0)+SUM(MBOXxC1)+SUM(MBOXxC2))*64.0/time
|
||||
Memory data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0)+SUM(MBOXxC1)+SUM(MBOXxC2))*64.0
|
||||
-
|
||||
Profiling group to measure memory bandwidth drawn by all cores of a socket.
|
||||
Addional to the bandwidth it also outputs the data volume.
|
||||
|
33
collectors/likwid/groups/westmereEX/NUMA.txt
Normal file
33
collectors/likwid/groups/westmereEX/NUMA.txt
Normal file
@@ -0,0 +1,33 @@
|
||||
SHORT Local and remote memory accesses
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 OFFCORE_RESPONSE_0_LOCAL_DRAM
|
||||
PMC1 OFFCORE_RESPONSE_1_REMOTE_DRAM
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Local DRAM data volume [GByte] 1.E-09*PMC0*64
|
||||
Local DRAM bandwidth [MByte/s] 1.E-06*(PMC0*64)/time
|
||||
Remote DRAM data volume [GByte] 1.E-09*PMC1*64
|
||||
Remote DRAM bandwidth [MByte/s] 1.E-06*(PMC1*64)/time
|
||||
Memory data volume [GByte] 1.E-09*(PMC0+PMC1)*64
|
||||
Memory bandwidth [MByte/s] 1.E-06*((PMC0+PMC1)*64)/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
CPI = CPU_CLK_UNHALTED_CORE/INSTR_RETIRED_ANY
|
||||
Local DRAM data volume [GByte] = 1.E-09*OFFCORE_RESPONSE_0_LOCAL_DRAM*64
|
||||
Local DRAM bandwidth [MByte/s] = 1.E-06*(OFFCORE_RESPONSE_0_LOCAL_DRAM*64)/time
|
||||
Remote DRAM data volume [GByte] = 1.E-09*OFFCORE_RESPONSE_1_REMOTE_DRAM*64
|
||||
Remote DRAM bandwidth [MByte/s] = 1.E-06*(OFFCORE_RESPONSE_1_REMOTE_DRAM*64)/time
|
||||
Memory data volume [GByte] = 1.E-09*(OFFCORE_RESPONSE_0_LOCAL_DRAM+OFFCORE_RESPONSE_1_REMOTE_DRAM)*64
|
||||
Memory bandwidth [MByte/s] = 1.E-06*((OFFCORE_RESPONSE_0_LOCAL_DRAM+OFFCORE_RESPONSE_1_REMOTE_DRAM)*64)/time
|
||||
--
|
||||
This performance group measures the data traffic of CPU cores to local and remote
|
||||
memory.
|
35
collectors/likwid/groups/westmereEX/TLB_DATA.txt
Normal file
35
collectors/likwid/groups/westmereEX/TLB_DATA.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
SHORT L2 data TLB miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 DTLB_LOAD_MISSES_ANY
|
||||
PMC1 DTLB_MISSES_ANY
|
||||
PMC2 DTLB_LOAD_MISSES_WALK_CYCLES
|
||||
PMC3 DTLB_MISSES_WALK_CYCLES
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L1 DTLB load misses PMC0
|
||||
L1 DTLB load miss rate PMC0/FIXC0
|
||||
L1 DTLB load miss duration [Cyc] PMC2/PMC0
|
||||
L1 DTLB store misses (PMC1-PMC0)
|
||||
L1 DTLB store miss rate (PMC1-PMC0)/FIXC0
|
||||
L1 DTLB store miss duration [Cyc] (PMC3-PMC2)/(PMC1-PMC0)
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L1 DTLB load misses = DTLB_LOAD_MISSES_ANY
|
||||
L1 DTLB load miss rate = DTLB_LOAD_MISSES_ANY / INSTR_RETIRED_ANY
|
||||
L1 DTLB load miss duration [Cyc] = DTLB_LOAD_MISSES_WALK_CYCLES / DTLB_LOAD_MISSES_ANY
|
||||
L1 DTLB store misses = DTLB_MISSES_ANY-DTLB_LOAD_MISSES_ANY
|
||||
L1 DTLB store miss rate = (DTLB_MISSES_ANY-DTLB_LOAD_MISSES_ANY) / INSTR_RETIRED_ANY
|
||||
L1 DTLB store miss duration [Cyc] = (DTLB_MISSES_WALK_CYCLES-DTLB_LOAD_MISSES_WALK_CYCLES) / (DTLB_MISSES_ANY-DTLB_LOAD_MISSES_ANY)
|
||||
-
|
||||
The DTLB miss rate gives a measure how often a TLB miss occurred
|
||||
per instruction. The store miss calculations are done using ALL-LOADS TLB walks.
|
||||
|
27
collectors/likwid/groups/westmereEX/TLB_INSTR.txt
Normal file
27
collectors/likwid/groups/westmereEX/TLB_INSTR.txt
Normal file
@@ -0,0 +1,27 @@
|
||||
SHORT L1 Instruction TLB miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 ITLB_MISSES_ANY
|
||||
PMC1 ITLB_MISSES_WALK_CYCLES
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L1 ITLB misses PMC0
|
||||
L1 ITLB miss rate PMC0/FIXC0
|
||||
L1 ITLB miss duration [Cyc] PMC1/PMC0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L1 ITLB misses = ITLB_MISSES_ANY
|
||||
L1 ITLB miss rate = ITLB_MISSES_ANY / INSTR_RETIRED_ANY
|
||||
L1 ITLB miss duration [Cyc] = ITLB_MISSES_WALK_CYCLES / ITLB_MISSES_ANY
|
||||
-
|
||||
The ITLB miss rates gives a measure how often a TLB miss occurred
|
||||
per instruction. The duration measures the time in cycles how long a walk did take.
|
||||
|
32
collectors/likwid/groups/westmereEX/UOPS.txt
Normal file
32
collectors/likwid/groups/westmereEX/UOPS.txt
Normal file
@@ -0,0 +1,32 @@
|
||||
SHORT UOPs execution info
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 UOPS_ISSUED_ANY
|
||||
PMC2 UOPS_RETIRED_ANY
|
||||
PMC3 UOPS_ISSUED_FUSED
|
||||
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Issued UOPs PMC0
|
||||
Merged UOPs PMC3
|
||||
Retired UOPs PMC2
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Issued UOPs = UOPS_ISSUED_ANY
|
||||
Merged UOPs = UOPS_ISSUED_FUSED
|
||||
Retired UOPs = UOPS_RETIRED_ANY
|
||||
-
|
||||
This group returns information about the instruction pipeline. It measures the
|
||||
issued, executed and retired uOPs and returns the number of uOPs which were issued
|
||||
but not executed as well as the number of uOPs which were executed but never retired.
|
||||
The executed but not retired uOPs commonly come from speculatively executed branches.
|
||||
|
Reference in New Issue
Block a user