Add likwid collector

This commit is contained in:
Thomas Roehl
2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions

View File

@@ -0,0 +1,31 @@
SHORT Branch prediction miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 BR_INST_RETIRED_ALL_BRANCHES
PMC1 BR_MISP_RETIRED_ALL_BRANCHES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Branch rate PMC0/FIXC0
Branch misprediction rate PMC1/FIXC0
Branch misprediction ratio PMC1/PMC0
Instructions per branch FIXC0/PMC0
LONG
Formulas:
Branch rate = BR_INST_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
Branch misprediction rate = BR_MISP_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
Branch misprediction ratio = BR_MISP_RETIRED_ALL_BRANCHES/BR_INST_RETIRED_ALL_BRANCHES
Instructions per branch = INSTR_RETIRED_ANY/BR_INST_RETIRED_ALL_BRANCHES
-
The rates state how often on average a branch or a mispredicted branch occurred
per instruction retired in total. The branch misprediction ratio sets directly
into relation what ratio of all branch instruction where mispredicted.
Instructions per branch is 1/branch rate.

View File

@@ -0,0 +1,25 @@
SHORT Data cache miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L1D_REPL
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
data cache misses PMC0
data cache miss rate PMC0/FIXC0
LONG
Formulas:
data cache misses = L1D_REPL
data cache miss rate = L1D_REPL / INSTR_RETIRED_ANY
-
This group measures the locality of your data accesses with regard to the L1
cache. The data cache miss rate gives a measure how often it was necessary to
get cache lines from higher levels of the memory hierarchy.

View File

@@ -0,0 +1,22 @@
SHORT Load to store ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 MEM_INST_RETIRED_LOADS
PMC1 MEM_INST_RETIRED_STORES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Load to store ratio PMC0/PMC1
LONG
Formulas:
Load to store ratio = MEM_INST_RETIRED_LOADS/MEM_INST_RETIRED_STORES
-
This is a simple metric to determine your load to store ratio.

View File

@@ -0,0 +1,24 @@
SHORT Divide unit information
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 ARITH_NUM_DIV
PMC1 ARITH_CYCLES_DIV_BUSY
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Number of divide ops PMC0
Avg. divide unit usage duration PMC1/PMC0
LONG
Formulas:
Number of divide ops = ARITH_NUM_DIV
Avg. divide unit usage duration = ARITH_CYCLES_DIV_BUSY/ARITH_NUM_DIV
-
This performance group measures the average latency of divide operations

View File

@@ -0,0 +1,35 @@
SHORT Double Precision MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 FP_COMP_OPS_EXE_SSE_FP_PACKED
PMC1 FP_COMP_OPS_EXE_SSE_FP_SCALAR
PMC2 FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION
PMC3 FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
DP [MFLOP/s] 1.0E-06*(PMC0*2.0+PMC1)/time
Packed [MUOPS/s] 1.0E-06*PMC0/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
SP [MUOPS/s] 1.0E-06*PMC2/time
DP [MUOPS/s] 1.0E-06*PMC3/time
LONG
Formulas:
DP [MFLOP/s] = 1.0E-06*(FP_COMP_OPS_EXE_SSE_FP_PACKED*2+FP_COMP_OPS_EXE_SSE_FP_SCALAR)/runtime
Packed [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_FP_PACKED/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_FP_SCALAR/runtime
SP [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION/runtime
DP [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION/runtime
-
The Nehalem has no possibility to measure MFLOPs if mixed precision calculations are done.
Therefore both single as well as double precision are measured to ensure the correctness
of the measurements. You can check if your code was vectorized on the number of
FP_COMP_OPS_EXE_SSE_FP_PACKED versus the FP_COMP_OPS_EXE_SSE_FP_SCALAR.

View File

@@ -0,0 +1,35 @@
SHORT Single Precision MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 FP_COMP_OPS_EXE_SSE_FP_PACKED
PMC1 FP_COMP_OPS_EXE_SSE_FP_SCALAR
PMC2 FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION
PMC3 FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
SP [MFLOP/s] 1.0E-06*(PMC0*4.0+PMC1)/time
Packed [MUOPS/s] 1.0E-06*PMC0/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
SP [MUOPS/s] 1.0E-06*PMC2/time
DP [MUOPS/s] 1.0E-06*PMC3/time
LONG
Formulas:
SP [MFLOP/s] = 1.0E-06*(FP_COMP_OPS_EXE_SSE_FP_PACKED*4+FP_COMP_OPS_EXE_SSE_FP_SCALAR)/runtime
Packed [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_FP_PACKED/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_FP_SCALAR/runtime
SP [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION/runtime
DP [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION/runtime
-
The Westmere EX has no possibility to measure MFLOPs if mixed precision calculations are done.
Therefore both single as well as double precision are measured to ensure the correctness
of the measurements. You can check if your code was vectorized on the number of
FP_COMP_OPS_EXE_SSE_FP_PACKED versus the FP_COMP_OPS_EXE_SSE_FP_SCALAR.

View File

@@ -0,0 +1,21 @@
SHORT X87 MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 INST_RETIRED_X87
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
X87 [MFLOP/s] 1.0E-06*PMC0/time
LONG
Formulas:
X87 [MFLOP/s] = 1.0E-06*INST_RETIRED_X87/runtime
-
Profiling group to measure X87 FLOP rate.

View File

@@ -0,0 +1,25 @@
SHORT Instruction cache miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L1I_READS
PMC1 L1I_MISSES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L1I request rate PMC0/FIXC0
L1I miss rate PMC1/FIXC0
L1I miss ratio PMC1/PMC0
LONG
Formulas:
L1I request rate = L1I_READS / INSTR_RETIRED_ANY
L1I miss rate = ICACHE_MISSES / INSTR_RETIRED_ANY
L1I miss ratio = ICACHE_MISSES / L1I_READS
-
This group measures some L1 instruction cache metrics.

View File

@@ -0,0 +1,38 @@
SHORT L2 cache bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L1D_REPL
PMC1 L1D_M_EVICT
PMC2 L1I_MISSES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L2D load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
L2D load data volume [GBytes] 1.0E-09*PMC0*64.0
L2D evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
L2D evict data volume [GBytes] 1.0E-09*PMC1*64.0
L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1+PMC2)*64.0/time
L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1+PMC2)*64.0
LONG
Formulas:
L2D load bandwidth [MBytes/s] = 1.0E-06*L1D_REPL*64.0/time
L2D load data volume [GBytes] = 1.0E-09*L1D_REPL*64.0
L2D evict bandwidth [MBytes/s] = 1.0E-06*L1D_M_EVICT*64.0/time
L2D evict data volume [GBytes] = 1.0E-09*L1D_M_EVICT*64.0
L2 bandwidth [MBytes/s] = 1.0E-06*(L1D_REPL+L1D_M_EVICT+L1I_MISSES)*64/time
L2 data volume [GBytes] = 1.0E-09*(L1D_REPL+L1D_M_EVICT+L1I_MISSES)*64
-
Profiling group to measure L2 cache bandwidth. The bandwidth is computed by the
number of cache line allocated in the L1 and the number of modified cache lines
evicted from the L1. Also reports on total data volume transferred between L2
and L1 cache. Note that this bandwidth also includes data transfers due to a
write allocate load on a store miss in L1 and traffic caused by misses in the
instruction cache.

View File

@@ -0,0 +1,34 @@
SHORT L2 cache miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L2_RQSTS_REFERENCES
PMC1 L2_RQSTS_MISS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L2 request rate PMC0/FIXC0
L2 miss rate PMC1/FIXC0
L2 miss ratio PMC1/PMC0
LONG
Formulas:
L2 request rate = L2_RQSTS_REFERENCES/INSTR_RETIRED_ANY
L2 miss rate = L2_RQSTS_MISS/INSTR_RETIRED_ANY
L2 miss ratio = L2_RQSTS_MISS/L2_RQSTS_REFERENCES
-
This group measures the locality of your data accesses with regard to the
L2 cache. L2 request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The L2 miss rate gives a measure how often it was necessary to get
cache lines from memory. And finally L2 miss ratio tells you how many of your
memory references required a cache line to be loaded from a higher level.
While the data cache miss rate might be given by your algorithm you should
try to get data cache miss ratio as low as possible by increasing your cache reuse.

View File

@@ -0,0 +1,36 @@
SHORT L3 cache bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L2_LINES_IN_ANY
PMC1 L2_LINES_OUT_DEMAND_DIRTY
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L3 load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
L3 load data volume [GBytes] 1.0E-09*PMC0*64.0
L3 evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
L3 evict data volume [GBytes] 1.0E-09*PMC1*64.0
L3 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
L3 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
LONG
Formulas:
L3 load bandwidth [MBytes/s] = 1.0E-06*L2_LINES_IN_ANY*64.0/time
L3 load data volume [GBytes] = 1.0E-09*L2_LINES_IN_ANY*64.0
L3 evict bandwidth [MBytes/s] = 1.0E-06*L2_LINES_OUT_DEMAND_DIRTY*64.0/time
L3 evict data volume [GBytes] = 1.0E-09*L2_LINES_OUT_DEMAND_DIRTY*64.0
L3 bandwidth [MBytes/s] = 1.0E-06*(L2_LINES_IN_ANY+L2_LINES_OUT_DEMAND_DIRTY)*64/time
L3 data volume [GBytes] = 1.0E-09*(L2_LINES_IN_ANY+L2_LINES_OUT_DEMAND_DIRTY)*64
-
Profiling group to measure L3 cache bandwidth. The bandwidth is
computed by the number of cache line allocated in the L2 and the number of
modified cache lines evicted from the L2. Also reporto data volume transferred
between L3 and L2 caches. Note that this bandwidth also includes data transfers
due to a write allocate load on a store miss in L2.

View File

@@ -0,0 +1,52 @@
SHORT L3 cache miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
CBOX0C0 LLC_HITS_ALL
CBOX0C1 LLC_MISSES_ALL
CBOX1C0 LLC_HITS_ALL
CBOX1C1 LLC_MISSES_ALL
CBOX2C0 LLC_HITS_ALL
CBOX2C1 LLC_MISSES_ALL
CBOX3C0 LLC_HITS_ALL
CBOX3C1 LLC_MISSES_ALL
CBOX4C0 LLC_HITS_ALL
CBOX4C1 LLC_MISSES_ALL
CBOX5C0 LLC_HITS_ALL
CBOX5C1 LLC_MISSES_ALL
CBOX6C0 LLC_HITS_ALL
CBOX6C1 LLC_MISSES_ALL
CBOX7C0 LLC_HITS_ALL
CBOX7C1 LLC_MISSES_ALL
CBOX8C0 LLC_HITS_ALL
CBOX8C1 LLC_MISSES_ALL
CBOX9C0 LLC_HITS_ALL
CBOX9C1 LLC_MISSES_ALL
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L3 request rate (CBOX0C0+CBOX0C1+CBOX1C0+CBOX1C1+CBOX2C0+CBOX2C1+CBOX3C0+CBOX3C1+CBOX4C0+CBOX4C1+CBOX5C0+CBOX5C1+CBOX6C0+CBOX6C1+CBOX7C0+CBOX7C1+CBOX8C0+CBOX8C1+CBOX9C0+CBOX9C1)/FIXC0
L3 miss rate (CBOX0C1+CBOX1C1+CBOX2C1+CBOX3C1+CBOX4C1+CBOX5C1+CBOX6C1+CBOX7C1+CBOX8C1+CBOX9C1)/FIXC0
L3 miss ratio (CBOX0C1+CBOX1C1+CBOX2C1+CBOX3C1+CBOX4C1+CBOX5C1+CBOX6C1+CBOX7C1+CBOX8C1+CBOX9C1)/(CBOX0C0+CBOX0C1+CBOX1C0+CBOX1C1+CBOX2C0+CBOX2C1+CBOX3C0+CBOX3C1+CBOX4C0+CBOX4C1+CBOX5C0+CBOX5C1+CBOX6C0+CBOX6C1+CBOX7C0+CBOX7C1+CBOX8C0+CBOX8C1+CBOX9C0+CBOX9C1)
LONG
Formulas:
L3 request rate = (SUM(LLC_HITS_ALL)+SUM(LLC_MISSES_ALL))/INSTR_RETIRED_ANY
L3 miss rate = SUM(LLC_MISSES_ALL)/INSTR_RETIRED_ANY
L3 miss ratio = SUM(LLC_MISSES_ALL)/(SUM(LLC_HITS_ALL)+SUM(LLC_MISSES_ALL))
-
This group measures the locality of your data accesses with regard to the
L3 cache. L3 request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The L3 miss rate gives a measure how often it was necessary to get
cache lines from memory. And finally L3 miss ratio tells you how many of your
memory references required a cache line to be loaded from a higher level.
While the data cache miss rate might be given by your algorithm you should
try to get data cache miss ratio as low as possible by increasing your cache reuse.

View File

@@ -0,0 +1,38 @@
SHORT Main memory bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
MBOX0C0 FVC_EV0_BBOX_CMDS_READS
MBOX0C1 DRAM_CMD_CAS_WR_OPN
MBOX0C2 DRAM_MISC_CAS_WR_CLS
MBOX1C0 FVC_EV0_BBOX_CMDS_READS
MBOX1C1 DRAM_CMD_CAS_WR_OPN
MBOX1C2 DRAM_MISC_CAS_WR_CLS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0)*64.0
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX0C2+MBOX1C1+MBOX1C2)*64.0/time
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX0C2+MBOX1C1+MBOX1C2)*64.0
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX0C1+MBOX1C1+MBOX0C2+MBOX1C2)*64/time
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX0C1+MBOX1C1+MBOX0C2+MBOX1C2)*64
LONG
Formulas:
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0))*64.0/time
Memory read data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0))*64.0
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC1)+SUM(MBOXxC2))*64.0/time
Memory write data volume [GBytes] = 1.0E-09*(SUM(MBOXxC1)+SUM(MBOXxC2))*64.0
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0)+SUM(MBOXxC1)+SUM(MBOXxC2))*64.0/time
Memory data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0)+SUM(MBOXxC1)+SUM(MBOXxC2))*64.0
-
Profiling group to measure memory bandwidth drawn by all cores of a socket.
Addional to the bandwidth it also outputs the data volume.

View File

@@ -0,0 +1,33 @@
SHORT Local and remote memory accesses
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 OFFCORE_RESPONSE_0_LOCAL_DRAM
PMC1 OFFCORE_RESPONSE_1_REMOTE_DRAM
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Local DRAM data volume [GByte] 1.E-09*PMC0*64
Local DRAM bandwidth [MByte/s] 1.E-06*(PMC0*64)/time
Remote DRAM data volume [GByte] 1.E-09*PMC1*64
Remote DRAM bandwidth [MByte/s] 1.E-06*(PMC1*64)/time
Memory data volume [GByte] 1.E-09*(PMC0+PMC1)*64
Memory bandwidth [MByte/s] 1.E-06*((PMC0+PMC1)*64)/time
LONG
Formulas:
CPI = CPU_CLK_UNHALTED_CORE/INSTR_RETIRED_ANY
Local DRAM data volume [GByte] = 1.E-09*OFFCORE_RESPONSE_0_LOCAL_DRAM*64
Local DRAM bandwidth [MByte/s] = 1.E-06*(OFFCORE_RESPONSE_0_LOCAL_DRAM*64)/time
Remote DRAM data volume [GByte] = 1.E-09*OFFCORE_RESPONSE_1_REMOTE_DRAM*64
Remote DRAM bandwidth [MByte/s] = 1.E-06*(OFFCORE_RESPONSE_1_REMOTE_DRAM*64)/time
Memory data volume [GByte] = 1.E-09*(OFFCORE_RESPONSE_0_LOCAL_DRAM+OFFCORE_RESPONSE_1_REMOTE_DRAM)*64
Memory bandwidth [MByte/s] = 1.E-06*((OFFCORE_RESPONSE_0_LOCAL_DRAM+OFFCORE_RESPONSE_1_REMOTE_DRAM)*64)/time
--
This performance group measures the data traffic of CPU cores to local and remote
memory.

View File

@@ -0,0 +1,35 @@
SHORT L2 data TLB miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 DTLB_LOAD_MISSES_ANY
PMC1 DTLB_MISSES_ANY
PMC2 DTLB_LOAD_MISSES_WALK_CYCLES
PMC3 DTLB_MISSES_WALK_CYCLES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L1 DTLB load misses PMC0
L1 DTLB load miss rate PMC0/FIXC0
L1 DTLB load miss duration [Cyc] PMC2/PMC0
L1 DTLB store misses (PMC1-PMC0)
L1 DTLB store miss rate (PMC1-PMC0)/FIXC0
L1 DTLB store miss duration [Cyc] (PMC3-PMC2)/(PMC1-PMC0)
LONG
Formulas:
L1 DTLB load misses = DTLB_LOAD_MISSES_ANY
L1 DTLB load miss rate = DTLB_LOAD_MISSES_ANY / INSTR_RETIRED_ANY
L1 DTLB load miss duration [Cyc] = DTLB_LOAD_MISSES_WALK_CYCLES / DTLB_LOAD_MISSES_ANY
L1 DTLB store misses = DTLB_MISSES_ANY-DTLB_LOAD_MISSES_ANY
L1 DTLB store miss rate = (DTLB_MISSES_ANY-DTLB_LOAD_MISSES_ANY) / INSTR_RETIRED_ANY
L1 DTLB store miss duration [Cyc] = (DTLB_MISSES_WALK_CYCLES-DTLB_LOAD_MISSES_WALK_CYCLES) / (DTLB_MISSES_ANY-DTLB_LOAD_MISSES_ANY)
-
The DTLB miss rate gives a measure how often a TLB miss occurred
per instruction. The store miss calculations are done using ALL-LOADS TLB walks.

View File

@@ -0,0 +1,27 @@
SHORT L1 Instruction TLB miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 ITLB_MISSES_ANY
PMC1 ITLB_MISSES_WALK_CYCLES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L1 ITLB misses PMC0
L1 ITLB miss rate PMC0/FIXC0
L1 ITLB miss duration [Cyc] PMC1/PMC0
LONG
Formulas:
L1 ITLB misses = ITLB_MISSES_ANY
L1 ITLB miss rate = ITLB_MISSES_ANY / INSTR_RETIRED_ANY
L1 ITLB miss duration [Cyc] = ITLB_MISSES_WALK_CYCLES / ITLB_MISSES_ANY
-
The ITLB miss rates gives a measure how often a TLB miss occurred
per instruction. The duration measures the time in cycles how long a walk did take.

View File

@@ -0,0 +1,32 @@
SHORT UOPs execution info
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 UOPS_ISSUED_ANY
PMC2 UOPS_RETIRED_ANY
PMC3 UOPS_ISSUED_FUSED
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Issued UOPs PMC0
Merged UOPs PMC3
Retired UOPs PMC2
LONG
Formulas:
Issued UOPs = UOPS_ISSUED_ANY
Merged UOPs = UOPS_ISSUED_FUSED
Retired UOPs = UOPS_RETIRED_ANY
-
This group returns information about the instruction pipeline. It measures the
issued, executed and retired uOPs and returns the number of uOPs which were issued
but not executed as well as the number of uOPs which were executed but never retired.
The executed but not retired uOPs commonly come from speculatively executed branches.