Add likwid collector

This commit is contained in:
Thomas Roehl
2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions

View File

@@ -0,0 +1,31 @@
SHORT Branch prediction miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 BR_INST_RETIRED_ALL_BRANCHES
PMC1 BR_MISP_RETIRED_ALL_BRANCHES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Branch rate PMC0/FIXC0
Branch misprediction rate PMC1/FIXC0
Branch misprediction ratio PMC1/PMC0
Instructions per branch FIXC0/PMC0
LONG
Formulas:
Branch rate = BR_INST_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
Branch misprediction rate = BR_MISP_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
Branch misprediction ratio = BR_MISP_RETIRED_ALL_BRANCHES/BR_INST_RETIRED_ALL_BRANCHES
Instructions per branch = INSTR_RETIRED_ANY/BR_INST_RETIRED_ALL_BRANCHES
-
The rates state how often on average a branch or a mispredicted branch occurred
per instruction retired in total. The branch misprediction ratio sets directly
into relation what ratio of all branch instruction where mispredicted.
Instructions per branch is 1/branch rate.

View File

@@ -0,0 +1,23 @@
SHORT Power and Energy consumption
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PWR0 PWR_PKG_ENERGY
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Energy [J] PWR0
Power [W] PWR0/time
LONG
Formulas:
Power = PWR_PKG_ENERGY / time
-
The Xeon Phi (Knights Landing) implements the new RAPL interface. This interface enables to
monitor the consumed energy on the package (socket) level.

View File

@@ -0,0 +1,22 @@
SHORT Load to store ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 MEM_UOPS_RETIRED_ALL_LOADS
PMC1 MEM_UOPS_RETIRED_ALL_STORES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Load to store ratio PMC0/PMC1
LONG
Formulas:
Load to store ratio = MEM_UOPS_RETIRED_ALL_LOADS/MEM_UOPS_RETIRED_ALL_STORES
-
This is a metric to determine your load to store ratio.

View File

@@ -0,0 +1,24 @@
SHORT Divide unit information
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 CYCLES_DIV_BUSY_COUNT
PMC1 CYCLES_DIV_BUSY
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Number of divide ops PMC0
Avg. divide unit usage duration PMC1/PMC0
LONG
Formulas:
Number of divide ops = CYCLES_DIV_BUSY_COUNT
Avg. divide unit usage duration = CYCLES_DIV_BUSY/CYCLES_DIV_BUSY_COUNT
-
This performance group measures the average latency of divide operations

View File

@@ -0,0 +1,33 @@
SHORT Power and Energy consumption
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
TMP0 TEMP_CORE
PWR0 PWR_PKG_ENERGY
PWR1 PWR_PP0_ENERGY
PWR3 PWR_DRAM_ENERGY
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Temperature [C] TMP0
Energy [J] PWR0
Power [W] PWR0/time
Energy PP0 [J] PWR1
Power PP0 [W] PWR1/time
Energy DRAM [J] PWR3
Power DRAM [W] PWR3/time
LONG
Formulas:
Power = PWR_PKG_ENERGY / time
Power PP0 = PWR_PP0_ENERGY / time
Power DRAM = PWR_DRAM_ENERGY / time
-
Knights Landing implements the new RAPL interface. This interface enables to
monitor the consumed energy on the package (socket) level.

View File

@@ -0,0 +1,34 @@
SHORT Double Precision MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 UOPS_RETIRED_SCALAR_SIMD
PMC1 UOPS_RETIRED_PACKED_SIMD
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
DP [MFLOP/s] (SSE assumed) 1.0E-06*((PMC1*2.0)+PMC0)/time
DP [MFLOP/s] (AVX assumed) 1.0E-06*((PMC1*4.0)+PMC0)/time
DP [MFLOP/s] (AVX512 assumed) 1.0E-06*((PMC1*8.0)+PMC0)/time
Packed [MUOPS/s] 1.0E-06*(PMC1)/time
Scalar [MUOPS/s] 1.0E-06*PMC0/time
LONG
Formulas:
DP [MFLOP/s] (SSE assumed) = 1.0E-06*(UOPS_RETIRED_PACKED_SIMD*2+UOPS_RETIRED_SCALAR_SIMD)/runtime
DP [MFLOP/s] (AVX assumed) = 1.0E-06*(UOPS_RETIRED_PACKED_SIMD*4+UOPS_RETIRED_SCALAR_SIMD)/runtime
DP [MFLOP/s] (AVX512 assumed) = 1.0E-06*(UOPS_RETIRED_PACKED_SIMD*8+UOPS_RETIRED_SCALAR_SIMD)/runtime
Packed [MUOPS/s] = 1.0E-06*(UOPS_RETIRED_PACKED_SIMD)/runtime
Scalar [MUOPS/s] = 1.0E-06*UOPS_RETIRED_SCALAR_SIMD/runtime
-
AVX/SSE scalar and packed double precision FLOP rates. The Xeon Phi (Knights Landing) provides
no possibility to differentiate between double and single precision FLOP/s. Therefore, we only
assume that the printed [MFLOP/s] value is for double-precision code. Moreover, there is no way
to distinguish between SSE, AVX or AVX512 packed SIMD operations. Therefore, this group prints
out the [MFLOP/s] for different SIMD techniques.
WARNING: The events also count for integer arithmetics

View File

@@ -0,0 +1,34 @@
SHORT Single Precision MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 UOPS_RETIRED_SCALAR_SIMD
PMC1 UOPS_RETIRED_PACKED_SIMD
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
SP [MFLOP/s] (SSE assumed) 1.0E-06*(PMC1*4.0+PMC0)/time
SP [MFLOP/s] (AVX assumed) 1.0E-06*(PMC1*8.0+PMC0)/time
SP [MFLOP/s] (AVX512 assumed) 1.0E-06*(PMC1*16.0+PMC0)/time
Packed [MUOPS/s] 1.0E-06*(PMC1)/time
Scalar [MUOPS/s] 1.0E-06*PMC0/time
LONG
Formulas:
SP [MFLOP/s] (SSE assumed) = 1.0E-06*(UOPS_RETIRED_PACKED_SIMD*4+UOPS_RETIRED_SCALAR_SIMD)/runtime
SP [MFLOP/s] (AVX assumed) = 1.0E-06*(UOPS_RETIRED_PACKED_SIMD*8+UOPS_RETIRED_SCALAR_SIMD)/runtime
SP [MFLOP/s] (AVX512 assumed) = 1.0E-06*(UOPS_RETIRED_PACKED_SIMD*16+UOPS_RETIRED_SCALAR_SIMD)/runtime
Packed [MUOPS/s] = 1.0E-06*(UOPS_RETIRED_PACKED_SIMD)/runtime
Scalar [MUOPS/s] = 1.0E-06*UOPS_RETIRED_SCALAR_SIMD/runtime
-
AVX/SSE scalar and packed single precision FLOP rates. The Xeon Phi (Knights Landing) provides
no possibility to differentiate between double and single precision FLOP/s. Therefore, we only
assume that the printed MFLOP/s value is for single-precision code. Moreover, there is no way
to distinguish between SSE, AVX or AVX512 packed SIMD operations. Therefore, this group prints
out the MFLOP/s for different SIMD techniques.
WARNING: The events also count for integer arithmetics

View File

@@ -0,0 +1,25 @@
SHORT Frontend stalls
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 NO_ALLOC_CYCLES_ALL
PMC1 NO_ALLOC_CYCLES_ALL_COUNT
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Frontend stalls PMC1
Avg. frontend stall duration [cyc] PMC0/PMC1
Frontend stall ratio PMC0/FIXC1
LONG
Formulas:
Frontend stalls = NO_ALLOC_CYCLES_ALL
Avg. frontend stall duration [cyc] = NO_ALLOC_CYCLES_ALL/NO_ALLOC_CYCLES_ALL_COUNT
Frontend stall ratio = NO_ALLOC_CYCLES_ALL/CPU_CLK_UNHALTED_CORE
-
Frontend stalls

View File

@@ -0,0 +1,46 @@
SHORT Memory bandwidth in MBytes/s for High Bandwidth Memory (HBM)
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
EDBOX0C0 EDC_RPQ_INSERTS
EDBOX1C0 EDC_RPQ_INSERTS
EDBOX2C0 EDC_RPQ_INSERTS
EDBOX3C0 EDC_RPQ_INSERTS
EDBOX4C0 EDC_RPQ_INSERTS
EDBOX5C0 EDC_RPQ_INSERTS
EDBOX6C0 EDC_RPQ_INSERTS
EDBOX7C0 EDC_RPQ_INSERTS
EDBOX0C1 EDC_WPQ_INSERTS
EDBOX1C1 EDC_WPQ_INSERTS
EDBOX2C1 EDC_WPQ_INSERTS
EDBOX3C1 EDC_WPQ_INSERTS
EDBOX4C1 EDC_WPQ_INSERTS
EDBOX5C1 EDC_WPQ_INSERTS
EDBOX6C1 EDC_WPQ_INSERTS
EDBOX7C1 EDC_WPQ_INSERTS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Memory read bandwidth [MBytes/s] 1.0E-06*(EDBOX0C0+EDBOX1C0+EDBOX2C0+EDBOX3C0+EDBOX4C0+EDBOX5C0+EDBOX6C0+EDBOX7C0)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(EDBOX0C0+EDBOX1C0+EDBOX2C0+EDBOX3C0+EDBOX4C0+EDBOX5C0+EDBOX6C0+EDBOX7C0)*64.0
Memory writeback bandwidth [MBytes/s] 1.0E-06*(EDBOX0C1+EDBOX1C1+EDBOX2C1+EDBOX3C1+EDBOX4C1+EDBOX5C1+EDBOX6C1+EDBOX7C1)*64.0/time
Memory writeback data volume [GBytes] 1.0E-09*(EDBOX0C1+EDBOX1C1+EDBOX2C1+EDBOX3C1+EDBOX4C1+EDBOX5C1+EDBOX6C1+EDBOX7C1)*64.0
Memory bandwidth [MBytes/s] 1.0E-06*(EDBOX0C0+EDBOX1C0+EDBOX2C0+EDBOX3C0+EDBOX4C0+EDBOX5C0+EDBOX6C0+EDBOX7C0+EDBOX0C1+EDBOX1C1+EDBOX2C1+EDBOX3C1+EDBOX4C1+EDBOX5C1+EDBOX6C1+EDBOX7C1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(EDBOX0C0+EDBOX1C0+EDBOX2C0+EDBOX3C0+EDBOX4C0+EDBOX5C0+EDBOX6C0+EDBOX7C0+EDBOX0C1+EDBOX1C1+EDBOX2C1+EDBOX3C1+EDBOX4C1+EDBOX5C1+EDBOX6C1+EDBOX7C1)*64.0
LONG
Formulas:
Memory read bandwidth [MBytes/s] = 1.0E-06*(sum(EDC_RPQ_INSERTS))*64/time
Memory read data volume [GBytes] = 1.0E-09*(sum(EDC_RPQ_INSERTS))*64
Memory writeback bandwidth [MBytes/s] = 1.0E-06*(sum(EDC_WPQ_INSERTS))*64/time
Memory writeback data volume [GBytes] = 1.0E-09*(sum(EDC_WPQ_INSERTS))*64
Memory bandwidth [MBytes/s] = 1.0E-06*(sum(EDC_RPQ_INSERTS)+sum(EDC_WPQ_INSERTS))*64/time
Memory data volume [GBytes] = 1.0E-09*(sum(EDC_RPQ_INSERTS)+sum(EDC_WPQ_INSERTS))*64
-
Profiling group to measure data transfers from and to the high bandwidth memory (HBM).

View File

@@ -0,0 +1,87 @@
SHORT Memory bandwidth in MBytes/s for High Bandwidth Memory (HBM)
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
EDBOX0C0 EDC_RPQ_INSERTS
EDBOX1C0 EDC_RPQ_INSERTS
EDBOX2C0 EDC_RPQ_INSERTS
EDBOX3C0 EDC_RPQ_INSERTS
EDBOX4C0 EDC_RPQ_INSERTS
EDBOX5C0 EDC_RPQ_INSERTS
EDBOX6C0 EDC_RPQ_INSERTS
EDBOX7C0 EDC_RPQ_INSERTS
EDBOX0C1 EDC_WPQ_INSERTS
EDBOX1C1 EDC_WPQ_INSERTS
EDBOX2C1 EDC_WPQ_INSERTS
EDBOX3C1 EDC_WPQ_INSERTS
EDBOX4C1 EDC_WPQ_INSERTS
EDBOX5C1 EDC_WPQ_INSERTS
EDBOX6C1 EDC_WPQ_INSERTS
EDBOX7C1 EDC_WPQ_INSERTS
EUBOX0C0 EDC_MISS_CLEAN
EUBOX1C0 EDC_MISS_CLEAN
EUBOX2C0 EDC_MISS_CLEAN
EUBOX3C0 EDC_MISS_CLEAN
EUBOX4C0 EDC_MISS_CLEAN
EUBOX5C0 EDC_MISS_CLEAN
EUBOX6C0 EDC_MISS_CLEAN
EUBOX7C0 EDC_MISS_CLEAN
EUBOX0C1 EDC_MISS_DIRTY
EUBOX1C1 EDC_MISS_DIRTY
EUBOX2C1 EDC_MISS_DIRTY
EUBOX3C1 EDC_MISS_DIRTY
EUBOX4C1 EDC_MISS_DIRTY
EUBOX5C1 EDC_MISS_DIRTY
EUBOX6C1 EDC_MISS_DIRTY
EUBOX7C1 EDC_MISS_DIRTY
MBOX0C0 MC_CAS_READS
MBOX0C1 MC_CAS_WRITES
MBOX1C0 MC_CAS_READS
MBOX1C1 MC_CAS_WRITES
MBOX2C0 MC_CAS_READS
MBOX2C1 MC_CAS_WRITES
MBOX4C0 MC_CAS_READS
MBOX4C1 MC_CAS_WRITES
MBOX5C0 MC_CAS_READS
MBOX5C1 MC_CAS_WRITES
MBOX6C0 MC_CAS_READS
MBOX6C1 MC_CAS_WRITES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
MCDRAM Memory read bandwidth [MBytes/s] 1.0E-06*((EDBOX0C0+EDBOX1C0+EDBOX2C0+EDBOX3C0+EDBOX4C0+EDBOX5C0+EDBOX6C0+EDBOX7C0)-(EUBOX0C0+EUBOX1C0+EUBOX2C0+EUBOX3C0+EUBOX4C0+EUBOX5C0+EUBOX6C0+EUBOX7C0)-(EUBOX0C1+EUBOX1C1+EUBOX2C1+EUBOX3C1+EUBOX4C1+EUBOX5C1+EUBOX6C1+EUBOX7C1))*64.0/time
MCDRAM Memory read data volume [GBytes] 1.0E-09*((EDBOX0C0+EDBOX1C0+EDBOX2C0+EDBOX3C0+EDBOX4C0+EDBOX5C0+EDBOX6C0+EDBOX7C0)-(EUBOX0C0+EUBOX1C0+EUBOX2C0+EUBOX3C0+EUBOX4C0+EUBOX5C0+EUBOX6C0+EUBOX7C0)-(EUBOX0C1+EUBOX1C1+EUBOX2C1+EUBOX3C1+EUBOX4C1+EUBOX5C1+EUBOX6C1+EUBOX7C1))*64.0
MCDRAM Memory writeback bandwidth [MBytes/s] 1.0E-06*((EDBOX0C1+EDBOX1C1+EDBOX2C1+EDBOX3C1+EDBOX4C1+EDBOX5C1+EDBOX6C1+EDBOX7C1)-(MBOX0C0+MBOX1C0+MBOX2C0+MBOX4C0+MBOX5C0+MBOX6C0))*64.0/time
MCDRAM Memory writeback data volume [GBytes] 1.0E-09*((EDBOX0C1+EDBOX1C1+EDBOX2C1+EDBOX3C1+EDBOX4C1+EDBOX5C1+EDBOX6C1+EDBOX7C1)-(MBOX0C0+MBOX1C0+MBOX2C0+MBOX4C0+MBOX5C0+MBOX6C0))*64.0
MCDRAM Memory bandwidth [MBytes/s] 1.0E-06*(((EDBOX0C0+EDBOX1C0+EDBOX2C0+EDBOX3C0+EDBOX4C0+EDBOX5C0+EDBOX6C0+EDBOX7C0)-(EUBOX0C0+EUBOX1C0+EUBOX2C0+EUBOX3C0+EUBOX4C0+EUBOX5C0+EUBOX6C0+EUBOX7C0)-(EUBOX0C1+EUBOX1C1+EUBOX2C1+EUBOX3C1+EUBOX4C1+EUBOX5C1+EUBOX6C1+EUBOX7C1))+((EDBOX0C1+EDBOX1C1+EDBOX2C1+EDBOX3C1+EDBOX4C1+EDBOX5C1+EDBOX6C1+EDBOX7C1)-(MBOX0C0+MBOX1C0+MBOX2C0+MBOX4C0+MBOX5C0+MBOX6C0)))*64.0/time
MCDRAM Memory data volume [GBytes] 1.0E-09*(((EDBOX0C0+EDBOX1C0+EDBOX2C0+EDBOX3C0+EDBOX4C0+EDBOX5C0+EDBOX6C0+EDBOX7C0)-(EUBOX0C0+EUBOX1C0+EUBOX2C0+EUBOX3C0+EUBOX4C0+EUBOX5C0+EUBOX6C0+EUBOX7C0)-(EUBOX0C1+EUBOX1C1+EUBOX2C1+EUBOX3C1+EUBOX4C1+EUBOX5C1+EUBOX6C1+EUBOX7C1))+((EDBOX0C1+EDBOX1C1+EDBOX2C1+EDBOX3C1+EDBOX4C1+EDBOX5C1+EDBOX6C1+EDBOX7C1)-(MBOX0C0+MBOX1C0+MBOX2C0+MBOX4C0+MBOX5C0+MBOX6C0)))*64.0
DDR Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX4C0+MBOX5C0+MBOX6C0)*64.0/time
DDR Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX4C0+MBOX5C0+MBOX6C0)*64.0
DDR Memory writeback bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX4C1+MBOX5C1+MBOX6C1)*64.0/time
DDR Memory writeback data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX4C1+MBOX5C1+MBOX6C1)*64.0
DDR Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX4C1+MBOX5C1+MBOX6C1)*64.0/time
DDR Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX4C1+MBOX5C1+MBOX6C1)*64.0
LONG
Formulas:
MCDRAM Memory read bandwidth [MBytes/s] = 1.0E-06*(sum(EDC_RPQ_INSERTS))*64/time
MCDRAM Memory read data volume [GBytes] = 1.0E-09*(sum(EDC_RPQ_INSERTS))*64
MCDRAM Memory writeback bandwidth [MBytes/s] = 1.0E-06*(sum(EDC_WPQ_INSERTS))*64/time
MCDRAM Memory writeback data volume [GBytes] = 1.0E-09*(sum(EDC_WPQ_INSERTS))*64
MCDRAM Memory bandwidth [MBytes/s] = 1.0E-06*(sum(EDC_RPQ_INSERTS)+sum(EDC_WPQ_INSERTS))*64/time
MCDRAM Memory data volume [GBytes] = 1.0E-09*(sum(EDC_RPQ_INSERTS)+sum(EDC_WPQ_INSERTS))*64
DDR Memory read bandwidth [MBytes/s] = 1.0E-06*(sum(MC_CAS_READS))*64/time
DDR Memory read data volume [GBytes] = 1.0E-09*(sum(MC_CAS_READS))*64
DDR Memory writeback bandwidth [MBytes/s] = 1.0E-06*(sum(MC_CAS_WRITES))*64/time
DDR Memory writeback data volume [GBytes] = 1.0E-09*(sum(MC_CAS_WRITES))*64
DDR Memory bandwidth [MBytes/s] = 1.0E-06*(sum(MC_CAS_READS)+sum(MC_CAS_WRITES))*64/time
DDR Memory data volume [GBytes] = 1.0E-09*(sum(MC_CAS_READS)+sum(MC_CAS_WRITES))*64
-
Profiling group to measure data transfers from and to the high bandwidth memory (HBM).

View File

@@ -0,0 +1,32 @@
SHORT Memory bandwidth in MBytes/s for High Bandwidth Memory (HBM)
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0:MATCH0=0x4908:MATCH1=0x3F8060 OFFCORE_RESPONSE_0_OPTIONS
PMC1:MATCH0=0x32F7:MATCH1=0x3F8060 OFFCORE_RESPONSE_1_OPTIONS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Memory read bandwidth [MBytes/s] 1.0E-06*(PMC1)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(PMC1)*64.0
Memory writeback bandwidth [MBytes/s] 1.0E-06*(PMC0)*64.0/time
Memory writeback data volume [GBytes] 1.0E-09*(PMC0)*64.0
Memory bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
LONG
Formulas:
Memory read bandwidth [MBytes/s] = 1.0E-06*(sum(OFFCORE_RESPONSE_1_OPTIONS:MATCH0=0x32F7:MATCH1=0x3F8060))*64/time
Memory read data volume [GBytes] = 1.0E-09*(sum(OFFCORE_RESPONSE_1_OPTIONS:MATCH0=0x32F7:MATCH1=0x3F8060))*64
Memory writeback bandwidth [MBytes/s] = 1.0E-06*(sum(OFFCORE_RESPONSE_0_OPTIONS:MATCH0=0x4908:MATCH1=0x3F8060))*64/time
Memory writeback data volume [GBytes] = 1.0E-09*(sum(OFFCORE_RESPONSE_0_OPTIONS:MATCH0=0x4908:MATCH1=0x3F8060))*64
Memory bandwidth [MBytes/s] = 1.0E-06*(sum(OFFCORE_RESPONSE_1_OPTIONS:MATCH0=0x32F7:MATCH1=0x3F8060)+sum(OFFCORE_RESPONSE_0_OPTIONS:MATCH0=0x4908:MATCH1=0x3F8060))*64/time
Memory data volume [GBytes] = 1.0E-09*(sum(OFFCORE_RESPONSE_1_OPTIONS:MATCH0=0x32F7:MATCH1=0x3F8060)+sum(OFFCORE_RESPONSE_0_OPTIONS:MATCH0=0x4908:MATCH1=0x3F8060))*64
-
Profiling group to measure data transfers from and to the high bandwidth memory (HBM).
If possible, use the HBM or HBM_CACHE group because they provide more accurate counts.

View File

@@ -0,0 +1,25 @@
SHORT Instruction cache miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 ICACHE_ACCESSES
PMC1 ICACHE_MISSES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L1I request rate PMC0/FIXC0
L1I miss rate PMC1/FIXC0
L1I miss ratio PMC1/PMC0
LONG
Formulas:
L1I request rate = ICACHE_ACCESSES / INSTR_RETIRED_ANY
L1I miss rate = ICACHE_MISSES / INSTR_RETIRED_ANY
L1I miss ratio = ICACHE_MISSES / ICACHE_ACCESSES
-
This group measures some L1 instruction cache metrics.

View File

@@ -0,0 +1,36 @@
SHORT L2 cache bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L2_REQUESTS_REFERENCE
PMC1:MATCH0=0x0002:MATCH1=0x1 OFFCORE_RESPONSE_0_OPTIONS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L2 non-RFO bandwidth [MBytes/s] 1.E-06*(PMC0)*64.0/time
L2 non-RFO data volume [GByte] 1.E-09*PMC0*64.0
L2 RFO bandwidth [MBytes/s] 1.E-06*(PMC1)*64.0/time
L2 RFO data volume [GByte] 1.E-09*(PMC1)*64.0
L2 bandwidth [MBytes/s] 1.E-06*(PMC0+PMC1)*64.0/time
L2 data volume [GByte] 1.E-09*(PMC0+PMC1)*64.0
LONG
Formulas:
L2 non-RFO bandwidth [MBytes/s] = 1.E-06*L2_REQUESTS_REFERENCE*64.0/time
L2 non-RFO data volume [GByte] = 1.E-09*L2_REQUESTS_REFERENCE*64.0
L2 RFO bandwidth [MBytes/s] = 1.E-06*(OFFCORE_RESPONSE_0_OPTIONS:MATCH0=0x0002:MATCH1=0x1)*64.0/time
L2 RFO data volume [GByte] = 1.E-09*(OFFCORE_RESPONSE_0_OPTIONS:MATCH0=0x0002:MATCH1=0x1)*64.0
L2 bandwidth [MBytes/s] = 1.E-06*(L2_REQUESTS_REFERENCE+OFFCORE_RESPONSE_0_OPTIONS:MATCH0=0x0002:MATCH1=0x1)*64.0/time
L2 data volume [GByte] = 1.E-09*(L2_REQUESTS_REFERENCE+OFFCORE_RESPONSE_0_OPTIONS:MATCH0=0x0002:MATCH1=0x1)*64.0
--
The L2 bandwidth and data volume does not contain RFOs (also called
write-allocates). The RFO bandwidth and data volume is only accurate when all
used data fits in the L2 cache. As soon as the data exceeds the L2 cache size,
the RFO metrics are too high.
Moreover, with increasing count of measured cores, the non-RFO metrics overcount
but commonly stay withing 10% error.

View File

@@ -0,0 +1,34 @@
SHORT L2 cache miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 MEM_UOPS_RETIRED_L2_HIT_LOADS
PMC1 MEM_UOPS_RETIRED_L2_MISS_LOADS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L2 request rate (PMC0+PMC1)/FIXC0
L2 miss rate PMC1/FIXC0
L2 miss ratio PMC1/(PMC0+PMC1)
LONG
Formulas:
L2 request rate = (MEM_UOPS_RETIRED_L2_HIT_LOADS+MEM_UOPS_RETIRED_L2_MISS_LOADS)/INSTR_RETIRED_ANY
L2 miss rate = MEM_UOPS_RETIRED_L2_MISS_LOADS/INSTR_RETIRED_ANY
L2 miss ratio = MEM_UOPS_RETIRED_L2_MISS_LOADS/(MEM_UOPS_RETIRED_L2_HIT_LOADS+MEM_UOPS_RETIRED_L2_MISS_LOADS)
-
This group measures the locality of your data accesses with regard to the
L2 cache. L2 request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The L2 miss rate gives a measure how often it was necessary to get
cache lines from memory. And finally L2 miss ratio tells you how many of your
memory references required a cache line to be loaded from a higher level.
While the data cache miss rate might be given by your algorithm you should
try to get data cache miss ratio as low as possible by increasing your cache
reuse.

View File

@@ -0,0 +1,47 @@
SHORT Memory bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
MBOX0C0 MC_CAS_READS
MBOX0C1 MC_CAS_WRITES
MBOX1C0 MC_CAS_READS
MBOX1C1 MC_CAS_WRITES
MBOX2C0 MC_CAS_READS
MBOX2C1 MC_CAS_WRITES
MBOX4C0 MC_CAS_READS
MBOX4C1 MC_CAS_WRITES
MBOX5C0 MC_CAS_READS
MBOX5C1 MC_CAS_WRITES
MBOX6C0 MC_CAS_READS
MBOX6C1 MC_CAS_WRITES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX4C0+MBOX5C0+MBOX6C0)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX4C0+MBOX5C0+MBOX6C0)*64.0
Memory writeback bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX4C1+MBOX5C1+MBOX6C1)*64.0/time
Memory writeback data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX4C1+MBOX5C1+MBOX6C1)*64.0
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX4C1+MBOX5C1+MBOX6C1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX4C1+MBOX5C1+MBOX6C1)*64.0
LONG
Formulas:
Memory read bandwidth [MBytes/s] = 1.0E-06*(sum(MC_CAS_READS))*64/time
Memory read data volume [GBytes] = 1.0E-09*(sum(MC_CAS_READS))*64
Memory writeback bandwidth [MBytes/s] = 1.0E-06*(sum(MC_CAS_WRITES))*64/time
Memory writeback data volume [GBytes] = 1.0E-09*(sum(MC_CAS_WRITES))*64
Memory bandwidth [MBytes/s] = 1.0E-06*(sum(MC_CAS_READS)+sum(MC_CAS_WRITES))*64/time
Memory data volume [GBytes] = 1.0E-09*(sum(MC_CAS_READS)+sum(MC_CAS_WRITES))*64
-
Profiling group to measure L2 to MEM load cache bandwidth. The bandwidth is computed by the
number of cache line allocated in the L2 cache. Since there is no possibility to retrieve
the evicted cache lines, this group measures only the load cache bandwidth. The
writeback metrics count only modified cache lines that are written back to go to
exclusive state
The group also output totally load and writeback data volume transferred between memory and L2.

View File

@@ -0,0 +1,27 @@
SHORT L2 data TLB miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 PAGE_WALKS_DTLB_COUNT
PMC1 PAGE_WALKS_DTLB_CYCLES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L1 DTLB misses PMC0
L1 DTLB miss rate PMC0/FIXC0
L1 DTLB miss duration [Cyc] PMC1/PMC0
LONG
Formulas:
L1 DTLB misses = PAGE_WALKS_DTLB_COUNT
L1 DTLB miss rate = PAGE_WALKS_DTLB_COUNT / INSTR_RETIRED_ANY
L1 DTLB miss duration [Cyc] = PAGE_WALKS_DTLB_CYCLES / PAGE_WALKS_DTLB_COUNT
-
The DTLB load and store miss rates gives a measure how often a TLB miss occurred
per instruction. The duration measures the time in cycles how long a walk did take.

View File

@@ -0,0 +1,27 @@
SHORT L1 Instruction TLB miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 PAGE_WALKS_ITLB_COUNT
PMC1 PAGE_WALKS_ITLB_CYCLES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L1 ITLB misses PMC0
L1 ITLB miss rate PMC0/FIXC0
L1 ITLB miss duration [Cyc] PMC1/PMC0
LONG
Formulas:
L1 ITLB misses = PAGE_WALKS_ITLB_COUNT
L1 ITLB miss rate = PAGE_WALKS_ITLB_COUNT / INSTR_RETIRED_ANY
L1 ITLB miss duration [Cyc] = PAGE_WALKS_ITLB_CYCLES / PAGE_WALKS_ITLB_COUNT
-
The ITLB miss rates gives a measure how often a TLB miss occurred
per instruction. The duration measures the time in cycles how long a walk did take.

View File

@@ -0,0 +1,25 @@
SHORT UOP retirement stalls
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 UOPS_RETIRED_STALLED_CYCLES
PMC1 UOPS_RETIRED_STALLS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Number of stalls PMC1
Avg. stall duration [cyc] PMC0/PMC1
Stall ratio PMC0/FIXC1
LONG
Formulas:
Number of stalls = UOPS_RETIRED_STALLS
Avg. stall duration [cyc] = UOPS_RETIRED_STALLED_CYCLES/UOPS_RETIRED_STALLS
Stall ratio = UOPS_RETIRED_STALLED_CYCLES/CPU_CLK_UNHALTED_CORE
-
This group measures stalls in the UOP retirement.