Add likwid collector

This commit is contained in:
Thomas Roehl
2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions

View File

@@ -0,0 +1,31 @@
SHORT Branch prediction miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 BR_INST_RETIRED_ALL_BRANCHES
PMC1 BR_MISP_RETIRED_ALL_BRANCHES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Branch rate PMC0/FIXC0
Branch misprediction rate PMC1/FIXC0
Branch misprediction ratio PMC1/PMC0
Instructions per branch FIXC0/PMC0
LONG
Formulas:
Branch rate = BR_INST_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
Branch misprediction rate = BR_MISP_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
Branch misprediction ratio = BR_MISP_RETIRED_ALL_BRANCHES/BR_INST_RETIRED_ALL_BRANCHES
Instructions per branch = INSTR_RETIRED_ANY/BR_INST_RETIRED_ALL_BRANCHES
-
The rates state how often on average a branch or a mispredicted branch occurred
per instruction retired in total. The branch misprediction ratio sets directly
into relation what ratio of all branch instruction where mispredicted.
Instructions per branch is 1/branch rate.

View File

@@ -0,0 +1,121 @@
SHORT Cache bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L1D_REPLACEMENT
PMC1 L1D_M_EVICT
PMC2 L2_LINES_IN_ALL
PMC3 L2_LINES_OUT_DIRTY_ALL
CBOX0C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX1C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX2C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX3C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX4C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX5C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX6C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX7C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX8C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX9C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX10C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX11C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX12C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX13C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX14C0:STATE=0x3F LLC_LOOKUP_DATA_READ
CBOX0C1 LLC_VICTIMS_M_STATE
CBOX1C1 LLC_VICTIMS_M_STATE
CBOX2C1 LLC_VICTIMS_M_STATE
CBOX3C1 LLC_VICTIMS_M_STATE
CBOX4C1 LLC_VICTIMS_M_STATE
CBOX5C1 LLC_VICTIMS_M_STATE
CBOX6C1 LLC_VICTIMS_M_STATE
CBOX7C1 LLC_VICTIMS_M_STATE
CBOX8C1 LLC_VICTIMS_M_STATE
CBOX9C1 LLC_VICTIMS_M_STATE
CBOX10C1 LLC_VICTIMS_M_STATE
CBOX11C1 LLC_VICTIMS_M_STATE
CBOX12C1 LLC_VICTIMS_M_STATE
CBOX13C1 LLC_VICTIMS_M_STATE
CBOX14C1 LLC_VICTIMS_M_STATE
MBOX0C0 CAS_COUNT_RD
MBOX0C1 CAS_COUNT_WR
MBOX1C0 CAS_COUNT_RD
MBOX1C1 CAS_COUNT_WR
MBOX2C0 CAS_COUNT_RD
MBOX2C1 CAS_COUNT_WR
MBOX3C0 CAS_COUNT_RD
MBOX3C1 CAS_COUNT_WR
MBOX4C0 CAS_COUNT_RD
MBOX4C1 CAS_COUNT_WR
MBOX5C0 CAS_COUNT_RD
MBOX5C1 CAS_COUNT_WR
MBOX6C0 CAS_COUNT_RD
MBOX6C1 CAS_COUNT_WR
MBOX7C0 CAS_COUNT_RD
MBOX7C1 CAS_COUNT_WR
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L2 to L1 load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
L2 to L1 load data volume [GBytes] 1.0E-09*PMC0*64.0
L1 to L2 evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
L1 to L2 evict data volume [GBytes] 1.0E-09*PMC1*64.0
L1 to/from L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
L1 to/from L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
L3 to L2 load bandwidth [MBytes/s] 1.0E-06*PMC2*64.0/time
L3 to L2 load data volume [GBytes] 1.0E-09*PMC2*64.0
L2 to L3 evict bandwidth [MBytes/s] 1.0E-06*PMC3*64.0/time
L2 to L3 evict data volume [GBytes] 1.0E-09*PMC3*64.0
L2 to/from L3 bandwidth [MBytes/s] 1.0E-06*(PMC2+PMC3)*64.0/time
L2 to/from L3 data volume [GBytes] 1.0E-09*(PMC2+PMC3)*64.0
System to L3 bandwidth [MBytes/s] 1.0E-06*(CBOX0C0:STATE=0x3F+CBOX1C0:STATE=0x3F+CBOX2C0:STATE=0x3F+CBOX3C0:STATE=0x3F+CBOX4C0:STATE=0x3F+CBOX5C0:STATE=0x3F+CBOX6C0:STATE=0x3F+CBOX7C0:STATE=0x3F+CBOX8C0:STATE=0x3F+CBOX9C0:STATE=0x3F+CBOX10C0:STATE=0x3F+CBOX11C0:STATE=0x3F+CBOX12C0:STATE=0x3F+CBOX13C0:STATE=0x3F+CBOX14C0:STATE=0x3F)*64.0/time
System to L3 data volume [GBytes] 1.0E-09*(CBOX0C0:STATE=0x3F+CBOX1C0:STATE=0x3F+CBOX2C0:STATE=0x3F+CBOX3C0:STATE=0x3F+CBOX4C0:STATE=0x3F+CBOX5C0:STATE=0x3F+CBOX6C0:STATE=0x3F+CBOX7C0:STATE=0x3F+CBOX8C0:STATE=0x3F+CBOX9C0:STATE=0x3F+ CBOX10C0:STATE=0x3F+CBOX11C0:STATE=0x3F+CBOX12C0:STATE=0x3F+CBOX13C0:STATE=0x3F+CBOX14C0:STATE=0x3F)*64.0
L3 to memory bandwidth [MBytes/s] 1.0E-06*(CBOX0C1+CBOX1C1+CBOX2C1+CBOX3C1+CBOX4C1+CBOX5C1+CBOX6C1+CBOX7C1+CBOX8C1+CBOX9C1+CBOX10C1+CBOX11C1+CBOX12C1+CBOX13C1+CBOX14C1)*64/time
L3 to memory data volume [GBytes] 1.0E-09*(CBOX0C1+CBOX1C1+CBOX2C1+CBOX3C1+CBOX4C1+CBOX5C1+CBOX6C1+CBOX7C1+CBOX8C1+CBOX9C1+CBOX10C1+CBOX11C1+CBOX12C1+CBOX13C1+CBOX14C1)*64
L3 to/from system bandwidth [MBytes/s] 1.0E-06*(CBOX0C0:STATE=0x3F+CBOX1C0:STATE=0x3F+CBOX2C0:STATE=0x3F+CBOX3C0:STATE=0x3F+CBOX4C0:STATE=0x3F+CBOX5C0:STATE=0x3F+CBOX6C0:STATE=0x3F+CBOX7C0:STATE=0x3F+CBOX8C0:STATE=0x3F+CBOX9C0:STATE=0x3F+ CBOX10C0:STATE=0x3F+CBOX11C0:STATE=0x3F+CBOX12C0:STATE=0x3F+CBOX13C0:STATE=0x3F+CBOX14C0:STATE=0x3F+CBOX0C1+CBOX1C1+CBOX2C1+CBOX3C1+CBOX4C1+CBOX5C1+CBOX6C1+CBOX7C1+CBOX8C1+CBOX9C1+CBOX10C1+CBOX11C1+CBOX12C1+CBOX13C1+CBOX14C1)*64.0/time
L3 to/from system data volume [GBytes] 1.0E-09*(CBOX0C0:STATE=0x3F+CBOX1C0:STATE=0x3F+CBOX2C0:STATE=0x3F+CBOX3C0:STATE=0x3F+CBOX4C0:STATE=0x3F+CBOX5C0:STATE=0x3F+CBOX6C0:STATE=0x3F+CBOX7C0:STATE=0x3F+CBOX8C0:STATE=0x3F+CBOX9C0:STATE=0x3F+ CBOX10C0:STATE=0x3F+CBOX11C0:STATE=0x3F+CBOX12C0:STATE=0x3F+CBOX13C0:STATE=0x3F+CBOX14C0:STATE=0x3F+CBOX0C1+CBOX1C1+CBOX2C1+CBOX3C1+CBOX4C1+CBOX5C1+CBOX6C1+CBOX7C1+CBOX8C1+CBOX9C1+CBOX10C1+CBOX11C1+CBOX12C1+CBOX13C1+CBOX14C1)*64.0
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
LONG
Formulas:
L2 to L1 load bandwidth [MBytes/s] = 1.0E-06*L1D_REPLACEMENT*64/time
L2 to L1 load data volume [GBytes] = 1.0E-09*L1D_REPLACEMENT*64
L1 to L2 evict bandwidth [MBytes/s] = 1.0E-06*L1D_M_EVICT*64/time
L1 to L2 evict data volume [GBytes] = 1.0E-09*L1D_M_EVICT*64
L1 to/from L2 bandwidth [MBytes/s] = 1.0E-06*(L1D_REPLACEMENT+L1D_M_EVICT)*64/time
L1 to/from L2 data volume [GBytes] = 1.0E-09*(L1D_REPLACEMENT+L1D_M_EVICT)*64
L3 to L2 load bandwidth [MBytes/s] = 1.0E-06*L2_LINES_IN_ALL*64/time
L3 to L2 load data volume [GBytes] = 1.0E-09*L2_LINES_IN_ALL*64
L2 to L3 evict bandwidth [MBytes/s] = 1.0E-06*L2_TRANS_L2_WB*64/time
L2 to L3 evict data volume [GBytes] = 1.0E-09*L2_TRANS_L2_WB*64
L2 to/from L3 bandwidth [MBytes/s] = 1.0E-06*(L2_LINES_IN_ALL+L2_TRANS_L2_WB)*64/time
L2 to/from L3 data volume [GBytes] = 1.0E-09*(L2_LINES_IN_ALL+L2_TRANS_L2_WB)*64
System to L3 bandwidth [MBytes/s] = 1.0E-06*(SUM(LLC_LOOKUP_DATA_READ:STATE=0x3F))*64/time
System to L3 data volume [GBytes] = 1.0E-09*(SUM(LLC_LOOKUP_DATA_READ:STATE=0x3F))*64
L3 to system bandwidth [MBytes/s] = 1.0E-06*(SUM(LLC_VICTIMS_M_STATE))*64/time
L3 to system data volume [GBytes] = 1.0E-09*(SUM(LLC_VICTIMS_M_STATE))*64
L3 to/from system bandwidth [MBytes/s] = 1.0E-06*(SUM(LLC_LOOKUP_DATA_READ:STATE=0x3F)+SUM(LLC_VICTIMS_M_STATE))*64/time
L3 to/from system data volume [GBytes] = 1.0E-09*(SUM(LLC_LOOKUP_DATA_READ:STATE=0x3F)+SUM(LLC_VICTIMS_M_STATE))*64
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD))*64.0/time
Memory read data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD))*64.0
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_WR))*64.0/time
Memory write data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_WR))*64.0
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0/time
Memory data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0
-
Group to measure cache transfers between L1 and Memory. Please notice that the
L3 to/from system metrics contain any traffic to the system (memory,
Intel QPI, etc.) but don't seem to handle anything because commonly memory read
bandwidth and L3 to L2 bandwidth is higher as the memory to L3 bandwidth.

View File

@@ -0,0 +1,55 @@
SHORT CBOX related data and metrics
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
CBOX0C0 LLC_VICTIMS_M_STATE
CBOX1C0 LLC_VICTIMS_M_STATE
CBOX2C0 LLC_VICTIMS_M_STATE
CBOX3C0 LLC_VICTIMS_M_STATE
CBOX4C0 LLC_VICTIMS_M_STATE
CBOX5C0 LLC_VICTIMS_M_STATE
CBOX6C0 LLC_VICTIMS_M_STATE
CBOX7C0 LLC_VICTIMS_M_STATE
CBOX8C0 LLC_VICTIMS_M_STATE
CBOX9C0 LLC_VICTIMS_M_STATE
CBOX10C0 LLC_VICTIMS_M_STATE
CBOX11C0 LLC_VICTIMS_M_STATE
CBOX12C0 LLC_VICTIMS_M_STATE
CBOX13C0 LLC_VICTIMS_M_STATE
CBOX14C0 LLC_VICTIMS_M_STATE
CBOX0C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX1C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX2C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX3C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX4C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX5C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX6C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX7C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX8C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX9C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX10C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX11C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX12C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX13C1:STATE=0x1 LLC_LOOKUP_ANY
CBOX14C1:STATE=0x1 LLC_LOOKUP_ANY
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
LLC misses per instruction (CBOX0C0+CBOX1C0+CBOX2C0+CBOX3C0+CBOX4C0+CBOX5C0+CBOX6C0+CBOX7C0+CBOX8C0+CBOX9C0+CBOX10C0+CBOX11C0+CBOX12C0+CBOX13C0+CBOX14C0)/FIXC0
LLC data written to MEM [MBytes] 1E-6*(CBOX0C1:STATE=0x1+CBOX1C1:STATE=0x1+CBOX2C1:STATE=0x1+CBOX3C1:STATE=0x1+CBOX4C1:STATE=0x1+CBOX5C1:STATE=0x1+CBOX6C1:STATE=0x1+CBOX7C1:STATE=0x1+CBOX8C1:STATE=0x1+CBOX9C1:STATE=0x1+CBOX10C1:STATE=0x1+CBOX11C1:STATE=0x1+CBOX12C1:STATE=0x1+CBOX13C1:STATE=0x1+CBOX14C1:STATE=0x1)*64
LONG
Formulas:
LLC misses per instruction = sum(LLC_VICTIMS_M_STATE)/INSTR_RETIRED_ANY
LLC data written to MEM [MBytes] = sum(LLC_LOOKUP_ANY:STATE=0x1)*64*1E-6
--
The CBOXes mediate the traffic from the L2 cache to the segmented L3 cache. Each
CBOX is responsible for one segment (2.5 MByte). The boxes maintain the coherence between all
CPU cores of the socket. Depending on the CPU core count, some CBOXes are not attached
to a 2.5 MByte slice but are still active and track the traffic.

View File

@@ -0,0 +1,26 @@
SHORT Power and Energy consumption
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PWR0 PWR_PKG_ENERGY
UBOXFIX UNCORE_CLOCK
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
Uncore Clock [MHz] 1.E-06*UBOXFIX/time
CPI FIXC1/FIXC0
Energy [J] PWR0
Power [W] PWR0/time
LONG
Formulas:
Power = PWR_PKG_ENERGY / time
Uncore Clock [MHz] = 1.E-06 * UNCORE_CLOCK / time
-
IvyBridge implements the new RAPL interface. This interface enables to
monitor the consumed energy on the package (socket) level.

View File

@@ -0,0 +1,38 @@
SHORT Cycle Activities
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 CYCLE_ACTIVITY_CYCLES_L2_PENDING
PMC1 CYCLE_ACTIVITY_CYCLES_LDM_PENDING
PMC2 CYCLE_ACTIVITY_CYCLES_L1D_PENDING
PMC3 CYCLE_ACTIVITY_CYCLES_NO_EXECUTE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Cycles without execution [%] (PMC3/FIXC1)*100
Cycles without execution due to L1D [%] (PMC2/FIXC1)*100
Cycles without execution due to L2 [%] (PMC0/FIXC1)*100
Cycles without execution due to memory loads [%] (PMC1/FIXC1)*100
LONG
Formulas:
Cycles without execution [%] = CYCLE_ACTIVITY_CYCLES_NO_EXECUTE/CPU_CLK_UNHALTED_CORE*100
Cycles with stalls due to L1D [%] = CYCLE_ACTIVITY_CYCLES_L1D_PENDING/CPU_CLK_UNHALTED_CORE*100
Cycles with stalls due to L2 [%] = CYCLE_ACTIVITY_CYCLES_L2_PENDING/CPU_CLK_UNHALTED_CORE*100
Cycles without execution due to memory loads [%] = CYCLE_ACTIVITY_CYCLES_LDM_PENDING/CPU_CLK_UNHALTED_CORE*100
--
This performance group measures the cycles while waiting for data from the cache
and memory hierarchy.
CYCLE_ACTIVITY_CYCLES_NO_EXECUTE: Counts number of cycles nothing is executed on
any execution port.
CYCLE_ACTIVITY_CYCLES_L1D_PENDING: Cycles while L1 cache miss demand load is
outstanding.
CYCLE_ACTIVITY_CYCLES_L2_PENDING: Cycles while L2 cache miss demand load is
outstanding.
CYCLE_ACTIVITY_CYCLES_LDM_PENDING: Cycles while memory subsystem has an
outstanding load.

View File

@@ -0,0 +1,45 @@
SHORT Cycle Activities (Stalls)
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 CYCLE_ACTIVITY_STALLS_L2_PENDING
PMC1 CYCLE_ACTIVITY_STALLS_LDM_PENDING
PMC2 CYCLE_ACTIVITY_STALLS_L1D_PENDING
PMC3 CYCLE_ACTIVITY_STALLS_TOTAL
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Total execution stalls PMC3
Stalls caused by L1D misses [%] (PMC2/PMC3)*100
Stalls caused by L2 misses [%] (PMC0/PMC3)*100
Stalls caused by memory loads [%] (PMC1/PMC3)*100
Execution stall rate [%] (PMC3/FIXC1)*100
Stalls caused by L1D misses rate [%] (PMC2/FIXC1)*100
Stalls caused by L2 misses rate [%] (PMC0/FIXC1)*100
Stalls caused by memory loads rate [%] (PMC1/FIXC1)*100
LONG
Formulas:
Total execution stalls = CYCLE_ACTIVITY_STALLS_TOTAL
Stalls caused by L1D misses [%] = (CYCLE_ACTIVITY_STALLS_L1D_PENDING/CYCLE_ACTIVITY_STALLS_TOTAL)*100
Stalls caused by L2 misses [%] = (CYCLE_ACTIVITY_STALLS_L2_PENDING/CYCLE_ACTIVITY_STALLS_TOTAL)*100
Stalls caused by memory loads [%] = (CYCLE_ACTIVITY_STALLS_LDM_PENDING/CYCLE_ACTIVITY_STALLS_TOTAL)*100
Execution stall rate [%] = (CYCLE_ACTIVITY_STALLS_TOTAL/CPU_CLK_UNHALTED_CORE)*100
Stalls caused by L1D misses rate [%] = (CYCLE_ACTIVITY_STALLS_L1D_PENDING/CPU_CLK_UNHALTED_CORE)*100
Stalls caused by L2 misses rate [%] = (CYCLE_ACTIVITY_STALLS_L2_PENDING/CPU_CLK_UNHALTED_CORE)*100
Stalls caused by memory loads rate [%] = (CYCLE_ACTIVITY_STALLS_LDM_PENDING/CPU_CLK_UNHALTED_CORE)*100
--
This performance group measures the stalls caused by data traffic in the cache
hierarchy.
CYCLE_ACTIVITY_STALLS_TOTAL: Total execution stalls.
CYCLE_ACTIVITY_STALLS_L1D_PENDING: Execution stalls while L1 cache miss demand
load is outstanding.
CYCLE_ACTIVITY_STALLS_L2_PENDING: Execution stalls while L2 cache miss demand
load is outstanding.
CYCLE_ACTIVITY_STALLS_LDM_PENDING: Execution stalls while memory subsystem has
an outstanding load.

View File

@@ -0,0 +1,22 @@
SHORT Load to store ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 MEM_UOPS_RETIRED_LOADS
PMC1 MEM_UOPS_RETIRED_STORES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Load to store ratio PMC0/PMC1
LONG
Formulas:
Load to store ratio = MEM_UOPS_RETIRED_LOADS/MEM_UOPS_RETIRED_STORES
-
This is a metric to determine your load to store ratio.

View File

@@ -0,0 +1,24 @@
SHORT Divide unit information
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 ARITH_NUM_DIV
PMC1 ARITH_FPU_DIV_ACTIVE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Number of divide ops PMC0
Avg. divide unit usage duration PMC1/PMC0
LONG
Formulas:
Number of divide ops = ARITH_NUM_DIV
Avg. divide unit usage duration = ARITH_FPU_DIV_ACTIVE/ARITH_NUM_DIV
-
This performance group measures the average latency of divide operations

View File

@@ -0,0 +1,33 @@
SHORT Power and Energy consumption
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
TMP0 TEMP_CORE
PWR0 PWR_PKG_ENERGY
PWR1 PWR_PP0_ENERGY
PWR3 PWR_DRAM_ENERGY
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Temperature [C] TMP0
Energy [J] PWR0
Power [W] PWR0/time
Energy PP0 [J] PWR1
Power PP0 [W] PWR1/time
Energy DRAM [J] PWR3
Power DRAM [W] PWR3/time
LONG
Formulas:
Power = PWR_PKG_ENERGY / time
Power PP0 = PWR_PP0_ENERGY / time
Power DRAM = PWR_DRAM_ENERGY / time
-
IvyBridge implements the new RAPL interface. This interface enables to
monitor the consumed energy on the package (socket), the PP0 domain
and DRAM level. The PP0 domain often refers to only the CPU cores.

View File

@@ -0,0 +1,32 @@
SHORT False sharing
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 MEM_LOAD_UOPS_LLC_HIT_RETIRED_XSNP_HITM
PMC1 MEM_LOAD_UOPS_LLC_MISS_RETIRED_REMOTE_HITM
PMC2 MEM_LOAD_UOPS_RETIRED_ALL
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Local LLC false sharing [MByte] 1.E-06*PMC0*64
Local LLC false sharing rate PMC0/PMC2
Remote LLC false sharing [MByte] 1.E-06*PMC1*64
Remote LLC false sharing rate PMC1/PMC2
LONG
Formulas:
Local LLC false sharing [MByte] = 1.E-06*MEM_LOAD_UOPS_LLC_HIT_RETIRED_XSNP_HITM*64
Local LLC false sharing rate = MEM_LOAD_UOPS_LLC_HIT_RETIRED_XSNP_HITM/MEM_LOAD_UOPS_RETIRED_ALL
Remote LLC false sharing [MByte] = 1.E-06*MEM_LOAD_UOPS_LLC_MISS_RETIRED_REMOTE_HITM*64
Remote LLC false sharing rate = MEM_LOAD_UOPS_LLC_MISS_RETIRED_REMOTE_HITM/MEM_LOAD_UOPS_RETIRED_ALL
-
False-sharing of cache lines can dramatically reduce the performance of an
application. This performance group measures the L3 traffic induced by false-sharing.
The false-sharing rate uses all memory loads as reference.
For systems with multiple CPU sockets, this performance group also measures the
false-sharing of cache lines over socket boundaries.

View File

@@ -0,0 +1,26 @@
SHORT Packed AVX MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 SIMD_FP_256_PACKED_SINGLE
PMC1 SIMD_FP_256_PACKED_DOUBLE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Packed SP [MFLOP/s] 1.0E-06*(PMC0*8.0)/time
Packed DP [MFLOP/s] 1.0E-06*(PMC1*4.0)/time
LONG
Formulas:
Packed SP [MFLOP/s] = 1.0E-06*(SIMD_FP_256_PACKED_SINGLE*8)/runtime
Packed DP [MFLOP/s] = 1.0E-06*(SIMD_FP_256_PACKED_DOUBLE*4)/runtime
-
Packed 32b AVX FLOPs rates. Please note that the current FLOP measurements on
IvyBridge are potentially wrong.
So you cannot trust these counters at the moment!

View File

@@ -0,0 +1,33 @@
SHORT Double Precision MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE
PMC1 FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE
PMC2 SIMD_FP_256_PACKED_DOUBLE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
DP [MFLOP/s] 1.0E-06*(PMC0*2.0+PMC1+PMC2*4.0)/time
AVX DP [MFLOP/s] 1.0E-06*(PMC2*4.0)/time
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2)/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
Vectorization ratio 100*(PMC0+PMC2)/(PMC0+PMC1+PMC2)
LONG
Formulas:
DP [MFLOP/s] = 1.0E-06*(FP_COMP_OPS_EXE_SSE_FP_PACKED*2+FP_COMP_OPS_EXE_SSE_FP_SCALAR+SIMD_FP_256_PACKED_DOUBLE*4)/runtime
AVX DP [MFLOP/s] = 1.0E-06*(SIMD_FP_256_PACKED_DOUBLE*4)/runtime
Packed [MUOPS/s] = 1.0E-06*(FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE+SIMD_FP_256_PACKED_DOUBLE)/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE/runtime
Vectorization ratio = 100*(FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE+SIMD_FP_256_PACKED_DOUBLE)/(FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE+FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE+SIMD_FP_256_PACKED_DOUBLE)
-
SSE scalar and packed double precision FLOP rates. Please note that the current
FLOP measurements on IvyBridge are potentially wrong.
So you cannot trust these counters at the moment!

View File

@@ -0,0 +1,33 @@
SHORT Single Precision MFLOP/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE
PMC1 FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE
PMC2 SIMD_FP_256_PACKED_SINGLE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
SP [MFLOP/s] 1.0E-06*(PMC0*4.0+PMC1+PMC2*8.0)/time
AVX SP [MFLOP/s] 1.0E-06*(PMC2*8.0)/time
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2)/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
Vectorization ratio 100*(PMC0+PMC2)/(PMC0+PMC1+PMC2)
LONG
Formulas:
SP [MFLOP/s] = 1.0E-06*(FP_COMP_OPS_EXE_SSE_FP_PACKED*4+FP_COMP_OPS_EXE_SSE_FP_SCALAR+SIMD_FP_256_PACKED_SINGLE*8)/runtime
AVX SP [MFLOP/s] = 1.0E-06*(SIMD_FP_256_PACKED_SINGLE*8)/runtime
Packed [MUOPS/s] = 1.0E-06*(FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE+SIMD_FP_256_PACKED_SINGLE)/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE/runtime
Vectorization ratio = 100*(FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE+SIMD_FP_256_PACKED_SINGLE)/(FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE+FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE+SIMD_FP_256_PACKED_SINGLE)
-
SSE scalar and packed single precision FLOP rates. Please note that the current
FLOP measurements on IvyBridge are potentially wrong.
So you cannot trust these counters at the moment!

View File

@@ -0,0 +1,33 @@
SHORT Instruction cache miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 ICACHE_ACCESSES
PMC1 ICACHE_MISSES
PMC2 ICACHE_IFETCH_STALL
PMC3 ILD_STALL_IQ_FULL
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L1I request rate PMC0/FIXC0
L1I miss rate PMC1/FIXC0
L1I miss ratio PMC1/PMC0
L1I stalls PMC2
L1I stall rate PMC2/FIXC0
L1I queue full stalls PMC3
L1I queue full stall rate PMC3/FIXC0
LONG
Formulas:
L1I request rate = ICACHE_ACCESSES / INSTR_RETIRED_ANY
L1I miss rate = ICACHE_MISSES / INSTR_RETIRED_ANY
L1I miss ratio = ICACHE_MISSES / ICACHE_ACCESSES
L1I stalls = ICACHE_IFETCH_STALL
L1I stall rate = ICACHE_IFETCH_STALL / INSTR_RETIRED_ANY
-
This group measures some L1 instruction cache metrics.

View File

@@ -0,0 +1,38 @@
SHORT L2 cache bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L1D_REPLACEMENT
PMC1 L1D_M_EVICT
PMC2 ICACHE_MISSES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L2D load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
L2D load data volume [GBytes] 1.0E-09*PMC0*64.0
L2D evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
L2D evict data volume [GBytes] 1.0E-09*PMC1*64.0
L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1+PMC2)*64.0/time
L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1+PMC2)*64.0
LONG
Formulas:
L2D load bandwidth [MBytes/s] = 1.0E-06*L1D_REPLACEMENT*64.0/time
L2D load data volume [GBytes] = 1.0E-09*L1D_REPLACEMENT*64.0
L2D evict bandwidth [MBytes/s] = 1.0E-06*L1D_M_EVICT*64.0/time
L2D evict data volume [GBytes] = 1.0E-09*L1D_M_EVICT*64.0
L2 bandwidth [MBytes/s] = 1.0E-06*(L1D_REPLACEMENT+L1D_M_EVICT+ICACHE_MISSES)*64/time
L2 data volume [GBytes] = 1.0E-09*(L1D_REPLACEMENT+L1D_M_EVICT+ICACHE_MISSES)*64
-
Profiling group to measure L2 cache bandwidth. The bandwidth is computed by the
number of cache line allocated in the L1 and the number of modified cache lines
evicted from the L1. The group also outputs total data volume transferred between
L2 and L1. Note that this bandwidth also includes data transfers due to a write
allocate load on a store miss in L1 and cache lines transferred it the instruction
cache.

View File

@@ -0,0 +1,34 @@
SHORT L2 cache miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L2_TRANS_ALL_REQUESTS
PMC1 L2_RQSTS_MISS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L2 request rate PMC0/FIXC0
L2 miss rate PMC1/FIXC0
L2 miss ratio PMC1/PMC0
LONG
Formulas:
L2 request rate = L2_TRANS_ALL_REQUESTS/INSTR_RETIRED_ANY
L2 miss rate = L2_RQSTS_MISS/INSTR_RETIRED_ANY
L2 miss ratio = L2_RQSTS_MISS/L2_TRANS_ALL_REQUESTS
-
This group measures the locality of your data accesses with regard to the
L2 cache. L2 request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The L2 miss rate gives a measure how often it was necessary to get
cache lines from memory. And finally L2 miss ratio tells you how many of your
memory references required a cache line to be loaded from a higher level.
While the# data cache miss rate might be given by your algorithm you should
try to get data cache miss ratio as low as possible by increasing your cache reuse.

View File

@@ -0,0 +1,36 @@
SHORT L3 cache bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L2_LINES_IN_ALL
PMC1 L2_LINES_OUT_DIRTY_ALL
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L3 load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
L3 load data volume [GBytes] 1.0E-09*PMC0*64.0
L3 evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
L3 evict data volume [GBytes] 1.0E-09*PMC1*64.0
L3 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
L3 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
LONG
Formulas:
L3 load bandwidth [MBytes/s] = 1.0E-06*L2_LINES_IN_ALL*64.0/time
L3 load data volume [GBytes] = 1.0E-09*L2_LINES_IN_ALL*64.0
L3 evict bandwidth [MBytes/s] = 1.0E-06*L2_LINES_OUT_DIRTY_ALL*64.0/time
L3 evict data volume [GBytes] = 1.0E-09*L2_LINES_OUT_DIRTY_ALL*64.0
L3 bandwidth [MBytes/s] = 1.0E-06*(L2_LINES_IN_ALL+L2_LINES_OUT_DIRTY_ALL)*64/time
L3 data volume [GBytes] = 1.0E-09*(L2_LINES_IN_ALL+L2_LINES_OUT_DIRTY_ALL)*64
-
Profiling group to measure L3 cache bandwidth. The bandwidth is computed by the
number of cache line allocated in the L2 and the number of modified cache lines
evicted from the L2. This group also output data volume transferred between the
L3 and measured cores L2 caches. Note that this bandwidth also includes data
transfers due to a write allocate load on a store miss in L2.

View File

@@ -0,0 +1,36 @@
SHORT L3 cache miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 MEM_LOAD_UOPS_RETIRED_L3_ALL
PMC1 MEM_LOAD_UOPS_RETIRED_L3_MISS
PMC2 UOPS_RETIRED_ALL
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L3 request rate (PMC0)/PMC2
L3 miss rate PMC1/PMC2
L3 miss ratio PMC1/PMC0
LONG
Formulas:
L3 request rate = MEM_LOAD_UOPS_RETIRED_L3_ALL/UOPS_RETIRED_ALL
L3 miss rate = MEM_LOAD_UOPS_RETIRED_L3_MISS/UOPS_RETIRED_ALL
L3 miss ratio = MEM_LOAD_UOPS_RETIRED_L3_MISS/MEM_LOAD_UOPS_RETIRED_L3_ALL
-
This group measures the locality of your data accesses with regard to the
L3 cache. L3 request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The L3 miss rate gives a measure how often it was necessary to get
cache lines from memory. And finally L3 miss ratio tells you how many of your
memory references required a cache line to be loaded from a higher level.
While the# data cache miss rate might be given by your algorithm you should
try to get data cache miss ratio as low as possible by increasing your cache reuse.

View File

@@ -0,0 +1,49 @@
SHORT Main memory bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
MBOX0C0 CAS_COUNT_RD
MBOX0C1 CAS_COUNT_WR
MBOX1C0 CAS_COUNT_RD
MBOX1C1 CAS_COUNT_WR
MBOX2C0 CAS_COUNT_RD
MBOX2C1 CAS_COUNT_WR
MBOX3C0 CAS_COUNT_RD
MBOX3C1 CAS_COUNT_WR
MBOX4C0 CAS_COUNT_RD
MBOX4C1 CAS_COUNT_WR
MBOX5C0 CAS_COUNT_RD
MBOX5C1 CAS_COUNT_WR
MBOX6C0 CAS_COUNT_RD
MBOX6C1 CAS_COUNT_WR
MBOX7C0 CAS_COUNT_RD
MBOX7C1 CAS_COUNT_WR
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
LONG
Formulas:
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0))*64.0/time
Memory read data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0))*64.0
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC1))*64.0/time
Memory write data volume [GBytes] = 1.0E-09*(SUM(MBOXxC1))*64.0
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0/time
Memory data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0
-
Profiling group to measure memory bandwidth drawn by all cores of a socket.
Since this group is based on Uncore events it is only possible to measure on a
per socket base. Some of the counters may not be available on your system.
Also outputs total data volume transferred from main memory.

View File

@@ -0,0 +1,75 @@
SHORT Power and Energy consumption
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PWR0 PWR_PKG_ENERGY
PWR3 PWR_DRAM_ENERGY
PMC0 FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE
PMC1 FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE
PMC2 SIMD_FP_256_PACKED_DOUBLE
MBOX0C0 CAS_COUNT_RD
MBOX0C1 CAS_COUNT_WR
MBOX1C0 CAS_COUNT_RD
MBOX1C1 CAS_COUNT_WR
MBOX2C0 CAS_COUNT_RD
MBOX2C1 CAS_COUNT_WR
MBOX3C0 CAS_COUNT_RD
MBOX3C1 CAS_COUNT_WR
MBOX4C0 CAS_COUNT_RD
MBOX4C1 CAS_COUNT_WR
MBOX5C0 CAS_COUNT_RD
MBOX5C1 CAS_COUNT_WR
MBOX6C0 CAS_COUNT_RD
MBOX6C1 CAS_COUNT_WR
MBOX7C0 CAS_COUNT_RD
MBOX7C1 CAS_COUNT_WR
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Energy [J] PWR0
Power [W] PWR0/time
Energy DRAM [J] PWR3
Power DRAM [W] PWR3/time
MFLOP/s 1.0E-06*(PMC0*2.0+PMC1+PMC2*4.0)/time
AVX [MFLOP/s] 1.0E-06*(PMC2*4.0)/time
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2)/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
Operational intensity (PMC0*2.0+PMC1+PMC2*4.0)/((MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0)
LONG
Formulas:
Power [W] = PWR_PKG_ENERGY/runtime
Power DRAM [W] = PWR_DRAM_ENERGY/runtime
MFLOP/s = 1.0E-06*(FP_COMP_OPS_EXE_SSE_FP_PACKED*2+FP_COMP_OPS_EXE_SSE_FP_SCALAR+SIMD_FP_256_PACKED_DOUBLE*4)/runtime
AVX [MFLOP/s] = 1.0E-06*(SIMD_FP_256_PACKED_DOUBLE*4)/runtime
Packed [MUOPS/s] = 1.0E-06*(FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE+SIMD_FP_256_PACKED_DOUBLE)/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE/runtime
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0))*64.0/time
Memory read data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0))*64.0
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC1))*64.0/time
Memory write data volume [GBytes] = 1.0E-09*(SUM(MBOXxC1))*64.0
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0/time
Memory data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0
Operational intensity = (FP_COMP_OPS_EXE_SSE_FP_PACKED*2+FP_COMP_OPS_EXE_SSE_FP_SCALAR+SIMD_FP_256_PACKED_DOUBLE*4)/(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0)
--
Profiling group to measure memory bandwidth drawn by all cores of a socket.
Since this group is based on Uncore events it is only possible to measure on
a per socket base. Also outputs total data volume transferred from main memory.
SSE scalar and packed double precision FLOP rates. Also reports on packed AVX
32b instructions. Please note that the current FLOP measurements on SandyBridge
are potentially wrong. So you cannot trust these counters at the moment!
The operational intensity is calculated using the FP values of the cores and the
memory data volume of the whole socket. The actual operational intensity for
multiple CPUs can be found in the statistics table in the Sum column.

View File

@@ -0,0 +1,74 @@
SHORT Power and Energy consumption
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PWR0 PWR_PKG_ENERGY
PWR3 PWR_DRAM_ENERGY
PMC0 FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE
PMC1 FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE
PMC2 SIMD_FP_256_PACKED_SINGLE
MBOX0C0 CAS_COUNT_RD
MBOX0C1 CAS_COUNT_WR
MBOX1C0 CAS_COUNT_RD
MBOX1C1 CAS_COUNT_WR
MBOX2C0 CAS_COUNT_RD
MBOX2C1 CAS_COUNT_WR
MBOX3C0 CAS_COUNT_RD
MBOX3C1 CAS_COUNT_WR
MBOX4C0 CAS_COUNT_RD
MBOX4C1 CAS_COUNT_WR
MBOX5C0 CAS_COUNT_RD
MBOX5C1 CAS_COUNT_WR
MBOX6C0 CAS_COUNT_RD
MBOX6C1 CAS_COUNT_WR
MBOX7C0 CAS_COUNT_RD
MBOX7C1 CAS_COUNT_WR
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Energy [J] PWR0
Power [W] PWR0/time
Energy DRAM [J] PWR3
Power DRAM [W] PWR3/time
MFLOP/s 1.0E-06*(PMC0*4.0+PMC1+PMC2*8.0)/time
AVX [MFLOP/s] 1.0E-06*(PMC2*8.0)/time
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2)/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
Operational intensity (PMC0*4.0+PMC1+PMC2*8.0)/((MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0)
LONG
Formulas:
Power [W] = PWR_PKG_ENERGY/runtime
Power DRAM [W] = PWR_DRAM_ENERGY/runtime
MFLOP/s = 1.0E-06*(FP_COMP_OPS_EXE_SSE_FP_PACKED*4+FP_COMP_OPS_EXE_SSE_FP_SCALAR+SIMD_FP_256_PACKED_SINGLE*8)/runtime
AVX [MFLOP/s] = 1.0E-06*(SIMD_FP_256_PACKED_SINGLE*8)/runtime
Packed [MUOPS/s] = 1.0E-06*(FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE+SIMD_FP_256_PACKED_SINGLE)/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE/runtime
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0))*64.0/time
Memory read data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0))*64.0
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC1))*64.0/time
Memory write data volume [GBytes] = 1.0E-09*(SUM(MBOXxC1))*64.0
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0/time
Memory data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0
Operational intensity = (FP_COMP_OPS_EXE_SSE_FP_PACKED*4+FP_COMP_OPS_EXE_SSE_FP_SCALAR+SIMD_FP_256_PACKED_SINGLE*8)/(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0)
--
Profiling group to measure memory bandwidth drawn by all cores of a socket.
Since this group is based on Uncore events it is only possible to measure on
a per socket base. Also outputs total data volume transferred from main memory.
SSE scalar and packed single precision FLOP rates. Also reports on packed AVX
32b instructions. Please note that the current FLOP measurements on IvyBridge
are potentially wrong. So you cannot trust these counters at the moment!
The operational intensity is calculated using the FP values of the cores and the
memory data volume of the whole socket. The actual operational intensity for
multiple CPUs can be found in the statistics table in the Sum column.

View File

@@ -0,0 +1,33 @@
SHORT Local and remote memory accesses
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 OFFCORE_RESPONSE_0_LOCAL_DRAM
PMC1 OFFCORE_RESPONSE_1_REMOTE_DRAM
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Local DRAM data volume [GByte] 1.E-09*PMC0*64
Local DRAM bandwidth [MByte/s] 1.E-06*(PMC0*64)/time
Remote DRAM data volume [GByte] 1.E-09*PMC1*64
Remote DRAM bandwidth [MByte/s] 1.E-06*(PMC1*64)/time
Memory data volume [GByte] 1.E-09*(PMC0+PMC1)*64
Memory bandwidth [MByte/s] 1.E-06*((PMC0+PMC1)*64)/time
LONG
Formulas:
CPI = CPU_CLK_UNHALTED_CORE/INSTR_RETIRED_ANY
Local DRAM data volume [GByte] = 1.E-09*OFFCORE_RESPONSE_0_LOCAL_DRAM*64
Local DRAM bandwidth [MByte/s] = 1.E-06*(OFFCORE_RESPONSE_0_LOCAL_DRAM*64)/time
Remote DRAM data volume [GByte] = 1.E-09*OFFCORE_RESPONSE_1_REMOTE_DRAM*64
Remote DRAM bandwidth [MByte/s] = 1.E-06*(OFFCORE_RESPONSE_1_REMOTE_DRAM*64)/time
Memory data volume [GByte] = 1.E-09*(OFFCORE_RESPONSE_0_LOCAL_DRAM+OFFCORE_RESPONSE_1_REMOTE_DRAM)*64
Memory bandwidth [MByte/s] = 1.E-06*((OFFCORE_RESPONSE_0_LOCAL_DRAM+OFFCORE_RESPONSE_1_REMOTE_DRAM)*64)/time
--
This performance group measures the data traffic of CPU cores to local and remote
memory.

View File

@@ -0,0 +1,40 @@
SHORT Execution port utilization
REQUIRE_NOHT
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 UOPS_DISPATCHED_PORT_PORT_0
PMC1 UOPS_DISPATCHED_PORT_PORT_1
PMC2 UOPS_DISPATCHED_PORT_PORT_2
PMC3 UOPS_DISPATCHED_PORT_PORT_3
PMC4 UOPS_DISPATCHED_PORT_PORT_4
PMC5 UOPS_DISPATCHED_PORT_PORT_5
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Port0 usage ratio PMC0/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5)
Port1 usage ratio PMC1/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5)
Port2 usage ratio PMC2/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5)
Port3 usage ratio PMC3/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5)
Port4 usage ratio PMC4/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5)
Port5 usage ratio PMC5/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5)
LONG
Formulas:
Port0 usage ratio = UOPS_DISPATCHED_PORT_PORT_0/SUM(UOPS_DISPATCHED_PORT_PORT_*)
Port1 usage ratio = UOPS_DISPATCHED_PORT_PORT_1/SUM(UOPS_DISPATCHED_PORT_PORT_*)
Port2 usage ratio = UOPS_DISPATCHED_PORT_PORT_2/SUM(UOPS_DISPATCHED_PORT_PORT_*)
Port3 usage ratio = UOPS_DISPATCHED_PORT_PORT_3/SUM(UOPS_DISPATCHED_PORT_PORT_*)
Port4 usage ratio = UOPS_DISPATCHED_PORT_PORT_4/SUM(UOPS_DISPATCHED_PORT_PORT_*)
Port5 usage ratio = UOPS_DISPATCHED_PORT_PORT_5/SUM(UOPS_DISPATCHED_PORT_PORT_*)
-
This group measures the execution port utilization in a CPU core. The group can
only be measured when HyperThreading is disabled because only then each CPU core
can program eight counters.

View File

@@ -0,0 +1,52 @@
SHORT QPI Link Layer data
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
SBOX0C0 DIRECT2CORE_SUCCESS_RBT_HIT
SBOX1C0 DIRECT2CORE_SUCCESS_RBT_HIT
SBOX2C0 DIRECT2CORE_SUCCESS_RBT_HIT
SBOX0C1 TXL_FLITS_G0_DATA
SBOX1C1 TXL_FLITS_G0_DATA
SBOX2C1 TXL_FLITS_G0_DATA
SBOX0C2 TXL_FLITS_G0_NON_DATA
SBOX1C2 TXL_FLITS_G0_NON_DATA
SBOX2C2 TXL_FLITS_G0_NON_DATA
SBOX0C3 SBOX_CLOCKTICKS
SBOX1C3 SBOX_CLOCKTICKS
SBOX2C3 SBOX_CLOCKTICKS
SBOX0FIX QPI_RATE
SBOX1FIX QPI_RATE
SBOX2FIX QPI_RATE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
QPI Speed Link 0 [GT/s] ((SBOX0C3)/time)*inverseClock*(8/1000)
QPI Speed Link 1 [GT/s] ((SBOX1C3)/time)*inverseClock*(8/1000)
QPI Speed Link 2 [GT/s] ((SBOX2C3)/time)*inverseClock*(8/1000)
QPI Rate Link 0 [GT/s] 1.E-09*SBOX0FIX
QPI Rate Link 1 [GT/s] 1.E-09*SBOX1FIX
QPI Rate Link 2 [GT/s] 1.E-09*SBOX2FIX
data from QPI to LLC [MByte] 1.E-06*(SBOX0C0+SBOX1C0+SBOX2C0)*8
QPI data volume [MByte] 1.E-06*(SBOX0C1+SBOX1C1+SBOX2C1)*8
QPI data bandwidth [MByte/s] 1.E-06*(SBOX0C1+SBOX1C1+SBOX2C1)*8/time
QPI link volume [MByte] 1.E-06*(SBOX0C1+SBOX1C1+SBOX2C1+SBOX0C2+SBOX1C2+SBOX2C2)*8
QPI link bandwidth [MByte/s] 1.E-06*(SBOX0C1+SBOX1C1+SBOX2C1+SBOX0C2+SBOX1C2+SBOX2C2)*8/time
LONG
Formulas:
QPI Speed Link 0/1/2 [GT/s] = ((SBOX_CLOCKTICKS)/time)*clock*(8/1000)
QPI Rate Link 0/1/2 [GT/s] = 1.E-09*(QPI_RATE)
data from QPI to LLC [MByte] = 1.E-06*(sum(DIRECT2CORE_SUCCESS_RBT_HIT)*64)
QPI data volume [MByte] = 1.E-06*(sum(TXL_FLITS_G0_DATA)*8)
QPI data bandwidth [MByte/s] = 1.E-06*(sum(TXL_FLITS_G0_DATA)*8)/runtime
QPI link volume [MByte] = 1.E-06*((sum(TXL_FLITS_G0_DATA)+sum(TXL_FLITS_G0_NON_DATA))*8)
QPI link bandwidth [MByte/s] = 1.E-06*((sum(TXL_FLITS_G0_DATA)+sum(TXL_FLITS_G0_NON_DATA))*8)/runtime
--
The Intel QPI Link Layer is responsible for packetizing requests from the caching agent (CBOXes)
on the way out to the system interface.

View File

@@ -0,0 +1,22 @@
SHORT Recovery duration
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 INT_MISC_RECOVERY_CYCLES
PMC1 INT_MISC_RECOVERY_COUNT
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Avg. recovery duration PMC0/PMC1
LONG
Formulas:
Avg. recovery duration = INT_MISC_RECOVERY_CYCLES/INT_MISC_RECOVERY_COUNT
-
This group measures the duration of recoveries after SSE exception, memory
disambiguation, etc...

View File

@@ -0,0 +1,35 @@
SHORT L2 data TLB miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 DTLB_LOAD_MISSES_CAUSES_A_WALK
PMC1 DTLB_STORE_MISSES_CAUSES_A_WALK
PMC2 DTLB_LOAD_MISSES_WALK_DURATION
PMC3 DTLB_STORE_MISSES_WALK_DURATION
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L1 DTLB load misses PMC0
L1 DTLB load miss rate PMC0/FIXC0
L1 DTLB load miss duration [Cyc] PMC2/PMC0
L1 DTLB store misses PMC1
L1 DTLB store miss rate PMC1/FIXC0
L1 DTLB store miss duration [Cyc] PMC3/PMC1
LONG
Formulas:
L1 DTLB load misses = DTLB_LOAD_MISSES_CAUSES_A_WALK
L1 DTLB load miss rate = DTLB_LOAD_MISSES_CAUSES_A_WALK / INSTR_RETIRED_ANY
L1 DTLB load miss duration [Cyc] = DTLB_LOAD_MISSES_WALK_DURATION / DTLB_LOAD_MISSES_CAUSES_A_WALK
L1 DTLB store misses = DTLB_STORE_MISSES_CAUSES_A_WALK
L1 DTLB store miss rate = DTLB_STORE_MISSES_CAUSES_A_WALK / INSTR_RETIRED_ANY
L1 DTLB store miss duration [Cyc] = DTLB_STORE_MISSES_WALK_DURATION / DTLB_STORE_MISSES_CAUSES_A_WALK
-
The DTLB load and store miss rates gives a measure how often a TLB miss occurred
per instruction. The duration measures the time in cycles how long a walk did take.

View File

@@ -0,0 +1,28 @@
SHORT L1 Instruction TLB miss rate/ratio
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 ITLB_MISSES_CAUSES_A_WALK
PMC1 ITLB_MISSES_WALK_DURATION
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
L1 ITLB misses PMC0
L1 ITLB miss rate PMC0/FIXC0
L1 ITLB miss duration [Cyc] PMC1/PMC0
LONG
Formulas:
L1 ITLB misses = ITLB_MISSES_CAUSES_A_WALK
L1 ITLB miss rate = ITLB_MISSES_CAUSES_A_WALK / INSTR_RETIRED_ANY
L1 ITLB miss duration [Cyc] = ITLB_MISSES_WALK_DURATION / ITLB_MISSES_CAUSES_A_WALK
-
The ITLB miss rates gives a measure how often a TLB miss occurred
per instruction. The duration measures the time in cycles how long a walk did take.

View File

@@ -0,0 +1,48 @@
SHORT Top down cycle allocation
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 UOPS_ISSUED_ANY
PMC1 UOPS_RETIRED_RETIRE_SLOTS
PMC2 IDQ_UOPS_NOT_DELIVERED_CORE
PMC3 INT_MISC_RECOVERY_CYCLES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
IPC FIXC0/FIXC1
Total Slots 4*FIXC1
Slots Retired PMC1
Fetch Bubbles PMC2
Recovery Bubbles 4*PMC3
Front End [%] PMC2/(4*FIXC1)*100
Speculation [%] (PMC0-PMC1+(4*PMC3))/(4*FIXC1)*100
Retiring [%] PMC1/(4*FIXC1)*100
Back End [%] (1-((PMC2+PMC0+(4*PMC3))/(4*FIXC1)))*100
LONG
Formulas:
Total Slots = 4*CPU_CLK_UNHALTED_CORE
Slots Retired = UOPS_RETIRED_RETIRE_SLOTS
Fetch Bubbles = IDQ_UOPS_NOT_DELIVERED_CORE
Recovery Bubbles = 4*INT_MISC_RECOVERY_CYCLES
Front End [%] = IDQ_UOPS_NOT_DELIVERED_CORE/(4*CPU_CLK_UNHALTED_CORE)*100
Speculation [%] = (UOPS_ISSUED_ANY-UOPS_RETIRED_RETIRE_SLOTS+(4*INT_MISC_RECOVERY_CYCLES))/(4*CPU_CLK_UNHALTED_CORE)*100
Retiring [%] = UOPS_RETIRED_RETIRE_SLOTS/(4*CPU_CLK_UNHALTED_CORE)*100
Back End [%] = (1-((IDQ_UOPS_NOT_DELIVERED_CORE+UOPS_ISSUED_ANY+(4*INT_MISC_RECOVERY_CYCLES))/(4*CPU_CLK_UNHALTED_CORE)))*100
--
This performance group measures cycles to determine percentage of time spent in
front end, back end, retiring and speculation. These metrics are published and
verified by Intel. Further information:
Webpage describing Top-Down Method and its usage in Intel vTune:
https://software.intel.com/en-us/vtune-amplifier-help-tuning-applications-using-a-top-down-microarchitecture-analysis-method
Paper by Yasin Ahmad:
https://sites.google.com/site/analysismethods/yasin-pubs/TopDown-Yasin-ISPASS14.pdf?attredirects=0
Slides by Yasin Ahmad:
http://www.cs.technion.ac.il/~erangi/TMA_using_Linux_perf__Ahmad_Yasin.pdf
The performance group was originally published here:
http://perf.mvermeulen.com/2018/04/14/top-down-performance-counter-analysis-part-1-likwid/

View File

@@ -0,0 +1,96 @@
SHORT All Clocks
EVENTSET
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
CBOX0C0 CBOX_CLOCKTICKS
CBOX1C0 CBOX_CLOCKTICKS
CBOX2C0 CBOX_CLOCKTICKS
CBOX3C0 CBOX_CLOCKTICKS
CBOX4C0 CBOX_CLOCKTICKS
CBOX5C0 CBOX_CLOCKTICKS
CBOX6C0 CBOX_CLOCKTICKS
CBOX7C0 CBOX_CLOCKTICKS
CBOX8C0 CBOX_CLOCKTICKS
CBOX9C0 CBOX_CLOCKTICKS
CBOX10C0 CBOX_CLOCKTICKS
CBOX11C0 CBOX_CLOCKTICKS
CBOX12C0 CBOX_CLOCKTICKS
CBOX13C0 CBOX_CLOCKTICKS
CBOX14C0 CBOX_CLOCKTICKS
MBOX0C0 DRAM_CLOCKTICKS
MBOX1C0 DRAM_CLOCKTICKS
MBOX2C0 DRAM_CLOCKTICKS
MBOX3C0 DRAM_CLOCKTICKS
MBOX0FIX DRAM_CLOCKTICKS
MBOX1FIX DRAM_CLOCKTICKS
MBOX2FIX DRAM_CLOCKTICKS
MBOX3FIX DRAM_CLOCKTICKS
SBOX0C0 SBOX_CLOCKTICKS
SBOX1C0 SBOX_CLOCKTICKS
SBOX2C0 SBOX_CLOCKTICKS
UBOXFIX UNCORE_CLOCK
BBOX0C0 BBOX_CLOCKTICKS
BBOX1C0 BBOX_CLOCKTICKS
WBOX0 WBOX_CLOCKTICKS
PBOX0 PBOX_CLOCKTICKS
RBOX0C0 RBOX_CLOCKTICKS
RBOX1C0 RBOX_CLOCKTICKS
RBOX2C0 RBOX_CLOCKTICKS
IBOX0C0 IBOX_CLOCKTICKS
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
UBOX Frequency [GHz] 1.E-09*UBOXFIX/(FIXC1*inverseClock)
CBOX0 Frequency [GHz] 1.E-09*CBOX0C0/(FIXC1*inverseClock)
CBOX1 Frequency [GHz] 1.E-09*CBOX1C0/(FIXC1*inverseClock)
CBOX2 Frequency [GHz] 1.E-09*CBOX2C0/(FIXC1*inverseClock)
CBOX3 Frequency [GHz] 1.E-09*CBOX3C0/(FIXC1*inverseClock)
CBOX4 Frequency [GHz] 1.E-09*CBOX4C0/(FIXC1*inverseClock)
CBOX5 Frequency [GHz] 1.E-09*CBOX5C0/(FIXC1*inverseClock)
CBOX6 Frequency [GHz] 1.E-09*CBOX6C0/(FIXC1*inverseClock)
CBOX7 Frequency [GHz] 1.E-09*CBOX7C0/(FIXC1*inverseClock)
CBOX8 Frequency [GHz] 1.E-09*CBOX8C0/(FIXC1*inverseClock)
CBOX9 Frequency [GHz] 1.E-09*CBOX9C0/(FIXC1*inverseClock)
CBOX10 Frequency [GHz] 1.E-09*CBOX10C0/(FIXC1*inverseClock)
CBOX11 Frequency [GHz] 1.E-09*CBOX11C0/(FIXC1*inverseClock)
CBOX12 Frequency [GHz] 1.E-09*CBOX12C0/(FIXC1*inverseClock)
CBOX13 Frequency [GHz] 1.E-09*CBOX13C0/(FIXC1*inverseClock)
CBOX14 Frequency [GHz] 1.E-09*CBOX14C0/(FIXC1*inverseClock)
MBOX0 Frequency [GHz] 1.E-09*MBOX0C0/(FIXC1*inverseClock)
MBOX0FIX Frequency [GHz] 1.E-09*MBOX0FIX/(FIXC1*inverseClock)
MBOX1 Frequency [GHz] 1.E-09*MBOX1C0/(FIXC1*inverseClock)
MBOX1FIX Frequency [GHz] 1.E-09*MBOX1FIX/(FIXC1*inverseClock)
MBOX2 Frequency [GHz] 1.E-09*MBOX2C0/(FIXC1*inverseClock)
MBOX2FIX Frequency [GHz] 1.E-09*MBOX2FIX/(FIXC1*inverseClock)
MBOX3 Frequency [GHz] 1.E-09*MBOX3C0/(FIXC1*inverseClock)
MBOX3FIX Frequency [GHz] 1.E-09*MBOX3FIX/(FIXC1*inverseClock)
SBOX0 Frequency [GHz] 1.E-09*SBOX0C0/(FIXC1*inverseClock)
SBOX1 Frequency [GHz] 1.E-09*SBOX1C0/(FIXC1*inverseClock)
SBOX2 Frequency [GHz] 1.E-09*SBOX2C0/(FIXC1*inverseClock)
BBOX0 Frequency [GHz] 1.E-09*BBOX0C0/(FIXC1*inverseClock)
BBOX1 Frequency [GHz] 1.E-09*BBOX1C0/(FIXC1*inverseClock)
WBOX Frequency [GHz] 1.E-09*WBOX0/(FIXC1*inverseClock)
PBOX Frequency [GHz] 1.E-09*PBOX0/(FIXC1*inverseClock)
RBOX0 Frequency [GHz] 1.E-09*RBOX0C0/(FIXC1*inverseClock)
RBOX1 Frequency [GHz] 1.E-09*RBOX1C0/(FIXC1*inverseClock)
RBOX2 Frequency [GHz] 1.E-09*RBOX2C0/(FIXC1*inverseClock)
IBOX Frequency [GHz] 1.E-09*IBOX0/(FIXC1*inverseClock)
LONG
Formulas:
UBOX Frequency [GHz] = 1.E-09*UNCORE_CLOCK/(CPU_CLK_UNHALTED_CORE*inverseClock)
CBOX[0-14] Frequency [GHz] = 1.E-09*CBOX_CLOCKTICKS/(CPU_CLK_UNHALTED_CORE*inverseClock)
MBOX[0-3] Frequency [GHz] = 1.E-09*DRAM_CLOCKTICKS/(CPU_CLK_UNHALTED_CORE*inverseClock)
MBOX[0-3]FIX Frequency [GHz] = 1.E-09*DRAM_CLOCKTICKS/(CPU_CLK_UNHALTED_CORE*inverseClock)
SBOX[0-2] Frequency [GHz] = 1.E-09*SBOX_CLOCKTICKS/(CPU_CLK_UNHALTED_CORE*inverseClock)
BBOX[0-1] Frequency [GHz] = 1.E-09*BBOX_CLOCKTICKS/(CPU_CLK_UNHALTED_CORE*inverseClock)
RBOX[0-2] Frequency [GHz] = 1.E-09*RBOX_CLOCKTICKS/(CPU_CLK_UNHALTED_CORE*inverseClock)
WBOX Frequency [GHz] = 1.E-09*WBOX_CLOCKTICKS/(CPU_CLK_UNHALTED_CORE*inverseClock)
PBOX Frequency [GHz] = 1.E-09*PBOX_CLOCKTICKS/(CPU_CLK_UNHALTED_CORE*inverseClock)
IBOX Frequency [GHz] = 1.E-09*IBOX_CLOCKTICKS/(CPU_CLK_UNHALTED_CORE*inverseClock)
--
A Overview over the frequencies of all Uncore units.

View File

@@ -0,0 +1,35 @@
SHORT UOPs execution info
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 UOPS_ISSUED_ANY
PMC1 UOPS_EXECUTED_THREAD
PMC2 UOPS_RETIRED_ALL
PMC3 UOPS_ISSUED_FLAGS_MERGE
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Issued UOPs PMC0
Merged UOPs PMC3
Executed UOPs PMC1
Retired UOPs PMC2
LONG
Formulas:
Issued UOPs = UOPS_ISSUED_ANY
Merged UOPs = UOPS_ISSUED_FLAGS_MERGE
Executed UOPs = UOPS_EXECUTED_THREAD
Retired UOPs = UOPS_RETIRED_ALL
-
This group returns information about the instruction pipeline. It measures the
issued, executed and retired uOPs and returns the number of uOPs which were issued
but not executed as well as the number of uOPs which were executed but never retired.
The executed but not retired uOPs commonly come from speculatively executed branches.

View File

@@ -0,0 +1,31 @@
SHORT UOPs execution
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 UOPS_EXECUTED_USED_CYCLES
PMC1 UOPS_EXECUTED_STALL_CYCLES
PMC2 CPU_CLOCK_UNHALTED_TOTAL_CYCLES
PMC3:EDGEDETECT UOPS_EXECUTED_STALL_CYCLES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Used cycles ratio [%] 100*PMC0/PMC2
Unused cycles ratio [%] 100*PMC1/PMC2
Avg stall duration [cycles] PMC1/PMC3:EDGEDETECT
LONG
Formulas:
Used cycles ratio [%] = 100*UOPS_EXECUTED_USED_CYCLES/CPU_CLK_UNHALTED_CORE
Unused cycles ratio [%] = 100*UOPS_EXECUTED_STALL_CYCLES/CPU_CLK_UNHALTED_CORE
Avg stall duration [cycles] = UOPS_EXECUTED_STALL_CYCLES/UOPS_EXECUTED_STALL_CYCLES:EDGEDETECT
-
This performance group returns the ratios of used and unused cycles regarding
the execution stage in the pipeline. Used cycles are all cycles where uOPs are
executed while unused cycles refer to pipeline stalls. Moreover, the group
calculates the average stall duration in cycles.

View File

@@ -0,0 +1,31 @@
SHORT UOPs issueing
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 UOPS_ISSUED_USED_CYCLES
PMC1 UOPS_ISSUED_STALL_CYCLES
PMC2 CPU_CLOCK_UNHALTED_TOTAL_CYCLES
PMC3:EDGEDETECT UOPS_ISSUED_STALL_CYCLES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Used cycles ratio [%] 100*PMC0/PMC2
Unused cycles ratio [%] 100*PMC1/PMC2
Avg stall duration [cycles] PMC1/PMC3:EDGEDETECT
LONG
Formulas:
Used cycles ratio [%] = 100*UOPS_ISSUED_USED_CYCLES/CPU_CLK_UNHALTED_CORE
Unused cycles ratio [%] = 100*UOPS_ISSUED_STALL_CYCLES/CPU_CLK_UNHALTED_CORE
Avg stall duration [cycles] = UOPS_ISSUED_STALL_CYCLES/UOPS_ISSUED_STALL_CYCLES:EDGEDETECT
-
This performance group returns the ratios of used and unused cycles regarding
the issue stage in the pipeline. Used cycles are all cycles where uOPs are
issued while unused cycles refer to pipeline stalls. Moreover, the group
calculates the average stall duration in cycles.

View File

@@ -0,0 +1,31 @@
SHORT UOPs retirement
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 UOPS_RETIRED_USED_CYCLES
PMC1 UOPS_RETIRED_STALL_CYCLES
PMC2 CPU_CLOCK_UNHALTED_TOTAL_CYCLES
PMC3:EDGEDETECT UOPS_RETIRED_STALL_CYCLES
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Used cycles ratio [%] 100*PMC0/PMC2
Unused cycles ratio [%] 100*PMC1/PMC2
Avg stall duration [cycles] PMC1/PMC3:EDGEDETECT
LONG
Formulas:
Used cycles ratio [%] = 100*UOPS_RETIRED_USED_CYCLES/CPU_CLK_UNHALTED_CORE
Unused cycles ratio [%] = 100*UOPS_RETIRED_STALL_CYCLES/CPU_CLK_UNHALTED_CORE
Avg stall duration [cycles] = UOPS_RETIRED_STALL_CYCLES/UOPS_RETIRED_STALL_CYCLES:EDGEDETECT
-
This performance group returns the ratios of used and unused cycles regarding
the retirement stage in the pipeline (re-order buffer). Used cycles are all
cycles where uOPs are retired while unused cycles refer to pipeline stalls.
Moreover, the group calculates the average stall duration in cycles.