mirror of
https://github.com/ClusterCockpit/cc-metric-collector.git
synced 2025-07-31 08:56:06 +02:00
Add likwid collector
This commit is contained in:
31
collectors/likwid/groups/broadwellEP/BRANCH.txt
Normal file
31
collectors/likwid/groups/broadwellEP/BRANCH.txt
Normal file
@@ -0,0 +1,31 @@
|
||||
SHORT Branch prediction miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 BR_INST_RETIRED_ALL_BRANCHES
|
||||
PMC1 BR_MISP_RETIRED_ALL_BRANCHES
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Branch rate PMC0/FIXC0
|
||||
Branch misprediction rate PMC1/FIXC0
|
||||
Branch misprediction ratio PMC1/PMC0
|
||||
Instructions per branch FIXC0/PMC0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Branch rate = BR_INST_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
|
||||
Branch misprediction rate = BR_MISP_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
|
||||
Branch misprediction ratio = BR_MISP_RETIRED_ALL_BRANCHES/BR_INST_RETIRED_ALL_BRANCHES
|
||||
Instructions per branch = INSTR_RETIRED_ANY/BR_INST_RETIRED_ALL_BRANCHES
|
||||
-
|
||||
The rates state how often on average a branch or a mispredicted branch occurred
|
||||
per instruction retired in total. The branch misprediction ratio sets directly
|
||||
into relation what ratio of all branch instruction where mispredicted.
|
||||
Instructions per branch is 1/branch rate.
|
||||
|
135
collectors/likwid/groups/broadwellEP/CACHES.txt
Normal file
135
collectors/likwid/groups/broadwellEP/CACHES.txt
Normal file
@@ -0,0 +1,135 @@
|
||||
SHORT Cache bandwidth in MBytes/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 L1D_REPLACEMENT
|
||||
PMC1 L2_TRANS_L1D_WB
|
||||
PMC2 L2_LINES_IN_ALL
|
||||
PMC3 L2_TRANS_L2_WB
|
||||
CBOX0C1 LLC_VICTIMS_M
|
||||
CBOX1C1 LLC_VICTIMS_M
|
||||
CBOX2C1 LLC_VICTIMS_M
|
||||
CBOX3C1 LLC_VICTIMS_M
|
||||
CBOX4C1 LLC_VICTIMS_M
|
||||
CBOX5C1 LLC_VICTIMS_M
|
||||
CBOX6C1 LLC_VICTIMS_M
|
||||
CBOX7C1 LLC_VICTIMS_M
|
||||
CBOX8C1 LLC_VICTIMS_M
|
||||
CBOX9C1 LLC_VICTIMS_M
|
||||
CBOX10C1 LLC_VICTIMS_M
|
||||
CBOX11C1 LLC_VICTIMS_M
|
||||
CBOX12C1 LLC_VICTIMS_M
|
||||
CBOX13C1 LLC_VICTIMS_M
|
||||
CBOX14C1 LLC_VICTIMS_M
|
||||
CBOX15C1 LLC_VICTIMS_M
|
||||
CBOX16C1 LLC_VICTIMS_M
|
||||
CBOX17C1 LLC_VICTIMS_M
|
||||
CBOX18C1 LLC_VICTIMS_M
|
||||
CBOX19C1 LLC_VICTIMS_M
|
||||
CBOX20C1 LLC_VICTIMS_M
|
||||
CBOX21C1 LLC_VICTIMS_M
|
||||
CBOX0C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX1C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX2C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX3C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX4C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX5C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX6C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX7C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX8C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX9C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX10C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX11C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX12C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX13C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX14C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX15C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX16C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX17C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX18C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX19C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX20C0 LLC_LOOKUP_DATA_READ
|
||||
CBOX21C0 LLC_LOOKUP_DATA_READ
|
||||
MBOX0C0 CAS_COUNT_RD
|
||||
MBOX0C1 CAS_COUNT_WR
|
||||
MBOX1C0 CAS_COUNT_RD
|
||||
MBOX1C1 CAS_COUNT_WR
|
||||
MBOX2C0 CAS_COUNT_RD
|
||||
MBOX2C1 CAS_COUNT_WR
|
||||
MBOX3C0 CAS_COUNT_RD
|
||||
MBOX3C1 CAS_COUNT_WR
|
||||
MBOX4C0 CAS_COUNT_RD
|
||||
MBOX4C1 CAS_COUNT_WR
|
||||
MBOX5C0 CAS_COUNT_RD
|
||||
MBOX5C1 CAS_COUNT_WR
|
||||
MBOX6C0 CAS_COUNT_RD
|
||||
MBOX6C1 CAS_COUNT_WR
|
||||
MBOX7C0 CAS_COUNT_RD
|
||||
MBOX7C1 CAS_COUNT_WR
|
||||
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L2 to L1 load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
|
||||
L2 to L1 load data volume [GBytes] 1.0E-09*PMC0*64.0
|
||||
L1 to L2 evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
|
||||
L1 to L2 evict data volume [GBytes] 1.0E-09*PMC1*64.0
|
||||
L1 to/from L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
|
||||
L1 to/from L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
|
||||
L3 to L2 load bandwidth [MBytes/s] 1.0E-06*PMC2*64.0/time
|
||||
L3 to L2 load data volume [GBytes] 1.0E-09*PMC2*64.0
|
||||
L2 to L3 evict bandwidth [MBytes/s] 1.0E-06*PMC3*64.0/time
|
||||
L2 to L3 evict data volume [GBytes] 1.0E-09*PMC3*64.0
|
||||
L2 to/from L3 bandwidth [MBytes/s] 1.0E-06*(PMC2+PMC3)*64.0/time
|
||||
L2 to/from L3 data volume [GBytes] 1.0E-09*(PMC2+PMC3)*64.0
|
||||
System to L3 bandwidth [MBytes/s] 1.0E-06*(CBOX0C0+CBOX1C0+CBOX2C0+CBOX3C0+CBOX4C0+CBOX5C0+CBOX6C0+CBOX7C0+CBOX8C0+CBOX9C0+CBOX10C0+CBOX11C0+CBOX12C0+CBOX13C0+CBOX14C0+CBOX15C0+CBOX16C0+CBOX17C0+CBOX18C0+CBOX19C0+CBOX20C0+CBOX21C0)*64.0/time
|
||||
System to L3 data volume [GBytes] 1.0E-09*(CBOX0C0+CBOX1C0+CBOX2C0+CBOX3C0+CBOX4C0+CBOX5C0+CBOX6C0+CBOX7C0+CBOX8C0+CBOX9C0+CBOX10C0+CBOX11C0+CBOX12C0+CBOX13C0+CBOX14C0+CBOX15C0+CBOX16C0+CBOX17C0+CBOX18C0+CBOX19C0+CBOX20C0+CBOX21C0)*64.0
|
||||
L3 to system bandwidth [MBytes/s] 1.0E-06*(CBOX0C1+CBOX1C1+CBOX2C1+CBOX3C1+CBOX4C1+CBOX5C1+CBOX6C1+CBOX7C1+CBOX8C1+CBOX9C1+CBOX10C1+CBOX11C1+CBOX12C1+CBOX13C1+CBOX14C1+CBOX15C1+CBOX16C1+CBOX17C1+CBOX18C1+CBOX19C1+CBOX20C1+CBOX21C1)*64/time
|
||||
L3 to system data volume [GBytes] 1.0E-09*(CBOX0C1+CBOX1C1+CBOX2C1+CBOX3C1+CBOX4C1+CBOX5C1+CBOX6C1+CBOX7C1+CBOX8C1+CBOX9C1+CBOX10C1+CBOX11C1+CBOX12C1+CBOX13C1+CBOX14C1+CBOX15C1+CBOX16C1+CBOX17C1+CBOX18C1+CBOX19C1+CBOX20C1+CBOX21C1)*64
|
||||
L3 to/from system bandwidth [MBytes/s] 1.0E-06*(CBOX0C0+CBOX1C0+CBOX2C0+CBOX3C0+CBOX4C0+CBOX5C0+CBOX6C0+CBOX7C0+CBOX8C0+CBOX9C0+CBOX10C0+CBOX11C0+CBOX12C0+CBOX13C0+CBOX14C0+CBOX15C0+CBOX16C0+CBOX17C0+CBOX18C0+CBOX19C0+CBOX20C0+CBOX21C0+CBOX0C1+CBOX1C1+CBOX2C1+CBOX3C1+CBOX4C1+CBOX5C1+CBOX6C1+CBOX7C1+CBOX8C1+CBOX9C1+CBOX10C1+CBOX11C1+CBOX12C1+CBOX13C1+CBOX14C1+CBOX15C1+CBOX16C1+CBOX17C1+CBOX18C1+CBOX19C1+CBOX20C1+CBOX21C1)*64.0/time
|
||||
L3 to/from system data volume [GBytes] 1.0E-09*(CBOX0C0+CBOX1C0+CBOX2C0+CBOX3C0+CBOX4C0+CBOX5C0+CBOX6C0+CBOX7C0+CBOX8C0+CBOX9C0+CBOX10C0+CBOX11C0+CBOX12C0+CBOX13C0+CBOX14C0+CBOX15C0+CBOX16C0+CBOX17C0+CBOX18C0+CBOX19C0+CBOX20C0+CBOX21C0+CBOX0C1+CBOX1C1+CBOX2C1+CBOX3C1+CBOX4C1+CBOX5C1+CBOX6C1+CBOX7C1+CBOX8C1+CBOX9C1+CBOX10C1+CBOX11C1+CBOX12C1+CBOX13C1+CBOX14C1+CBOX15C1+CBOX16C1+CBOX17C1+CBOX18C1+CBOX19C1+CBOX20C1+CBOX21C1)*64.0
|
||||
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0/time
|
||||
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0
|
||||
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
|
||||
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
|
||||
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
|
||||
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L2 to L1 load bandwidth [MBytes/s] = 1.0E-06*L1D_REPLACEMENT*64/time
|
||||
L2 to L1 load data volume [GBytes] = 1.0E-09*L1D_REPLACEMENT*64
|
||||
L1 to L2 evict bandwidth [MBytes/s] = 1.0E-06*L1D_M_EVICT*64/time
|
||||
L1 to L2 evict data volume [GBytes] = 1.0E-09*L1D_M_EVICT*64
|
||||
L1 to/from L2 bandwidth [MBytes/s] = 1.0E-06*(L1D_REPLACEMENT+L1D_M_EVICT)*64/time
|
||||
L1 to/from L2 data volume [GBytes] = 1.0E-09*(L1D_REPLACEMENT+L1D_M_EVICT)*64
|
||||
L3 to L2 load bandwidth [MBytes/s] = 1.0E-06*L2_LINES_IN_ALL*64/time
|
||||
L3 to L2 load data volume [GBytes] = 1.0E-09*L2_LINES_IN_ALL*64
|
||||
L2 to L3 evict bandwidth [MBytes/s] = 1.0E-06*L2_TRANS_L2_WB*64/time
|
||||
L2 to L3 evict data volume [GBytes] = 1.0E-09*L2_TRANS_L2_WB*64
|
||||
L2 to/from L3 bandwidth [MBytes/s] = 1.0E-06*(L2_LINES_IN_ALL+L2_TRANS_L2_WB)*64/time
|
||||
L2 to/from L3 data volume [GBytes] = 1.0E-09*(L2_LINES_IN_ALL+L2_TRANS_L2_WB)*64
|
||||
System to L3 bandwidth [MBytes/s] = 1.0E-06*(SUM(LLC_LOOKUP_DATA_READ))*64/time
|
||||
System to L3 data volume [GBytes] = 1.0E-09*(SUM(LLC_LOOKUP_DATA_READ))*64
|
||||
L3 to system bandwidth [MBytes/s] = 1.0E-06*(SUM(LLC_VICTIMS_M))*64/time
|
||||
L3 to system data volume [GBytes] = 1.0E-09*(SUM(LLC_VICTIMS_M))*64
|
||||
L3 to/from system bandwidth [MBytes/s] = 1.0E-06*(SUM(LLC_LOOKUP_DATA_READ)+SUM(LLC_VICTIMS_M))*64/time
|
||||
L3 to/from system data volume [GBytes] = 1.0E-09*(SUM(LLC_LOOKUP_DATA_READ)+SUM(LLC_VICTIMS_M))*64
|
||||
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD))*64.0/time
|
||||
Memory read data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD))*64.0
|
||||
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_WR))*64.0/time
|
||||
Memory write data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_WR))*64.0
|
||||
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0/time
|
||||
Memory data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0
|
||||
-
|
||||
Group to measure cache transfers between L1 and Memory. Please notice that the
|
||||
L3 to/from system metrics contain any traffic to the system (memory,
|
||||
Intel QPI, etc.) but don't seem to handle anything because commonly memory read
|
||||
bandwidth and L3 to L2 bandwidth is higher as the memory to L3 bandwidth.
|
||||
|
26
collectors/likwid/groups/broadwellEP/CLOCK.txt
Normal file
26
collectors/likwid/groups/broadwellEP/CLOCK.txt
Normal file
@@ -0,0 +1,26 @@
|
||||
SHORT Power and Energy consumption
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PWR0 PWR_PKG_ENERGY
|
||||
UBOXFIX UNCORE_CLOCK
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
Uncore Clock [MHz] 1.E-06*UBOXFIX/time
|
||||
CPI FIXC1/FIXC0
|
||||
Energy [J] PWR0
|
||||
Power [W] PWR0/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Power = PWR_PKG_ENERGY / time
|
||||
Uncore Clock [MHz] = 1.E-06 * UNCORE_CLOCK / time
|
||||
-
|
||||
Broadwell implements the new RAPL interface. This interface enables to
|
||||
monitor the consumed energy on the package (socket) level.
|
||||
|
38
collectors/likwid/groups/broadwellEP/CYCLE_ACTIVITY.txt
Normal file
38
collectors/likwid/groups/broadwellEP/CYCLE_ACTIVITY.txt
Normal file
@@ -0,0 +1,38 @@
|
||||
SHORT Cycle Activities
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 CYCLE_ACTIVITY_CYCLES_L2_PENDING
|
||||
PMC1 CYCLE_ACTIVITY_CYCLES_LDM_PENDING
|
||||
PMC2 CYCLE_ACTIVITY_CYCLES_L1D_PENDING
|
||||
PMC3 CYCLE_ACTIVITY_CYCLES_NO_EXECUTE
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Cycles without execution [%] (PMC3/FIXC1)*100
|
||||
Cycles without execution due to L1D [%] (PMC2/FIXC1)*100
|
||||
Cycles without execution due to L2 [%] (PMC0/FIXC1)*100
|
||||
Cycles without execution due to memory loads [%] (PMC1/FIXC1)*100
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Cycles without execution [%] = CYCLE_ACTIVITY_CYCLES_NO_EXECUTE/CPU_CLK_UNHALTED_CORE*100
|
||||
Cycles with stalls due to L1D [%] = CYCLE_ACTIVITY_CYCLES_L1D_PENDING/CPU_CLK_UNHALTED_CORE*100
|
||||
Cycles with stalls due to L2 [%] = CYCLE_ACTIVITY_CYCLES_L2_PENDING/CPU_CLK_UNHALTED_CORE*100
|
||||
Cycles without execution due to memory loads [%] = CYCLE_ACTIVITY_CYCLES_LDM_PENDING/CPU_CLK_UNHALTED_CORE*100
|
||||
--
|
||||
This performance group measures the cycles while waiting for data from the cache
|
||||
and memory hierarchy.
|
||||
CYCLE_ACTIVITY_CYCLES_NO_EXECUTE: Counts number of cycles nothing is executed on
|
||||
any execution port.
|
||||
CYCLE_ACTIVITY_CYCLES_L1D_PENDING: Cycles while L1 cache miss demand load is
|
||||
outstanding.
|
||||
CYCLE_ACTIVITY_CYCLES_L2_PENDING: Cycles while L2 cache miss demand load is
|
||||
outstanding.
|
||||
CYCLE_ACTIVITY_CYCLES_LDM_PENDING: Cycles while memory subsystem has an
|
||||
outstanding load.
|
45
collectors/likwid/groups/broadwellEP/CYCLE_STALLS.txt
Normal file
45
collectors/likwid/groups/broadwellEP/CYCLE_STALLS.txt
Normal file
@@ -0,0 +1,45 @@
|
||||
SHORT Cycle Activities (Stalls)
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 CYCLE_ACTIVITY_STALLS_L2_PENDING
|
||||
PMC1 CYCLE_ACTIVITY_STALLS_LDM_PENDING
|
||||
PMC2 CYCLE_ACTIVITY_STALLS_L1D_PENDING
|
||||
PMC3 CYCLE_ACTIVITY_STALLS_TOTAL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Total execution stalls PMC3
|
||||
Stalls caused by L1D misses [%] (PMC2/PMC3)*100
|
||||
Stalls caused by L2 misses [%] (PMC0/PMC3)*100
|
||||
Stalls caused by memory loads [%] (PMC1/PMC3)*100
|
||||
Execution stall rate [%] (PMC3/FIXC1)*100
|
||||
Stalls caused by L1D misses rate [%] (PMC2/FIXC1)*100
|
||||
Stalls caused by L2 misses rate [%] (PMC0/FIXC1)*100
|
||||
Stalls caused by memory loads rate [%] (PMC1/FIXC1)*100
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Total execution stalls = CYCLE_ACTIVITY_STALLS_TOTAL
|
||||
Stalls caused by L1D misses [%] = (CYCLE_ACTIVITY_STALLS_L1D_PENDING/CYCLE_ACTIVITY_STALLS_TOTAL)*100
|
||||
Stalls caused by L2 misses [%] = (CYCLE_ACTIVITY_STALLS_L2_PENDING/CYCLE_ACTIVITY_STALLS_TOTAL)*100
|
||||
Stalls caused by memory loads [%] = (CYCLE_ACTIVITY_STALLS_LDM_PENDING/CYCLE_ACTIVITY_STALLS_TOTAL)*100
|
||||
Execution stall rate [%] = (CYCLE_ACTIVITY_STALLS_TOTAL/CPU_CLK_UNHALTED_CORE)*100
|
||||
Stalls caused by L1D misses rate [%] = (CYCLE_ACTIVITY_STALLS_L1D_PENDING/CPU_CLK_UNHALTED_CORE)*100
|
||||
Stalls caused by L2 misses rate [%] = (CYCLE_ACTIVITY_STALLS_L2_PENDING/CPU_CLK_UNHALTED_CORE)*100
|
||||
Stalls caused by memory loads rate [%] = (CYCLE_ACTIVITY_STALLS_LDM_PENDING/CPU_CLK_UNHALTED_CORE)*100
|
||||
--
|
||||
This performance group measures the stalls caused by data traffic in the cache
|
||||
hierarchy.
|
||||
CYCLE_ACTIVITY_STALLS_TOTAL: Total execution stalls.
|
||||
CYCLE_ACTIVITY_STALLS_L1D_PENDING: Execution stalls while L1 cache miss demand
|
||||
load is outstanding.
|
||||
CYCLE_ACTIVITY_STALLS_L2_PENDING: Execution stalls while L2 cache miss demand
|
||||
load is outstanding.
|
||||
CYCLE_ACTIVITY_STALLS_LDM_PENDING: Execution stalls while memory subsystem has
|
||||
an outstanding load.
|
22
collectors/likwid/groups/broadwellEP/DATA.txt
Normal file
22
collectors/likwid/groups/broadwellEP/DATA.txt
Normal file
@@ -0,0 +1,22 @@
|
||||
SHORT Load to store ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 MEM_UOPS_RETIRED_LOADS_ALL
|
||||
PMC1 MEM_UOPS_RETIRED_STORES_ALL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Load to store ratio PMC0/PMC1
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Load to store ratio = MEM_UOPS_RETIRED_LOADS_ALL/MEM_UOPS_RETIRED_STORES_ALL
|
||||
-
|
||||
This is a metric to determine your load to store ratio.
|
||||
|
24
collectors/likwid/groups/broadwellEP/DIVIDE.txt
Normal file
24
collectors/likwid/groups/broadwellEP/DIVIDE.txt
Normal file
@@ -0,0 +1,24 @@
|
||||
SHORT Divide unit information
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0:EDGEDETECT ARITH_FPU_DIV_ACTIVE
|
||||
PMC1 ARITH_FPU_DIV_ACTIVE
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Number of divide ops PMC0:EDGEDETECT
|
||||
Avg. divide unit usage duration PMC1/PMC0:EDGEDETECT
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Number of divide ops = ARITH_FPU_DIV_ACTIVE:EDGEDETECT
|
||||
Avg. divide unit usage duration = ARITH_FPU_DIV_ACTIVE/ARITH_FPU_DIV_ACTIVE:EDGEDETECT
|
||||
-
|
||||
This performance group measures the average latency of divide operations
|
35
collectors/likwid/groups/broadwellEP/ENERGY.txt
Normal file
35
collectors/likwid/groups/broadwellEP/ENERGY.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
SHORT Power and Energy consumption
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
TMP0 TEMP_CORE
|
||||
PWR0 PWR_PKG_ENERGY
|
||||
PWR1 PWR_PP0_ENERGY
|
||||
PWR3 PWR_DRAM_ENERGY
|
||||
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Temperature [C] TMP0
|
||||
Energy [J] PWR0
|
||||
Power [W] PWR0/time
|
||||
Energy PP0 [J] PWR1
|
||||
Power PP0 [W] PWR1/time
|
||||
Energy DRAM [J] PWR3
|
||||
Power DRAM [W] PWR3/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Power = PWR_PKG_ENERGY / time
|
||||
Power PP0 = PWR_PP0_ENERGY / time
|
||||
Power DRAM = PWR_DRAM_ENERGY / time
|
||||
-
|
||||
Broadwell implements the new RAPL interface. This interface enables to
|
||||
monitor the consumed energy on the package (socket) and DRAM level.
|
||||
|
30
collectors/likwid/groups/broadwellEP/FALSE_SHARE.txt
Normal file
30
collectors/likwid/groups/broadwellEP/FALSE_SHARE.txt
Normal file
@@ -0,0 +1,30 @@
|
||||
SHORT False sharing
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 MEM_LOAD_UOPS_L3_HIT_RETIRED_XSNP_HITM
|
||||
PMC1 MEM_LOAD_UOPS_L3_MISS_RETIRED_REMOTE_HITM
|
||||
PMC2 MEM_UOPS_RETIRED_LOADS_ALL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Local LLC false sharing [MByte] 1.E-06*PMC0*64
|
||||
Local LLC false sharing rate PMC0/PMC2
|
||||
Remote LLC false sharing [MByte] 1.E-06*PMC1*64
|
||||
Remote LLC false sharing rate PMC1/PMC2
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Local LLC false sharing [MByte] = 1.E-06*MEM_LOAD_UOPS_L3_HIT_RETIRED_XSNP_HITM*64
|
||||
Local LLC false sharing rate = MEM_LOAD_UOPS_L3_HIT_RETIRED_XSNP_HITM/MEM_UOPS_RETIRED_LOADS_ALL
|
||||
Remote LLC false sharing [MByte] = 1.E-06*MEM_LOAD_UOPS_L3_MISS_RETIRED_REMOTE_HITM*64
|
||||
Remote LLC false sharing rate = MEM_LOAD_UOPS_L3_MISS_RETIRED_REMOTE_HITM/MEM_LOAD_UOPS_RETIRED_ALL
|
||||
-
|
||||
False-sharing of cache lines can dramatically reduce the performance of an
|
||||
application. This performance group measures the L3 traffic induced by false-sharing.
|
||||
The false-sharing rate uses all memory load UOPs as reference.
|
24
collectors/likwid/groups/broadwellEP/FLOPS_AVX.txt
Normal file
24
collectors/likwid/groups/broadwellEP/FLOPS_AVX.txt
Normal file
@@ -0,0 +1,24 @@
|
||||
SHORT Packed AVX MFLOP/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE
|
||||
PMC1 FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Packed SP [MFLOP/s] 1.0E-06*(PMC0*8.0)/time
|
||||
Packed DP [MFLOP/s] 1.0E-06*(PMC1*4.0)/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Packed SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8)/runtime
|
||||
Packed DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4)/runtime
|
||||
-
|
||||
FLOP rates of 256 bit packed floating-point instructions
|
||||
|
31
collectors/likwid/groups/broadwellEP/FLOPS_DP.txt
Normal file
31
collectors/likwid/groups/broadwellEP/FLOPS_DP.txt
Normal file
@@ -0,0 +1,31 @@
|
||||
SHORT Double Precision MFLOP/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE
|
||||
PMC1 FP_ARITH_INST_RETIRED_SCALAR_DOUBLE
|
||||
PMC2 FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
DP [MFLOP/s] 1.0E-06*(PMC0*2.0+PMC1+PMC2*4.0)/time
|
||||
AVX DP [MFLOP/s] 1.0E-06*(PMC2*4.0)/time
|
||||
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2)/time
|
||||
Scalar [MUOPS/s] 1.0E-06*PMC1/time
|
||||
Vectorization ratio 100*(PMC0+PMC2)/(PMC0+PMC1+PMC2)
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE*2+FP_ARITH_INST_RETIRED_SCALAR_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4)/runtime
|
||||
AVX DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4)/runtime
|
||||
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE)/runtime
|
||||
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED_SCALAR_DOUBLE/runtime
|
||||
Vectorization ratio = 100*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE)/(FP_ARITH_INST_RETIRED_SCALAR_DOUBLE+FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE)
|
||||
-
|
||||
AVX/SSE scalar and packed double precision FLOP rates.
|
||||
|
31
collectors/likwid/groups/broadwellEP/FLOPS_SP.txt
Normal file
31
collectors/likwid/groups/broadwellEP/FLOPS_SP.txt
Normal file
@@ -0,0 +1,31 @@
|
||||
SHORT Single Precision MFLOP/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE
|
||||
PMC1 FP_ARITH_INST_RETIRED_SCALAR_SINGLE
|
||||
PMC2 FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
SP [MFLOP/s] 1.0E-06*(PMC0*4.0+PMC1+PMC2*8.0)/time
|
||||
AVX SP [MFLOP/s] 1.0E-06*(PMC2*8.0)/time
|
||||
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2)/time
|
||||
Scalar [MUOPS/s] 1.0E-06*PMC1/time
|
||||
Vectorization ratio 100*(PMC0+PMC2)/(PMC0+PMC1+PMC2)
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE*4+FP_ARITH_INST_RETIRED_SCALAR_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8)/runtime
|
||||
AVX SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8)/runtime
|
||||
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE)/runtime
|
||||
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED_SCALAR_SINGLE/runtime
|
||||
Vectorization ratio [%] = 100*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE)/(FP_ARITH_INST_RETIRED_SCALAR_SINGLE+FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE)
|
||||
-
|
||||
AVX/SSE scalar and packed single precision FLOP rates.
|
||||
|
40
collectors/likwid/groups/broadwellEP/HA.txt
Normal file
40
collectors/likwid/groups/broadwellEP/HA.txt
Normal file
@@ -0,0 +1,40 @@
|
||||
SHORT Main memory bandwidth in MBytes/s seen from Home agent
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
BBOX0C0 IMC_READS_NORMAL
|
||||
BBOX0C1 BYPASS_IMC_TAKEN
|
||||
BBOX0C2 IMC_WRITES_ALL
|
||||
BBOX1C0 IMC_READS_NORMAL
|
||||
BBOX1C1 BYPASS_IMC_TAKEN
|
||||
BBOX1C2 IMC_WRITES_ALL
|
||||
|
||||
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Memory read bandwidth [MBytes/s] 1.0E-06*(BBOX0C0+BBOX1C0+BBOX0C1+BBOX1C1)*64.0/time
|
||||
Memory read data volume [GBytes] 1.0E-09*(BBOX0C0+BBOX1C0+BBOX0C1+BBOX1C1)*64.0
|
||||
Memory write bandwidth [MBytes/s] 1.0E-06*(BBOX0C2+BBOX1C2)*64.0/time
|
||||
Memory write data volume [GBytes] 1.0E-09*(BBOX0C2+BBOX1C2)*64.0
|
||||
Memory bandwidth [MBytes/s] 1.0E-06*(BBOX0C0+BBOX1C0+BBOX0C1+BBOX1C1+BBOX0C2+BBOX1C2)*64.0/time
|
||||
Memory data volume [GBytes] 1.0E-09*(BBOX0C0+BBOX1C0+BBOX0C1+BBOX1C1+BBOX0C2+BBOX1C2)*64.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(IMC_READS_NORMAL)+SUM(BYPASS_IMC_TAKEN))*64.0/time
|
||||
Memory read data volume [GBytes] = 1.0E-09*(SUM(IMC_READS_NORMAL)+SUM(BYPASS_IMC_TAKEN))*64.0
|
||||
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(IMC_WRITES_ALL))*64.0/time
|
||||
Memory write data volume [GBytes] = 1.0E-09*(SUM(IMC_WRITES_ALL))*64.0
|
||||
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(IMC_READS_NORMAL) + SUM(BYPASS_IMC_TAKEN) + SUM(IMC_WRITES_ALL))*64.0/time
|
||||
Memory data volume [GBytes] = 1.0E-09*(SUM(IMC_READS_NORMAL) + SUM(BYPASS_IMC_TAKEN) + SUM(IMC_WRITES_ALL))*64.0
|
||||
-
|
||||
This group derives the same metrics as the MEM group but use the events of the
|
||||
Home Agent, a central unit that is responsible for the protocol side of memory
|
||||
interactions.
|
25
collectors/likwid/groups/broadwellEP/ICACHE.txt
Normal file
25
collectors/likwid/groups/broadwellEP/ICACHE.txt
Normal file
@@ -0,0 +1,25 @@
|
||||
SHORT Instruction cache miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 ICACHE_ACCESSES
|
||||
PMC1 ICACHE_MISSES
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L1I request rate PMC0/FIXC0
|
||||
L1I miss rate PMC1/FIXC0
|
||||
L1I miss ratio PMC1/PMC0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L1I request rate = ICACHE_ACCESSES / INSTR_RETIRED_ANY
|
||||
L1I miss rate = ICACHE_MISSES / INSTR_RETIRED_ANY
|
||||
L1I miss ratio = ICACHE_MISSES / ICACHE_ACCESSES
|
||||
-
|
||||
This group measures some L1 instruction cache metrics.
|
37
collectors/likwid/groups/broadwellEP/L2.txt
Normal file
37
collectors/likwid/groups/broadwellEP/L2.txt
Normal file
@@ -0,0 +1,37 @@
|
||||
SHORT L2 cache bandwidth in MBytes/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 L1D_REPLACEMENT
|
||||
PMC1 L2_TRANS_L1D_WB
|
||||
PMC2 ICACHE_MISSES
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L2D load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
|
||||
L2D load data volume [GBytes] 1.0E-09*PMC0*64.0
|
||||
L2D evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
|
||||
L2D evict data volume [GBytes] 1.0E-09*PMC1*64.0
|
||||
L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1+PMC2)*64.0/time
|
||||
L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1+PMC2)*64.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L2D load bandwidth [MBytes/s] = 1.0E-06*L1D_REPLACEMENT*64.0/time
|
||||
L2D load data volume [GBytes] = 1.0E-09*L1D_REPLACEMENT*64.0
|
||||
L2D evict bandwidth [MBytes/s] = 1.0E-06*L2_TRANS_L1D_WB*64.0/time
|
||||
L2D evict data volume [GBytes] = 1.0E-09*L2_TRANS_L1D_WB*64.0
|
||||
L2 bandwidth [MBytes/s] = 1.0E-06*(L1D_REPLACEMENT+L2_TRANS_L1D_WB+ICACHE_MISSES)*64.0/time
|
||||
L2 data volume [GBytes] = 1.0E-09*(L1D_REPLACEMENT+L2_TRANS_L1D_WB+ICACHE_MISSES)*64.0
|
||||
-
|
||||
Profiling group to measure L2 cache bandwidth. The bandwidth is computed by the
|
||||
number of cache line loaded from the L2 to the L2 data cache and the writebacks from
|
||||
the L2 data cache to the L2 cache. The group also outputs total data volume transferred between
|
||||
L2 and L1. Note that this bandwidth also includes data transfers due to a write
|
||||
allocate load on a store miss in L1 and cache lines transferred it the instruction
|
||||
cache.
|
34
collectors/likwid/groups/broadwellEP/L2CACHE.txt
Normal file
34
collectors/likwid/groups/broadwellEP/L2CACHE.txt
Normal file
@@ -0,0 +1,34 @@
|
||||
SHORT L2 cache miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 L2_TRANS_ALL_REQUESTS
|
||||
PMC1 L2_RQSTS_MISS
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L2 request rate PMC0/FIXC0
|
||||
L2 miss rate PMC1/FIXC0
|
||||
L2 miss ratio PMC1/PMC0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L2 request rate = L2_TRANS_ALL_REQUESTS/INSTR_RETIRED_ANY
|
||||
L2 miss rate = L2_RQSTS_MISS/INSTR_RETIRED_ANY
|
||||
L2 miss ratio = L2_RQSTS_MISS/L2_TRANS_ALL_REQUESTS
|
||||
-
|
||||
This group measures the locality of your data accesses with regard to the
|
||||
L2 cache. L2 request rate tells you how data intensive your code is
|
||||
or how many data accesses you have on average per instruction.
|
||||
The L2 miss rate gives a measure how often it was necessary to get
|
||||
cache lines from memory. And finally L2 miss ratio tells you how many of your
|
||||
memory references required a cache line to be loaded from a higher level.
|
||||
While the# data cache miss rate might be given by your algorithm you should
|
||||
try to get data cache miss ratio as low as possible by increasing your cache reuse.
|
||||
|
||||
|
36
collectors/likwid/groups/broadwellEP/L3.txt
Normal file
36
collectors/likwid/groups/broadwellEP/L3.txt
Normal file
@@ -0,0 +1,36 @@
|
||||
SHORT L3 cache bandwidth in MBytes/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 L2_LINES_IN_ALL
|
||||
PMC1 L2_TRANS_L2_WB
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L3 load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
|
||||
L3 load data volume [GBytes] 1.0E-09*PMC0*64.0
|
||||
L3 evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
|
||||
L3 evict data volume [GBytes] 1.0E-09*PMC1*64.0
|
||||
L3 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
|
||||
L3 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L3 load bandwidth [MBytes/s] = 1.0E-06*L2_LINES_IN_ALL*64.0/time
|
||||
L3 load data volume [GBytes] = 1.0E-09*L2_LINES_IN_ALL*64.0
|
||||
L3 evict bandwidth [MBytes/s] = 1.0E-06*L2_TRANS_L2_WB*64.0/time
|
||||
L3 evict data volume [GBytes] = 1.0E-09*L2_TRANS_L2_WB*64.0
|
||||
L3 bandwidth [MBytes/s] = 1.0E-06*(L2_LINES_IN_ALL+L2_TRANS_L2_WB)*64/time
|
||||
L3 data volume [GBytes] = 1.0E-09*(L2_LINES_IN_ALL+L2_TRANS_L2_WB)*64
|
||||
-
|
||||
Profiling group to measure L3 cache bandwidth. The bandwidth is computed by the
|
||||
number of cache line allocated in the L2 and the number of modified cache lines
|
||||
evicted from the L2. This group also output data volume transferred between the
|
||||
L3 and measured cores L2 caches. Note that this bandwidth also includes data
|
||||
transfers due to a write allocate load on a store miss in L2.
|
||||
|
35
collectors/likwid/groups/broadwellEP/L3CACHE.txt
Normal file
35
collectors/likwid/groups/broadwellEP/L3CACHE.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
SHORT L3 cache miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 MEM_LOAD_UOPS_RETIRED_L3_ALL
|
||||
PMC1 MEM_LOAD_UOPS_RETIRED_L3_MISS
|
||||
PMC2 UOPS_RETIRED_ALL
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L3 request rate PMC0/PMC2
|
||||
L3 miss rate PMC1/PMC2
|
||||
L3 miss ratio PMC1/PMC0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L3 request rate = MEM_LOAD_UOPS_RETIRED_L3_ALL/UOPS_RETIRED_ALL
|
||||
L3 miss rate = MEM_LOAD_UOPS_RETIRED_L3_MISS/UOPS_RETIRED_ALL
|
||||
L3 miss ratio = MEM_LOAD_UOPS_RETIRED_L3_MISS/MEM_LOAD_UOPS_RETIRED_L3_ALL
|
||||
-
|
||||
This group measures the locality of your data accesses with regard to the
|
||||
L3 cache. L3 request rate tells you how data intensive your code is
|
||||
or how many data accesses you have on average per instruction.
|
||||
The L3 miss rate gives a measure how often it was necessary to get
|
||||
cache lines from memory. And finally L3 miss ratio tells you how many of your
|
||||
memory references required a cache line to be loaded from a higher level.
|
||||
While the# data cache miss rate might be given by your algorithm you should
|
||||
try to get data cache miss ratio as low as possible by increasing your cache reuse.
|
||||
|
||||
|
52
collectors/likwid/groups/broadwellEP/MEM.txt
Normal file
52
collectors/likwid/groups/broadwellEP/MEM.txt
Normal file
@@ -0,0 +1,52 @@
|
||||
SHORT Main memory bandwidth in MBytes/s
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
MBOX0C0 CAS_COUNT_RD
|
||||
MBOX0C1 CAS_COUNT_WR
|
||||
MBOX1C0 CAS_COUNT_RD
|
||||
MBOX1C1 CAS_COUNT_WR
|
||||
MBOX2C0 CAS_COUNT_RD
|
||||
MBOX2C1 CAS_COUNT_WR
|
||||
MBOX3C0 CAS_COUNT_RD
|
||||
MBOX3C1 CAS_COUNT_WR
|
||||
MBOX4C0 CAS_COUNT_RD
|
||||
MBOX4C1 CAS_COUNT_WR
|
||||
MBOX5C0 CAS_COUNT_RD
|
||||
MBOX5C1 CAS_COUNT_WR
|
||||
MBOX6C0 CAS_COUNT_RD
|
||||
MBOX6C1 CAS_COUNT_WR
|
||||
MBOX7C0 CAS_COUNT_RD
|
||||
MBOX7C1 CAS_COUNT_WR
|
||||
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0/time
|
||||
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0
|
||||
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
|
||||
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
|
||||
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
|
||||
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0))*64.0/runtime
|
||||
Memory read data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0))*64.0
|
||||
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC1))*64.0/runtime
|
||||
Memory write data volume [GBytes] = 1.0E-09*(SUM(MBOXxC1))*64.0
|
||||
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0/runtime
|
||||
Memory data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0
|
||||
-
|
||||
Profiling group to measure memory bandwidth drawn by all cores of a socket.
|
||||
Since this group is based on Uncore events it is only possible to measure on a
|
||||
per socket base. Some of the counters may not be available on your system.
|
||||
Also outputs total data volume transferred from main memory.
|
||||
The same metrics are provided by the HA group.
|
||||
|
73
collectors/likwid/groups/broadwellEP/MEM_DP.txt
Normal file
73
collectors/likwid/groups/broadwellEP/MEM_DP.txt
Normal file
@@ -0,0 +1,73 @@
|
||||
SHORT Overview of arithmetic and main memory performance
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PWR0 PWR_PKG_ENERGY
|
||||
PWR3 PWR_DRAM_ENERGY
|
||||
PMC0 FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE
|
||||
PMC1 FP_ARITH_INST_RETIRED_SCALAR_DOUBLE
|
||||
PMC2 FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE
|
||||
MBOX0C0 CAS_COUNT_RD
|
||||
MBOX0C1 CAS_COUNT_WR
|
||||
MBOX1C0 CAS_COUNT_RD
|
||||
MBOX1C1 CAS_COUNT_WR
|
||||
MBOX2C0 CAS_COUNT_RD
|
||||
MBOX2C1 CAS_COUNT_WR
|
||||
MBOX3C0 CAS_COUNT_RD
|
||||
MBOX3C1 CAS_COUNT_WR
|
||||
MBOX4C0 CAS_COUNT_RD
|
||||
MBOX4C1 CAS_COUNT_WR
|
||||
MBOX5C0 CAS_COUNT_RD
|
||||
MBOX5C1 CAS_COUNT_WR
|
||||
MBOX6C0 CAS_COUNT_RD
|
||||
MBOX6C1 CAS_COUNT_WR
|
||||
MBOX7C0 CAS_COUNT_RD
|
||||
MBOX7C1 CAS_COUNT_WR
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Energy [J] PWR0
|
||||
Power [W] PWR0/time
|
||||
Energy DRAM [J] PWR3
|
||||
Power DRAM [W] PWR3/time
|
||||
MFLOP/s 1.0E-06*(PMC0*2.0+PMC1+PMC2*4.0)/time
|
||||
AVX [MFLOP/s] 1.0E-06*(PMC2*4.0)/time
|
||||
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2)/time
|
||||
Scalar [MUOPS/s] 1.0E-06*PMC1/time
|
||||
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0/time
|
||||
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0
|
||||
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
|
||||
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
|
||||
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
|
||||
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
|
||||
Operational intensity (PMC0*2.0+PMC1+PMC2*4.0)/((MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0)
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Power [W] = PWR_PKG_ENERGY/runtime
|
||||
Power DRAM [W] = PWR_DRAM_ENERGY/runtime
|
||||
MFLOP/s = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE*2+FP_ARITH_INST_RETIRED_SCALAR_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4)/runtime
|
||||
AVX [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4)/runtime
|
||||
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE)/runtime
|
||||
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED_SCALAR_DOUBLE/runtime
|
||||
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0))*64.0/runtime
|
||||
Memory read data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0))*64.0
|
||||
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC1))*64.0/runtime
|
||||
Memory write data volume [GBytes] = 1.0E-09*(SUM(MBOXxC1))*64.0
|
||||
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0/runtime
|
||||
Memory data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0
|
||||
Operational intensity = (FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE*2+FP_ARITH_INST_RETIRED_SCALAR_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4)/(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0)
|
||||
--
|
||||
Profiling group to measure memory bandwidth drawn by all cores of a socket.
|
||||
Since this group is based on Uncore events it is only possible to measure on
|
||||
a per socket base. Also outputs total data volume transferred from main memory.
|
||||
SSE scalar and packed double precision FLOP rates. Also reports on packed AVX
|
||||
32b instructions.
|
||||
The operational intensity is calculated using the FP values of the cores and the
|
||||
memory data volume of the whole socket. The actual operational intensity for
|
||||
multiple CPUs can be found in the statistics table in the Sum column.
|
73
collectors/likwid/groups/broadwellEP/MEM_SP.txt
Normal file
73
collectors/likwid/groups/broadwellEP/MEM_SP.txt
Normal file
@@ -0,0 +1,73 @@
|
||||
SHORT Overview of arithmetic and main memory performance
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PWR0 PWR_PKG_ENERGY
|
||||
PWR3 PWR_DRAM_ENERGY
|
||||
PMC0 FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE
|
||||
PMC1 FP_ARITH_INST_RETIRED_SCALAR_SINGLE
|
||||
PMC2 FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE
|
||||
MBOX0C0 CAS_COUNT_RD
|
||||
MBOX0C1 CAS_COUNT_WR
|
||||
MBOX1C0 CAS_COUNT_RD
|
||||
MBOX1C1 CAS_COUNT_WR
|
||||
MBOX2C0 CAS_COUNT_RD
|
||||
MBOX2C1 CAS_COUNT_WR
|
||||
MBOX3C0 CAS_COUNT_RD
|
||||
MBOX3C1 CAS_COUNT_WR
|
||||
MBOX4C0 CAS_COUNT_RD
|
||||
MBOX4C1 CAS_COUNT_WR
|
||||
MBOX5C0 CAS_COUNT_RD
|
||||
MBOX5C1 CAS_COUNT_WR
|
||||
MBOX6C0 CAS_COUNT_RD
|
||||
MBOX6C1 CAS_COUNT_WR
|
||||
MBOX7C0 CAS_COUNT_RD
|
||||
MBOX7C1 CAS_COUNT_WR
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Energy [J] PWR0
|
||||
Power [W] PWR0/time
|
||||
Energy DRAM [J] PWR3
|
||||
Power DRAM [W] PWR3/time
|
||||
MFLOP/s 1.0E-06*(PMC0*4.0+PMC1+PMC2*8.0)/time
|
||||
AVX [MFLOP/s] 1.0E-06*(PMC2*8.0)/time
|
||||
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2)/time
|
||||
Scalar [MUOPS/s] 1.0E-06*PMC1/time
|
||||
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0/time
|
||||
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0)*64.0
|
||||
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
|
||||
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
|
||||
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time
|
||||
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0
|
||||
Operational intensity (PMC0*4.0+PMC1+PMC2*8.0)/((MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0)
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Power [W] = PWR_PKG_ENERGY/runtime
|
||||
Power DRAM [W] = PWR_DRAM_ENERGY/runtime
|
||||
MFLOP/s = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE*4+FP_ARITH_INST_RETIRED_SCALAR_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8)/runtime
|
||||
AVX [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8)/runtime
|
||||
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE)/runtime
|
||||
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED_SCALAR_SINGLE/runtime
|
||||
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0))*64.0/runtime
|
||||
Memory read data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0))*64.0
|
||||
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC1))*64.0/runtime
|
||||
Memory write data volume [GBytes] = 1.0E-09*(SUM(MBOXxC1))*64.0
|
||||
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0/runtime
|
||||
Memory data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0
|
||||
Operational intensity = (FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE*4+FP_ARITH_INST_RETIRED_SCALAR_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8)/(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0)
|
||||
--
|
||||
Profiling group to measure memory bandwidth drawn by all cores of a socket.
|
||||
Since this group is based on Uncore events it is only possible to measure on
|
||||
a per socket base. Also outputs total data volume transferred from main memory.
|
||||
SSE scalar and packed single precision FLOP rates. Also reports on packed AVX
|
||||
32b instructions.
|
||||
The operational intensity is calculated using the FP values of the cores and the
|
||||
memory data volume of the whole socket. The actual operational intensity for
|
||||
multiple CPUs can be found in the statistics table in the Sum column.
|
41
collectors/likwid/groups/broadwellEP/NUMA.txt
Normal file
41
collectors/likwid/groups/broadwellEP/NUMA.txt
Normal file
@@ -0,0 +1,41 @@
|
||||
SHORT Local and remote data transfers
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
BBOX0C0 REQUESTS_READS_LOCAL
|
||||
BBOX1C0 REQUESTS_READS_LOCAL
|
||||
BBOX0C1 REQUESTS_READS_REMOTE
|
||||
BBOX1C1 REQUESTS_READS_REMOTE
|
||||
BBOX0C2 REQUESTS_WRITES_LOCAL
|
||||
BBOX1C2 REQUESTS_WRITES_LOCAL
|
||||
BBOX0C3 REQUESTS_WRITES_REMOTE
|
||||
BBOX1C3 REQUESTS_WRITES_REMOTE
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Local bandwidth [MByte/s] 1.E-06*((BBOX0C0+BBOX1C0+BBOX0C2+BBOX1C2)*64)/time
|
||||
Local data volume [GByte] 1.E-09*(BBOX0C0+BBOX1C0+BBOX0C2+BBOX1C2)*64
|
||||
Remote bandwidth [MByte/s] 1.E-06*((BBOX0C1+BBOX1C1+BBOX0C3+BBOX1C3)*64)/time
|
||||
Remote data volume [GByte] 1.E-09*(BBOX0C1+BBOX1C1+BBOX0C3+BBOX1C3)*64
|
||||
Total bandwidth [MByte/s] 1.E-06*((BBOX0C0+BBOX1C0+BBOX0C2+BBOX1C2+BBOX0C1+BBOX1C1+BBOX0C3+BBOX1C3)*64)/time
|
||||
Total data volume [GByte] 1.E-09*(BBOX0C0+BBOX1C0+BBOX0C2+BBOX1C2+BBOX0C1+BBOX1C1+BBOX0C3+BBOX1C3)*64
|
||||
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
CPI = CPU_CLK_UNHALTED_CORE/INSTR_RETIRED_ANY
|
||||
Local bandwidth [MByte/s] = 1.E-06*((SUM(REQUESTS_READS_LOCAL)+SUM(REQUESTS_WRITES_LOCAL))*64)/time
|
||||
Local data volume [GByte] = 1.E-09*(SUM(REQUESTS_READS_LOCAL)+SUM(REQUESTS_WRITES_LOCAL))*64
|
||||
Remote bandwidth [MByte/s] = 1.E-06*((SUM(REQUESTS_READS_REMOTE)+SUM(REQUESTS_WRITES_REMOTE))*64)/time
|
||||
Remote data volume [GByte] = 1.E-09*(SUM(REQUESTS_READS_REMOTE)+SUM(REQUESTS_WRITES_REMOTE))*64
|
||||
Total bandwidth [MByte/s] = 1.E-06*((SUM(REQUESTS_READS_LOCAL)+SUM(REQUESTS_WRITES_LOCAL)+SUM(REQUESTS_READS_REMOTE)+SUM(REQUESTS_WRITES_REMOTE))*64)/time
|
||||
Total data volume [GByte] = 1.E-09*(SUM(REQUESTS_READS_LOCAL)+SUM(REQUESTS_WRITES_LOCAL)+SUM(REQUESTS_READS_REMOTE)+SUM(REQUESTS_WRITES_REMOTE))*64
|
||||
--
|
||||
This performance group measures the data traffic of CPU sockets to local and remote
|
||||
CPU sockets. It uses the Home Agent for calculation. This may include also data from
|
||||
other sources than the memory controllers.
|
50
collectors/likwid/groups/broadwellEP/PORT_USAGE.txt
Normal file
50
collectors/likwid/groups/broadwellEP/PORT_USAGE.txt
Normal file
@@ -0,0 +1,50 @@
|
||||
SHORT Execution port utilization
|
||||
|
||||
REQUIRE_NOHT
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 UOPS_EXECUTED_PORT_PORT_0
|
||||
PMC1 UOPS_EXECUTED_PORT_PORT_1
|
||||
PMC2 UOPS_EXECUTED_PORT_PORT_2
|
||||
PMC3 UOPS_EXECUTED_PORT_PORT_3
|
||||
PMC4 UOPS_EXECUTED_PORT_PORT_4
|
||||
PMC5 UOPS_EXECUTED_PORT_PORT_5
|
||||
PMC6 UOPS_EXECUTED_PORT_PORT_6
|
||||
PMC7 UOPS_EXECUTED_PORT_PORT_7
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Port0 usage ratio PMC0/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5+PMC6+PMC7)
|
||||
Port1 usage ratio PMC1/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5+PMC6+PMC7)
|
||||
Port2 usage ratio PMC2/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5+PMC6+PMC7)
|
||||
Port3 usage ratio PMC3/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5+PMC6+PMC7)
|
||||
Port4 usage ratio PMC4/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5+PMC6+PMC7)
|
||||
Port5 usage ratio PMC5/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5+PMC6+PMC7)
|
||||
Port6 usage ratio PMC6/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5+PMC6+PMC7)
|
||||
Port7 usage ratio PMC7/(PMC0+PMC1+PMC2+PMC3+PMC4+PMC5+PMC6+PMC7)
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Port0 usage ratio = UOPS_EXECUTED_PORT_PORT_0/SUM(UOPS_EXECUTED_PORT_PORT_*)
|
||||
Port1 usage ratio = UOPS_EXECUTED_PORT_PORT_1/SUM(UOPS_EXECUTED_PORT_PORT_*)
|
||||
Port2 usage ratio = UOPS_EXECUTED_PORT_PORT_2/SUM(UOPS_EXECUTED_PORT_PORT_*)
|
||||
Port3 usage ratio = UOPS_EXECUTED_PORT_PORT_3/SUM(UOPS_EXECUTED_PORT_PORT_*)
|
||||
Port4 usage ratio = UOPS_EXECUTED_PORT_PORT_4/SUM(UOPS_EXECUTED_PORT_PORT_*)
|
||||
Port5 usage ratio = UOPS_EXECUTED_PORT_PORT_5/SUM(UOPS_EXECUTED_PORT_PORT_*)
|
||||
Port6 usage ratio = UOPS_EXECUTED_PORT_PORT_6/SUM(UOPS_EXECUTED_PORT_PORT_*)
|
||||
Port7 usage ratio = UOPS_EXECUTED_PORT_PORT_7/SUM(UOPS_EXECUTED_PORT_PORT_*)
|
||||
-
|
||||
This group measures the execution port utilization in a CPU core. The group can
|
||||
only be measured when HyperThreading is disabled because only then each CPU core
|
||||
can program eight counters.
|
||||
Please be aware that the counters PMC4-7 are broken on Intel Broadwell. They
|
||||
don't increment if either user- or kernel-level filtering is applied. User-level
|
||||
filtering is default in LIKWID, hence kernel-level filtering is added
|
||||
automatically for PMC4-7. The returned counts can be much higher.
|
49
collectors/likwid/groups/broadwellEP/QPI.txt
Normal file
49
collectors/likwid/groups/broadwellEP/QPI.txt
Normal file
@@ -0,0 +1,49 @@
|
||||
SHORT QPI Link Layer data
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
QBOX0C0 RXL_FLITS_G0_DATA
|
||||
QBOX1C0 RXL_FLITS_G0_DATA
|
||||
QBOX0C1 RXL_FLITS_G0_NON_DATA
|
||||
QBOX1C1 RXL_FLITS_G0_NON_DATA
|
||||
QBOX0C2 TXL_FLITS_G0_DATA
|
||||
QBOX1C2 TXL_FLITS_G0_DATA
|
||||
QBOX0C3 TXL_FLITS_G0_NON_DATA
|
||||
QBOX1C3 TXL_FLITS_G0_NON_DATA
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
QPI send data volume [GByte] 1.E-09*(QBOX0C2+QBOX1C2)*8
|
||||
QPI send data bandwidth [MByte/s] 1.E-06*(QBOX0C2+QBOX1C2)*8/time
|
||||
QPI send link volume [GByte] 1.E-09*(QBOX0C2+QBOX1C2+QBOX0C3+QBOX1C3)*8
|
||||
QPI send link bandwidth [MByte/s] 1.E-06*(QBOX0C2+QBOX1C2+QBOX0C3+QBOX1C3)*8/time
|
||||
QPI receive data volume [GByte] 1.E-09*(QBOX0C0+QBOX1C0)*8
|
||||
QPI receive data bandwidth [MByte/s] 1.E-06*(QBOX0C0+QBOX1C0)*8/time
|
||||
QPI receive link volume [GByte] 1.E-09*(QBOX0C0+QBOX1C0+QBOX0C1+QBOX1C1)*8
|
||||
QPI receive link bandwidth [MByte/s] 1.E-06*(QBOX0C0+QBOX1C0+QBOX0C1+QBOX1C1)*8/time
|
||||
QPI total transfer volume [GByte] 1.E-09*(QBOX0C0+QBOX1C0+QBOX0C2+QBOX1C2+QBOX0C1+QBOX1C1+QBOX0C3+QBOX1C3)*8
|
||||
QPI total bandwidth [MByte/s] 1.E-06*(QBOX0C0+QBOX1C0+QBOX0C2+QBOX1C2+QBOX0C1+QBOX1C1+QBOX0C3+QBOX1C3)*8/time
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
QPI send data volume [GByte] = 1.E-09*(sum(TXL_FLITS_G0_DATA)*8)
|
||||
QPI send data bandwidth [MByte/s] = 1.E-06*(sum(TXL_FLITS_G0_DATA)*8)/runtime
|
||||
QPI send link volume [GByte] = 1.E-09*((sum(TXL_FLITS_G0_DATA)+sum(TXL_FLITS_G0_NON_DATA))*8)
|
||||
QPI send link bandwidth [MByte/s] = 1.E-06*((sum(TXL_FLITS_G0_DATA)+sum(TXL_FLITS_G0_NON_DATA))*8)/runtime
|
||||
QPI receive data volume [GByte] = 1.E-09*(sum(RXL_FLITS_G0_DATA)*8)
|
||||
QPI receive data bandwidth [MByte/s] = 1.E-06*(sum(RXL_FLITS_G0_DATA)*8)/runtime
|
||||
QPI receive link volume [GByte] = 1.E-09*((sum(RXL_FLITS_G0_DATA)+sum(RXL_FLITS_G0_NON_DATA))*8)
|
||||
QPI receive link bandwidth [MByte/s] = 1.E-06*((sum(RXL_FLITS_G0_DATA)+sum(RXL_FLITS_G0_NON_DATA))*8)/runtime
|
||||
QPI total transfer volume [GByte] = 1.E-09*(sum(TXL_FLITS_G0_DATA)+sum(TXL_FLITS_G0_NON_DATA)+sum(RXL_FLITS_G0_DATA)+sum(RXL_FLITS_G0_NON_DATA))*8
|
||||
QPI total bandwidth [MByte/s] = 1.E-06*(sum(TXL_FLITS_G0_DATA)+sum(TXL_FLITS_G0_NON_DATA)+sum(RXL_FLITS_G0_DATA)+sum(RXL_FLITS_G0_NON_DATA))*8/time
|
||||
--
|
||||
The Intel QPI Link Layer is responsible for packetizing requests from the caching agent (CBOXes)
|
||||
on the way out to the system interface. For Broadwell EP systems, the Link Layer and the
|
||||
Ring interface is separated. The QPI link volume contains header, data and trailer while the
|
||||
QPI data volume counts only the data flits.
|
35
collectors/likwid/groups/broadwellEP/TLB_DATA.txt
Normal file
35
collectors/likwid/groups/broadwellEP/TLB_DATA.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
SHORT L2 data TLB miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 DTLB_LOAD_MISSES_CAUSES_A_WALK
|
||||
PMC1 DTLB_STORE_MISSES_CAUSES_A_WALK
|
||||
PMC2 DTLB_LOAD_MISSES_WALK_DURATION
|
||||
PMC3 DTLB_STORE_MISSES_WALK_DURATION
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L1 DTLB load misses PMC0
|
||||
L1 DTLB load miss rate PMC0/FIXC0
|
||||
L1 DTLB load miss duration PMC2
|
||||
L1 DTLB store misses PMC1
|
||||
L1 DTLB store miss rate PMC1/FIXC0
|
||||
L1 DTLB store miss duration PMC3
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L1 DTLB load misses = DTLB_LOAD_MISSES_CAUSES_A_WALK
|
||||
L1 DTLB load miss rate = DTLB_LOAD_MISSES_CAUSES_A_WALK / INSTR_RETIRED_ANY
|
||||
L1 DTLB load miss duration = DTLB_LOAD_MISSES_WALK_DURATION
|
||||
L1 DTLB store misses = DTLB_STORE_MISSES_CAUSES_A_WALK
|
||||
L1 DTLB store miss rate = DTLB_STORE_MISSES_CAUSES_A_WALK / INSTR_RETIRED_ANY
|
||||
L1 DTLB store miss duration = DTLB_STORE_MISSES_WALK_DURATION
|
||||
-
|
||||
The DTLB load and store miss rates gives a measure how often a TLB miss occurred
|
||||
per instruction. The duration measures the time in cycles how long a walk did take.
|
||||
|
28
collectors/likwid/groups/broadwellEP/TLB_INSTR.txt
Normal file
28
collectors/likwid/groups/broadwellEP/TLB_INSTR.txt
Normal file
@@ -0,0 +1,28 @@
|
||||
SHORT L1 Instruction TLB miss rate/ratio
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 ITLB_MISSES_CAUSES_A_WALK
|
||||
PMC1 ITLB_MISSES_WALK_DURATION
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
L1 ITLB misses PMC0
|
||||
L1 ITLB miss rate PMC0/FIXC0
|
||||
L1 ITLB miss duration PMC1
|
||||
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
L1 ITLB misses = ITLB_MISSES_CAUSES_A_WALK
|
||||
L1 ITLB miss rate = ITLB_MISSES_CAUSES_A_WALK / INSTR_RETIRED_ANY
|
||||
L1 ITLB miss duration = ITLB_MISSES_WALK_DURATION
|
||||
-
|
||||
The ITLB miss rates gives a measure how often a TLB miss occurred
|
||||
per instruction. The duration measures the time in cycles how long a walk did take.
|
||||
|
48
collectors/likwid/groups/broadwellEP/TMA.txt
Normal file
48
collectors/likwid/groups/broadwellEP/TMA.txt
Normal file
@@ -0,0 +1,48 @@
|
||||
SHORT Top down cycle allocation
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 UOPS_ISSUED_ANY
|
||||
PMC1 UOPS_RETIRED_RETIRE_SLOTS
|
||||
PMC2 IDQ_UOPS_NOT_DELIVERED_CORE
|
||||
PMC3 INT_MISC_RECOVERY_CYCLES
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
IPC FIXC0/FIXC1
|
||||
Total Slots 4*FIXC1
|
||||
Slots Retired PMC1
|
||||
Fetch Bubbles PMC2
|
||||
Recovery Bubbles 4*PMC3
|
||||
Front End [%] PMC2/(4*FIXC1)*100
|
||||
Speculation [%] (PMC0-PMC1+(4*PMC3))/(4*FIXC1)*100
|
||||
Retiring [%] PMC1/(4*FIXC1)*100
|
||||
Back End [%] (1-((PMC2+PMC0+(4*PMC3))/(4*FIXC1)))*100
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Total Slots = 4*CPU_CLK_UNHALTED_CORE
|
||||
Slots Retired = UOPS_RETIRED_RETIRE_SLOTS
|
||||
Fetch Bubbles = IDQ_UOPS_NOT_DELIVERED_CORE
|
||||
Recovery Bubbles = 4*INT_MISC_RECOVERY_CYCLES
|
||||
Front End [%] = IDQ_UOPS_NOT_DELIVERED_CORE/(4*CPU_CLK_UNHALTED_CORE)*100
|
||||
Speculation [%] = (UOPS_ISSUED_ANY-UOPS_RETIRED_RETIRE_SLOTS+(4*INT_MISC_RECOVERY_CYCLES))/(4*CPU_CLK_UNHALTED_CORE)*100
|
||||
Retiring [%] = UOPS_RETIRED_RETIRE_SLOTS/(4*CPU_CLK_UNHALTED_CORE)*100
|
||||
Back End [%] = (1-((IDQ_UOPS_NOT_DELIVERED_CORE+UOPS_ISSUED_ANY+(4*INT_MISC_RECOVERY_CYCLES))/(4*CPU_CLK_UNHALTED_CORE)))*100
|
||||
--
|
||||
This performance group measures cycles to determine percentage of time spent in
|
||||
front end, back end, retiring and speculation. These metrics are published and
|
||||
verified by Intel. Further information:
|
||||
Webpage describing Top-Down Method and its usage in Intel vTune:
|
||||
https://software.intel.com/en-us/vtune-amplifier-help-tuning-applications-using-a-top-down-microarchitecture-analysis-method
|
||||
Paper by Yasin Ahmad:
|
||||
https://sites.google.com/site/analysismethods/yasin-pubs/TopDown-Yasin-ISPASS14.pdf?attredirects=0
|
||||
Slides by Yasin Ahmad:
|
||||
http://www.cs.technion.ac.il/~erangi/TMA_using_Linux_perf__Ahmad_Yasin.pdf
|
||||
The performance group was originally published here:
|
||||
http://perf.mvermeulen.com/2018/04/14/top-down-performance-counter-analysis-part-1-likwid/
|
35
collectors/likwid/groups/broadwellEP/UOPS.txt
Normal file
35
collectors/likwid/groups/broadwellEP/UOPS.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
SHORT UOPs execution info
|
||||
|
||||
EVENTSET
|
||||
FIXC0 INSTR_RETIRED_ANY
|
||||
FIXC1 CPU_CLK_UNHALTED_CORE
|
||||
FIXC2 CPU_CLK_UNHALTED_REF
|
||||
PMC0 UOPS_ISSUED_ANY
|
||||
PMC1 UOPS_EXECUTED_THREAD
|
||||
PMC2 UOPS_RETIRED_ALL
|
||||
PMC3 UOPS_ISSUED_FLAGS_MERGE
|
||||
|
||||
|
||||
|
||||
METRICS
|
||||
Runtime (RDTSC) [s] time
|
||||
Runtime unhalted [s] FIXC1*inverseClock
|
||||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
|
||||
CPI FIXC1/FIXC0
|
||||
Issued UOPs PMC0
|
||||
Merged UOPs PMC3
|
||||
Executed UOPs PMC1
|
||||
Retired UOPs PMC2
|
||||
|
||||
LONG
|
||||
Formulas:
|
||||
Issued UOPs = UOPS_ISSUED_ANY
|
||||
Merged UOPs = UOPS_ISSUED_FLAGS_MERGE
|
||||
Executed UOPs = UOPS_EXECUTED_THREAD
|
||||
Retired UOPs = UOPS_RETIRED_ALL
|
||||
-
|
||||
This group returns information about the instruction pipeline. It measures the
|
||||
issued, executed and retired uOPs and returns the number of uOPs which were issued
|
||||
but not executed as well as the number of uOPs which were executed but never retired.
|
||||
The executed but not retired uOPs commonly come from speculatively executed branches.
|
||||
|
Reference in New Issue
Block a user