Add likwid collector

This commit is contained in:
Thomas Roehl
2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions

View File

@@ -0,0 +1,22 @@
SHORT L1 compute to data access ratio
EVENTSET
PMC0 VPU_ELEMENTS_ACTIVE
PMC1 DATA_READ_OR_WRITE
METRICS
Runtime (RDTSC) [s] time
L1 compute intensity PMC0/PMC1
LONG
Formulas:
L1 compute intensity = VPU_ELEMENTS_ACTIVE/DATA_READ_OR_WRITE
-
These metric is a way to measure the computational density of an
application, or how many computations it is performing on average for each
piece of data loaded. L1 compute to data access ratio should be
used to judge suitability of an application for running on the Intel MIC
architecture. Applications that will perform well on the Intel MIC
architecture should be vectorized, and ideally be able to perform multiple
operations on the same pieces of data (or same cache lines).

View File

@@ -0,0 +1,22 @@
SHORT L2 compute to data access ratio
EVENTSET
PMC0 VPU_ELEMENTS_ACTIVE
PMC1 DATA_READ_MISS_OR_WRITE_MISS
METRICS
Runtime (RDTSC) [s] time
L2 compute intensity PMC0/PMC1
LONG
Formulas:
L2 compute intensity = VPU_ELEMENTS_ACTIVE/DATA_READ_MISS_OR_WRITE_MISS
-
These metric is a way to measure the computational density of an
application, or how many computations it is performing on average for each
piece of data loaded. L2 compute to data access ratio should be
used to judge suitability of an application for running on the Intel MIC
architecture. Applications that will perform well on the Intel MIC
architecture should be vectorized, and ideally be able to perform multiple
operations on the same pieces of data (or same cache lines).

View File

@@ -0,0 +1,23 @@
SHORT Cycles per instruction
EVENTSET
PMC0 INSTRUCTIONS_EXECUTED
PMC1 CPU_CLK_UNHALTED
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC1*inverseClock
CPI PMC1/PMC0
IPC PMC0/PMC1
LONG
Formulas:
CPI = CPU_CLK_UNHALTED/INSTRUCTIONS_EXECUTED
IPC = INSTRUCTIONS_EXECUTED/CPU_CLK_UNHALTED
-
This group measures how efficient the processor works with
regard to instruction throughput. Also important as a standalone
metric is INSTRUCTIONS_RETIRED as it tells you how many instruction
you need to execute for a task. An optimization might show very
low CPI values but execute many more instruction for it.

View File

@@ -0,0 +1,18 @@
SHORT Memory bandwidth
EVENTSET
PMC0 DATA_READ_MISS_OR_WRITE_MISS
PMC1 DATA_CACHE_LINES_WRITTEN_BACK
METRICS
Runtime (RDTSC) [s] time
Memory bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
LONG
Formulas:
Memory bandwidth [MBytes/s] = 1.0E-06*(DATA_READ_MISS_OR_WRITE_MISS+DATA_CACHE_LINES_WRITTEN_BACK)*64.0/time
Memory data volume [GBytes] = 1.0E-09*(DATA_READ_MISS_OR_WRITE_MISS+DATA_CACHE_LINES_WRITTEN_BACK)*64.0
-
Total memory bandwidth and data volume.

View File

@@ -0,0 +1,18 @@
SHORT L2 write misses
EVENTSET
PMC0 L2_DATA_WRITE_MISS_MEM_FILL
METRICS
Runtime (RDTSC) [s] time
L2 RFO bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
L2 RFO data volume [GBytes] 1.0E-09*PMC0*64.0
LONG
Formulas:
L2 RFO bandwidth [MBytes/s] = 1.0E-06*L2_DATA_WRITE_MISS_MEM_FILL*64.0/time
L2 RFO data volume [GBytes] = 1.0E-09*L2_DATA_WRITE_MISS_MEM_FILL*64.0
-
Bandwidth and data volume fetched from memory due to a L2 data write miss. These
fetches are commonly called write-allocate loads or read-for-ownership (RFO).

View File

@@ -0,0 +1,17 @@
SHORT L2 read misses
EVENTSET
PMC0 L2_DATA_READ_MISS_MEM_FILL
METRICS
Runtime (RDTSC) [s] time
L2 read bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
L2 read data volume [GBytes] 1.0E-09*PMC0*64.0
LONG
Formulas:
L2 read bandwidth [MBytes/s] = 1.0E-06*L2_DATA_READ_MISS_MEM_FILL*64.0/time
L2 read data volume [GBytes] = 1.0E-09*L2_DATA_READ_MISS_MEM_FILL*64.0
-
The data volume and bandwidth caused by read misses in the L2 cache.

View File

@@ -0,0 +1,17 @@
SHORT HW prefetch transfers
EVENTSET
PMC0 HWP_L2MISS
METRICS
Runtime (RDTSC) [s] time
Prefetch bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
Prefetch data volume [GBytes] 1.0E-09*PMC0*64.0
LONG
Formulas:
Prefetch bandwidth [MBytes/s] = 1.0E-06*HWP_L2MISS*64.0/time
Prefetch data volume [GBytes] = 1.0E-09*HWP_L2MISS*64.0
-
The bandwidth and data volume caused by L2 misses from the hardware prefetcher.

View File

@@ -0,0 +1,17 @@
SHORT L2 victom requests
EVENTSET
PMC0 L2_VICTIM_REQ_WITH_DATA
METRICS
Runtime (RDTSC) [s] time
Victim bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
Victim data volume [GBytes] 1.0E-09*PMC0*64.0
LONG
Formulas:
Victim bandwidth [MBytes/s] = 1.0E-06*L2_VICTIM_REQ_WITH_DATA*64.0/time
Victim data volume [GBytes] = 1.0E-09*L2_VICTIM_REQ_WITH_DATA*64.0
-
Data volume and bandwidth caused by cache line victims.

View File

@@ -0,0 +1,19 @@
SHORT L2 snoop hits
EVENTSET
PMC0 SNP_HITM_L2
METRICS
Runtime (RDTSC) [s] time
Snoop bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
Snoop data volume [GBytes] 1.0E-09*PMC0*64.0
LONG
Formulas:
Snoop bandwidth [MBytes/s] = 1.0E-06*SNP_HITM_L2*64.0/time
Snoop data volume [GBytes] = 1.0E-09*SNP_HITM_L2*64.0
-
Snoop traffic caused by HITM requests. HITM requests are L2 requests that
are served by another core's L2 cache but the remote cache line is in modified
state.

View File

@@ -0,0 +1,17 @@
SHORT L2 read misses
EVENTSET
PMC0 L2_READ_MISS
METRICS
Runtime (RDTSC) [s] time
L2 read bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
L2 read data volume [GBytes] 1.0E-09*PMC0*64.0
LONG
Formulas:
L2 read bandwidth [MBytes/s] = 1.0E-06*L2_READ_MISS*64.0/time
L2 read data volume [GBytes] = 1.0E-09*L2_READ_MISS*64.0
-
Data volume and bandwidth caused by read misses in the L2 cache.

View File

@@ -0,0 +1,20 @@
SHORT Memory read bandwidth
EVENTSET
PMC0 DATA_READ_MISS
PMC1 HWP_L2MISS
METRICS
Runtime (RDTSC) [s] time
Memory read bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
LONG
Formulas:
Memory read bandwidth [MBytes/s] = 1.0E-06*(L2_DATA_READ_MISS_MEM_FILL+HWP_L2MISS)*64.0/time
Memory read data volume [GBytes] = 1.0E-09*(L2_DATA_READ_MISS_MEM_FILL+HWP_L2MISS)*64.0
-
Bandwidth and data volume of read operations from the memory to L2 cache. The
metric is introduced in the book 'Intel Xeon Phi Coprocessor High-Performance
Programming' by James Jeffers and James Reinders.

View File

@@ -0,0 +1,20 @@
SHORT Memory write bandwidth
EVENTSET
PMC0 L2_VICTIM_REQ_WITH_DATA
PMC1 SNP_HITM_L2
METRICS
Runtime (RDTSC) [s] time
Memory write bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
Memory write data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
LONG
Formulas:
Memory write bandwidth [MBytes/s] = 1.0E-06*(L2_VICTIM_REQ_WITH_DATA+SNP_HITM_L2)*64.0/time
Memory write data volume [GBytes] = 1.0E-09*(L2_VICTIM_REQ_WITH_DATA+SNP_HITM_L2)*64.0
-
Bandwidth and data volume of write operations from the L2 cache to memory. The
metric is introduced in the book 'Intel Xeon Phi Coprocessor High-Performance
Programming' by James Jeffers and James Reinders.

View File

@@ -0,0 +1,21 @@
SHORT Pairing ratio
EVENTSET
PMC0 INSTRUCTIONS_EXECUTED
PMC1 INSTRUCTIONS_EXECUTED_V_PIPE
METRICS
Runtime (RDTSC) [s] time
V-pipe ratio PMC1/PMC0
Pairing ratio PMC1/(PMC0-PMC1)
LONG
Formulas:
V-pipe ratio = INSTRUCTIONS_EXECUTED_V_PIPE/INSTRUCTIONS_EXECUTED
Pairing ratio = INSTRUCTIONS_EXECUTED_V_PIPE/(INSTRUCTIONS_EXECUTED-INSTRUCTIONS_EXECUTED_V_PIPE)
-
Each hardware thread on the Xeon Phi can execute two instruction simultaneously,
one in the U-pipe and one in the V-pipe. But this is only possible if the
instructions can be paired. The instructions executed in paired fashion are counted
by the event INSTRUCTIONS_EXECUTED_V_PIPE. The event INSTRUCTIONS_EXECUTED increments
for each instruction, hence the maximal increase per cycle can be 2.

View File

@@ -0,0 +1,15 @@
SHORT Miss ratio fof data reads
EVENTSET
PMC0 DATA_READ
PMC1 DATA_READ_MISS
METRICS
Runtime (RDTSC) [s] time
Read miss ratio PMC1/PMC0
LONG
Formulas:
Read miss ratio = DATA_READ_MISS/DATA_READ
--
Miss ratio for data reads.

View File

@@ -0,0 +1,23 @@
SHORT TLB Misses
EVENTSET
PMC0 LONG_DATA_PAGE_WALK
PMC1 DATA_PAGE_WALK
METRICS
Runtime (RDTSC) [s] time
L1 TLB misses [misses/s] PMC1/time
L2 TLB misses [misses/s] PMC0/time
L1 TLB misses per L2 TLB miss PMC1/PMC0
LONG
Formulas:
L1 TLB misses [misses/s] = DATA_PAGE_WALK/time
L2 TLB misses [misses/s] = LONG_DATA_PAGE_WALK/time
L1 TLB misses per L2 TLB miss = DATA_PAGE_WALK/LONG_DATA_PAGE_WALK
-
Analysis of the layered TLB of the Intel Xeon Phi. According to the book
'Intel Xeon Phi Coprocessor High-Performance Programming' by James Jeffers and
James Reinders, a high L1 TLB misses per L2 TLB miss ratio suggests that your
working set fits into the L2 TLB but not in L1 TLB. Using large pages may be
beneficial.

View File

@@ -0,0 +1,23 @@
SHORT L1 TLB misses
EVENTSET
PMC0 DATA_PAGE_WALK
PMC1 DATA_READ_OR_WRITE
METRICS
Runtime (RDTSC) [s] time
L1 TLB misses [misses/s] PMC0/time
L1 TLB miss ratio PMC0/PMC1
LONG
Formulas:
L1 TLB misses [misses/s] = DATA_PAGE_WALK/time
L1 TLB miss ratio = DATA_PAGE_WALK/DATA_READ_OR_WRITE
-
This performance group measures the L1 TLB misses. A L1 TLB miss that hits the
L2 TLB has a penelty of about 25 cycles for 4kB pages. For 2MB pages, the penelty
for a L1 TLB miss that hits L2 TLB is about 8 cycles. The minimal L1 TLB miss ratio
is about 1/64, so a high ratio indicates a bad spartial locality. Data of a page
is only partly accessed. It can also indicate trashing because when multiple pages
are accessed in a loop iteration, the size and associativity is not sufficient to
hold all pages.

View File

@@ -0,0 +1,21 @@
SHORT L2 TLB misses
EVENTSET
PMC0 LONG_DATA_PAGE_WALK
PMC1 DATA_READ_OR_WRITE
METRICS
Runtime (RDTSC) [s] time
L2 TLB misses [misses/s] PMC0/time
L2 TLB miss ratio PMC0/PMC1
LONG
Formulas:
L2 TLB misses [misses/s] = LONG_DATA_PAGE_WALK/time
L2 TLB miss ratio = LONG_DATA_PAGE_WALK/DATA_READ_OR_WRITE
-
This performance group measures the L2 TLB misses. A L2 TLB miss has a penelty
of at least 100 cycles, hence it is important to avoid them. A high ratio can
indicate trashing because when multiple pages are accessed in a loop iteration,
the size and associativity is not sufficient to hold all pages. This would also
result in a bad ratio for the L1 TLB.

View File

@@ -0,0 +1,21 @@
SHORT Vectorization intensity
EVENTSET
PMC0 VPU_INSTRUCTIONS_EXECUTED
PMC1 VPU_ELEMENTS_ACTIVE
METRICS
Runtime (RDTSC) [s] time
Vectorization intensity PMC1/PMC0
LONG
Formulas:
Vectorization intensity = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED
-
Vector instructions include instructions that perform floating-point
operations, instructions that load vector registers from memory and store them
to memory, instructions to manipulate vector mask registers, and other special
purpose instructions such as vector shuffle.
According to the book 'Intel Xeon Phi Coprocessor High-Performance Programming'
by James Jeffers and James Reinders, the vectorization intensity should be >=8
for double precision and >=16 for single precision.

View File

@@ -0,0 +1,20 @@
SHORT Vector unit usage
EVENTSET
PMC0 VPU_INSTRUCTIONS_EXECUTED
PMC1 VPU_STALL_REG
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC1*inverseClock
VPU stall ratio [%] 100*(VPU_STALL_REG/PMC0)
LONG
Formulas:
VPU stall ratio [%] = 100*(VPU_STALL_REG/VPU_INSTRUCTIONS_EXECUTED)
--
This group measures how efficient the processor works with
regard to vectorization instruction throughput. The event VPU_STALL_REG counts
the VPU stalls due to data dependencies. Dependencies are read-after-write,
write-after-write and write-after-read.

View File

@@ -0,0 +1,18 @@
SHORT VPU filling for double precisiof data
EVENTSET
PMC0 VPU_INSTRUCTIONS_EXECUTED
PMC1 VPU_ELEMENTS_ACTIVE
METRICS
Runtime (RDTSC) [s] time
VPU fill ratio PMC0*8/PMC1
LONG
Formulas:
VPU fill ratio = VPU_INSTRUCTIONS_EXECUTED*8/VPU_ELEMENTS_ACTIVE
--
This performance group measures the number of vector instructions that are
performed on each vector loaded to the VPU. It is important to increate the
ratio to get a high throughput because memory accesses (loading data to the VPU)
are expensive.

View File

@@ -0,0 +1,20 @@
SHORT VPU pairing ratio
EVENTSET
PMC0 VPU_INSTRUCTIONS_EXECUTED
PMC1 VPU_INSTRUCTIONS_EXECUTED_V_PIPE
METRICS
Runtime (RDTSC) [s] time
V-pipe ratio PMC1/PMC0
Pairing ratio PMC1/(PMC0-PMC1)
LONG
Formulas:
V-pipe ratio = VPU_INSTRUCTIONS_EXECUTED_V_PIPE/VPU_INSTRUCTIONS_EXECUTED
Pairing ratio = VPU_INSTRUCTIONS_EXECUTED_V_PIPE/(VPU_INSTRUCTIONS_EXECUTED-VPU_INSTRUCTIONS_EXECUTED_V_PIPE)
--
This performance group measures the pairing ratio of vector instructions. The
V-pipe can only execute a subset of all instruction, the main workload is done
by the U-pipe. A higher throughput can be achieved if the pairing ratio is
increased.

View File

@@ -0,0 +1,16 @@
SHORT Miss ratio for VPU data reads
EVENTSET
PMC0 VPU_DATA_READ
PMC1 VPU_DATA_READ_MISS
METRICS
Runtime (RDTSC) [s] time
VPU read miss ratio PMC1/PMC0
LONG
Formulas:
VPU read miss ratio = PMC1/PMC0
--
This performance group determines the ratio between reads and reads that miss
the cache and are issued by the VPU.

View File

@@ -0,0 +1,16 @@
SHORT Miss ratio for VPU data writes
EVENTSET
PMC0 VPU_DATA_WRITE
PMC1 VPU_DATA_WRITE_MISS
METRICS
Runtime (RDTSC) [s] time
VPU write miss ratio PMC1/PMC0
LONG
Formulas:
VPU write miss ratio = PMC1/PMC0
--
This performance group determines the ratio between writes and writes that miss
the cache and are issued by the VPU.

View File

@@ -0,0 +1,15 @@
SHORT Miss ratio fof data writes
EVENTSET
PMC0 DATA_WRITE
PMC1 DATA_WRITE_MISS
METRICS
Runtime (RDTSC) [s] time
Write miss ratio PMC1/PMC0
LONG
Formulas:
Write miss ratio = DATA_WRITE_MISS/DATA_WRITE
--
Miss ratio fof data writes.