Add likwid collector

2025-12-18 21:26:18 +01:00 · 2021-03-25 14:47:10 +01:00
parent 4fddcb9741
commit a6ac0c5373
670 changed files with 24926 additions and 0 deletions
--- a/collectors/likwid/groups/phi/CACHE.txt
+++ b/collectors/likwid/groups/phi/CACHE.txt
@@ -0,0 +1,22 @@
+SHORT L1 compute to data access ratio
+
+EVENTSET
+PMC0  VPU_ELEMENTS_ACTIVE
+PMC1  DATA_READ_OR_WRITE
+
+METRICS
+Runtime (RDTSC) [s] time
+L1 compute intensity   PMC0/PMC1
+
+LONG
+Formulas:
+L1 compute intensity = VPU_ELEMENTS_ACTIVE/DATA_READ_OR_WRITE
+-
+These metric is a way to measure the computational density of an
+application, or how many computations it is performing on average for each
+piece of data loaded. L1 compute to data access ratio should be
+used to judge suitability of an application for running on the Intel MIC
+architecture. Applications that will perform well on the Intel MIC
+architecture should be vectorized, and ideally be able to perform multiple
+operations on the same pieces of data (or same cache lines).
+
--- a/collectors/likwid/groups/phi/COMPUTE_TO_DATA_RATIO.txt
+++ b/collectors/likwid/groups/phi/COMPUTE_TO_DATA_RATIO.txt
@@ -0,0 +1,22 @@
+SHORT L2 compute to data access ratio
+
+EVENTSET
+PMC0  VPU_ELEMENTS_ACTIVE
+PMC1  DATA_READ_MISS_OR_WRITE_MISS
+
+METRICS
+Runtime (RDTSC) [s] time
+L2 compute intensity   PMC0/PMC1
+
+LONG
+Formulas:
+L2 compute intensity = VPU_ELEMENTS_ACTIVE/DATA_READ_MISS_OR_WRITE_MISS
+-
+These metric is a way to measure the computational density of an
+application, or how many computations it is performing on average for each
+piece of data loaded. L2 compute to data access ratio should be
+used to judge suitability of an application for running on the Intel MIC
+architecture. Applications that will perform well on the Intel MIC
+architecture should be vectorized, and ideally be able to perform multiple
+operations on the same pieces of data (or same cache lines).
+
--- a/collectors/likwid/groups/phi/CPI.txt
+++ b/collectors/likwid/groups/phi/CPI.txt
@@ -0,0 +1,23 @@
+SHORT  Cycles per instruction
+
+EVENTSET
+PMC0  INSTRUCTIONS_EXECUTED
+PMC1  CPU_CLK_UNHALTED
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]  PMC1*inverseClock
+CPI   PMC1/PMC0
+IPC   PMC0/PMC1
+
+LONG
+Formulas:
+CPI = CPU_CLK_UNHALTED/INSTRUCTIONS_EXECUTED
+IPC = INSTRUCTIONS_EXECUTED/CPU_CLK_UNHALTED
+-
+This group measures how efficient the processor works with
+regard to instruction throughput. Also important as a standalone
+metric is INSTRUCTIONS_RETIRED as it tells you how many instruction
+you need to execute for a task. An optimization might show very
+low CPI values but execute many more instruction for it.
+
--- a/collectors/likwid/groups/phi/MEM.txt
+++ b/collectors/likwid/groups/phi/MEM.txt
@@ -0,0 +1,18 @@
+SHORT Memory bandwidth
+
+EVENTSET
+PMC0  DATA_READ_MISS_OR_WRITE_MISS
+PMC1  DATA_CACHE_LINES_WRITTEN_BACK
+
+
+METRICS
+Runtime (RDTSC) [s] time
+Memory bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
+Memory data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
+
+LONG
+Formulas:
+Memory bandwidth [MBytes/s] = 1.0E-06*(DATA_READ_MISS_OR_WRITE_MISS+DATA_CACHE_LINES_WRITTEN_BACK)*64.0/time
+Memory data volume [GBytes] = 1.0E-09*(DATA_READ_MISS_OR_WRITE_MISS+DATA_CACHE_LINES_WRITTEN_BACK)*64.0
+-
+Total memory bandwidth and data volume.
--- a/collectors/likwid/groups/phi/MEM1.txt
+++ b/collectors/likwid/groups/phi/MEM1.txt
@@ -0,0 +1,18 @@
+SHORT L2 write misses
+
+EVENTSET
+PMC0  L2_DATA_WRITE_MISS_MEM_FILL
+
+METRICS
+Runtime (RDTSC) [s] time
+L2 RFO bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
+L2 RFO data volume [GBytes] 1.0E-09*PMC0*64.0
+
+LONG
+Formulas:
+L2 RFO bandwidth [MBytes/s] = 1.0E-06*L2_DATA_WRITE_MISS_MEM_FILL*64.0/time
+L2 RFO data volume [GBytes] = 1.0E-09*L2_DATA_WRITE_MISS_MEM_FILL*64.0
+-
+Bandwidth and data volume fetched from memory due to a L2 data write miss. These
+fetches are commonly called write-allocate loads or read-for-ownership (RFO).
+
--- a/collectors/likwid/groups/phi/MEM2.txt
+++ b/collectors/likwid/groups/phi/MEM2.txt
@@ -0,0 +1,17 @@
+SHORT L2 read misses
+
+EVENTSET
+PMC0  L2_DATA_READ_MISS_MEM_FILL
+
+METRICS
+Runtime (RDTSC) [s] time
+L2 read bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
+L2 read data volume [GBytes] 1.0E-09*PMC0*64.0
+
+LONG
+Formulas:
+L2 read bandwidth [MBytes/s] = 1.0E-06*L2_DATA_READ_MISS_MEM_FILL*64.0/time
+L2 read data volume [GBytes] = 1.0E-09*L2_DATA_READ_MISS_MEM_FILL*64.0
+-
+The data volume and bandwidth caused by read misses in the L2 cache.
+
--- a/collectors/likwid/groups/phi/MEM3.txt
+++ b/collectors/likwid/groups/phi/MEM3.txt
@@ -0,0 +1,17 @@
+SHORT HW prefetch transfers
+
+EVENTSET
+PMC0  HWP_L2MISS
+
+METRICS
+Runtime (RDTSC) [s] time
+Prefetch bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
+Prefetch data volume [GBytes] 1.0E-09*PMC0*64.0
+
+LONG
+Formulas:
+Prefetch bandwidth [MBytes/s] = 1.0E-06*HWP_L2MISS*64.0/time
+Prefetch data volume [GBytes] = 1.0E-09*HWP_L2MISS*64.0
+-
+The bandwidth and data volume caused by L2 misses from the hardware prefetcher.
+
--- a/collectors/likwid/groups/phi/MEM4.txt
+++ b/collectors/likwid/groups/phi/MEM4.txt
@@ -0,0 +1,17 @@
+SHORT L2 victom requests
+
+EVENTSET
+PMC0  L2_VICTIM_REQ_WITH_DATA
+
+METRICS
+Runtime (RDTSC) [s] time
+Victim bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
+Victim data volume [GBytes] 1.0E-09*PMC0*64.0
+
+LONG
+Formulas:
+Victim bandwidth [MBytes/s] = 1.0E-06*L2_VICTIM_REQ_WITH_DATA*64.0/time
+Victim data volume [GBytes] = 1.0E-09*L2_VICTIM_REQ_WITH_DATA*64.0
+-
+Data volume and bandwidth caused by cache line victims.
+
--- a/collectors/likwid/groups/phi/MEM5.txt
+++ b/collectors/likwid/groups/phi/MEM5.txt
@@ -0,0 +1,19 @@
+SHORT L2 snoop hits
+
+EVENTSET
+PMC0  SNP_HITM_L2
+
+METRICS
+Runtime (RDTSC) [s] time
+Snoop bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
+Snoop data volume [GBytes] 1.0E-09*PMC0*64.0
+
+LONG
+Formulas:
+Snoop bandwidth [MBytes/s] = 1.0E-06*SNP_HITM_L2*64.0/time
+Snoop data volume [GBytes] = 1.0E-09*SNP_HITM_L2*64.0
+-
+Snoop traffic caused by HITM requests. HITM requests are L2 requests that
+are served by another core's L2 cache but the remote cache line is in modified
+state.
+
--- a/collectors/likwid/groups/phi/MEM6.txt
+++ b/collectors/likwid/groups/phi/MEM6.txt
@@ -0,0 +1,17 @@
+SHORT L2 read misses
+
+EVENTSET
+PMC0  L2_READ_MISS
+
+METRICS
+Runtime (RDTSC) [s] time
+L2 read bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time
+L2 read data volume [GBytes] 1.0E-09*PMC0*64.0
+
+LONG
+Formulas:
+L2 read bandwidth [MBytes/s] = 1.0E-06*L2_READ_MISS*64.0/time
+L2 read data volume [GBytes] = 1.0E-09*L2_READ_MISS*64.0
+-
+Data volume and bandwidth caused by read misses in the L2 cache.
+
--- a/collectors/likwid/groups/phi/MEM_READ.txt
+++ b/collectors/likwid/groups/phi/MEM_READ.txt
@@ -0,0 +1,20 @@
+SHORT Memory read bandwidth
+
+EVENTSET
+PMC0  DATA_READ_MISS
+PMC1  HWP_L2MISS
+
+
+METRICS
+Runtime (RDTSC) [s] time
+Memory read bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
+Memory read data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
+
+LONG
+Formulas:
+Memory read bandwidth [MBytes/s] = 1.0E-06*(L2_DATA_READ_MISS_MEM_FILL+HWP_L2MISS)*64.0/time
+Memory read data volume [GBytes] = 1.0E-09*(L2_DATA_READ_MISS_MEM_FILL+HWP_L2MISS)*64.0
+-
+Bandwidth and data volume of read operations from the memory to L2 cache. The
+metric is introduced in the book 'Intel Xeon Phi Coprocessor High-Performance
+Programming' by James Jeffers and James Reinders.
--- a/collectors/likwid/groups/phi/MEM_WRITE.txt
+++ b/collectors/likwid/groups/phi/MEM_WRITE.txt
@@ -0,0 +1,20 @@
+SHORT Memory write bandwidth
+
+EVENTSET
+PMC0  L2_VICTIM_REQ_WITH_DATA
+PMC1  SNP_HITM_L2
+
+
+METRICS
+Runtime (RDTSC) [s] time
+Memory write bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
+Memory write data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
+
+LONG
+Formulas:
+Memory write bandwidth [MBytes/s] = 1.0E-06*(L2_VICTIM_REQ_WITH_DATA+SNP_HITM_L2)*64.0/time
+Memory write data volume [GBytes] = 1.0E-09*(L2_VICTIM_REQ_WITH_DATA+SNP_HITM_L2)*64.0
+-
+Bandwidth and data volume of write operations from the L2 cache to memory. The
+metric is introduced in the book 'Intel Xeon Phi Coprocessor High-Performance
+Programming' by James Jeffers and James Reinders.
--- a/collectors/likwid/groups/phi/PAIRING.txt
+++ b/collectors/likwid/groups/phi/PAIRING.txt
@@ -0,0 +1,21 @@
+SHORT Pairing ratio
+
+EVENTSET
+PMC0  INSTRUCTIONS_EXECUTED
+PMC1  INSTRUCTIONS_EXECUTED_V_PIPE
+
+METRICS
+Runtime (RDTSC) [s] time
+V-pipe ratio   PMC1/PMC0
+Pairing ratio PMC1/(PMC0-PMC1)
+
+LONG
+Formulas:
+V-pipe ratio = INSTRUCTIONS_EXECUTED_V_PIPE/INSTRUCTIONS_EXECUTED
+Pairing ratio = INSTRUCTIONS_EXECUTED_V_PIPE/(INSTRUCTIONS_EXECUTED-INSTRUCTIONS_EXECUTED_V_PIPE)
+-
+Each hardware thread on the Xeon Phi can execute two instruction simultaneously,
+one in the U-pipe and one in the V-pipe. But this is only possible if the
+instructions can be paired. The instructions executed in paired fashion are counted
+by the event INSTRUCTIONS_EXECUTED_V_PIPE. The event INSTRUCTIONS_EXECUTED increments
+for each instruction, hence the maximal increase per cycle can be 2.
--- a/collectors/likwid/groups/phi/READ_MISS_RATIO.txt
+++ b/collectors/likwid/groups/phi/READ_MISS_RATIO.txt
@@ -0,0 +1,15 @@
+SHORT Miss ratio fof data reads
+
+EVENTSET
+PMC0  DATA_READ
+PMC1  DATA_READ_MISS
+
+METRICS
+Runtime (RDTSC) [s] time
+Read miss ratio PMC1/PMC0
+
+LONG
+Formulas:
+Read miss ratio = DATA_READ_MISS/DATA_READ
+--
+Miss ratio for data reads.
--- a/collectors/likwid/groups/phi/TLB.txt
+++ b/collectors/likwid/groups/phi/TLB.txt
@@ -0,0 +1,23 @@
+SHORT TLB Misses
+
+EVENTSET
+PMC0 LONG_DATA_PAGE_WALK
+PMC1 DATA_PAGE_WALK
+
+METRICS
+Runtime (RDTSC) [s] time
+L1 TLB misses [misses/s] PMC1/time
+L2 TLB misses [misses/s] PMC0/time
+L1 TLB misses per L2 TLB miss PMC1/PMC0
+
+LONG
+Formulas:
+L1 TLB misses [misses/s] = DATA_PAGE_WALK/time
+L2 TLB misses [misses/s] = LONG_DATA_PAGE_WALK/time
+L1 TLB misses per L2 TLB miss = DATA_PAGE_WALK/LONG_DATA_PAGE_WALK
+-
+Analysis of the layered TLB of the Intel Xeon Phi. According to the book
+'Intel Xeon Phi Coprocessor High-Performance Programming' by James Jeffers and
+James Reinders, a high L1 TLB misses per L2 TLB miss ratio suggests that your
+working set fits into the L2 TLB but not in L1 TLB. Using large pages may be
+beneficial.
--- a/collectors/likwid/groups/phi/TLB_L1.txt
+++ b/collectors/likwid/groups/phi/TLB_L1.txt
@@ -0,0 +1,23 @@
+SHORT L1 TLB misses
+
+EVENTSET
+PMC0 DATA_PAGE_WALK
+PMC1 DATA_READ_OR_WRITE
+
+METRICS
+Runtime (RDTSC) [s] time
+L1 TLB misses [misses/s] PMC0/time
+L1 TLB miss ratio PMC0/PMC1
+
+LONG
+Formulas:
+L1 TLB misses [misses/s] = DATA_PAGE_WALK/time
+L1 TLB miss ratio = DATA_PAGE_WALK/DATA_READ_OR_WRITE
+-
+This performance group measures the L1 TLB misses. A L1 TLB miss that hits the
+L2 TLB has a penelty of about 25 cycles for 4kB pages. For 2MB pages, the penelty
+for a L1 TLB miss that hits L2 TLB is about 8 cycles. The minimal L1 TLB miss ratio
+is about 1/64, so a high ratio indicates a bad spartial locality. Data of a page
+is only partly accessed. It can also indicate trashing because when multiple pages
+are accessed in a loop iteration, the size and associativity is not sufficient to
+hold all pages.
--- a/collectors/likwid/groups/phi/TLB_L2.txt
+++ b/collectors/likwid/groups/phi/TLB_L2.txt
@@ -0,0 +1,21 @@
+SHORT L2 TLB misses
+
+EVENTSET
+PMC0 LONG_DATA_PAGE_WALK
+PMC1 DATA_READ_OR_WRITE
+
+METRICS
+Runtime (RDTSC) [s] time
+L2 TLB misses [misses/s] PMC0/time
+L2 TLB miss ratio PMC0/PMC1
+
+LONG
+Formulas:
+L2 TLB misses [misses/s] = LONG_DATA_PAGE_WALK/time
+L2 TLB miss ratio = LONG_DATA_PAGE_WALK/DATA_READ_OR_WRITE
+-
+This performance group measures the L2 TLB misses. A L2 TLB miss has a penelty
+of at least 100 cycles, hence it is important to avoid them. A high ratio can
+indicate trashing because when multiple pages are accessed in a loop iteration,
+the size and associativity is not sufficient to hold all pages. This would also
+result in a bad ratio for the L1 TLB.
--- a/collectors/likwid/groups/phi/VECTOR.txt
+++ b/collectors/likwid/groups/phi/VECTOR.txt
@@ -0,0 +1,21 @@
+SHORT  Vectorization intensity
+
+EVENTSET
+PMC0  VPU_INSTRUCTIONS_EXECUTED
+PMC1  VPU_ELEMENTS_ACTIVE
+
+METRICS
+Runtime (RDTSC) [s] time
+Vectorization intensity PMC1/PMC0
+
+LONG
+Formulas:
+Vectorization intensity = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED
+-
+Vector instructions include instructions that perform floating-point
+operations, instructions that load vector registers from memory and store them
+to memory, instructions to manipulate vector mask registers, and other special
+purpose instructions such as vector shuffle.
+According to the book 'Intel Xeon Phi Coprocessor High-Performance Programming'
+by James Jeffers and James Reinders, the vectorization intensity should be >=8
+for double precision and >=16 for single precision.
--- a/collectors/likwid/groups/phi/VECTOR2.txt
+++ b/collectors/likwid/groups/phi/VECTOR2.txt
@@ -0,0 +1,20 @@
+SHORT  Vector unit usage
+
+EVENTSET
+PMC0  VPU_INSTRUCTIONS_EXECUTED
+PMC1  VPU_STALL_REG
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]  PMC1*inverseClock
+VPU stall ratio [%] 100*(VPU_STALL_REG/PMC0)
+
+LONG
+Formulas:
+VPU stall ratio [%] = 100*(VPU_STALL_REG/VPU_INSTRUCTIONS_EXECUTED)
+--
+This group measures how efficient the processor works with
+regard to vectorization instruction throughput. The event VPU_STALL_REG counts
+the VPU stalls due to data dependencies. Dependencies are read-after-write,
+write-after-write and write-after-read.
+
--- a/collectors/likwid/groups/phi/VPU_FILL_RATIO_DBL.txt
+++ b/collectors/likwid/groups/phi/VPU_FILL_RATIO_DBL.txt
@@ -0,0 +1,18 @@
+SHORT VPU filling for double precisiof data
+
+EVENTSET
+PMC0  VPU_INSTRUCTIONS_EXECUTED
+PMC1  VPU_ELEMENTS_ACTIVE
+
+METRICS
+Runtime (RDTSC) [s] time
+VPU fill ratio PMC0*8/PMC1
+
+LONG
+Formulas:
+VPU fill ratio = VPU_INSTRUCTIONS_EXECUTED*8/VPU_ELEMENTS_ACTIVE
+--
+This performance group measures the number of vector instructions that are
+performed on each vector loaded to the VPU. It is important to increate the
+ratio to get a high throughput because memory accesses (loading data to the VPU)
+are expensive.
--- a/collectors/likwid/groups/phi/VPU_PAIRING.txt
+++ b/collectors/likwid/groups/phi/VPU_PAIRING.txt
@@ -0,0 +1,20 @@
+SHORT VPU pairing ratio
+
+EVENTSET
+PMC0  VPU_INSTRUCTIONS_EXECUTED
+PMC1  VPU_INSTRUCTIONS_EXECUTED_V_PIPE
+
+METRICS
+Runtime (RDTSC) [s] time
+V-pipe ratio   PMC1/PMC0
+Pairing ratio PMC1/(PMC0-PMC1)
+
+LONG
+Formulas:
+V-pipe ratio = VPU_INSTRUCTIONS_EXECUTED_V_PIPE/VPU_INSTRUCTIONS_EXECUTED
+Pairing ratio = VPU_INSTRUCTIONS_EXECUTED_V_PIPE/(VPU_INSTRUCTIONS_EXECUTED-VPU_INSTRUCTIONS_EXECUTED_V_PIPE)
+--
+This performance group measures the pairing ratio of vector instructions. The
+V-pipe can only execute a subset of all instruction, the main workload is done
+by the U-pipe. A higher throughput can be achieved if the pairing ratio is
+increased.
--- a/collectors/likwid/groups/phi/VPU_READ_MISS_RATIO.txt
+++ b/collectors/likwid/groups/phi/VPU_READ_MISS_RATIO.txt
@@ -0,0 +1,16 @@
+SHORT Miss ratio for VPU data reads
+
+EVENTSET
+PMC0  VPU_DATA_READ
+PMC1  VPU_DATA_READ_MISS
+
+METRICS
+Runtime (RDTSC) [s] time
+VPU read miss ratio PMC1/PMC0
+
+LONG
+Formulas:
+VPU read miss ratio = PMC1/PMC0
+--
+This performance group determines the ratio between reads and reads that miss
+the cache and are issued by the VPU.
--- a/collectors/likwid/groups/phi/VPU_WRITE_MISS_RATIO.txt
+++ b/collectors/likwid/groups/phi/VPU_WRITE_MISS_RATIO.txt
@@ -0,0 +1,16 @@
+SHORT Miss ratio for VPU data writes
+
+EVENTSET
+PMC0  VPU_DATA_WRITE
+PMC1  VPU_DATA_WRITE_MISS
+
+METRICS
+Runtime (RDTSC) [s] time
+VPU write miss ratio PMC1/PMC0
+
+LONG
+Formulas:
+VPU write miss ratio = PMC1/PMC0
+--
+This performance group determines the ratio between writes and writes that miss
+the cache and are issued by the VPU.
--- a/collectors/likwid/groups/phi/WRITE_MISS_RATIO.txt
+++ b/collectors/likwid/groups/phi/WRITE_MISS_RATIO.txt
@@ -0,0 +1,15 @@
+SHORT Miss ratio fof data writes
+
+EVENTSET
+PMC0  DATA_WRITE
+PMC1  DATA_WRITE_MISS
+
+METRICS
+Runtime (RDTSC) [s] time
+Write miss ratio PMC1/PMC0
+
+LONG
+Formulas:
+Write miss ratio = DATA_WRITE_MISS/DATA_WRITE
+--
+Miss ratio fof data writes.