SHORT Utilization of FP pipelines

EVENTSET
PMC0  INST_RETIRED
PMC1  CPU_CYCLES
PMC2  FLA_VAL
PMC3  FLA_VAL_PRD_CNT
PMC4  FLB_VAL
PMC5  FLB_VAL_PRD_CNT

METRICS
Runtime (RDTSC) [s] time
CPI  PMC1/PMC0
FP operation pipeline A busy rate [%] (PMC2/PMC1)*100.0
FP pipeline A active element rate [%] (PMC3/(PMC2*16))*100.0
FP operation pipeline B busy rate [%] (PMC4/PMC1)*100.0
FP pipeline B active element rate [%] (PMC5/(PMC4*16))*100.0


LONG
Formulas:
CPI = CPU_CYCLES/INST_SPEC
FP operation pipeline A busy rate [%] = (FLA_VAL/CPU_CYCLES)*100.0
FP pipeline A active element rate [%] = (FLA_VAL_PRD_CNT/(FLA_VAL*16))*100.0
FP operation pipeline B busy rate [%] = (FLB_VAL/CPU_CYCLES)*100.0
FP pipeline B active element rate [%] = (FLB_VAL_PRD_CNT/(FLB_VAL*16))*100.0
-
FLx_VAL: This event counts valid cycles of FLx pipeline.
FLx_VAL_PRD_CNT: This event counts the number of 1's in the predicate bits of
                 request in FLA pipeline, where it is corrected so that it
                 becomes 16 when all bits are 1.
So each predicate mask has 16 slots, so there are 16 slots per cycle in FLA and
FLB. FLA is partly used by other instructions like SVE stores.