Skip to content

Commit

Permalink
Add half precision FP groups for SPR for completeness
Browse files Browse the repository at this point in the history
  • Loading branch information
TomTheBear committed Nov 5, 2023
1 parent 31854d9 commit b0f76f1
Show file tree
Hide file tree
Showing 2 changed files with 185 additions and 0 deletions.
92 changes: 92 additions & 0 deletions groups/SPR/HBM_HP.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
SHORT Overview of HP arithmetic and main memory performance

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PWR0 PWR_PKG_ENERGY
PWR3 PWR_DRAM_ENERGY
PMC0 FP_ARITH_INST_RETIRED2_128B_PACKED_HALF
PMC1 FP_ARITH_INST_RETIRED2_SCALAR_HALF
PMC2 FP_ARITH_INST_RETIRED2_256B_PACKED_HALF
PMC3 FP_ARITH_INST_RETIRED2_512B_PACKED_HALF
HBM0C0 CAS_COUNT_RD
HBM0C1 CAS_COUNT_WR
HBM1C0 CAS_COUNT_RD
HBM1C1 CAS_COUNT_WR
HBM2C0 CAS_COUNT_RD
HBM2C1 CAS_COUNT_WR
HBM3C0 CAS_COUNT_RD
HBM3C1 CAS_COUNT_WR
HBM4C0 CAS_COUNT_RD
HBM4C1 CAS_COUNT_WR
HBM5C0 CAS_COUNT_RD
HBM5C1 CAS_COUNT_WR
HBM6C0 CAS_COUNT_RD
HBM6C1 CAS_COUNT_WR
HBM7C0 CAS_COUNT_RD
HBM7C1 CAS_COUNT_WR
HBM8C0 CAS_COUNT_RD
HBM8C1 CAS_COUNT_WR
HBM9C0 CAS_COUNT_RD
HBM9C1 CAS_COUNT_WR
HBM10C0 CAS_COUNT_RD
HBM10C1 CAS_COUNT_WR
HBM11C0 CAS_COUNT_RD
HBM11C1 CAS_COUNT_WR
HBM12C0 CAS_COUNT_RD
HBM12C1 CAS_COUNT_WR
HBM13C0 CAS_COUNT_RD
HBM13C1 CAS_COUNT_WR
HBM14C0 CAS_COUNT_RD
HBM14C1 CAS_COUNT_WR
HBM15C0 CAS_COUNT_RD
HBM15C1 CAS_COUNT_WR


METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Energy [J] PWR0
Power [W] PWR0/time
Energy DRAM [J] PWR3
Power DRAM [W] PWR3/time
HP [MFLOP/s] 1.0E-06*(PMC0*8.0+PMC1+PMC2*16.0+PMC3*32.0)/time
AVX HP [MFLOP/s] 1.0E-06*(PMC2*16.0+PMC3*32.0)/time
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2+PMC3)/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
HBM read bandwidth [MBytes/s] 1.0E-06*(HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0)*64.0/time
HBM read data volume [GBytes] 1.0E-09*(HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0)*64.0
HBM write bandwidth [MBytes/s] 1.0E-06*(HBM0C1+HBM1C1+HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0/ time
HBM write data volume [GBytes] 1.0E-09*(HBM0C1+HBM1C1+HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0
HBM bandwidth [MBytes/s] 1.0E-06*(HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0+HBM0C1+HBM1C1+ HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0/time
HBM data volume [GBytes] 1.0E-09*(HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0+HBM0C1+HBM1C1+ HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0
Operational intensity [FLOP/Byte] (PMC0*8.0+PMC1+PMC2*16.0+PMC3*32.0)/((HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0+HBM0C1+HBM1C1+ HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0)

LONG
Formulas:
Power [W] = PWR_PKG_ENERGY/runtime
Power DRAM [W] = PWR_DRAM_ENERGY/runtime
HP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_128B_PACKED_HALF*8+FP_ARITH_INST_RETIRED2_SCALAR_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/runtime
AVX HP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/runtime
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_128B_PACKED_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF)/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED2_SCALAR_HALF/runtime
HBM read bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD))*64.0/runtime
HBM read data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD))*64.0
HBM write bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_WR))*64.0/runtime
HBM write data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_WR))*64.0
HBM bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0/runtime
HBM data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0
Operational intensity [FLOP/Byte] = (FP_ARITH_INST_RETIRED2_128B_PACKED_HALF*8+FP_ARITH_INST_RETIRED2_SCALAR_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0)
--
Profiling group to measure HBM bandwidth drawn by all cores of a socket.
Since this group is based on Uncore events it is only possible to measure on
a per socket base. Also outputs total data volume transferred from HBM.
SSE scalar and packed half precision FLOP rates. Also reports on packed AVX
32b instructions.
The operational intensity is calculated using the FP values of the cores and the
HBM data volume of the whole socket. The actual operational intensity for
multiple CPUs can be found in the statistics table in the Sum column.
93 changes: 93 additions & 0 deletions groups/SPR/MEM_HP.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
SHORT Overview of HP arithmetic and main memory performance

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PWR0 PWR_PKG_ENERGY
PWR3 PWR_DRAM_ENERGY
PMC0 FP_ARITH_INST_RETIRED2_128B_PACKED_HALF
PMC1 FP_ARITH_INST_RETIRED2_SCALAR_HALF
PMC2 FP_ARITH_INST_RETIRED2_256B_PACKED_HALF
PMC3 FP_ARITH_INST_RETIRED2_512B_PACKED_HALF
MBOX0C0 CAS_COUNT_RD
MBOX0C1 CAS_COUNT_WR
MBOX1C0 CAS_COUNT_RD
MBOX1C1 CAS_COUNT_WR
MBOX2C0 CAS_COUNT_RD
MBOX2C1 CAS_COUNT_WR
MBOX3C0 CAS_COUNT_RD
MBOX3C1 CAS_COUNT_WR
MBOX4C0 CAS_COUNT_RD
MBOX4C1 CAS_COUNT_WR
MBOX5C0 CAS_COUNT_RD
MBOX5C1 CAS_COUNT_WR
MBOX6C0 CAS_COUNT_RD
MBOX6C1 CAS_COUNT_WR
MBOX7C0 CAS_COUNT_RD
MBOX7C1 CAS_COUNT_WR
MBOX8C0 CAS_COUNT_RD
MBOX8C1 CAS_COUNT_WR
MBOX9C0 CAS_COUNT_RD
MBOX9C1 CAS_COUNT_WR
MBOX10C0 CAS_COUNT_RD
MBOX10C1 CAS_COUNT_WR
MBOX11C0 CAS_COUNT_RD
MBOX11C1 CAS_COUNT_WR
MBOX12C0 CAS_COUNT_RD
MBOX12C1 CAS_COUNT_WR
MBOX13C0 CAS_COUNT_RD
MBOX13C1 CAS_COUNT_WR
MBOX14C0 CAS_COUNT_RD
MBOX14C1 CAS_COUNT_WR
MBOX15C0 CAS_COUNT_RD
MBOX15C1 CAS_COUNT_WR



METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Energy [J] PWR0
Power [W] PWR0/time
Energy DRAM [J] PWR3
Power DRAM [W] PWR3/time
HP [MFLOP/s] 1.0E-06*(PMC0*8.0+PMC1+PMC2*16.0+PMC3*32.0)/time
AVX HP [MFLOP/s] 1.0E-06*(PMC2*16.0+PMC3*32.0)/time
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2+PMC3)/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0)*64.0
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0/time
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0
Operational intensity [FLOP/Byte] (PMC0*8.0+PMC1+PMC2*16.0+PMC3*32.0)/((MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0)

LONG
Formulas:
Power [W] = PWR_PKG_ENERGY/runtime
Power DRAM [W] = PWR_DRAM_ENERGY/runtime
HP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_128B_PACKED_HALF*8+FP_ARITH_INST_RETIRED2_SCALAR_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/runtime
AVX HP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/runtime
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_128B_PACKED_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF)/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED2_SCALAR_HALF/runtime
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD))*64.0/runtime
Memory read data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD))*64.0
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_WR))*64.0/runtime
Memory write data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_WR))*64.0
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0/runtime
Memory data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0
Operational intensity [FLOP/Byte] = (FP_ARITH_INST_RETIRED2_128B_PACKED_HALF*8+FP_ARITH_INST_RETIRED2_SCALAR_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0)
--
Profiling group to measure memory bandwidth drawn by all cores of a socket.
Since this group is based on Uncore events it is only possible to measure on
a per socket base. Also outputs total data volume transferred from main memory.
SSE scalar and packed half precision FLOP rates. Also reports on packed AVX
32b instructions.
The operational intensity is calculated using the FP values of the cores and the
memory data volume of the whole socket. The actual operational intensity for
multiple CPUs can be found in the statistics table in the Sum column.

0 comments on commit b0f76f1

Please sign in to comment.