-
Notifications
You must be signed in to change notification settings - Fork 232
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add half precision FP groups for SPR for completeness
- Loading branch information
1 parent
31854d9
commit b0f76f1
Showing
2 changed files
with
185 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
SHORT Overview of HP arithmetic and main memory performance | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
PWR0 PWR_PKG_ENERGY | ||
PWR3 PWR_DRAM_ENERGY | ||
PMC0 FP_ARITH_INST_RETIRED2_128B_PACKED_HALF | ||
PMC1 FP_ARITH_INST_RETIRED2_SCALAR_HALF | ||
PMC2 FP_ARITH_INST_RETIRED2_256B_PACKED_HALF | ||
PMC3 FP_ARITH_INST_RETIRED2_512B_PACKED_HALF | ||
HBM0C0 CAS_COUNT_RD | ||
HBM0C1 CAS_COUNT_WR | ||
HBM1C0 CAS_COUNT_RD | ||
HBM1C1 CAS_COUNT_WR | ||
HBM2C0 CAS_COUNT_RD | ||
HBM2C1 CAS_COUNT_WR | ||
HBM3C0 CAS_COUNT_RD | ||
HBM3C1 CAS_COUNT_WR | ||
HBM4C0 CAS_COUNT_RD | ||
HBM4C1 CAS_COUNT_WR | ||
HBM5C0 CAS_COUNT_RD | ||
HBM5C1 CAS_COUNT_WR | ||
HBM6C0 CAS_COUNT_RD | ||
HBM6C1 CAS_COUNT_WR | ||
HBM7C0 CAS_COUNT_RD | ||
HBM7C1 CAS_COUNT_WR | ||
HBM8C0 CAS_COUNT_RD | ||
HBM8C1 CAS_COUNT_WR | ||
HBM9C0 CAS_COUNT_RD | ||
HBM9C1 CAS_COUNT_WR | ||
HBM10C0 CAS_COUNT_RD | ||
HBM10C1 CAS_COUNT_WR | ||
HBM11C0 CAS_COUNT_RD | ||
HBM11C1 CAS_COUNT_WR | ||
HBM12C0 CAS_COUNT_RD | ||
HBM12C1 CAS_COUNT_WR | ||
HBM13C0 CAS_COUNT_RD | ||
HBM13C1 CAS_COUNT_WR | ||
HBM14C0 CAS_COUNT_RD | ||
HBM14C1 CAS_COUNT_WR | ||
HBM15C0 CAS_COUNT_RD | ||
HBM15C1 CAS_COUNT_WR | ||
|
||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
CPI FIXC1/FIXC0 | ||
Energy [J] PWR0 | ||
Power [W] PWR0/time | ||
Energy DRAM [J] PWR3 | ||
Power DRAM [W] PWR3/time | ||
HP [MFLOP/s] 1.0E-06*(PMC0*8.0+PMC1+PMC2*16.0+PMC3*32.0)/time | ||
AVX HP [MFLOP/s] 1.0E-06*(PMC2*16.0+PMC3*32.0)/time | ||
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2+PMC3)/time | ||
Scalar [MUOPS/s] 1.0E-06*PMC1/time | ||
HBM read bandwidth [MBytes/s] 1.0E-06*(HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0)*64.0/time | ||
HBM read data volume [GBytes] 1.0E-09*(HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0)*64.0 | ||
HBM write bandwidth [MBytes/s] 1.0E-06*(HBM0C1+HBM1C1+HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0/ time | ||
HBM write data volume [GBytes] 1.0E-09*(HBM0C1+HBM1C1+HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0 | ||
HBM bandwidth [MBytes/s] 1.0E-06*(HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0+HBM0C1+HBM1C1+ HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0/time | ||
HBM data volume [GBytes] 1.0E-09*(HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0+HBM0C1+HBM1C1+ HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0 | ||
Operational intensity [FLOP/Byte] (PMC0*8.0+PMC1+PMC2*16.0+PMC3*32.0)/((HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0+HBM0C1+HBM1C1+ HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0) | ||
|
||
LONG | ||
Formulas: | ||
Power [W] = PWR_PKG_ENERGY/runtime | ||
Power DRAM [W] = PWR_DRAM_ENERGY/runtime | ||
HP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_128B_PACKED_HALF*8+FP_ARITH_INST_RETIRED2_SCALAR_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/runtime | ||
AVX HP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/runtime | ||
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_128B_PACKED_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF)/runtime | ||
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED2_SCALAR_HALF/runtime | ||
HBM read bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD))*64.0/runtime | ||
HBM read data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD))*64.0 | ||
HBM write bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_WR))*64.0/runtime | ||
HBM write data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_WR))*64.0 | ||
HBM bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0/runtime | ||
HBM data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0 | ||
Operational intensity [FLOP/Byte] = (FP_ARITH_INST_RETIRED2_128B_PACKED_HALF*8+FP_ARITH_INST_RETIRED2_SCALAR_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0) | ||
-- | ||
Profiling group to measure HBM bandwidth drawn by all cores of a socket. | ||
Since this group is based on Uncore events it is only possible to measure on | ||
a per socket base. Also outputs total data volume transferred from HBM. | ||
SSE scalar and packed half precision FLOP rates. Also reports on packed AVX | ||
32b instructions. | ||
The operational intensity is calculated using the FP values of the cores and the | ||
HBM data volume of the whole socket. The actual operational intensity for | ||
multiple CPUs can be found in the statistics table in the Sum column. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
SHORT Overview of HP arithmetic and main memory performance | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
PWR0 PWR_PKG_ENERGY | ||
PWR3 PWR_DRAM_ENERGY | ||
PMC0 FP_ARITH_INST_RETIRED2_128B_PACKED_HALF | ||
PMC1 FP_ARITH_INST_RETIRED2_SCALAR_HALF | ||
PMC2 FP_ARITH_INST_RETIRED2_256B_PACKED_HALF | ||
PMC3 FP_ARITH_INST_RETIRED2_512B_PACKED_HALF | ||
MBOX0C0 CAS_COUNT_RD | ||
MBOX0C1 CAS_COUNT_WR | ||
MBOX1C0 CAS_COUNT_RD | ||
MBOX1C1 CAS_COUNT_WR | ||
MBOX2C0 CAS_COUNT_RD | ||
MBOX2C1 CAS_COUNT_WR | ||
MBOX3C0 CAS_COUNT_RD | ||
MBOX3C1 CAS_COUNT_WR | ||
MBOX4C0 CAS_COUNT_RD | ||
MBOX4C1 CAS_COUNT_WR | ||
MBOX5C0 CAS_COUNT_RD | ||
MBOX5C1 CAS_COUNT_WR | ||
MBOX6C0 CAS_COUNT_RD | ||
MBOX6C1 CAS_COUNT_WR | ||
MBOX7C0 CAS_COUNT_RD | ||
MBOX7C1 CAS_COUNT_WR | ||
MBOX8C0 CAS_COUNT_RD | ||
MBOX8C1 CAS_COUNT_WR | ||
MBOX9C0 CAS_COUNT_RD | ||
MBOX9C1 CAS_COUNT_WR | ||
MBOX10C0 CAS_COUNT_RD | ||
MBOX10C1 CAS_COUNT_WR | ||
MBOX11C0 CAS_COUNT_RD | ||
MBOX11C1 CAS_COUNT_WR | ||
MBOX12C0 CAS_COUNT_RD | ||
MBOX12C1 CAS_COUNT_WR | ||
MBOX13C0 CAS_COUNT_RD | ||
MBOX13C1 CAS_COUNT_WR | ||
MBOX14C0 CAS_COUNT_RD | ||
MBOX14C1 CAS_COUNT_WR | ||
MBOX15C0 CAS_COUNT_RD | ||
MBOX15C1 CAS_COUNT_WR | ||
|
||
|
||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
CPI FIXC1/FIXC0 | ||
Energy [J] PWR0 | ||
Power [W] PWR0/time | ||
Energy DRAM [J] PWR3 | ||
Power DRAM [W] PWR3/time | ||
HP [MFLOP/s] 1.0E-06*(PMC0*8.0+PMC1+PMC2*16.0+PMC3*32.0)/time | ||
AVX HP [MFLOP/s] 1.0E-06*(PMC2*16.0+PMC3*32.0)/time | ||
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2+PMC3)/time | ||
Scalar [MUOPS/s] 1.0E-06*PMC1/time | ||
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0)*64.0/time | ||
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0)*64.0 | ||
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0/time | ||
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0 | ||
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0/time | ||
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0 | ||
Operational intensity [FLOP/Byte] (PMC0*8.0+PMC1+PMC2*16.0+PMC3*32.0)/((MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0) | ||
|
||
LONG | ||
Formulas: | ||
Power [W] = PWR_PKG_ENERGY/runtime | ||
Power DRAM [W] = PWR_DRAM_ENERGY/runtime | ||
HP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_128B_PACKED_HALF*8+FP_ARITH_INST_RETIRED2_SCALAR_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/runtime | ||
AVX HP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/runtime | ||
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_128B_PACKED_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF)/runtime | ||
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED2_SCALAR_HALF/runtime | ||
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD))*64.0/runtime | ||
Memory read data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD))*64.0 | ||
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_WR))*64.0/runtime | ||
Memory write data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_WR))*64.0 | ||
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0/runtime | ||
Memory data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0 | ||
Operational intensity [FLOP/Byte] = (FP_ARITH_INST_RETIRED2_128B_PACKED_HALF*8+FP_ARITH_INST_RETIRED2_SCALAR_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0) | ||
-- | ||
Profiling group to measure memory bandwidth drawn by all cores of a socket. | ||
Since this group is based on Uncore events it is only possible to measure on | ||
a per socket base. Also outputs total data volume transferred from main memory. | ||
SSE scalar and packed half precision FLOP rates. Also reports on packed AVX | ||
32b instructions. | ||
The operational intensity is calculated using the FP values of the cores and the | ||
memory data volume of the whole socket. The actual operational intensity for | ||
multiple CPUs can be found in the statistics table in the Sum column. |