forked from pytorch/executorch
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add 3D Texture Bandwidth metric (pytorch#4336)
Summary: Pull Request resolved: pytorch#4336 This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from 3D textures in each of its dimensions, using the following shader, where A is a 3D texture and B is a writeonly buffer. The calculation of the texel position will depend on the dimension that is being benchmarked x : pos = ivec3(offset, 0, 0) y : pos = ivec3(0, offset, 0) z : pos = ivec3(0, 0, offset) void main() { vec4 sum = vec4(0); const uint workgroup_width = local_group_size * niter * ${NUNROLL}; uint offset = (gl_WorkGroupID[0] * workgroup_width + gl_LocalInvocationID[0]) & addr_mask; int i = 0; for (; i < niter; ++i) { sum *= texelFetch(A, pos, 0); offset = (offset + local_group_size) & addr_mask; ... ... sum *= texelFetch(A, pos, 0); offset = (offset + local_group_size) & addr_mask; } vec4 zero = vec4(i>>31); B[gl_LocalInvocationID[0]] = sum + zero; } The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID. Finally, we make sure to use the `sum` and `i ` variables so that the compiler's optimizer does not flatten the loops. For a Samsung S22, the bandwidth behaves like this for each of the dimensions. {F1767497386} Comparing the bandwidth for the X dimension to OpenCL, which was obtained through [ArchProbe](https://github.com/microsoft/ArchProbe), we can observe that, although the behavior is the same, Vulkan has an increased bandwidth for most access sizes. {F1767497972} Comparing to the bandwidth for buffers, we can observe that the bandwidth is similar to regular buffers, but still much smaller than UBOs at small access sizes. {F1767497707} Reviewed By: jorgep31415 Differential Revision: D59980139
- Loading branch information
1 parent
b7c8378
commit 2ddafed
Showing
3 changed files
with
186 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
/* | ||
* Copyright (c) Meta Platforms, Inc. and affiliates. | ||
* All rights reserved. | ||
* | ||
* This source code is licensed under the BSD-style license found in the | ||
* LICENSE file in the root directory of this source tree. | ||
*/ | ||
|
||
#version 450 core | ||
|
||
#define PRECISION ${PRECISION} | ||
#define VEC4_T ${texel_type(DTYPE)} | ||
|
||
layout(std430) buffer; | ||
|
||
${layout_declare_sampler(0, "r", "A", DTYPE)} | ||
${layout_declare_buffer(1, "w", "B", DTYPE, "PRECISION", False)} | ||
|
||
layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in; | ||
|
||
layout(constant_id = 3) const int niter = 1; | ||
layout(constant_id = 4) const int nvec = 1; | ||
layout(constant_id = 5) const int local_group_size = 1; | ||
|
||
void main() { | ||
// The address mask works as a modulo because x % 2^n == x & (2^n - 1). | ||
// This will help us limit address accessing to a specific set of unique | ||
// addresses depending on the access size we want to measure. | ||
const int addr_mask = nvec - 1; | ||
vec4 sum = vec4(0); | ||
|
||
// This is to distribute the accesses to unique addresses across the workgroups, once the | ||
// size of the access excedes the workgroup width. | ||
const uint workgroup_width = local_group_size * niter * ${NUNROLL}; | ||
uint offset = (gl_WorkGroupID[0] * workgroup_width + gl_LocalInvocationID[0]) & addr_mask; | ||
|
||
int i = 0; | ||
for (; i < niter; ++i){ | ||
VEC4_T in_texel; | ||
$for j in range(int(NUNROLL)): | ||
$if DIM == 0: | ||
in_texel = texelFetch(A, ivec3(offset, 0, 0), 0); | ||
$elif DIM == 1: | ||
in_texel = texelFetch(A, ivec3(0, offset, 0), 0); | ||
$elif DIM == 2: | ||
in_texel = texelFetch(A, ivec3(0, 0, offset), 0); | ||
|
||
sum *= in_texel; | ||
|
||
// On each unroll, a new unique address will be accessed through the offset, | ||
// limited by the address mask to a specific set of unique addresses | ||
offset = (offset + local_group_size) & addr_mask; | ||
} | ||
|
||
// This is to ensure no compiler optimizations occur | ||
vec4 zero = vec4(i>>31); | ||
|
||
B[gl_LocalInvocationID[0]] = sum + zero; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Copyright (c) Meta Platforms, Inc. and affiliates. | ||
# All rights reserved. | ||
# | ||
# This source code is licensed under the BSD-style license found in the | ||
# LICENSE file in the root directory of this source tree. | ||
|
||
tex_bandwidth: | ||
parameter_names_with_default_values: | ||
DTYPE: float | ||
NUNROLL: "16" | ||
generate_variant_forall: | ||
DIM: | ||
- RANGE: [0, 2] | ||
shader_variants: | ||
- NAME: tex_bandwidth |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters