Enable grid outer persistent scheduling #2435

naoyam · 2023-02-09T00:14:47Z

(Extracted from #1772)
(Stacked on top of #2416)

Extends the normalization scheduler with grid persistence. The scheduling approach is mostly the same as the one I manually did here: https://github.com/csarofeen/pytorch/blob/devel/third_party/nvfuser/test/test_gpu_outer_reduction.cpp#L379.

There are still many performance limitations, and in this PR I tried to minimize regressions compare to the current TOT even when that means missing perf gains in some other cases. Will address those limitations later.

Some major limitations:

The non-welford grouped grid reduction kernel is slow. It's even slower than the welford kernel as it's specifically optimized for outer reductions, whereas the non-welford kernel is a generic template implementation.
In fusions like half-precision batchnorm with float weights, the vectorization factor is limited to 4 due to the type of the weights inputs
Iteration domains of size non power-of-two need better tuning. For example, the TIMM batchnorm benchmarks have iteration domains of size 200, whereas the ResNet batchnorms are easier to work with as their sizes are, e.g., 256, 512, etc.
Using shared memory as well as L2 for additional cache space

Benchmark performance results

A100

Bandwidth and speedup curves of all the benchmarks. The left axis shows the bandwidth of each benchmark with the TOT devel and this branch (red and blue dots). The right axis shows the speed-up factor of each benchmark. The benchmark plots are sorted by the bandwidth of the devel branch. Note that the number of benchmarks that are scheduled as grid persistence is still small, so most of the dots are scattered around 1.0.

As I mentioned above, I tried to minimize performance regressions, and these graphs show there's indeed no significant regression. I saw even more benchmarks showing positive speedups with different heuristics configurations, but the current heuristics are the best I found so far without causing any major regression, at least on A100. (There are some regressions on Titan RTX as shown below.)

Speedup histogram. Note the Y axis is log scale.

The largest speedup is with this benchmark:

name	bw_devel	bw_new	bw_new/bw_devel
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/512/64/8/manual_time	2.836540e+11	6.599990e+11	2.326775

It's still 660 GB/s, so likely there's still a lot of room to improve.

This one looks quite good now.

name	bw_devel	bw_new	bw_new/bw_devel
NvFuserScheduler_ResNet_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_ResNet_BatchNorm_nhwc_fp16/256/2048/7/manual_time	6.192360e+11	9.104440e+11	1.47027

Titan RTX

The results on Titan RTX look more mixed compared to those on A100. There are some benchmarks exhibiting more than 20% regression, although they are quite small problems. It turned it's a little more difficult to avoid regressions on Titan RTX than on A100, and I even needed to add extra check for the 7.5 compute architecture. Quick tests with manual scheduling seems to indicate lack of full vectorization is one of the reasons. Will revisit again.

V100

There's one case that got around 3.5x speedup, though it could be just a random noise given its raw bandwidth is just a few GB/s. Otherwise, the speedup factors look similar to those on A100.

naoyam · 2023-02-09T00:47:09Z

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp

+
+  // In the case of outer grid persistence, make sure the vectorized
+  // domain placed at the innermost position.
+  // TODO: Why isn't this the case by default?


Why doesn't sortAndRFactor move vectorized iteration domains to the innermost position?

sortAndRFactor order was primarily just empirical. If al the C++ tests and benchmarks run with vectorized dimensions fully innermost then it's fine to change the order. I believe it could be easier to deal with now as I think it was mostly fighting with compute at.

I'll work on this later as a separate PR.

naoyam · 2023-02-09T00:48:59Z

third_party/nvfuser/csrc/scheduler/registry.cpp

+    // The runtime kernel for grouped normal grid reductions is not
+    // well tuned, and it turned out to be quite difficult to get
+    // consistently better performances than non-persistent
+    // schedules. Disabled for now.
+    // TODO: Enable non-welford persistent reductions
+    if (is_cross_grid &&
+        std::any_of(
+            reduction_tvs.begin(),
+            reduction_tvs.end(),
+            [](TensorView* reduction_tv) {
+              return !reduction_tv->definition()->isA<WelfordOp>();
+            })) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Persistent, "non-Welford not enabled yet");
+      return false;
+    }


I'll remove this constraint once I optimize the non-welford grouped reduction kernel as I did the welford kernel.

naoyam · 2023-02-09T00:55:34Z

third_party/nvfuser/csrc/scheduler/registry.cpp

+    // Had a hard time tuning on Titan RTX when the iteration
+    // space is not evenly divided by threads and thread blocks. It
+    // doesn't seem to be noticeably bad on A100, though. For now,
+    // disable the schedule if not evenly divisible just on Titan RTX
+    // only.
+    // TODO: Revisit
+    if (is_cross_grid &&
+        (properties.total_iteration_numel %
+             (vectorization_factor * cross_grid_params->launch_params.bdimx() *
+              cross_grid_params->launch_params.gdimx()) !=
+         0) &&
+        device_prop->major == 7 && device_prop->minor == 5) {
+      scheduler_debug_utils::canScheduleRejectReason(
+          ScheduleHeuristic::Persistent, "iteration not evenly divided");
+      return false;
+    }
+
+    return true;
+  }


Saw some regressions with the TIMM batcnorm benchmarks on Titan RTX. They have iteration domains of size, e.g., 200 and 368, which result in non-even distribution of work among threads and blocks. These cases are also slow on A100, but still the persistent version is better than the current segmented kernels.

Will revisit later.

naoyam · 2023-02-09T17:58:04Z

The V100 results actually showed non-negligible regressions. I didn't notice due to the outlier making.

Added the same bail-out condition as Titan RTX: #26d1588

Updated performance results. The outlier result is excluded.

There's still a small number of regressions of about 20%. Part of the reasons seems to be due to this register usage estimate is too conservative:

pytorch/third_party/nvfuser/csrc/scheduler/normalization_utils.cpp

Line 161 in 26d1588

register_count -= persistent_buffer_factor;

. Will address later.

csarofeen

LGTM nice work.

csarofeen · 2023-02-12T20:38:46Z

third_party/nvfuser/csrc/scheduler/normalization_utils.h

+//!   for bdimx in all valid bdimx in an decreasing order
+//!     for gdimy in valid gdimy values in an increasing order
+//!
+//! Each of bdimx and gdimy determines gdimy and gdimx, respecitively,


Should this be Each of bdimx and *bdimy ?

Fixed to "Each of bdimx and gdimy determines bdym and gdimx, respecitively". bdimx and gdimy are determined as shown in the above loop nest, and bdimy and gdimx are picked accordingly.

csarofeen · 2023-02-12T22:58:38Z

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp

+        reduction_axis,
+        rparams.block_dim_inner_reduction,
+        rparams.lparams.bdimy());
+    reduction_tv->split(


Isn't this the point of inner_parallel_static?

Not exactly. The inner dimension of the split is not parallelized.

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp

csarofeen · 2023-02-12T23:03:25Z

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp

+
+  // In the case of outer grid persistence, make sure the vectorized
+  // domain placed at the innermost position.
+  // TODO: Why isn't this the case by default?


sortAndRFactor order was primarily just empirical. If al the C++ tests and benchmarks run with vectorized dimensions fully innermost then it's fine to change the order. I believe it could be easier to deal with now as I think it was mostly fighting with compute at.

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp

third_party/nvfuser/csrc/scheduler/registry.cpp

csarofeen · 2023-02-13T16:01:55Z

third_party/nvfuser/csrc/scheduler/registry.cpp

+    const auto device_prop = at::cuda::getCurrentDeviceProperties();
+
+    const int64_t sm_register_file_size =
+        static_cast<int64_t>(device_prop->regsPerBlock * sizeof(int));


Can't you preemptively subtract off the same values as in getAvailableRegisterCount?

I think that should also work. Will consider in next iterations of PRs.

third_party/nvfuser/csrc/scheduler/registry.cpp

csarofeen · 2023-02-13T16:10:39Z

third_party/nvfuser/test/test_gpu_outer_reduction.cpp

+    DataType dtype,
+    bool use_weights = false,
+    DataType weights_dtype = DataType::Float) {
+  const bool benchmark_mode = isBenchmarkMode();


Should there still be a benchmark mode in the tests? I assume this was for development but guessing it should probably be removed.

Let me keep it for now as there's still performance issues and having benchmarks here is very handy as both performance measurement and correctness validation can be done.

naoyam added 5 commits February 3, 2023 23:03

Fix input type lowering in segmentation

5a22cf8

Merge branch 'devel' into segmenter_fix

eecd4b6

Enable grid outer persistent scheduling

411e2bf

format

dc69f89

cleanup

522db04

naoyam commented Feb 9, 2023

View reviewed changes

merge fix

ba76fab

naoyam changed the title ~~[WIP] Enable grid outer persistent scheduling~~ Enable grid outer persistent scheduling Feb 9, 2023

naoyam marked this pull request as ready for review February 9, 2023 05:22

naoyam requested a review from csarofeen February 9, 2023 05:23

naoyam mentioned this pull request Feb 9, 2023

[WIP] Enable the persistent scheduler to use grid persistence #1772

Closed

9 tasks

Heuristics adjustment for V100

26d1588

Base automatically changed from segmenter_fix to devel February 9, 2023 23:22

Merge branch 'devel' into grid_outer_norm

12ae76d

csarofeen approved these changes Feb 13, 2023

View reviewed changes

PR feedback

e0fc03d

naoyam merged commit 7b37a83 into devel Feb 15, 2023

This was referenced Feb 15, 2023

Outer grid reduction kernels should be optimized as outer grid welford kernels are #2465

Open

Heuristics tuning of outer grid persistence with iteration domain sizes that are not evenly distributed #2467

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable grid outer persistent scheduling #2435

Enable grid outer persistent scheduling #2435

naoyam commented Feb 9, 2023 •

edited

Loading

naoyam Feb 9, 2023

csarofeen Feb 12, 2023

naoyam Feb 15, 2023

naoyam Feb 15, 2023

naoyam Feb 9, 2023

naoyam Feb 9, 2023

naoyam commented Feb 9, 2023

csarofeen left a comment

csarofeen Feb 12, 2023

naoyam Feb 15, 2023

csarofeen Feb 12, 2023

naoyam Feb 15, 2023

csarofeen Feb 12, 2023

csarofeen Feb 13, 2023

naoyam Feb 15, 2023

csarofeen Feb 13, 2023

naoyam Feb 15, 2023

Enable grid outer persistent scheduling #2435

Enable grid outer persistent scheduling #2435

Conversation

naoyam commented Feb 9, 2023 • edited Loading

Benchmark performance results

A100

Titan RTX

V100

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam commented Feb 9, 2023

csarofeen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam commented Feb 9, 2023 •

edited

Loading