Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable grid outer persistent scheduling #2435

Merged
merged 9 commits into from
Feb 15, 2023
Merged

Enable grid outer persistent scheduling #2435

merged 9 commits into from
Feb 15, 2023

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Feb 9, 2023

(Extracted from #1772)
(Stacked on top of #2416)

Extends the normalization scheduler with grid persistence. The scheduling approach is mostly the same as the one I manually did here: https://github.com/csarofeen/pytorch/blob/devel/third_party/nvfuser/test/test_gpu_outer_reduction.cpp#L379.

There are still many performance limitations, and in this PR I tried to minimize regressions compare to the current TOT even when that means missing perf gains in some other cases. Will address those limitations later.

Some major limitations:

  • The non-welford grouped grid reduction kernel is slow. It's even slower than the welford kernel as it's specifically optimized for outer reductions, whereas the non-welford kernel is a generic template implementation.
  • In fusions like half-precision batchnorm with float weights, the vectorization factor is limited to 4 due to the type of the weights inputs
  • Iteration domains of size non power-of-two need better tuning. For example, the TIMM batchnorm benchmarks have iteration domains of size 200, whereas the ResNet batchnorms are easier to work with as their sizes are, e.g., 256, 512, etc.
  • Using shared memory as well as L2 for additional cache space

Benchmark performance results

A100

Bandwidth and speedup curves of all the benchmarks. The left axis shows the bandwidth of each benchmark with the TOT devel and this branch (red and blue dots). The right axis shows the speed-up factor of each benchmark. The benchmark plots are sorted by the bandwidth of the devel branch. Note that the number of benchmarks that are scheduled as grid persistence is still small, so most of the dots are scattered around 1.0.

As I mentioned above, I tried to minimize performance regressions, and these graphs show there's indeed no significant regression. I saw even more benchmarks showing positive speedups with different heuristics configurations, but the current heuristics are the best I found so far without causing any major regression, at least on A100. (There are some regressions on Titan RTX as shown below.)

image

Speedup histogram. Note the Y axis is log scale.
image

The largest speedup is with this benchmark:

 name bw_devel bw_new bw_new/bw_devel
NvFuserScheduler_BatchNorm_nhwc_fp32___GRAPH/NvFuserScheduler_BatchNorm_nhwc_fp32/512/64/8/manual_time 2.836540e+11 6.599990e+11 2.326775

It's still 660 GB/s, so likely there's still a lot of room to improve.

This one looks quite good now.

name bw_devel bw_new bw_new/bw_devel
NvFuserScheduler_ResNet_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_ResNet_BatchNorm_nhwc_fp16/256/2048/7/manual_time 6.192360e+11 9.104440e+11 1.47027

Titan RTX

The results on Titan RTX look more mixed compared to those on A100. There are some benchmarks exhibiting more than 20% regression, although they are quite small problems. It turned it's a little more difficult to avoid regressions on Titan RTX than on A100, and I even needed to add extra check for the 7.5 compute architecture. Quick tests with manual scheduling seems to indicate lack of full vectorization is one of the reasons. Will revisit again.

image

image

V100

There's one case that got around 3.5x speedup, though it could be just a random noise given its raw bandwidth is just a few GB/s. Otherwise, the speedup factors look similar to those on A100.

image

image


// In the case of outer grid persistence, make sure the vectorized
// domain placed at the innermost position.
// TODO: Why isn't this the case by default?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't sortAndRFactor move vectorized iteration domains to the innermost position?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sortAndRFactor order was primarily just empirical. If al the C++ tests and benchmarks run with vectorized dimensions fully innermost then it's fine to change the order. I believe it could be easier to deal with now as I think it was mostly fighting with compute at.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll work on this later as a separate PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +2005 to +2020
// The runtime kernel for grouped normal grid reductions is not
// well tuned, and it turned out to be quite difficult to get
// consistently better performances than non-persistent
// schedules. Disabled for now.
// TODO: Enable non-welford persistent reductions
if (is_cross_grid &&
std::any_of(
reduction_tvs.begin(),
reduction_tvs.end(),
[](TensorView* reduction_tv) {
return !reduction_tv->definition()->isA<WelfordOp>();
})) {
scheduler_debug_utils::canScheduleRejectReason(
ScheduleHeuristic::Persistent, "non-Welford not enabled yet");
return false;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this constraint once I optimize the non-welford grouped reduction kernel as I did the welford kernel.

Comment on lines 2022 to 2040
// Had a hard time tuning on Titan RTX when the iteration
// space is not evenly divided by threads and thread blocks. It
// doesn't seem to be noticeably bad on A100, though. For now,
// disable the schedule if not evenly divisible just on Titan RTX
// only.
// TODO: Revisit
if (is_cross_grid &&
(properties.total_iteration_numel %
(vectorization_factor * cross_grid_params->launch_params.bdimx() *
cross_grid_params->launch_params.gdimx()) !=
0) &&
device_prop->major == 7 && device_prop->minor == 5) {
scheduler_debug_utils::canScheduleRejectReason(
ScheduleHeuristic::Persistent, "iteration not evenly divided");
return false;
}

return true;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw some regressions with the TIMM batcnorm benchmarks on Titan RTX. They have iteration domains of size, e.g., 200 and 368, which result in non-even distribution of work among threads and blocks. These cases are also slow on A100, but still the persistent version is better than the current segmented kernels.

Will revisit later.

@naoyam naoyam changed the title [WIP] Enable grid outer persistent scheduling Enable grid outer persistent scheduling Feb 9, 2023
@naoyam naoyam marked this pull request as ready for review February 9, 2023 05:22
@naoyam naoyam requested a review from csarofeen February 9, 2023 05:23
@naoyam
Copy link
Collaborator Author

naoyam commented Feb 9, 2023

The V100 results actually showed non-negligible regressions. I didn't notice due to the outlier making.

Added the same bail-out condition as Titan RTX: #26d1588

Updated performance results. The outlier result is excluded.

image

image

There's still a small number of regressions of about 20%. Part of the reasons seems to be due to this register usage estimate is too conservative:

register_count -= persistent_buffer_factor;
. Will address later.

Base automatically changed from segmenter_fix to devel February 9, 2023 23:22
Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM nice work.

//! for bdimx in all valid bdimx in an decreasing order
//! for gdimy in valid gdimy values in an increasing order
//!
//! Each of bdimx and gdimy determines gdimy and gdimx, respecitively,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Each of bdimx and *bdimy ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to "Each of bdimx and gdimy determines bdym and gdimx, respecitively". bdimx and gdimy are determined as shown in the above loop nest, and bdimy and gdimx are picked accordingly.

reduction_axis,
rparams.block_dim_inner_reduction,
rparams.lparams.bdimy());
reduction_tv->split(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the point of inner_parallel_static?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. The inner dimension of the split is not parallelized.

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp Outdated Show resolved Hide resolved

// In the case of outer grid persistence, make sure the vectorized
// domain placed at the innermost position.
// TODO: Why isn't this the case by default?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sortAndRFactor order was primarily just empirical. If al the C++ tests and benchmarks run with vectorized dimensions fully innermost then it's fine to change the order. I believe it could be easier to deal with now as I think it was mostly fighting with compute at.

third_party/nvfuser/csrc/scheduler/reduction_utils.cpp Outdated Show resolved Hide resolved
third_party/nvfuser/csrc/scheduler/registry.cpp Outdated Show resolved Hide resolved
const auto device_prop = at::cuda::getCurrentDeviceProperties();

const int64_t sm_register_file_size =
static_cast<int64_t>(device_prop->regsPerBlock * sizeof(int));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you preemptively subtract off the same values as in getAvailableRegisterCount?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should also work. Will consider in next iterations of PRs.

DataType dtype,
bool use_weights = false,
DataType weights_dtype = DataType::Float) {
const bool benchmark_mode = isBenchmarkMode();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there still be a benchmark mode in the tests? I assume this was for development but guessing it should probably be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me keep it for now as there's still performance issues and having benchmarks here is very handy as both performance measurement and correctness validation can be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants