-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable grid outer persistent scheduling #2435
Conversation
|
||
// In the case of outer grid persistence, make sure the vectorized | ||
// domain placed at the innermost position. | ||
// TODO: Why isn't this the case by default? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't sortAndRFactor
move vectorized iteration domains to the innermost position?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sortAndRFactor
order was primarily just empirical. If al the C++ tests and benchmarks run with vectorized dimensions fully innermost then it's fine to change the order. I believe it could be easier to deal with now as I think it was mostly fighting with compute at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll work on this later as a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The runtime kernel for grouped normal grid reductions is not | ||
// well tuned, and it turned out to be quite difficult to get | ||
// consistently better performances than non-persistent | ||
// schedules. Disabled for now. | ||
// TODO: Enable non-welford persistent reductions | ||
if (is_cross_grid && | ||
std::any_of( | ||
reduction_tvs.begin(), | ||
reduction_tvs.end(), | ||
[](TensorView* reduction_tv) { | ||
return !reduction_tv->definition()->isA<WelfordOp>(); | ||
})) { | ||
scheduler_debug_utils::canScheduleRejectReason( | ||
ScheduleHeuristic::Persistent, "non-Welford not enabled yet"); | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this constraint once I optimize the non-welford grouped reduction kernel as I did the welford kernel.
// Had a hard time tuning on Titan RTX when the iteration | ||
// space is not evenly divided by threads and thread blocks. It | ||
// doesn't seem to be noticeably bad on A100, though. For now, | ||
// disable the schedule if not evenly divisible just on Titan RTX | ||
// only. | ||
// TODO: Revisit | ||
if (is_cross_grid && | ||
(properties.total_iteration_numel % | ||
(vectorization_factor * cross_grid_params->launch_params.bdimx() * | ||
cross_grid_params->launch_params.gdimx()) != | ||
0) && | ||
device_prop->major == 7 && device_prop->minor == 5) { | ||
scheduler_debug_utils::canScheduleRejectReason( | ||
ScheduleHeuristic::Persistent, "iteration not evenly divided"); | ||
return false; | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw some regressions with the TIMM batcnorm benchmarks on Titan RTX. They have iteration domains of size, e.g., 200 and 368, which result in non-even distribution of work among threads and blocks. These cases are also slow on A100, but still the persistent version is better than the current segmented kernels.
Will revisit later.
The V100 results actually showed non-negligible regressions. I didn't notice due to the outlier making. Added the same bail-out condition as Titan RTX: #26d1588 Updated performance results. The outlier result is excluded. There's still a small number of regressions of about 20%. Part of the reasons seems to be due to this register usage estimate is too conservative:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM nice work.
//! for bdimx in all valid bdimx in an decreasing order | ||
//! for gdimy in valid gdimy values in an increasing order | ||
//! | ||
//! Each of bdimx and gdimy determines gdimy and gdimx, respecitively, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Each of bdimx and *bdimy ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to "Each of bdimx and gdimy determines bdym and gdimx, respecitively". bdimx and gdimy are determined as shown in the above loop nest, and bdimy and gdimx are picked accordingly.
reduction_axis, | ||
rparams.block_dim_inner_reduction, | ||
rparams.lparams.bdimy()); | ||
reduction_tv->split( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the point of inner_parallel_static
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. The inner dimension of the split is not parallelized.
|
||
// In the case of outer grid persistence, make sure the vectorized | ||
// domain placed at the innermost position. | ||
// TODO: Why isn't this the case by default? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sortAndRFactor
order was primarily just empirical. If al the C++ tests and benchmarks run with vectorized dimensions fully innermost then it's fine to change the order. I believe it could be easier to deal with now as I think it was mostly fighting with compute at.
const auto device_prop = at::cuda::getCurrentDeviceProperties(); | ||
|
||
const int64_t sm_register_file_size = | ||
static_cast<int64_t>(device_prop->regsPerBlock * sizeof(int)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you preemptively subtract off the same values as in getAvailableRegisterCount
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that should also work. Will consider in next iterations of PRs.
DataType dtype, | ||
bool use_weights = false, | ||
DataType weights_dtype = DataType::Float) { | ||
const bool benchmark_mode = isBenchmarkMode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there still be a benchmark mode in the tests? I assume this was for development but guessing it should probably be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me keep it for now as there's still performance issues and having benchmarks here is very handy as both performance measurement and correctness validation can be done.
(Extracted from #1772)
(Stacked on top of #2416)
Extends the normalization scheduler with grid persistence. The scheduling approach is mostly the same as the one I manually did here: https://github.com/csarofeen/pytorch/blob/devel/third_party/nvfuser/test/test_gpu_outer_reduction.cpp#L379.
There are still many performance limitations, and in this PR I tried to minimize regressions compare to the current TOT even when that means missing perf gains in some other cases. Will address those limitations later.
Some major limitations:
Benchmark performance results
A100
Bandwidth and speedup curves of all the benchmarks. The left axis shows the bandwidth of each benchmark with the TOT devel and this branch (red and blue dots). The right axis shows the speed-up factor of each benchmark. The benchmark plots are sorted by the bandwidth of the devel branch. Note that the number of benchmarks that are scheduled as grid persistence is still small, so most of the dots are scattered around 1.0.
As I mentioned above, I tried to minimize performance regressions, and these graphs show there's indeed no significant regression. I saw even more benchmarks showing positive speedups with different heuristics configurations, but the current heuristics are the best I found so far without causing any major regression, at least on A100. (There are some regressions on Titan RTX as shown below.)
Speedup histogram. Note the Y axis is log scale.
The largest speedup is with this benchmark:
It's still 660 GB/s, so likely there's still a lot of room to improve.
This one looks quite good now.
Titan RTX
The results on Titan RTX look more mixed compared to those on A100. There are some benchmarks exhibiting more than 20% regression, although they are quite small problems. It turned it's a little more difficult to avoid regressions on Titan RTX than on A100, and I even needed to add extra check for the 7.5 compute architecture. Quick tests with manual scheduling seems to indicate lack of full vectorization is one of the reasons. Will revisit again.
V100
There's one case that got around 3.5x speedup, though it could be just a random noise given its raw bandwidth is just a few GB/s. Otherwise, the speedup factors look similar to those on A100.