Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In #387, we have to do a
in order to make our swizzling algorithm work for epilogue. This looks more like an empirical hack whose only goal is to creates a square block. Although it empirically worked, I struggled to find a first-principle explanation for this approach. So I read through my original PR #155 multiple times and think through things carefully. But the more I read and think, the more I feel that the original implementation in #155 does not make sense. The problem is, #155 tries to interleave the entire
ldmatrix_rows / repeated_pattern_size
with an equal size split on tile y dimension. This is overkill, because we just need to evenly distribute rows on different megabanks, and as long as we do so, the number of rows can be arbitrarily large and we can still be bank-conflict free. So we should be swizzling on a(g, g)
block instead of a (potentially much larger)(ldmatrix_rows / repeated_pattern_size, ldmatrix_rows / repeated_pattern_size)
block.