Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add epilogue to store MMA results in shared memory before write to #387
add epilogue to store MMA results in shared memory before write to #387
Changes from 16 commits
379eab5
b500d36
acf1167
be885b0
7138cbe
fad4ad8
84e3e98
a94e5df
9f4bcc4
1760c15
f0ff6f9
da5dc3a
537b855
ab86a1f
f2a75cd
7a4d5b5
86b8911
5586b3a
6a8f139
d6212cb
95ea553
925e04d
1f30a36
80d7588
d3019f0
212258c
a2045cd
32f43d8
67ecdb0
864a918
79c452c
26d970d
189cef4
889ce23
94aae9b
27870d0
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are checking if
smem_size <= device_smem_limit
and enable smem unswizzle if this condition is satisfied. I don't think this is the correct behavior. Each kernel has its designed occupancy, if the designed occupancy isN
, then we should haveN*smem_size <= device_smem_limit
. cc: @drzejan2 Any opinion on this?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right @zasdfgbnm
The first though that comes to my mind is to calculate current occupancy where only
smem_a
andsmem_b
are in use and then check if new occupancy, withsmem_a
,smem_b
andsmem_c
, does not change the occupancy. Something along these lines:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. But without memory reuse
new_smem_occupancy < curr_smem_occupancy
is almost always true. e.g. I checkedNVFuserTest.FusionAmpereMatmulTileCheck4warp_CUDA
and the results are:This function was designed to ensure we can launch the kernel with shared memory epilogue without considering the occupancy. By doing this we can increase the number of cases can be tested with smem epilogue. For performance purpose we definitly need
N*smem_size <= device_smem_limit
, in this revision N is derived from register limitation. For caseNVFuserTest.FusionAmpereMatmulTileCheck4warp_CUDA
N is 1 or 2.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @liqiangxl for running verification smem-based and register-based occupancy. The updated implementation looks good.