Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add epilogue to store MMA results in shared memory before write to #387
add epilogue to store MMA results in shared memory before write to #387
Changes from 29 commits
379eab5
b500d36
acf1167
be885b0
7138cbe
fad4ad8
84e3e98
a94e5df
9f4bcc4
1760c15
f0ff6f9
da5dc3a
537b855
ab86a1f
f2a75cd
7a4d5b5
86b8911
5586b3a
6a8f139
d6212cb
95ea553
925e04d
1f30a36
80d7588
d3019f0
212258c
a2045cd
32f43d8
67ecdb0
864a918
79c452c
26d970d
189cef4
889ce23
94aae9b
27870d0
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still true with the new
hasEnoughSharedMemoryForEpilogue
that takes register occupancy into consideration?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case
FusionAmpereMatmulSmemEpilogue_CUDA
, The settings of cta_tile (64, 128, 32), warp_tile(32, 32, 32), and smem_double_buffer_stage(2) are purposefully changed to produce a constant occupancy of 25% no matterhas_smem_epilogue = true or false
. This allows me to evaluate the influence of the has_smem_epilogue. There is no noticable change of performance for TT, NT, TN (kernel1-3). But for NN (kernel4), there is a slightly decrease. I checked the global memory write index, the access is coalesced forhas_smem_epilogue = true
and strided forhas_smem_epilogue = false
Global Memory write index for
MatmulLayout::NN with has_smem_epilogue = false
:Global Memory write index for
MatmulLayout::NN with has_smem_epilogue = true
:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so changing
has_smem_epilogue
to true makes the memory access contiguous, but the perf drop? That is strange.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to
blocks_per_sm_by_register
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|| blocks_per_sm_with_smem_epilogue > 0;
is a tmp condition to allow the test of as many cases as possbile without considering performance. Will be deleted before merge.