You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we are trying to support Embedding & CrossEntropyLoss without a fusion segmentation. This feature request is an umbrella item that I'm using to host follow up issues & PRs:
target to support primitive operations in the form of numpy.take and numpy.take_along_axis (we may need to clean up & add nvfuser API); @jjsjann123
finalize target problem sizes and reference implementation (torch.compile?!) for benchmarking; @kevinstephano. I'm marking this as done, since Kevin mentioned A better size to think about would be [8192, 32768] where you should have lots of waves. Though we might want to get more for perf tuning?! Validating CrossEntropyLoss Performance #278
start to figuring out backward for embedding and crossEntropyLoss; @jjsjann123
manual hint for segmentation @jjsjann123segmenter hint #262 Note: turns out the hint is not needed, since Naoya updated the heuristics for scheduling. But we still went through with the PR, since a manual segmenter hint might be useful for debugging. Marking this as done.
The text was updated successfully, but these errors were encountered:
Note that now we have the cpp examples and a rough idea of how forward should look like, I'm tagging @naoyam on this issue as well to track codegen progress there.
Just a quick update. I just realized we don't support fusing a normalization with a reduction, so the final reduction would be segmented out. The take_along_axis op should be in the same fusion as the softmax, so at least we should only write the 1D take_along_axis output to global memory.
Fusing the normalization with the reduction should be possible but it's just not supported right now. Hopefully, it shouldn't be a big perf overhead.
Even though we still have issue with our take/take_along_dim support. I'm going to close this mega issue since the goals of the sprint is mostly cleared.
The original ask comes from csarofeen/pytorch#2556
Currently we are trying to support Embedding & CrossEntropyLoss without a fusion segmentation. This feature request is an umbrella item that I'm using to host follow up issues & PRs:
numpy.take
andnumpy.take_along_axis
(we may need to clean up & add nvfuser API); @jjsjann123A better size to think about would be [8192, 32768] where you should have lots of waves.
Though we might want to get more for perf tuning?! ValidatingCrossEntropyLoss
Performance #278The text was updated successfully, but these errors were encountered: