Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Expanding support on gather-like operations to avoid Kernel segmentation #156

Closed
4 of 5 tasks
jjsjann123 opened this issue Apr 10, 2023 · 4 comments
Closed
4 of 5 tasks
Assignees

Comments

@jjsjann123
Copy link
Collaborator

jjsjann123 commented Apr 10, 2023

The original ask comes from csarofeen/pytorch#2556

Currently we are trying to support Embedding & CrossEntropyLoss without a fusion segmentation. This feature request is an umbrella item that I'm using to host follow up issues & PRs:

  • target to support primitive operations in the form of numpy.take and numpy.take_along_axis (we may need to clean up & add nvfuser API); @jjsjann123
  • draft a cpp example with CrossEntropyLoss forward [WIP] Cross loss entropy cpp example #201 ; @jjsjann123
  • finalize target problem sizes and reference implementation (torch.compile?!) for benchmarking; @kevinstephano. I'm marking this as done, since Kevin mentioned A better size to think about would be [8192, 32768] where you should have lots of waves. Though we might want to get more for perf tuning?! Validating CrossEntropyLoss Performance #278
  • start to figuring out backward for embedding and crossEntropyLoss; @jjsjann123
  • manual hint for segmentation @jjsjann123 segmenter hint #262 Note: turns out the hint is not needed, since Naoya updated the heuristics for scheduling. But we still went through with the PR, since a manual segmenter hint might be useful for debugging. Marking this as done.
@jjsjann123
Copy link
Collaborator Author

Note that now we have the cpp examples and a rough idea of how forward should look like, I'm tagging @naoyam on this issue as well to track codegen progress there.

@naoyam
Copy link
Collaborator

naoyam commented Apr 24, 2023

Just a quick update. I just realized we don't support fusing a normalization with a reduction, so the final reduction would be segmented out. The take_along_axis op should be in the same fusion as the softmax, so at least we should only write the 1D take_along_axis output to global memory.

Fusing the normalization with the reduction should be possible but it's just not supported right now. Hopefully, it shouldn't be a big perf overhead.

@jjsjann123
Copy link
Collaborator Author

Kevin posted up some perf expectation and code snippets here: #278

@jjsjann123
Copy link
Collaborator Author

Even though we still have issue with our take/take_along_dim support. I'm going to close this mega issue since the goals of the sprint is mostly cleared.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants