[Feature Request] Expanding support on gather-like operations to avoid Kernel segmentation #156

jjsjann123 · 2023-04-10T23:34:08Z

The original ask comes from csarofeen/pytorch#2556

Currently we are trying to support Embedding & CrossEntropyLoss without a fusion segmentation. This feature request is an umbrella item that I'm using to host follow up issues & PRs:

target to support primitive operations in the form of numpy.take and numpy.take_along_axis (we may need to clean up & add nvfuser API); @jjsjann123
draft a cpp example with CrossEntropyLoss forward [WIP] Cross loss entropy cpp example #201 ; @jjsjann123
finalize target problem sizes and reference implementation (torch.compile?!) for benchmarking; @kevinstephano. I'm marking this as done, since Kevin mentioned A better size to think about would be [8192, 32768] where you should have lots of waves. Though we might want to get more for perf tuning?! Validating CrossEntropyLoss Performance #278
start to figuring out backward for embedding and crossEntropyLoss; @jjsjann123
manual hint for segmentation @jjsjann123 segmenter hint #262 Note: turns out the hint is not needed, since Naoya updated the heuristics for scheduling. But we still went through with the PR, since a manual segmenter hint might be useful for debugging. Marking this as done.

The text was updated successfully, but these errors were encountered:

jjsjann123 · 2023-04-24T16:28:11Z

Note that now we have the cpp examples and a rough idea of how forward should look like, I'm tagging @naoyam on this issue as well to track codegen progress there.

naoyam · 2023-04-24T16:40:05Z

Just a quick update. I just realized we don't support fusing a normalization with a reduction, so the final reduction would be segmented out. The take_along_axis op should be in the same fusion as the softmax, so at least we should only write the 1D take_along_axis output to global memory.

Fusing the normalization with the reduction should be possible but it's just not supported right now. Hopefully, it shouldn't be a big perf overhead.

jjsjann123 · 2023-05-04T01:14:28Z

Kevin posted up some perf expectation and code snippets here: #278

jjsjann123 · 2023-06-23T16:02:31Z

Even though we still have issue with our take/take_along_dim support. I'm going to close this mega issue since the goals of the sprint is mostly cleared.

jjsjann123 assigned jjsjann123 and kevinstephano Apr 10, 2023

jjsjann123 assigned naoyam Apr 24, 2023

jjsjann123 mentioned this issue Apr 24, 2023

[WIP] Cross loss entropy cpp example #201

Closed

naoyam mentioned this issue May 3, 2023

take_along_axis with non fusion inputs #250

Merged

jjsjann123 closed this as completed Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Expanding support on gather-like operations to avoid Kernel segmentation #156

[Feature Request] Expanding support on gather-like operations to avoid Kernel segmentation #156

jjsjann123 commented Apr 10, 2023 •

edited

Loading

jjsjann123 commented Apr 24, 2023

naoyam commented Apr 24, 2023

jjsjann123 commented May 4, 2023

jjsjann123 commented Jun 23, 2023

[Feature Request] Expanding support on gather-like operations to avoid Kernel segmentation #156

[Feature Request] Expanding support on gather-like operations to avoid Kernel segmentation #156

Comments

jjsjann123 commented Apr 10, 2023 • edited Loading

jjsjann123 commented Apr 24, 2023

naoyam commented Apr 24, 2023

jjsjann123 commented May 4, 2023

jjsjann123 commented Jun 23, 2023

jjsjann123 commented Apr 10, 2023 •

edited

Loading