-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : add WebGPU backend #7773
Comments
Hi! I'm interested in bringing this backend to GGML and was wondering if there are any startup materials available for newcomers to quickly ramp up and start working on this backend? |
So I've been playing with implementation of webgpu for a few days. I got a very minimal version with working buffer management and support for some simple ops. My version is based on ggerganov/ggml#585 but with some noticeable changes:
However, I'm not very familiar with ggml backend interface so I'm having a question: I made a test cgraph to test my implementation: https://github.com/ngxson/ggml_webgpu_dev/blob/a5fcc25c359b997869b8683ab485d1d3f96b37f9/main.cpp#L70 When calling
Here is my @ggerganov Could you help me understand this part? Thank you. |
If every tensor used in the graph needed to be allocated separately, the compute buffer would be several gigabytes even for the simplest models. The point of ggml-alloc is to minimize the size of the compute buffer by allocating tensors in the same memory locations when possible based on the order of evaluation of the graph. So this behavior is completely expected. I don't understand what you are trying to do with |
@slaren Thanks for the explanation. So apparently I'm now running into another issue that both src and dest of
I'm not sure how other backends handle this (and also the |
ggml-alloc can make some operations automatically inplace if it determines that it is safe to do so, to save memory. Other backends do not need to do anything special in this case, they just pass the same pointer for both the destination and src. I am not sure why this is a problem for webGPU, in the worst case it might require making a different version of the kernels for inplace operations, but there is probably some workaround possible. |
Could gpu.cpp (developed by Answers.AI) helps your progress? @ggerganov |
I think we can implement the kernels from scratch. The backend setup that @ngxson showed earlier seems like a good starting point. |
gpu.cpp can be useful to reduce amount of boilerplate code needed to setup webgpu device & buffers. I'll give it a try when I have more time. But keep in mind that the more complicated part is to re-implement all the kernels in WGSL. |
Indeed, their kernel implementation seems not finished too (string inline inside a header). Writing shaders manually also will make the implementation very hard in C/C++, unlike the Rust counterpart where the implementation can be "oxidized". Every paths doesn't feel right. CMIIW Maybe the following repos could help anyone who will write the shaders: |
A backend can be useful even if it only implements matrix multiplication, there is no need to implement every kernel at the same time. Start with (somewhat) fast matrix multiplication kernel, and add other operations progressively. |
gpu.cpp author here, happy to collaborate with others on webgpu kernel implementations. We're still somewhat exploratory phases of the best approach, one basic starting point is a wgsl variant of sboehm's series: matmul https://github.com/AnswerDotAI/gpu.cpp/blob/main/examples/matmul/run.cpp We're working on a small set of transformer kernels but it will take work to get them performant + mature. +1 to not needing to do everything. Another possibility is to leverage compiler toolchains like onnx or tinygrad. There's probably a way to pull wgsl out of their output though I haven't tried it myself yet. |
I had a quick look at the file, seems like this is exactly what I need (inline shader as string, less boilerplate) I just have a question real quick: how do you handle "inplace" operation? For example, if I want to scale a vector, let's say For adapting the kernels, I planned to base on ggml kompute shaders, which have more or less the same syntax. |
Hey guys, author of the original webgpu PR here, feel free to contact me, I might have some insights on wgpu quirks although it was all like an year ago.
|
Here's some matmul kernels BTW (not mine) |
One more thought, the Naga shader compiler from the wgpu project has the main purpose of taking the wgsl shaders and compiling them to msl/spv/etc. However, it seems there's decent support for the opposite of that - i.e. take GLSL or SPIR-V shaders and compile them to WGSL. It might be a good starting point for rewriting a lot of shaders. |
I hope that this would be relatively easy to do since AFAIK WebGPU allows us to write kernels in a shader language, so we have experience how to create such backends.
There has been some initial work in ggerganov/ggml#585 - could be useful as a starting point
The text was updated successfully, but these errors were encountered: