ggml : add WebGPU backend #7773

ggerganov · 2024-06-05T14:24:37Z

I hope that this would be relatively easy to do since AFAIK WebGPU allows us to write kernels in a shader language, so we have experience how to create such backends.

There has been some initial work in ggerganov/ggml#585 - could be useful as a starting point

WenheLI · 2024-06-15T00:23:01Z

Hi! I'm interested in bringing this backend to GGML and was wondering if there are any startup materials available for newcomers to quickly ramp up and start working on this backend?

ngxson · 2024-06-27T14:07:24Z

So I've been playing with implementation of webgpu for a few days. I got a very minimal version with working buffer management and support for some simple ops.

My version is based on ggerganov/ggml#585 but with some noticeable changes:

Up-to-date ggml backend API
Use webgpu_cpp instead of plain C (requires C++17)
emscripten-only for now

However, I'm not very familiar with ggml backend interface so I'm having a question:

I made a test cgraph to test my implementation: https://github.com/ngxson/ggml_webgpu_dev/blob/a5fcc25c359b997869b8683ab485d1d3f96b37f9/main.cpp#L70

When calling ggml_gallocr_alloc_graph, I expect it to call buffer_type_alloc_buffer with enough memory for all nodes, but turns out it only alloc memory for one node then call init_tensor for all nodes:

gml_backend_wgpu_buffer_type_alloc_buffer: 256  ==> only enough memory for one node
storage_buffer_1: create with size=256
ggml_backend_wgpu_buffer_reset
ggml_backend_wgpu_buffer_init_tensor: node_0
storage_buffer_1: node_0, init to offset 0
ggml_backend_wgpu_buffer_init_tensor: node_1
storage_buffer_1: realloc to size=512           ==> not enough memory, we need to realloc
storage_buffer_1: node_1, init to offset 256
ggml_backend_wgpu_buffer_init_tensor: node_2

Here is my tensor_init function: https://github.com/ngxson/ggml_webgpu_dev/blob/a5fcc25c359b997869b8683ab485d1d3f96b37f9/ggml-wgpu.cpp#L195

@ggerganov Could you help me understand this part? Thank you.

slaren · 2024-06-27T14:24:42Z

If every tensor used in the graph needed to be allocated separately, the compute buffer would be several gigabytes even for the simplest models. The point of ggml-alloc is to minimize the size of the compute buffer by allocating tensors in the same memory locations when possible based on the order of evaluation of the graph. So this behavior is completely expected.

I don't understand what you are trying to do with offset_table. You can calculate the offset of the tensor within the buffer from ggml_tensor::data by subtracting it to the base address returned by ggml_backend_wgpu_buffer_get_base.

ngxson · 2024-06-27T16:13:16Z

@slaren Thanks for the explanation.

So apparently offset_table was used because I didn't know that offset can be calculated using tensor->data - base. With that in mind, I removed offset_table and also removed the std::set<ggml_wgpu_buffer_context *> buffers used for tracking all the allocated buffers.

I'm now running into another issue that both src and dest of result = ggml_div(ctx0, result, model.b) point to the same tensor:

Writable storage buffer binding aliasing found between [BindGroup "bind_group"] set at bind group index 0, binding index 0, and [BindGroup "bind_group"] set at bind group index 0, binding index 2, with overlapping ranges (offset: 0, size: 256) and (offset: 0, size: 256) in [Buffer "storage_buffer_1"].
 - While encoding [ComputePassEncoder (unlabeled)].DispatchWorkgroups(8, 1, 1).

I'm not sure how other backends handle this (and also the _inplace version). Do you have any clue?

slaren · 2024-06-27T16:18:37Z

ggml-alloc can make some operations automatically inplace if it determines that it is safe to do so, to save memory. Other backends do not need to do anything special in this case, they just pass the same pointer for both the destination and src. I am not sure why this is a problem for webGPU, in the worst case it might require making a different version of the kernels for inplace operations, but there is probably some workaround possible.

refinism · 2024-11-19T15:05:23Z

Could gpu.cpp (developed by Answers.AI) helps your progress? @ggerganov
They use stripped Google's Dawn implementation.
The problem is, it seems explicitly depend on clang and C++17(?).

ggerganov · 2024-11-19T18:16:03Z

I think we can implement the kernels from scratch. The backend setup that @ngxson showed earlier seems like a good starting point.

ngxson · 2024-11-19T19:34:28Z

gpu.cpp can be useful to reduce amount of boilerplate code needed to setup webgpu device & buffers. I'll give it a try when I have more time. But keep in mind that the more complicated part is to re-implement all the kernels in WGSL.

refinism · 2024-11-19T20:26:19Z

Indeed, their kernel implementation seems not finished too (string inline inside a header). Writing shaders manually also will make the implementation very hard in C/C++, unlike the Rust counterpart where the implementation can be "oxidized". Every paths doesn't feel right. CMIIW

Maybe the following repos could help anyone who will write the shaders:
web-rwkv
crabml
wonnx

slaren · 2024-11-19T20:38:08Z

A backend can be useful even if it only implements matrix multiplication, there is no need to implement every kernel at the same time. Start with (somewhat) fast matrix multiplication kernel, and add other operations progressively.

austinvhuang · 2024-11-19T22:20:32Z

gpu.cpp author here, happy to collaborate with others on webgpu kernel implementations. We're still somewhat exploratory phases of the best approach, one basic starting point is a wgsl variant of sboehm's series: matmul https://github.com/AnswerDotAI/gpu.cpp/blob/main/examples/matmul/run.cpp

We're working on a small set of transformer kernels but it will take work to get them performant + mature. +1 to not needing to do everything.

Another possibility is to leverage compiler toolchains like onnx or tinygrad. There's probably a way to pull wgsl out of their output though I haven't tried it myself yet.

ngxson · 2024-11-19T22:52:37Z

gpu.cpp author here, happy to collaborate with others on webgpu kernel implementations. We're still somewhat exploratory phases of the best approach, one basic starting point is a wgsl variant of sboehm's series: matmul https://github.com/AnswerDotAI/gpu.cpp/blob/main/examples/matmul/run.cpp

I had a quick look at the file, seems like this is exactly what I need (inline shader as string, less boilerplate)

I just have a question real quick: how do you handle "inplace" operation? For example, if I want to scale a vector, let's say v * 0.5f, I can just modify v directly without creating a new vector for result. But this is currently quite messy to do because bind group can't be overlap in webgpu. One solution could be to have a dedicated "inplace" kernel that use read_write storage class, but I'm looking for a solution that can use the same kernel as non-inplace version.

For adapting the kernels, I planned to base on ggml kompute shaders, which have more or less the same syntax.

audiovention · 2024-11-20T06:40:19Z

Hey guys, author of the original webgpu PR here, feel free to contact me, I might have some insights on wgpu quirks although it was all like an year ago.
Some notes I have on the discussion:

I don't think there's much benefit in using extra wrappers like gpu.cpp or webgpu.cpp - webgpu is already quite low on boilerplate (e.g. compare to vulkan...) and furthermore the real API is still quite unstable, let alone such third-party projects without serious backing, I think it's not ideal to depend on them. Especially, since you'd want to easily switch to latest releases of the base engines and not have to wait for the third-party libs' support.
Still some serious differences between dawn and wpgu (the rust one). Wgpu doesn't yet support f16 math, if you want to support both, test on wgpu.
Neither backend supports cooperative matmul yet which will heavily limit performance (x5 at least in my experience). As far as I understand it's already worked on at dawn and might have a beta soon. Then again current vulkan backend doesn't use that yet and gets decent performance.
Few big things I fought in our implementation were the buffer mapping rules. I.e. input and output buffers can't be the same, so you have to implement different shaders for in-place ops. Another one is buffer mapping has rather strict alignment rules, so you end up having to always pass pointer and offset to your kernels.
We did the webgpu implementation for easy cross platform and cross vendor support. However in the end we released our product with metal for macOs as performance was much better (due to coop/simdgroup matmul). If the vulkan backend existed back then, we would've used that for windows. But so far the webgpu backend seems to work great on users' machines.

audiovention · 2024-11-20T06:43:42Z

Here's some matmul kernels BTW (not mine)
https://github.com/FL33TW00D/wgpu-mm
Probably a good starting point and some of the best performance you'll get without coopmat.

audiovention · 2024-11-20T07:40:39Z

One more thought, the Naga shader compiler from the wgpu project has the main purpose of taking the wgsl shaders and compiling them to msl/spv/etc. However, it seems there's decent support for the opposite of that - i.e. take GLSL or SPIR-V shaders and compile them to WGSL. It might be a good starting point for rewriting a lot of shaders.
https://github.com/gfx-rs/wgpu/tree/trunk/naga

ggerganov added help wanted Extra attention is needed research 🔬 labels Jun 5, 2024

ggerganov added this to ggml : roadmap Jun 5, 2024

ggerganov moved this to Todo in ggml : roadmap Jun 5, 2024

ngxson mentioned this issue Jun 8, 2024

Add WebGPU support ngxson/wllama#66

Open

rahuldshetty mentioned this issue Jul 11, 2024

Support WebGPU Inference rahuldshetty/llm.js#14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : add WebGPU backend #7773

ggml : add WebGPU backend #7773

ggerganov commented Jun 5, 2024

WenheLI commented Jun 15, 2024

ngxson commented Jun 27, 2024 •

edited

Loading

slaren commented Jun 27, 2024 •

edited

Loading

ngxson commented Jun 27, 2024

slaren commented Jun 27, 2024

refinism commented Nov 19, 2024 •

edited

Loading

ggerganov commented Nov 19, 2024

ngxson commented Nov 19, 2024

refinism commented Nov 19, 2024 •

edited

Loading

slaren commented Nov 19, 2024

austinvhuang commented Nov 19, 2024

ngxson commented Nov 19, 2024 •

edited

Loading

audiovention commented Nov 20, 2024

audiovention commented Nov 20, 2024

audiovention commented Nov 20, 2024

ggml : add WebGPU backend #7773

ggml : add WebGPU backend #7773

Comments

ggerganov commented Jun 5, 2024

WenheLI commented Jun 15, 2024

ngxson commented Jun 27, 2024 • edited Loading

slaren commented Jun 27, 2024 • edited Loading

ngxson commented Jun 27, 2024

slaren commented Jun 27, 2024

refinism commented Nov 19, 2024 • edited Loading

ggerganov commented Nov 19, 2024

ngxson commented Nov 19, 2024

refinism commented Nov 19, 2024 • edited Loading

slaren commented Nov 19, 2024

austinvhuang commented Nov 19, 2024

ngxson commented Nov 19, 2024 • edited Loading

audiovention commented Nov 20, 2024

audiovention commented Nov 20, 2024

audiovention commented Nov 20, 2024

ngxson commented Jun 27, 2024 •

edited

Loading

slaren commented Jun 27, 2024 •

edited

Loading

refinism commented Nov 19, 2024 •

edited

Loading

refinism commented Nov 19, 2024 •

edited

Loading

ngxson commented Nov 19, 2024 •

edited

Loading