-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Dawn C++ WebGPU backend #837
Comments
Here is an implementation of sd using webgpu in chrome, not dawn. I found it interesting, but I don't know if it's useful for llama.cpp https://github.com/mlc-ai/web-stable-diffusion |
Agreed that chrome makes more sense. If you want to run on GPU on local, you should just run pytorch. The whole point of llama.cpp is that you have no deps. I think that running on the user's browser is a very interesting idea, but in practice, it may be slow. Btw, webGPU API is constrained compared with CUDA. I wonder if you will get good performance. |
Hello, I did a test with their implementation https://github.com/mlc-ai/web-llm and in the feeling I think that the speed is maybe a little slower than with llama.cpp but I only have an Iris Xe. What it is interesting is that it's recognize my intel card, I don't think it's possible easily with the basic pytorch. It could be interesting to see it would be possible to use GPU and CPU together in llama.cpp? Even as an option, it could be nice if it can win a few tokens. |
I have tested it locally as well. It works pretty fast with a 4GB 4bit quantized vicuna 7B model. Web-llm is using Apache TVM unity based on the IRTensor (IRModule), compiled with emscripten and WASM for the SentencePiece tokenizer. This will natively support WebGPU on different devices, but it's technologically challenging, let's consider that web-llm is from devs involved in TVM Unity development. Stack and components involved are a lot and it's far to be simple as GGML idea was. |
It's worth to note this naive GPT implementation in vanilla JavaScript that support WebGPU. |
Llama.cpp specifically targets the CPU, so it's unlikely such a dependency will be added, but see the discussion in #915. |
I've done a small first step towards that: |
Would WebGPU solve the 32-bit memory issue since most of layers/computations would come to the GPU memory? #97 |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
@ggerganov Hello, thanks to your GGUF release of Llama-3.2-1B-Instruct-Q4_K_M-GGUF that is just 800MB, and it can be easily sharded to few chunks, and there is no need of WASM64, could be worth to attempt to load to WASM simd in the browser? Most browsers now have infact support to simd and thread. |
Today Chrome released WebGPU support in Chrome Beta.
The Google's Dawn project is a C++ standalone implementation of the WebGPU. It enables support of WebGPU in other libraries, by example this WIP are NodeJS binding to Dawn, that would enable - in theory - WebGPU in Node.
So it should be possible to add Dawn as GPU backend to Llama/GGML C++ math operations.
The text was updated successfully, but these errors were encountered: