You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DeepSeek dropped a sonnet level model today, beating it in aider's latest leaderboard.
The catch is its 685B size. Luckily, there are 256 experts of ~30B, and 8 of them are used for each token. Running 8 queries in parallel in my tests is where exllama shines at around 130 tok/s on a single 3090, so 16 tok/s for each.
However, that would require a very expensive amount of VRAM, in the ~400GB range for 4.0bpw. I have 128+72GB ram/vram, so 2.0-2.1bpw could fit if efficiently split (Rip quality, why am I even writing this).
Being able to offload experts to RAM would allow a ktransformers like approach where the shared expert and a fraction of the others are in VRAM and the rest in RAM, allowing for very efficient MoE inference.
As long as the 8 experts can be swapped from ram to the GPUs, it would be theoretically possible to run this whale locally. Slowly because of PCIE speeds for swaps on each token, but I doubt you have plans to do CPU inference, which is bad at parallelism anyway.
Using PCIE 4.0/5.0 x16 speeds, each card could swap at RAM speeds the correct experts, likely in a second or two, for a wonderful ~1 tok/s if lucky.
The more I dive into this, the more absurd it sounds. Anyway, what do you think? Any chance of getting this model to run in any way? ktransformers seems to be on hold.
Solution
Alternatives
No response
Explanation
Examples
No response
Additional context
No response
Acknowledgements
I have looked for similar requests before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will make my requests politely.
The text was updated successfully, but these errors were encountered:
The issue here is that you can't predict which experts you need until the routing layer, which is right before the expert layer. So you will constantly be loading weights from system RAM into VRAM, and the PCIe bandwidth becomes your bottleneck. At the end of the day you wouldn't be better off than if you were just keeping those weights in system RAM and doing CPU inference.
As for that, though, this model won't be as compute heavy as the size implies since it is super sparse. So you might be able to run it at the most aggressive quantization level in llama.cpp and get a token per second or something out of it. Which sucks, but a weight swapping mechanism in ExLlama would be even slower since it would have the same bandwidth limitation (system RAM) plus extra latency for PCIe and then CUDA.
I guess your best bet for running this locally would be investing in a second-hand CPU server with 512 GB of RAM and a pair of EPYC CPUs. They can be surprisingly cheap, actually.
Right, makes sense. I guess I can only hope for the ktransformer devs to upstream their changes in llama.cpp or update their repo. I'll try IQ2_XS once i'm done mounting my 3rd card. Curious to see how much a double cpu server like that would cost
Problem
DeepSeek dropped a sonnet level model today, beating it in aider's latest leaderboard.
The catch is its 685B size. Luckily, there are 256 experts of ~30B, and 8 of them are used for each token. Running 8 queries in parallel in my tests is where exllama shines at around 130 tok/s on a single 3090, so 16 tok/s for each.
However, that would require a very expensive amount of VRAM, in the ~400GB range for 4.0bpw. I have 128+72GB ram/vram, so 2.0-2.1bpw could fit if efficiently split (Rip quality, why am I even writing this).
Being able to offload experts to RAM would allow a ktransformers like approach where the shared expert and a fraction of the others are in VRAM and the rest in RAM, allowing for very efficient MoE inference.
As long as the 8 experts can be swapped from ram to the GPUs, it would be theoretically possible to run this whale locally. Slowly because of PCIE speeds for swaps on each token, but I doubt you have plans to do CPU inference, which is bad at parallelism anyway.
Using PCIE 4.0/5.0 x16 speeds, each card could swap at RAM speeds the correct experts, likely in a second or two, for a wonderful ~1 tok/s if lucky.
The more I dive into this, the more absurd it sounds. Anyway, what do you think? Any chance of getting this model to run in any way? ktransformers seems to be on hold.
Solution
Alternatives
No response
Explanation
Examples
No response
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: