[REQUEST] Offloading a customizable number of experts into RAM for DeepSeek V3 685B? #706

TyraVex · 2024-12-26T10:13:24Z

Problem

DeepSeek dropped a sonnet level model today, beating it in aider's latest leaderboard.

The catch is its 685B size. Luckily, there are 256 experts of ~30B, and 8 of them are used for each token. Running 8 queries in parallel in my tests is where exllama shines at around 130 tok/s on a single 3090, so 16 tok/s for each.

However, that would require a very expensive amount of VRAM, in the ~400GB range for 4.0bpw. I have 128+72GB ram/vram, so 2.0-2.1bpw could fit if efficiently split (Rip quality, why am I even writing this).

Being able to offload experts to RAM would allow a ktransformers like approach where the shared expert and a fraction of the others are in VRAM and the rest in RAM, allowing for very efficient MoE inference.

As long as the 8 experts can be swapped from ram to the GPUs, it would be theoretically possible to run this whale locally. Slowly because of PCIE speeds for swaps on each token, but I doubt you have plans to do CPU inference, which is bad at parallelism anyway.

Using PCIE 4.0/5.0 x16 speeds, each card could swap at RAM speeds the correct experts, likely in a second or two, for a wonderful ~1 tok/s if lucky.

The more I dive into this, the more absurd it sounds. Anyway, what do you think? Any chance of getting this model to run in any way? ktransformers seems to be on hold.

Solution

Alternatives

No response

Explanation

Examples

No response

Additional context

No response

Acknowledgements

I have looked for similar requests before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will make my requests politely.

turboderp · 2024-12-26T16:40:46Z

The issue here is that you can't predict which experts you need until the routing layer, which is right before the expert layer. So you will constantly be loading weights from system RAM into VRAM, and the PCIe bandwidth becomes your bottleneck. At the end of the day you wouldn't be better off than if you were just keeping those weights in system RAM and doing CPU inference.

As for that, though, this model won't be as compute heavy as the size implies since it is super sparse. So you might be able to run it at the most aggressive quantization level in llama.cpp and get a token per second or something out of it. Which sucks, but a weight swapping mechanism in ExLlama would be even slower since it would have the same bandwidth limitation (system RAM) plus extra latency for PCIe and then CUDA.

I guess your best bet for running this locally would be investing in a second-hand CPU server with 512 GB of RAM and a pair of EPYC CPUs. They can be surprisingly cheap, actually.

TyraVex · 2024-12-27T00:41:02Z

Right, makes sense. I guess I can only hope for the ktransformer devs to upstream their changes in llama.cpp or update their repo. I'll try IQ2_XS once i'm done mounting my 3rd card. Curious to see how much a double cpu server like that would cost

Edit: 1k for 512gb non epyc wow

TyraVex closed this as completed Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Offloading a customizable number of experts into RAM for DeepSeek V3 685B? #706

[REQUEST] Offloading a customizable number of experts into RAM for DeepSeek V3 685B? #706

TyraVex commented Dec 26, 2024

turboderp commented Dec 26, 2024

TyraVex commented Dec 27, 2024 •

edited

Loading

[REQUEST] Offloading a customizable number of experts into RAM for DeepSeek V3 685B? #706

[REQUEST] Offloading a customizable number of experts into RAM for DeepSeek V3 685B? #706

Comments

TyraVex commented Dec 26, 2024

Problem

Solution

Alternatives

Explanation

Examples

Additional context

Acknowledgements

turboderp commented Dec 26, 2024

TyraVex commented Dec 27, 2024 • edited Loading

TyraVex commented Dec 27, 2024 •

edited

Loading