Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Offloading a customizable number of experts into RAM for DeepSeek V3 685B? #706

Closed
3 tasks done
TyraVex opened this issue Dec 26, 2024 · 2 comments
Closed
3 tasks done

Comments

@TyraVex
Copy link

TyraVex commented Dec 26, 2024

Problem

DeepSeek dropped a sonnet level model today, beating it in aider's latest leaderboard.

The catch is its 685B size. Luckily, there are 256 experts of ~30B, and 8 of them are used for each token. Running 8 queries in parallel in my tests is where exllama shines at around 130 tok/s on a single 3090, so 16 tok/s for each.

However, that would require a very expensive amount of VRAM, in the ~400GB range for 4.0bpw. I have 128+72GB ram/vram, so 2.0-2.1bpw could fit if efficiently split (Rip quality, why am I even writing this).

Being able to offload experts to RAM would allow a ktransformers like approach where the shared expert and a fraction of the others are in VRAM and the rest in RAM, allowing for very efficient MoE inference.

As long as the 8 experts can be swapped from ram to the GPUs, it would be theoretically possible to run this whale locally. Slowly because of PCIE speeds for swaps on each token, but I doubt you have plans to do CPU inference, which is bad at parallelism anyway.

Using PCIE 4.0/5.0 x16 speeds, each card could swap at RAM speeds the correct experts, likely in a second or two, for a wonderful ~1 tok/s if lucky.

The more I dive into this, the more absurd it sounds. Anyway, what do you think? Any chance of getting this model to run in any way? ktransformers seems to be on hold.

Solution

Alternatives

No response

Explanation

Examples

No response

Additional context

No response

Acknowledgements

  • I have looked for similar requests before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will make my requests politely.
@turboderp
Copy link
Owner

The issue here is that you can't predict which experts you need until the routing layer, which is right before the expert layer. So you will constantly be loading weights from system RAM into VRAM, and the PCIe bandwidth becomes your bottleneck. At the end of the day you wouldn't be better off than if you were just keeping those weights in system RAM and doing CPU inference.

As for that, though, this model won't be as compute heavy as the size implies since it is super sparse. So you might be able to run it at the most aggressive quantization level in llama.cpp and get a token per second or something out of it. Which sucks, but a weight swapping mechanism in ExLlama would be even slower since it would have the same bandwidth limitation (system RAM) plus extra latency for PCIe and then CUDA.

I guess your best bet for running this locally would be investing in a second-hand CPU server with 512 GB of RAM and a pair of EPYC CPUs. They can be surprisingly cheap, actually.

@TyraVex
Copy link
Author

TyraVex commented Dec 27, 2024

Right, makes sense. I guess I can only hope for the ktransformer devs to upstream their changes in llama.cpp or update their repo. I'll try IQ2_XS once i'm done mounting my 3rd card. Curious to see how much a double cpu server like that would cost

Edit: 1k for 512gb non epyc wow

@TyraVex TyraVex closed this as completed Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants