-
-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Qwen2.5-72B-2.xxbpw/Llama-70B-2.4bpw (maybe related to KV caching code) garbage output on some specific prompts. #697
Comments
Less than 3 bpw is a very aggressive quantization. Results are always going to be a bit mixed, especially if you add cache quantization into the mix as well. If this is consistent for certain prompts I could try to look into those. If it's reproducible, possibly it relates to an overflow condition that could be addressed, but it's most likely just a case of running up against the limits of the EXL2 quantization scheme for a particular model. |
Also in FP16 cache mode. And it will go away after some use up of cache space which is THE head scraching part. |
Here the full log. (didn't include the same prompt but bug disappearing after cache replacing, mentioned below.) Note: I modified all malloc call to calloc and added memset 0 after cudaMalloc, fixes nothing though 🥹 |
Recommend this model Athene-V2-Chat_exl2_2.25bpw, the quality of the replies is high nor does it mess up the replies, the quality of the 2.25bpw replies is amazing! |
It is. Until some prompt breaks it. I tested this model same behaviour as OG 72B, not in log though. |
You can try Cohere, ChatML both prompt_format in exui, using tabbyapi I found that prompt_template is left empty to reply properly, using ChatML it repeats the response |
That is the problem I'm trying to get head around. It works somewhat, but why. |
It's because there's a problem with the tabby api's prompt template,I don't know the exact problem. |
I'm planning to add a GPU, can I mix RTX 2000 with RTX 3000 w/ exllama? If so what overhead to expect (per token latency - VRAM Size/VRAM Speed)? @turboderp |
20x0 is generally not recommended because anything pre Ampere won't support Flash Attention. And performance overall isn't going to be great regardless so having one slower CPU can become a severe bottleneck. |
@turboderp slower "CPU"? BTW this issue topic is still unclear if because of model's weights are broken under 2.xxbpw or code handling caching are broken though. |
Sorry, I clicked the wrong button I think. Not that I can add much, since 2.4bpw is expected to be very hit and miss. Flash Attention does make a big difference, yes. Especially on longer prompts, and it matters a lot for memory overhead as well. The main thing you miss out on without Flash Attention is the paged cache, which means the generator can only run in a limited fallback mode. CUDA 7.5 is also potentially going to be a problem just for software support going forward. That said the 2080Ti is a more powerful GPU overall than the 3060, and if you can get a hacked 22 GB version it may be worth considering. It will work, just in fallback mode without continuous batching, deduplication and what have you. |
I'll try it then. Will draft model work? Compute capability 7.5 is not THAT old. People are still banging the 5.x~6.x cards😄. I can always sell it to get a 3000 one if it turns out working badly so hope me luck. Update: it works. Avg 12~14 no specdec and 20~30 with. Super long context varies. |
Update: Athene-V2-Chat(QW2.5-72B FineTune)-4.25bpw(made with 0.2.6) also outputs garbage on some prompts with a few certain system prompts(set in tokenizer_config.json or in API system role). Fresh cache. and not 100%, if one try didn't, then following won't, the reproducibility is a mess... when set YaRN + 40kQ4 cache, 32kQ4 + wo/ YaRN seems fine, also seems very sensitive to settings of system prompt? Adding a dot . will decide it break or not. Only specific above combos will work correctly. What could go wrong? Broken model file during quantization? But how can it be, how large/broken a number can be from a ~4 bit mapping to? Or only NaN during inference compute? Where is the compute code, all in this repo's CUDA kernel or part is done through pytorch? BTW Mistral-Large-2411-2.75bpw no config combo will work, maybe same issue maybe not. |
OS
Linux
GPU Library
CUDA 12.x
Python version
3.11
Pytorch version
2.4.0
Describe the bug
See log
Reproduction steps
Use tabbyAPI + Default template of these models then prompt with the one in log
Expected behavior
No outputting garbage
Logs
See below
Acknowledgements
The text was updated successfully, but these errors were encountered: