Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Qwen2.5-72B-2.xxbpw/Llama-70B-2.4bpw (maybe related to KV caching code) garbage output on some specific prompts. #697

Open
3 tasks done
Originalimoc opened this issue Dec 14, 2024 · 15 comments
Labels
bug Something isn't working

Comments

@Originalimoc
Copy link

Originalimoc commented Dec 14, 2024

OS

Linux

GPU Library

CUDA 12.x

Python version

3.11

Pytorch version

2.4.0

Describe the bug

See log

Reproduction steps

Use tabbyAPI + Default template of these models then prompt with the one in log

Expected behavior

No outputting garbage

Logs

See below

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.
@Originalimoc Originalimoc added the bug Something isn't working label Dec 14, 2024
@Originalimoc Originalimoc changed the title [BUG] Qwen2.5-72B-2.35bpw (maybe related to KV caching code) garbage output on some specific prompts. [BUG] Qwen2.5-72B-2.xxbpw (maybe related to KV caching code) garbage output on some specific prompts. Dec 14, 2024
@turboderp
Copy link
Owner

Less than 3 bpw is a very aggressive quantization. Results are always going to be a bit mixed, especially if you add cache quantization into the mix as well. If this is consistent for certain prompts I could try to look into those. If it's reproducible, possibly it relates to an overflow condition that could be addressed, but it's most likely just a case of running up against the limits of the EXL2 quantization scheme for a particular model.

@Originalimoc
Copy link
Author

Also in FP16 cache mode. And it will go away after some use up of cache space which is THE head scraching part.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 15, 2024

tabby_api.log.txt

Here the full log. (didn't include the same prompt but bug disappearing after cache replacing, mentioned below.) Note: I modified all malloc call to calloc and added memset 0 after cudaMalloc, fixes nothing though 🥹

@Originalimoc Originalimoc reopened this Dec 15, 2024
@Originalimoc
Copy link
Author

I also asked Gemini 2.0 Flash for full code review but it found nothing, 300k+ context too hard lol😅:
image

@Originalimoc Originalimoc changed the title [BUG] Qwen2.5-72B-2.xxbpw (maybe related to KV caching code) garbage output on some specific prompts. [BUG] Qwen2.5-72B-2.xxbpw/Llama-70B-2.4bpw (maybe related to KV caching code) garbage output on some specific prompts. Dec 15, 2024
@xldistance
Copy link

Recommend this model Athene-V2-Chat_exl2_2.25bpw, the quality of the replies is high nor does it mess up the replies, the quality of the 2.25bpw replies is amazing!

@Originalimoc
Copy link
Author

Originalimoc commented Dec 17, 2024

Recommend this model Athene-V2-Chat_exl2_2.25bpw, the quality of the replies is high nor does it mess up the replies, the quality of the 2.25bpw replies is amazing!

It is. Until some prompt breaks it. I tested this model same behaviour as OG 72B, not in log though.

@xldistance
Copy link

xldistance commented Dec 17, 2024

You can try Cohere, ChatML both prompt_format in exui, using tabbyapi I found that prompt_template is left empty to reply properly, using ChatML it repeats the response

@Originalimoc
Copy link
Author

That is the problem I'm trying to get head around. It works somewhat, but why.

@xldistance
Copy link

That is the problem I'm trying to get head around. It works somewhat, but why.

It's because there's a problem with the tabby api's prompt template,I don't know the exact problem.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 18, 2024

I'm planning to add a GPU, can I mix RTX 2000 with RTX 3000 w/ exllama? If so what overhead to expect (per token latency - VRAM Size/VRAM Speed)? @turboderp

@turboderp
Copy link
Owner

20x0 is generally not recommended because anything pre Ampere won't support Flash Attention. And performance overall isn't going to be great regardless so having one slower CPU can become a severe bottleneck.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 18, 2024

@turboderp slower "CPU"?
I don't use batching, does Flash Attention matter that much? 22GB pricing is really good almost half of a 3090. Will 3060 360GB/s be faster than 2080Ti 616GB/s because of flash attention? I'm expecting ~10tk/s given if 1/(22/616 + 22/616) = 14 full VRAM read per second. 1/(22/616 + 24/1000) is only 16.7. And adding drafting model to the equation can it be even faster?

BTW this issue topic is still unclear if because of model's weights are broken under 2.xxbpw or code handling caching are broken though.

@turboderp turboderp reopened this Dec 18, 2024
@turboderp
Copy link
Owner

Sorry, I clicked the wrong button I think. Not that I can add much, since 2.4bpw is expected to be very hit and miss.

Flash Attention does make a big difference, yes. Especially on longer prompts, and it matters a lot for memory overhead as well. The main thing you miss out on without Flash Attention is the paged cache, which means the generator can only run in a limited fallback mode. CUDA 7.5 is also potentially going to be a problem just for software support going forward.

That said the 2080Ti is a more powerful GPU overall than the 3060, and if you can get a hacked 22 GB version it may be worth considering. It will work, just in fallback mode without continuous batching, deduplication and what have you.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 18, 2024

I'll try it then. Will draft model work? Compute capability 7.5 is not THAT old. People are still banging the 5.x~6.x cards😄. I can always sell it to get a 3000 one if it turns out working badly so hope me luck. Update: it works. Avg 12~14 no specdec and 20~30 with. Super long context varies.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 23, 2024

Update: Athene-V2-Chat(QW2.5-72B FineTune)-4.25bpw(made with 0.2.6) also outputs garbage on some prompts with a few certain system prompts(set in tokenizer_config.json or in API system role).

Fresh cache. and not 100%, if one try didn't, then following won't, the reproducibility is a mess... when set YaRN + 40kQ4 cache, 32kQ4 + wo/ YaRN seems fine, also seems very sensitive to settings of system prompt?

Adding a dot . will decide it break or not. Only specific above combos will work correctly.

What could go wrong? Broken model file during quantization? But how can it be, how large/broken a number can be from a ~4 bit mapping to? Or only NaN during inference compute? Where is the compute code, all in this repo's CUDA kernel or part is done through pytorch?

BTW Mistral-Large-2411-2.75bpw no config combo will work, maybe same issue maybe not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants