[BUG] Qwen2.5-72B-2.xxbpw/Llama-70B-2.4bpw (maybe related to KV caching code) garbage output on some specific prompts. #697

Originalimoc · 2024-12-14T16:55:49Z

OS

Linux

GPU Library

CUDA 12.x

Python version

3.11

Pytorch version

2.4.0

Describe the bug

See log

Reproduction steps

Use tabbyAPI + Default template of these models then prompt with the one in log

Expected behavior

No outputting garbage

Logs

See below

Acknowledgements

I have looked for similar issues before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

turboderp · 2024-12-14T20:03:12Z

Less than 3 bpw is a very aggressive quantization. Results are always going to be a bit mixed, especially if you add cache quantization into the mix as well. If this is consistent for certain prompts I could try to look into those. If it's reproducible, possibly it relates to an overflow condition that could be addressed, but it's most likely just a case of running up against the limits of the EXL2 quantization scheme for a particular model.

Originalimoc · 2024-12-15T03:55:21Z

Also in FP16 cache mode. And it will go away after some use up of cache space which is THE head scraching part.

Originalimoc · 2024-12-15T09:11:30Z

tabby_api.log.txt

Here the full log. (didn't include the same prompt but bug disappearing after cache replacing, mentioned below.) Note: I modified all malloc call to calloc and added memset 0 after cudaMalloc, fixes nothing though 🥹

Originalimoc · 2024-12-15T09:14:29Z

I also asked Gemini 2.0 Flash for full code review but it found nothing, 300k+ context too hard lol😅:

xldistance · 2024-12-17T03:44:08Z

Recommend this model Athene-V2-Chat_exl2_2.25bpw, the quality of the replies is high nor does it mess up the replies, the quality of the 2.25bpw replies is amazing!

Originalimoc · 2024-12-17T09:10:01Z

Recommend this model Athene-V2-Chat_exl2_2.25bpw, the quality of the replies is high nor does it mess up the replies, the quality of the 2.25bpw replies is amazing!

It is. Until some prompt breaks it. I tested this model same behaviour as OG 72B, not in log though.

xldistance · 2024-12-17T09:25:45Z

You can try Cohere, ChatML both prompt_format in exui, using tabbyapi I found that prompt_template is left empty to reply properly, using ChatML it repeats the response

Originalimoc · 2024-12-17T10:07:00Z

That is the problem I'm trying to get head around. It works somewhat, but why.

xldistance · 2024-12-17T10:10:02Z

That is the problem I'm trying to get head around. It works somewhat, but why.

It's because there's a problem with the tabby api's prompt template,I don't know the exact problem.

Originalimoc · 2024-12-18T12:09:07Z

I'm planning to add a GPU, can I mix RTX 2000 with RTX 3000 w/ exllama? If so what overhead to expect (per token latency - VRAM Size/VRAM Speed)? @turboderp

turboderp · 2024-12-18T13:06:21Z

20x0 is generally not recommended because anything pre Ampere won't support Flash Attention. And performance overall isn't going to be great regardless so having one slower CPU can become a severe bottleneck.

Originalimoc · 2024-12-18T13:14:53Z

@turboderp slower "CPU"?
I don't use batching, does Flash Attention matter that much? 22GB pricing is really good almost half of a 3090. Will 3060 360GB/s be faster than 2080Ti 616GB/s because of flash attention? I'm expecting ~10tk/s given if 1/(22/616 + 22/616) = 14 full VRAM read per second. 1/(22/616 + 24/1000) is only 16.7. And adding drafting model to the equation can it be even faster?

BTW this issue topic is still unclear if because of model's weights are broken under 2.xxbpw or code handling caching are broken though.

turboderp · 2024-12-18T13:39:03Z

Sorry, I clicked the wrong button I think. Not that I can add much, since 2.4bpw is expected to be very hit and miss.

Flash Attention does make a big difference, yes. Especially on longer prompts, and it matters a lot for memory overhead as well. The main thing you miss out on without Flash Attention is the paged cache, which means the generator can only run in a limited fallback mode. CUDA 7.5 is also potentially going to be a problem just for software support going forward.

That said the 2080Ti is a more powerful GPU overall than the 3060, and if you can get a hacked 22 GB version it may be worth considering. It will work, just in fallback mode without continuous batching, deduplication and what have you.

Originalimoc · 2024-12-18T16:00:08Z

I'll try it then. Will draft model work? Compute capability 7.5 is not THAT old. People are still banging the 5.x~6.x cards😄. I can always sell it to get a 3000 one if it turns out working badly so hope me luck. Update: it works. Avg 12~14 no specdec and 20~30 with. Super long context varies.

Originalimoc · 2024-12-23T11:32:38Z

Update: Athene-V2-Chat(QW2.5-72B FineTune)-4.25bpw(made with 0.2.6) also outputs garbage on some prompts with a few certain system prompts(set in tokenizer_config.json or in API system role).

Fresh cache. and not 100%, if one try didn't, then following won't, the reproducibility is a mess... when set YaRN + 40kQ4 cache, 32kQ4 + wo/ YaRN seems fine, also seems very sensitive to settings of system prompt?

Adding a dot . will decide it break or not. Only specific above combos will work correctly.

What could go wrong? Broken model file during quantization? But how can it be, how large/broken a number can be from a ~4 bit mapping to? Or only NaN during inference compute? Where is the compute code, all in this repo's CUDA kernel or part is done through pytorch?

BTW Mistral-Large-2411-2.75bpw no config combo will work, maybe same issue maybe not.

Originalimoc added the bug Something isn't working label Dec 14, 2024

Originalimoc closed this as completed Dec 14, 2024

Originalimoc changed the title ~~[BUG] Qwen2.5-72B-2.35bpw (maybe related to KV caching code) garbage output on some specific prompts.~~ [BUG] Qwen2.5-72B-2.xxbpw (maybe related to KV caching code) garbage output on some specific prompts. Dec 14, 2024

Originalimoc reopened this Dec 15, 2024

Originalimoc changed the title ~~[BUG] Qwen2.5-72B-2.xxbpw (maybe related to KV caching code) garbage output on some specific prompts.~~ [BUG] Qwen2.5-72B-2.xxbpw/Llama-70B-2.4bpw (maybe related to KV caching code) garbage output on some specific prompts. Dec 15, 2024

turboderp closed this as completed Dec 18, 2024

turboderp reopened this Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Qwen2.5-72B-2.xxbpw/Llama-70B-2.4bpw (maybe related to KV caching code) garbage output on some specific prompts. #697

[BUG] Qwen2.5-72B-2.xxbpw/Llama-70B-2.4bpw (maybe related to KV caching code) garbage output on some specific prompts. #697

Originalimoc commented Dec 14, 2024 •

edited

Loading

turboderp commented Dec 14, 2024

Originalimoc commented Dec 15, 2024

Originalimoc commented Dec 15, 2024 •

edited

Loading

Originalimoc commented Dec 15, 2024

xldistance commented Dec 17, 2024

Originalimoc commented Dec 17, 2024 •

edited

Loading

xldistance commented Dec 17, 2024 •

edited

Loading

Originalimoc commented Dec 17, 2024

xldistance commented Dec 17, 2024

Originalimoc commented Dec 18, 2024 •

edited

Loading

turboderp commented Dec 18, 2024

Originalimoc commented Dec 18, 2024 •

edited

Loading

turboderp commented Dec 18, 2024

Originalimoc commented Dec 18, 2024 •

edited

Loading

Originalimoc commented Dec 23, 2024 •

edited

Loading

[BUG] Qwen2.5-72B-2.xxbpw/Llama-70B-2.4bpw (maybe related to KV caching code) garbage output on some specific prompts. #697

[BUG] Qwen2.5-72B-2.xxbpw/Llama-70B-2.4bpw (maybe related to KV caching code) garbage output on some specific prompts. #697

Comments

Originalimoc commented Dec 14, 2024 • edited Loading

OS

GPU Library

Python version

Pytorch version

Describe the bug

Reproduction steps

Expected behavior

Logs

Acknowledgements

turboderp commented Dec 14, 2024

Originalimoc commented Dec 15, 2024

Originalimoc commented Dec 15, 2024 • edited Loading

Originalimoc commented Dec 15, 2024

xldistance commented Dec 17, 2024

Originalimoc commented Dec 17, 2024 • edited Loading

xldistance commented Dec 17, 2024 • edited Loading

Originalimoc commented Dec 17, 2024

xldistance commented Dec 17, 2024

Originalimoc commented Dec 18, 2024 • edited Loading

turboderp commented Dec 18, 2024

Originalimoc commented Dec 18, 2024 • edited Loading

turboderp commented Dec 18, 2024

Originalimoc commented Dec 18, 2024 • edited Loading

Originalimoc commented Dec 23, 2024 • edited Loading

Originalimoc commented Dec 14, 2024 •

edited

Loading

Originalimoc commented Dec 15, 2024 •

edited

Loading

Originalimoc commented Dec 17, 2024 •

edited

Loading

xldistance commented Dec 17, 2024 •

edited

Loading

Originalimoc commented Dec 18, 2024 •

edited

Loading

Originalimoc commented Dec 18, 2024 •

edited

Loading

Originalimoc commented Dec 18, 2024 •

edited

Loading

Originalimoc commented Dec 23, 2024 •

edited

Loading