How to dequantize a model with 4 groups and centroids greater than 4096? #128

ShawnzzWu · 2024-11-29T01:04:49Z

I've been trying to quantize and run the Meta-Llama-3.1-8B-Instruct-2.3bit model with group number set to 4, and successfully run the model when k1(centroids) is 4096 as in the paper. However, anything k1 setting above that(8192, 16384, 65536) would lead to a successful quantization but a model running failure. The error logs show that the reason could be some illegal memory access during the dequant function.

So here's what I want to ask, does the code support running a model with group number option on and centroids set to 8k and greater? Or do I need to do some adjustment to make it work?

Looking forward to your reply.

YangWang92 · 2024-11-29T01:11:30Z

I've been trying to quantize and run the Meta-Llama-3.1-8B-Instruct-2.3bit model with group number set to 4, and successfully run the model when k1(centroids) is 4096 as in the paper. However, anything k1 setting above that(8192, 16384, 65536) would lead to a successful quantization but a model running failure. The error logs show that the reason could be some illegal memory access during the dequant function.

So here's what I want to ask, does the code support running a model with group number option on and centroids set to 8k and greater? Or do I need to do some adjustment to make it work?

Looking forward to your reply.

Sorry about that. It seems to be an issue with the CUDA kernel. The multi-group CUDA kernel does seem to have some problems. If you want to evaluate accuracy for now, you can try using the Torch version and perform layer-by-layer evaluation, although it might be a bit slow. We’ll work on fixing this part soon. I’m a bit busy these days, but I’ll look into it as soon as possible. Thank you for your patience!

ShawnzzWu · 2024-11-29T01:16:59Z

I've been trying to quantize and run the Meta-Llama-3.1-8B-Instruct-2.3bit model with group number set to 4, and successfully run the model when k1(centroids) is 4096 as in the paper. However, anything k1 setting above that(8192, 16384, 65536) would lead to a successful quantization but a model running failure. The error logs show that the reason could be some illegal memory access during the dequant function.
So here's what I want to ask, does the code support running a model with group number option on and centroids set to 8k and greater? Or do I need to do some adjustment to make it work?
Looking forward to your reply.

Sorry about that. It seems to be an issue with the CUDA kernel. The multi-group CUDA kernel does seem to have some problems. If you want to evaluate accuracy for now, you can try using the Torch version and perform layer-by-layer evaluation, although it might be a bit slow. We’ll work on fixing this part soon. I’m a bit busy these days, but I’ll look into it as soon as possible. Thank you for your patience!

Okay, thanks for the notice. Looking forward to the update.

wejoncy · 2024-11-29T04:00:04Z

Hi @ShawnzzWu

Would you mind sharing your quantized model so I can debug into it?

ShawnzzWu · 2024-11-29T06:39:32Z

Hi @ShawnzzWu

Would you mind sharing your quantized model so I can debug into it?

Sorry, for information security reasons, I'm not allowed to share my file with you directly, but I just basically changed the --group_num, and --num_centroids these two options during the quantization

YangWang92 · 2024-11-29T07:04:35Z

Hi @ShawnzzWu
Would you mind sharing your quantized model so I can debug into it?

Sorry, for information security reasons, I'm not allowed to share my file with you directly, but I just basically changed the --group_num, and --num_centroids these two options during the quantization

Can you shared the configuration or args of quantization? Thanks!

ShawnzzWu · 2024-11-29T09:28:55Z

Hi @ShawnzzWu
Would you mind sharing your quantized model so I can debug into it?

Sorry, for information security reasons, I'm not allowed to share my file with you directly, but I just basically changed the --group_num, and --num_centroids these two options during the quantization

Can you shared the configuration or args of quantization? Thanks!

python -u run_vptq.py
--model_name Meta-Llama-3.1-8B-Instruct
--output_dir outputs/Meta-Llama-3.1-8B
--vector_lens -1 12
--group_num 4
--num_centroids -1 8192
--num_res_centroids -1 4096
--npercent 0
--blocksize 128
--new_eval
--seq_len 8192
--kmeans_mode hessian
--num_gpus 8
--enable_perm
--enable_norm
--save_model
--save_packed_model
--hessian_path /Hessians-Llama-31-8B-Instruct-6144-8k
--inv_hessian_path /InvHessians-Llama-31-8B-Instruct-6144-8k
--ktol 1e-5 --kiter 100

These are the quantization args that I can provide, the quantization process looked fine to me, but would run into error when running the model using "python -m vptq --model=VPTQ-community/Meta-Llama-3.1-8B --chat".

YangWang92 added bug Something isn't working question Further information is requested labels Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to dequantize a model with 4 groups and centroids greater than 4096? #128

How to dequantize a model with 4 groups and centroids greater than 4096? #128

ShawnzzWu commented Nov 29, 2024

YangWang92 commented Nov 29, 2024

ShawnzzWu commented Nov 29, 2024

wejoncy commented Nov 29, 2024

ShawnzzWu commented Nov 29, 2024

YangWang92 commented Nov 29, 2024

ShawnzzWu commented Nov 29, 2024

How to dequantize a model with 4 groups and centroids greater than 4096? #128

How to dequantize a model with 4 groups and centroids greater than 4096? #128

Comments

ShawnzzWu commented Nov 29, 2024

YangWang92 commented Nov 29, 2024

ShawnzzWu commented Nov 29, 2024

wejoncy commented Nov 29, 2024

ShawnzzWu commented Nov 29, 2024

YangWang92 commented Nov 29, 2024

ShawnzzWu commented Nov 29, 2024