-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Generate a 2-bit Quantized Meta-Llama-3.1-8B-Instruct Model? #126
Comments
Try this one, and you can dowload hessian matrix from here https://huggingface.co/collections/VPTQ-community/hessian-and-invhessian-checkpoints-66fd249a104850d17b23fd8b . CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_vptq.py \
--model_name Qwen/Qwen2.5-7B-Instruct \
--output_dir outputs/Qwen2.5-7B-Instruct/ \
--vector_lens -1 12 \
--group_num 1 \
--num_centroids -1 65536 \
--num_res_centroids -1 4096 \
--npercent 0 \
--blocksize 128 \
--new_eval \
--seq_len 8192 \
--kmeans_mode hessian \
--num_gpus 8 \
--enable_perm \
--enable_norm \
--save_model \
--save_packed_model \
--hessian_path Hessians-Qwen2.5-7B-Instruct-6144-8k \
--inv_hessian_path InvHessians-Qwen2.5-7B-Instruct-6144-8k \
--ktol 1e-5 --kiter 100 |
Thanks for your response. I noticed that the command you provided is designed for quantizing the CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python run_vptq.py \
--model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
--output_dir outputs/Meta-Llama-3.1-8B-Instruct-2.3bit/ \
--vector_lens -1 12 \
--group_num 1 \
--num_centroids -1 65536 \
--num_res_centroids -1 4096 \
--npercent 0 \
--blocksize 128 \
--new_eval \
--seq_len 8192 \
--kmeans_mode hessian \
--num_gpus 6 \
--enable_perm \
--enable_norm \
--save_model \
--save_packed_model \
--hessian_path Hessians-Llama-31-8B-Instruct-6144-8k \
--inv_hessian_path InvHessians-Llama-31-8B-Instruct-6144-8k \
--ktol 1e-5 --kiter 100 Additionally, I would like to better understand the relationship between I also noticed in Table 10 of the VPTQ paper's appendix fine-tuning was applied to the 2.3-bit quantized version of the Thank you for your excellent work! I am looking forward to the release of the detailed quantization tutorial. |
Yes, you can directly replace the model name and Hessian matrix to quantize different models. Additionally, here is a quick guide on setting the quantization parameters: https://github.com/microsoft/VPTQ?tab=readme-ov-file#models-from-open-source-community. The fine-tuning code has not been open-sourced yet. It is based on a simple modification of LlamaFactory, and I will release this part soon. Please stay tuned. |
I have successfully obtained a 2-bit quantized version of the The command I used is as follows: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -u run_vptq.py \
--model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
--output_dir outputs/Meta-Llama-3.1-8B-Instruct-2.3bit-custom_1pOutlier-woQuant/ \
--vector_lens -1 12 \
--group_num 1 \
--num_centroids -1 65536 \
--num_res_centroids -1 4096 \
--npercent 1 \
--blocksize 128 \
--new_eval \
--seq_len 8192 \
--kmeans_mode hessian \
--num_gpus 8 \
--enable_perm \
--enable_norm \
--save_packed_model \
--hessian_path Hessians-Llama-31-8B-Instruct-6144-8k \
--inv_hessian_path InvHessians-Llama-31-8B-Instruct-6144-8k \
--ktol 1e-5 --kiter 100 |
This configuration looks a bit odd. When we set |
Thank you for your quick response. I set if num_centroids == -1: # Do not quantize, keep original data I assume this allows for keeping the original outlier weights. Additionally, if the |
hmmmmmm, thanks for the reminder! I haven’t checked this part of the code in a long time. Let me take a look. |
I found a similar closed issue related to this topic. Following your reply in that issue, I successfully configured the
vptq-algo
environment based on the tutorial in the algorithm branch.The
Quantization on Meta-Llama-3.1-8B-Instruct
section provides an example of using VPTQ to generate3-bit
quantizedMeta-Llama-3.1-8B-Instruct
model. However, if I want to generate the2.3-bit
quantizedMeta-Llama-3.1-8B-Instruct
model provided in VPTQ-community, how should I configure the parameters for run_vptq.py? Specifically, which arguments should I adjust to achieve 2.3-bit quantization?Looking forward to your reply.
The text was updated successfully, but these errors were encountered: