Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Generate a 2-bit Quantized Meta-Llama-3.1-8B-Instruct Model? #126

Open
ForAxel opened this issue Nov 21, 2024 · 7 comments
Open

How to Generate a 2-bit Quantized Meta-Llama-3.1-8B-Instruct Model? #126

ForAxel opened this issue Nov 21, 2024 · 7 comments
Assignees
Labels
question Further information is requested

Comments

@ForAxel
Copy link

ForAxel commented Nov 21, 2024

I found a similar closed issue related to this topic. Following your reply in that issue, I successfully configured the vptq-algo environment based on the tutorial in the algorithm branch.
The Quantization on Meta-Llama-3.1-8B-Instruct section provides an example of using VPTQ to generate 3-bit quantized Meta-Llama-3.1-8B-Instruct model. However, if I want to generate the 2.3-bit quantized Meta-Llama-3.1-8B-Instruct model provided in VPTQ-community, how should I configure the parameters for run_vptq.py? Specifically, which arguments should I adjust to achieve 2.3-bit quantization?
Looking forward to your reply.

@YangWang92
Copy link
Contributor

Try this one, and you can dowload hessian matrix from here https://huggingface.co/collections/VPTQ-community/hessian-and-invhessian-checkpoints-66fd249a104850d17b23fd8b .

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_vptq.py \
        --model_name Qwen/Qwen2.5-7B-Instruct \
        --output_dir outputs/Qwen2.5-7B-Instruct/ \
        --vector_lens -1 12 \
        --group_num 1 \
        --num_centroids -1 65536 \
        --num_res_centroids -1 4096 \
        --npercent 0 \
        --blocksize 128 \
        --new_eval \
        --seq_len 8192 \
        --kmeans_mode hessian \
        --num_gpus 8 \
        --enable_perm \
        --enable_norm \
        --save_model \
        --save_packed_model \
        --hessian_path Hessians-Qwen2.5-7B-Instruct-6144-8k \
        --inv_hessian_path InvHessians-Qwen2.5-7B-Instruct-6144-8k \
        --ktol 1e-5 --kiter 100

@YangWang92 YangWang92 added the question Further information is requested label Nov 21, 2024
@YangWang92 YangWang92 self-assigned this Nov 21, 2024
@ForAxel
Copy link
Author

ForAxel commented Nov 22, 2024

Try this one, and you can dowload hessian matrix from here https://huggingface.co/collections/VPTQ-community/hessian-and-invhessian-checkpoints-66fd249a104850d17b23fd8b .

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_vptq.py
--model_name Qwen/Qwen2.5-7B-Instruct
--output_dir outputs/Qwen2.5-7B-Instruct/
--vector_lens -1 12
--group_num 1
--num_centroids -1 65536
--num_res_centroids -1 4096
--npercent 0
--blocksize 128
--new_eval
--seq_len 8192
--kmeans_mode hessian
--num_gpus 8
--enable_perm
--enable_norm
--save_model
--save_packed_model
--hessian_path Hessians-Qwen2.5-7B-Instruct-6144-8k
--inv_hessian_path InvHessians-Qwen2.5-7B-Instruct-6144-8k
--ktol 1e-5 --kiter 100

Thanks for your response. I noticed that the command you provided is designed for quantizing the Qwen2.5-7B model. Is it possible to directly apply the parameter settings in this command to the 2.3-bit quantization of the LLaMA3.1-8B model? The command I am using is as follows:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python run_vptq.py \
        --model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
        --output_dir outputs/Meta-Llama-3.1-8B-Instruct-2.3bit/ \
        --vector_lens -1 12 \
        --group_num 1 \
        --num_centroids -1 65536 \
        --num_res_centroids -1 4096 \
        --npercent 0 \
        --blocksize 128 \
        --new_eval \
        --seq_len 8192 \
        --kmeans_mode hessian \
        --num_gpus 6 \
        --enable_perm \
        --enable_norm \
        --save_model \
        --save_packed_model \
        --hessian_path Hessians-Llama-31-8B-Instruct-6144-8k \
        --inv_hessian_path InvHessians-Llama-31-8B-Instruct-6144-8k \
        --ktol 1e-5 --kiter 100

Additionally, I would like to better understand the relationship between run_vptq.py parameters (such as vector_lens, num_centroids, num_res_centroids, etc.) and the resulting quantized models, particularly in terms of the bit-width of the quantized models in the VPTQ-community. Will this information be released in the future?

I also noticed in Table 10 of the VPTQ paper's appendix fine-tuning was applied to the 2.3-bit quantized version of the LLaMA2-13B model. Could you provide more details about this fine-tuning operation? Is it available in the source code?

Thank you for your excellent work! I am looking forward to the release of the detailed quantization tutorial.

@YangWang92
Copy link
Contributor

Yes, you can directly replace the model name and Hessian matrix to quantize different models. Additionally, here is a quick guide on setting the quantization parameters: https://github.com/microsoft/VPTQ?tab=readme-ov-file#models-from-open-source-community.

The fine-tuning code has not been open-sourced yet. It is based on a simple modification of LlamaFactory, and I will release this part soon. Please stay tuned.

@ForAxel
Copy link
Author

ForAxel commented Nov 29, 2024

I have successfully obtained a 2-bit quantized version of the Meta-Llama-3.1-8B-Instruct model based on the parameters you provided, and the QA accuracy is quite good! However, when I changed the --npercent parameter to 1, I noticed a significant drop in QA accuracy. This result seems inconsistent with the findings presented in the ablation study (Table 10). Could there be a bug in the algorithm branch of the code?

The command I used is as follows:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -u run_vptq.py \
        --model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
        --output_dir outputs/Meta-Llama-3.1-8B-Instruct-2.3bit-custom_1pOutlier-woQuant/ \
        --vector_lens -1 12 \
        --group_num 1 \
        --num_centroids -1 65536 \
        --num_res_centroids -1 4096 \
        --npercent 1 \
        --blocksize 128 \
        --new_eval \
        --seq_len 8192 \
        --kmeans_mode hessian \
        --num_gpus 8 \
        --enable_perm \
        --enable_norm \
        --save_packed_model \
        --hessian_path Hessians-Llama-31-8B-Instruct-6144-8k \
        --inv_hessian_path InvHessians-Llama-31-8B-Instruct-6144-8k \
        --ktol 1e-5 --kiter 100 

@YangWang92
Copy link
Contributor

YangWang92 commented Nov 29, 2024

This configuration looks a bit odd. When we set --npercent 1, it extracts 1% of the outliers to build a separate lookup table. However, with --vector_lens **-1** 12, the vector length of the outliers is set to -1, and --num_centroids **-1** 65536 sets the table size to -1, which would likely cause an error in the k-means clustering process. Could it be that the configuration is incorrect? You can set --vector_lens **4** 12 and --num_centroids **4096** 65536 to quantize these 1% outliers in 3bits = log_2^(4096) / 4 = 12 / 4 = 3.

@ForAxel
Copy link
Author

ForAxel commented Nov 29, 2024

Thank you for your quick response. I set --vector_lens -1 12 because, in line 226 of ./vptq/quantizer.py, it notes:

if num_centroids == -1:  # Do not quantize, keep original data

I assume this allows for keeping the original outlier weights.

Additionally, if the --vector_lens and --num_centroids parameters are modified, should the --num_res_centroids also be reset? Is it possible to set --num_res_centroids**2048**4096?
I would greatly appreciate it if you could provide the detailed parameter configurations for ablation experiments #8, #9, and #10, as outlined in Table 10.

@YangWang92
Copy link
Contributor

Thank you for your quick response. I set --vector_lens -1 12 because, in line 226 of ./vptq/quantizer.py, it notes:

if num_centroids == -1: # Do not quantize, keep original data
I assume this allows for keeping the original outlier weights.

Additionally, if the --vector_lens and --num_centroids parameters are modified, should the --num_res_centroids also be reset? Is it possible to set --num_res_centroids**2048**4096? I would greatly appreciate it if you could provide the detailed parameter configurations for ablation experiments #8, #9, and #10, as outlined in Table 10.

hmmmmmm, thanks for the reminder! I haven’t checked this part of the code in a long time. Let me take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants