How to Generate a 2-bit Quantized Meta-Llama-3.1-8B-Instruct Model? #126

ForAxel · 2024-11-21T12:00:32Z

I found a similar closed issue related to this topic. Following your reply in that issue, I successfully configured the vptq-algo environment based on the tutorial in the algorithm branch.
The Quantization on Meta-Llama-3.1-8B-Instruct section provides an example of using VPTQ to generate 3-bit quantized Meta-Llama-3.1-8B-Instruct model. However, if I want to generate the 2.3-bit quantized Meta-Llama-3.1-8B-Instruct model provided in VPTQ-community, how should I configure the parameters for run_vptq.py? Specifically, which arguments should I adjust to achieve 2.3-bit quantization?
Looking forward to your reply.

The text was updated successfully, but these errors were encountered:

YangWang92 · 2024-11-21T15:05:42Z

Try this one, and you can dowload hessian matrix from here https://huggingface.co/collections/VPTQ-community/hessian-and-invhessian-checkpoints-66fd249a104850d17b23fd8b .

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_vptq.py \
        --model_name Qwen/Qwen2.5-7B-Instruct \
        --output_dir outputs/Qwen2.5-7B-Instruct/ \
        --vector_lens -1 12 \
        --group_num 1 \
        --num_centroids -1 65536 \
        --num_res_centroids -1 4096 \
        --npercent 0 \
        --blocksize 128 \
        --new_eval \
        --seq_len 8192 \
        --kmeans_mode hessian \
        --num_gpus 8 \
        --enable_perm \
        --enable_norm \
        --save_model \
        --save_packed_model \
        --hessian_path Hessians-Qwen2.5-7B-Instruct-6144-8k \
        --inv_hessian_path InvHessians-Qwen2.5-7B-Instruct-6144-8k \
        --ktol 1e-5 --kiter 100

ForAxel · 2024-11-22T09:16:33Z

Try this one, and you can dowload hessian matrix from here https://huggingface.co/collections/VPTQ-community/hessian-and-invhessian-checkpoints-66fd249a104850d17b23fd8b .

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_vptq.py
--model_name Qwen/Qwen2.5-7B-Instruct
--output_dir outputs/Qwen2.5-7B-Instruct/
--vector_lens -1 12
--group_num 1
--num_centroids -1 65536
--num_res_centroids -1 4096
--npercent 0
--blocksize 128
--new_eval
--seq_len 8192
--kmeans_mode hessian
--num_gpus 8
--enable_perm
--enable_norm
--save_model
--save_packed_model
--hessian_path Hessians-Qwen2.5-7B-Instruct-6144-8k
--inv_hessian_path InvHessians-Qwen2.5-7B-Instruct-6144-8k
--ktol 1e-5 --kiter 100

Thanks for your response. I noticed that the command you provided is designed for quantizing the Qwen2.5-7B model. Is it possible to directly apply the parameter settings in this command to the 2.3-bit quantization of the LLaMA3.1-8B model? The command I am using is as follows:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python run_vptq.py \
        --model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
        --output_dir outputs/Meta-Llama-3.1-8B-Instruct-2.3bit/ \
        --vector_lens -1 12 \
        --group_num 1 \
        --num_centroids -1 65536 \
        --num_res_centroids -1 4096 \
        --npercent 0 \
        --blocksize 128 \
        --new_eval \
        --seq_len 8192 \
        --kmeans_mode hessian \
        --num_gpus 6 \
        --enable_perm \
        --enable_norm \
        --save_model \
        --save_packed_model \
        --hessian_path Hessians-Llama-31-8B-Instruct-6144-8k \
        --inv_hessian_path InvHessians-Llama-31-8B-Instruct-6144-8k \
        --ktol 1e-5 --kiter 100

Additionally, I would like to better understand the relationship between run_vptq.py parameters (such as vector_lens, num_centroids, num_res_centroids, etc.) and the resulting quantized models, particularly in terms of the bit-width of the quantized models in the VPTQ-community. Will this information be released in the future?

I also noticed in Table 10 of the VPTQ paper's appendix fine-tuning was applied to the 2.3-bit quantized version of the LLaMA2-13B model. Could you provide more details about this fine-tuning operation? Is it available in the source code?

Thank you for your excellent work! I am looking forward to the release of the detailed quantization tutorial.

YangWang92 · 2024-11-22T11:13:23Z

Yes, you can directly replace the model name and Hessian matrix to quantize different models. Additionally, here is a quick guide on setting the quantization parameters: https://github.com/microsoft/VPTQ?tab=readme-ov-file#models-from-open-source-community.

The fine-tuning code has not been open-sourced yet. It is based on a simple modification of LlamaFactory, and I will release this part soon. Please stay tuned.

ForAxel · 2024-11-29T07:20:56Z

I have successfully obtained a 2-bit quantized version of the Meta-Llama-3.1-8B-Instruct model based on the parameters you provided, and the QA accuracy is quite good! However, when I changed the --npercent parameter to 1, I noticed a significant drop in QA accuracy. This result seems inconsistent with the findings presented in the ablation study (Table 10). Could there be a bug in the algorithm branch of the code?

The command I used is as follows:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -u run_vptq.py \
        --model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
        --output_dir outputs/Meta-Llama-3.1-8B-Instruct-2.3bit-custom_1pOutlier-woQuant/ \
        --vector_lens -1 12 \
        --group_num 1 \
        --num_centroids -1 65536 \
        --num_res_centroids -1 4096 \
        --npercent 1 \
        --blocksize 128 \
        --new_eval \
        --seq_len 8192 \
        --kmeans_mode hessian \
        --num_gpus 8 \
        --enable_perm \
        --enable_norm \
        --save_packed_model \
        --hessian_path Hessians-Llama-31-8B-Instruct-6144-8k \
        --inv_hessian_path InvHessians-Llama-31-8B-Instruct-6144-8k \
        --ktol 1e-5 --kiter 100

YangWang92 · 2024-11-29T07:28:36Z

This configuration looks a bit odd. When we set --npercent 1, it extracts 1% of the outliers to build a separate lookup table. However, with --vector_lens **-1** 12, the vector length of the outliers is set to -1, and --num_centroids **-1** 65536 sets the table size to -1, which would likely cause an error in the k-means clustering process. Could it be that the configuration is incorrect? You can set --vector_lens **4** 12 and --num_centroids **4096** 65536 to quantize these 1% outliers in 3bits = log_2^(4096) / 4 = 12 / 4 = 3.

ForAxel · 2024-11-29T07:52:20Z

Thank you for your quick response. I set --vector_lens -1 12 because, in line 226 of ./vptq/quantizer.py, it notes:

if num_centroids == -1:  # Do not quantize, keep original data

I assume this allows for keeping the original outlier weights.

Additionally, if the --vector_lens and --num_centroids parameters are modified, should the --num_res_centroids also be reset? Is it possible to set --num_res_centroids**2048**4096?
I would greatly appreciate it if you could provide the detailed parameter configurations for ablation experiments #8, #9, and #10, as outlined in Table 10.

YangWang92 · 2024-11-29T08:00:25Z

Thank you for your quick response. I set --vector_lens -1 12 because, in line 226 of ./vptq/quantizer.py, it notes:

if num_centroids == -1: # Do not quantize, keep original data
I assume this allows for keeping the original outlier weights.

Additionally, if the --vector_lens and --num_centroids parameters are modified, should the --num_res_centroids also be reset? Is it possible to set --num_res_centroids**2048**4096? I would greatly appreciate it if you could provide the detailed parameter configurations for ablation experiments #8, #9, and #10, as outlined in Table 10.

hmmmmmm, thanks for the reminder! I haven’t checked this part of the code in a long time. Let me take a look.

YangWang92 added the question Further information is requested label Nov 21, 2024

YangWang92 self-assigned this Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Generate a 2-bit Quantized Meta-Llama-3.1-8B-Instruct Model? #126

How to Generate a 2-bit Quantized Meta-Llama-3.1-8B-Instruct Model? #126

ForAxel commented Nov 21, 2024

YangWang92 commented Nov 21, 2024

ForAxel commented Nov 22, 2024

YangWang92 commented Nov 22, 2024

ForAxel commented Nov 29, 2024 •

edited

Loading

YangWang92 commented Nov 29, 2024 •

edited

Loading

ForAxel commented Nov 29, 2024 •

edited

Loading

YangWang92 commented Nov 29, 2024

How to Generate a 2-bit Quantized Meta-Llama-3.1-8B-Instruct Model? #126

How to Generate a 2-bit Quantized Meta-Llama-3.1-8B-Instruct Model? #126

Comments

ForAxel commented Nov 21, 2024

YangWang92 commented Nov 21, 2024

ForAxel commented Nov 22, 2024

YangWang92 commented Nov 22, 2024

ForAxel commented Nov 29, 2024 • edited Loading

YangWang92 commented Nov 29, 2024 • edited Loading

ForAxel commented Nov 29, 2024 • edited Loading

YangWang92 commented Nov 29, 2024

ForAxel commented Nov 29, 2024 •

edited

Loading

YangWang92 commented Nov 29, 2024 •

edited

Loading

ForAxel commented Nov 29, 2024 •

edited

Loading