INT8 Quantization of dinov2 TensorRT Model is Not Faster than FP16 Quantization #4273

mr-lz · 2024-12-06T10:34:38Z

Hello,

I used PyTorch-Quantization for post-training INT8 quantization on the dinov2-base model and then converted it to a TensorRT model. However, I found that the INT8 model is slightly slower than the FP16 model (the same conclusion was observed on A100, V100, and A10). Is this behavior normal?

Thank you.

lix19937 · 2024-12-11T00:44:34Z

maybe many ops not run with int8 and add more reformat layers.

asfiyab-nvidia · 2024-12-16T22:41:04Z

cc @akhilg-nv for additional comment

asfiyab-nvidia added the quantization Issues related to Quantization label Dec 16, 2024

asfiyab-nvidia assigned akhilg-nv Dec 16, 2024

asfiyab-nvidia added the triaged Issue has been triaged by maintainers label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INT8 Quantization of dinov2 TensorRT Model is Not Faster than FP16 Quantization #4273

INT8 Quantization of dinov2 TensorRT Model is Not Faster than FP16 Quantization #4273

mr-lz commented Dec 6, 2024

lix19937 commented Dec 11, 2024

asfiyab-nvidia commented Dec 16, 2024

INT8 Quantization of dinov2 TensorRT Model is Not Faster than FP16 Quantization #4273

INT8 Quantization of dinov2 TensorRT Model is Not Faster than FP16 Quantization #4273

Comments

mr-lz commented Dec 6, 2024

lix19937 commented Dec 11, 2024

asfiyab-nvidia commented Dec 16, 2024