You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am performing QAT quantization on a complex model. When I insert Q/DQ nodes into the ResNet portion I want to quantize according to the rules, TensorRT can run this part in INT8 after building. How can I ensure that the parts without Q/DQ nodes run with optimal performance in non-INT8 precision (FP16 + FP32)? I noticed that after inserting Q/DQ nodes into a part of the complex network, the performance of the unquantized parts decreases compared to FP16.
I conducted an experiment where I inserted QDQ only before a single convolution layer and obtained the build result.
The result of building the same network in FP16 mode.
Why does the part within the green box perform differently?
Another question: Even if the input and output of Myelin are exactly the same in the two exported engines, the execution time differs significantly.
fp16 mode:
I'm confused about how I can ensure that the unquantized parts of my model run optimally in FP16 or FP32.
Description
I am performing QAT quantization on a complex model. When I insert Q/DQ nodes into the ResNet portion I want to quantize according to the rules, TensorRT can run this part in INT8 after building. How can I ensure that the parts without Q/DQ nodes run with optimal performance in non-INT8 precision (FP16 + FP32)? I noticed that after inserting Q/DQ nodes into a part of the complex network, the performance of the unquantized parts decreases compared to FP16.
I conducted an experiment where I inserted QDQ only before a single convolution layer and obtained the build result.
The result of building the same network in FP16 mode.
Why does the part within the green box perform differently?
Another question: Even if the input and output of Myelin are exactly the same in the two exported engines, the execution time differs significantly.
fp16 mode:
I'm confused about how I can ensure that the unquantized parts of my model run optimally in FP16 or FP32.
Environment
TensorRT Version: 8.5.2
NVIDIA GPU: orin / 3090
NVIDIA Driver Version:
CUDA Version: 11.4
CUDNN Version:
Operating System:
Python Version (if applicable):
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version):
Relevant Files
Model link:
Steps To Reproduce
Commands or scripts:
Have you tried the latest release?:
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
):The text was updated successfully, but these errors were encountered: