refit_cuda_engine method is too slow #3332

davidli313 · 2023-09-19T02:41:04Z

Description

I used the python tensorrt refitter class to load the LoRA weights of stable diffusion unet, but the refitter.refit_cuda_engine method is so slow, usually taking 4~5 seconds. Is there any way to improve the performance of refit_cuda_engine?

Environment

TensorRT Version:
8.6.1
NVIDIA GPU:
GeForce RTX 4090
NVIDIA Driver Version:
525.89.02
CUDA Version:
12.0
CUDNN Version:
8.9.2

Operating System:
Ubuntu 20.04.1
Python Version (if applicable):
3.9.16
Tensorflow Version (if applicable):

PyTorch Version (if applicable):
1.12.1
Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

The text was updated successfully, but these errors were encountered:

BowenFu · 2023-09-19T08:44:56Z

@davidli313 could you try TensorRT 9.0? The refitting perf has been improved by 1.8 - 15x in TensorRT 9.0.

Also can you give more details on your use case? What is the inference time? (Where are the weights from? Is there a training stage?) What is the percentage of refitting time in the entire process?

zhangvia · 2023-09-21T02:14:37Z

@davidli313 could you try TensorRT 9.0? The refitting perf has been improved by 1.8 - 15x in TensorRT 9.0.

Also can you give more details on your use case? What is the inference time? (Where are the weights from? Is there a training stage?) What is the percentage of refitting time in the entire process?

actually, the engine built with refit feature has a poor bad preformance in inference time consuming，especially with dynamic shape feature。and the inference time consuming is not stable. the refittable unet engine inference process costs 500ms to1000ms. that is slower than pytorch.

sunhs · 2023-10-08T09:37:56Z

@davidli313 @zhangvia Hi, can you point me to some samples of loading lora with refit ?

FuyuanChen · 2024-04-16T12:47:22Z

I meet same problem. By nsight sys , I find some cudaMalloc and CudaFree with long time. Who can help me, give me a sample code of c++ to refit a unet with lora?

BowenFu · 2024-04-16T14:41:20Z

I meet same problem. By nsight sys , I find some cudaMalloc and CudaFree with long time. Who can help me, give me a sample code of c++ to refit a unet with lora?

Please refer to "https://github.com/NVIDIA/TensorRT/blob/release/9.2/demo/Diffusion/utilities.py" and pass GPU weights to refitter instead of CPU weights to avoid internal memory allocation.

davidli313 changed the title ~~XXX failure of TensorRT X.Y when running XXX on GPU XXX~~ refit_cuda_engine method is too slow Sep 19, 2023

zerollzeng assigned BowenFu Sep 20, 2023

zerollzeng added the triaged Issue has been triaged by maintainers label Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refit_cuda_engine method is too slow #3332

refit_cuda_engine method is too slow #3332

davidli313 commented Sep 19, 2023 •

edited

Loading

BowenFu commented Sep 19, 2023

zhangvia commented Sep 21, 2023

sunhs commented Oct 8, 2023 •

edited

Loading

FuyuanChen commented Apr 16, 2024

BowenFu commented Apr 16, 2024

refit_cuda_engine method is too slow #3332

refit_cuda_engine method is too slow #3332

Comments

davidli313 commented Sep 19, 2023 • edited Loading

Description

Environment

Relevant Files

Steps To Reproduce

BowenFu commented Sep 19, 2023

zhangvia commented Sep 21, 2023

sunhs commented Oct 8, 2023 • edited Loading

FuyuanChen commented Apr 16, 2024

BowenFu commented Apr 16, 2024

davidli313 commented Sep 19, 2023 •

edited

Loading

sunhs commented Oct 8, 2023 •

edited

Loading