Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refit_cuda_engine method is too slow #3332

Open
davidli313 opened this issue Sep 19, 2023 · 5 comments
Open

refit_cuda_engine method is too slow #3332

davidli313 opened this issue Sep 19, 2023 · 5 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@davidli313
Copy link

davidli313 commented Sep 19, 2023

Description

I used the python tensorrt refitter class to load the LoRA weights of stable diffusion unet, but the refitter.refit_cuda_engine method is so slow, usually taking 4~5 seconds. Is there any way to improve the performance of refit_cuda_engine?

Environment

TensorRT Version:
8.6.1
NVIDIA GPU:
GeForce RTX 4090
NVIDIA Driver Version:
525.89.02
CUDA Version:
12.0
CUDNN Version:
8.9.2

Operating System:
Ubuntu 20.04.1
Python Version (if applicable):
3.9.16
Tensorflow Version (if applicable):

PyTorch Version (if applicable):
1.12.1
Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

@davidli313 davidli313 changed the title XXX failure of TensorRT X.Y when running XXX on GPU XXX refit_cuda_engine method is too slow Sep 19, 2023
@BowenFu
Copy link

BowenFu commented Sep 19, 2023

@davidli313 could you try TensorRT 9.0? The refitting perf has been improved by 1.8 - 15x in TensorRT 9.0.

Also can you give more details on your use case? What is the inference time? (Where are the weights from? Is there a training stage?) What is the percentage of refitting time in the entire process?

@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Sep 20, 2023
@zhangvia
Copy link

@davidli313 could you try TensorRT 9.0? The refitting perf has been improved by 1.8 - 15x in TensorRT 9.0.

Also can you give more details on your use case? What is the inference time? (Where are the weights from? Is there a training stage?) What is the percentage of refitting time in the entire process?

actually, the engine built with refit feature has a poor bad preformance in inference time consuming,especially with dynamic shape feature。and the inference time consuming is not stable. the refittable unet engine inference process costs 500ms to1000ms. that is slower than pytorch.

@sunhs
Copy link

sunhs commented Oct 8, 2023

@davidli313 @zhangvia Hi, can you point me to some samples of loading lora with refit ?

@FuyuanChen
Copy link

I meet same problem. By nsight sys , I find some cudaMalloc and CudaFree with long time. Who can help me, give me a sample code of c++ to refit a unet with lora?

@BowenFu
Copy link

BowenFu commented Apr 16, 2024

I meet same problem. By nsight sys , I find some cudaMalloc and CudaFree with long time. Who can help me, give me a sample code of c++ to refit a unet with lora?

Please refer to "https://github.com/NVIDIA/TensorRT/blob/release/9.2/demo/Diffusion/utilities.py" and pass GPU weights to refitter instead of CPU weights to avoid internal memory allocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

6 participants