Swin-transformer infer slower than that in pytorch using TensorRT 8.6 when running Ubuntu20.04 on GPU 3080 and 3090 #4294

Lemonononon · 2024-12-23T01:02:27Z

Description

model: swin base patch4 window7 224 link

Setting batchsize to 40, inference cost 70ms in trt using executeV2 api while it cost only 30ms in pytorch.

Environment

TensorRT Version: 8.6

NVIDIA GPU: 3080, 3090

NVIDIA Driver Version: 535.183.01

CUDA Version: 12.2

Operating System: Ubuntu20.04

Steps To Reproduce

C++ code as follow:

auto begin = std::chrono::high_resolution_clock::now();
context->executeV2(buffers);
std::cout << "dataSize: " << dataSize  << "infer time: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - begin).count() << "ms" << std::endl;

and the log here:

trtexec --onnx=/home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx --minShapes=input:1x3x224x224 --optShapes=input:40x3x224x224 --maxShapes=input:40x3x224x224 --saveEngine=swin_transformer.engine --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=/home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx --minShapes=input:1x3x224x224 --optShapes=input:40x3x224x224 --maxShapes=input:40x3x224x224 --saveEngine=swin_transformer.engine --fp16
[12/20/2024-17:49:18] [I] === Model Options ===
[12/20/2024-17:49:18] [I] Format: ONNX
[12/20/2024-17:49:18] [I] Model: /home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx
[12/20/2024-17:49:18] [I] Output:
[12/20/2024-17:49:18] [I] === Build Options ===
[12/20/2024-17:49:18] [I] Max batch: explicit batch
[12/20/2024-17:49:18] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[12/20/2024-17:49:18] [I] minTiming: 1
[12/20/2024-17:49:18] [I] avgTiming: 8
[12/20/2024-17:49:18] [I] Precision: FP32+FP16
[12/20/2024-17:49:18] [I] LayerPrecisions:
[12/20/2024-17:49:18] [I] Layer Device Types:
[12/20/2024-17:49:18] [I] Calibration:
[12/20/2024-17:49:18] [I] Refit: Disabled
[12/20/2024-17:49:18] [I] Version Compatible: Disabled
[12/20/2024-17:49:18] [I] TensorRT runtime: full
[12/20/2024-17:49:18] [I] Lean DLL Path:
[12/20/2024-17:49:18] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[12/20/2024-17:49:18] [I] Exclude Lean Runtime: Disabled
[12/20/2024-17:49:18] [I] Sparsity: Disabled
[12/20/2024-17:49:18] [I] Safe mode: Disabled
[12/20/2024-17:49:18] [I] Build DLA standalone loadable: Disabled
[12/20/2024-17:49:18] [I] Allow GPU fallback for DLA: Disabled
[12/20/2024-17:49:18] [I] DirectIO mode: Disabled
[12/20/2024-17:49:18] [I] Restricted mode: Disabled
[12/20/2024-17:49:18] [I] Skip inference: Disabled
[12/20/2024-17:49:18] [I] Save engine: swin_transformer.engine
[12/20/2024-17:49:18] [I] Load engine:
[12/20/2024-17:49:18] [I] Profiling verbosity: 0
[12/20/2024-17:49:18] [I] Tactic sources: Using default tactic sources
[12/20/2024-17:49:18] [I] timingCacheMode: local
[12/20/2024-17:49:18] [I] timingCacheFile:
[12/20/2024-17:49:18] [I] Heuristic: Disabled
[12/20/2024-17:49:18] [I] Preview Features: Use default preview flags.
[12/20/2024-17:49:18] [I] MaxAuxStreams: -1
[12/20/2024-17:49:18] [I] BuilderOptimizationLevel: -1
[12/20/2024-17:49:18] [I] Input(s)s format: fp32:CHW
[12/20/2024-17:49:18] [I] Output(s)s format: fp32:CHW
[12/20/2024-17:49:18] [I] Input build shape: input=1x3x224x224+40x3x224x224+40x3x224x224
[12/20/2024-17:49:18] [I] Input calibration shapes: model
[12/20/2024-17:49:18] [I] === System Options ===
[12/20/2024-17:49:18] [I] Device: 0
[12/20/2024-17:49:18] [I] DLACore:
[12/20/2024-17:49:18] [I] Plugins:
[12/20/2024-17:49:18] [I] setPluginsToSerialize:
[12/20/2024-17:49:18] [I] dynamicPlugins:
[12/20/2024-17:49:18] [I] ignoreParsedPluginLibs: 0
[12/20/2024-17:49:18] [I]
[12/20/2024-17:49:18] [I] === Inference Options ===
[12/20/2024-17:49:18] [I] Batch: Explicit
[12/20/2024-17:49:18] [I] Input inference shape: input=40x3x224x224
[12/20/2024-17:49:18] [I] Iterations: 10
[12/20/2024-17:49:18] [I] Duration: 3s (+ 200ms warm up)
[12/20/2024-17:49:18] [I] Sleep time: 0ms
[12/20/2024-17:49:18] [I] Idle time: 0ms
[12/20/2024-17:49:18] [I] Inference Streams: 1
[12/20/2024-17:49:18] [I] ExposeDMA: Disabled
[12/20/2024-17:49:18] [I] Data transfers: Enabled
[12/20/2024-17:49:18] [I] Spin-wait: Disabled
[12/20/2024-17:49:18] [I] Multithreading: Disabled
[12/20/2024-17:49:18] [I] CUDA Graph: Disabled
[12/20/2024-17:49:18] [I] Separate profiling: Disabled
[12/20/2024-17:49:18] [I] Time Deserialize: Disabled
[12/20/2024-17:49:18] [I] Time Refit: Disabled
[12/20/2024-17:49:18] [I] NVTX verbosity: 0
[12/20/2024-17:49:18] [I] Persistent Cache Ratio: 0
[12/20/2024-17:49:18] [I] Inputs:
[12/20/2024-17:49:18] [I] === Reporting Options ===
[12/20/2024-17:49:18] [I] Verbose: Disabled
[12/20/2024-17:49:18] [I] Averages: 10 inferences
[12/20/2024-17:49:18] [I] Percentiles: 90,95,99
[12/20/2024-17:49:18] [I] Dump refittable layers:Disabled
[12/20/2024-17:49:18] [I] Dump output: Disabled
[12/20/2024-17:49:18] [I] Profile: Disabled
[12/20/2024-17:49:18] [I] Export timing to JSON file:
[12/20/2024-17:49:18] [I] Export output to JSON file:
[12/20/2024-17:49:18] [I] Export profile to JSON file:
[12/20/2024-17:49:18] [I]
[12/20/2024-17:49:18] [I] === Device Information ===
[12/20/2024-17:49:18] [I] Selected Device: NVIDIA GeForce RTX 3080
[12/20/2024-17:49:18] [I] Compute Capability: 8.6
[12/20/2024-17:49:18] [I] SMs: 68
[12/20/2024-17:49:18] [I] Device Global Memory: 9987 MiB
[12/20/2024-17:49:18] [I] Shared Memory per SM: 100 KiB
[12/20/2024-17:49:18] [I] Memory Bus Width: 320 bits (ECC disabled)
[12/20/2024-17:49:18] [I] Application Compute Clock Rate: 1.71 GHz
[12/20/2024-17:49:18] [I] Application Memory Clock Rate: 9.501 GHz
[12/20/2024-17:49:18] [I]
[12/20/2024-17:49:18] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[12/20/2024-17:49:18] [I]
[12/20/2024-17:49:18] [I] TensorRT version: 8.6.1
[12/20/2024-17:49:18] [I] Loading standard plugins
[12/20/2024-17:49:18] [I] [TRT] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 18, GPU 978 (MiB)
[12/20/2024-17:49:26] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1449, GPU +266, now: CPU 1544, GPU 1244 (MiB)
[12/20/2024-17:49:26] [I] Start parsing network model.
[12/20/2024-17:49:26] [I] [TRT] ----------------------------------------------------------------
[12/20/2024-17:49:26] [I] [TRT] Input filename: /home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx
[12/20/2024-17:49:26] [I] [TRT] ONNX IR version: 0.0.8
[12/20/2024-17:49:26] [I] [TRT] Opset version: 17
[12/20/2024-17:49:26] [I] [TRT] Producer name: pytorch
[12/20/2024-17:49:26] [I] [TRT] Producer version: 2.4.0
[12/20/2024-17:49:26] [I] [TRT] Domain:
[12/20/2024-17:49:26] [I] [TRT] Model version: 0
[12/20/2024-17:49:26] [I] [TRT] Doc string:
[12/20/2024-17:49:26] [I] [TRT] ----------------------------------------------------------------
[12/20/2024-17:49:27] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[12/20/2024-17:49:27] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[12/20/2024-17:49:27] [I] Finished parsing network model. Parse time: 0.844845
[12/20/2024-17:49:27] [I] [TRT] Graph optimization time: 0.115686 seconds.
[12/20/2024-17:49:27] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
mha_fusion.cpp:355: DCHECK(always(sym_eql(b1, calculateBatchesSym(dims_bmm1_output)))) failed.
mha_fusion.cpp:355: DCHECK(always(sym_eql(b1, calculateBatchesSym(dims_bmm1_output)))) failed.
[12/20/2024-17:53:48] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[12/20/2024-17:53:49] [I] [TRT] Total Host Persistent Memory: 3536
[12/20/2024-17:53:49] [I] [TRT] Total Device Persistent Memory: 18944
[12/20/2024-17:53:49] [I] [TRT] Total Scratch Memory: 963880960
[12/20/2024-17:53:49] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 167 MiB, GPU 1358 MiB
[12/20/2024-17:53:49] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 2 steps to complete.
[12/20/2024-17:53:49] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.008904ms to assign 2 blocks to 2 nodes requiring 1028106240 bytes.
[12/20/2024-17:53:49] [I] [TRT] Total Activation Memory: 1028106240
[12/20/2024-17:53:49] [W] [TRT] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[12/20/2024-17:53:49] [W] [TRT] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[12/20/2024-17:53:49] [W] [TRT] Check verbose logs for the list of affected weights.
[12/20/2024-17:53:49] [W] [TRT] - 140 weights are affected by this issue: Detected subnormal FP16 values.
[12/20/2024-17:53:49] [W] [TRT] - 53 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
[12/20/2024-17:53:49] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +335, now: CPU 0, GPU 335 (MiB)
[12/20/2024-17:53:50] [I] Engine built in 271.633 sec.
[12/20/2024-17:53:50] [I] [TRT] Loaded engine size: 339 MiB
[12/20/2024-17:53:51] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +334, now: CPU 0, GPU 334 (MiB)
[12/20/2024-17:53:51] [I] Engine deserialized in 0.314119 sec.
[12/20/2024-17:53:51] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +981, now: CPU 0, GPU 1315 (MiB)
[12/20/2024-17:53:51] [I] Setting persistentCacheLimit to 0 bytes.
[12/20/2024-17:53:51] [I] Using random values for input input
[12/20/2024-17:53:51] [I] Input binding for input with dimensions 40x3x224x224 is created.
[12/20/2024-17:53:51] [I] Output binding for output with dimensions 40x12 is created.
[12/20/2024-17:53:51] [I] Starting inference
[12/20/2024-17:53:54] [I] Warmup completed 3 queries over 200 ms
[12/20/2024-17:53:54] [I] Timing trace has 41 queries over 3.26256 s
[12/20/2024-17:53:54] [I]
[12/20/2024-17:53:54] [I] === Trace details ===
[12/20/2024-17:53:54] [I] Trace averages of 10 runs:
[12/20/2024-17:53:54] [I] Average on 10 runs - GPU latency: 76.6097 ms - Host latency: 78.6104 ms (enqueue 2.95773 ms)
[12/20/2024-17:53:54] [I] Average on 10 runs - GPU latency: 76.2822 ms - Host latency: 78.2786 ms (enqueue 2.71705 ms)
[12/20/2024-17:53:54] [I] Average on 10 runs - GPU latency: 77.2628 ms - Host latency: 79.2623 ms (enqueue 2.68743 ms)
[12/20/2024-17:53:54] [I] Average on 10 runs - GPU latency: 80.7294 ms - Host latency: 82.7305 ms (enqueue 2.8291 ms)
[12/20/2024-17:53:54] [I]
[12/20/2024-17:53:54] [I] === Performance summary ===
[12/20/2024-17:53:54] [I] Throughput: 12.5668 qps
[12/20/2024-17:53:54] [I] Latency: min = 77.4142 ms, max = 87.4338 ms, mean = 79.7159 ms, median = 77.8967 ms, percentile(90%) = 83.7148 ms, percentile(95%) = 84.0322 ms, percentile(99%) = 87.4338 ms
[12/20/2024-17:53:54] [I] Enqueue Time: min = 2.57581 ms, max = 4.6181 ms, mean = 2.79792 ms, median = 2.69324 ms, percentile(90%) = 3.04767 ms, percentile(95%) = 3.07214 ms, percentile(99%) = 4.6181 ms
[12/20/2024-17:53:54] [I] H2D Latency: min = 1.98132 ms, max = 2.03809 ms, mean = 1.99573 ms, median = 1.99512 ms, percentile(90%) = 2.00317 ms, percentile(95%) = 2.00336 ms, percentile(99%) = 2.03809 ms
[12/20/2024-17:53:54] [I] GPU Compute Time: min = 75.4247 ms, max = 85.4312 ms, mean = 77.7155 ms, median = 75.8978 ms, percentile(90%) = 81.708 ms, percentile(95%) = 82.0315 ms, percentile(99%) = 85.4312 ms
[12/20/2024-17:53:54] [I] D2H Latency: min = 0.00292969 ms, max = 0.0126953 ms, mean = 0.00470715 ms, median = 0.00427246 ms, percentile(90%) = 0.00585938 ms, percentile(95%) = 0.0065918 ms, percentile(99%) = 0.0126953 ms
[12/20/2024-17:53:54] [I] Total Host Walltime: 3.26256 s
[12/20/2024-17:53:54] [I] Total GPU Compute Time: 3.18633 s
[12/20/2024-17:53:54] [W] * GPU compute time is unstable, with coefficient of variance = 3.46826%.
[12/20/2024-17:53:54] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[12/20/2024-17:53:54] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/20/2024-17:53:54] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=/home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx --minShapes=input:1x3x224x224 --optShapes=input:40x3x224x224 --maxShapes=input:40x3x224x224 --saveEngine=swin_transformer.engine --fp16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swin-transformer infer slower than that in pytorch using TensorRT 8.6 when running Ubuntu20.04 on GPU 3080 and 3090 #4294

Swin-transformer infer slower than that in pytorch using TensorRT 8.6 when running Ubuntu20.04 on GPU 3080 and 3090 #4294

Lemonononon commented Dec 23, 2024

Swin-transformer infer slower than that in pytorch using TensorRT 8.6 when running Ubuntu20.04 on GPU 3080 and 3090 #4294

Swin-transformer infer slower than that in pytorch using TensorRT 8.6 when running Ubuntu20.04 on GPU 3080 and 3090 #4294

Comments

Lemonononon commented Dec 23, 2024

Description

Environment

Steps To Reproduce