We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model: swin base patch4 window7 224 link
Setting batchsize to 40, inference cost 70ms in trt using executeV2 api while it cost only 30ms in pytorch.
TensorRT Version: 8.6
NVIDIA GPU: 3080, 3090
NVIDIA Driver Version: 535.183.01
CUDA Version: 12.2
Operating System: Ubuntu20.04
C++ code as follow:
auto begin = std::chrono::high_resolution_clock::now(); context->executeV2(buffers); std::cout << "dataSize: " << dataSize << "infer time: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - begin).count() << "ms" << std::endl;
and the log here:
trtexec --onnx=/home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx --minShapes=input:1x3x224x224 --optShapes=input:40x3x224x224 --maxShapes=input:40x3x224x224 --saveEngine=swin_transformer.engine --fp16 &&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=/home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx --minShapes=input:1x3x224x224 --optShapes=input:40x3x224x224 --maxShapes=input:40x3x224x224 --saveEngine=swin_transformer.engine --fp16 [12/20/2024-17:49:18] [I] === Model Options === [12/20/2024-17:49:18] [I] Format: ONNX [12/20/2024-17:49:18] [I] Model: /home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx [12/20/2024-17:49:18] [I] Output: [12/20/2024-17:49:18] [I] === Build Options === [12/20/2024-17:49:18] [I] Max batch: explicit batch [12/20/2024-17:49:18] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default [12/20/2024-17:49:18] [I] minTiming: 1 [12/20/2024-17:49:18] [I] avgTiming: 8 [12/20/2024-17:49:18] [I] Precision: FP32+FP16 [12/20/2024-17:49:18] [I] LayerPrecisions: [12/20/2024-17:49:18] [I] Layer Device Types: [12/20/2024-17:49:18] [I] Calibration: [12/20/2024-17:49:18] [I] Refit: Disabled [12/20/2024-17:49:18] [I] Version Compatible: Disabled [12/20/2024-17:49:18] [I] TensorRT runtime: full [12/20/2024-17:49:18] [I] Lean DLL Path: [12/20/2024-17:49:18] [I] Tempfile Controls: { in_memory: allow, temporary: allow } [12/20/2024-17:49:18] [I] Exclude Lean Runtime: Disabled [12/20/2024-17:49:18] [I] Sparsity: Disabled [12/20/2024-17:49:18] [I] Safe mode: Disabled [12/20/2024-17:49:18] [I] Build DLA standalone loadable: Disabled [12/20/2024-17:49:18] [I] Allow GPU fallback for DLA: Disabled [12/20/2024-17:49:18] [I] DirectIO mode: Disabled [12/20/2024-17:49:18] [I] Restricted mode: Disabled [12/20/2024-17:49:18] [I] Skip inference: Disabled [12/20/2024-17:49:18] [I] Save engine: swin_transformer.engine [12/20/2024-17:49:18] [I] Load engine: [12/20/2024-17:49:18] [I] Profiling verbosity: 0 [12/20/2024-17:49:18] [I] Tactic sources: Using default tactic sources [12/20/2024-17:49:18] [I] timingCacheMode: local [12/20/2024-17:49:18] [I] timingCacheFile: [12/20/2024-17:49:18] [I] Heuristic: Disabled [12/20/2024-17:49:18] [I] Preview Features: Use default preview flags. [12/20/2024-17:49:18] [I] MaxAuxStreams: -1 [12/20/2024-17:49:18] [I] BuilderOptimizationLevel: -1 [12/20/2024-17:49:18] [I] Input(s)s format: fp32:CHW [12/20/2024-17:49:18] [I] Output(s)s format: fp32:CHW [12/20/2024-17:49:18] [I] Input build shape: input=1x3x224x224+40x3x224x224+40x3x224x224 [12/20/2024-17:49:18] [I] Input calibration shapes: model [12/20/2024-17:49:18] [I] === System Options === [12/20/2024-17:49:18] [I] Device: 0 [12/20/2024-17:49:18] [I] DLACore: [12/20/2024-17:49:18] [I] Plugins: [12/20/2024-17:49:18] [I] setPluginsToSerialize: [12/20/2024-17:49:18] [I] dynamicPlugins: [12/20/2024-17:49:18] [I] ignoreParsedPluginLibs: 0 [12/20/2024-17:49:18] [I] [12/20/2024-17:49:18] [I] === Inference Options === [12/20/2024-17:49:18] [I] Batch: Explicit [12/20/2024-17:49:18] [I] Input inference shape: input=40x3x224x224 [12/20/2024-17:49:18] [I] Iterations: 10 [12/20/2024-17:49:18] [I] Duration: 3s (+ 200ms warm up) [12/20/2024-17:49:18] [I] Sleep time: 0ms [12/20/2024-17:49:18] [I] Idle time: 0ms [12/20/2024-17:49:18] [I] Inference Streams: 1 [12/20/2024-17:49:18] [I] ExposeDMA: Disabled [12/20/2024-17:49:18] [I] Data transfers: Enabled [12/20/2024-17:49:18] [I] Spin-wait: Disabled [12/20/2024-17:49:18] [I] Multithreading: Disabled [12/20/2024-17:49:18] [I] CUDA Graph: Disabled [12/20/2024-17:49:18] [I] Separate profiling: Disabled [12/20/2024-17:49:18] [I] Time Deserialize: Disabled [12/20/2024-17:49:18] [I] Time Refit: Disabled [12/20/2024-17:49:18] [I] NVTX verbosity: 0 [12/20/2024-17:49:18] [I] Persistent Cache Ratio: 0 [12/20/2024-17:49:18] [I] Inputs: [12/20/2024-17:49:18] [I] === Reporting Options === [12/20/2024-17:49:18] [I] Verbose: Disabled [12/20/2024-17:49:18] [I] Averages: 10 inferences [12/20/2024-17:49:18] [I] Percentiles: 90,95,99 [12/20/2024-17:49:18] [I] Dump refittable layers:Disabled [12/20/2024-17:49:18] [I] Dump output: Disabled [12/20/2024-17:49:18] [I] Profile: Disabled [12/20/2024-17:49:18] [I] Export timing to JSON file: [12/20/2024-17:49:18] [I] Export output to JSON file: [12/20/2024-17:49:18] [I] Export profile to JSON file: [12/20/2024-17:49:18] [I] [12/20/2024-17:49:18] [I] === Device Information === [12/20/2024-17:49:18] [I] Selected Device: NVIDIA GeForce RTX 3080 [12/20/2024-17:49:18] [I] Compute Capability: 8.6 [12/20/2024-17:49:18] [I] SMs: 68 [12/20/2024-17:49:18] [I] Device Global Memory: 9987 MiB [12/20/2024-17:49:18] [I] Shared Memory per SM: 100 KiB [12/20/2024-17:49:18] [I] Memory Bus Width: 320 bits (ECC disabled) [12/20/2024-17:49:18] [I] Application Compute Clock Rate: 1.71 GHz [12/20/2024-17:49:18] [I] Application Memory Clock Rate: 9.501 GHz [12/20/2024-17:49:18] [I] [12/20/2024-17:49:18] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at. [12/20/2024-17:49:18] [I] [12/20/2024-17:49:18] [I] TensorRT version: 8.6.1 [12/20/2024-17:49:18] [I] Loading standard plugins [12/20/2024-17:49:18] [I] [TRT] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 18, GPU 978 (MiB) [12/20/2024-17:49:26] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1449, GPU +266, now: CPU 1544, GPU 1244 (MiB) [12/20/2024-17:49:26] [I] Start parsing network model. [12/20/2024-17:49:26] [I] [TRT] ---------------------------------------------------------------- [12/20/2024-17:49:26] [I] [TRT] Input filename: /home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx [12/20/2024-17:49:26] [I] [TRT] ONNX IR version: 0.0.8 [12/20/2024-17:49:26] [I] [TRT] Opset version: 17 [12/20/2024-17:49:26] [I] [TRT] Producer name: pytorch [12/20/2024-17:49:26] [I] [TRT] Producer version: 2.4.0 [12/20/2024-17:49:26] [I] [TRT] Domain: [12/20/2024-17:49:26] [I] [TRT] Model version: 0 [12/20/2024-17:49:26] [I] [TRT] Doc string: [12/20/2024-17:49:26] [I] [TRT] ---------------------------------------------------------------- [12/20/2024-17:49:27] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [12/20/2024-17:49:27] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped [12/20/2024-17:49:27] [I] Finished parsing network model. Parse time: 0.844845 [12/20/2024-17:49:27] [I] [TRT] Graph optimization time: 0.115686 seconds. [12/20/2024-17:49:27] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored. mha_fusion.cpp:355: DCHECK(always(sym_eql(b1, calculateBatchesSym(dims_bmm1_output)))) failed. mha_fusion.cpp:355: DCHECK(always(sym_eql(b1, calculateBatchesSym(dims_bmm1_output)))) failed. [12/20/2024-17:53:48] [I] [TRT] Detected 1 inputs and 1 output network tensors. [12/20/2024-17:53:49] [I] [TRT] Total Host Persistent Memory: 3536 [12/20/2024-17:53:49] [I] [TRT] Total Device Persistent Memory: 18944 [12/20/2024-17:53:49] [I] [TRT] Total Scratch Memory: 963880960 [12/20/2024-17:53:49] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 167 MiB, GPU 1358 MiB [12/20/2024-17:53:49] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 2 steps to complete. [12/20/2024-17:53:49] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.008904ms to assign 2 blocks to 2 nodes requiring 1028106240 bytes. [12/20/2024-17:53:49] [I] [TRT] Total Activation Memory: 1028106240 [12/20/2024-17:53:49] [W] [TRT] TensorRT encountered issues when converting weights between types and that could affect accuracy. [12/20/2024-17:53:49] [W] [TRT] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights. [12/20/2024-17:53:49] [W] [TRT] Check verbose logs for the list of affected weights. [12/20/2024-17:53:49] [W] [TRT] - 140 weights are affected by this issue: Detected subnormal FP16 values. [12/20/2024-17:53:49] [W] [TRT] - 53 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value. [12/20/2024-17:53:49] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +335, now: CPU 0, GPU 335 (MiB) [12/20/2024-17:53:50] [I] Engine built in 271.633 sec. [12/20/2024-17:53:50] [I] [TRT] Loaded engine size: 339 MiB [12/20/2024-17:53:51] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +334, now: CPU 0, GPU 334 (MiB) [12/20/2024-17:53:51] [I] Engine deserialized in 0.314119 sec. [12/20/2024-17:53:51] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +981, now: CPU 0, GPU 1315 (MiB) [12/20/2024-17:53:51] [I] Setting persistentCacheLimit to 0 bytes. [12/20/2024-17:53:51] [I] Using random values for input input [12/20/2024-17:53:51] [I] Input binding for input with dimensions 40x3x224x224 is created. [12/20/2024-17:53:51] [I] Output binding for output with dimensions 40x12 is created. [12/20/2024-17:53:51] [I] Starting inference [12/20/2024-17:53:54] [I] Warmup completed 3 queries over 200 ms [12/20/2024-17:53:54] [I] Timing trace has 41 queries over 3.26256 s [12/20/2024-17:53:54] [I] [12/20/2024-17:53:54] [I] === Trace details === [12/20/2024-17:53:54] [I] Trace averages of 10 runs: [12/20/2024-17:53:54] [I] Average on 10 runs - GPU latency: 76.6097 ms - Host latency: 78.6104 ms (enqueue 2.95773 ms) [12/20/2024-17:53:54] [I] Average on 10 runs - GPU latency: 76.2822 ms - Host latency: 78.2786 ms (enqueue 2.71705 ms) [12/20/2024-17:53:54] [I] Average on 10 runs - GPU latency: 77.2628 ms - Host latency: 79.2623 ms (enqueue 2.68743 ms) [12/20/2024-17:53:54] [I] Average on 10 runs - GPU latency: 80.7294 ms - Host latency: 82.7305 ms (enqueue 2.8291 ms) [12/20/2024-17:53:54] [I] [12/20/2024-17:53:54] [I] === Performance summary === [12/20/2024-17:53:54] [I] Throughput: 12.5668 qps [12/20/2024-17:53:54] [I] Latency: min = 77.4142 ms, max = 87.4338 ms, mean = 79.7159 ms, median = 77.8967 ms, percentile(90%) = 83.7148 ms, percentile(95%) = 84.0322 ms, percentile(99%) = 87.4338 ms [12/20/2024-17:53:54] [I] Enqueue Time: min = 2.57581 ms, max = 4.6181 ms, mean = 2.79792 ms, median = 2.69324 ms, percentile(90%) = 3.04767 ms, percentile(95%) = 3.07214 ms, percentile(99%) = 4.6181 ms [12/20/2024-17:53:54] [I] H2D Latency: min = 1.98132 ms, max = 2.03809 ms, mean = 1.99573 ms, median = 1.99512 ms, percentile(90%) = 2.00317 ms, percentile(95%) = 2.00336 ms, percentile(99%) = 2.03809 ms [12/20/2024-17:53:54] [I] GPU Compute Time: min = 75.4247 ms, max = 85.4312 ms, mean = 77.7155 ms, median = 75.8978 ms, percentile(90%) = 81.708 ms, percentile(95%) = 82.0315 ms, percentile(99%) = 85.4312 ms [12/20/2024-17:53:54] [I] D2H Latency: min = 0.00292969 ms, max = 0.0126953 ms, mean = 0.00470715 ms, median = 0.00427246 ms, percentile(90%) = 0.00585938 ms, percentile(95%) = 0.0065918 ms, percentile(99%) = 0.0126953 ms [12/20/2024-17:53:54] [I] Total Host Walltime: 3.26256 s [12/20/2024-17:53:54] [I] Total GPU Compute Time: 3.18633 s [12/20/2024-17:53:54] [W] * GPU compute time is unstable, with coefficient of variance = 3.46826%. [12/20/2024-17:53:54] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability. [12/20/2024-17:53:54] [I] Explanations of the performance metrics are printed in the verbose logs. [12/20/2024-17:53:54] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=/home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx --minShapes=input:1x3x224x224 --optShapes=input:40x3x224x224 --maxShapes=input:40x3x224x224 --saveEngine=swin_transformer.engine --fp16
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Description
model: swin base patch4 window7 224 link
Setting batchsize to 40, inference cost 70ms in trt using executeV2 api while it cost only 30ms in pytorch.
Environment
TensorRT Version: 8.6
NVIDIA GPU: 3080, 3090
NVIDIA Driver Version: 535.183.01
CUDA Version: 12.2
Operating System: Ubuntu20.04
Steps To Reproduce
C++ code as follow:
and the log here:
trtexec --onnx=/home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx --minShapes=input:1x3x224x224 --optShapes=input:40x3x224x224 --maxShapes=input:40x3x224x224 --saveEngine=swin_transformer.engine --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=/home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx --minShapes=input:1x3x224x224 --optShapes=input:40x3x224x224 --maxShapes=input:40x3x224x224 --saveEngine=swin_transformer.engine --fp16
[12/20/2024-17:49:18] [I] === Model Options ===
[12/20/2024-17:49:18] [I] Format: ONNX
[12/20/2024-17:49:18] [I] Model: /home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx
[12/20/2024-17:49:18] [I] Output:
[12/20/2024-17:49:18] [I] === Build Options ===
[12/20/2024-17:49:18] [I] Max batch: explicit batch
[12/20/2024-17:49:18] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[12/20/2024-17:49:18] [I] minTiming: 1
[12/20/2024-17:49:18] [I] avgTiming: 8
[12/20/2024-17:49:18] [I] Precision: FP32+FP16
[12/20/2024-17:49:18] [I] LayerPrecisions:
[12/20/2024-17:49:18] [I] Layer Device Types:
[12/20/2024-17:49:18] [I] Calibration:
[12/20/2024-17:49:18] [I] Refit: Disabled
[12/20/2024-17:49:18] [I] Version Compatible: Disabled
[12/20/2024-17:49:18] [I] TensorRT runtime: full
[12/20/2024-17:49:18] [I] Lean DLL Path:
[12/20/2024-17:49:18] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[12/20/2024-17:49:18] [I] Exclude Lean Runtime: Disabled
[12/20/2024-17:49:18] [I] Sparsity: Disabled
[12/20/2024-17:49:18] [I] Safe mode: Disabled
[12/20/2024-17:49:18] [I] Build DLA standalone loadable: Disabled
[12/20/2024-17:49:18] [I] Allow GPU fallback for DLA: Disabled
[12/20/2024-17:49:18] [I] DirectIO mode: Disabled
[12/20/2024-17:49:18] [I] Restricted mode: Disabled
[12/20/2024-17:49:18] [I] Skip inference: Disabled
[12/20/2024-17:49:18] [I] Save engine: swin_transformer.engine
[12/20/2024-17:49:18] [I] Load engine:
[12/20/2024-17:49:18] [I] Profiling verbosity: 0
[12/20/2024-17:49:18] [I] Tactic sources: Using default tactic sources
[12/20/2024-17:49:18] [I] timingCacheMode: local
[12/20/2024-17:49:18] [I] timingCacheFile:
[12/20/2024-17:49:18] [I] Heuristic: Disabled
[12/20/2024-17:49:18] [I] Preview Features: Use default preview flags.
[12/20/2024-17:49:18] [I] MaxAuxStreams: -1
[12/20/2024-17:49:18] [I] BuilderOptimizationLevel: -1
[12/20/2024-17:49:18] [I] Input(s)s format: fp32:CHW
[12/20/2024-17:49:18] [I] Output(s)s format: fp32:CHW
[12/20/2024-17:49:18] [I] Input build shape: input=1x3x224x224+40x3x224x224+40x3x224x224
[12/20/2024-17:49:18] [I] Input calibration shapes: model
[12/20/2024-17:49:18] [I] === System Options ===
[12/20/2024-17:49:18] [I] Device: 0
[12/20/2024-17:49:18] [I] DLACore:
[12/20/2024-17:49:18] [I] Plugins:
[12/20/2024-17:49:18] [I] setPluginsToSerialize:
[12/20/2024-17:49:18] [I] dynamicPlugins:
[12/20/2024-17:49:18] [I] ignoreParsedPluginLibs: 0
[12/20/2024-17:49:18] [I]
[12/20/2024-17:49:18] [I] === Inference Options ===
[12/20/2024-17:49:18] [I] Batch: Explicit
[12/20/2024-17:49:18] [I] Input inference shape: input=40x3x224x224
[12/20/2024-17:49:18] [I] Iterations: 10
[12/20/2024-17:49:18] [I] Duration: 3s (+ 200ms warm up)
[12/20/2024-17:49:18] [I] Sleep time: 0ms
[12/20/2024-17:49:18] [I] Idle time: 0ms
[12/20/2024-17:49:18] [I] Inference Streams: 1
[12/20/2024-17:49:18] [I] ExposeDMA: Disabled
[12/20/2024-17:49:18] [I] Data transfers: Enabled
[12/20/2024-17:49:18] [I] Spin-wait: Disabled
[12/20/2024-17:49:18] [I] Multithreading: Disabled
[12/20/2024-17:49:18] [I] CUDA Graph: Disabled
[12/20/2024-17:49:18] [I] Separate profiling: Disabled
[12/20/2024-17:49:18] [I] Time Deserialize: Disabled
[12/20/2024-17:49:18] [I] Time Refit: Disabled
[12/20/2024-17:49:18] [I] NVTX verbosity: 0
[12/20/2024-17:49:18] [I] Persistent Cache Ratio: 0
[12/20/2024-17:49:18] [I] Inputs:
[12/20/2024-17:49:18] [I] === Reporting Options ===
[12/20/2024-17:49:18] [I] Verbose: Disabled
[12/20/2024-17:49:18] [I] Averages: 10 inferences
[12/20/2024-17:49:18] [I] Percentiles: 90,95,99
[12/20/2024-17:49:18] [I] Dump refittable layers:Disabled
[12/20/2024-17:49:18] [I] Dump output: Disabled
[12/20/2024-17:49:18] [I] Profile: Disabled
[12/20/2024-17:49:18] [I] Export timing to JSON file:
[12/20/2024-17:49:18] [I] Export output to JSON file:
[12/20/2024-17:49:18] [I] Export profile to JSON file:
[12/20/2024-17:49:18] [I]
[12/20/2024-17:49:18] [I] === Device Information ===
[12/20/2024-17:49:18] [I] Selected Device: NVIDIA GeForce RTX 3080
[12/20/2024-17:49:18] [I] Compute Capability: 8.6
[12/20/2024-17:49:18] [I] SMs: 68
[12/20/2024-17:49:18] [I] Device Global Memory: 9987 MiB
[12/20/2024-17:49:18] [I] Shared Memory per SM: 100 KiB
[12/20/2024-17:49:18] [I] Memory Bus Width: 320 bits (ECC disabled)
[12/20/2024-17:49:18] [I] Application Compute Clock Rate: 1.71 GHz
[12/20/2024-17:49:18] [I] Application Memory Clock Rate: 9.501 GHz
[12/20/2024-17:49:18] [I]
[12/20/2024-17:49:18] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[12/20/2024-17:49:18] [I]
[12/20/2024-17:49:18] [I] TensorRT version: 8.6.1
[12/20/2024-17:49:18] [I] Loading standard plugins
[12/20/2024-17:49:18] [I] [TRT] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 18, GPU 978 (MiB)
[12/20/2024-17:49:26] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1449, GPU +266, now: CPU 1544, GPU 1244 (MiB)
[12/20/2024-17:49:26] [I] Start parsing network model.
[12/20/2024-17:49:26] [I] [TRT] ----------------------------------------------------------------
[12/20/2024-17:49:26] [I] [TRT] Input filename: /home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx
[12/20/2024-17:49:26] [I] [TRT] ONNX IR version: 0.0.8
[12/20/2024-17:49:26] [I] [TRT] Opset version: 17
[12/20/2024-17:49:26] [I] [TRT] Producer name: pytorch
[12/20/2024-17:49:26] [I] [TRT] Producer version: 2.4.0
[12/20/2024-17:49:26] [I] [TRT] Domain:
[12/20/2024-17:49:26] [I] [TRT] Model version: 0
[12/20/2024-17:49:26] [I] [TRT] Doc string:
[12/20/2024-17:49:26] [I] [TRT] ----------------------------------------------------------------
[12/20/2024-17:49:27] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[12/20/2024-17:49:27] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[12/20/2024-17:49:27] [I] Finished parsing network model. Parse time: 0.844845
[12/20/2024-17:49:27] [I] [TRT] Graph optimization time: 0.115686 seconds.
[12/20/2024-17:49:27] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
mha_fusion.cpp:355: DCHECK(always(sym_eql(b1, calculateBatchesSym(dims_bmm1_output)))) failed.
mha_fusion.cpp:355: DCHECK(always(sym_eql(b1, calculateBatchesSym(dims_bmm1_output)))) failed.
[12/20/2024-17:53:48] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[12/20/2024-17:53:49] [I] [TRT] Total Host Persistent Memory: 3536
[12/20/2024-17:53:49] [I] [TRT] Total Device Persistent Memory: 18944
[12/20/2024-17:53:49] [I] [TRT] Total Scratch Memory: 963880960
[12/20/2024-17:53:49] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 167 MiB, GPU 1358 MiB
[12/20/2024-17:53:49] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 2 steps to complete.
[12/20/2024-17:53:49] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.008904ms to assign 2 blocks to 2 nodes requiring 1028106240 bytes.
[12/20/2024-17:53:49] [I] [TRT] Total Activation Memory: 1028106240
[12/20/2024-17:53:49] [W] [TRT] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[12/20/2024-17:53:49] [W] [TRT] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[12/20/2024-17:53:49] [W] [TRT] Check verbose logs for the list of affected weights.
[12/20/2024-17:53:49] [W] [TRT] - 140 weights are affected by this issue: Detected subnormal FP16 values.
[12/20/2024-17:53:49] [W] [TRT] - 53 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
[12/20/2024-17:53:49] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +335, now: CPU 0, GPU 335 (MiB)
[12/20/2024-17:53:50] [I] Engine built in 271.633 sec.
[12/20/2024-17:53:50] [I] [TRT] Loaded engine size: 339 MiB
[12/20/2024-17:53:51] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +334, now: CPU 0, GPU 334 (MiB)
[12/20/2024-17:53:51] [I] Engine deserialized in 0.314119 sec.
[12/20/2024-17:53:51] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +981, now: CPU 0, GPU 1315 (MiB)
[12/20/2024-17:53:51] [I] Setting persistentCacheLimit to 0 bytes.
[12/20/2024-17:53:51] [I] Using random values for input input
[12/20/2024-17:53:51] [I] Input binding for input with dimensions 40x3x224x224 is created.
[12/20/2024-17:53:51] [I] Output binding for output with dimensions 40x12 is created.
[12/20/2024-17:53:51] [I] Starting inference
[12/20/2024-17:53:54] [I] Warmup completed 3 queries over 200 ms
[12/20/2024-17:53:54] [I] Timing trace has 41 queries over 3.26256 s
[12/20/2024-17:53:54] [I]
[12/20/2024-17:53:54] [I] === Trace details ===
[12/20/2024-17:53:54] [I] Trace averages of 10 runs:
[12/20/2024-17:53:54] [I] Average on 10 runs - GPU latency: 76.6097 ms - Host latency: 78.6104 ms (enqueue 2.95773 ms)
[12/20/2024-17:53:54] [I] Average on 10 runs - GPU latency: 76.2822 ms - Host latency: 78.2786 ms (enqueue 2.71705 ms)
[12/20/2024-17:53:54] [I] Average on 10 runs - GPU latency: 77.2628 ms - Host latency: 79.2623 ms (enqueue 2.68743 ms)
[12/20/2024-17:53:54] [I] Average on 10 runs - GPU latency: 80.7294 ms - Host latency: 82.7305 ms (enqueue 2.8291 ms)
[12/20/2024-17:53:54] [I]
[12/20/2024-17:53:54] [I] === Performance summary ===
[12/20/2024-17:53:54] [I] Throughput: 12.5668 qps
[12/20/2024-17:53:54] [I] Latency: min = 77.4142 ms, max = 87.4338 ms, mean = 79.7159 ms, median = 77.8967 ms, percentile(90%) = 83.7148 ms, percentile(95%) = 84.0322 ms, percentile(99%) = 87.4338 ms
[12/20/2024-17:53:54] [I] Enqueue Time: min = 2.57581 ms, max = 4.6181 ms, mean = 2.79792 ms, median = 2.69324 ms, percentile(90%) = 3.04767 ms, percentile(95%) = 3.07214 ms, percentile(99%) = 4.6181 ms
[12/20/2024-17:53:54] [I] H2D Latency: min = 1.98132 ms, max = 2.03809 ms, mean = 1.99573 ms, median = 1.99512 ms, percentile(90%) = 2.00317 ms, percentile(95%) = 2.00336 ms, percentile(99%) = 2.03809 ms
[12/20/2024-17:53:54] [I] GPU Compute Time: min = 75.4247 ms, max = 85.4312 ms, mean = 77.7155 ms, median = 75.8978 ms, percentile(90%) = 81.708 ms, percentile(95%) = 82.0315 ms, percentile(99%) = 85.4312 ms
[12/20/2024-17:53:54] [I] D2H Latency: min = 0.00292969 ms, max = 0.0126953 ms, mean = 0.00470715 ms, median = 0.00427246 ms, percentile(90%) = 0.00585938 ms, percentile(95%) = 0.0065918 ms, percentile(99%) = 0.0126953 ms
[12/20/2024-17:53:54] [I] Total Host Walltime: 3.26256 s
[12/20/2024-17:53:54] [I] Total GPU Compute Time: 3.18633 s
[12/20/2024-17:53:54] [W] * GPU compute time is unstable, with coefficient of variance = 3.46826%.
[12/20/2024-17:53:54] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[12/20/2024-17:53:54] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/20/2024-17:53:54] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=/home/xiaoying/code/Pegasus/red/weights/swintransformer/swin_base_patch4_window7_224_opset_17.onnx --minShapes=input:1x3x224x224 --optShapes=input:40x3x224x224 --maxShapes=input:40x3x224x224 --saveEngine=swin_transformer.engine --fp16
The text was updated successfully, but these errors were encountered: