You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am wondering if the PyTorch baseline is actually optimized enough? Specifically, could you
Remove autocast since the model is already in FP16? AutoCast would actually convert some other non-GEMM fp16 kernels in FP32 (or TF32 in the case of Ampere GPUs)
Run some warm up iteration before measuring the inference latency (averaged across a few)? Like how you did it with TensorRT
Use flags such as torch.backends.cudnn.benchmark = True before running GPU kernels.
On my local machine, just these optimizations (for lack of a better word as they are not really optimizations) would make PyTorch baseline at least 2X faster.
The text was updated successfully, but these errors were encountered:
I am wondering if the PyTorch baseline is actually optimized enough? Specifically, could you
torch.backends.cudnn.benchmark = True
before running GPU kernels.On my local machine, just these optimizations (for lack of a better word as they are not really optimizations) would make PyTorch baseline at least 2X faster.
The text was updated successfully, but these errors were encountered: