Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bottleneck in mapreducedim for convolutional layers #558

Open
KristofferC opened this issue Jan 16, 2019 · 8 comments · May be fixed by #1302 or #1832
Open

Bottleneck in mapreducedim for convolutional layers #558

KristofferC opened this issue Jan 16, 2019 · 8 comments · May be fixed by #1302 or #1832

Comments

@KristofferC
Copy link
Contributor

KristofferC commented Jan 16, 2019

Running the conv network for MNIST in the model-zoo the following profile is obtained:

capture

The time in the mapreduce kernel (https://github.com/JuliaGPU/CuArrays.jl/blob/a3d2650db3eb62f25dcbe18a64ea0a0036caced4/src/mapreduce.jl#L27-L54) is probably a bit big.
This seems to be coming from a call to sum following a call to unbroadcast. I'm guessing this is from the activation function?

The specific call to the mapreduce kernel is Base._mapreducedim!(f::typeof(identity), op::typeof(Base.add_sum), R::CuArray{Float32}, A::CuArray{Float32})

@MikeInnes
Copy link
Member

MikeInnes commented Jan 16, 2019

This is probably coming from the .+ b here. During the forward pass b gets broadcasted out which means the gradient needs to be collapsed back down again (by summing across the broadcasted dimensions).

Ideally our mapreducedim kernel would just be fast, but it's easier said than done to optimise these kinds of GPU kernels. I believe there was also some work on wrapping CUDNN's gradient function, which would do that reduction for us, but that's not hooked up yet.

@KristofferC KristofferC changed the title Bottleneck in mapreducedim Bottleneck in mapreducedim for convolutional layers Jan 16, 2019
@MikeInnes
Copy link
Member

Yeah – the CUDNN wrappers were set up here, so it just needs someone to set up the right dispatch on the Flux side.

@jekbradbury
Copy link
Contributor

The slow mapreducedim kernel is my fault, and I've learned since then that there's a more optimized kernel in Knet here that might help us understand what we're missing? Maybe a KnetArrays vs CuArrays benchmark can shed light on how big a difference it would make.

@avik-pal
Copy link
Member

The integration from Flux's side is here #335. It needs a few fixes though.

@KristofferC
Copy link
Contributor Author

Heh, I didn't know there already was an implementation so I did one myself (although worse than in the PR).

Getting:

capture2

so it seems even for CUDNN the bias term is dominating.

@avik-pal
Copy link
Member

I ran the mnist model with the PR I mentioned.

==7862== Profiling application: julia
==7862== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   20.80%  208.80us         4  52.199us  14.560us  103.33us  void cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>(int, int, int, float const *, int, cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>*, float const , kernel_grad_params, int, float, int, int, int, int)
                   17.70%  177.63us         2  88.815us  31.264us  146.37us  void cudnn::detail::dgrad_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
                   11.12%  111.65us         3  37.215us  25.280us  59.968us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                   10.17%  102.08us         4  25.520us  11.040us  43.232us  void calc_bias_diff<int=2, float, float, int=128, int=0>(cudnnTensorStruct, float const *, cudnnTensorStruct, float*, float, float, int)
                    6.41%  64.319us         1  64.319us  64.319us  64.319us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    5.79%  58.079us         4  14.519us  3.6160us  26.623us  ptxcall_anonymous23_9
                    4.48%  44.960us         4  11.240us  2.2080us  21.664us  ptxcall_anonymous23_4
                    3.57%  35.872us         1  35.872us  35.872us  35.872us  void cudnn::detail::dgrad_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
                    2.41%  24.159us         1  24.159us  24.159us  24.159us  volta_sgemm_64x64_nn

I am getting quite low overhead for the bias term when the batch size is less (around 100). But increasing the batch size affects the bias term. It becomes around 28% of the time for a batch size of 1000.

@KristofferC
Copy link
Contributor Author

Looking only at the forward pass we currently have:

GPU activities:   59.96%  184.91ms      2350  78.683us  29.184us  117.03us  ptxcall_anonymous23_3
                  31.96%  98.559ms      2350  41.940us  16.960us  67.488us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)

while enabling cudnnConvolutionBiasActivationForward we have:

 GPU activities:   81.38%  108.64ms      2350  46.229us  18.368us  69.505us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)

by avoiding the anonymous kernel in applying the bias and activation function. I'll try make a PR for it.

@darsnack darsnack linked a pull request Jan 12, 2022 that will close this issue
@darsnack darsnack linked a pull request Jan 12, 2022 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants