Bottleneck in mapreducedim for convolutional layers #558

KristofferC · 2019-01-16T08:36:50Z

Running the conv network for MNIST in the model-zoo the following profile is obtained:

The time in the mapreduce kernel (https://github.com/JuliaGPU/CuArrays.jl/blob/a3d2650db3eb62f25dcbe18a64ea0a0036caced4/src/mapreduce.jl#L27-L54) is probably a bit big.
This seems to be coming from a call to sum following a call to unbroadcast. I'm guessing this is from the activation function?

The specific call to the mapreduce kernel is Base._mapreducedim!(f::typeof(identity), op::typeof(Base.add_sum), R::CuArray{Float32}, A::CuArray{Float32})

The text was updated successfully, but these errors were encountered:

MikeInnes · 2019-01-16T10:12:14Z

This is probably coming from the .+ b here. During the forward pass b gets broadcasted out which means the gradient needs to be collapsed back down again (by summing across the broadcasted dimensions).

Ideally our mapreducedim kernel would just be fast, but it's easier said than done to optimise these kinds of GPU kernels. I believe there was also some work on wrapping CUDNN's gradient function, which would do that reduction for us, but that's not hooked up yet.

KristofferC · 2019-01-16T10:24:25Z

There is https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionBiasActivationForward to do the whole forward pass in one shot and then one can use https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionBackwardBias and https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionBackwardData for the backward pass?

MikeInnes · 2019-01-16T10:29:34Z

Yeah – the CUDNN wrappers were set up here, so it just needs someone to set up the right dispatch on the Flux side.

jekbradbury · 2019-01-16T16:51:51Z

The slow mapreducedim kernel is my fault, and I've learned since then that there's a more optimized kernel in Knet here that might help us understand what we're missing? Maybe a KnetArrays vs CuArrays benchmark can shed light on how big a difference it would make.

avik-pal · 2019-01-17T17:49:32Z

The integration from Flux's side is here #335. It needs a few fixes though.

KristofferC · 2019-01-18T21:05:41Z

Heh, I didn't know there already was an implementation so I did one myself (although worse than in the PR).

Getting:

so it seems even for CUDNN the bias term is dominating.

avik-pal · 2019-01-19T06:58:23Z

I ran the mnist model with the PR I mentioned.

==7862== Profiling application: julia
==7862== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   20.80%  208.80us         4  52.199us  14.560us  103.33us  void cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>(int, int, int, float const *, int, cudnn::detail::wgrad_alg0_engine<float, int=512, int=6, int=5, int=3, int=3, int=3, bool=1, int=512>*, float const , kernel_grad_params, int, float, int, int, int, int)
                   17.70%  177.63us         2  88.815us  31.264us  146.37us  void cudnn::detail::dgrad_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=128, int=6, int=7, int=3, int=3, int=5, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
                   11.12%  111.65us         3  37.215us  25.280us  59.968us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                   10.17%  102.08us         4  25.520us  11.040us  43.232us  void calc_bias_diff<int=2, float, float, int=128, int=0>(cudnnTensorStruct, float const *, cudnnTensorStruct, float*, float, float, int)
                    6.41%  64.319us         1  64.319us  64.319us  64.319us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    5.79%  58.079us         4  14.519us  3.6160us  26.623us  ptxcall_anonymous23_9
                    4.48%  44.960us         4  11.240us  2.2080us  21.664us  ptxcall_anonymous23_4
                    3.57%  35.872us         1  35.872us  35.872us  35.872us  void cudnn::detail::dgrad_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::dgrad_engine<float, int=128, int=6, int=8, int=3, int=3, int=5, bool=1>*, kernel_grad_params, int, int, float, int, int, int)
                    2.41%  24.159us         1  24.159us  24.159us  24.159us  volta_sgemm_64x64_nn

I am getting quite low overhead for the bias term when the batch size is less (around 100). But increasing the batch size affects the bias term. It becomes around 28% of the time for a batch size of 1000.

KristofferC · 2019-01-21T16:16:27Z

Looking only at the forward pass we currently have:

GPU activities:   59.96%  184.91ms      2350  78.683us  29.184us  117.03us  ptxcall_anonymous23_3
                  31.96%  98.559ms      2350  41.940us  16.960us  67.488us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)

while enabling cudnnConvolutionBiasActivationForward we have:

 GPU activities:   81.38%  108.64ms      2350  46.229us  18.368us  69.505us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)

by avoiding the anonymous kernel in applying the bias and activation function. I'll try make a PR for it.

KristofferC changed the title ~~Bottleneck in mapreducedim~~ Bottleneck in mapreducedim for convolutional layers Jan 16, 2019

darsnack linked a pull request Jan 12, 2022 that will close this issue

make use of conv_bias_act #1302

Open

darsnack added the performance label Jan 12, 2022

darsnack linked a pull request Jan 12, 2022 that will close this issue

Use NNlib.conv_bias_act for Conv #1832

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bottleneck in mapreducedim for convolutional layers #558

Bottleneck in mapreducedim for convolutional layers #558

KristofferC commented Jan 16, 2019 •

edited

Loading

MikeInnes commented Jan 16, 2019 •

edited

Loading

KristofferC commented Jan 16, 2019

MikeInnes commented Jan 16, 2019

jekbradbury commented Jan 16, 2019

avik-pal commented Jan 17, 2019

KristofferC commented Jan 18, 2019

avik-pal commented Jan 19, 2019

KristofferC commented Jan 21, 2019

Bottleneck in mapreducedim for convolutional layers #558

Bottleneck in mapreducedim for convolutional layers #558

Comments

KristofferC commented Jan 16, 2019 • edited Loading

MikeInnes commented Jan 16, 2019 • edited Loading

KristofferC commented Jan 16, 2019

MikeInnes commented Jan 16, 2019

jekbradbury commented Jan 16, 2019

avik-pal commented Jan 17, 2019

KristofferC commented Jan 18, 2019

avik-pal commented Jan 19, 2019

KristofferC commented Jan 21, 2019

KristofferC commented Jan 16, 2019 •

edited

Loading

MikeInnes commented Jan 16, 2019 •

edited

Loading