Optimization pass #58

ljubobratovicrelja · 2016-09-29T17:53:19Z

This should be first optimization pass advised in #34 and #27, mostly by refactoring loops into mir.ndslice.algorithm iterations.

@9il I'll try making a commit per function (or related group of functions) when optimized. I'm making an early wip PR, so I can apply corrections from potential reviews to further optimizations. Hope you're OK with this plan.

Progress:

ljubobratovicrelja · 2016-09-29T17:57:03Z

After d870aa6, running benchmark on my laptop yields:

$ ./benchmark.sh master -i 1000 -t harris

-----------------------------------------------------------------------------------
Profiling...
-----------------------------------------------------------------------------------

master:
dcv.features.corner.harris.harrisCorners.3:775 ms and 185 μs
dcv.features.corner.harris.harrisCorners.5:1 sec, 527 ms, and 822 μs
dcv.features.corner.harris.shiTomasiCorners.3:724 ms and 623 μs
dcv.features.corner.harris.shiTomasiCorners.5:1 sec, 529 ms, and 490 μs
master profiling done, results written in /home/relja/git/dcv/tests/performance-tests/.cache/master/tests/performance-tests/profile.csv

this:
dcv.features.corner.harris.harrisCorners.3:499 ms and 515 μs
dcv.features.corner.harris.harrisCorners.5:586 ms and 269 μs
dcv.features.corner.harris.shiTomasiCorners.3:431 ms and 233 μs
dcv.features.corner.harris.shiTomasiCorners.5:578 ms and 913 μs
 profiling done, results written in /home/relja/git/dcv/tests/performance-tests/profile.csv

-----------------------------------------------------------------------------------
Comparing and writing comparison results...
-----------------------------------------------------------------------------------

Comparison done, results written in /home/relja/git/dcv/tests/performance-tests/benchmark.csv

-----------------------------------------------------------------------------------
Results...
-----------------------------------------------------------------------------------

Function Name,Previous Runtime[μs],Current Runtime[μs],Speedup[%]
dcv.features.corner.harris.harrisCorners.3,775185,499515,55
dcv.features.corner.harris.harrisCorners.5,1527822,586269,160
dcv.features.corner.harris.shiTomasiCorners.3,724623,431233,68
dcv.features.corner.harris.shiTomasiCorners.5,1529490,578913,164

9il · 2016-09-29T18:25:35Z

Are tests is single thread?

9il · 2016-09-29T18:29:48Z

Do you use mcpu=native flag for tests?
You may want to add fastmath

ljubobratovicrelja · 2016-09-29T18:34:34Z

Are tests is single thread?

No - do you think they should be? I wanted to test raw performance, seemed like there's no need to run them in single thread. Anyhow this is easily done because of optional parallelization, so it might as well be defined by optional flag when running tests.

Do you use mcpu=native flag for tests?

Oops, I was sure I added that... Not at the moment, will add it.

9il · 2016-09-29T18:46:22Z

Good paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.1608&rep=rep1&type=pdf
Looks like something we must have

ljubobratovicrelja · 2016-09-29T18:54:24Z

Good paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.1608&rep=rep1&type=pdf

Will take a look, thanks.

I originally wanted to implement constant time median filtering described here:
https://nomis80.org/ctmf.pdf

AFAIK it is commonly best used with bigger kernels. It's also vectorizable like the one described in the paper you've posted, which is great. We'll definitely talk about this in future..

9il · 2016-09-30T16:18:16Z

I originally wanted to implement constant time median filtering described here:
https://nomis80.org/ctmf.pdf

Looks very good too, probably better for median filter

ljubobratovicrelja · 2016-10-01T20:39:50Z

Finished dcv.imgproc.color module. I was mainly extracting conversion algorithm code from the main function, to be easily replaced afterwards with std.experimental.color conversion implementation. And by that replacing loops and lockstep with ndEach, ndMap and assumeSameStructure. With rgb2gray and gray2rgb it'd be really nice to have support for packed slices in assumeSafeStructure. Overall improvement in speed is nice:

-----------------------------------------------------------------------------------
Results...
-----------------------------------------------------------------------------------

Function Name,Previous Runtime[μs],Current Runtime[μs],Speedup[%]
dcv.imgproc.color.gray2rgb,284986,259867,9
dcv.imgproc.color.hsv2rgb,8043678,1001873,702
dcv.imgproc.color.rgb2gray,3845346,489591,685
dcv.imgproc.color.rgb2hsv,6374634,1177531,441
dcv.imgproc.color.rgb2yuv,4144262,1067521,288
dcv.imgproc.color.yuv2rgb,3945694,950825,314

9il · 2016-10-01T21:28:53Z

First, the speed up numbers are awesome! I would be very happy to see your blog post about this optimizaiton! For example, at http://blog.mir.dlang.io/

With rgb2gray and gray2rgb it'd be really nice to have support for packed slices in assumeSafeStructure.

The packed slices are not required here:
For example, for rgb2gray kernel algorithm will looks something like:

...
{
    auto prealloc = uninitilizedSlice!(V[3])(slice.shape);
    assumeSameStructure!("a", "b")(prealloc, slice).ndEach!rgb2grayImpl();
    ...
    return cast(typeof(return)) prealloc;
}
void rgb2grayImpl(P)(P pointerTuple) // no need ref
{
    auto v = pointerTuple.b;
    pointerTuple.a[0] = v;
    pointerTuple.a[1] = v;
    pointerTuple.a[2] = v;
}

9il · 2016-10-01T21:29:23Z

Also, probably std.experimental.color can be used for some kernels

9il · 2016-10-01T21:42:55Z

source/dcv/imgproc/color.d

-            prealloc = uninitializedArray!(V[])(range.shape[0] * range.shape[1]).sliced(range.shape[0], range.shape[1]);
+        auto r = cast(float)(pack[0].rgb) / 255.0f;
+        auto g = cast(float)(pack[1].rgb) / 255.0f;
+        auto b = cast(float)(pack[2].rgb) / 255.0f;


Fastmath may replace it with: cast(float)(pack[0].rgb) * (1.0f / 255.0f);
In the same time, this should be OK if (1.0f / 255.0f) * 255.0f == 1.0f.
The same true for 65535.0f.

Yes I think so. I changed it just to be sure. There are also num / 100.0f in few places that should be replaced with num * 0.01f.

9il · 2016-10-01T21:47:00Z

source/dcv/imgproc/color.d

-    m[] /= m.sum;
-    m[0].swap(m[2]);
+    auto m = rgb2GrayMltp[conv];
+    auto gray = range.pack!1.ndMap!(rgb => cast(V)(rgb[2] * m[0] + rgb[1] * m[1] + rgb[0] * m[2]));


It is better to define all inner lambdas as separate functions.
This will allow to apply @fastmath for them.
In addition, if the stide!2 == 1 (normal situation), then is much better/faster to use Slice!(2, XXX*), where XXX is a color or static array.

It is better to define all inner lambdas as separate functions.

OK, thanks. There's just one thing - this way function has to be public. If private I get this (example from gray2rbg):

../../.dub/packages/mir-0.18.2/mir/source/mir/ndslice/algorithm.d-mixin-414-mixin-504(504): Error: function dcv.imgproc.color.gray2rgbImpl!(Index).gray2rgbImpl is not accessible from module algorithm

This is not a huge problem, but those functions may clutter the public API. Is there a workaround for this? Seems like a language limitation, so I suppose there's nothing else to be done from Mir side?

9il · 2016-10-01T21:47:43Z

source/dcv/features/corner/harris.d


-    auto winSqr = winSize ^^ 2;
-    auto winHalf = winSize / 2;
+    window.ndEach!((pack) { auto gx = pack.fx; auto gy = pack.fy; r1 += gx * gx; r2 += gx * gy; r3 += gy * gy;  });


outer fastmath kernel

9il · 2016-10-01T21:48:40Z

source/dcv/imgproc/color.d

-
-        if (hsv[0] < 0)
-            hsv[0] += 360;
+    assumeSameStructure!("rgb", "hsv")(range, prealloc).pack!1.ndEach!((p) { rgb2hsvImpl!(V, R)(p); });


Just, ndEach!(rgb2hsvImpl!(...), otherwise @fastmath may not work properly

Again, Slice!(2, XXX*) should be used here instead of Slice!(3, Type*).
The reason is very simple. The color length can be known at compile time and the last stride equals to 1. This is very good optimization. So, std.experimental.color can help significantly.
Note, that you can convert, for example Slice!(2, RGB*) <--> Slice!(3, ubyte*).

ljubobratovicrelja · 2016-10-02T05:33:37Z

First, the speed up numbers are awesome! I would be very happy to see your blog post about this optimizaiton! For example, at http://blog.mir.dlang.io/

Yes - definitely. I think its very important for people to see how nd iterations are faster with ndslice.algorithm then with loops. This is important for Mir, but also very important for D - this example could mean a lot to newcomers searching for fast numerical code. I'll write it when we're finished with this PR, OK?

For example, for rgb2gray kernel algorithm will looks something like:

Awesome, thanks!

Also, probably std.experimental.color can be used for some kernels

Yes, for sure - I'll change this in following PRs.

9il · 2016-10-02T05:36:45Z

OK?

Looking forward!

9il · 2016-10-02T05:43:44Z

see also #59

ljubobratovicrelja · 2016-11-11T22:47:44Z

@henrygouk could you take a look at the merge with master diff (after bilateral sigma separation)? Selected line and those few below are the only edits, else is more-less taken from master. I believe I got the equation right, but I'd appreciate a double check. :)

ljubobratovicrelja · 2016-11-11T23:01:53Z

@9il I'd like to merge this one. PR has grown far beyond my expectations, and I believe that detailed revision at this point would be absurd. Except from few template parameter changes to allow lazy slices as inputs, API hasn't change, so I believe there would be no braking changes on merge. Please let me know if you think something else should be done before merge. And I'll tend to brake changes like this into smaller PRs in future, really sorry about this one...

Here is the benchmark result on the current state of the branch (take a special look at threshold results - thanks for the tip on lockstep):

Function Name,Previous Runtime[μs],Current Runtime[μs],Speedup[%]
dcv.features.corner.fast.FASTDetector,34834,35779,-2
dcv.features.corner.harris.harrisCorners.3,1624687,278579,483
dcv.features.corner.harris.harrisCorners.5,4406410,328159,1242
dcv.features.corner.harris.shiTomasiCorners.3,1583896,223794,607
dcv.features.corner.harris.shiTomasiCorners.5,4422529,297106,1388
dcv.features.rht.RhtCircles,1064020,987154,7
dcv.features.rht.RhtLines,1096606,1013977,8
dcv.features.utils.extractCorners,3164128,355564,789
dcv.imgproc.color.gray2rgb,441354,8918,4849
dcv.imgproc.color.hsv2rgb,433122,51392,742
dcv.imgproc.color.rgb2gray,262186,31813,724
dcv.imgproc.color.rgb2hsv,365969,65572,458
dcv.imgproc.color.rgb2yuv,272942,221719,23
dcv.imgproc.color.yuv2rgb,277880,219906,26
dcv.imgproc.convolution.conv.1D.3,124888,67486,85
dcv.imgproc.convolution.conv.1D.5,159795,68881,131
dcv.imgproc.convolution.conv.1D.7,206059,75361,173
dcv.imgproc.convolution.conv.2D.3x3,767058,120216,538
dcv.imgproc.convolution.conv.2D.5x5,1941055,360809,437
dcv.imgproc.convolution.conv.2D.7x7,3719552,865524,329
dcv.imgproc.convolution.conv.3D.3x3,2091025,374006,459
dcv.imgproc.convolution.conv.3D.5x5,5547364,1074208,416
dcv.imgproc.filter.bilateralFilter.3,6118754,1778482,244
dcv.imgproc.filter.bilateralFilter.5,16718651,4597027,263
dcv.imgproc.filter.calcGradients,2244798,506101,343
dcv.imgproc.filter.calcPartialDerivatives,428318,141520,202
dcv.imgproc.filter.canny,4108987,758240,441
dcv.imgproc.filter.close,705101,657362,7
dcv.imgproc.filter.dilate,638126,583649,9
dcv.imgproc.filter.erode,78763,79369,0
dcv.imgproc.filter.filterNonMaximum,477543,38968,1125
dcv.imgproc.filter.histEqual,103582,105822,-2
dcv.imgproc.filter.medianFilter.3,1029246,1019109,0
dcv.imgproc.filter.medianFilter.5,2954655,2881709,2
dcv.imgproc.filter.nonMaximaSupression,588455,84436,596
dcv.imgproc.filter.open,710023,651529,8
dcv.imgproc.imgmanip.remap,225780,62089,263
dcv.imgproc.imgmanip.resize.downsize,60381,65549,-7
dcv.imgproc.imgmanip.resize.upsize,466977,471281,0
dcv.imgproc.imgmanip.scale.downsize,62576,61688,1
dcv.imgproc.imgmanip.scale.upsize,470195,470125,0
dcv.imgproc.imgmanip.transformAffine,958879,939532,2
dcv.imgproc.imgmanip.transformPerspective,1094929,1090702,0
dcv.imgproc.imgmanip.warp,235169,63821,268
dcv.imgproc.threshold.threshold,124755,656,18917
dcv.multiview.stereo.matching.semiGlobalMatchingPipeline,56359182,56938945,-1
dcv.tracking.opticalflow.hornschunck.HornSchunckFlow,20398746,17306457,17
dcv.tracking.opticalflow.lucaskanade.LucasKanadeFlow,267640,288798,-7
dcv.tracking.opticalflow.pyramidflow.DensePyramidFlow.HornSchunckFlow,24545229,20135950,21
dcv.tracking.opticalflow.pyramidflow.SparsePyramidFlow.LucasKanadeFlow,368629,383785,-3

9il · 2016-11-12T08:38:11Z

Looks awesome! Great work!
Does new threshold works properly?

assumeSameStructure requires the same strides and shapes. The easist way is add a note in the docs, that all input images must be densi in the memory. And add assert check. Also this will allow to reduced code size and co.pilation time two times for all be vectorized functions.

Could you please write a.blog post in the Mir blog a out this awesome work?

9il · 2016-11-12T08:39:35Z

Sorry for the English . I am from phone

ljubobratovicrelja · 2016-11-12T08:49:01Z

Does new threshold works properly?

Of course, I've checked functionality after each change.

Ah tnx for reminding me about assumeSameStructure - yes I'll add note and assert check about this. I thought about performing a deep copy if slice memory is non contiguous, but I believe it's better to let user handle this. Do you agree?

Could you please write a.blog post in the Mir blog a out this awesome work?

Yes of course - we have already agreed on this! :-) I'll email you so we discuss details about the format of the post, as soon as I'm ready to start working on it! (which will be after this merge)

9il · 2016-11-12T09:01:50Z

. I thought about performing a deep copy if slice memory is non contiguous, but I believe it's better to let user handle this. Do you agree?

Yes

Parallelized and optimized corner response calculation.

d870aa6

ljubobratovicrelja added optimization wip labels Sep 29, 2016

ljubobratovicrelja added 6 commits October 1, 2016 17:38

Refactoring of extractCorners.

daf4e92

Refactoring rgb(bgr)2gray.

958f5a9

Refactoring rgb2hsv.

b6ac069

Refactoring hsv2rgb.

7da0159

uninitializedArray --> uninitializedSlice.

5658eb9

Refactoring yuv2rgb, overall module cleanup.

0fc5b01

ljubobratovicrelja added this to the Patch-up v0.1 milestone Oct 1, 2016

9il reviewed Oct 1, 2016

View reviewed changes

ljubobratovicrelja added 4 commits October 2, 2016 12:49

Added static slice packing helpers.

5970cb2

Extracted ndIter kernel lambdas to functions.

720fd50

div by 100.0 replaced with mul by 0.01.

21f6118

Extracted kernels to functions, added fastmath.

c3cb076

ljubobratovicrelja added 20 commits October 11, 2016 22:30

Fixed multi-channel bilateralFilter.

549cbab

Reorganized convolution templates to support lazy tensors.

539aced

Adopted convolution changes in benchmark functions.

13fc0b6

Refactoring of calcGradient, calcPartialDerivatives.

ecc4b0e

Refactoring of nonMaximaSuppression.

4569ffb

Refactored scaled and ranged to accept lazy tensors.

7f2f5b7

Parallelized calcPartialDerivatives, calcGradients and canny.

2cd6343

Fixed docs typo.

49a6963

Merge branch 'master' into optimization-patch-2

ba7b6ac

Merge branch 'master' into optimization-patch-2

1d2b8e4

Refactoring of filterNonMaximum.

b21c660

Docs fixup.

664a58f

filterNonMaximum optimization.

fcd1ede

Added type check for filterNonMaximum input.

b83ae5a

Refactoring of warp/remap.

c42e8fd

Added fastmath on linear interpolation.

0a83b39

Updated function template arguments.

24e8fbb

Refactoring of threshold.

0c744a7

Fixed fastmath floor.

7774f26

Merge with master.

824a4a8

Docs and formatting, asserts and notes on contiguous slice request.

d88c786

ljubobratovicrelja merged commit 753a1d3 into master Nov 12, 2016

ljubobratovicrelja deleted the optimization-patch-2 branch November 12, 2016 13:38

ljubobratovicrelja mentioned this pull request Nov 12, 2016

Iteration performance #27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization pass #58

Optimization pass #58

ljubobratovicrelja commented Sep 29, 2016 •

edited

Loading

ljubobratovicrelja commented Sep 29, 2016

9il commented Sep 29, 2016 •

edited

Loading

9il commented Sep 29, 2016

ljubobratovicrelja commented Sep 29, 2016

9il commented Sep 29, 2016

ljubobratovicrelja commented Sep 29, 2016

9il commented Sep 30, 2016

ljubobratovicrelja commented Oct 1, 2016

9il commented Oct 1, 2016 •

edited

Loading

9il commented Oct 1, 2016

9il Oct 1, 2016

ljubobratovicrelja Oct 2, 2016

9il Oct 1, 2016

ljubobratovicrelja Oct 2, 2016

9il Oct 1, 2016

9il Oct 1, 2016 •

edited

Loading

9il Oct 1, 2016

ljubobratovicrelja commented Oct 2, 2016

9il commented Oct 2, 2016

9il commented Oct 2, 2016

ljubobratovicrelja commented Nov 11, 2016

ljubobratovicrelja commented Nov 11, 2016

9il commented Nov 12, 2016

9il commented Nov 12, 2016

ljubobratovicrelja commented Nov 12, 2016

9il commented Nov 12, 2016

Optimization pass #58

Optimization pass #58

Conversation

ljubobratovicrelja commented Sep 29, 2016 • edited Loading

ljubobratovicrelja commented Sep 29, 2016

9il commented Sep 29, 2016 • edited Loading

9il commented Sep 29, 2016

ljubobratovicrelja commented Sep 29, 2016

9il commented Sep 29, 2016

ljubobratovicrelja commented Sep 29, 2016

9il commented Sep 30, 2016

ljubobratovicrelja commented Oct 1, 2016

9il commented Oct 1, 2016 • edited Loading

9il commented Oct 1, 2016

9il Oct 1, 2016

Choose a reason for hiding this comment

ljubobratovicrelja Oct 2, 2016

Choose a reason for hiding this comment

9il Oct 1, 2016

Choose a reason for hiding this comment

ljubobratovicrelja Oct 2, 2016

Choose a reason for hiding this comment

9il Oct 1, 2016

Choose a reason for hiding this comment

9il Oct 1, 2016 • edited Loading

Choose a reason for hiding this comment

9il Oct 1, 2016

Choose a reason for hiding this comment

ljubobratovicrelja commented Oct 2, 2016

9il commented Oct 2, 2016

9il commented Oct 2, 2016

ljubobratovicrelja commented Nov 11, 2016

ljubobratovicrelja commented Nov 11, 2016

9il commented Nov 12, 2016

9il commented Nov 12, 2016

ljubobratovicrelja commented Nov 12, 2016

9il commented Nov 12, 2016

ljubobratovicrelja commented Sep 29, 2016 •

edited

Loading

9il commented Sep 29, 2016 •

edited

Loading

9il commented Oct 1, 2016 •

edited

Loading

9il Oct 1, 2016 •

edited

Loading