Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization pass #58

Merged
merged 48 commits into from
Nov 12, 2016
Merged

Optimization pass #58

merged 48 commits into from
Nov 12, 2016

Conversation

ljubobratovicrelja
Copy link
Member

@ljubobratovicrelja ljubobratovicrelja commented Sep 29, 2016

This should be first optimization pass advised in #34 and #27, mostly by refactoring loops into mir.ndslice.algorithm iterations.

@9il I'll try making a commit per function (or related group of functions) when optimized. I'm making an early wip PR, so I can apply corrections from potential reviews to further optimizations. Hope you're OK with this plan.

Progress:

  • dcv.features.corner.harris
  • dcv.features.utils.extractCorners
  • dcv.imgproc.color
  • dcv.imgproc.convolution
  • dcv.imgproc.filter.bilateralFilter
  • dcv.imgproc.filter.calcGradients
  • dcv.imgproc.filter.calcPartialDerivatives
  • dcv.imgproc.filter.canny
  • dcv.imgproc.filter.filterNonMaximum
  • dcv.imgproc.filter.nonMaximaSupression
  • dcv.imgproc.imgmanip.remap/warp
  • dcv.imgproc.threshold.threshold

@ljubobratovicrelja
Copy link
Member Author

After d870aa6, running benchmark on my laptop yields:

$ ./benchmark.sh master -i 1000 -t harris

-----------------------------------------------------------------------------------
Profiling...
-----------------------------------------------------------------------------------

master:
dcv.features.corner.harris.harrisCorners.3:775 ms and 185 μs
dcv.features.corner.harris.harrisCorners.5:1 sec, 527 ms, and 822 μs
dcv.features.corner.harris.shiTomasiCorners.3:724 ms and 623 μs
dcv.features.corner.harris.shiTomasiCorners.5:1 sec, 529 ms, and 490 μs
master profiling done, results written in /home/relja/git/dcv/tests/performance-tests/.cache/master/tests/performance-tests/profile.csv

this:
dcv.features.corner.harris.harrisCorners.3:499 ms and 515 μs
dcv.features.corner.harris.harrisCorners.5:586 ms and 269 μs
dcv.features.corner.harris.shiTomasiCorners.3:431 ms and 233 μs
dcv.features.corner.harris.shiTomasiCorners.5:578 ms and 913 μs
 profiling done, results written in /home/relja/git/dcv/tests/performance-tests/profile.csv

-----------------------------------------------------------------------------------
Comparing and writing comparison results...
-----------------------------------------------------------------------------------

Comparison done, results written in /home/relja/git/dcv/tests/performance-tests/benchmark.csv

-----------------------------------------------------------------------------------
Results...
-----------------------------------------------------------------------------------

Function Name,Previous Runtime[μs],Current Runtime[μs],Speedup[%]
dcv.features.corner.harris.harrisCorners.3,775185,499515,55
dcv.features.corner.harris.harrisCorners.5,1527822,586269,160
dcv.features.corner.harris.shiTomasiCorners.3,724623,431233,68
dcv.features.corner.harris.shiTomasiCorners.5,1529490,578913,164

@9il
Copy link
Member

9il commented Sep 29, 2016

Are tests is single thread?

@9il
Copy link
Member

9il commented Sep 29, 2016

Do you use mcpu=native flag for tests?
You may want to add fastmath

@ljubobratovicrelja
Copy link
Member Author

Are tests is single thread?

No - do you think they should be? I wanted to test raw performance, seemed like there's no need to run them in single thread. Anyhow this is easily done because of optional parallelization, so it might as well be defined by optional flag when running tests.

Do you use mcpu=native flag for tests?

Oops, I was sure I added that... Not at the moment, will add it.

@9il
Copy link
Member

9il commented Sep 29, 2016

Good paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.1608&rep=rep1&type=pdf
Looks like something we must have

@ljubobratovicrelja
Copy link
Member Author

Good paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.1608&rep=rep1&type=pdf

Will take a look, thanks.

I originally wanted to implement constant time median filtering described here:
https://nomis80.org/ctmf.pdf

AFAIK it is commonly best used with bigger kernels. It's also vectorizable like the one described in the paper you've posted, which is great. We'll definitely talk about this in future..

@9il
Copy link
Member

9il commented Sep 30, 2016

I originally wanted to implement constant time median filtering described here:
https://nomis80.org/ctmf.pdf

Looks very good too, probably better for median filter

@ljubobratovicrelja
Copy link
Member Author

Finished dcv.imgproc.color module. I was mainly extracting conversion algorithm code from the main function, to be easily replaced afterwards with std.experimental.color conversion implementation. And by that replacing loops and lockstep with ndEach, ndMap and assumeSameStructure. With rgb2gray and gray2rgb it'd be really nice to have support for packed slices in assumeSafeStructure. Overall improvement in speed is nice:

-----------------------------------------------------------------------------------
Results...
-----------------------------------------------------------------------------------

Function Name,Previous Runtime[μs],Current Runtime[μs],Speedup[%]
dcv.imgproc.color.gray2rgb,284986,259867,9
dcv.imgproc.color.hsv2rgb,8043678,1001873,702
dcv.imgproc.color.rgb2gray,3845346,489591,685
dcv.imgproc.color.rgb2hsv,6374634,1177531,441
dcv.imgproc.color.rgb2yuv,4144262,1067521,288
dcv.imgproc.color.yuv2rgb,3945694,950825,314

@ljubobratovicrelja ljubobratovicrelja added this to the Patch-up v0.1 milestone Oct 1, 2016
@9il
Copy link
Member

9il commented Oct 1, 2016

First, the speed up numbers are awesome! I would be very happy to see your blog post about this optimizaiton! For example, at http://blog.mir.dlang.io/

With rgb2gray and gray2rgb it'd be really nice to have support for packed slices in assumeSafeStructure.

The packed slices are not required here:
For example, for rgb2gray kernel algorithm will looks something like:

...
{
    auto prealloc = uninitilizedSlice!(V[3])(slice.shape);
    assumeSameStructure!("a", "b")(prealloc, slice).ndEach!rgb2grayImpl();
    ...
    return cast(typeof(return)) prealloc;
}
void rgb2grayImpl(P)(P pointerTuple) // no need ref
{
    auto v = pointerTuple.b;
    pointerTuple.a[0] = v;
    pointerTuple.a[1] = v;
    pointerTuple.a[2] = v;
}

@9il
Copy link
Member

9il commented Oct 1, 2016

Also, probably std.experimental.color can be used for some kernels

prealloc = uninitializedArray!(V[])(range.shape[0] * range.shape[1]).sliced(range.shape[0], range.shape[1]);
auto r = cast(float)(pack[0].rgb) / 255.0f;
auto g = cast(float)(pack[1].rgb) / 255.0f;
auto b = cast(float)(pack[2].rgb) / 255.0f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fastmath may replace it with: cast(float)(pack[0].rgb) * (1.0f / 255.0f);
In the same time, this should be OK if (1.0f / 255.0f) * 255.0f == 1.0f.
The same true for 65535.0f.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so. I changed it just to be sure. There are also num / 100.0f in few places that should be replaced with num * 0.01f.

m[] /= m.sum;
m[0].swap(m[2]);
auto m = rgb2GrayMltp[conv];
auto gray = range.pack!1.ndMap!(rgb => cast(V)(rgb[2] * m[0] + rgb[1] * m[1] + rgb[0] * m[2]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to define all inner lambdas as separate functions.
This will allow to apply @fastmath for them.
In addition, if the stide!2 == 1 (normal situation), then is much better/faster to use Slice!(2, XXX*), where XXX is a color or static array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to define all inner lambdas as separate functions.

OK, thanks. There's just one thing - this way function has to be public. If private I get this (example from gray2rbg):

../../.dub/packages/mir-0.18.2/mir/source/mir/ndslice/algorithm.d-mixin-414-mixin-504(504): Error: function dcv.imgproc.color.gray2rgbImpl!(Index).gray2rgbImpl is not accessible from module algorithm

This is not a huge problem, but those functions may clutter the public API. Is there a workaround for this? Seems like a language limitation, so I suppose there's nothing else to be done from Mir side?


auto winSqr = winSize ^^ 2;
auto winHalf = winSize / 2;
window.ndEach!((pack) { auto gx = pack.fx; auto gy = pack.fy; r1 += gx * gx; r2 += gx * gy; r3 += gy * gy; });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outer fastmath kernel


if (hsv[0] < 0)
hsv[0] += 360;
assumeSameStructure!("rgb", "hsv")(range, prealloc).pack!1.ndEach!((p) { rgb2hsvImpl!(V, R)(p); });
Copy link
Member

@9il 9il Oct 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just, ndEach!(rgb2hsvImpl!(...), otherwise @fastmath may not work properly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, Slice!(2, XXX*) should be used here instead of Slice!(3, Type*).
The reason is very simple. The color length can be known at compile time and the last stride equals to 1. This is very good optimization. So, std.experimental.color can help significantly.
Note, that you can convert, for example Slice!(2, RGB*) <--> Slice!(3, ubyte*).

@ljubobratovicrelja
Copy link
Member Author

First, the speed up numbers are awesome! I would be very happy to see your blog post about this optimizaiton! For example, at http://blog.mir.dlang.io/

Yes - definitely. I think its very important for people to see how nd iterations are faster with ndslice.algorithm then with loops. This is important for Mir, but also very important for D - this example could mean a lot to newcomers searching for fast numerical code. I'll write it when we're finished with this PR, OK?

For example, for rgb2gray kernel algorithm will looks something like:

Awesome, thanks!

Also, probably std.experimental.color can be used for some kernels

Yes, for sure - I'll change this in following PRs.

@9il
Copy link
Member

9il commented Oct 2, 2016

OK?

Looking forward!

@9il
Copy link
Member

9il commented Oct 2, 2016

see also #59

@ljubobratovicrelja
Copy link
Member Author

@henrygouk could you take a look at the merge with master diff (after bilateral sigma separation)? Selected line and those few below are the only edits, else is more-less taken from master. I believe I got the equation right, but I'd appreciate a double check. :)

@ljubobratovicrelja
Copy link
Member Author

@9il I'd like to merge this one. PR has grown far beyond my expectations, and I believe that detailed revision at this point would be absurd. Except from few template parameter changes to allow lazy slices as inputs, API hasn't change, so I believe there would be no braking changes on merge. Please let me know if you think something else should be done before merge. And I'll tend to brake changes like this into smaller PRs in future, really sorry about this one...

Here is the benchmark result on the current state of the branch (take a special look at threshold results - thanks for the tip on lockstep):

Function Name,Previous Runtime[μs],Current Runtime[μs],Speedup[%]
dcv.features.corner.fast.FASTDetector,34834,35779,-2
dcv.features.corner.harris.harrisCorners.3,1624687,278579,483
dcv.features.corner.harris.harrisCorners.5,4406410,328159,1242
dcv.features.corner.harris.shiTomasiCorners.3,1583896,223794,607
dcv.features.corner.harris.shiTomasiCorners.5,4422529,297106,1388
dcv.features.rht.RhtCircles,1064020,987154,7
dcv.features.rht.RhtLines,1096606,1013977,8
dcv.features.utils.extractCorners,3164128,355564,789
dcv.imgproc.color.gray2rgb,441354,8918,4849
dcv.imgproc.color.hsv2rgb,433122,51392,742
dcv.imgproc.color.rgb2gray,262186,31813,724
dcv.imgproc.color.rgb2hsv,365969,65572,458
dcv.imgproc.color.rgb2yuv,272942,221719,23
dcv.imgproc.color.yuv2rgb,277880,219906,26
dcv.imgproc.convolution.conv.1D.3,124888,67486,85
dcv.imgproc.convolution.conv.1D.5,159795,68881,131
dcv.imgproc.convolution.conv.1D.7,206059,75361,173
dcv.imgproc.convolution.conv.2D.3x3,767058,120216,538
dcv.imgproc.convolution.conv.2D.5x5,1941055,360809,437
dcv.imgproc.convolution.conv.2D.7x7,3719552,865524,329
dcv.imgproc.convolution.conv.3D.3x3,2091025,374006,459
dcv.imgproc.convolution.conv.3D.5x5,5547364,1074208,416
dcv.imgproc.filter.bilateralFilter.3,6118754,1778482,244
dcv.imgproc.filter.bilateralFilter.5,16718651,4597027,263
dcv.imgproc.filter.calcGradients,2244798,506101,343
dcv.imgproc.filter.calcPartialDerivatives,428318,141520,202
dcv.imgproc.filter.canny,4108987,758240,441
dcv.imgproc.filter.close,705101,657362,7
dcv.imgproc.filter.dilate,638126,583649,9
dcv.imgproc.filter.erode,78763,79369,0
dcv.imgproc.filter.filterNonMaximum,477543,38968,1125
dcv.imgproc.filter.histEqual,103582,105822,-2
dcv.imgproc.filter.medianFilter.3,1029246,1019109,0
dcv.imgproc.filter.medianFilter.5,2954655,2881709,2
dcv.imgproc.filter.nonMaximaSupression,588455,84436,596
dcv.imgproc.filter.open,710023,651529,8
dcv.imgproc.imgmanip.remap,225780,62089,263
dcv.imgproc.imgmanip.resize.downsize,60381,65549,-7
dcv.imgproc.imgmanip.resize.upsize,466977,471281,0
dcv.imgproc.imgmanip.scale.downsize,62576,61688,1
dcv.imgproc.imgmanip.scale.upsize,470195,470125,0
dcv.imgproc.imgmanip.transformAffine,958879,939532,2
dcv.imgproc.imgmanip.transformPerspective,1094929,1090702,0
dcv.imgproc.imgmanip.warp,235169,63821,268
dcv.imgproc.threshold.threshold,124755,656,18917
dcv.multiview.stereo.matching.semiGlobalMatchingPipeline,56359182,56938945,-1
dcv.tracking.opticalflow.hornschunck.HornSchunckFlow,20398746,17306457,17
dcv.tracking.opticalflow.lucaskanade.LucasKanadeFlow,267640,288798,-7
dcv.tracking.opticalflow.pyramidflow.DensePyramidFlow.HornSchunckFlow,24545229,20135950,21
dcv.tracking.opticalflow.pyramidflow.SparsePyramidFlow.LucasKanadeFlow,368629,383785,-3

@9il
Copy link
Member

9il commented Nov 12, 2016

Looks awesome! Great work!
Does new threshold works properly?

assumeSameStructure requires the same strides and shapes. The easist way is add a note in the docs, that all input images must be densi in the memory. And add assert check. Also this will allow to reduced code size and co.pilation time two times for all be vectorized functions.

Could you please write a.blog post in the Mir blog a out this awesome work?

@9il
Copy link
Member

9il commented Nov 12, 2016

Sorry for the English . I am from phone

@ljubobratovicrelja
Copy link
Member Author

Does new threshold works properly?

Of course, I've checked functionality after each change.

Ah tnx for reminding me about assumeSameStructure - yes I'll add note and assert check about this. I thought about performing a deep copy if slice memory is non contiguous, but I believe it's better to let user handle this. Do you agree?

Could you please write a.blog post in the Mir blog a out this awesome work?

Yes of course - we have already agreed on this! :-) I'll email you so we discuss details about the format of the post, as soon as I'm ready to start working on it! (which will be after this merge)

@9il
Copy link
Member

9il commented Nov 12, 2016

. I thought about performing a deep copy if slice memory is non contiguous, but I believe it's better to let user handle this. Do you agree?

Yes

@ljubobratovicrelja ljubobratovicrelja merged commit 753a1d3 into master Nov 12, 2016
@ljubobratovicrelja ljubobratovicrelja deleted the optimization-patch-2 branch November 12, 2016 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants