-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization pass #58
Conversation
After d870aa6, running benchmark on my laptop yields:
|
Are tests is single thread? |
Do you use mcpu=native flag for tests? |
No - do you think they should be? I wanted to test raw performance, seemed like there's no need to run them in single thread. Anyhow this is easily done because of optional parallelization, so it might as well be defined by optional flag when running tests.
Oops, I was sure I added that... Not at the moment, will add it. |
Good paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.1608&rep=rep1&type=pdf |
Will take a look, thanks. I originally wanted to implement constant time median filtering described here: AFAIK it is commonly best used with bigger kernels. It's also vectorizable like the one described in the paper you've posted, which is great. We'll definitely talk about this in future.. |
Looks very good too, probably better for median filter |
Finished
|
First, the speed up numbers are awesome! I would be very happy to see your blog post about this optimizaiton! For example, at http://blog.mir.dlang.io/
The packed slices are not required here: ...
{
auto prealloc = uninitilizedSlice!(V[3])(slice.shape);
assumeSameStructure!("a", "b")(prealloc, slice).ndEach!rgb2grayImpl();
...
return cast(typeof(return)) prealloc;
}
void rgb2grayImpl(P)(P pointerTuple) // no need ref
{
auto v = pointerTuple.b;
pointerTuple.a[0] = v;
pointerTuple.a[1] = v;
pointerTuple.a[2] = v;
} |
Also, probably std.experimental.color can be used for some kernels |
prealloc = uninitializedArray!(V[])(range.shape[0] * range.shape[1]).sliced(range.shape[0], range.shape[1]); | ||
auto r = cast(float)(pack[0].rgb) / 255.0f; | ||
auto g = cast(float)(pack[1].rgb) / 255.0f; | ||
auto b = cast(float)(pack[2].rgb) / 255.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fastmath may replace it with: cast(float)(pack[0].rgb) * (1.0f / 255.0f);
In the same time, this should be OK if (1.0f / 255.0f) * 255.0f == 1.0f
.
The same true for 65535.0f
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think so. I changed it just to be sure. There are also num / 100.0f
in few places that should be replaced with num * 0.01f
.
m[] /= m.sum; | ||
m[0].swap(m[2]); | ||
auto m = rgb2GrayMltp[conv]; | ||
auto gray = range.pack!1.ndMap!(rgb => cast(V)(rgb[2] * m[0] + rgb[1] * m[1] + rgb[0] * m[2])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to define all inner lambdas as separate functions.
This will allow to apply @fastmath
for them.
In addition, if the stide!2 == 1
(normal situation), then is much better/faster to use Slice!(2, XXX*)
, where XXX is a color or static array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to define all inner lambdas as separate functions.
OK, thanks. There's just one thing - this way function has to be public. If private I get this (example from gray2rbg):
../../.dub/packages/mir-0.18.2/mir/source/mir/ndslice/algorithm.d-mixin-414-mixin-504(504): Error: function dcv.imgproc.color.gray2rgbImpl!(Index).gray2rgbImpl is not accessible from module algorithm
This is not a huge problem, but those functions may clutter the public API. Is there a workaround for this? Seems like a language limitation, so I suppose there's nothing else to be done from Mir side?
|
||
auto winSqr = winSize ^^ 2; | ||
auto winHalf = winSize / 2; | ||
window.ndEach!((pack) { auto gx = pack.fx; auto gy = pack.fy; r1 += gx * gx; r2 += gx * gy; r3 += gy * gy; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outer fastmath kernel
|
||
if (hsv[0] < 0) | ||
hsv[0] += 360; | ||
assumeSameStructure!("rgb", "hsv")(range, prealloc).pack!1.ndEach!((p) { rgb2hsvImpl!(V, R)(p); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just, ndEach!(rgb2hsvImpl!(...)
, otherwise @fastmath
may not work properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, Slice!(2, XXX*)
should be used here instead of Slice!(3, Type*)
.
The reason is very simple. The color length can be known at compile time and the last stride equals to 1. This is very good optimization. So, std.experimental.color can help significantly.
Note, that you can convert, for example Slice!(2, RGB*) <--> Slice!(3, ubyte*)
.
Yes - definitely. I think its very important for people to see how nd iterations are faster with ndslice.algorithm then with loops. This is important for Mir, but also very important for D - this example could mean a lot to newcomers searching for fast numerical code. I'll write it when we're finished with this PR, OK?
Awesome, thanks!
Yes, for sure - I'll change this in following PRs. |
Looking forward! |
see also #59 |
@henrygouk could you take a look at the merge with master diff (after bilateral sigma separation)? Selected line and those few below are the only edits, else is more-less taken from master. I believe I got the equation right, but I'd appreciate a double check. :) |
@9il I'd like to merge this one. PR has grown far beyond my expectations, and I believe that detailed revision at this point would be absurd. Except from few template parameter changes to allow lazy slices as inputs, API hasn't change, so I believe there would be no braking changes on merge. Please let me know if you think something else should be done before merge. And I'll tend to brake changes like this into smaller PRs in future, really sorry about this one... Here is the benchmark result on the current state of the branch (take a special look at threshold results - thanks for the tip on
|
Looks awesome! Great work! assumeSameStructure requires the same strides and shapes. The easist way is add a note in the docs, that all input images must be densi in the memory. And add assert check. Also this will allow to reduced code size and co.pilation time two times for all be vectorized functions. Could you please write a.blog post in the Mir blog a out this awesome work? |
Sorry for the English . I am from phone |
Of course, I've checked functionality after each change. Ah tnx for reminding me about assumeSameStructure - yes I'll add note and assert check about this. I thought about performing a deep copy if slice memory is non contiguous, but I believe it's better to let user handle this. Do you agree?
Yes of course - we have already agreed on this! :-) I'll email you so we discuss details about the format of the post, as soon as I'm ready to start working on it! (which will be after this merge) |
Yes |
This should be first optimization pass advised in #34 and #27, mostly by refactoring loops into mir.ndslice.algorithm iterations.
@9il I'll try making a commit per function (or related group of functions) when optimized. I'm making an early wip PR, so I can apply corrections from potential reviews to further optimizations. Hope you're OK with this plan.
Progress: