-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoding animated WebP is 4x slower than libwebp-sys
+ webp-animation
#119
Comments
I looked at the profile and the associated code a bit. The two low-hanging optimization opportunities are:
|
Paper on fast alpha blending without divisions: https://arxiv.org/pdf/2202.02864 |
Here is ready method if needed https://github.com/awxkee/pic-scale-safe/blob/7f1066e006aac5eea4c1474f5bbe81cae837b2b7/src/alpha.rs#L43 Division by alpha if required much less straightforward. |
I've attempted to optimize alpha blending by performing it in u16 instead of f64. I got the primitives working (rounding integer division both by 255 and by an arbitrary u8) but when I assemble the entire thing with them it falls apart. Pretty early, too - calculating the resulting alpha is already wrong, we don't even get to the interesting part of blending the RGB values. This is what I've got so far: https://github.com/image-rs/image-webp/compare/main...Shnatsel:image-webp:faster-alpha-blending?expand=1 By the way, @awxkee in your |
I think your |
And also it is possible to make special branch for numbers power of 2 for even faster division: fn is_power_of_two(n: u32) -> bool {
n != 0 && (n & (n - 1)) == 0
}
fn power_of_two(n: u32) -> u32 {
n.trailing_zeros()
} And then perform: value >> power_of_two |
Here is table to try out: https://gist.github.com/awxkee/b8df92e4f97346fb56ae91dc4dcca779 |
Thank you! I'll benchmark that and dig deeper into the performance of these things once we actually have a working alpha blending routine. Right now I'm not even sure if |
Okay, I checked how libwebp does it, and they actually do it in We should probably just port that. |
I've ported the libwebp algorithm. It is really inaccurate at low alpha levels but nobody is going to notice that anyway. It gives a 8% end-to-end performance boost on this sample. My conversion of the Realistically though, the libwebp blending is still going to be faster, if we're willing to take the accuracy hit at very low alpha values. |
You could perhaps replace their 'approximate division by 255' with exact division? I think this is the main inaccuracies producer. |
I turned an I also re-added support for big endian. C doesn't have a |
Actually if you're using |
That method results in a less precise approximation of the floating-point division, and I'm seeing a greater divergence from the floating-point reference. I believe the trick with the other |
In image v0.25.4 and image-webp v0.2.0, decoding the attached animated WebP is 4x slower than using
libwebp-sys
+webp-animation
: sample.zipimage
hyperfine
results:libwebp-sys + webp-animation
hyperfine
results:Analysis
webp-animation
shows a bit of multi-threading happening on the profile, with user time being longer than the total execution time, but even accounting for thatimage-webp
is 3x slower.Breakdown of where the time is spent in
image
, recorded bysamply
: https://share.firefox.dev/4fc3utgThe greatest contributors seem to be
image_webp::vp8::Vp8Decoder::decode_frame
(48%),image_webp::extended::do_alpha_blending
(20%),image_webp::vp8::Frame::fill_rgba
(16%).Within
decode_frame
the biggest contributor isimage_webp::vp8::Vp8Decoder::read_coefficients
(12% self time, 32% total time), and the code of that function looks like it could be optimized further to reduce bounds checks, etc. #71 is also relevant, but only accounts for 20% of the total time.The text was updated successfully, but these errors were encountered: