-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One last improvement for thomaswue #702
Conversation
@gunnarmorling Added one last code layout change that improves by 1-2%, but that is really it now ;-). |
@thomaswue challange accepted! |
Wait, now everbody switches to processing 16 bytes? I was experimenting with that more than a week ago haha. 🤦🏻♂️ |
@royvanrijn :)))) I have no idea how you managed to do so well while benchmarking on M2! 🙇 🪄 I found perf. results on ARM to not be representative at all. I included the 16-byte processing in this PR. @mtopolnik had been playing with a similar thing even before. It performed quite well from the start, but 2 major things were holding it back:
|
Based on the various experiments around the 16-byte processing, I also thought it would not benefit so much. But it seems like for my solution it was the last missing piece. It is amazing how once other bottlenecks are solved, suddenly a certain optimization can be super critical. It completely changes the bottlenecks identified by perf stat from a lot of bad speculation:
To almost no bad speculation and much higher instructions per cycle and almost no branch misses anymore:
|
Obviously, missed branches are bad. But what I did not expect is how bad they are. There are 500 million missed branches removed (which makes sense, because basically every 2nd row is a branch miss in the 8-byte vs 16-byte separate case version). The number of instructions for getting rid of them goes up by about 17 billion (!) and the new version is still more than 20% less cycles. |
Nice! Amazing to see this level of improvement at this point. Thanks a lot for participating in 1BRC, and congrats on implementing one of the very top entries, and keeping pushing updates over all the month, @thomaswue!
|
@thomaswue congratulations! |
Wow, cool, thank you, that is even faster on the evaluation server than expected! Nice to see how the solution can very much compete while keeping the core Result data structure in the style of a Java class. As referenced in the credit and co-author section, I got a lot of inspiration and borrowed ideas from @merykitty, @mukel, @artsiomkorzun, @jerrinot, and @abeobk. Thank you guys, it was a blast, I enjoyed it a lot (as you could see)! And I also learned a lot. Some of those learnings will go into future Graal compiler optimizations and tunings for sure! |
which translates in: know the pattern of your data, your domain, and you can get an immensely faster program. ie the right stride to parse, branchless parsing, decent locality (without Unsafe) |
It should be possible to go without unsafe and without performance loss, but it will require some really careful crafting to make sure all bounds checks are combined/moved to mitigate their effect. The memory API also has a few more checks than just the bounds (thread owner, scope, temporal check). I tried to use it and looked at the compiler graph. We will see if Graal compiler can optimize those specially maybe or if some API additions might be necessary. The other aspect we will add to Graal for sure is auto-vectorization of the search loop. That would avoid all that SWAR/mask complexity and produce decent vector code for the search automatically. We do that for the String indexOf methods already, but not for general user code. |
that is indeed a very nice thing which C2 can achieve, sometime, under certain conditions (which is an interesting combination of words :P ), but I still believe that having an intrinsic for a declarative |
Yes, I think the best approach is to implement it as a general compiler transform and offer optional utility methods that if used are guaranteed to trigger the optimization. |
That's very interesting!!! The Also, OS configuration might play a role, because of the patches against meltdown and spectre vulnerabilities. |
Hey @thomaswue, @merykitty, and @mukel! Congrats again on being in the Top 3 of the One Billion Row Challenge! To celebrate this amazing achievement, I would like to send you a 1BRC t-shirt and coffee mug. To claim your prize, fill out this form by Feb 18. After submitting the form, please provide a comment with the random value you've specified in the form, so that I know it is you who submitted it. All data entered will solely be used in relation to processing this shipment. Shipments can be sent to any country listed here. A big thank you to Decodable for sponsoring these prizes! Thanks a lot for participating in 1BRC, --Gunnar |
Hey @mtopolnik, thx for replying. I've tagged you on your own PR with the right link to the Top 20 form. Nice try to get that Top 3 t-shirt though ;) |
Oops... I got notifications for a bunch of them and had all of them open in the browser. |
Thank you so much @gunnarmorling! Looking forward to proudly wear that #1brc shirt in some of my next talks ;-). Number is 9597 9145 |
Hi @gunnarmorling, my random tokens: 3728 28670 |
Check List:
./mvnw verify
and the project builds successfully./test.sh <username>
shows no differences between expected and actual outputs)calculate_average_<username>.sh
(make sure to match casing of your GH user name) and is executablecalculate_average_baseline.sh
OK, so can't let @jerrinot have all the fun alone ;-). I adopted the approach that combines the <8 and 1-16 cases to avoid branch misprediction, which was the biggest improvement; and then also the masking instead of bit shifting, which added another few %. Added @jerrinot and @abeobk to the credits section. This should now get into the ballpark of @jerrinot's submission if the results from my machine don't lie. Major differences to his solution are that I am using Result objects for the hash table instead of unsafe access and I am unrolling 3 times. Also, it avoids the name length check by having the delimiter in the word mask.