-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to CalculateAverage_spullara #56
Conversation
Instead of re-opening, could you please create a new PR, based on current
upstream/main? Thx!
… Message ID: ***@***.***>
|
I have SIMD changes in https://github.com/spullara/1brc/tree/simd but they are much slower. |
Let me know when you're ready for another run. With 21.0.1-graalce again, I suppose? |
I don't think enough has changed honestly and I'm not even sure the inlining helps that much. I'll also update to "sdk use" in my script. |
Should be about 10% faster than before. |
I've been running it with 21.0.1-graal from sdkman, added it to the script. |
Ready for testing. 1780ms on my machine :) |
Hey @spullara, could you rebase this one to latest main and then run this:
This is a simple test suite I've added to make sure implementations are compliant and there's one failure (I have clarified some value ranges in the course of the day, this may be related). And once it's good, remove the "test failure" label again. Thanks! |
Updated and passes. BTW, having test.sh delete my 1B row file that I now need to regenerate is quite the troll :) |
Hum, hum, interesting point. I'd consider it volatile data, so 🤷 . The root issue is that we should have made the data set name an argument to the programs under test. Next time. That said, something seems wrong with this PR now, way too many unrelated changes. Can you rebase/squash so that it only contains your changes over the latest main? Thx! |
Yeah, mine also takes a file as an argument so I could test locally with a smaller file when it was slower. |
Argh, again?! Seems I really need to add a test with 10K distinct keys. It's so easy to miss otherwise. Gnarf. |
You would have to construct a test with names that purposefully have the same hash code to discover these bugs with a test. You have to do it by inspection. |
Actually, I'm not sure. Seems the key is derived straight from the characters and it is used as a key in a map, not index in an array. I.e. I don't think there could be collisions? |
If they aren't comparing the actual string value of the keys you can construct keys that collide with their formula. You allow city names up to 100 bytes of UTF-8 and this compresses that to just 4 bytes. It is impossible to construct a 4 byte (int) hash that represents 100 bytes of UTF-8 uniquely. |
True. It would just overflow at some point. Gnarf, that stuff is really tough to spot, as I can't generically fabricate colliding values without knowing what's the hash function. @filiphr, I'll remove your entry from the leaderboard for now, until this issue has been fixed. @spullara, thanks for paying attention that closely! |
I was reading the discussions around the hashing today. Sorry about it, I didn't have time to adapt it today. I have an idea similar to also use |
Making another PR to push progress in the open. Inlined a couple of methods so far. Attempts at SIMD acceleration are still failing. Kind of wondering if maybe it is already doing it in the VM for Array.equals for sure.