Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to CalculateAverage_spullara #56

Merged
merged 1 commit into from
Jan 4, 2024

Conversation

spullara
Copy link
Contributor

@spullara spullara commented Jan 3, 2024

Making another PR to push progress in the open. Inlined a couple of methods so far. Attempts at SIMD acceleration are still failing. Kind of wondering if maybe it is already doing it in the VM for Array.equals for sure.

@gunnarmorling
Copy link
Owner

gunnarmorling commented Jan 3, 2024 via email

@spullara
Copy link
Contributor Author

spullara commented Jan 3, 2024

I have SIMD changes in https://github.com/spullara/1brc/tree/simd but they are much slower.

@gunnarmorling
Copy link
Owner

Let me know when you're ready for another run. With 21.0.1-graalce again, I suppose?

@spullara
Copy link
Contributor Author

spullara commented Jan 3, 2024

I don't think enough has changed honestly and I'm not even sure the inlining helps that much. I'll also update to "sdk use" in my script.

@spullara
Copy link
Contributor Author

spullara commented Jan 3, 2024

Should be about 10% faster than before.

@spullara
Copy link
Contributor Author

spullara commented Jan 3, 2024

I've been running it with 21.0.1-graal from sdkman, added it to the script.

@spullara
Copy link
Contributor Author

spullara commented Jan 3, 2024

Ready for testing. 1780ms on my machine :)

@yemreinci
Copy link

Hi @spullara, I was able to make the original version perform 25% better on my machine with a few changes. Feel free to use the ideas: #86

@gunnarmorling
Copy link
Owner

gunnarmorling commented Jan 4, 2024

Hey @spullara, could you rebase this one to latest main and then run this:

./test.sh spullara

This is a simple test suite I've added to make sure implementations are compliant and there's one failure (I have clarified some value ranges in the course of the day, this may be related). And once it's good, remove the "test failure" label again. Thanks!

@spullara
Copy link
Contributor Author

spullara commented Jan 4, 2024

Updated and passes. BTW, having test.sh delete my 1B row file that I now need to regenerate is quite the troll :)

@gunnarmorling
Copy link
Owner

BTW, having test.sh delete my 1B row file that I now need to regenerate is quite the troll :)

Hum, hum, interesting point. I'd consider it volatile data, so 🤷 . The root issue is that we should have made the data set name an argument to the programs under test. Next time.

That said, something seems wrong with this PR now, way too many unrelated changes. Can you rebase/squash so that it only contains your changes over the latest main? Thx!

@spullara
Copy link
Contributor Author

spullara commented Jan 4, 2024

Hum, hum, interesting point. I'd consider it volatile data, so 🤷 . The root issue is that we should have made the data set name an argument to the programs under test. Next time.

Yeah, mine also takes a file as an argument so I could test locally with a smaller file when it was slower.

@gunnarmorling
Copy link
Owner

Nice, looking good now. 00:12.063, super-clsoe #2 behind @filiphr with 00:12.027!

@gunnarmorling gunnarmorling merged commit 4af3253 into gunnarmorling:main Jan 4, 2024
1 check passed
@spullara
Copy link
Contributor Author

spullara commented Jan 4, 2024

Nice, looking good now. 00:12.063, super-clsoe #2 behind @filiphr with 00:12.027!

His is using the hash code of the city as a key which is against the rules? Otherwise mine would be even faster.

https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_filiphr.java#L173

@gunnarmorling
Copy link
Owner

Argh, again?! Seems I really need to add a test with 10K distinct keys. It's so easy to miss otherwise. Gnarf.

@spullara
Copy link
Contributor Author

spullara commented Jan 4, 2024

Argh, again?! Seems I really need to add a test with 10K distinct keys. It's so easy to miss otherwise. Gnarf.

You would have to construct a test with names that purposefully have the same hash code to discover these bugs with a test. You have to do it by inspection.

@gunnarmorling
Copy link
Owner

Actually, I'm not sure. Seems the key is derived straight from the characters and it is used as a key in a map, not index in an array. I.e. I don't think there could be collisions?

@spullara
Copy link
Contributor Author

spullara commented Jan 4, 2024

If they aren't comparing the actual string value of the keys you can construct keys that collide with their formula. You allow city names up to 100 bytes of UTF-8 and this compresses that to just 4 bytes. It is impossible to construct a 4 byte (int) hash that represents 100 bytes of UTF-8 uniquely.

@gunnarmorling
Copy link
Owner

True. It would just overflow at some point. Gnarf, that stuff is really tough to spot, as I can't generically fabricate colliding values without knowing what's the hash function. @filiphr, I'll remove your entry from the leaderboard for now, until this issue has been fixed. @spullara, thanks for paying attention that closely!

@spullara
Copy link
Contributor Author

spullara commented Jan 4, 2024

You should see how fast mine is if I use the same "optimization" :) The single largest chunk of time is spent comparing keys to ensure that they match.

Screenshot 2024-01-04 at 10 51 00 AM

@filiphr
Copy link
Contributor

filiphr commented Jan 4, 2024

I was reading the discussions around the hashing today. Sorry about it, I didn't have time to adapt it today.

I have an idea similar to also use Arrays.equals for this, let's see what that would lead to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants