Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatroom's initial attempt #15

Merged
merged 6 commits into from
Jan 5, 2024

Conversation

fatroom
Copy link
Contributor

@fatroom fatroom commented Jan 2, 2024

Changes:

  • use mmap buffers
  • get some memory
  • Parallel stream
  • Long arythmetic

The name of your implementation class:
dev.morling.onebrc.CalculateAverage_fatroom

The JDK build to use (of not specified, the latest OpenJDK 21 upstream build will be used).
OpenJDK 64-Bit Server VM Temurin-21.0.1+12 (build 21.0.1+12-LTS, mixed mode)

The execution time of the program (Apple M1 Max, 32Gb):
real 0m15.303s
user 1m9.326s
sys 0m23.102s

@fatroom fatroom changed the title Initial attempt fatroom's initial attempt Jan 2, 2024
@lobaorn
Copy link

lobaorn commented Jan 2, 2024

Hey @fatroom I would offer the same suggestions that were made here: https://twitter.com/lobaorn/status/1742256906993799554

If using ZGC why not the generational one? Did you compare?

@fatroom
Copy link
Contributor Author

fatroom commented Jan 2, 2024

@lobaorn generational gives slower results on my config (approx 7 to 10 sec overhead). I haven't deeply tuned GC, just run with different GC to check if something better than default one. Actually parallel one gives a slightly better (approx 1 second faster) results, but fluctuated between runs, so picked ZGC as more stable.

@lobaorn
Copy link

lobaorn commented Jan 2, 2024

Hey @fisk if you would have the time, would you mind providing insights on how to better use the Generational variation of ZGC for this challenge? Or also providing the insight that given the way the values are being read and computed, the Generational variation would not be of use.

@gunnarmorling
Copy link
Owner

Hey @fatroom, thanks a lot for this entry. It looks like the output differs from the one of the reference implementation (some rounding issue, see the discussion at #5 for more context). Could you make sure that the output matches that one of the RI? Thanks!

@fisk
Copy link

fisk commented Jan 3, 2024

@lobaorn do you really think you can nerd snipe me like that with benchmarking games?

Having said that... I implemented some experimental leyden support for generational ZGC, with some object streaming shenanigans allowing loading of archived objects, and support for archived code from the JIT cache so we can run with compiled code immediately and dodge most of the warmup costs.

On my machine with 256 cores...

real 0m2.318s
user 1m24.229s
sys 0m2.009s

...beat that!

Oh and here are the JVM flags: -XX:+UseLargePages -XX:+UseZGC -XX:+ZGenerational -Xms8G -Xmx8G -XX:+AlwaysPreTouch -XX:ConcGCThreads=4 -XX:+UnlockDiagnosticVMOptions -XX:-ZProactive -XX:ZTenuringThreshold=1 -Xlog:gc*:file=fatroom_gc.log -XX:SharedArchiveFile=Fatroom-dynamic.jsa -XX:+ReplayTraining -XX:+ArchiveInvokeDynamic -XX:+LoadCachedCode -XX:CachedCodeFile=Fatroom-dynamic.jsa-sc

Here is my experimental Generational ZGC leyden branch: https://github.com/fisk/jdk/tree/1brc_genz_leyden

Takes about 1/3 of the time compared to normal ZGC on my machine.

@lobaorn
Copy link

lobaorn commented Jan 3, 2024

I apologize for being so shameless in my sniping, but I am glad it worked! :P

So now I will shamelessly share the link of your comment in the other PRs, so contestants can evaluate the importance of some tuning. I am not sure if anyone will be able/capable to use your experimental branch (and the Leyden flags), but at least a combination of the tuning with perhaps @shipilev build of Lilliput, who knows what difference could it make on the final implementations?

@fisk
Copy link

fisk commented Jan 3, 2024

To be clear about the GC tuning... I reduced the heap size to be rather small, mostly because on my large machine it's going so fast that the cost of materializing a larger heap starts to become noticeable. I also reduced the number of max concurrent GC threads as throwing more threads at the particular object graph created by this program doesn't seem to scale very well.

Usually, ZGC is geared more towards low latency than it is towards maximum throughput. However, as you throw more cores at the JVM, you get to a point where concurrent GC starts to win in throughput when the object graph being traced simply has a topology that doesn't allow fully utilizing all cores on the machine. That becomes a throughput bottleneck for STW tracing GCs, which doesn't really matter for a concurrently tracing GC.

@lobaorn
Copy link

lobaorn commented Jan 3, 2024

Thanks for reaching out again, and I got it from the context of your original comment, and in another PR I did that contextualization as well, it is not a one-size-fits-all approach or configuration. Rather, it was just my perception that most submissions did not do any argument's tweaking, and I was trying to thought-provoke the submitters to check if any or some different configurations could be a game changer.

So AppCDS, Heap, GC, Native... there are many flavors of optimizations that could happen after the code is finished, so it would also share some light on these new (even not new as AppCDS) or unfamiliar possibilities, since the challenge is gaining attention all over the place and in other stacks and etc.

@fatroom
Copy link
Contributor Author

fatroom commented Jan 3, 2024

@gunnarmorling thanks for bringing this to my attention. Corrected parsing, so output matches the RI now

@fatroom
Copy link
Contributor Author

fatroom commented Jan 4, 2024

@gunnarmorling I saw that on some PR you've asking to remove label once test suit pass. Seems as external contributor I don't have authority to manipulate labels here, so pinging you via comments

@gunnarmorling
Copy link
Owner

00:26.576. Thanks for participating!

@gunnarmorling
Copy link
Owner

(sorry for the noise on test failures, mistake on my end)

@gunnarmorling gunnarmorling merged commit 15cceae into gunnarmorling:main Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants