-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory mapped files, branchless parsing, bitwiddle magic #5
Conversation
src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java
Outdated
Show resolved
Hide resolved
src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java
Outdated
Show resolved
Hide resolved
src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java
Outdated
Show resolved
Hide resolved
src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java
Outdated
Show resolved
Hide resolved
Hey @royvanrijn since you are already very engajed, if you would like to take a look if Lilliput builds using GenerationalZGC could yield benefits: https://twitter.com/gunnarmorling/status/1742227887745376300 If I do a try myself would probably be by the end of the week, and only if there is another feasible approach to compare with the already opened PRs. Other than that should be JVM and GC tweaking... |
@royvanrijn Nice solution, I had something similar in mind. Maybe using FFI for mmap and pthreads would improve the performance (Less JVM byte arrays for buffering etc. - but somehow bending the JNI rule). Maybe I'll find the time :) |
What about using integer arithmetic instead of floats? |
Hey @royvanrijn, are you planning to do further changes to this one? If so, wanna put it into "Draft" state until it's good to go from your side? |
@royvanrijn, could you rebase this one to current main and squash everything into a single commit? |
Preliminary updated the leaderboard with the result from the latest version here: 00:23.366 on the evaluation environment. You've taken over again, @royvanrijn :) |
* Branchless parser, goes from String to int (10x): | ||
* "-1.2" to -12 | ||
* "40.1" to 401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gunnarmorling Is this assumption acceptable? E.g. 1
, 0
, 2.00
are all valid doubles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also there should be an acceptance test suite for the implementations, I am pretty sure this implementation does not produce the same output as the baseline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure it produces the exact same output, I check regularly with each change.
The input and output have one decimal place precision (as stated by the website "rounded to one fractional digit").
I'm internally storing the doubles as 10x integers because the precision is just a single digit, and I'm making sure the rounding is correct afterwards for the average.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README only says about output format but not input
Lines 27 to 28 in e7e7deb
The task is to write a Java program which reads the file, calculates the min, mean, and max temperature value per weather station, and emits the results on stdout like this | |
(i.e. sorted alphabetically by station name, and the result values per station in the format `<min>/<mean>/<max>`, rounded to one fractional digit): |
so its worth clarifying.
There are many rounding modes - this is also not specified, e.g. in go https://pkg.go.dev/math#Round is not the same as in java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also reference implementation does ad-hoc rounding Math.round(value * 10.0) / 10.0
but actual output depends on the string concatenation which performs another rounding, see https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Double.html#toString(double)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not being explicit enough here. Can the behavior of the reference implementation be described using any of the existing values of RoundingMode?
I think the the exact mode does not really matter but what matters is that the baseline is correct.
I propose to change baseline to use BigDecimal for results accumulation, use scale 1 (instead of round(x*10)/10) and HALF_UP rounding mode (as most common at school) at the final step:
BigDecimal value = new BigDecimal("12.34");
BigDecimal rounded = value.setScale(1, RoundingMode.HALF_UP);
System.out.println("=="+rounded.toString()+"=="); // prints ==12.3==
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's the thing, I don't think we can change the behavior of the reference implementation at this point, as it would render existing submissions invalid if they implement a different behavior. So I'd rather make the behavior of the RI explicit, also if it's not the most natural one (agreed that HALF_UP
behavior would have been better).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think its possible to fix RI because it uses double division and rounds twice.
Since there are no acceptance tests I bet a lot of implementations (those that do not parse and calculate values the same way) will not match RI anyways.
I think RI should favor correctness over performance, then it can be used to build acceptance test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've logged #49 for getting this one sorted out separately and get this PR merged. Let's continue the rounding topic over there. Thx!
return toAdd.measurement; | ||
} | ||
|
||
private static int hashCode(byte[] a, int length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hash code here has a data dependency: you either manually unroll this or just relax the hash code by using a var handle and use getLong amortizing the data dependency in batches, handing only the last 7 (or less) bytes separately, using the array.
In this way the most of computation would like resolve in much less loop iterations too, similar to https://github.com/apache/activemq-artemis/blob/25fc0342275b29cd73123523a46e6e94582597cd/artemis-commons/src/main/java/org/apache/activemq/artemis/utils/ByteUtil.java#L299
@royvanrijn, so what should we do with this one, and all the pending discussions? Wanna submit it as is and keep honing in follow-up PRs? I think it would be nice to be able to update the leaderboard with the current status (fastest right now is @spullara). For that, could you rebase it to resolve the merge conflict? |
Added SWAR (SIMD Within A Register) code to increase bytebuffer processing/throughput Delaying the creation of the String by comparing hash, segmenting like spullara, improved EOL finding
Squashing for merge.
Also fixing millisecond separator Co-authored-by: Gunnar Morling <[email protected]>
Shamelessly sharing this idea for JVM/GC tuning in another PR/discussion? #15 (comment) |
Also fixing millisecond separator Co-authored-by: Gunnar Morling <[email protected]>
Squashing for merge.
@gunnarmorling Git is failing me so hard haha, such a mess; let's get this merged and start working on a v2. |
@royvanrijn, could you add a line like this to your launch script for making sure the evaluation is done with GraalVM? I'll squash everything then and evaluate. Thx! |
@royvanrijn dang, so I've done what I shouldn't have done and merged it before running. But it seems to take much longer / hang now actually. Any idea what's wrong? |
long mask = match - 0x0101010101010101L; | ||
mask &= ~match; | ||
mask &= 0x8080808080808080L; | ||
return Long.numberOfTrailingZeros(mask) >>> 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it shouldn't be the number of leading ones?
@royvanrijn @gunnarmorling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird, I thought by setting explicitly on 105 to LE would make the compatibility issues disappear. So running it on my machine would automatically mean it works on the target, although perhaps having a performance hit.
afk atm, I’ll check tomorrow, if somebody wants to fix it and tell me, be my guest 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure actually, for these things I need an old school paper and a pencil :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some local classes that test; problem is that I believe it works on my machine, just not on the target machine, I’ll check soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, annoying, it runs fine locally (the code that was pushed), sigh. Kind of debugging in the dark haha... a challenge!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it twice, I'm wrong at #5
let's say we have a byte[] data = { 0x01, 0x03 }
And we assume to have a short-based version of SWAR
reading the content of data with (a short) little-endian means:
0x0301
which have the less significant part at the lower address,
hence the binary hex SWAR result obtained (I'm using the Netty algorithm, but here should be the same) will be
0x8000
and, in order to find out 0x03, we have to use the trailing zeros (here 8 + 7 = 15 -> 15/8 = 1) .
Which means that is fine as it is!
@gunnarmorling added a comment to help |
// System.out.println("Took: " + (System.currentTimeMillis() - before)); | ||
// Simple is faster for '\n' (just three options) | ||
int endPointer; | ||
if (bb.get(separatorPointer + 4) == '\n') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my input I get IOOBE here:
Exception in thread "main" java.lang.IndexOutOfBoundsException
at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:542)
at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:567)
at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:670)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:927)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
at dev.morling.onebrc.CalculateAverage_royvanrijn.run(CalculateAverage_royvanrijn.java:144)
at dev.morling.onebrc.CalculateAverage_royvanrijn.main(CalculateAverage_royvanrijn.java:92)
Caused by: java.lang.IndexOutOfBoundsException
at java.base/java.nio.Buffer$1.apply(Buffer.java:757)
at java.base/java.nio.Buffer$1.apply(Buffer.java:754)
at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:213)
at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:210)
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
at java.base/java.nio.Buffer.checkIndex(Buffer.java:768)
at java.base/java.nio.DirectByteBuffer.get(DirectByteBuffer.java:358)
at dev.morling.onebrc.CalculateAverage_royvanrijn.lambda$run$0(CalculateAverage_royvanrijn.java:118)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:960)
at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:934)
at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327)
at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
Let me know if you need the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes please, perhaps there is some other bug that's platform dependent; do share.
Can you specify what platform you're running it on, and could you please also check this (improved) version:
https://github.com/royvanrijn/1brc/blob/8db31e6a36fbc305765a2393efb06ba6bff23f42/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running on Windows. Will check the new version later (it is quite late here now and starting from tomorrow I won't have access to PC for the weekend).
Let me know if you can suggest how to send you the data file - I started bzipping it and it takes forever, but even part way through, the archive is 2Gb (you can mail me upload coordinates at dimitar.dimitrov at gmail dot com)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, can you narrow it down? Perhaps run a very small test? Do they all crash, just this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddimtirov for future reference, the default compression level on zstd will be a lot faster and offer reasonable compression:
On a MacBook Pro 2020 - 2 GHz Quad-Core Intel Core i5:
# time zstd -z measurements.txt 8s
measurements.txt : 28.24% ( 12.8 GiB => 3.63 GiB, measurements.txt.zst)
zstd -z measurements.txt 92.10s user 8.05s system 107% cpu 1:33.54 total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test case from the discussion gunnarmorling#5 (comment) Neither all implementations match baseline: ```sh $ ./test_all.sh src/test/resources/samples/measurements-rounding-baseline.txt 2>/dev/null FAIL armandino FAIL artsiomkorzun PASS baseline PASS bjhara PASS criccomini FAIL ddimtirov FAIL ebarlas FAIL filiphr FAIL itaske PASS jgrateron FAIL khmarbaise FAIL kuduwa-keshavram PASS lawrey PASS moysesb FAIL nstng PASS padreati FAIL palmr PASS richardstartin FAIL royvanrijn PASS seijikun PASS spullara PASS truelive ``` not they match precise value `33.6+31.7+21.9+14.6=25.5`: ``` $ ./test_all.sh src/test/resources/samples/measurements-rounding-precise.txt 2>/dev/null PASS armandino PASS artsiomkorzun FAIL baseline FAIL bjhara FAIL criccomini FAIL ddimtirov PASS ebarlas PASS filiphr PASS itaske FAIL jgrateron PASS khmarbaise FAIL kuduwa-keshavram FAIL lawrey FAIL moysesb PASS nstng FAIL padreati PASS palmr FAIL richardstartin PASS royvanrijn FAIL seijikun FAIL spullara FAIL truelive ``` For gunnarmorling#49
Added a test case from the discussion gunnarmorling#5 (comment) Neither all implementations match baseline: nor they match precise value of `33.6+31.7+21.9+14.6=25.5`: ``` $ ./test_all.sh src/test/resources/samples/measurements-rounding-baseline.txt 2>/dev/null | tee /tmp/rounding-baseline.log FAIL armandino FAIL artsiomkorzun PASS baseline PASS bjhara PASS criccomini FAIL ddimtirov FAIL ebarlas FAIL filiphr FAIL itaske PASS jgrateron FAIL khmarbaise FAIL kuduwa-keshavram PASS lawrey PASS moysesb FAIL nstng PASS padreati FAIL palmr PASS richardstartin FAIL royvanrijn PASS seijikun PASS spullara PASS truelive $ ./test_all.sh src/test/resources/samples/measurements-rounding-precise.txt 2>/dev/null | tee /tm p/rounding-precise.log PASS armandino PASS artsiomkorzun FAIL baseline FAIL bjhara FAIL criccomini FAIL ddimtirov PASS ebarlas PASS filiphr FAIL itaske FAIL jgrateron PASS khmarbaise FAIL kuduwa-keshavram FAIL lawrey FAIL moysesb PASS nstng FAIL padreati PASS palmr FAIL richardstartin PASS royvanrijn FAIL seijikun FAIL spullara FAIL truelive $ git --no-pager diff --word-diff /tmp/rounding-baseline.log /tmp/rounding-precise.log diff --git a/tmp/rounding-baseline.log b/tmp/rounding-precise.log index 76d5b4e..495fb00 100644 --- a/tmp/rounding-baseline.log +++ b/tmp/rounding-precise.log @@ -1,22 +1,22 @@ [-FAIL-]{+PASS+} armandino[-FAIL artsiomkorzun-] PASS {+artsiomkorzun+} {+FAIL+} baseline [-PASS-]{+FAIL+} bjhara [-PASS-]{+FAIL+} criccomini FAIL ddimtirov [-FAIL-]{+PASS+} ebarlas [-FAIL-]{+PASS+} filiphr FAIL itaske [-PASS jgrateron-]FAIL {+jgrateron+} {+PASS+} khmarbaise FAIL kuduwa-keshavram [-PASS-]{+FAIL+} lawrey[-PASS moysesb-] FAIL [-nstng-]{+moysesb+} PASS [-padreati-]{+nstng+} FAIL [-palmr-]{+padreati+} PASS [-richardstartin-]{+palmr+} FAIL [-royvanrijn-]{+richardstartin+} PASS {+royvanrijn+} {+FAIL+} seijikun [-PASS-]{+FAIL+} spullara [-PASS-]{+FAIL+} truelive ``` For gunnarmorling#49
Added two test cases from the discussion gunnarmorling#5 (comment) Neither all implementations match baseline nor they match precise value of `33.6+31.7+21.9+14.6=25.5`: ``` $ ./test_all.sh src/test/resources/samples/measurements-rounding-baseline.txt 2>/dev/null | tee /tmp/rounding-baseline.log FAIL armandino FAIL artsiomkorzun PASS baseline PASS bjhara PASS criccomini FAIL ddimtirov FAIL ebarlas FAIL filiphr FAIL itaske PASS jgrateron FAIL khmarbaise FAIL kuduwa-keshavram PASS lawrey PASS moysesb FAIL nstng PASS padreati FAIL palmr PASS richardstartin FAIL royvanrijn PASS seijikun PASS spullara PASS truelive $ ./test_all.sh src/test/resources/samples/measurements-rounding-precise.txt 2>/dev/null | tee /tm p/rounding-precise.log PASS armandino PASS artsiomkorzun FAIL baseline FAIL bjhara FAIL criccomini FAIL ddimtirov PASS ebarlas PASS filiphr FAIL itaske FAIL jgrateron PASS khmarbaise FAIL kuduwa-keshavram FAIL lawrey FAIL moysesb PASS nstng FAIL padreati PASS palmr FAIL richardstartin PASS royvanrijn FAIL seijikun FAIL spullara FAIL truelive $ git --no-pager diff --word-diff /tmp/rounding-baseline.log /tmp/rounding-precise.log diff --git a/tmp/rounding-baseline.log b/tmp/rounding-precise.log index 76d5b4e..495fb00 100644 --- a/tmp/rounding-baseline.log +++ b/tmp/rounding-precise.log @@ -1,22 +1,22 @@ [-FAIL-]{+PASS+} armandino[-FAIL artsiomkorzun-] PASS {+artsiomkorzun+} {+FAIL+} baseline [-PASS-]{+FAIL+} bjhara [-PASS-]{+FAIL+} criccomini FAIL ddimtirov [-FAIL-]{+PASS+} ebarlas [-FAIL-]{+PASS+} filiphr FAIL itaske [-PASS jgrateron-]FAIL {+jgrateron+} {+PASS+} khmarbaise FAIL kuduwa-keshavram [-PASS-]{+FAIL+} lawrey[-PASS moysesb-] FAIL [-nstng-]{+moysesb+} PASS [-padreati-]{+nstng+} FAIL [-palmr-]{+padreati+} PASS [-richardstartin-]{+palmr+} FAIL [-royvanrijn-]{+richardstartin+} PASS {+royvanrijn+} {+FAIL+} seijikun [-PASS-]{+FAIL+} spullara [-PASS-]{+FAIL+} truelive ``` For gunnarmorling#49
Added two test cases from the discussion gunnarmorling#5 (comment) Neither all implementations match baseline nor they match precise value of `33.6+31.7+21.9+14.6=25.5`: ``` $ ./test_all.sh src/test/resources/samples/measurements-rounding-baseline.txt 2>/dev/null | tee /tmp/rounding-baseline.log FAIL armandino FAIL artsiomkorzun PASS baseline PASS bjhara PASS criccomini FAIL ddimtirov FAIL ebarlas FAIL filiphr FAIL itaske PASS jgrateron FAIL khmarbaise FAIL kuduwa-keshavram PASS lawrey PASS moysesb FAIL nstng PASS padreati FAIL palmr PASS richardstartin FAIL royvanrijn PASS seijikun PASS spullara PASS truelive $ ./test_all.sh src/test/resources/samples/measurements-rounding-precise.txt 2>/dev/null | tee /tm p/rounding-precise.log PASS armandino PASS artsiomkorzun FAIL baseline FAIL bjhara FAIL criccomini FAIL ddimtirov PASS ebarlas PASS filiphr FAIL itaske FAIL jgrateron PASS khmarbaise FAIL kuduwa-keshavram FAIL lawrey FAIL moysesb PASS nstng FAIL padreati PASS palmr FAIL richardstartin PASS royvanrijn FAIL seijikun FAIL spullara FAIL truelive $ diff -y /tmp/rounding-baseline.log /tmp/rounding-precise.log FAIL armandino | PASS armandino FAIL artsiomkorzun | PASS artsiomkorzun PASS baseline | FAIL baseline PASS bjhara | FAIL bjhara PASS criccomini | FAIL criccomini FAIL ddimtirov FAIL ddimtirov FAIL ebarlas | PASS ebarlas FAIL filiphr | PASS filiphr FAIL itaske FAIL itaske PASS jgrateron | FAIL jgrateron FAIL khmarbaise | PASS khmarbaise FAIL kuduwa-keshavram FAIL kuduwa-keshavram PASS lawrey | FAIL lawrey PASS moysesb | FAIL moysesb FAIL nstng | PASS nstng PASS padreati | FAIL padreati FAIL palmr | PASS palmr PASS richardstartin | FAIL richardstartin FAIL royvanrijn | PASS royvanrijn PASS seijikun | FAIL seijikun PASS spullara | FAIL spullara PASS truelive | FAIL truelive ``` Its also interesting that e.g. `itaske` produces different results between runs and thus may pass or fail sporadically. For gunnarmorling#49
…board in local testing using evaluate2.sh] (#209) * Linear probe for city indexing. Beats current leader spullara 2.2 vs 3.8 elapsed time. * Straightforward impl using bytebuffers. Turns out memorysegments were slower than used mappedbytebuffers. * A initial submit-worthy entry Comparison to select entries (averaged over 3 runs) * spullara 1.66s [5th on leaderboard currently] * vemana (this submission) 1.65s * artsiomkorzun 1.64s [4th on leaderboard currently] Tests: PASS Impl Class: dev.morling.onebrc.CalculateAverage_vemana Machine specs * 16 core Ryzen 7950X * 128GB RAM Description * Decompose the full file into Shards of memory mapped files and process each independently, outputting a TreeMap: City -> Statistics * Compose the final answer by merging the individual TreeMap outputs * Select 1 Thread per available processor as reported by the JVM * Size to fit all datastructure in 0.5x L3 cache (4MB/core on the evaluation machines) * Use linear probing hash table, with identity of city name = byte[] and hash code computed inline * Avoid all allocation in the hot path and instead use method parameters. So, instead of passing a single Object param called Point(x, y, z), pass 3 parameters for each of its components. It is ugly, but this challenge is so far from Java's idioms anyway * G1GC seems to want to interfere; use ParallelGC instead (just a quick and dirty hack) Things tried that did not work * MemorySegments are actually slower than MappedByteBuffers * Trying to inline everything: not needed; the JIT compiler is pretty good * Playing with JIT compiler flags didn't yield clear wins. In particular, was surprised that using a max level of 3 and reducing compilation threshold did nothing.. when the jit logs print that none of the methods reach level 4 and stay there for long * Hand-coded implementation of Array.equals(..) using readLong(..) & bitmask_based_on_length from a bytebuffer instead of byte by byte * Further tuning to compile loop methods: timings are now consistenctly ahead of artsiomkorzun in 4th place. There are methods on the data path that were being interpreted for far too long. For example, the method that takes a byte range and simply calls one method per line was taking a disproportionate amount of time. Using `-XX:+AlwaysCompileLoopMethods` option improved completion time by 4%. ============= vemana =============== [20:55:22] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh vemana; done; Using java version 21.0.1-graal in this shell. real 0m1.581s user 0m34.166s sys 0m1.435s Using java version 21.0.1-graal in this shell. real 0m1.593s user 0m34.629s sys 0m1.470s Using java version 21.0.1-graal in this shell. real 0m1.632s user 0m35.893s sys 0m1.340s Using java version 21.0.1-graal in this shell. real 0m1.596s user 0m33.074s sys 0m1.386s Using java version 21.0.1-graal in this shell. real 0m1.611s user 0m35.516s sys 0m1.438s ============= artsiomkorzun =============== [20:56:12] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh artsiomkorzun; done; Using java version 21.0.1-graal in this shell. real 0m1.669s user 0m38.043s sys 0m1.287s Using java version 21.0.1-graal in this shell. real 0m1.679s user 0m37.840s sys 0m1.400s Using java version 21.0.1-graal in this shell. real 0m1.657s user 0m37.607s sys 0m1.298s Using java version 21.0.1-graal in this shell. real 0m1.643s user 0m36.852s sys 0m1.392s Using java version 21.0.1-graal in this shell. real 0m1.644s user 0m36.951s sys 0m1.279s ============= spullara =============== [20:57:55] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh spullara; done; Using java version 21.0.1-graal in this shell. real 0m1.676s user 0m37.404s sys 0m1.386s Using java version 21.0.1-graal in this shell. real 0m1.652s user 0m36.509s sys 0m1.486s Using java version 21.0.1-graal in this shell. real 0m1.665s user 0m36.451s sys 0m1.506s Using java version 21.0.1-graal in this shell. real 0m1.671s user 0m36.917s sys 0m1.371s Using java version 21.0.1-graal in this shell. real 0m1.634s user 0m35.624s sys 0m1.573s ========================== Running Tests ====================== [21:17:57] [lsv@vemana]$ ./runTests.sh vemana Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10000-unique-keys.txt Using java version 21.0.1-graal in this shell. real 0m0.150s user 0m1.035s sys 0m0.117s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10.txt Using java version 21.0.1-graal in this shell. real 0m0.114s user 0m0.789s sys 0m0.116s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-1.txt Using java version 21.0.1-graal in this shell. real 0m0.115s user 0m0.948s sys 0m0.075s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-20.txt Using java version 21.0.1-graal in this shell. real 0m0.113s user 0m0.926s sys 0m0.066s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-2.txt Using java version 21.0.1-graal in this shell. real 0m0.110s user 0m0.734s sys 0m0.078s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-3.txt Using java version 21.0.1-graal in this shell. real 0m0.114s user 0m0.870s sys 0m0.095s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-boundaries.txt Using java version 21.0.1-graal in this shell. real 0m0.113s user 0m0.843s sys 0m0.084s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-complex-utf8.txt Using java version 21.0.1-graal in this shell. real 0m0.121s user 0m0.852s sys 0m0.171s * Improve by a few % more; now, convincingly faster than 6th place submission. So far, only algorithms and tuning; no bitwise tricks yet. Improve chunking implementation to avoid allocation and allow finegrained chunking for the last X% of work. Work now proceeds in two stages: big chunk stage and small chunk stage. This is to avoid straggler threads holding up result merging. Tests pass [07:14:49] [lsv@vemana]$ ./test.sh vemana Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10000-unique-keys.txt Using java version 21.0.1-graal in this shell. real 0m0.152s user 0m0.973s sys 0m0.107s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10.txt Using java version 21.0.1-graal in this shell. real 0m0.113s user 0m0.840s sys 0m0.060s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-1.txt Using java version 21.0.1-graal in this shell. real 0m0.107s user 0m0.681s sys 0m0.085s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-20.txt Using java version 21.0.1-graal in this shell. real 0m0.105s user 0m0.894s sys 0m0.068s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-2.txt Using java version 21.0.1-graal in this shell. real 0m0.099s user 0m0.895s sys 0m0.068s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-3.txt Using java version 21.0.1-graal in this shell. real 0m0.098s user 0m0.813s sys 0m0.050s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-boundaries.txt Using java version 21.0.1-graal in this shell. real 0m0.095s user 0m0.777s sys 0m0.087s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-complex-utf8.txt Using java version 21.0.1-graal in this shell. real 0m0.112s user 0m0.904s sys 0m0.069s * Merge results from finished threads instead of waiting for all threads to finish. Not a huge difference overall but no reason to wait. Also experiment with a few other compiler flags and attempt to use jitwatch to understand what the jit is doing. * Move to prepare_*.sh format and run evaluate2.sh locally. Shows 7th place in leaderboard | # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes | |---|-----------------|--------------------|-----|---------------|-----------| | 1 | 00:01.588 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary | | 2 | 00:01.866 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)| 21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) | | | 3 | 00:01.904 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | | | | 00:02.398 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java)| 21.0.1-graal | [Elliot Barlas](https://github.com/ebarlas) | | | | 00:02.724 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_obourgain.java)| 21.0.1-open | [Olivier Bourgain](https://github.com/obourgain) | | | | 00:02.771 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_algirdasrascius.java)| 21.0.1-open | [Algirdas Ra__ius](https://github.com/algirdasrascius) | | | | 00:02.842 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java)| 21.0.1-graal | [Vemana](https://github.com/vemana) | | | | 00:02.902 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java)| 21.0.1-graal | [Sam Pullara](https://github.com/spullara) | | | | 00:02.906 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java)| 21.0.1-graal | [artsiomkorzun](https://github.com/artsiomkorzun) | | | | 00:02.970 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_isolgpus.java)| 21.0.1-open | [Jamie Stansfield](https://github.com/isolgpus) | | * Tune chunksize to get another 2% improvement for 8 processors as used by the evaluation script. * Read int at a time for city name length detection; speeds up by 2% in local testing. * Improve reading temperature double by exiting loop quicker; no major tricks (like reading an int) yet, but good for 5th place on leaderboard in local testing. This small change has caused a surprising gain in performance by about 4%. I didn't expect such a big change, but perhaps in combination with the earlier change to read int by int for the city name, temperature reading is dominating that aspect of the time. Also, perhaps the quicker exit (as soon as you see '.' instead of reading until '\n') means you get to simply skip reading the '\n' across each of the lines. Since the lines are on average like 15 characters, it may be that avoiding reading the \n is a meaningful saving. Or maybe the JIT found a clever optimization for reading the temperature. Or maybe it is simply the case that the number of multiplications is now down to 2 from the previous 3 is what's causing the performance gain? | # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes | |---|-----------------|--------------------|-----|---------------|-----------| | 1 | 00:01.531 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary | | 2 | 00:01.794 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | | | 3 | 00:01.956 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)| 21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) | | | | 00:02.346 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java)| 21.0.1-graal | [Elliot Barlas](https://github.com/ebarlas) | | | | 00:02.673 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java)| 21.0.1-graal | [Subrahmanyam](https://github.com/vemana) | | | | 00:02.689 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_obourgain.java)| 21.0.1-open | [Olivier Bourgain](https://github.com/obourgain) | | | | 00:02.785 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_algirdasrascius.java)| 21.0.1-open | [Algirdas Ra__ius](https://github.com/algirdasrascius) | | | | 00:02.926 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_isolgpus.java)| 21.0.1-open | [Jamie Stansfield](https://github.com/isolgpus) | | | | 00:02.928 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java)| 21.0.1-graal | [Artsiom Korzun](https://github.com/artsiomkorzun) | | | | 00:02.932 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java)| 21.0.1-graal | [Sam Pullara](https://github.com/spullara) | | * Reduce one multiplication when temperature is +ve. * Linear probe for city indexing. Beats current leader spullara 2.2 vs 3.8 elapsed time. * Straightforward impl using bytebuffers. Turns out memorysegments were slower than used mappedbytebuffers. * A initial submit-worthy entry Comparison to select entries (averaged over 3 runs) * spullara 1.66s [5th on leaderboard currently] * vemana (this submission) 1.65s * artsiomkorzun 1.64s [4th on leaderboard currently] Tests: PASS Impl Class: dev.morling.onebrc.CalculateAverage_vemana Machine specs * 16 core Ryzen 7950X * 128GB RAM Description * Decompose the full file into Shards of memory mapped files and process each independently, outputting a TreeMap: City -> Statistics * Compose the final answer by merging the individual TreeMap outputs * Select 1 Thread per available processor as reported by the JVM * Size to fit all datastructure in 0.5x L3 cache (4MB/core on the evaluation machines) * Use linear probing hash table, with identity of city name = byte[] and hash code computed inline * Avoid all allocation in the hot path and instead use method parameters. So, instead of passing a single Object param called Point(x, y, z), pass 3 parameters for each of its components. It is ugly, but this challenge is so far from Java's idioms anyway * G1GC seems to want to interfere; use ParallelGC instead (just a quick and dirty hack) Things tried that did not work * MemorySegments are actually slower than MappedByteBuffers * Trying to inline everything: not needed; the JIT compiler is pretty good * Playing with JIT compiler flags didn't yield clear wins. In particular, was surprised that using a max level of 3 and reducing compilation threshold did nothing.. when the jit logs print that none of the methods reach level 4 and stay there for long * Hand-coded implementation of Array.equals(..) using readLong(..) & bitmask_based_on_length from a bytebuffer instead of byte by byte * Further tuning to compile loop methods: timings are now consistenctly ahead of artsiomkorzun in 4th place. There are methods on the data path that were being interpreted for far too long. For example, the method that takes a byte range and simply calls one method per line was taking a disproportionate amount of time. Using `-XX:+AlwaysCompileLoopMethods` option improved completion time by 4%. ============= vemana =============== [20:55:22] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh vemana; done; Using java version 21.0.1-graal in this shell. real 0m1.581s user 0m34.166s sys 0m1.435s Using java version 21.0.1-graal in this shell. real 0m1.593s user 0m34.629s sys 0m1.470s Using java version 21.0.1-graal in this shell. real 0m1.632s user 0m35.893s sys 0m1.340s Using java version 21.0.1-graal in this shell. real 0m1.596s user 0m33.074s sys 0m1.386s Using java version 21.0.1-graal in this shell. real 0m1.611s user 0m35.516s sys 0m1.438s ============= artsiomkorzun =============== [20:56:12] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh artsiomkorzun; done; Using java version 21.0.1-graal in this shell. real 0m1.669s user 0m38.043s sys 0m1.287s Using java version 21.0.1-graal in this shell. real 0m1.679s user 0m37.840s sys 0m1.400s Using java version 21.0.1-graal in this shell. real 0m1.657s user 0m37.607s sys 0m1.298s Using java version 21.0.1-graal in this shell. real 0m1.643s user 0m36.852s sys 0m1.392s Using java version 21.0.1-graal in this shell. real 0m1.644s user 0m36.951s sys 0m1.279s ============= spullara =============== [20:57:55] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh spullara; done; Using java version 21.0.1-graal in this shell. real 0m1.676s user 0m37.404s sys 0m1.386s Using java version 21.0.1-graal in this shell. real 0m1.652s user 0m36.509s sys 0m1.486s Using java version 21.0.1-graal in this shell. real 0m1.665s user 0m36.451s sys 0m1.506s Using java version 21.0.1-graal in this shell. real 0m1.671s user 0m36.917s sys 0m1.371s Using java version 21.0.1-graal in this shell. real 0m1.634s user 0m35.624s sys 0m1.573s ========================== Running Tests ====================== [21:17:57] [lsv@vemana]$ ./runTests.sh vemana Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10000-unique-keys.txt Using java version 21.0.1-graal in this shell. real 0m0.150s user 0m1.035s sys 0m0.117s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10.txt Using java version 21.0.1-graal in this shell. real 0m0.114s user 0m0.789s sys 0m0.116s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-1.txt Using java version 21.0.1-graal in this shell. real 0m0.115s user 0m0.948s sys 0m0.075s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-20.txt Using java version 21.0.1-graal in this shell. real 0m0.113s user 0m0.926s sys 0m0.066s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-2.txt Using java version 21.0.1-graal in this shell. real 0m0.110s user 0m0.734s sys 0m0.078s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-3.txt Using java version 21.0.1-graal in this shell. real 0m0.114s user 0m0.870s sys 0m0.095s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-boundaries.txt Using java version 21.0.1-graal in this shell. real 0m0.113s user 0m0.843s sys 0m0.084s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-complex-utf8.txt Using java version 21.0.1-graal in this shell. real 0m0.121s user 0m0.852s sys 0m0.171s * Improve by a few % more; now, convincingly faster than 6th place submission. So far, only algorithms and tuning; no bitwise tricks yet. Improve chunking implementation to avoid allocation and allow finegrained chunking for the last X% of work. Work now proceeds in two stages: big chunk stage and small chunk stage. This is to avoid straggler threads holding up result merging. Tests pass [07:14:49] [lsv@vemana]$ ./test.sh vemana Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10000-unique-keys.txt Using java version 21.0.1-graal in this shell. real 0m0.152s user 0m0.973s sys 0m0.107s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-10.txt Using java version 21.0.1-graal in this shell. real 0m0.113s user 0m0.840s sys 0m0.060s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-1.txt Using java version 21.0.1-graal in this shell. real 0m0.107s user 0m0.681s sys 0m0.085s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-20.txt Using java version 21.0.1-graal in this shell. real 0m0.105s user 0m0.894s sys 0m0.068s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-2.txt Using java version 21.0.1-graal in this shell. real 0m0.099s user 0m0.895s sys 0m0.068s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-3.txt Using java version 21.0.1-graal in this shell. real 0m0.098s user 0m0.813s sys 0m0.050s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-boundaries.txt Using java version 21.0.1-graal in this shell. real 0m0.095s user 0m0.777s sys 0m0.087s Validating calculate_average_vemana.sh -- src/test/resources/samples/measurements-complex-utf8.txt Using java version 21.0.1-graal in this shell. real 0m0.112s user 0m0.904s sys 0m0.069s * Merge results from finished threads instead of waiting for all threads to finish. Not a huge difference overall but no reason to wait. Also experiment with a few other compiler flags and attempt to use jitwatch to understand what the jit is doing. * Move to prepare_*.sh format and run evaluate2.sh locally. Shows 7th place in leaderboard | # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes | |---|-----------------|--------------------|-----|---------------|-----------| | 1 | 00:01.588 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary | | 2 | 00:01.866 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)| 21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) | | | 3 | 00:01.904 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | | | | 00:02.398 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java)| 21.0.1-graal | [Elliot Barlas](https://github.com/ebarlas) | | | | 00:02.724 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_obourgain.java)| 21.0.1-open | [Olivier Bourgain](https://github.com/obourgain) | | | | 00:02.771 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_algirdasrascius.java)| 21.0.1-open | [Algirdas Ra__ius](https://github.com/algirdasrascius) | | | | 00:02.842 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java)| 21.0.1-graal | [Vemana](https://github.com/vemana) | | | | 00:02.902 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java)| 21.0.1-graal | [Sam Pullara](https://github.com/spullara) | | | | 00:02.906 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java)| 21.0.1-graal | [artsiomkorzun](https://github.com/artsiomkorzun) | | | | 00:02.970 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_isolgpus.java)| 21.0.1-open | [Jamie Stansfield](https://github.com/isolgpus) | | * Tune chunksize to get another 2% improvement for 8 processors as used by the evaluation script. * Read int at a time for city name length detection; speeds up by 2% in local testing. * Improve reading temperature double by exiting loop quicker; no major tricks (like reading an int) yet, but good for 5th place on leaderboard in local testing. This small change has caused a surprising gain in performance by about 4%. I didn't expect such a big change, but perhaps in combination with the earlier change to read int by int for the city name, temperature reading is dominating that aspect of the time. Also, perhaps the quicker exit (as soon as you see '.' instead of reading until '\n') means you get to simply skip reading the '\n' across each of the lines. Since the lines are on average like 15 characters, it may be that avoiding reading the \n is a meaningful saving. Or maybe the JIT found a clever optimization for reading the temperature. Or maybe it is simply the case that the number of multiplications is now down to 2 from the previous 3 is what's causing the performance gain? | # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes | |---|-----------------|--------------------|-----|---------------|-----------| | 1 | 00:01.531 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary | | 2 | 00:01.794 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | | | 3 | 00:01.956 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)| 21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) | | | | 00:02.346 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java)| 21.0.1-graal | [Elliot Barlas](https://github.com/ebarlas) | | | | 00:02.673 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java)| 21.0.1-graal | [Subrahmanyam](https://github.com/vemana) | | | | 00:02.689 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_obourgain.java)| 21.0.1-open | [Olivier Bourgain](https://github.com/obourgain) | | | | 00:02.785 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_algirdasrascius.java)| 21.0.1-open | [Algirdas Ra__ius](https://github.com/algirdasrascius) | | | | 00:02.926 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_isolgpus.java)| 21.0.1-open | [Jamie Stansfield](https://github.com/isolgpus) | | | | 00:02.928 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java)| 21.0.1-graal | [Artsiom Korzun](https://github.com/artsiomkorzun) | | | | 00:02.932 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java)| 21.0.1-graal | [Sam Pullara](https://github.com/spullara) | | * Reduce one multiplication when temperature is +ve. * Added some documentation on the approach. --------- Co-authored-by: vemana <[email protected]>
* Latest snapshot (#1) preparing initial version * Improved performance to 20seconds (-9seconds from the previous version) (#2) improved performance a bit * Improved performance to 14 seconds (-6 seconds) (#3) improved performance to 14 seconds * sync branches (#4) * initial commit * some refactoring of methods * some fixes for partitioning * some fixes for partitioning * fixed hacky getcode for utf8 bytes * simplified getcode for partitioning * temp solution with syncing * temp solution with syncing * new stream processing * new stream processing * some improvements * cleaned stuff * run configuration * round buffer for the stream to pages * not using compute since it's slower than straightforward get/put. using own byte array equals. * using parallel gc * avoid copying bytes when creating a station object * formatting * Copy less arrays. Improved performance to 12.7 seconds (-2 seconds) (#5) * initial commit * some refactoring of methods * some fixes for partitioning * some fixes for partitioning * fixed hacky getcode for utf8 bytes * simplified getcode for partitioning * temp solution with syncing * temp solution with syncing * new stream processing * new stream processing * some improvements * cleaned stuff * run configuration * round buffer for the stream to pages * not using compute since it's slower than straightforward get/put. using own byte array equals. * using parallel gc * avoid copying bytes when creating a station object * formatting * some tuning to increase performance * some tuning to increase performance * avoid copying data; fast hashCode with slightly more collisions * avoid copying data; fast hashCode with slightly more collisions * cleanup (#6) * tidy up
As the number conversion for the original problem was somewhat weird + arguably somewhat Java-specific (see discussions at gunnarmorling/1brc#5), I've decided to move away from using that and use the sample results + run checks against my own base implementation instead.
As the number conversion for the original problem was somewhat weird + arguably somewhat Java-specific (see discussions at gunnarmorling/1brc#5), I've decided to move away from using that and use the sample results + run checks against my own base implementation instead.
I've ran it using:
sdk use java 21.0.1-graal
Latest graalvm seems to give the best performance.
I've added AppCDS, very simple, by dumping on the first run, using on the other runs.
I've also implemented memory mapped files, based on spullara/bjhara's submission, that's awesome and gave even more improvement ideas.
Most improvements come from optimizing the inner-most loop (of course) by writing my own branchless parser for the numbers, which probably looks like magic.
Finally I've added a custom HashTable implementation, just for this specific usecase.... speeeed.
On the MacBook Pro M2 (32GB) it runs in:
real 0m2.873s