Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory mapped files, branchless parsing, bitwiddle magic #5

Merged
merged 10 commits into from
Jan 3, 2024
Merged

memory mapped files, branchless parsing, bitwiddle magic #5

merged 10 commits into from
Jan 3, 2024

Conversation

royvanrijn
Copy link
Contributor

@royvanrijn royvanrijn commented Jan 1, 2024

I've ran it using:
sdk use java 21.0.1-graal

Latest graalvm seems to give the best performance.

I've added AppCDS, very simple, by dumping on the first run, using on the other runs.

I've also implemented memory mapped files, based on spullara/bjhara's submission, that's awesome and gave even more improvement ideas.

Most improvements come from optimizing the inner-most loop (of course) by writing my own branchless parser for the numbers, which probably looks like magic.

Finally I've added a custom HashTable implementation, just for this specific usecase.... speeeed.

On the MacBook Pro M2 (32GB) it runs in:
real 0m2.873s

@royvanrijn royvanrijn changed the title Caved in and created a version that partitions the file memory mapped files, branchless parsing, bitwiddle magic Jan 2, 2024
@lobaorn
Copy link

lobaorn commented Jan 2, 2024

Hey @royvanrijn since you are already very engajed, if you would like to take a look if Lilliput builds using GenerationalZGC could yield benefits: https://twitter.com/gunnarmorling/status/1742227887745376300

If I do a try myself would probably be by the end of the week, and only if there is another feasible approach to compare with the already opened PRs. Other than that should be JVM and GC tweaking...

@swaechter
Copy link

swaechter commented Jan 2, 2024

@royvanrijn Nice solution, I had something similar in mind. Maybe using FFI for mmap and pthreads would improve the performance (Less JVM byte arrays for buffering etc. - but somehow bending the JNI rule). Maybe I'll find the time :)

@suchwerk
Copy link

suchwerk commented Jan 2, 2024

What about using integer arithmetic instead of floats?

@gunnarmorling
Copy link
Owner

Hey @royvanrijn, are you planning to do further changes to this one? If so, wanna put it into "Draft" state until it's good to go from your side?

@gunnarmorling
Copy link
Owner

@royvanrijn, could you rebase this one to current main and squash everything into a single commit?

@gunnarmorling
Copy link
Owner

Preliminary updated the leaderboard with the result from the latest version here: 00:23.366 on the evaluation environment. You've taken over again, @royvanrijn :)

@royvanrijn royvanrijn closed this Jan 2, 2024
@royvanrijn royvanrijn reopened this Jan 2, 2024
Comment on lines +244 to +223
* Branchless parser, goes from String to int (10x):
* "-1.2" to -12
* "40.1" to 401
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gunnarmorling Is this assumption acceptable? E.g. 1, 0, 2.00 are all valid doubles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there should be an acceptance test suite for the implementations, I am pretty sure this implementation does not produce the same output as the baseline.

Copy link
Contributor Author

@royvanrijn royvanrijn Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it produces the exact same output, I check regularly with each change.

The input and output have one decimal place precision (as stated by the website "rounded to one fractional digit").

I'm internally storing the doubles as 10x integers because the precision is just a single digit, and I'm making sure the rounding is correct afterwards for the average.

Copy link
Contributor

@AlexanderYastrebov AlexanderYastrebov Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README only says about output format but not input

1brc/README.md

Lines 27 to 28 in e7e7deb

The task is to write a Java program which reads the file, calculates the min, mean, and max temperature value per weather station, and emits the results on stdout like this
(i.e. sorted alphabetically by station name, and the result values per station in the format `<min>/<mean>/<max>`, rounded to one fractional digit):

so its worth clarifying.

There are many rounding modes - this is also not specified, e.g. in go https://pkg.go.dev/math#Round is not the same as in java.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reference implementation does ad-hoc rounding Math.round(value * 10.0) / 10.0 but actual output depends on the string concatenation which performs another rounding, see https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Double.html#toString(double)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to unfortunate output format selection (see #14) one has to use word diff

Created #36 to address this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not being explicit enough here. Can the behavior of the reference implementation be described using any of the existing values of RoundingMode?

I think the the exact mode does not really matter but what matters is that the baseline is correct.

I propose to change baseline to use BigDecimal for results accumulation, use scale 1 (instead of round(x*10)/10) and HALF_UP rounding mode (as most common at school) at the final step:

BigDecimal value = new BigDecimal("12.34");
BigDecimal rounded = value.setScale(1, RoundingMode.HALF_UP);
  
System.out.println("=="+rounded.toString()+"=="); // prints ==12.3==

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's the thing, I don't think we can change the behavior of the reference implementation at this point, as it would render existing submissions invalid if they implement a different behavior. So I'd rather make the behavior of the RI explicit, also if it's not the most natural one (agreed that HALF_UP behavior would have been better).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its possible to fix RI because it uses double division and rounds twice.
Since there are no acceptance tests I bet a lot of implementations (those that do not parse and calculate values the same way) will not match RI anyways.

I think RI should favor correctness over performance, then it can be used to build acceptance test suite.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've logged #49 for getting this one sorted out separately and get this PR merged. Let's continue the rounding topic over there. Thx!

return toAdd.measurement;
}

private static int hashCode(byte[] a, int length) {
Copy link

@franz1981 franz1981 Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash code here has a data dependency: you either manually unroll this or just relax the hash code by using a var handle and use getLong amortizing the data dependency in batches, handing only the last 7 (or less) bytes separately, using the array.
In this way the most of computation would like resolve in much less loop iterations too, similar to https://github.com/apache/activemq-artemis/blob/25fc0342275b29cd73123523a46e6e94582597cd/artemis-commons/src/main/java/org/apache/activemq/artemis/utils/ByteUtil.java#L299

@gunnarmorling
Copy link
Owner

@royvanrijn, so what should we do with this one, and all the pending discussions? Wanna submit it as is and keep honing in follow-up PRs? I think it would be nice to be able to update the leaderboard with the current status (fastest right now is @spullara). For that, could you rebase it to resolve the merge conflict?

@royvanrijn royvanrijn closed this Jan 3, 2024
royvanrijn and others added 5 commits January 3, 2024 16:30
Added SWAR (SIMD Within A Register) code to increase bytebuffer processing/throughput

Delaying the creation of the String by comparing hash, segmenting like spullara, improved EOL finding
Squashing for merge.
@royvanrijn royvanrijn reopened this Jan 3, 2024
Also fixing millisecond separator

Co-authored-by: Gunnar Morling <[email protected]>
@lobaorn
Copy link

lobaorn commented Jan 3, 2024

Shamelessly sharing this idea for JVM/GC tuning in another PR/discussion? #15 (comment)

@royvanrijn
Copy link
Contributor Author

@gunnarmorling Git is failing me so hard haha, such a mess; let's get this merged and start working on a v2.

@gunnarmorling
Copy link
Owner

@royvanrijn, could you add a line like this to your launch script for making sure the evaluation is done with GraalVM? I'll squash everything then and evaluate. Thx!

@gunnarmorling gunnarmorling merged commit 5570f1b into gunnarmorling:main Jan 3, 2024
1 check passed
@gunnarmorling
Copy link
Owner

@royvanrijn dang, so I've done what I shouldn't have done and merged it before running. But it seems to take much longer / hang now actually. Any idea what's wrong?

long mask = match - 0x0101010101010101L;
mask &= ~match;
mask &= 0x8080808080808080L;
return Long.numberOfTrailingZeros(mask) >>> 3;
Copy link

@franz1981 franz1981 Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it shouldn't be the number of leading ones?
@royvanrijn @gunnarmorling

Copy link
Contributor Author

@royvanrijn royvanrijn Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, I thought by setting explicitly on 105 to LE would make the compatibility issues disappear. So running it on my machine would automatically mean it works on the target, although perhaps having a performance hit.

afk atm, I’ll check tomorrow, if somebody wants to fix it and tell me, be my guest 😂

Copy link

@franz1981 franz1981 Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure actually, for these things I need an old school paper and a pencil :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some local classes that test; problem is that I believe it works on my machine, just not on the target machine, I’ll check soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, annoying, it runs fine locally (the code that was pushed), sigh. Kind of debugging in the dark haha... a challenge!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it twice, I'm wrong at #5

let's say we have a byte[] data = { 0x01, 0x03 }
And we assume to have a short-based version of SWAR

reading the content of data with (a short) little-endian means:

0x0301

which have the less significant part at the lower address,
hence the binary hex SWAR result obtained (I'm using the Netty algorithm, but here should be the same) will be

0x8000

and, in order to find out 0x03, we have to use the trailing zeros (here 8 + 7 = 15 -> 15/8 = 1) .

Which means that is fine as it is!

@franz1981
Copy link

@gunnarmorling added a comment to help

// System.out.println("Took: " + (System.currentTimeMillis() - before));
// Simple is faster for '\n' (just three options)
int endPointer;
if (bb.get(separatorPointer + 4) == '\n') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my input I get IOOBE here:

Exception in thread "main" java.lang.IndexOutOfBoundsException
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
	at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:542)
	at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:567)
	at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:670)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:927)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
	at dev.morling.onebrc.CalculateAverage_royvanrijn.run(CalculateAverage_royvanrijn.java:144)
	at dev.morling.onebrc.CalculateAverage_royvanrijn.main(CalculateAverage_royvanrijn.java:92)
Caused by: java.lang.IndexOutOfBoundsException
	at java.base/java.nio.Buffer$1.apply(Buffer.java:757)
	at java.base/java.nio.Buffer$1.apply(Buffer.java:754)
	at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:213)
	at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:210)
	at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98)
	at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
	at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
	at java.base/java.nio.Buffer.checkIndex(Buffer.java:768)
	at java.base/java.nio.DirectByteBuffer.get(DirectByteBuffer.java:358)
	at dev.morling.onebrc.CalculateAverage_royvanrijn.lambda$run$0(CalculateAverage_royvanrijn.java:118)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1708)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:960)
	at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:934)
	at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327)
	at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)

Let me know if you need the file.

Copy link
Contributor Author

@royvanrijn royvanrijn Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes please, perhaps there is some other bug that's platform dependent; do share.

Can you specify what platform you're running it on, and could you please also check this (improved) version:
https://github.com/royvanrijn/1brc/blob/8db31e6a36fbc305765a2393efb06ba6bff23f42/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running on Windows. Will check the new version later (it is quite late here now and starting from tomorrow I won't have access to PC for the weekend).

Let me know if you can suggest how to send you the data file - I started bzipping it and it takes forever, but even part way through, the archive is 2Gb (you can mail me upload coordinates at dimitar.dimitrov at gmail dot com)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, can you narrow it down? Perhaps run a very small test? Do they all crash, just this one?

Copy link

@DamienOReilly DamienOReilly Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddimtirov for future reference, the default compression level on zstd will be a lot faster and offer reasonable compression:

On a MacBook Pro 2020 - 2 GHz Quad-Core Intel Core i5:

# time zstd -z measurements.txt                                                                                                                                                                                    8s
measurements.txt     : 28.24%   (  12.8 GiB =>   3.63 GiB, measurements.txt.zst)
zstd -z measurements.txt  92.10s user 8.05s system 107% cpu 1:33.54 total

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See related #61

We've also added some basic samples within #82

AlexanderYastrebov added a commit to AlexanderYastrebov/1brc that referenced this pull request Jan 4, 2024
Added a test case from the discussion gunnarmorling#5 (comment)

Neither all implementations match baseline:
```sh
$ ./test_all.sh src/test/resources/samples/measurements-rounding-baseline.txt 2>/dev/null
FAIL armandino
FAIL artsiomkorzun
PASS baseline
PASS bjhara
PASS criccomini
FAIL ddimtirov
FAIL ebarlas
FAIL filiphr
FAIL itaske
PASS jgrateron
FAIL khmarbaise
FAIL kuduwa-keshavram
PASS lawrey
PASS moysesb
FAIL nstng
PASS padreati
FAIL palmr
PASS richardstartin
FAIL royvanrijn
PASS seijikun
PASS spullara
PASS truelive
```

not they match precise value `33.6+31.7+21.9+14.6=25.5`:
```
$ ./test_all.sh src/test/resources/samples/measurements-rounding-precise.txt 2>/dev/null
PASS armandino
PASS artsiomkorzun
FAIL baseline
FAIL bjhara
FAIL criccomini
FAIL ddimtirov
PASS ebarlas
PASS filiphr
PASS itaske
FAIL jgrateron
PASS khmarbaise
FAIL kuduwa-keshavram
FAIL lawrey
FAIL moysesb
PASS nstng
FAIL padreati
PASS palmr
FAIL richardstartin
PASS royvanrijn
FAIL seijikun
FAIL spullara
FAIL truelive
```

For gunnarmorling#49
AlexanderYastrebov added a commit to AlexanderYastrebov/1brc that referenced this pull request Jan 4, 2024
Added a test case from the discussion gunnarmorling#5 (comment)

Neither all implementations match baseline:
nor they match precise value of `33.6+31.7+21.9+14.6=25.5`:
```
$ ./test_all.sh src/test/resources/samples/measurements-rounding-baseline.txt 2>/dev/null | tee /tmp/rounding-baseline.log
FAIL armandino
FAIL artsiomkorzun
PASS baseline
PASS bjhara
PASS criccomini
FAIL ddimtirov
FAIL ebarlas
FAIL filiphr
FAIL itaske
PASS jgrateron
FAIL khmarbaise
FAIL kuduwa-keshavram
PASS lawrey
PASS moysesb
FAIL nstng
PASS padreati
FAIL palmr
PASS richardstartin
FAIL royvanrijn
PASS seijikun
PASS spullara
PASS truelive

$ ./test_all.sh src/test/resources/samples/measurements-rounding-precise.txt 2>/dev/null | tee /tm
p/rounding-precise.log
PASS armandino
PASS artsiomkorzun
FAIL baseline
FAIL bjhara
FAIL criccomini
FAIL ddimtirov
PASS ebarlas
PASS filiphr
FAIL itaske
FAIL jgrateron
PASS khmarbaise
FAIL kuduwa-keshavram
FAIL lawrey
FAIL moysesb
PASS nstng
FAIL padreati
PASS palmr
FAIL richardstartin
PASS royvanrijn
FAIL seijikun
FAIL spullara
FAIL truelive

$ git --no-pager diff --word-diff /tmp/rounding-baseline.log /tmp/rounding-precise.log
diff --git a/tmp/rounding-baseline.log b/tmp/rounding-precise.log
index 76d5b4e..495fb00 100644
--- a/tmp/rounding-baseline.log
+++ b/tmp/rounding-precise.log
@@ -1,22 +1,22 @@
[-FAIL-]{+PASS+} armandino[-FAIL artsiomkorzun-]
PASS {+artsiomkorzun+}
{+FAIL+} baseline
[-PASS-]{+FAIL+} bjhara
[-PASS-]{+FAIL+} criccomini
FAIL ddimtirov
[-FAIL-]{+PASS+} ebarlas
[-FAIL-]{+PASS+} filiphr
FAIL itaske
[-PASS jgrateron-]FAIL {+jgrateron+}
{+PASS+} khmarbaise
FAIL kuduwa-keshavram
[-PASS-]{+FAIL+} lawrey[-PASS moysesb-]
FAIL [-nstng-]{+moysesb+}
PASS [-padreati-]{+nstng+}
FAIL [-palmr-]{+padreati+}
PASS [-richardstartin-]{+palmr+}
FAIL [-royvanrijn-]{+richardstartin+}
PASS {+royvanrijn+}
{+FAIL+} seijikun
[-PASS-]{+FAIL+} spullara
[-PASS-]{+FAIL+} truelive
```

For gunnarmorling#49
AlexanderYastrebov added a commit to AlexanderYastrebov/1brc that referenced this pull request Jan 4, 2024
Added two test cases from the discussion gunnarmorling#5 (comment)

Neither all implementations match baseline nor they match precise value of `33.6+31.7+21.9+14.6=25.5`:
```
$ ./test_all.sh src/test/resources/samples/measurements-rounding-baseline.txt 2>/dev/null | tee /tmp/rounding-baseline.log
FAIL armandino
FAIL artsiomkorzun
PASS baseline
PASS bjhara
PASS criccomini
FAIL ddimtirov
FAIL ebarlas
FAIL filiphr
FAIL itaske
PASS jgrateron
FAIL khmarbaise
FAIL kuduwa-keshavram
PASS lawrey
PASS moysesb
FAIL nstng
PASS padreati
FAIL palmr
PASS richardstartin
FAIL royvanrijn
PASS seijikun
PASS spullara
PASS truelive

$ ./test_all.sh src/test/resources/samples/measurements-rounding-precise.txt 2>/dev/null | tee /tm
p/rounding-precise.log
PASS armandino
PASS artsiomkorzun
FAIL baseline
FAIL bjhara
FAIL criccomini
FAIL ddimtirov
PASS ebarlas
PASS filiphr
FAIL itaske
FAIL jgrateron
PASS khmarbaise
FAIL kuduwa-keshavram
FAIL lawrey
FAIL moysesb
PASS nstng
FAIL padreati
PASS palmr
FAIL richardstartin
PASS royvanrijn
FAIL seijikun
FAIL spullara
FAIL truelive

$ git --no-pager diff --word-diff /tmp/rounding-baseline.log /tmp/rounding-precise.log
diff --git a/tmp/rounding-baseline.log b/tmp/rounding-precise.log
index 76d5b4e..495fb00 100644
--- a/tmp/rounding-baseline.log
+++ b/tmp/rounding-precise.log
@@ -1,22 +1,22 @@
[-FAIL-]{+PASS+} armandino[-FAIL artsiomkorzun-]
PASS {+artsiomkorzun+}
{+FAIL+} baseline
[-PASS-]{+FAIL+} bjhara
[-PASS-]{+FAIL+} criccomini
FAIL ddimtirov
[-FAIL-]{+PASS+} ebarlas
[-FAIL-]{+PASS+} filiphr
FAIL itaske
[-PASS jgrateron-]FAIL {+jgrateron+}
{+PASS+} khmarbaise
FAIL kuduwa-keshavram
[-PASS-]{+FAIL+} lawrey[-PASS moysesb-]
FAIL [-nstng-]{+moysesb+}
PASS [-padreati-]{+nstng+}
FAIL [-palmr-]{+padreati+}
PASS [-richardstartin-]{+palmr+}
FAIL [-royvanrijn-]{+richardstartin+}
PASS {+royvanrijn+}
{+FAIL+} seijikun
[-PASS-]{+FAIL+} spullara
[-PASS-]{+FAIL+} truelive
```

For gunnarmorling#49
AlexanderYastrebov added a commit to AlexanderYastrebov/1brc that referenced this pull request Jan 4, 2024
Added two test cases from the discussion gunnarmorling#5 (comment)

Neither all implementations match baseline nor they match precise value of `33.6+31.7+21.9+14.6=25.5`:
```
$ ./test_all.sh src/test/resources/samples/measurements-rounding-baseline.txt 2>/dev/null | tee /tmp/rounding-baseline.log
FAIL armandino
FAIL artsiomkorzun
PASS baseline
PASS bjhara
PASS criccomini
FAIL ddimtirov
FAIL ebarlas
FAIL filiphr
FAIL itaske
PASS jgrateron
FAIL khmarbaise
FAIL kuduwa-keshavram
PASS lawrey
PASS moysesb
FAIL nstng
PASS padreati
FAIL palmr
PASS richardstartin
FAIL royvanrijn
PASS seijikun
PASS spullara
PASS truelive

$ ./test_all.sh src/test/resources/samples/measurements-rounding-precise.txt 2>/dev/null | tee /tm
p/rounding-precise.log
PASS armandino
PASS artsiomkorzun
FAIL baseline
FAIL bjhara
FAIL criccomini
FAIL ddimtirov
PASS ebarlas
PASS filiphr
FAIL itaske
FAIL jgrateron
PASS khmarbaise
FAIL kuduwa-keshavram
FAIL lawrey
FAIL moysesb
PASS nstng
FAIL padreati
PASS palmr
FAIL richardstartin
PASS royvanrijn
FAIL seijikun
FAIL spullara
FAIL truelive

$ diff -y /tmp/rounding-baseline.log /tmp/rounding-precise.log
FAIL armandino                                                | PASS armandino
FAIL artsiomkorzun                                            | PASS artsiomkorzun
PASS baseline                                                 | FAIL baseline
PASS bjhara                                                   | FAIL bjhara
PASS criccomini                                               | FAIL criccomini
FAIL ddimtirov                                                  FAIL ddimtirov
FAIL ebarlas                                                  | PASS ebarlas
FAIL filiphr                                                  | PASS filiphr
FAIL itaske                                                     FAIL itaske
PASS jgrateron                                                | FAIL jgrateron
FAIL khmarbaise                                               | PASS khmarbaise
FAIL kuduwa-keshavram                                           FAIL kuduwa-keshavram
PASS lawrey                                                   | FAIL lawrey
PASS moysesb                                                  | FAIL moysesb
FAIL nstng                                                    | PASS nstng
PASS padreati                                                 | FAIL padreati
FAIL palmr                                                    | PASS palmr
PASS richardstartin                                           | FAIL richardstartin
FAIL royvanrijn                                               | PASS royvanrijn
PASS seijikun                                                 | FAIL seijikun
PASS spullara                                                 | FAIL spullara
PASS truelive                                                 | FAIL truelive
```

Its also interesting that e.g. `itaske` produces different results
between runs and thus may pass or fail sporadically.

For gunnarmorling#49
gunnarmorling pushed a commit that referenced this pull request Jan 14, 2024
…board in local testing using evaluate2.sh] (#209)

* Linear probe for city indexing. Beats current leader spullara 2.2 vs 3.8 elapsed time.

* Straightforward impl using bytebuffers. Turns out memorysegments were slower than used mappedbytebuffers.

* A initial submit-worthy entry

Comparison to select entries (averaged over 3 runs)
* spullara 1.66s [5th on leaderboard currently]
* vemana (this submission) 1.65s
* artsiomkorzun 1.64s [4th on leaderboard currently]

Tests: PASS
Impl Class: dev.morling.onebrc.CalculateAverage_vemana

Machine specs
* 16 core Ryzen 7950X
* 128GB RAM

Description
* Decompose the full file into Shards of memory mapped files and process
  each independently, outputting a TreeMap: City -> Statistics
* Compose the final answer by merging the individual TreeMap outputs
* Select 1 Thread per available processor as reported by the JVM
* Size to fit all datastructure in 0.5x L3 cache (4MB/core on the
  evaluation machines)
* Use linear probing hash table, with identity of city name = byte[] and
  hash code computed inline
* Avoid all allocation in the hot path and instead use method
  parameters. So, instead of passing a single Object param called Point(x, y, z),
  pass 3 parameters for each of its components. It is ugly, but this
  challenge is so far from Java's idioms anyway
* G1GC seems to want to interfere; use ParallelGC instead (just a quick
  and dirty hack)

Things tried that did not work
* MemorySegments are actually slower than MappedByteBuffers
* Trying to inline everything: not needed; the JIT compiler is pretty
  good
* Playing with JIT compiler flags didn't yield clear wins. In
  particular, was surprised that using a max level of 3 and reducing
  compilation threshold did nothing.. when the jit logs print that none
  of the methods reach level 4 and stay there for long
* Hand-coded implementation of Array.equals(..) using
  readLong(..) & bitmask_based_on_length from a bytebuffer instead of byte by byte

* Further tuning to compile loop methods: timings are now consistenctly ahead of artsiomkorzun in 4th place.

There are methods on the data path that were being interpreted for far
too long. For example, the method that takes a byte range and simply
calls one method per line was taking a disproportionate amount of time.
Using `-XX:+AlwaysCompileLoopMethods` option improved completion time by 4%.

============= vemana ===============
[20:55:22] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh vemana;
done;

Using java version 21.0.1-graal in this shell.

real    0m1.581s
user    0m34.166s
sys     0m1.435s

Using java version 21.0.1-graal in this shell.

real    0m1.593s
user    0m34.629s
sys     0m1.470s

Using java version 21.0.1-graal in this shell.

real    0m1.632s
user    0m35.893s
sys     0m1.340s

Using java version 21.0.1-graal in this shell.

real    0m1.596s
user    0m33.074s
sys     0m1.386s

Using java version 21.0.1-graal in this shell.

real    0m1.611s
user    0m35.516s
sys     0m1.438s

============= artsiomkorzun ===============
[20:56:12] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh
artsiomkorzun; done;

Using java version 21.0.1-graal in this shell.

real    0m1.669s
user    0m38.043s
sys     0m1.287s

Using java version 21.0.1-graal in this shell.

real    0m1.679s
user    0m37.840s
sys     0m1.400s

Using java version 21.0.1-graal in this shell.

real    0m1.657s
user    0m37.607s
sys     0m1.298s

Using java version 21.0.1-graal in this shell.

real    0m1.643s
user    0m36.852s
sys     0m1.392s

Using java version 21.0.1-graal in this shell.

real    0m1.644s
user    0m36.951s
sys     0m1.279s

============= spullara ===============
[20:57:55] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh spullara;
done;

Using java version 21.0.1-graal in this shell.

real    0m1.676s
user    0m37.404s
sys     0m1.386s

Using java version 21.0.1-graal in this shell.

real    0m1.652s
user    0m36.509s
sys     0m1.486s

Using java version 21.0.1-graal in this shell.

real    0m1.665s
user    0m36.451s
sys     0m1.506s

Using java version 21.0.1-graal in this shell.

real    0m1.671s
user    0m36.917s
sys     0m1.371s

Using java version 21.0.1-graal in this shell.

real    0m1.634s
user    0m35.624s
sys     0m1.573s

========================== Running Tests ======================

[21:17:57] [lsv@vemana]$ ./runTests.sh vemana
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-10000-unique-keys.txt

Using java version 21.0.1-graal in this shell.

real    0m0.150s
user    0m1.035s
sys     0m0.117s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-10.txt

Using java version 21.0.1-graal in this shell.

real    0m0.114s
user    0m0.789s
sys     0m0.116s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-1.txt

Using java version 21.0.1-graal in this shell.

real    0m0.115s
user    0m0.948s
sys     0m0.075s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-20.txt

Using java version 21.0.1-graal in this shell.

real    0m0.113s
user    0m0.926s
sys     0m0.066s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-2.txt

Using java version 21.0.1-graal in this shell.

real    0m0.110s
user    0m0.734s
sys     0m0.078s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-3.txt

Using java version 21.0.1-graal in this shell.

real    0m0.114s
user    0m0.870s
sys     0m0.095s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-boundaries.txt

Using java version 21.0.1-graal in this shell.

real    0m0.113s
user    0m0.843s
sys     0m0.084s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-complex-utf8.txt

Using java version 21.0.1-graal in this shell.

real    0m0.121s
user    0m0.852s
sys     0m0.171s

* Improve by a few % more; now, convincingly faster than 6th place
submission. So far, only algorithms and tuning; no bitwise tricks yet.

Improve chunking implementation to avoid allocation and allow
finegrained chunking for the last X% of work. Work now proceeds in two
stages: big chunk stage and small chunk stage. This is to avoid
straggler threads holding up result merging.

Tests pass

[07:14:49] [lsv@vemana]$ ./test.sh vemana
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-10000-unique-keys.txt

Using java version 21.0.1-graal in this shell.

real    0m0.152s
user    0m0.973s
sys     0m0.107s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-10.txt

Using java version 21.0.1-graal in this shell.

real    0m0.113s
user    0m0.840s
sys     0m0.060s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-1.txt

Using java version 21.0.1-graal in this shell.

real    0m0.107s
user    0m0.681s
sys     0m0.085s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-20.txt

Using java version 21.0.1-graal in this shell.

real    0m0.105s
user    0m0.894s
sys     0m0.068s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-2.txt

Using java version 21.0.1-graal in this shell.

real    0m0.099s
user    0m0.895s
sys     0m0.068s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-3.txt

Using java version 21.0.1-graal in this shell.

real    0m0.098s
user    0m0.813s
sys     0m0.050s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-boundaries.txt

Using java version 21.0.1-graal in this shell.

real    0m0.095s
user    0m0.777s
sys     0m0.087s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-complex-utf8.txt

Using java version 21.0.1-graal in this shell.

real    0m0.112s
user    0m0.904s
sys     0m0.069s

* Merge results from finished threads instead of waiting for all threads
to finish.

Not a huge difference overall but no reason to wait.

Also experiment with a few other compiler flags and attempt to use
jitwatch to understand what the jit is doing.

* Move to prepare_*.sh format and run evaluate2.sh locally.

Shows 7th place in leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes
|
|---|-----------------|--------------------|-----|---------------|-----------|
| 1 | 00:01.588 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)|
21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) |
   GraalVM native binary |
| 2 | 00:01.866 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)|
21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) |  |
| 3 | 00:01.904 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)|
21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) |  |
|   | 00:02.398 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java)|
21.0.1-graal | [Elliot Barlas](https://github.com/ebarlas) |  |
|   | 00:02.724 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_obourgain.java)|
21.0.1-open | [Olivier Bourgain](https://github.com/obourgain) |  |
|   | 00:02.771 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_algirdasrascius.java)|
21.0.1-open | [Algirdas Ra__ius](https://github.com/algirdasrascius) |
   |
|   | 00:02.842 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java)|
21.0.1-graal | [Vemana](https://github.com/vemana) |  |
|   | 00:02.902 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java)|
21.0.1-graal | [Sam Pullara](https://github.com/spullara) |  |
|   | 00:02.906 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java)|
21.0.1-graal | [artsiomkorzun](https://github.com/artsiomkorzun) |  |
|   | 00:02.970 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_isolgpus.java)|
21.0.1-open | [Jamie Stansfield](https://github.com/isolgpus) |  |

* Tune chunksize to get another 2% improvement for 8 processors as used by
the evaluation script.

* Read int at a time for city name length detection; speeds up by 2% in local testing.

* Improve reading temperature double by exiting loop quicker; no major
tricks (like reading an int) yet, but good for 5th place on leaderboard in local testing.

This small change has caused a surprising gain in performance by about 4%.
I didn't expect such a big change, but perhaps in combination with the
earlier change to read int by int for the city name, temperature reading
is dominating that aspect of the time. Also, perhaps the quicker exit
(as soon as you see '.' instead of reading until '\n') means you get to
simply skip reading the '\n' across each of the lines. Since the lines
are on average like 15 characters, it may be that avoiding reading the \n
is a meaningful saving. Or maybe the JIT found a clever optimization for
reading the temperature.

Or maybe it is simply the case that the number of multiplications is now
down to 2 from the previous 3 is what's causing the performance gain?

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes
|
|---|-----------------|--------------------|-----|---------------|-----------|
| 1 | 00:01.531 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)|
21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) |
   GraalVM native binary |
| 2 | 00:01.794 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)|
21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) |  |
| 3 | 00:01.956 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)|
21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) |  |
|   | 00:02.346 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java)|
21.0.1-graal | [Elliot Barlas](https://github.com/ebarlas) |  |
|   | 00:02.673 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java)|
21.0.1-graal | [Subrahmanyam](https://github.com/vemana) |  |
|   | 00:02.689 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_obourgain.java)|
21.0.1-open | [Olivier Bourgain](https://github.com/obourgain) |  |
|   | 00:02.785 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_algirdasrascius.java)|
21.0.1-open | [Algirdas Ra__ius](https://github.com/algirdasrascius) |
   |
|   | 00:02.926 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_isolgpus.java)|
21.0.1-open | [Jamie Stansfield](https://github.com/isolgpus) |  |
|   | 00:02.928 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java)|
21.0.1-graal | [Artsiom Korzun](https://github.com/artsiomkorzun) |  |
|   | 00:02.932 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java)|
21.0.1-graal | [Sam Pullara](https://github.com/spullara) |  |

* Reduce one multiplication when temperature is +ve.

* Linear probe for city indexing. Beats current leader spullara 2.2 vs 3.8 elapsed time.

* Straightforward impl using bytebuffers. Turns out memorysegments were slower than used mappedbytebuffers.

* A initial submit-worthy entry

Comparison to select entries (averaged over 3 runs)
* spullara 1.66s [5th on leaderboard currently]
* vemana (this submission) 1.65s
* artsiomkorzun 1.64s [4th on leaderboard currently]

Tests: PASS
Impl Class: dev.morling.onebrc.CalculateAverage_vemana

Machine specs
* 16 core Ryzen 7950X
* 128GB RAM

Description
* Decompose the full file into Shards of memory mapped files and process
  each independently, outputting a TreeMap: City -> Statistics
* Compose the final answer by merging the individual TreeMap outputs
* Select 1 Thread per available processor as reported by the JVM
* Size to fit all datastructure in 0.5x L3 cache (4MB/core on the
  evaluation machines)
* Use linear probing hash table, with identity of city name = byte[] and
  hash code computed inline
* Avoid all allocation in the hot path and instead use method
  parameters. So, instead of passing a single Object param called Point(x, y, z),
  pass 3 parameters for each of its components. It is ugly, but this
  challenge is so far from Java's idioms anyway
* G1GC seems to want to interfere; use ParallelGC instead (just a quick
  and dirty hack)

Things tried that did not work
* MemorySegments are actually slower than MappedByteBuffers
* Trying to inline everything: not needed; the JIT compiler is pretty
  good
* Playing with JIT compiler flags didn't yield clear wins. In
  particular, was surprised that using a max level of 3 and reducing
  compilation threshold did nothing.. when the jit logs print that none
  of the methods reach level 4 and stay there for long
* Hand-coded implementation of Array.equals(..) using
  readLong(..) & bitmask_based_on_length from a bytebuffer instead of byte by byte

* Further tuning to compile loop methods: timings are now consistenctly ahead of artsiomkorzun in 4th place.

There are methods on the data path that were being interpreted for far
too long. For example, the method that takes a byte range and simply
calls one method per line was taking a disproportionate amount of time.
Using `-XX:+AlwaysCompileLoopMethods` option improved completion time by 4%.

============= vemana ===============
[20:55:22] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh vemana;
done;

Using java version 21.0.1-graal in this shell.

real    0m1.581s
user    0m34.166s
sys     0m1.435s

Using java version 21.0.1-graal in this shell.

real    0m1.593s
user    0m34.629s
sys     0m1.470s

Using java version 21.0.1-graal in this shell.

real    0m1.632s
user    0m35.893s
sys     0m1.340s

Using java version 21.0.1-graal in this shell.

real    0m1.596s
user    0m33.074s
sys     0m1.386s

Using java version 21.0.1-graal in this shell.

real    0m1.611s
user    0m35.516s
sys     0m1.438s

============= artsiomkorzun ===============
[20:56:12] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh
artsiomkorzun; done;

Using java version 21.0.1-graal in this shell.

real    0m1.669s
user    0m38.043s
sys     0m1.287s

Using java version 21.0.1-graal in this shell.

real    0m1.679s
user    0m37.840s
sys     0m1.400s

Using java version 21.0.1-graal in this shell.

real    0m1.657s
user    0m37.607s
sys     0m1.298s

Using java version 21.0.1-graal in this shell.

real    0m1.643s
user    0m36.852s
sys     0m1.392s

Using java version 21.0.1-graal in this shell.

real    0m1.644s
user    0m36.951s
sys     0m1.279s

============= spullara ===============
[20:57:55] [lsv@vemana]$ for i in 1 2 3 4 5; do ./runTheir.sh spullara;
done;

Using java version 21.0.1-graal in this shell.

real    0m1.676s
user    0m37.404s
sys     0m1.386s

Using java version 21.0.1-graal in this shell.

real    0m1.652s
user    0m36.509s
sys     0m1.486s

Using java version 21.0.1-graal in this shell.

real    0m1.665s
user    0m36.451s
sys     0m1.506s

Using java version 21.0.1-graal in this shell.

real    0m1.671s
user    0m36.917s
sys     0m1.371s

Using java version 21.0.1-graal in this shell.

real    0m1.634s
user    0m35.624s
sys     0m1.573s

========================== Running Tests ======================

[21:17:57] [lsv@vemana]$ ./runTests.sh vemana
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-10000-unique-keys.txt

Using java version 21.0.1-graal in this shell.

real    0m0.150s
user    0m1.035s
sys     0m0.117s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-10.txt

Using java version 21.0.1-graal in this shell.

real    0m0.114s
user    0m0.789s
sys     0m0.116s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-1.txt

Using java version 21.0.1-graal in this shell.

real    0m0.115s
user    0m0.948s
sys     0m0.075s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-20.txt

Using java version 21.0.1-graal in this shell.

real    0m0.113s
user    0m0.926s
sys     0m0.066s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-2.txt

Using java version 21.0.1-graal in this shell.

real    0m0.110s
user    0m0.734s
sys     0m0.078s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-3.txt

Using java version 21.0.1-graal in this shell.

real    0m0.114s
user    0m0.870s
sys     0m0.095s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-boundaries.txt

Using java version 21.0.1-graal in this shell.

real    0m0.113s
user    0m0.843s
sys     0m0.084s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-complex-utf8.txt

Using java version 21.0.1-graal in this shell.

real    0m0.121s
user    0m0.852s
sys     0m0.171s

* Improve by a few % more; now, convincingly faster than 6th place
submission. So far, only algorithms and tuning; no bitwise tricks yet.

Improve chunking implementation to avoid allocation and allow
finegrained chunking for the last X% of work. Work now proceeds in two
stages: big chunk stage and small chunk stage. This is to avoid
straggler threads holding up result merging.

Tests pass

[07:14:49] [lsv@vemana]$ ./test.sh vemana
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-10000-unique-keys.txt

Using java version 21.0.1-graal in this shell.

real    0m0.152s
user    0m0.973s
sys     0m0.107s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-10.txt

Using java version 21.0.1-graal in this shell.

real    0m0.113s
user    0m0.840s
sys     0m0.060s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-1.txt

Using java version 21.0.1-graal in this shell.

real    0m0.107s
user    0m0.681s
sys     0m0.085s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-20.txt

Using java version 21.0.1-graal in this shell.

real    0m0.105s
user    0m0.894s
sys     0m0.068s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-2.txt

Using java version 21.0.1-graal in this shell.

real    0m0.099s
user    0m0.895s
sys     0m0.068s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-3.txt

Using java version 21.0.1-graal in this shell.

real    0m0.098s
user    0m0.813s
sys     0m0.050s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-boundaries.txt

Using java version 21.0.1-graal in this shell.

real    0m0.095s
user    0m0.777s
sys     0m0.087s
Validating calculate_average_vemana.sh --
src/test/resources/samples/measurements-complex-utf8.txt

Using java version 21.0.1-graal in this shell.

real    0m0.112s
user    0m0.904s
sys     0m0.069s

* Merge results from finished threads instead of waiting for all threads
to finish.

Not a huge difference overall but no reason to wait.

Also experiment with a few other compiler flags and attempt to use
jitwatch to understand what the jit is doing.

* Move to prepare_*.sh format and run evaluate2.sh locally.

Shows 7th place in leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes
|
|---|-----------------|--------------------|-----|---------------|-----------|
| 1 | 00:01.588 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)|
21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) |
   GraalVM native binary |
| 2 | 00:01.866 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)|
21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) |  |
| 3 | 00:01.904 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)|
21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) |  |
|   | 00:02.398 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java)|
21.0.1-graal | [Elliot Barlas](https://github.com/ebarlas) |  |
|   | 00:02.724 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_obourgain.java)|
21.0.1-open | [Olivier Bourgain](https://github.com/obourgain) |  |
|   | 00:02.771 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_algirdasrascius.java)|
21.0.1-open | [Algirdas Ra__ius](https://github.com/algirdasrascius) |
   |
|   | 00:02.842 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java)|
21.0.1-graal | [Vemana](https://github.com/vemana) |  |
|   | 00:02.902 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java)|
21.0.1-graal | [Sam Pullara](https://github.com/spullara) |  |
|   | 00:02.906 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java)|
21.0.1-graal | [artsiomkorzun](https://github.com/artsiomkorzun) |  |
|   | 00:02.970 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_isolgpus.java)|
21.0.1-open | [Jamie Stansfield](https://github.com/isolgpus) |  |

* Tune chunksize to get another 2% improvement for 8 processors as used by
the evaluation script.

* Read int at a time for city name length detection; speeds up by 2% in local testing.

* Improve reading temperature double by exiting loop quicker; no major
tricks (like reading an int) yet, but good for 5th place on leaderboard in local testing.

This small change has caused a surprising gain in performance by about 4%.
I didn't expect such a big change, but perhaps in combination with the
earlier change to read int by int for the city name, temperature reading
is dominating that aspect of the time. Also, perhaps the quicker exit
(as soon as you see '.' instead of reading until '\n') means you get to
simply skip reading the '\n' across each of the lines. Since the lines
are on average like 15 characters, it may be that avoiding reading the \n
is a meaningful saving. Or maybe the JIT found a clever optimization for
reading the temperature.

Or maybe it is simply the case that the number of multiplications is now
down to 2 from the previous 3 is what's causing the performance gain?

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes
|
|---|-----------------|--------------------|-----|---------------|-----------|
| 1 | 00:01.531 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)|
21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) |
   GraalVM native binary |
| 2 | 00:01.794 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)|
21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) |  |
| 3 | 00:01.956 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)|
21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) |  |
|   | 00:02.346 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java)|
21.0.1-graal | [Elliot Barlas](https://github.com/ebarlas) |  |
|   | 00:02.673 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_vemana.java)|
21.0.1-graal | [Subrahmanyam](https://github.com/vemana) |  |
|   | 00:02.689 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_obourgain.java)|
21.0.1-open | [Olivier Bourgain](https://github.com/obourgain) |  |
|   | 00:02.785 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_algirdasrascius.java)|
21.0.1-open | [Algirdas Ra__ius](https://github.com/algirdasrascius) |
   |
|   | 00:02.926 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_isolgpus.java)|
21.0.1-open | [Jamie Stansfield](https://github.com/isolgpus) |  |
|   | 00:02.928 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_artsiomkorzun.java)|
21.0.1-graal | [Artsiom Korzun](https://github.com/artsiomkorzun) |  |
|   | 00:02.932 |
[link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_spullara.java)|
21.0.1-graal | [Sam Pullara](https://github.com/spullara) |  |

* Reduce one multiplication when temperature is +ve.

* Added some documentation on the approach.

---------

Co-authored-by: vemana <[email protected]>
gunnarmorling pushed a commit that referenced this pull request Jan 28, 2024
* Latest snapshot (#1)

preparing initial version

* Improved performance to 20seconds  (-9seconds from the previous version) (#2)

improved performance a bit

* Improved performance to 14 seconds (-6 seconds) (#3)

improved performance to 14 seconds

* sync branches (#4)

* initial commit

* some refactoring of methods

* some fixes for partitioning

* some fixes for partitioning

* fixed hacky getcode for utf8 bytes

* simplified getcode for partitioning

* temp solution with syncing

* temp solution with syncing

* new stream processing

* new stream processing

* some improvements

* cleaned stuff

* run configuration

* round buffer for the stream to pages

* not using compute since it's slower than straightforward get/put. using own byte array equals.

* using parallel gc

* avoid copying bytes when creating a station object

* formatting

* Copy less arrays. Improved performance to 12.7 seconds (-2 seconds) (#5)

* initial commit

* some refactoring of methods

* some fixes for partitioning

* some fixes for partitioning

* fixed hacky getcode for utf8 bytes

* simplified getcode for partitioning

* temp solution with syncing

* temp solution with syncing

* new stream processing

* new stream processing

* some improvements

* cleaned stuff

* run configuration

* round buffer for the stream to pages

* not using compute since it's slower than straightforward get/put. using own byte array equals.

* using parallel gc

* avoid copying bytes when creating a station object

* formatting

* some tuning to increase performance

* some tuning to increase performance

* avoid copying data; fast hashCode with slightly more collisions

* avoid copying data; fast hashCode with slightly more collisions

* cleanup (#6)

* tidy up
kahlstrm added a commit to kahlstrm/brc-rs that referenced this pull request Apr 20, 2024
As the number conversion for the original problem was somewhat weird + arguably somewhat Java-specific (see discussions at gunnarmorling/1brc#5), I've decided to move away from using that and use the sample results + run checks against my own base implementation instead.
kahlstrm added a commit to kahlstrm/brc-rs that referenced this pull request Apr 20, 2024
As the number conversion for the original problem was somewhat weird + arguably somewhat Java-specific (see discussions at gunnarmorling/1brc#5), I've decided to move away from using that and use the sample results + run checks against my own base implementation instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.