Compare outputs with tolerance #375

mtopolnik · 2024-01-13T15:16:05Z

This adds a Java program that compares two output files with tolerance for accumulated double errors.

It also produces nice diagnostic output, indicating the whole entry where there's a problem.

gunnarmorling · 2024-01-13T18:49:35Z

Dang, all this is a bit of a mess. Turns out, I even had the correct behavior initially (see commented code in baseline), but then I got "smart" and refactored it into that round() method. Someone brought up the concern about missing creating false positives with this approach (i.e. masking actual errors). Seems kinda impossible to not make it suck in one way or another. Maybe, the least bad way could be this:

Fix the baseline
Allow tests to be run with two modes: "strict" (i.e. current diff-based impl, contenders must exactly match the result of the fixed baseline) and "lenient" (i.e. what's proposed here, contenders can under-report mean by 0.1 compared to the fixed baseline)
Leave existing entries as is (i.e. implicitly using "lenient" validation mode)
Use "strict" mode for new or updated entries going forward
When publishing the 10K key set test results for the 10 or so top entries (i.e. similar to your earlier experiment), add a column which says whether they pass in "lenient" or "strict" mode, encouraging entries which only pass in "lenient" mode to adjust their implementation

I feel that may be the best balance to ensure correctness while not moving goal posts too much at this point. In particular, it would not mess up any existing results so far, which feels important to me.

@mtopolnik, @royvanrijn, @hundredwatt, @thomaswue, @merykitty, any thoughts?

gunnarmorling

Thanks! A few comments inline.

gunnarmorling · 2024-01-13T18:19:24Z

src/main/java/dev/morling/onebrc/CompareWithTolerance.java

+import java.util.regex.Pattern;
+import java.util.stream.IntStream;
+
+public class CompareWithTolerance {


Can we make this an implicit class in the root dir, with top-level main method, like process_output.java (which itself btw. is obsolete by now, I'm gonna remove it). That way, it doesn't have to be re-packaged after changes, just by running it with java lenient_compare.java.

src/main/java/dev/morling/onebrc/CompareWithTolerance.java

gunnarmorling · 2024-01-13T18:38:23Z

src/main/java/dev/morling/onebrc/CompareWithTolerance.java

+            var nameAndStats2 = parseEntry(entries2[i], i, fname2);
+            if (IntStream.range(0, 3)
+                    .mapToObj(n -> new double[]{ nameAndStats1.stats[n], nameAndStats2.stats[n] })
+                    .anyMatch(pair -> Math.abs((pair[0] - pair[1])) > 1)) {


I think the baseline would only ever get the value too high by 0.1 (or 1 as int here), i.e. no need for abs()?

src/main/java/dev/morling/onebrc/CompareWithTolerance.java

gunnarmorling · 2024-01-13T18:40:20Z

test.sh

@@ -45,7 +45,7 @@ for sample in $(ls $INPUT); do
  rm -f measurements.txt
  ln -s $sample measurements.txt

-  diff <("./calculate_average_$FORK.sh" | ./tocsv.sh) <(./tocsv.sh < ${sample%.txt}.out)
+./compare_output.sh <("./calculate_average_$FORK.sh") ${sample%.txt}.out


Can we make this parameterized instead, i.e. add a --lenient option to the script which controls which mode to use?

mtopolnik · 2024-01-13T18:54:38Z

@mtopolnik, @royvanrijn, @hundredwatt, @thomaswue, @merykitty, any thoughts?

Maybe we can first assess the damage? How many solutions are failing? For example, with my dataset, among the first 20 I found only a single solution to have this off-by-one-decimal problem.

royvanrijn · 2024-01-13T18:55:15Z

Sounds like a good plan. If there are entries failing, let’s mark that separately on the leaderboard and they’ll hopefully be fixed.

mtopolnik · 2024-01-13T18:56:33Z

I can add to the script that it records such minor mistakes, but continues testing for larger ones. Then we know for sure the solution is correct modulo off-by-decimal.

tivrfoa · 2024-01-13T19:01:07Z

Maybe an easier approach to not have to change testing too much:

Copy CalculateAverage_baseline to CalculateAverage_baseline_lenient
Copy CalculateAverage_baseline to CalculateAverage_baseline_strict
Fix strict version
Use diff first on strict output, then on lenient output if needed.

gunnarmorling · 2024-01-13T20:31:42Z

Maybe we can first assess the damage? How many solutions are failing? For example, with my dataset, among the first 20 I found only a single solution to have this off-by-one-decimal problem.

For the test case brought up by @tivrfoa here, about 1/3 of all implementations fail when comparing to the output of the fixed baseline. But out of the current Top 20, i.e. the set I'd run for the 10K keyset potentially, only 2 fail. So not too bad indeed.

gunnarmorling · 2024-01-13T22:43:32Z

Maybe an easier approach to not have to change testing too much...

Yeah, coming to the same conclusion actually. So I think just adding the new test case should be all we need here, and I'll follow up with updating the baseline implementation, keeping the old one around two for cases where the old output needs to be reproduced.

AlexanderYastrebov · 2024-01-14T02:42:36Z

Linking related #49

AlexanderYastrebov · 2024-01-14T03:21:54Z

validate_output.java

+            System.exit(1);
+        }
+        content = content.substring(1, content.length() - 1);
+        return content.split(",");


The whole thing kind of shows the importance of using a proper format #14

See also

1brc/tocsv.sh

Lines 29 to 31 in a0f826c

# 3. id may contain comma, e.g. "Washington, D.C.;-15.1;14.8;44.8, Wau;-2.1;27.4;53.4"

# so replace ", " with a newline only if it is preceded by a digit

s/$[0-9]$, /\1\n/g

As an exercise I've re-implemented gunnarmorling#375 idea to compare result numbers with a tolerance. But I think this approach is wrong, the rules should define rounding mode and baseline should be fixed to produce the correct result. Submissions should be fixed accordingly or qualified as non-passing. Could have been done earlier but better late than never. Updates gunnarmorling#49

As an exercise I've re-implemented gunnarmorling#375 idea to compare result numbers with a tolerance: ``` $ ./test.sh baseline 0 src/test/resources/samples/measurements-rounding-precise.txt Validating calculate_average_baseline.sh -- src/test/resources/samples/measurements-rounding-precise.txt Rounding=14.6/25.5/33.6 != Rounding=14.6/25.4/33.6 (avg) ``` But I think this approach is wrong, the rules should define rounding mode and baseline should be fixed to produce the correct result. Submissions should be fixed accordingly or qualified as non-passing. Could have been done earlier but better late than never. Updates gunnarmorling#49

As an exercise I've re-implemented gunnarmorling#375 idea to compare result numbers with a tolerance: ``` $ ./test.sh baseline 0 src/test/resources/samples/measurements-rounding-precise.txt Validating calculate_average_baseline.sh -- src/test/resources/samples/measurements-rounding-precise.txt Rounding=14.6/25.5/33.6 != Rounding=14.6/25.4/33.6 (avg) ``` The hassle of parsing highlights the importance of using machine-readable format as suggested in gunnarmorling#14 **But** I think this approach is wrong, the rules should define rounding mode and baseline should be fixed to produce the correct result. Submissions should be fixed accordingly or qualified as non-passing. Could have been done earlier but better late than never. Updates gunnarmorling#49

gunnarmorling · 2024-01-14T08:28:23Z

Ok, so as discussed I'm moving forward with this by only adding the new test case and adjusting the baseline implementation accordingly. I'm going to close this PR in favour of #392, which contains this work.

New or updated entries need to be compliant with this fixed behavior, i.e. pass the new additional test, whereas existing entries remain as-is, i.e. compliance at the point of original evaluation is relevant. Thanks a lot @mtopolnik for pushing on this, and to all for providing feedback.

mtopolnik · 2024-01-14T08:31:59Z

Here's the summary of what I understood. Many solutions do this: accumulate an integer sum that is 10x the actual value, and then, to get the average:

Convert both sum and count to double;
Divide sum by count;
Round the result;
Divide by ten.

The baseline accumulates the sum as a double, and then:

Divide sum by count;
Multiply by 10.0;
Round;
Divide by ten.

The proposed fix is to instead:

Multiply the sum by 10.0;
Round;
Divide by 10.0;
Divide by count;
Round again.

This should align the result with the first approach, which accumulates the integers.

gunnarmorling · 2024-01-14T08:55:13Z

Yes, pretty much that. See the linked new PR for how it looks like from a "specification" point of view.

mtopolnik marked this pull request as draft January 13, 2024 15:16

mtopolnik force-pushed the compare-with-tolerance branch from e81922b to 1bd0483 Compare January 13, 2024 15:19

mtopolnik marked this pull request as ready for review January 13, 2024 15:19

gunnarmorling reviewed Jan 13, 2024

View reviewed changes

mtopolnik force-pushed the compare-with-tolerance branch 2 times, most recently from f3701fb to 83ea793 Compare January 13, 2024 21:27

mtopolnik added 6 commits January 13, 2024 22:28

Add rounding error test case

518427e

Compare files with +/- 0.1 tolerance

fa12b21

Quotes around reported entries

4bb4468

Wording

ac0b997

Support lenient, move to root

70ddfb0

Use in test.sh

c7da487

mtopolnik force-pushed the compare-with-tolerance branch from 83ea793 to c7da487 Compare January 13, 2024 21:29

Delete compare_output.sh

c08f16c

gunnarmorling mentioned this pull request Jan 13, 2024

wip,test: add rounding test case #115

Closed

AlexanderYastrebov reviewed Jan 14, 2024

View reviewed changes

AlexanderYastrebov mentioned this pull request Jan 14, 2024

Compare results with tolerance #390

Closed

gunnarmorling closed this Jan 14, 2024

mtopolnik deleted the compare-with-tolerance branch January 17, 2024 07:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare outputs with tolerance #375

Compare outputs with tolerance #375

mtopolnik commented Jan 13, 2024

gunnarmorling commented Jan 13, 2024

gunnarmorling left a comment

gunnarmorling Jan 13, 2024

gunnarmorling Jan 13, 2024

gunnarmorling Jan 13, 2024

mtopolnik commented Jan 13, 2024 •

edited

Loading

royvanrijn commented Jan 13, 2024

mtopolnik commented Jan 13, 2024 •

edited

Loading

tivrfoa commented Jan 13, 2024

gunnarmorling commented Jan 13, 2024

gunnarmorling commented Jan 13, 2024

AlexanderYastrebov commented Jan 14, 2024 •

edited

Loading

AlexanderYastrebov Jan 14, 2024

gunnarmorling commented Jan 14, 2024

mtopolnik commented Jan 14, 2024

gunnarmorling commented Jan 14, 2024

	# 3. id may contain comma, e.g. "Washington, D.C.;-15.1;14.8;44.8, Wau;-2.1;27.4;53.4"
	# so replace ", " with a newline only if it is preceded by a digit
	s/\([0-9]\), /\1\n/g

Compare outputs with tolerance #375

Compare outputs with tolerance #375

Conversation

mtopolnik commented Jan 13, 2024

gunnarmorling commented Jan 13, 2024

gunnarmorling left a comment

Choose a reason for hiding this comment

gunnarmorling Jan 13, 2024

Choose a reason for hiding this comment

gunnarmorling Jan 13, 2024

Choose a reason for hiding this comment

gunnarmorling Jan 13, 2024

Choose a reason for hiding this comment

mtopolnik commented Jan 13, 2024 • edited Loading

royvanrijn commented Jan 13, 2024

mtopolnik commented Jan 13, 2024 • edited Loading

tivrfoa commented Jan 13, 2024

gunnarmorling commented Jan 13, 2024

gunnarmorling commented Jan 13, 2024

AlexanderYastrebov commented Jan 14, 2024 • edited Loading

AlexanderYastrebov Jan 14, 2024

Choose a reason for hiding this comment

gunnarmorling commented Jan 14, 2024

mtopolnik commented Jan 14, 2024

gunnarmorling commented Jan 14, 2024

mtopolnik commented Jan 13, 2024 •

edited

Loading

mtopolnik commented Jan 13, 2024 •

edited

Loading

AlexanderYastrebov commented Jan 14, 2024 •

edited

Loading