Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare outputs with tolerance #375

Closed

Conversation

mtopolnik
Copy link
Contributor

This adds a Java program that compares two output files with tolerance for accumulated double errors.

It also produces nice diagnostic output, indicating the whole entry where there's a problem.

@mtopolnik mtopolnik marked this pull request as draft January 13, 2024 15:16
@mtopolnik mtopolnik force-pushed the compare-with-tolerance branch from e81922b to 1bd0483 Compare January 13, 2024 15:19
@mtopolnik mtopolnik marked this pull request as ready for review January 13, 2024 15:19
@gunnarmorling
Copy link
Owner

Dang, all this is a bit of a mess. Turns out, I even had the correct behavior initially (see commented code in baseline), but then I got "smart" and refactored it into that round() method. Someone brought up the concern about missing creating false positives with this approach (i.e. masking actual errors). Seems kinda impossible to not make it suck in one way or another. Maybe, the least bad way could be this:

  • Fix the baseline
  • Allow tests to be run with two modes: "strict" (i.e. current diff-based impl, contenders must exactly match the result of the fixed baseline) and "lenient" (i.e. what's proposed here, contenders can under-report mean by 0.1 compared to the fixed baseline)
  • Leave existing entries as is (i.e. implicitly using "lenient" validation mode)
  • Use "strict" mode for new or updated entries going forward
  • When publishing the 10K key set test results for the 10 or so top entries (i.e. similar to your earlier experiment), add a column which says whether they pass in "lenient" or "strict" mode, encouraging entries which only pass in "lenient" mode to adjust their implementation

I feel that may be the best balance to ensure correctness while not moving goal posts too much at this point. In particular, it would not mess up any existing results so far, which feels important to me.

@mtopolnik, @royvanrijn, @hundredwatt, @thomaswue, @merykitty, any thoughts?

Copy link
Owner

@gunnarmorling gunnarmorling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few comments inline.

import java.util.regex.Pattern;
import java.util.stream.IntStream;

public class CompareWithTolerance {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this an implicit class in the root dir, with top-level main method, like process_output.java (which itself btw. is obsolete by now, I'm gonna remove it). That way, it doesn't have to be re-packaged after changes, just by running it with java lenient_compare.java.

src/main/java/dev/morling/onebrc/CompareWithTolerance.java Outdated Show resolved Hide resolved
src/main/java/dev/morling/onebrc/CompareWithTolerance.java Outdated Show resolved Hide resolved
var nameAndStats2 = parseEntry(entries2[i], i, fname2);
if (IntStream.range(0, 3)
.mapToObj(n -> new double[]{ nameAndStats1.stats[n], nameAndStats2.stats[n] })
.anyMatch(pair -> Math.abs((pair[0] - pair[1])) > 1)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the baseline would only ever get the value too high by 0.1 (or 1 as int here), i.e. no need for abs()?

src/main/java/dev/morling/onebrc/CompareWithTolerance.java Outdated Show resolved Hide resolved
test.sh Outdated
@@ -45,7 +45,7 @@ for sample in $(ls $INPUT); do
rm -f measurements.txt
ln -s $sample measurements.txt

diff <("./calculate_average_$FORK.sh" | ./tocsv.sh) <(./tocsv.sh < ${sample%.txt}.out)
./compare_output.sh <("./calculate_average_$FORK.sh") ${sample%.txt}.out
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this parameterized instead, i.e. add a --lenient option to the script which controls which mode to use?

@mtopolnik
Copy link
Contributor Author

mtopolnik commented Jan 13, 2024

@mtopolnik, @royvanrijn, @hundredwatt, @thomaswue, @merykitty, any thoughts?

Maybe we can first assess the damage? How many solutions are failing? For example, with my dataset, among the first 20 I found only a single solution to have this off-by-one-decimal problem.

@royvanrijn
Copy link
Contributor

Sounds like a good plan. If there are entries failing, let’s mark that separately on the leaderboard and they’ll hopefully be fixed.

@mtopolnik
Copy link
Contributor Author

mtopolnik commented Jan 13, 2024

I can add to the script that it records such minor mistakes, but continues testing for larger ones. Then we know for sure the solution is correct modulo off-by-decimal.

@tivrfoa
Copy link
Contributor

tivrfoa commented Jan 13, 2024

Maybe an easier approach to not have to change testing too much:

  • Copy CalculateAverage_baseline to CalculateAverage_baseline_lenient
  • Copy CalculateAverage_baseline to CalculateAverage_baseline_strict
  • Fix strict version
  • Use diff first on strict output, then on lenient output if needed.

@gunnarmorling
Copy link
Owner

Maybe we can first assess the damage? How many solutions are failing? For example, with my dataset, among the first 20 I found only a single solution to have this off-by-one-decimal problem.

For the test case brought up by @tivrfoa here, about 1/3 of all implementations fail when comparing to the output of the fixed baseline. But out of the current Top 20, i.e. the set I'd run for the 10K keyset potentially, only 2 fail. So not too bad indeed.

@mtopolnik mtopolnik force-pushed the compare-with-tolerance branch 2 times, most recently from f3701fb to 83ea793 Compare January 13, 2024 21:27
@mtopolnik mtopolnik force-pushed the compare-with-tolerance branch from 83ea793 to c7da487 Compare January 13, 2024 21:29
@gunnarmorling
Copy link
Owner

Maybe an easier approach to not have to change testing too much...

Yeah, coming to the same conclusion actually. So I think just adding the new test case should be all we need here, and I'll follow up with updating the baseline implementation, keeping the old one around two for cases where the old output needs to be reproduced.

@AlexanderYastrebov
Copy link
Contributor

AlexanderYastrebov commented Jan 14, 2024

Linking related #49

System.exit(1);
}
content = content.substring(1, content.length() - 1);
return content.split(",");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole thing kind of shows the importance of using a proper format #14

See also

1brc/tocsv.sh

Lines 29 to 31 in a0f826c

# 3. id may contain comma, e.g. "Washington, D.C.;-15.1;14.8;44.8, Wau;-2.1;27.4;53.4"
# so replace ", " with a newline only if it is preceded by a digit
s/\([0-9]\), /\1\n/g

AlexanderYastrebov added a commit to AlexanderYastrebov/1brc that referenced this pull request Jan 14, 2024
As an exercise I've re-implemented gunnarmorling#375 idea to compare result numbers
with a tolerance.

But I think this approach is wrong, the rules should define rounding
mode and baseline should be fixed to produce the correct result.

Submissions should be fixed accordingly or qualified as non-passing.

Could have been done earlier but better late than never.

Updates gunnarmorling#49
AlexanderYastrebov added a commit to AlexanderYastrebov/1brc that referenced this pull request Jan 14, 2024
As an exercise I've re-implemented gunnarmorling#375 idea to compare result numbers
with a tolerance:
```
$ ./test.sh baseline 0 src/test/resources/samples/measurements-rounding-precise.txt
Validating calculate_average_baseline.sh -- src/test/resources/samples/measurements-rounding-precise.txt
Rounding=14.6/25.5/33.6 != Rounding=14.6/25.4/33.6 (avg)
```

But I think this approach is wrong, the rules should define rounding
mode and baseline should be fixed to produce the correct result.

Submissions should be fixed accordingly or qualified as non-passing.

Could have been done earlier but better late than never.

Updates gunnarmorling#49
AlexanderYastrebov added a commit to AlexanderYastrebov/1brc that referenced this pull request Jan 14, 2024
As an exercise I've re-implemented gunnarmorling#375 idea to compare result numbers
with a tolerance:
```
$ ./test.sh baseline 0 src/test/resources/samples/measurements-rounding-precise.txt
Validating calculate_average_baseline.sh -- src/test/resources/samples/measurements-rounding-precise.txt
Rounding=14.6/25.5/33.6 != Rounding=14.6/25.4/33.6 (avg)
```
The hassle of parsing highlights the importance of using
machine-readable format as suggested in gunnarmorling#14

**But** I think this approach is wrong, the rules should define rounding
mode and baseline should be fixed to produce the correct result.

Submissions should be fixed accordingly or qualified as non-passing.

Could have been done earlier but better late than never.

Updates gunnarmorling#49
@gunnarmorling
Copy link
Owner

Ok, so as discussed I'm moving forward with this by only adding the new test case and adjusting the baseline implementation accordingly. I'm going to close this PR in favour of #392, which contains this work.

New or updated entries need to be compliant with this fixed behavior, i.e. pass the new additional test, whereas existing entries remain as-is, i.e. compliance at the point of original evaluation is relevant. Thanks a lot @mtopolnik for pushing on this, and to all for providing feedback.

@mtopolnik
Copy link
Contributor Author

Here's the summary of what I understood. Many solutions do this: accumulate an integer sum that is 10x the actual value, and then, to get the average:

  1. Convert both sum and count to double;
  2. Divide sum by count;
  3. Round the result;
  4. Divide by ten.

The baseline accumulates the sum as a double, and then:

  1. Divide sum by count;
  2. Multiply by 10.0;
  3. Round;
  4. Divide by ten.

The proposed fix is to instead:

  1. Multiply the sum by 10.0;
  2. Round;
  3. Divide by 10.0;
  4. Divide by count;
  5. Round again.

This should align the result with the first approach, which accumulates the integers.

@gunnarmorling
Copy link
Owner

Yes, pretty much that. See the linked new PR for how it looks like from a "specification" point of view.

@mtopolnik mtopolnik deleted the compare-with-tolerance branch January 17, 2024 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants