-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare outputs with tolerance #375
Conversation
e81922b
to
1bd0483
Compare
Dang, all this is a bit of a mess. Turns out, I even had the correct behavior initially (see commented code in baseline), but then I got "smart" and refactored it into that
I feel that may be the best balance to ensure correctness while not moving goal posts too much at this point. In particular, it would not mess up any existing results so far, which feels important to me. @mtopolnik, @royvanrijn, @hundredwatt, @thomaswue, @merykitty, any thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A few comments inline.
import java.util.regex.Pattern; | ||
import java.util.stream.IntStream; | ||
|
||
public class CompareWithTolerance { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this an implicit class in the root dir, with top-level main method, like process_output.java
(which itself btw. is obsolete by now, I'm gonna remove it). That way, it doesn't have to be re-packaged after changes, just by running it with java lenient_compare.java
.
var nameAndStats2 = parseEntry(entries2[i], i, fname2); | ||
if (IntStream.range(0, 3) | ||
.mapToObj(n -> new double[]{ nameAndStats1.stats[n], nameAndStats2.stats[n] }) | ||
.anyMatch(pair -> Math.abs((pair[0] - pair[1])) > 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the baseline would only ever get the value too high by 0.1 (or 1 as int
here), i.e. no need for abs()
?
test.sh
Outdated
@@ -45,7 +45,7 @@ for sample in $(ls $INPUT); do | |||
rm -f measurements.txt | |||
ln -s $sample measurements.txt | |||
|
|||
diff <("./calculate_average_$FORK.sh" | ./tocsv.sh) <(./tocsv.sh < ${sample%.txt}.out) | |||
./compare_output.sh <("./calculate_average_$FORK.sh") ${sample%.txt}.out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this parameterized instead, i.e. add a --lenient
option to the script which controls which mode to use?
Maybe we can first assess the damage? How many solutions are failing? For example, with my dataset, among the first 20 I found only a single solution to have this off-by-one-decimal problem. |
Sounds like a good plan. If there are entries failing, let’s mark that separately on the leaderboard and they’ll hopefully be fixed. |
I can add to the script that it records such minor mistakes, but continues testing for larger ones. Then we know for sure the solution is correct modulo off-by-decimal. |
Maybe an easier approach to not have to change testing too much:
|
For the test case brought up by @tivrfoa here, about 1/3 of all implementations fail when comparing to the output of the fixed baseline. But out of the current Top 20, i.e. the set I'd run for the 10K keyset potentially, only 2 fail. So not too bad indeed. |
f3701fb
to
83ea793
Compare
83ea793
to
c7da487
Compare
Yeah, coming to the same conclusion actually. So I think just adding the new test case should be all we need here, and I'll follow up with updating the baseline implementation, keeping the old one around two for cases where the old output needs to be reproduced. |
Linking related #49 |
System.exit(1); | ||
} | ||
content = content.substring(1, content.length() - 1); | ||
return content.split(","); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an exercise I've re-implemented gunnarmorling#375 idea to compare result numbers with a tolerance. But I think this approach is wrong, the rules should define rounding mode and baseline should be fixed to produce the correct result. Submissions should be fixed accordingly or qualified as non-passing. Could have been done earlier but better late than never. Updates gunnarmorling#49
As an exercise I've re-implemented gunnarmorling#375 idea to compare result numbers with a tolerance: ``` $ ./test.sh baseline 0 src/test/resources/samples/measurements-rounding-precise.txt Validating calculate_average_baseline.sh -- src/test/resources/samples/measurements-rounding-precise.txt Rounding=14.6/25.5/33.6 != Rounding=14.6/25.4/33.6 (avg) ``` But I think this approach is wrong, the rules should define rounding mode and baseline should be fixed to produce the correct result. Submissions should be fixed accordingly or qualified as non-passing. Could have been done earlier but better late than never. Updates gunnarmorling#49
As an exercise I've re-implemented gunnarmorling#375 idea to compare result numbers with a tolerance: ``` $ ./test.sh baseline 0 src/test/resources/samples/measurements-rounding-precise.txt Validating calculate_average_baseline.sh -- src/test/resources/samples/measurements-rounding-precise.txt Rounding=14.6/25.5/33.6 != Rounding=14.6/25.4/33.6 (avg) ``` The hassle of parsing highlights the importance of using machine-readable format as suggested in gunnarmorling#14 **But** I think this approach is wrong, the rules should define rounding mode and baseline should be fixed to produce the correct result. Submissions should be fixed accordingly or qualified as non-passing. Could have been done earlier but better late than never. Updates gunnarmorling#49
Ok, so as discussed I'm moving forward with this by only adding the new test case and adjusting the baseline implementation accordingly. I'm going to close this PR in favour of #392, which contains this work. New or updated entries need to be compliant with this fixed behavior, i.e. pass the new additional test, whereas existing entries remain as-is, i.e. compliance at the point of original evaluation is relevant. Thanks a lot @mtopolnik for pushing on this, and to all for providing feedback. |
Here's the summary of what I understood. Many solutions do this: accumulate an integer sum that is 10x the actual value, and then, to get the average:
The baseline accumulates the sum as a double, and then:
The proposed fix is to instead:
This should align the result with the first approach, which accumulates the integers. |
Yes, pretty much that. See the linked new PR for how it looks like from a "specification" point of view. |
This adds a Java program that compares two output files with tolerance for accumulated
double
errors.It also produces nice diagnostic output, indicating the whole entry where there's a problem.