-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability of document order when tied #508
Comments
I'd expect that modifying the following [[nodiscard]] constexpr static auto
min_heap_order(entry_type const& lhs, entry_type const& rhs) noexcept -> bool
{
return lhs.first > rhs.first || lhs.second < rhs.second;
} |
Whoops, actually more like: return lhs.first == rhs.first ? lhs.second < rhs.second : lhs.first > rhs.first; |
This is a tricky issue because of efficiency considerations. But if the problem is the stability of of the final sort, how about we add this additional sort condition only at the very end? This, I believe, would still be reproducible, at least if we use the same algorithm, and add documents in the same order. What do you think about this? I would like to avoid adding another condition in the heap insert itself, because that might slow thinigs down unnecessarily. |
That sounds fair to me; I agree that every statement matters during the query processing itself. The one caveat is that if there happens to be a tied score at the end of the ranked list (e.g., rank 1000 and rank 1001 have the same scores), the document returned at rank 1000 is non-deterministic. This feels like a fair concession, though. |
BTW, I have this PR pending: #507 where I implement a custom pop-push operation instead of using We could also implement our own With those, at least the logic could be well-defined and documented, and it would be deterministic in the sense that if you run the same algorithm on the same data but different environments (say, me compiling and running with gcc/libstdc++ and you with clang/libc++, for example), the results should be the same. But we wouldn't be able to guarantee that any algorithm would return the exact same results in that edge case of the same score around Does that make sense or am I missing something important? |
Sounds good to me! Thanks @elshize! |
The final sorting order is now by score (descending) and docid (ascending). Furthermore, `std::push_heap` is replaced with our own implementation to maintain consistency across standard libraries.
@seanmacavaney FYI, I opned a PR for this, if you'd like to give it a look. |
The final sorting order is now by score (descending) and docid (ascending). Furthermore, `std::push_heap` is replaced with our own implementation to maintain consistency across standard libraries.
The final sorting order is now by score (descending) and docid (ascending). Furthermore, `std::push_heap` is replaced with our own implementation to maintain consistency across standard libraries.
Describe the bug
Since
sort_heap
is not stable andtopk_queue
only sorts on score, the order of results that have the same scores can differ between runs. This is harmful for the repeatbility of results.To Reproduce
Steps to reproduce the behavior:
Error message
No error message.
Expected behavior
I expect results to be ordered ascending by DocId as a secondary sort when scores result in ties.
Environment info
Operating System: Ubuntu 22.04.1
Compiler name: gcc
Compiler version: 9
The text was updated successfully, but these errors were encountered: