Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability of document order when tied #508

Open
seanmacavaney opened this issue Dec 31, 2022 · 7 comments · May be fixed by #515
Open

stability of document order when tied #508

seanmacavaney opened this issue Dec 31, 2022 · 7 comments · May be fixed by #515
Assignees
Labels
bug Something isn't working

Comments

@seanmacavaney
Copy link

Describe the bug
Since sort_heap is not stable and topk_queue only sorts on score, the order of results that have the same scores can differ between runs. This is harmful for the repeatbility of results.

To Reproduce
Steps to reproduce the behavior:

  1. Follow the steps to index and execute queries. Be sure to include a query that will result in a score tie for the ranking model. For instance, the query "chemical reactions" over the vaswani dataset.
  2. Observe the order of documents with tied scores.
  3. Repeat and re-observe several times.

Error message
No error message.

Expected behavior
I expect results to be ordered ascending by DocId as a secondary sort when scores result in ties.

Environment info
Operating System: Ubuntu 22.04.1
Compiler name: gcc
Compiler version: 9

@seanmacavaney seanmacavaney added the bug Something isn't working label Dec 31, 2022
@seanmacavaney
Copy link
Author

seanmacavaney commented Dec 31, 2022

I'd expect that modifying the following topk_queue method to consider DocId would solve this issue, but I am not sure about the efficiency implications, or if other places need to be changed as well.

    [[nodiscard]] constexpr static auto
    min_heap_order(entry_type const& lhs, entry_type const& rhs) noexcept -> bool
    {
        return lhs.first > rhs.first || lhs.second < rhs.second;
    }

@seanmacavaney
Copy link
Author

Whoops, actually more like:

return lhs.first == rhs.first ? lhs.second < rhs.second : lhs.first > rhs.first;

@elshize
Copy link
Member

elshize commented Jan 6, 2023

This is a tricky issue because of efficiency considerations.

But if the problem is the stability of of the final sort, how about we add this additional sort condition only at the very end? This, I believe, would still be reproducible, at least if we use the same algorithm, and add documents in the same order.

What do you think about this?

I would like to avoid adding another condition in the heap insert itself, because that might slow thinigs down unnecessarily.

@seanmacavaney
Copy link
Author

That sounds fair to me; I agree that every statement matters during the query processing itself.

The one caveat is that if there happens to be a tied score at the end of the ranked list (e.g., rank 1000 and rank 1001 have the same scores), the document returned at rank 1000 is non-deterministic. This feels like a fair concession, though.

@elshize
Copy link
Member

elshize commented Jan 6, 2023

BTW, I have this PR pending: #507 where I implement a custom pop-push operation instead of using std::pop_heap.

We could also implement our own pop_push (used only before the heap fills up), so that we have a fixed implementation across different compilation environments.

With those, at least the logic could be well-defined and documented, and it would be deterministic in the sense that if you run the same algorithm on the same data but different environments (say, me compiling and running with gcc/libstdc++ and you with clang/libc++, for example), the results should be the same.

But we wouldn't be able to guarantee that any algorithm would return the exact same results in that edge case of the same score around kth position, because they don't have the same exact sequence of heap inserts.

Does that make sense or am I missing something important?

@seanmacavaney
Copy link
Author

Sounds good to me! Thanks @elshize!

@elshize elshize self-assigned this Jan 24, 2023
elshize added a commit that referenced this issue Jan 25, 2023
The final sorting order is now by score (descending) and docid
(ascending). Furthermore, `std::push_heap` is replaced with our own
implementation to maintain consistency across standard libraries.
@elshize
Copy link
Member

elshize commented Jan 25, 2023

@seanmacavaney FYI, I opned a PR for this, if you'd like to give it a look.

elshize added a commit that referenced this issue Jan 27, 2023
The final sorting order is now by score (descending) and docid
(ascending). Furthermore, `std::push_heap` is replaced with our own
implementation to maintain consistency across standard libraries.
elshize added a commit that referenced this issue Feb 7, 2023
The final sorting order is now by score (descending) and docid
(ascending). Furthermore, `std::push_heap` is replaced with our own
implementation to maintain consistency across standard libraries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants