Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topk_queue::finalize sorts by both score and ID (#508) #515

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 45 additions & 16 deletions include/pisa/topk_queue.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,13 @@ namespace pisa {
/// min element. Because it is a binary heap, the elements are not sorted;
/// use `finalize()` member function to sort it before accessing it with
/// `topk()`.
///
/// Note that `finalize()` breaks ties between entries with equal scores by
/// sorting them by document ID. However, this is not true for insertion, and
/// thus it may happen that a document with identical score as the k-th
/// document in the heap but with a lower document ID is _not_ in the top-k
/// results. In such case, which entry makes it to top k is determined by the
/// order in which elements are pushed to the heap.
struct topk_queue {
using entry_type = std::pair<Score, DocId>;

Expand Down Expand Up @@ -50,7 +57,7 @@ struct topk_queue {
}
m_q.emplace_back(score, docid);
if (PISA_UNLIKELY(m_q.size() <= m_k)) {
std::push_heap(m_q.begin(), m_q.end(), min_heap_order);
push_heap(m_q.begin(), m_q.end());
if (PISA_UNLIKELY(m_q.size() == m_k)) {
m_effective_threshold = m_q.front().first;
}
Expand All @@ -69,17 +76,25 @@ struct topk_queue {

/// Sorts the results in the heap container in the descending score order.
///
/// After calling this function, the heap should be no longer modified, as
/// the heap order will not be preserved.
/// If multiple entries have equal score, they are sorted by document ID. Notice that this only
/// happens in the finalization step; due to performance considerations, inserting is done with
/// score-only order. After calling this function, the heap should be no longer modified, as the
/// heap order will not be preserved.
void finalize()
{
std::sort_heap(m_q.begin(), m_q.end(), min_heap_order);
size_t size = std::lower_bound(
m_q.begin(),
m_q.end(),
0,
[](std::pair<Score, DocId> l, Score r) { return l.first > r; })
- m_q.begin();
auto sort_order = [](auto const& lhs, auto const& rhs) {
if (lhs.first == rhs.first) {
return lhs.second < rhs.second;
}
return lhs.first > rhs.first;
};
// We have to do a full sort because it is not the exact same ordering as when pushing
// elements to the heap.
std::sort(m_q.begin(), m_q.end(), sort_order);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are switching to sort then why don't we just use stable_sort on the scores such that we maintain as second sorting condition the insertion time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure stable_sort will preserve that. I actually haven't analyzed this in detail, but are you saying that given our implementation, we can be sure that the more recently inserted entry is always above in the heap or at the same level but on the left side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that heap_sort creates some issues with reproducibility since two documents with equal score have no guarantees on the order they will appear in the list of results.
Stable sort will keep the relative order of elements with equivalent values, fixing the reproducibility issue.

I prefer this solution because:
(1) it is easier and less error prone
(2) it resolves the issue with docID ordering not maintained while inserting and only done at the end, which could cause the problem described in the code comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what you're suggesting is to change the semantic: instead of ensuring that within the top-k results ties are broken by the docID order, you're proposing to not give any such guarantee, but rather say that the order of elements of with the same score is undefined, but deterministic, i.e., reproducible across multiple runs.

Do I understand correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it resolves the issue with docID ordering not maintained while inserting and only done at the end, which could cause the problem described in the code comment

You'll have to be more explicit here, not sure which issue we're resolving. From my understanding, you're solution simply changes the semantic of what we do, but maybe I'm missing something, so if you could be a bit more clear about what you mean and how it solves it, that would help.


auto search_order = [](auto entry, auto score) { return entry.first > score; };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add a comment here to explain what this is doing? I know this is unmodified behaviour, but since you are on it would be better to document the code.

auto first_zero_score = std::lower_bound(m_q.begin(), m_q.end(), 0, search_order);
std::size_t size = std::distance(m_q.begin(), first_zero_score);
m_q.resize(size);
}

Expand Down Expand Up @@ -127,12 +142,6 @@ struct topk_queue {
[[nodiscard]] auto size() const noexcept -> std::size_t { return m_q.size(); }

private:
[[nodiscard]] constexpr static auto
min_heap_order(entry_type const& lhs, entry_type const& rhs) noexcept -> bool
{
return lhs.first > rhs.first;
}

using entry_iterator_type = typename std::vector<entry_type>::iterator;

/// Sifts down the top element of the heap in `[first, last)`.
Expand Down Expand Up @@ -172,6 +181,26 @@ struct topk_queue {
}
}

// We use our own function (as opposed to `std::heap_push`), to ensure that
// heap implementation is consistent across standard libraries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does to ensure that heap implementation is consistent across standard libraries mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Means that if I compile on libstdc++ 12 and you compile on libc++ 10, then both of us have the same logic, and thus if you run the same exact algorithm on the same exact collection, etc., you will get the same results.

With std::push, the implementation may change from version to version and implementation to implementation (LLVM/GNU/MS/Apple) so it's not necessarily reproducible across environments.

Using pair ordering (score, doc) woulds of course solve all of that, but that's not efficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand! Makes sense. Right now we are not aware of differences in the implementation of libc++ 10 and libstdc++ 12, right?

My major objection to this approach is that if we stick to this paradigm, we would need to reimplement the entire STL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My major objection to this approach is that if we stick to this paradigm, we would need to reimplement the entire STL

Normally, I would agree, but this is somewhat special circumstances. Technically speaking, we probably would want to strictly define the order (score + docid) but we're not because of efficiency. But also we are already implementing the more complex pop operation, so we might as well finish the job. It's very little effort for possibly a significant improvement for someone down the line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put it this way: you're doing research, and have some results, and then you recompile after system upgrade or something, and it doesn't match; it's not life changing, but that's very annoying imo, and I personally see no reason not to fix that with one little function. It could save someone a headache down the line.

void push_heap(entry_iterator_type first, entry_iterator_type last)
{
std::size_t hole_idx = std::distance(first, last) - 1;
std::size_t top_idx = 0;
auto cmp = [](entry_iterator_type const& lhs, entry_type const& rhs) {
return lhs->first > rhs.first;
};
auto value = *std::next(first, hole_idx);

auto parent = (hole_idx - 1) / 2;
while (hole_idx > top_idx && cmp(first + parent, value)) {
*(first + hole_idx) = *(first + parent);
hole_idx = parent;
parent = (hole_idx - 1) / 2;
}
*(first + hole_idx) = value;
}

std::size_t m_k;
float m_initial_threshold;
std::vector<entry_type> m_q;
Expand Down
22 changes: 22 additions & 0 deletions test/test_topk_queue.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,28 @@ auto kth(std::vector<float> scores, int k) -> float
return scores.at(k - 1);
}

auto sort_order = [](auto const& lhs, auto const& rhs) {
if (lhs.first == rhs.first) {
return lhs.second < rhs.second;
}
return lhs.first > rhs.first;
};

TEST_CASE("Top-k ordering", "[topk_queue][prop]")
{
SECTION("Final elements are always sorted")
{
check([] {
auto [scores, docids] = *gen_postings(10, 1000);

pisa::topk_queue topk(10);
accumulate(topk, scores, docids);
topk.finalize();
REQUIRE(std::is_sorted(topk.topk().begin(), topk.topk().end(), sort_order));
});
}
}

TEST_CASE("Threshold", "[topk_queue][prop]")
{
SECTION("When initial = 0.0, the final threshold is the k-th score")
Expand Down