-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
topk_queue::finalize sorts by both score and ID (#508) #515
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,13 @@ namespace pisa { | |
/// min element. Because it is a binary heap, the elements are not sorted; | ||
/// use `finalize()` member function to sort it before accessing it with | ||
/// `topk()`. | ||
/// | ||
/// Note that `finalize()` breaks ties between entries with equal scores by | ||
/// sorting them by document ID. However, this is not true for insertion, and | ||
/// thus it may happen that a document with identical score as the k-th | ||
/// document in the heap but with a lower document ID is _not_ in the top-k | ||
/// results. In such case, which entry makes it to top k is determined by the | ||
/// order in which elements are pushed to the heap. | ||
struct topk_queue { | ||
using entry_type = std::pair<Score, DocId>; | ||
|
||
|
@@ -50,7 +57,7 @@ struct topk_queue { | |
} | ||
m_q.emplace_back(score, docid); | ||
if (PISA_UNLIKELY(m_q.size() <= m_k)) { | ||
std::push_heap(m_q.begin(), m_q.end(), min_heap_order); | ||
push_heap(m_q.begin(), m_q.end()); | ||
if (PISA_UNLIKELY(m_q.size() == m_k)) { | ||
m_effective_threshold = m_q.front().first; | ||
} | ||
|
@@ -69,17 +76,25 @@ struct topk_queue { | |
|
||
/// Sorts the results in the heap container in the descending score order. | ||
/// | ||
/// After calling this function, the heap should be no longer modified, as | ||
/// the heap order will not be preserved. | ||
/// If multiple entries have equal score, they are sorted by document ID. Notice that this only | ||
/// happens in the finalization step; due to performance considerations, inserting is done with | ||
/// score-only order. After calling this function, the heap should be no longer modified, as the | ||
/// heap order will not be preserved. | ||
void finalize() | ||
{ | ||
std::sort_heap(m_q.begin(), m_q.end(), min_heap_order); | ||
size_t size = std::lower_bound( | ||
m_q.begin(), | ||
m_q.end(), | ||
0, | ||
[](std::pair<Score, DocId> l, Score r) { return l.first > r; }) | ||
- m_q.begin(); | ||
auto sort_order = [](auto const& lhs, auto const& rhs) { | ||
if (lhs.first == rhs.first) { | ||
return lhs.second < rhs.second; | ||
} | ||
return lhs.first > rhs.first; | ||
}; | ||
// We have to do a full sort because it is not the exact same ordering as when pushing | ||
// elements to the heap. | ||
std::sort(m_q.begin(), m_q.end(), sort_order); | ||
|
||
auto search_order = [](auto entry, auto score) { return entry.first > score; }; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could add a comment here to explain what this is doing? I know this is unmodified behaviour, but since you are on it would be better to document the code. |
||
auto first_zero_score = std::lower_bound(m_q.begin(), m_q.end(), 0, search_order); | ||
std::size_t size = std::distance(m_q.begin(), first_zero_score); | ||
m_q.resize(size); | ||
} | ||
|
||
|
@@ -127,12 +142,6 @@ struct topk_queue { | |
[[nodiscard]] auto size() const noexcept -> std::size_t { return m_q.size(); } | ||
|
||
private: | ||
[[nodiscard]] constexpr static auto | ||
min_heap_order(entry_type const& lhs, entry_type const& rhs) noexcept -> bool | ||
{ | ||
return lhs.first > rhs.first; | ||
} | ||
|
||
using entry_iterator_type = typename std::vector<entry_type>::iterator; | ||
|
||
/// Sifts down the top element of the heap in `[first, last)`. | ||
|
@@ -172,6 +181,26 @@ struct topk_queue { | |
} | ||
} | ||
|
||
// We use our own function (as opposed to `std::heap_push`), to ensure that | ||
// heap implementation is consistent across standard libraries. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Means that if I compile on libstdc++ 12 and you compile on libc++ 10, then both of us have the same logic, and thus if you run the same exact algorithm on the same exact collection, etc., you will get the same results. With Using pair ordering (score, doc) woulds of course solve all of that, but that's not efficient. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Understand! Makes sense. Right now we are not aware of differences in the implementation of libc++ 10 and libstdc++ 12, right? My major objection to this approach is that if we stick to this paradigm, we would need to reimplement the entire STL. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Normally, I would agree, but this is somewhat special circumstances. Technically speaking, we probably would want to strictly define the order (score + docid) but we're not because of efficiency. But also we are already implementing the more complex pop operation, so we might as well finish the job. It's very little effort for possibly a significant improvement for someone down the line. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's put it this way: you're doing research, and have some results, and then you recompile after system upgrade or something, and it doesn't match; it's not life changing, but that's very annoying imo, and I personally see no reason not to fix that with one little function. It could save someone a headache down the line. |
||
void push_heap(entry_iterator_type first, entry_iterator_type last) | ||
{ | ||
std::size_t hole_idx = std::distance(first, last) - 1; | ||
std::size_t top_idx = 0; | ||
auto cmp = [](entry_iterator_type const& lhs, entry_type const& rhs) { | ||
return lhs->first > rhs.first; | ||
}; | ||
auto value = *std::next(first, hole_idx); | ||
|
||
auto parent = (hole_idx - 1) / 2; | ||
while (hole_idx > top_idx && cmp(first + parent, value)) { | ||
*(first + hole_idx) = *(first + parent); | ||
hole_idx = parent; | ||
parent = (hole_idx - 1) / 2; | ||
} | ||
*(first + hole_idx) = value; | ||
} | ||
|
||
std::size_t m_k; | ||
float m_initial_threshold; | ||
std::vector<entry_type> m_q; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are switching to
sort
then why don't we just usestable_sort
on the scores such that we maintain as second sorting condition the insertion time?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure
stable_sort
will preserve that. I actually haven't analyzed this in detail, but are you saying that given our implementation, we can be sure that the more recently inserted entry is always above in the heap or at the same level but on the left side?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that
heap_sort
creates some issues with reproducibility since two documents with equal score have no guarantees on the order they will appear in the list of results.Stable sort will keep the relative order of elements with equivalent values, fixing the reproducibility issue.
I prefer this solution because:
(1) it is easier and less error prone
(2) it resolves the issue with docID ordering not maintained while inserting and only done at the end, which could cause the problem described in the code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what you're suggesting is to change the semantic: instead of ensuring that within the top-k results ties are broken by the docID order, you're proposing to not give any such guarantee, but rather say that the order of elements of with the same score is undefined, but deterministic, i.e., reproducible across multiple runs.
Do I understand correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll have to be more explicit here, not sure which issue we're resolving. From my understanding, you're solution simply changes the semantic of what we do, but maybe I'm missing something, so if you could be a bit more clear about what you mean and how it solves it, that would help.