-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Snippets][CPU] Added external repacking via BrgemmCopyB #28179
base: master
Are you sure you want to change the base?
[Snippets][CPU] Added external repacking via BrgemmCopyB #28179
Conversation
c6c62b5
to
9e85ea1
Compare
9e85ea1
to
2b536b7
Compare
f2523ef
to
b665659
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first part
@@ -172,10 +173,48 @@ class Subgraph::SubgraphExecutor { | |||
inline void segfault_detector(); | |||
#endif | |||
|
|||
private: | |||
std::vector<MemoryPtr> reorder_inputs(const dnnl::stream& strm, const std::vector<MemoryPtr>& inMemPtrs); | |||
#ifdef OPENVINO_ARCH_X86_64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long #ifdef
blocks significantly reduce readability. Do you think we can create different executors of x86-64 and arm?
Note that we can use variadic templates + perfect forwarding to avoid repeating these long argument lists in constructors.
if (should_repacking_be_in_parallel()) | ||
in_parallel_repack_inputs(inMemPtrs, indexes, ithr, call_args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, RepackingImplType
can't change after the executor is create, so there is no need to check it inside the caller
. To minimize runtime overheads, we should create different callers for different RepackingImplType
.
Moreover, since we already have a builder, I would suggest (at least to consider) exposing RepackingImplType
as the executor's template parameter. We can then use if consexpr (...)
to take advantage of compile-time resolved branches ==> enjoy zero runtime overheads
|
||
uint8_t* repacked_ptr = get_external_scratchpad_ptr(ithr, in_idx) + dst_offset; | ||
|
||
auto& offsets = m_repacked_offsets_by_threads.at(ithr)[in_idx]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we use ithr
for indexing, then why do we need an unordered map here? It looks like vector
could be just as good here, but faster.
for (size_t j = 0; j < indexes.size(); j++) { | ||
src_offset += src_offsets[j] * indexes[j]; | ||
dst_offset += dst_offsets[j] * indexes[j]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop can't be vectorized because indexes.size() is not known at runtime. That's a pity, because we know that indexes.size() == 6
in 96% cases. Of course it's hardly a bottleneck, but its quite easy to allow for this optimization - again templates 🙂 : indexes size must be a template parameter.
it also makes sense because input ranks can't change in runtime, so there will likely be only one instantiation of this template
uint8_t* repacked_ptr = get_external_scratchpad_ptr(ithr, in_idx) + dst_offset; | ||
|
||
auto& offsets = m_repacked_offsets_by_threads.at(ithr)[in_idx]; | ||
if (offsets.count(src_offset) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar concern: although an unordered_map access is O(1), but the constant is rather large.
It would be much faster to create a vector<bool> is_repacked
(which is implemented through a bitset => takes very little space). The only question here is whether we can come up with a sufficiently dense hash function to avoid wasting too much space.
it looks like indexes[0] * indexes[1] * ... * indexes.back()
is good enough. Especially considering the fact that this type of executors is used only when the parallel work amount is small.
What do you think?
|
||
uint8_t* repacked_ptr = get_external_scratchpad_ptr(ithr, in_idx) + dst_offset; | ||
|
||
auto& offsets = m_repacked_offsets_by_threads.at(ithr)[in_idx]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question here: why do we need to check every repacked_input here via in_idx?
I thought that it's either no inputs for this thread are repacked (it's the first time this thread processes this batch), or all inputs are repacked (this thread already processed some date from this batch).
Can it be that the thread processed only a part of m_repacked_inputs
?
Details:
Tickets:
TODO: