-
-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ResNet backbone vs. mask pooling #1634
Comments
Sorry for late response. |
Thanks. Will also look into this myself. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@hbredin @mengjie-du Hello, I recently realized that this could be avoided and while searching through the issues i've found out you already initiated a discussion. Basically, I've slightly changed the for-loop in
I have two questions:
thanks |
How do you skip it and reduce the latency? Could you please share your code for study? |
Hey @nikosanto13, thanks for your message. The main reason is a lack of resources (understand: time) on my side. Note that the solution suggested by @mengjie-du (splitting the model in two parts) has been partially implemented here already. I am not sure I'll be able to prioritize reviewing PRs on this particular aspect in the near future, though... |
@hbredin I see, thanks for the update. I'll create a fork where I'll finish the partial implementation for
Anyways, I'll be glad to contribute if you find time in the future to welcome PRs in that front. Btw, props for the project - it helped me a ton. @foreverhell stay tuned for the fork. I'll push the changes there. |
It has been noticed that the 3.1 pipeline efficiency suffers from speaker embedding inference. With the default config, every 10s chunk has to undergo inference 3 times by the embedding model. It proves effective by separating the whole embedding model pipeline into the resnet backbone and the mask pooling. With this modification, every chunk only needs to be inferred one time through the backbone, bringing almost 3x speedup in my experiment. Furthermore, cache inference strategy helps a lot as well, given the default overlapped ratio of 90%.
Originally posted by @mengjie-du in #1621 (comment)
Hey @mengjie-du, that's a nice idea. Would you contribute this to the pyannote.audio codebase? I tried to send you an email at the address mentioned in this paper but received an error message in return -- so I am taking my chance here.
The text was updated successfully, but these errors were encountered: