-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Word Timestamps #38
Conversation
bfaecc9
to
e5b3cc5
Compare
It seems like |
This comment was marked as resolved.
This comment was marked as resolved.
Co-authored-by: Finn Voorhees <[email protected]>
Co-authored-by: Finn Voorhees <[email protected]>
@finnvoor Thanks for giving this an early look, these comments are super helpful. I'm really impressed you were able to get that video output working, thanks for sharing - fascinating to see it with your overlay.
Open question - I agree that skipSpecialTokens should remove them, but also wondering if they should just be removed by default? I.e. perhaps someone wants special tokens in the text responses, but not in the word timings.
This appears to be correct, will investigate. Also mentioned in another comment the punctuations for contractions are a little off too because it's being combined with the next word instead of the current word. Will continue to refine this but I suspect these are related. Will report back soon. |
This is amazing @finnvoor, thanks for the review! |
@finnvoor Just pushed a fix for some of the issues you reported. Here's a short clip of the properly aligned word subtitles, as suspected they were 1 off previously. This latest commit should also handle contractions much better. dry_ice_trimmed.mp4Based on your feedback (and anyone else's) I will also adjust how the special tokens are handled in the word timestamps too. |
looks much better now! Detail_202403010956142.mp4I can't imagine there's much use for having word level timings for special tokens, since they aren't really associated with time in audio. I think every use of Whisper I've seen has filtered out special tokens anyway |
@finnvoor Thanks for the feedback, these will no longer include special tokens. Also added some of the heuristics from the openai reference repo. I did notice that large-v3 was giving some pretty off results (lots of 0s length words) but v2 was fine, so something to keep in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work and ready to merge!
- All model versions in https://huggingface.co/datasets/argmaxinc/whisperkit-coreml support
--word-timestamps
now - I reran quality evaluations and the output quality is unaffected by these changes.
- Performance regresses slightly but is expected and we will address that in the next release
This is intended as an initial "functional" PR. Example code and usage guidelines will be coming as a fast-follow. Will also update the default models on huggingface to have the appropriate outputs. In the meantime, you can use this CLI script to test out the flow:
--word-timestamps
flagOutputs the following json:
https://gist.github.com/ZachNagengast/f36a751bc68a3b5f2c41ada8bcc33746
Resolves #2