Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Word Timestamps #38

Merged
merged 8 commits into from
Mar 2, 2024
Merged

Support Word Timestamps #38

merged 8 commits into from
Mar 2, 2024

Conversation

ZachNagengast
Copy link
Contributor

@ZachNagengast ZachNagengast commented Feb 29, 2024

This is intended as an initial "functional" PR. Example code and usage guidelines will be coming as a fast-follow. Will also update the default models on huggingface to have the appropriate outputs. In the meantime, you can use this CLI script to test out the flow:

  1. Download tiny.en (only one with alignment weights currently)
make download-model MODEL=tiny.en
  1. Transcribe with --word-timestamps flag
swift run transcribe --word-timestamps \
 --model-path "Models/whisperkit-coreml/openai_whisper-tiny.en" \
--audio-path ~/Downloads/ted_60.wav \
--report \
--report-path ~/Downloads \
--verbose

Outputs the following json:

https://gist.github.com/ZachNagengast/f36a751bc68a3b5f2c41ada8bcc33746

Resolves #2

@finnvoor
Copy link
Contributor

It seems like skipSpecialTokens isn't respected when using word level timestamps, I still get a bunch of <|22.42|> timing tokens. Not sure if this is intentional or not.

@finnvoor

This comment was marked as resolved.

ZachNagengast and others added 2 commits February 29, 2024 09:11
@ZachNagengast
Copy link
Contributor Author

@finnvoor Thanks for giving this an early look, these comments are super helpful. I'm really impressed you were able to get that video output working, thanks for sharing - fascinating to see it with your overlay.

It seems like skipSpecialTokens isn't respected when using word level timestamps, I still get a bunch of <|22.42|> timing tokens. Not sure if this is intentional or not.

Open question - I agree that skipSpecialTokens should remove them, but also wondering if they should just be removed by default? I.e. perhaps someone wants special tokens in the text responses, but not in the word timings.

The word timings seem to be one word behind where they should be, I could be doing something wrong but just matching the WordTiming start and end to video + audio, the timing seems to be super precise but exactly 1 word behind. You can see in this video if you unmute and watch the highlighted words.

This appears to be correct, will investigate. Also mentioned in another comment the punctuations for contractions are a little off too because it's being combined with the next word instead of the current word. Will continue to refine this but I suspect these are related. Will report back soon.

@atiorh
Copy link
Contributor

atiorh commented Feb 29, 2024

This is amazing @finnvoor, thanks for the review!

@ZachNagengast
Copy link
Contributor Author

ZachNagengast commented Mar 1, 2024

@finnvoor Just pushed a fix for some of the issues you reported.

Here's a short clip of the properly aligned word subtitles, as suspected they were 1 off previously.

This latest commit should also handle contractions much better.

dry_ice_trimmed.mp4

Based on your feedback (and anyone else's) I will also adjust how the special tokens are handled in the word timestamps too.

@finnvoor
Copy link
Contributor

finnvoor commented Mar 1, 2024

looks much better now!

Detail_202403010956142.mp4

I can't imagine there's much use for having word level timings for special tokens, since they aren't really associated with time in audio. I think every use of Whisper I've seen has filtered out special tokens anyway

@ZachNagengast ZachNagengast requested a review from atiorh March 2, 2024 02:49
@ZachNagengast
Copy link
Contributor Author

@finnvoor Thanks for the feedback, these will no longer include special tokens. Also added some of the heuristics from the openai reference repo. I did notice that large-v3 was giving some pretty off results (lots of 0s length words) but v2 was fine, so something to keep in mind.

Copy link
Contributor

@atiorh atiorh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work and ready to merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Word level timestamps
3 participants