Support Word Timestamps #38

ZachNagengast · 2024-02-29T03:08:22Z

This is intended as an initial "functional" PR. Example code and usage guidelines will be coming as a fast-follow. Will also update the default models on huggingface to have the appropriate outputs. In the meantime, you can use this CLI script to test out the flow:

Download tiny.en (only one with alignment weights currently)

make download-model MODEL=tiny.en

Transcribe with --word-timestamps flag

swift run transcribe --word-timestamps \
 --model-path "Models/whisperkit-coreml/openai_whisper-tiny.en" \
--audio-path ~/Downloads/ted_60.wav \
--report \
--report-path ~/Downloads \
--verbose

Outputs the following json:

https://gist.github.com/ZachNagengast/f36a751bc68a3b5f2c41ada8bcc33746

Resolves #2

Sources/WhisperKit/Core/Models.swift

Sources/WhisperKit/Core/SegmentSeeker.swift

finnvoor · 2024-02-29T11:36:29Z

It seems like skipSpecialTokens isn't respected when using word level timestamps, I still get a bunch of <|22.42|> timing tokens. Not sure if this is intentional or not.

Co-authored-by: Finn Voorhees <[email protected]>

ZachNagengast · 2024-02-29T17:27:07Z

@finnvoor Thanks for giving this an early look, these comments are super helpful. I'm really impressed you were able to get that video output working, thanks for sharing - fascinating to see it with your overlay.

It seems like skipSpecialTokens isn't respected when using word level timestamps, I still get a bunch of <|22.42|> timing tokens. Not sure if this is intentional or not.

Open question - I agree that skipSpecialTokens should remove them, but also wondering if they should just be removed by default? I.e. perhaps someone wants special tokens in the text responses, but not in the word timings.

The word timings seem to be one word behind where they should be, I could be doing something wrong but just matching the WordTiming start and end to video + audio, the timing seems to be super precise but exactly 1 word behind. You can see in this video if you unmute and watch the highlighted words.

This appears to be correct, will investigate. Also mentioned in another comment the punctuations for contractions are a little off too because it's being combined with the next word instead of the current word. Will continue to refine this but I suspect these are related. Will report back soon.

atiorh · 2024-02-29T17:43:09Z

This is amazing @finnvoor, thanks for the review!

…it into word-timestamps

ZachNagengast · 2024-03-01T09:04:19Z

@finnvoor Just pushed a fix for some of the issues you reported.

Here's a short clip of the properly aligned word subtitles, as suspected they were 1 off previously.

This latest commit should also handle contractions much better.

dry_ice_trimmed.mp4

Based on your feedback (and anyone else's) I will also adjust how the special tokens are handled in the word timestamps too.

finnvoor · 2024-03-01T10:00:07Z

looks much better now!

Detail_202403010956142.mp4

I can't imagine there's much use for having word level timings for special tokens, since they aren't really associated with time in audio. I think every use of Whisper I've seen has filtered out special tokens anyway

ZachNagengast · 2024-03-02T02:52:25Z

@finnvoor Thanks for the feedback, these will no longer include special tokens. Also added some of the heuristics from the openai reference repo. I did notice that large-v3 was giving some pretty off results (lots of 0s length words) but v2 was fine, so something to keep in mind.

atiorh

Great work and ready to merge!

All model versions in https://huggingface.co/datasets/argmaxinc/whisperkit-coreml support --word-timestamps now
I reran quality evaluations and the output quality is unaffected by these changes.
Performance regresses slightly but is expected and we will address that in the next release

Add word level timestamp handling

e5b3cc5

ZachNagengast force-pushed the word-timestamps branch from bfaecc9 to e5b3cc5 Compare February 29, 2024 04:29

Merge branch 'main' into word-timestamps

17c9c9e

finnvoor reviewed Feb 29, 2024

View reviewed changes

Sources/WhisperKit/Core/Models.swift Outdated Show resolved Hide resolved

finnvoor reviewed Feb 29, 2024

View reviewed changes

Sources/WhisperKit/Core/SegmentSeeker.swift Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

ZachNagengast and others added 2 commits February 29, 2024 09:11

Make wordtimings properties public

58bc116

Co-authored-by: Finn Voorhees <[email protected]>

Fix crash for empty alignments

dc831c6

Co-authored-by: Finn Voorhees <[email protected]>

ZachNagengast added 2 commits March 1, 2024 00:55

Update and test merging logic, fix alignment off by one issue

0a09a0d

Merge branch 'word-timestamps' of ssh://github.com/argmaxinc/WhisperK…

db8bbe2

…it into word-timestamps

ZachNagengast added 2 commits March 1, 2024 18:24

Add remaining word timestamp heuristics, remove special tokens

4615f5e

Fix sampleLength early loop termination

897807e

ZachNagengast requested a review from atiorh March 2, 2024 02:49

atiorh approved these changes Mar 2, 2024

View reviewed changes

ZachNagengast merged commit dda6571 into main Mar 2, 2024
1 check passed

This was referenced Mar 3, 2024

Reducing hallucinations by removing zero-length words based on word timestamps #41

Closed

Add tests to public CI argmaxinc/whisperkittools#3

Merged

Reduce redundant decoder forward passes by leveraging word-level timestamps #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Word Timestamps #38

Support Word Timestamps #38

ZachNagengast commented Feb 29, 2024 •

edited

Loading

finnvoor commented Feb 29, 2024

This comment was marked as resolved.

ZachNagengast commented Feb 29, 2024

atiorh commented Feb 29, 2024

ZachNagengast commented Mar 1, 2024 •

edited

Loading

finnvoor commented Mar 1, 2024

ZachNagengast commented Mar 2, 2024

atiorh left a comment

Support Word Timestamps #38

Support Word Timestamps #38

Conversation

ZachNagengast commented Feb 29, 2024 • edited Loading

finnvoor commented Feb 29, 2024

This comment was marked as resolved.

ZachNagengast commented Feb 29, 2024

atiorh commented Feb 29, 2024

ZachNagengast commented Mar 1, 2024 • edited Loading

finnvoor commented Mar 1, 2024

ZachNagengast commented Mar 2, 2024

atiorh left a comment

Choose a reason for hiding this comment

ZachNagengast commented Feb 29, 2024 •

edited

Loading

ZachNagengast commented Mar 1, 2024 •

edited

Loading