-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
f0 estimation in silent period #60
Comments
Hi, this is expected because the CREPE model is only trained for pitch accuracy and not voice activity detection. Although you could devise a heuristic based on the confidence metrics, please note that the model has never seen silence during training and may not classify all silence as "low confidence". Detecting when exactly there is a pitched sound is nontrivial and studied with the keywords such as voice activity detection (related: #47). Regarding the pitch inaccuracy, while it could be an issue with the sample rate (the model expects 16kHz audio) or simply it is making a wrong prediction, I believe the 220-260Hz pitch range you have for female voice is normal, as it can depend on the individual and the prosody of the utterance you're using. I, for example, am male and can produce between 70Hz and 400Hz. To make sure, you can cross-validate using other pitch tracking methods such as pYIN, SWIPE, or SPICE. |
During silence the model will still do its best to detect whatever in the silent segment, which might just be some static noise from the microphone. So the output may not be always the same during silence. Again, the model wasn't trained on silent audio and will just try to extrapolate what it knows about pitched signal, so its output during silence is not reliable.
If you have the voicing labels or a good heuristic to get them, you can post-process the outputs to suppress whatever was predicted during silence. This can be done since the model works in a fully convolutional way, and slicing the silent audio won't make a big difference.
Depending on the way the signal was sliced, the 1024-sample segment might be from a different locations, and the model's output might be sensitive to how they're sliced especially on the silent portions. It'll help diagnose the issue if you zero out the predictions during silence. Also I think it'll help if you plot all graphs w.r.t time (in seconds) as opposed to samples. That'll help identify any misalignments between audio and annotations. |
Hi, I have a question when I use CREPE to estimate f0 in human voice dataset. I found that the f0 has fluctuation even in audio's silent period. Is it normal? I am not sure if f0 should be constant during silent period. Is it because CREPE is not suitable in estimating human voice?
I also found in this literature that female f0 should be around 186 but I got around 220 here, and sometimes even higher (300).
The text was updated successfully, but these errors were encountered: