You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 1, 2024. It is now read-only.
Thank you very much for your excellent work.
One problem I am confused about is the definition of the crossmodal loss function and coseparation loss function. In the train.py, why random numbers and opt.gt_percentage are used to select which audio feature (audio_embedding_A1_pred or audio_embedding_A1_gt) is used. According to the method of the paper, shouldn't the predictive features be used?
Because the separated sound can be of low quality, especially in the initial stage of training, making their embeddings bad. However, the ground-truth audios/embeddings are always reliable. So this step is just to make sure the cross-modal loss or co-separation loss learns from meaningful feature embeddings that lead to meaningful distance metrics in the embedding space. Then, the loss using the predicted embeddings is the part that actually helps the separation learning.
Thank you very much for your excellent work.
One problem I am confused about is the definition of the
crossmodal loss function
andcoseparation loss function
. In the train.py, why random numbers andopt.gt_percentage
are used to select which audio feature (audio_embedding_A1_pred
oraudio_embedding_A1_gt
) is used. According to the method of the paper, shouldn't the predictive features be used?The text was updated successfully, but these errors were encountered: