Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wavesplit 2021 #454

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Wavesplit 2021 #454

wants to merge 2 commits into from

Conversation

popcornell
Copy link
Collaborator

@popcornell popcornell commented Feb 24, 2021

Should work now with oracle embedding. I made a separate pull request because it is faster
See previous pull request also: #70 from last year.
Many thanks to Neil (@lienz) again.
Help from anyone is very welcomed as I am currently very GPU-constrained. Also time-constrained

mixtures, oracle_s, oracle_ids = batch
b, n_spk, frames = oracle_s.size()

# spk_vectors = self.model.get_speaker_vectors(mixtures)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in validation steps I use oracle embeddings for now and no speaker stack

model = Wavesplit(
conf["masknet"]["n_src"],
{"embed_dim": 512},
{"embed_dim": 512, "spk_vec_dim": 512, "n_repeats": 4, "return_all_layers": False},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone wants to experiment with this here you can change the hyperparams

nondefault_nsrc:
sample_rate: 8000
mode: min
segment: 1.0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0 seconds or 0.75 as in the paper is enough

@popcornell popcornell requested a review from mpariente February 24, 2021 15:33
@popcornell popcornell added the enhancement New feature or request label Feb 24, 2021
@mpariente
Copy link
Collaborator

I'll review after @JorisCos

@mpariente mpariente requested a review from JorisCos February 25, 2021 08:59
@popcornell
Copy link
Collaborator Author

It would be cool if someone can try to run the training with the full system and not oracle embeddings. You can wait for review when the full system has been trained and performance is decent

Copy link
Collaborator

@JorisCos JorisCos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a very nice surprise to see a new Wavesplit PR for Asteroid thanks @popcornell.
I made my review with general comments and questions. Aren't we missing the eval script and the tests ?

Comment on lines +147 to +153
if __name__ == "__main__":
a = WHAMID(
"/media/sam/bx500/wavesplit/asteroid/egs/wham/wavesplit/data/wav8k/min/tt", "sep_clean"
)

for i in a:
print(i[-1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be removed

Comment on lines +83 to +93
if __name__ == "__main__":
parser = argparse.ArgumentParser("WHAM data preprocessing")
parser.add_argument(
"--in_dir", type=str, default=None, help="Directory path of wham including tr, cv and tt"
)
parser.add_argument(
"--out_dir", type=str, default=None, help="Directory path to put output files"
)
args = parser.parse_args()
print(args)
preprocess(args)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should create a def main(args) at the beginning if the file, put the args for the parser also at the beginning and call preprocess inside main(args) it's more user friendly we can see directly the arguments and the function that is called without scrolling

Comment on lines +102 to +107
# exp normalize trick
# with torch.no_grad():
# b = torch.max(distances, dim=1, keepdim=True)[0]
# out = -distance_utt + b.squeeze(1) - torch.log(torch.exp(-distances + b).sum(1))
# return out.sum(1)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove ?

Comment on lines +224 to +229
# testing exp normalize average
# distances = torch.ones((1, 101, 4000))
# with torch.no_grad():
# b = torch.max(distances, dim=1, keepdim=True)[0]
# out = b.squeeze(1) - torch.log(torch.exp(-distances + b).sum(1))
# out2 = - torch.log(torch.exp(-distances).sum(1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove ?

from kmeans_pytorch import kmeans, kmeans_predict


class Conv1DBlock(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you think we should make wavesplit part of asteroid not just the WHAM recipes ?

@JorisCos
Copy link
Collaborator

JorisCos commented Oct 7, 2021

Just letting you know that I am currently working on the recipe to run some experiments.
Hopefully, the results will be as expected and we will finally merge this 🚀

@lminer
Copy link

lminer commented Oct 15, 2021

@JorisCos Does that mean there's a more current version of this branch somewhere? Would be nice to be able to take a look if possible.

@wangshuo182
Copy link

It seems to work well with oracle embedding (18.5dB score was improved in the WSJ-2mix validation after 50 epochs). But when two stacks are jointly trained, the separation stack yields almost the same signals as the mixture, and the SISDR metric tends to be zero. Could you please share the results if anyone has tried the complete pipeline? @popcornell @JorisCos

@popcornell
Copy link
Collaborator Author

That's very interesting to know !
Unfortunately all I have is here on GitHub. Maybe Joris has more up to date code.

Do you think the degradation is due to over fitting of the training speakers IDs ?
It may be. In the paper they use some things like speaker dropout to mitigate that.
WSJ2Mix is small regarding speakers diversity after all and for reasonable speakers ID extraction usually you need tons of diversity e.g. voxceleb

@lminer
Copy link

lminer commented Jun 5, 2022 via email

@TCord
Copy link

TCord commented Jun 8, 2022

It seems to work well with oracle embedding (18.5dB score was improved in the WSJ-2mix validation after 50 epochs). But when two stacks are jointly trained, the separation stack yields almost the same signals as the mixture, and the SISDR metric tends to be zero. Could you please share the results if anyone has tried the complete pipeline? @popcornell @JorisCos

Hi, I also tried to run some experiments with Wavesplit (albeit in our own framework) in the past. I think a stagnating training of the speaker stack might result from two things:

  1. Missing shuffling of speaker IDs during training:
    The target speaker IDs need to be shuffled every time so that the model really needs the embeddings of the speaker stack to solve the permutation problem at the output. I haven't checked all of the code, but at least at first glance, I have not seen this shuffling for the target IDs.
  2. The latent dimension of the speaker stack is too high:
    If you don't use dynamic mixing (DM), the WSJ2mix dataset has 101 speakers, while the latent dimension is set to 256 in the corresponding paper. So, without DM (resulting in 285 speakers of WSJ0+WSJ1) there is no information bottleneck that would enable the model to generalize to unseen speakers. Even then, the generalization tricks that @popcornell mentioned (centroid dropout / mixup) still should be necessary for the model to really generalize to unseen speakers.
    Reducing the latent dimension to e.g. 64 and see if it's learning something then might work.

But even then, all of this should not prevent the model from at least overfitting to the training set.

@popcornell
Copy link
Collaborator Author

popcornell commented Jun 8, 2022

Do they use shuffling in the paper ? It sounds a very smart thing to do but they don't seem to use it. There is no shuffling here and it will be great to add because for sure prevents the model to be lazy and memorize the speakers.
Your point on the dimension of the embedding is valid. I don't think dynamic mixing as they describe in the paper uses also WSJ1. If it uses it the results in the paper are not comparable with previous works as uses additional data (then better throw in LibriMix which has more speaker diversity !). If only WSJ0 then you have 101 speakers always even with DM. This exacerbates the problem for sure.

Most of the code here is also from the first version of the paper where there were not many augmentations on the speaker stack (no speaker dropout for example, maybe only gaussian noise ?). I did not implement these augmentations.
Maybe Neil Zeghidour still has some other hidden tricks to make the model generalize better. @TCord were you able to succesfully replicate it to some decent degree ?

@popcornell
Copy link
Collaborator Author

@lminer did you use voxceleb ?

@TCord
Copy link

TCord commented Jun 8, 2022

If I remember it correctly, they also used the label shuffling in the paper. In my experiments, I did not use the architecture as proposed in the paper, but a Conv-TasNet as separation stack (i.e. I added an additional encoder/decoder layer) and reduced the total amount of layers. Here, I was able to train the model, but it did not improve upon the performance of a Conv-TasNet.
My conjecture was that a sample-wise resolution as in the paper is necessary to obtain good results and provides the most significant improvements. I think, in the DPRNN papers it was also shown that choosing a very small window size and frame advance in the encoder further improved the separation for anechoic data.
By employing an additional speaker stack jointly with a Conv-TasNet, the permutation problem could be solved through the speaker stack, but it did not improve the separation performance over a sole Conv-TasNet. As training the full-size Wavesplit model without any additional encoder/decoder layer takes a massive amount of GPU memory, I dropped these experiments afterwards.

@popcornell
Copy link
Collaborator Author

I have observed the same actually. Also according to https://arxiv.org/abs/2202.00733 the use of speaker ID info does in fact nor really help.

@lminer
Copy link

lminer commented Jun 8, 2022

@popcornell I used my own private dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants