Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull request johan BUT sre #326

Merged
merged 36 commits into from
Aug 29, 2024

Conversation

gulamungon
Copy link
Contributor

SRE recipe using CTS superset + voxceleb as embedding extractor training data (See README). There are very few changes outside the recipe. Let me know if this is not appropriate:

  • In tools/make_shard_list.py changed so that if VAD info does not exist for a recording
    this recording will be used as is, i.e. no VAD will be applied. This is not ideal since
    absence of VAD info currently occurs also if VAD was estimated but no speech was found.
    However, there may be situations where we don't want to apply any sets, e.g., we may
    want to apply VAD to CTS but not voxceleb. Then we need it this way. This means that
    utterances for which VAD was ran but no speech detected, should be filtered before
    the shards are created. This will be the case if the sets are filtered with
    local/filter_utt_accd_dur.py since this script discards recordings with no Speech
    according to VAD.
    Ideally this should be improved so that recording for which for which no speech was
    detected will be removed while file that we don't want to apply VAD to will be kepts.
    Possibly by
    • Change the VAD info format so that it also contains files with no speech, then discard them in tools/make_shard_list.py
      while keeping the ones with no VAD info, i.e., those for which we do not want to apply VAD.
    • Keep the VAD format as is, i.e., no info for recordings with no speech but instead add a fake VAD info for the files
      we don't want to apply VAD for. This info would simply mark the whole segment as speech.
  • Changed in get_data_for_plda(... in plda_utils.py to give a warning instead of a crash in
    whenever an entry is in the scp but not in utt2spk. This embedding will be skipped.
    (This can be the case if files have been added added to the original CTS data folder
    since the scp is created by finding all wav files in this directory. In the case of
    BUT, we have some extra files here for sanity checks.) If this solution is not appropriate
    we can change in the data preparation script so that mismatch between the created scp
    and utt2spk is fixed already there.
  • Changed in local/make_system_sad.py (this change is only in the local recipe) so that it process so that VAD is processed for a limited number of files at the time (hardwired to 10000), after which the VAD result is saved (Instead of processing all at once). When extracting VAD. It took very long time to start otherwise. This could also be helpful in case there is a crash since output is saved after each part instead of after the whole set.

@wsstriving wsstriving requested review from czy97 and Hunterhuan June 4, 2024 15:35
@JiJiJiang
Copy link
Collaborator

Please fix Lint errors.

@czy97
Copy link
Collaborator

czy97 commented Jun 6, 2024

Hello Johan @gulamungon, thanks for the contribution. Can you first fix the Lint errors. Locally, you can use the flake8 command to check for problematic files and use yapf -i xxx.py to automatically format the problematic files.

@gulamungon
Copy link
Contributor Author

Hello Johan @gulamungon, thanks for the contribution. Can you first fix the Lint errors. Locally, you can use the flake8 command to check for problematic files and use yapf -i xxx.py to automatically format the problematic files.

Sure, I'll try to fix it asap.

@gulamungon
Copy link
Contributor Author

I fixed it hopefully.

@gulamungon
Copy link
Contributor Author

Changed tabs to spaces.

@wsstriving
Copy link
Collaborator

@gulamungon Hi, Johan, thanks for the contribution, it seems there are still some lint errors (trailing whitespaces)
@czy97 @JiJiJiang Maybe you guys start the reviewing first, and we do the lint fix afterwards

@gulamungon
Copy link
Contributor Author

gulamungon commented Jun 17, 2024 via email

export PYTHONIOENCODING=UTF-8
export PYTHONPATH=../../../:$PYTHONPATH

export PATH=$PATH:/mnt/matylda6/rohdin/software/kaldi_20200214/tools/sph2pipe_v2.5/
Copy link
Collaborator

@JiJiJiang JiJiJiang Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Johan @gulamungon , thanks for your codes.

Since installing Kaldi is a little complex, we only adopt some useful shell/perl/python scripts in WeSpeaker rather than installing the whole Kaldi.
You may consider two methods here:

  1. Download sph2pipe_v2.5.tar.gz and decompress it into a external_tools dir;
  2. Use some other tools to convert sph into wav, i.e., ffmpeg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll fix it. I'm on a trip this week so probably next week.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gulamungon Hi Johan, can you fix the lint errors, .etc, then we can proceed the merging process

@gulamungon
Copy link
Contributor Author

I changed sph2pipe to ffmpeg. Since before, we simply used sre16 data prepared elsewhere by Kaldi (as in ../v2) and this uses sph2pipe, I here instead copied in the sre16 datapreparation scripts and modified them to use ffmpeg instead as well as some other minor changes to fit better here.

I fixed the trailing spaces.

Since now quite many things have been changed I'm rerunning the recipe to see that nothing is broken. I think you can review it but perhaps better to wait with the merge until the run has finished.

@JiJiJiang
Copy link
Collaborator

JiJiJiang commented Aug 22, 2024

@gulamungon Hi Johan, thank you for your update! But there still seems some Lint errors.
You can run pip install pre-commit and pre-commit install, the lint errors would occur the time you run git commit before you push the latest code to github repo.

@gulamungon
Copy link
Contributor Author

Ok trying again. Actually, I did those checks manually but I was doing it from examples/sre/v3 and it seems errors in files that are links were not detected properly. pre-commit is convenient. Thanks for the tip.

@gulamungon
Copy link
Contributor Author

It is not clear to me what the flake8 issue is. I didn't see it when I ran it locally.

@JiJiJiang
Copy link
Collaborator

It is not clear to me what the flake8 issue is. I didn't see it when I ran it locally.

This flake8 error was fixed in the recent updates. Maybe you need to merge the master branch first and fix the conflicts (if exists).

@gulamungon
Copy link
Contributor Author

Ok. I pulled the recent changes to master than merged it into this branch. Hopefully it works now.

@JiJiJiang JiJiJiang merged commit 03ceb00 into wenet-e2e:master Aug 29, 2024
4 checks passed
@gulamungon gulamungon deleted the pull_request_Johan_BUT_sre branch September 17, 2024 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants