wenet-e2e · JiJiJiang · Aug 29, 2024 · Feb 7, 2024 · Feb 14, 2024 · Mar 22, 2024
diff --git a/examples/sre/v3/README b/examples/sre/v3/README
@@ -0,0 +1,29 @@
+Changed a little in make_system_sad.py to make split a large data set in parts
+when extracting VAD. It took ages to start otherwise and this will also be
+helpful in case there is a crash since output is saved after each part instead
+of after the whole set.
+
+# We use some scripts from Kaldi (combine_data.sh and fix_data_dir.sh)
+
+# This should not be needed anymore.
+# ln -s $KALDI_ROOT/egs/wsj/s5/utils
+# export PATH=$PATH:$(pwd)/utils/ # This is necessary since some Kaldi scripts assume other Kaldi scripts exists in the path.
+#export PATH=$PATH:$KALDI_ROOT/
+
+
+CTS
+                              spk /     utt
+Org. data                    6867 /  605760
+After VAD                    6867 /  605704
+After removing T < 5s        6867 /  604774
+After removing utt/spk < 3   6867 /  604774
+
+VOX
+                              spk /     utt
+Org. data                    7245 / 1245525
+After VAD                    7245 / 1245469
+After removing T < 5s        7245 /  816385
+After removing utt/spk < 3   7245 /  816385
+
+Total
+After removing utt/spk < 3  14112 / 1421159
diff --git a/examples/sre/v3/README.md b/examples/sre/v3/README.md
@@ -0,0 +1,99 @@
+### Main differences from ../v2
+* The training data is the CTS superset plus VoxCeleb with GSM codec
+* The test data is SRE16, SRE18, and SRE21
+* Preprocessing of embeddings before backend/scoring is supported
+
+### Important
+Similarly to ../v2, this recipe uses silero vad https://github.com/snakers4/silero-vad
+downloaded from here https://github.com/snakers4/silero-vad/archive/refs/tags/v4.0.zip
+If you intended to use this recipe for an evaluation/competition, make sure to check that
+it is allowed to use the data that has been used to train Silero.
+
+### Instructions
+* Set the paths in stage 1. The variable ```sre_data_dir``` is assumed to be prepared by
+  Kaldi (https://github.com/kaldi-asr/kaldi/tree/master/egs/sre16/v2).
+  Only the eval and unlabeled (major) data of sre16 is taken from there.
+  ```voxceleb_dir``` is the path to voxceleb prepared by wespeaker (```../../voxceleb/v2```).
+  If you set it to "" (empty string), the preparation will be run here. For the other datasets,
+  the path to the folder provided by LDA should be provided. The relevant LDC numbers and
+  file names of the data can be seen in the script. If you don't have
+  one or two of the "eval/dev" sets of "sre16", "sre18" or "sre21" and not specify it, you may
+  have to comment it from some more places in order to avoided crashes. (Eventually
+  the script will hopefully be made more robust to this.)
+  If you don't have the CTS superset data, you can skip stage 5 in ```local/prepare_data.sh```
+  and instead replace the CTS data it with some other data, e.g., the training data prepared in ```../v2```
+  If so, it is probably the easiest to name this data "CTS" since this name is assumed later
+  in the recipe.
+* Select which torchrun command to use in stage 3. The first line
+  (currently commented) is for "single-node, multi-worker" (one
+  pytorch job per machine). The second line is for "Stacked
+  single-node multi-worker" (more than one pytorch job may be
+  submitted to the same node in your cluster.) See
+  https://pytorch.org/docs/stable/elastic/run.html for explanations.
+* Stage 3 (training) and stage 4 (embedding extraction) need GPU. You may have
+  to arrange how to run these parts based on your environment.
+
+
+### Explanation of embedding processing
+
+The code supports flexible combinations of embedding processing steps, such as length-norm and LDA.
+A processing chain is specified e.g., as follows
+```
+mean-subtract --scp $mean1_scp | length-norm | lda --scp $lda_scp --utt2spk $utt2spk --dim $lda_dim | length-norm"
+```
+The script ```wespeaker/bin/prep_embd_proc.py``` takes such a processing chain as input, loops through the processing steps (separated by ```|```), calculates
+the necessary processing parameters (means, lda transforms etc.) and stores the whole processing chain with parameters in
+pickle format. The parameters for each step will be calculated sequentially and the data specified for the parameter estimation of a step will
+be processed by the  earlier steps. Therefore the data for the different steps can be different. For example when estimating LDA in the above chain, the data given by ```$lda_scp``` will first be processed by ```mean-subtract``` whose parameters were estimated by ```$mean1_scp``` which could be a different dataset.
+In scenarios where unlabeled domain adaptation data is available, we want to use this data for the first mean subtraction while still using the out domain data for LDA estimation. This CANNOT be achieved by specifying the processing chain
+```
+mean-subtract --scp $indomain_scp | length-norm | lda --scp $lda_scp --utt2spk $utt2spk --dim $lda_dim | length-norm
+```
+since this would have the consequence that in LDA estimation, the data (```$lda_scp```) would be subjected to mean subtraction
+using the mean of the indomain data (```$indomain_scp```). To solve this, we have an additional script ```wespeaker/bin/update_embd_proc.py``` used as follows
+```
+new_link="mean-subtract --scp $indomain_scp"
+python wespeaker/bin/update_embd_proc.py --in_path $preprocessing_path_cts_aug --out_path $preprocessing_path_sre18_unlab --link_no_to_remove 0 --new_link "$new_link"
+```
+where ```$preprocessing_path_cts_aug``` is the path to the pickled original processing chain and ```$preprocessing_path_sre18_unlab``` is the path to the new pickled processing chain.
+The script will remove link 0, e.g. ```mean-subtract --scp $mean1_scp``` and replace it with ```mean-subtract --scp $indomain_scp```.
+
+
+### Regarding extractor training data pruning
+
+Similarly to ```../v2``` and Kaldi's sre16 recipe, we discard some of the training utterances based on duration as well as training speakers based on their number of utterances.
+This is controlled in stage 9 of ```local/prepare_data.sh```. It is quite flexible but currently a bit messy and some consequences of the settings are not obvious. Therefore some explanation is provided here.
+There are three "blocks" in stage 9:
+* The first block discards all utterances shorter or equal to some specified duration (currently set to 5s) according to VOICED DURATION.
+* The second block discards all utterances shorter or equal to some specified duration (currently set to 5s) according to TOTAL DURATION, i.e., ignoring VAD info.
+* The third block discards all speakers that has less than or equal to a specified number of utterances. (Currently set to 2, i.e. speaker with 3 or more utterances are kept.)
+It is possible to set the thresholds differently for the different sets. IMPORTANT: The pruning in block 1 is based on ```data/data_set_name/utt2voice_dur``` which is calculated
+from the VAD info, so if a recording does not have any speech, it will not be present in utt2voice_dur and therefore discarded in this block even if the duration threshold is
+set to e.g. -1. If we want such utterances to be kept for one set we should not run this block for the set (as currently is the case for voxceleb). The current setup is as follows:
+    1. Apply block one to CTS but not Voxceleb
+    2. Apply block two to Voxceleb but not CTS. (Applying this stage to CTS would not have an effect if the thresholds are the same since the total duration is always larger or equal to the voiced duration.)
+    3. Apply stage three to both CTS and VoxCeleb.
+
+    This means Voxceleb recordings are kept even if they have no speech accordng to VAD. The later shard creation stage applies VAD if available, otherwise keeps the file as it is. So Voxceleb recording with no speech according to VAD will NOT be discarded (but there are only around 70 of them which is unlikely to have any effect on the trained system.). Also, there is a risk that pruning according to total duration while applying VAD in shard creation could result in recordings shorter than "num_frms". These will be zero padded at training time so there will be no crash but this is probably also suboptimal.
+These is setting are arguably somewhat weird. Applying block one also to voxceleb (and not using block two at all) would be more reasonable but it seems to degrade the performance due to discarding too many files. A better solution than the current would be to try with smaller thresholds than 5s but we have had not had time to explore this yet. Also, it would be reasonable to discard recordings with no speech according to VAD in the shard creation stage. However, when no VAD is available for a file, the shard creation code does not know whether this is because no speech was detected for this file according to VAD, or because VAD was not ran for this file. Since we want to have the possibility to keep recordings for which the latter is the case, we have it this way (it could for example be considered not to use VAD for voxceleb at all, in which case we need to avoid discarding these files at the shard creation stage). A more flexible and clear solution is needed and we will work on this for future updates.
+
+
+### Some data statistics
+|                                              |  CTS #utt   | CTS #spk | CTS #utt | CTS #spk | comment|
+| ---                                          |  ---    | ---  |    ---  |  --- |  --- |
+|Original data                                 |  605760 | 6867 | 1245525 | 7245 |      |
+|exclud recording with nospeech acording to VAD|  605704 | 6867 | 1245455 | 7245 | VAD is a bit random so these numbers could vary slightly, especially for voxceleb. |
+|After filtering according voiced duration     |  604774 | 6867 |  816411 | 7245 | Accordingly, here too. We don't use this for voxceleb in the current settings.  |
+|After filtering according total duration       |  -      | -    |  868326 | 7245 | Haven't checked this for CTS.
+
+No speaker are discarded in block three with the current setting.
+
+
+### Things to explore
+Very few things have been tuned. For example the following could be low-hanging fruits:
+* The above mentioned pruning rules
+* Utterance durations of the training segments.
+* Shall voxceleb be included? Is applying the GSM codec a good idea? (Note that GSM codec is applied in the data preparation stage while augmentation is applied at training time, i.e, GSM codec comes before augmentations. This is not so realistic, since in reality noise and reverberation comes before the data is recorded and encoded. However, it is consistent with CTS where we also apply augmentations at the already encoded audio since it was encoded at recording time.)
+* The other architectures.
+
+We will tune this futher in the future. We are also happy to hear about any such results obtained by others.
diff --git a/examples/sre/v3/conf/resnet.yaml b/examples/sre/v3/conf/resnet.yaml
@@ -0,0 +1,81 @@
+### train configuration
+
+exp_dir: exp/ResNet34-TSTP-emb256-fbank40-num_frms200-aug0.6-spFalse-saFalse-Softmax-SGD-epoch150
+gpus: "[0,1]"
+num_avg: 10
+enable_amp: False # whether enable automatic mixed precision training
+
+seed: 42
+num_epochs: 150
+save_epoch_interval: 5 # save model every 5 epochs
+log_batch_interval: 100 # log every 100 batchs
+
+dataloader_args:
+  batch_size: 256
+  num_workers: 7    # Total number of cores will be (this +1)*num_gpus
+  pin_memory: False
+  prefetch_factor: 8
+  drop_last: True
+
+dataset_args:
+  # the sample number which will be traversed within one epoch, if the value equals to 0,
+  # the utterance number in the dataset will be used as the sample_num_per_epoch.
+  sample_num_per_epoch: 780000
+  shuffle: True
+  shuffle_args:
+    shuffle_size: 1500
+  filter: True
+  filter_args:
+    min_num_frames: 100
+    max_num_frames: 300
+  resample_rate: 8000
+  speed_perturb: False
+  num_frms: 200
+  aug_prob: 0.6 # prob to add reverb & noise aug per sample
+  fbank_args:
+    num_mel_bins: 64
+    frame_shift: 10
+    frame_length: 25
+    dither: 1.0
+  spec_aug: False
+  spec_aug_args:
+    num_t_mask: 1
+    num_f_mask: 1
+    max_t: 10
+    max_f: 8
+    prob: 0.6
+
+model: ResNet34 # ResNet18, ResNet34, ResNet50, ResNet101, ResNet152
+model_init: null
+model_args:
+  feat_dim: 64
+  embed_dim: 256
+  pooling_func: "TSTP" # TSTP, ASTP, MQMHASTP
+  two_emb_layer: False
+projection_args:
+  project_type: "softmax" # add_margin, arc_margin, sphere, softmax, arc_margin_intertopk_subcenter
+
+margin_scheduler: MarginScheduler
+margin_update:
+  initial_margin: 0.0
+  final_margin: 0.2
+  increase_start_epoch: 20
+  fix_start_epoch: 40
+  update_margin: True
+  increase_type: "exp" # exp, linear
+
+loss: CrossEntropyLoss
+loss_args: {}
+
+optimizer: SGD
+optimizer_args:
+  momentum: 0.9
+  nesterov: True
+  weight_decay: 0.0001
+
+scheduler: ExponentialDecrease
+scheduler_args:
+  initial_lr: 0.1
+  final_lr: 0.00005
+  warm_up_epoch: 6
+  warm_from_zero: True
diff --git a/examples/sre/v3/local/create_preproc_embd_lists.sh b/examples/sre/v3/local/create_preproc_embd_lists.sh
@@ -0,0 +1,119 @@
+#!/bin/bash
+
+# Copyright (c) 2024 Johan Rohdin ([email protected])
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# The preprocessed embeddings are already stored but we need to create the lists
+# as score.sh wants them.
+
+exp_dir=$1
+data=data
+
+# We have three different preprocessors for which we need to prepare the lists
+# embd_proc_cts_aug.pkl             # LDA and cts_aug mean subtraction
+# embd_proc_sre16_major.pkl         # LDA and sre16_major mean subtracion (Only used for SRE16)
+# embd_proc_sre18_dev_unlabeled.pkl # LDA and sre18_dev_unlabeled mean subtracion (Only used for SRE18)
+
+
+### !!!
+# Note that xvector2 is only a hack for BUT
+
+##################################################################
+# CTS AUG for all sets
+echo "mean vector of enroll"
+python tools/vector_mean.py \
+  --spk2utt ${data}/sre16/eval/enrollment/spk2utt \
+  --xvector_scp $exp_dir/embeddings/sre16/eval/enrollment/xvector_proc_embd_proc_cts_aug.scp \
+  --spk_xvector_ark $exp_dir/embeddings/sre16/eval/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.ark
+
+python tools/vector_mean.py \
+  --spk2utt ${data}/sre18/dev/enrollment/mdl_id2utt \
+  --xvector_scp $exp_dir/embeddings/sre18/dev/enrollment/xvector_proc_embd_proc_cts_aug.scp \
+  --spk_xvector_ark $exp_dir/embeddings/sre18/dev/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.ark
+
+python tools/vector_mean.py \
+  --spk2utt ${data}/sre18/eval/enrollment/mdl_id2utt \
+  --xvector_scp $exp_dir/embeddings/sre18/eval/enrollment/xvector_proc_embd_proc_cts_aug.scp \
+  --spk_xvector_ark $exp_dir/embeddings/sre18/eval/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.ark
+
+python tools/vector_mean.py \
+  --spk2utt ${data}/sre21/dev/enrollment/mdl_id2utt \
+  --xvector_scp $exp_dir/embeddings/sre21/dev/enrollment/xvector_proc_embd_proc_cts_aug.scp \
+  --spk_xvector_ark $exp_dir/embeddings/sre21/dev/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.ark
+
+python tools/vector_mean.py \
+  --spk2utt ${data}/sre21/eval/enrollment/mdl_id2utt \
+  --xvector_scp $exp_dir/embeddings/sre21/eval/enrollment/xvector_proc_embd_proc_cts_aug.scp \
+  --spk_xvector_ark $exp_dir/embeddings/sre21/eval/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.ark
+
+
+# Create one scp with both enroll and test since this is expected by some scripts
+cat ${exp_dir}/embeddings/sre16/eval/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.scp \
+    ${exp_dir}/embeddings/sre16/eval/test/xvector_proc_embd_proc_cts_aug.scp \
+    > ${exp_dir}/embeddings/sre16/eval/xvector_proc_embd_proc_cts_aug.scp
+
+cat ${exp_dir}/embeddings/sre18/dev/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.scp \
+    ${exp_dir}/embeddings/sre18/dev/test/xvector_proc_embd_proc_cts_aug.scp \
+    > ${exp_dir}/embeddings/sre18/dev/xvector_proc_embd_proc_cts_aug.scp
+
+cat ${exp_dir}/embeddings/sre18/eval/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.scp \
+    ${exp_dir}/embeddings/sre18/eval/test/xvector_proc_embd_proc_cts_aug.scp \
+    > ${exp_dir}/embeddings/sre18/eval/xvector_proc_embd_proc_cts_aug.scp
+
+cat ${exp_dir}/embeddings/sre21/dev/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.scp \
+    ${exp_dir}/embeddings/sre21/dev/test/xvector_proc_embd_proc_cts_aug.scp \
+    > ${exp_dir}/embeddings/sre21/dev/xvector_proc_embd_proc_cts_aug.scp
+
+cat ${exp_dir}/embeddings/sre21/eval/enrollment/enroll_spk_xvector_proc_embd_proc_cts_aug.scp \
+    ${exp_dir}/embeddings/sre21/eval/test/xvector_proc_embd_proc_cts_aug.scp \
+    > ${exp_dir}/embeddings/sre21/eval/xvector_proc_embd_proc_cts_aug.scp
+
+
+##################################################################
+# sre16_major for sre16 eval
+echo "mean vector of enroll"
+python tools/vector_mean.py \
+  --spk2utt ${data}/sre16/eval/enrollment/spk2utt \
+  --xvector_scp $exp_dir/embeddings/sre16/eval/enrollment/xvector_proc_embd_proc_sre16_major.scp \
+  --spk_xvector_ark $exp_dir/embeddings/sre16/eval/enrollment/enroll_spk_xvector_proc_embd_proc_sre16_major.ark
+
+# Create one scp with both enroll and test since this is expected by some scripts
+cat ${exp_dir}/embeddings/sre16/eval/enrollment/enroll_spk_xvector_proc_embd_proc_sre16_major.scp \
+    ${exp_dir}/embeddings/sre16/eval/test/xvector_proc_embd_proc_sre16_major.scp \
+    > ${exp_dir}/embeddings/sre16/eval/xvector_proc_embd_proc_sre16_major.scp
+
+
+##################################################################
+# sre18_dev_unlabeled for sre18 dev/eval
+echo "mean vector of enroll"
+python tools/vector_mean.py \
+  --spk2utt ${data}/sre18/dev/enrollment/mdl_id2utt \
+  --xvector_scp $exp_dir/embeddings/sre18/dev/enrollment/xvector_proc_embd_proc_sre18_dev_unlabeled.scp \
+  --spk_xvector_ark $exp_dir/embeddings/sre18/dev/enrollment/enroll_spk_xvector_proc_embd_proc_sre18_dev_unlabeled.ark
+
+python tools/vector_mean.py \
+  --spk2utt ${data}/sre18/eval/enrollment/mdl_id2utt \
+  --xvector_scp $exp_dir/embeddings/sre18/eval/enrollment/xvector_proc_embd_proc_sre18_dev_unlabeled.scp \
+  --spk_xvector_ark $exp_dir/embeddings/sre18/eval/enrollment/enroll_spk_xvector_proc_embd_proc_sre18_dev_unlabeled.ark
+
+# Create one scp with both enroll and test since this is expected by some scripts
+cat ${exp_dir}/embeddings/sre18/dev/enrollment/enroll_spk_xvector_proc_embd_proc_sre18_dev_unlabeled.scp \
+    ${exp_dir}/embeddings/sre18/dev/test/xvector_proc_embd_proc_sre18_dev_unlabeled.scp \
+    > ${exp_dir}/embeddings/sre18/dev/xvector_proc_embd_proc_sre18_dev_unlabeled.scp
+
+cat ${exp_dir}/embeddings/sre18/eval/enrollment/enroll_spk_xvector_proc_embd_proc_sre18_dev_unlabeled.scp \
+    ${exp_dir}/embeddings/sre18/eval/test/xvector_proc_embd_proc_sre18_dev_unlabeled.scp \
+    > ${exp_dir}/embeddings/sre18/eval/xvector_proc_embd_proc_sre18_dev_unlabeled.scp
+
diff --git a/examples/sre/v3/local/download_data.sh b/examples/sre/v3/local/download_data.sh
@@ -0,0 +1 @@
+../../../voxceleb/v2/local/download_data.sh