Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[diar] merge voxconverse v3 into v2 and update results in README.md #352

Merged
merged 6 commits into from
Aug 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ pre-commit install # for clean and tidy code
```

## 🔥 News
* 2024.08.20: Update diarization recipe for VoxConverse dataset by leveraging umap dimensionality reduction and hdbscan clustering, see [#347](https://github.com/wenet-e2e/wespeaker/pull/347).
* 2024.08.20: Update diarization recipe for VoxConverse dataset by leveraging umap dimensionality reduction and hdbscan clustering, see [#347](https://github.com/wenet-e2e/wespeaker/pull/347) and [#352](https://github.com/wenet-e2e/wespeaker/pull/352).
* 2024.08.18: Support using ssl pre-trained models as the frontend. The [WavLM recipe](https://github.com/wenet-e2e/wespeaker/blob/master/examples/voxceleb/v2/run_wavlm.sh) is also provided, see [#344](https://github.com/wenet-e2e/wespeaker/pull/344).
* 2024.05.15: Add support for [quality-aware score calibration](https://arxiv.org/pdf/2211.00815), see [#320](https://github.com/wenet-e2e/wespeaker/pull/320).
* 2024.04.25: Add support for the gemini-dfresnet model, see [#291](https://github.com/wenet-e2e/wespeaker/pull/291).
Expand Down
4 changes: 4 additions & 0 deletions examples/voxconverse/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
This is a **WeSpeaker** speaker diarization recipe on the Voxconverse 2020 dataset. It focused on a ``in the wild`` scenario, which was collected from YouTube videos with a semi-automatic pipeline and released for the diarization track in VoxSRC 2020 Challenge. See https://www.robots.ox.ac.uk/~vgg/data/voxconverse/ for more detailed information.

Two recipes are provided, including **v1** and **v2**. Their only difference is that in **v2**, we split the Fbank extraction, embedding extraction and clustering modules to different stages. We recommend newcomers to follow the **v2** recipe and run it stage by stage.

🔥 UPDATE 2024.08.20:
* silero-vad v5.1 is used in place of v3.1
* umap dimensionality reduction + hdbscan clustering is also supported in v2
12 changes: 7 additions & 5 deletions examples/voxconverse/v1/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
## Overview

* We suggest to run this recipe on a gpu-available machine, with onnxruntime-gpu supported.
* Dataset: voxconverse_dev that consists of 216 utterances
* Speaker model: ResNet34 model pretrained by wespeaker
* Dataset: Voxconverse2020 (dev: 216 utts)
* Speaker model: ResNet34 model pretrained by WeSpeaker
* Refer to [voxceleb sv recipe](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb/v2)
* [pretrained model path](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.onnx)
* Speaker activity detection model: oracle SAD (from ground truth annotation) or system SAD (VAD model pretrained by silero, https://github.com/snakers4/silero-vad)
* Speaker activity detection model:
* oracle SAD (from ground truth annotation)
* system SAD (VAD model pretrained by [silero-vad](https://github.com/snakers4/silero-vad), v3.1 is deprecated now)
* Clustering method: spectral clustering
* Metric: DER = MISS + FALSE ALARM + SPEAKER CONFUSION (%)

Expand All @@ -15,8 +17,8 @@

| system | MISS | FA | SC | DER |
|:---|:---:|:---:|:---:|:---:|
| This repo (with oracle SAD) | 2.3 | 0.0 | 1.9 | 4.2 |
| This repo (with system SAD) | 3.7 | 0.8 | 2.0 | 6.5 |
| Ours (oracle SAD + spectral clustering) | 2.3 | 0.0 | 1.9 | 4.2 |
| Ours (silero-vad v3.1 + spectral clustering) | 3.7 | 0.8 | 2.0 | 6.5 |
| DIHARD 2019 baseline [^1] | 11.1 | 1.4 | 11.3 | 23.8 |
| DIHARD 2019 baseline w/ SE [^1] | 9.3 | 1.3 | 9.7 | 20.2 |
| (SyncNet ASD only) [^1] | 2.2 | 4.1 | 4.0 | 10.4 |
Expand Down
5 changes: 2 additions & 3 deletions examples/voxconverse/v1/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
unzip -o external_tools/SCTK-v2.4.12.zip -d external_tools

# [2] Download voice activity detection model pretrained by Silero Team
wget -c https://github.com/snakers4/silero-vad/archive/refs/tags/v3.1.zip -O external_tools/silero-vad-v3.1.zip
unzip -o external_tools/silero-vad-v3.1.zip -d external_tools
#wget -c https://github.com/snakers4/silero-vad/archive/refs/tags/v3.1.zip -O external_tools/silero-vad-v3.1.zip
#unzip -o external_tools/silero-vad-v3.1.zip -d external_tools

# [3] Download ResNet34 speaker model pretrained by WeSpeaker Team
mkdir -p pretrained_models
Expand Down Expand Up @@ -79,7 +79,6 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
if [[ "x${sad_type}" == "xsystem" ]]; then
# System SAD: applying 'silero' VAD
python3 wespeaker/diar/make_system_sad.py \
--repo-path external_tools/silero-vad-3.1 \
--scp data/dev/wav.scp \
--min-duration $min_duration > data/dev/system_sad
fi
Expand Down
25 changes: 18 additions & 7 deletions examples/voxconverse/v2/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
## Overview

* We suggest to run this recipe on a gpu-available machine, with onnxruntime-gpu supported.
* Dataset: voxconverse_dev that consists of 216 utterances
* Speaker model: ResNet34 model pretrained by wespeaker
* Dataset: Voxconverse2020 (dev: 216 utts, test: 232 utts)
* Speaker model: ResNet34 model pretrained by WeSpeaker
* Refer to [voxceleb sv recipe](https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb/v2)
* [pretrained model path](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.onnx)
* Speaker activity detection model: oracle SAD (from ground truth annotation) or system SAD (VAD model pretrained by silero, https://github.com/snakers4/silero-vad)
* Clustering method: spectral clustering
* Speaker activity detection model:
* oracle SAD (from ground truth annotation)
* system SAD (VAD model pretrained by [silero-vad](https://github.com/snakers4/silero-vad), v3.1 => v5.1)
* Clustering method:
* spectral clustering
* umap dimensionality reduction + hdbscan clustering
* Metric: DER = MISS + FALSE ALARM + SPEAKER CONFUSION (%)

## Results
Expand All @@ -15,8 +19,11 @@

| system | MISS | FA | SC | DER |
|:---|:---:|:---:|:---:|:---:|
| This repo (with oracle SAD) | 2.3 | 0.0 | 2.1 | 4.4 |
| This repo (with system SAD) | 3.7 | 0.8 | 2.2 | 6.8 |
| Ours (oracle SAD + spectral clustering) | 2.3 | 0.0 | 2.1 | 4.4 |
| Ours (oracle SAD + umap clustering) | 2.3 | 0.0 | 1.3 | 3.6 |
| Ours (silero-vad v3.1 + spectral clustering) | 3.7 | 0.8 | 2.2 | 6.7 |
| Ours (silero-vad v5.1 + spectral clustering) | 3.4 | 0.6 | 2.3 | 6.3 |
| Ours (silero-vad v5.1 + umap clustering) | 3.4 | 0.6 | 1.4 | 5.4 |
| DIHARD 2019 baseline [^1] | 11.1 | 1.4 | 11.3 | 23.8 |
| DIHARD 2019 baseline w/ SE [^1] | 9.3 | 1.3 | 9.7 | 20.2 |
| (SyncNet ASD only) [^1] | 2.2 | 4.1 | 4.0 | 10.4 |
Expand All @@ -27,7 +34,11 @@

| system | MISS | FA | SC | DER |
|:---|:---:|:---:|:---:|:---:|
| This repo (with system SAD) | 4.0 | 2.4 | 3.4 | 9.8 |
| Ours (oracle SAD + spectral clustering) | 1.6 | 0.0 | 3.3 | 4.9 |
| Ours (oracle SAD + umap clustering) | 1.6 | 0.0 | 1.9 | 3.5 |
| Ours (silero-vad v3.1 + spectral clustering) | 4.0 | 2.4 | 3.4 | 9.8 |
| Ours (silero-vad v5.1 + spectral clustering) | 3.8 | 1.7 | 3.3 | 8.8 |
| Ours (silero-vad v5.1 + umap clustering) | 3.8 | 1.7 | 1.8 | 7.3 |


[^1]: Spot the conversation: speaker diarisation in the wild, https://arxiv.org/pdf/2007.01216.pdf
35 changes: 16 additions & 19 deletions examples/voxconverse/v2/run.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#!/bin/bash
# Copyright (c) 2022-2023 Xu Xiang
# 2022 Zhengyang Chen ([email protected])
# 2024 Hongji Wang ([email protected])
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -18,8 +19,9 @@

stage=-1
stop_stage=-1
sad_type="oracle"
partition="dev"
sad_type="oracle" # oracle/system
partition="dev" # dev/test
cluster_type="spectral" # spectral/umap

# do cmn on the sub-segment or on the vad segment
subseg_cmn=true
Expand All @@ -36,11 +38,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
wget -c https://github.com/usnistgov/SCTK/archive/refs/tags/v2.4.12.zip -O external_tools/SCTK-v2.4.12.zip
unzip -o external_tools/SCTK-v2.4.12.zip -d external_tools

# [2] Download voice activity detection model pretrained by Silero Team
wget -c https://github.com/snakers4/silero-vad/archive/refs/tags/v3.1.zip -O external_tools/silero-vad-v3.1.zip
unzip -o external_tools/silero-vad-v3.1.zip -d external_tools

# [3] Download ResNet34 speaker model pretrained by WeSpeaker Team
# [2] Download ResNet34 speaker model pretrained by WeSpeaker Team
mkdir -p pretrained_models

wget -c https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.onnx -O pretrained_models/voxceleb_resnet34_LM.onnx
Expand Down Expand Up @@ -101,7 +99,6 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
if [[ "x${sad_type}" == "xsystem" ]]; then
# System SAD: applying 'silero' VAD
python3 wespeaker/diar/make_system_sad.py \
--repo-path external_tools/silero-vad-3.1 \
--scp data/${partition}/wav.scp \
--min-duration $min_duration > data/${partition}/system_sad
fi
Expand Down Expand Up @@ -144,24 +141,24 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
fi


# Applying spectral clustering algorithm
# Applying spectral or ump+hdbscan clustering algorithm
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then

[ -f "exp/spectral_cluster/${partition}_${sad_type}_sad_labels" ] && rm exp/spectral_cluster/${partition}_${sad_type}_sad_labels
[ -f "exp/${cluster_type}_cluster/${partition}_${sad_type}_sad_labels" ] && rm exp/${cluster_type}_cluster/${partition}_${sad_type}_sad_labels

echo "Doing spectral clustering and store the result in exp/spectral_cluster/${partition}_${sad_type}_sad_labels"
echo "Doing ${cluster_type} clustering and store the result in exp/${cluster_type}_cluster/${partition}_${sad_type}_sad_labels"
echo "..."
python3 wespeaker/diar/spectral_clusterer.py \
python3 wespeaker/diar/${cluster_type}_clusterer.py \
--scp exp/${partition}_${sad_type}_sad_embedding/emb.scp \
--output exp/spectral_cluster/${partition}_${sad_type}_sad_labels
--output exp/${cluster_type}_cluster/${partition}_${sad_type}_sad_labels
fi


# Convert labels to RTTMs
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
python3 wespeaker/diar/make_rttm.py \
--labels exp/spectral_cluster/${partition}_${sad_type}_sad_labels \
--channel 1 > exp/spectral_cluster/${partition}_${sad_type}_sad_rttm
--labels exp/${cluster_type}_cluster/${partition}_${sad_type}_sad_labels \
--channel 1 > exp/${cluster_type}_cluster/${partition}_${sad_type}_sad_rttm
fi


Expand All @@ -173,18 +170,18 @@ if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
perl external_tools/SCTK-2.4.12/src/md-eval/md-eval.pl \
-c 0.25 \
-r <(cat ${ref_dir}/${partition}/*.rttm) \
-s exp/spectral_cluster/${partition}_${sad_type}_sad_rttm 2>&1 | tee exp/spectral_cluster/${partition}_${sad_type}_sad_res
-s exp/${cluster_type}_cluster/${partition}_${sad_type}_sad_rttm 2>&1 | tee exp/${cluster_type}_cluster/${partition}_${sad_type}_sad_res

if [ ${get_each_file_res} -eq 1 ];then
single_file_res_dir=exp/spectral_cluster/${partition}_${sad_type}_single_file_res
single_file_res_dir=exp/${cluster_type}_cluster/${partition}_${sad_type}_single_file_res
mkdir -p $single_file_res_dir
echo -e "\nGet the DER results for each file and the results will be stored underd ${single_file_res_dir}\n..."

awk '{print $2}' exp/spectral_cluster/${partition}_${sad_type}_sad_rttm | sort -u | while read file_name; do
awk '{print $2}' exp/${cluster_type}_cluster/${partition}_${sad_type}_sad_rttm | sort -u | while read file_name; do
perl external_tools/SCTK-2.4.12/src/md-eval/md-eval.pl \
-c 0.25 \
-r <(cat ${ref_dir}/${partition}/${file_name}.rttm) \
-s <(grep "${file_name}" exp/spectral_cluster/${partition}_${sad_type}_sad_rttm) > ${single_file_res_dir}/${partition}_${file_name}_res
-s <(grep "${file_name}" exp/${cluster_type}_cluster/${partition}_${sad_type}_sad_rttm) > ${single_file_res_dir}/${partition}_${file_name}_res
done
echo "Done!"
fi
Expand Down
34 changes: 0 additions & 34 deletions examples/voxconverse/v3/README.md

This file was deleted.

1 change: 0 additions & 1 deletion examples/voxconverse/v3/local

This file was deleted.

1 change: 0 additions & 1 deletion examples/voxconverse/v3/path.sh

This file was deleted.

Loading
Loading