This demo demonstrates Automatic Speech Recognition (ASR) with pretrained Wav2Vec model.
After reading and normalizing audio signal, running a neural network to get character probabilities, and CTC greedy decoding, the demo prints the decoded text.
The list of models supported by the demo is in <omz_dir>/demos/speech_recognition_wav2vec_demo/python/models.lst
file.
This file can be used as a parameter for Model Downloader and Converter to download and, if necessary, convert models to OpenVINO IR format (*.xml + *.bin).
An example of using the Model Downloader:
omz_downloader --list models.lst
An example of using the Model Converter:
omz_converter --list models.lst
- wav2vec2-base
NOTE: Refer to the tables Intel's Pre-Trained Models Device Support and Public Pre-Trained Models Device Support for the details on models inference support at different devices.
Run the application with -h
option to see help message.
usage: speech_recognition_wav2vec_demo.py [-h] -m MODEL -i INPUT [-d DEVICE] [--vocab VOCAB] [--dynamic_shape]
optional arguments:
-h, --help Show this help message and exit.
-m MODEL, --model MODEL
Required. Path to an .xml file with a trained model.
-i INPUT, --input INPUT
Required. Path to an audio file in WAV PCM 16 kHz mono format.
-d DEVICE, --device DEVICE
Optional. Specify the target device to infer on, for example: CPU, GPU or
HETERO. The demo will look for a suitable OpenVINO Runtime plugin for this device. Default value is CPU.
--vocab VOCAB Optional. Path to an .json file with encoding vocabulary.
--dynamic_shape Optional. Using dynamic shapes for inputs and outputs of model.
The typical command line is:
python3 speech_recognition_wav2vec_demo.py -m wav2vec2-base.xml -i audio.wav
NOTE: Only 16-bit, 16 kHz, mono-channel WAVE audio files are supported.
An example audio file can be taken from https://storage.openvinotoolkit.org/models_contrib/speech/2021.2/librispeech_s5/how_are_you_doing_today.wav.
The application prints the decoded text for the audio file. The demo reports
- Latency: total processing time required to process input data (from reading the data to displaying the results).