NeMo ASR+VAD Inference
This example provides the ASR+VAD inference pipeline, with the option to perform only ASR or VAD alone.
Input
There are two types of input
- A manifest passed to
manifest_filepath
, - A directory containing audios passed to
audio_dir
and also specifyaudio_type
(default towav
).
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "text"] are required. An example of a manifest file is:
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "text": "a b c d e"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "text": "f g h i j"}
Output
Output will be a folder storing the VAD predictions and/or a manifest containing the audio transcriptions. Some temporary data will also be stored.
Usage
To run the code with ASR+VAD default settings:
python speech_to_text_with_vad.py \
manifest_filepath=/PATH/TO/MANIFEST.json \
vad_model=vad_multilingual_marblenet \
asr_model=stt_en_conformer_ctc_large \
vad_config=../conf/vad/vad_inference_postprocess.yaml
To use only ASR and disable VAD, set vad_model=None
and use_rttm=False
.
To use only VAD, set asr_model=None
and specify both vad_model
and vad_config
.
To enable profiling, set profiling=True
, but this will significantly slow down the program.
To use or disable feature masking, set use_rttm
to True
or False
.
To normalize feature before masking, set normalize=pre_norm
,
and set normalize=post_norm
for masking before normalization.
To use a specific value for feature masking, set feat_mask_val
to the desired value.
Default is feat_mask_val=None
, where -16.530 (zero log mel-spectrogram value) will be used for post_norm
and 0 (same as SpecAugment) will be used for pre_norm
.
See more options in the InferenceConfig
class.