In [None]:
"""
Please run notebook locally (if you have all the dependencies and a GPU). 
Technically you can run this notebook on Google Colab but you need to set up microphone for Colab.
 
Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
5. Set up microphone for Colab
"""
# If you're using Google Colab and not running locally, run this cell.

## Install dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg portaudio19-dev
!pip install text-unidecode
!pip install pyaudio

# ## Install NeMo
BRANCH = 'r1.17.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]

## Install TorchAudio
!pip install torchaudio>=0.13.0 -f https://download.pytorch.org/whl/torch_stable.html

# Voice Activity Detection (VAD)


This notebook demonstrates how to perform
1. [offline streaming inference on audio files (offline VAD)](#Offline-streaming-inference);
2. [finetuning](#Finetune) and use [posterior](#Posterior);
3. [vad postprocessing and threshold tuning](#VAD-postprocessing-and-Tuning-threshold);
4. [online streaming inference](#Online-streaming-inference);
5. [online streaming inference from a microphone's stream](#Online-streaming-inference-through-microphone).

Note the incompatibility of components could lead to failure of running this notebook locally with container, we might deprecate this notebook and provide a better tutorial in soon releases.

The notebook requires PyAudio library to get a signal from an audio device.
For Ubuntu, please run the following commands to install it:
```
sudo apt install python3-pyaudio
pip install pyaudio
```

This notebook requires the `torchaudio` library to be installed for MarbleNet. Please follow the instructions available at the [torchaudio installer](https://github.com/NVIDIA/NeMo/blob/main/scripts/installers/install_torchaudio_latest.sh) and [torchaudio Github page](https://github.com/pytorch/audio#installation) to install the appropriate version of torchaudio.


In [None]:
import numpy as np
import pyaudio as pa
import os, time
import librosa
import IPython.display as ipd
import matplotlib.pyplot as plt
%matplotlib inline

import nemo
import nemo.collections.asr as nemo_asr

In [None]:
# sample rate, Hz
SAMPLE_RATE = 16000

## Restore the model from NGC

In [None]:
vad_model = nemo_asr.models.EncDecClassificationModel.from_pretrained('vad_marblenet')

## Observing the config of the model

In [None]:
from omegaconf import OmegaConf
import copy

In [None]:
# Preserve a copy of the full config
cfg = copy.deepcopy(vad_model._cfg)
print(OmegaConf.to_yaml(cfg))

## Setup preprocessor with these settings

In [None]:
vad_model.preprocessor = vad_model.from_config_dict(cfg.preprocessor)

In [None]:
# Set model to inference mode
vad_model.eval();

In [None]:
vad_model = vad_model.to(vad_model.device)

We demonstrate two methods for streaming inference:
1. [offline streaming inference (script)](#Offline-streaming-inference)
2. [online streaming inference (step-by-step)](#Online-streaming-inference)

# Offline streaming inference

VAD relies on shorter fixed-length segments for prediction. 

You can find all necessary steps about inference in 
```python
 Script: /examples/asr/speech_classification/vad_infer.py 
 Config: /examples/asr/conf/vad/vad_inference_postprocessing.yaml
```
Duration inference, we generate frame-level prediction by two approaches:

1. shift the window of length `window_length_in_sec` (e.g. 0.63s) by `shift_length_in_sec` (e.g. 10ms) to generate the frame and use the prediction of the window to represent the label for the frame; Use 
```python
 /examples/asr/speech_classification/vad_infer.py
```

 This script will automatically split long audio file to avoid CUDA memory issue and performing **streaming** inside `AudioLabelDataset`.

### Posterior


2. generate predictions with overlapping input segments. Then a smoothing filter is applied to decide the label for a frame spanned by multiple segments. Perform this step alongside with above step with flag **gen_overlap_seq=True** or use
```python
/scripts/voice_activity_detection/vad_overlap_posterior.py
```
if you already have frame level prediction. 

Have a look at [MarbleNet paper](https://arxiv.org/pdf/2010.13886.pdf) for choices about segment length, smoothing filter, etc. And play with those parameters with your data.

You can also find posterior about converting frame level prediction to speech/no-speech segment in start and end times format in `vad_overlap_posterior.py` or use flag **gen_seg_table=True** alongside with `vad_infer.py`

### Finetune
You might need to finetune on your data for better performance. For finetuning/transfer learning, please refer to [**Transfer learning** part of ASR tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)

## VAD postprocessing and Tuning threshold

We can use a single **threshold** (achieved by onset=offset=0.5) to binarize predictions or use typical VAD postprocessing including

### Binarization:
1. **onset** and **offset** threshold for detecting the beginning and end of a speech;
2. padding durations before (**pad_onset**) and after (**pad_offset**) each speech segment.

### Filtering:
1. threshold for short speech segment deletion (**min_duration_on**);
2. threshold for small silence deletion (**min_duration_off**);
3. Whether to perform short speech segment deletion first (**filter_speech_first**).


Of course you can do threshold tuning on frame level prediction. We also provide a script 
```python
/scripts/voice_activity_detection/vad_tune_threshold.py
```

to help you find best thresholds if you have ground truth label file in RTTM format. 

# Online streaming inference

## Setting up data for Streaming Inference

In [None]:
from nemo.core.classes import IterableDataset
from nemo.core.neural_types import NeuralType, AudioSignal, LengthsType
import torch
from torch.utils.data import DataLoader

In [None]:
# simple data layer to pass audio signal
class AudioDataLayer(IterableDataset):
 @property
 def output_types(self):
 return {
 'audio_signal': NeuralType(('B', 'T'), AudioSignal(freq=self._sample_rate)),
 'a_sig_length': NeuralType(tuple('B'), LengthsType()),
 }

 def __init__(self, sample_rate):
 super().__init__()
 self._sample_rate = sample_rate
 self.output = True
 
 def __iter__(self):
 return self
 
 def __next__(self):
 if not self.output:
 raise StopIteration
 self.output = False
 return torch.as_tensor(self.signal, dtype=torch.float32), \
 torch.as_tensor(self.signal_shape, dtype=torch.int64)
 
 def set_signal(self, signal):
 self.signal = signal.astype(np.float32)/32768.
 self.signal_shape = self.signal.size
 self.output = True

 def __len__(self):
 return 1

In [None]:
data_layer = AudioDataLayer(sample_rate=cfg.train_ds.sample_rate)
data_loader = DataLoader(data_layer, batch_size=1, collate_fn=data_layer.collate_fn)

In [None]:
# inference method for audio signal (single instance)
def infer_signal(model, signal):
 data_layer.set_signal(signal)
 batch = next(iter(data_loader))
 audio_signal, audio_signal_len = batch
 audio_signal, audio_signal_len = audio_signal.to(vad_model.device), audio_signal_len.to(vad_model.device)
 logits = model.forward(input_signal=audio_signal, input_signal_length=audio_signal_len)
 return logits

In [None]:
# class for streaming frame-based VAD
# 1) use reset() method to reset FrameVAD's state
# 2) call transcribe(frame) to do VAD on
# contiguous signal's frames
# To simplify the flow, we use single threshold to binarize predictions.
class FrameVAD:
 
 def __init__(self, model_definition,
 threshold=0.5,
 frame_len=2, frame_overlap=2.5, 
 offset=10):
 '''
 Args:
 threshold: If prob of speech is larger than threshold, classify the segment to be speech.
 frame_len: frame's duration, seconds
 frame_overlap: duration of overlaps before and after current frame, seconds
 offset: number of symbols to drop for smooth streaming
 '''
 self.vocab = list(model_definition['labels'])
 self.vocab.append('_')
 
 self.sr = model_definition['sample_rate']
 self.threshold = threshold
 self.frame_len = frame_len
 self.n_frame_len = int(frame_len * self.sr)
 self.frame_overlap = frame_overlap
 self.n_frame_overlap = int(frame_overlap * self.sr)
 timestep_duration = model_definition['AudioToMFCCPreprocessor']['window_stride']
 for block in model_definition['JasperEncoder']['jasper']:
 timestep_duration *= block['stride'][0] ** block['repeat']
 self.buffer = np.zeros(shape=2*self.n_frame_overlap + self.n_frame_len,
 dtype=np.float32)
 self.offset = offset
 self.reset()
 
 def _decode(self, frame, offset=0):
 assert len(frame)==self.n_frame_len
 self.buffer[:-self.n_frame_len] = self.buffer[self.n_frame_len:]
 self.buffer[-self.n_frame_len:] = frame
 logits = infer_signal(vad_model, self.buffer).cpu().numpy()[0]
 decoded = self._greedy_decoder(
 self.threshold,
 logits,
 self.vocab
 )
 return decoded 
 
 
 @torch.no_grad()
 def transcribe(self, frame=None):
 if frame is None:
 frame = np.zeros(shape=self.n_frame_len, dtype=np.float32)
 if len(frame) < self.n_frame_len:
 frame = np.pad(frame, [0, self.n_frame_len - len(frame)], 'constant')
 unmerged = self._decode(frame, self.offset)
 return unmerged
 
 def reset(self):
 '''
 Reset frame_history and decoder's state
 '''
 self.buffer=np.zeros(shape=self.buffer.shape, dtype=np.float32)
 self.prev_char = ''

 @staticmethod
 def _greedy_decoder(threshold, logits, vocab):
 s = []
 if logits.shape[0]:
 probs = torch.softmax(torch.as_tensor(logits), dim=-1)
 probas, _ = torch.max(probs, dim=-1)
 probas_s = probs[1].item()
 preds = 1 if probas_s >= threshold else 0
 s = [preds, str(vocab[preds]), probs[0].item(), probs[1].item(), str(logits)]
 return s



Streaming inference depends on a few factors, such as the frame length (STEP) and buffer size (WINDOW SIZE). Experiment with a few values to see their effects in the below cells.

In [None]:
STEP_LIST = [0.01,0.01]
WINDOW_SIZE_LIST = [0.31,0.15]

In [None]:
import wave

def offline_inference(wave_file, STEP = 0.025, WINDOW_SIZE = 0.5, threshold=0.5):
 
 FRAME_LEN = STEP # infer every STEP seconds 
 CHANNELS = 1 # number of audio channels (expect mono signal)
 RATE = 16000 # sample rate, Hz
 
 
 CHUNK_SIZE = int(FRAME_LEN*RATE)
 
 vad = FrameVAD(model_definition = {
 'sample_rate': SAMPLE_RATE,
 'AudioToMFCCPreprocessor': cfg.preprocessor,
 'JasperEncoder': cfg.encoder,
 'labels': cfg.labels
 },
 threshold=threshold,
 frame_len=FRAME_LEN, frame_overlap = (WINDOW_SIZE-FRAME_LEN)/2,
 offset=0)

 wf = wave.open(wave_file, 'rb')
 p = pa.PyAudio()

 empty_counter = 0

 preds = []
 proba_b = []
 proba_s = []
 
 stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
 channels=CHANNELS,
 rate=RATE,
 output = True)

 data = wf.readframes(CHUNK_SIZE)

 while len(data) > 0:

 data = wf.readframes(CHUNK_SIZE)
 signal = np.frombuffer(data, dtype=np.int16)
 result = vad.transcribe(signal)

 preds.append(result[0])
 proba_b.append(result[2])
 proba_s.append(result[3])
 
 if len(result):
 print(result,end='\n')
 empty_counter = 3
 elif empty_counter > 0:
 empty_counter -= 1
 if empty_counter == 0:
 print(' ',end='')
 
 p.terminate()
 vad.reset()
 
 return preds, proba_b, proba_s

### Here we show an example of online streaming inference
You can use your file or download the provided demo audio file. 

In [None]:
demo_wave = 'VAD_demo.wav'
if not os.path.exists(demo_wave):
 !wget "https://dldata-public.s3.us-east-2.amazonaws.com/VAD_demo.wav" 

In [None]:
wave_file = demo_wave

CHANNELS = 1
RATE = 16000
audio, sample_rate = librosa.load(wave_file, sr=RATE)
dur = librosa.get_duration(audio)
print(dur)

In [None]:
ipd.Audio(audio, rate=sample_rate)

In [None]:
threshold=0.4

results = []
for STEP, WINDOW_SIZE in zip(STEP_LIST, WINDOW_SIZE_LIST, ):
 print(f'====== STEP is {STEP}s, WINDOW_SIZE is {WINDOW_SIZE}s ====== ')
 preds, proba_b, proba_s = offline_inference(wave_file, STEP, WINDOW_SIZE, threshold)
 results.append([STEP, WINDOW_SIZE, preds, proba_b, proba_s])

To simplify the flow, the above prediction is based on single threshold and `threshold=0.4`.

You can play with other [threshold](#VAD-postprocessing-and-Tuning-threshold) or use postprocessing and see how they would impact performance. 

**Note** if you want better performance, [finetune](#Finetune) on your data and use posteriors such as [overlapped prediction](#Posterior). 

Let's plot the prediction and melspectrogram

In [None]:
import librosa.display
plt.figure(figsize=[20,10])

num = len(results)
for i in range(num):
 len_pred = len(results[i][2]) 
 FRAME_LEN = results[i][0]
 ax1 = plt.subplot(num+1,1,i+1)

 ax1.plot(np.arange(audio.size) / sample_rate, audio, 'b')
 ax1.set_xlim([-0.01, int(dur)+1]) 
 ax1.tick_params(axis='y', labelcolor= 'b')
 ax1.set_ylabel('Signal')
 ax1.set_ylim([-1, 1])

 proba_s = results[i][4]
 pred = [1 if p > threshold else 0 for p in proba_s]
 ax2 = ax1.twinx()
 ax2.plot(np.arange(len_pred)/(1/results[i][0]), np.array(pred) , 'r', label='pred')
 ax2.plot(np.arange(len_pred)/(1/results[i][0]), np.array(proba_s) , 'g--', label='speech prob')
 ax2.tick_params(axis='y', labelcolor='r')
 legend = ax2.legend(loc='lower right', shadow=True)
 ax1.set_ylabel('prediction')

 ax2.set_title(f'step {results[i][0]}s, buffer size {results[i][1]}s')
 ax2.set_ylabel('Preds and Probas')
 
 
ax = plt.subplot(num+1,1,i+2)
S = librosa.feature.melspectrogram(y=audio, sr=sample_rate, n_mels=64, fmax=8000)
S_dB = librosa.power_to_db(S, ref=np.max)
librosa.display.specshow(S_dB, x_axis='time', y_axis='mel', sr=sample_rate, fmax=8000)
ax.set_title('Mel-frequency spectrogram')
ax.grid()
plt.show()

## Online streaming inference through microphone

**Please note the VAD model is not perfect for various microphone input and you might need to finetune on your input and play with different parameters.**

In [None]:
STEP = 0.01 
WINDOW_SIZE = 0.31
CHANNELS = 1 
RATE = 16000
FRAME_LEN = STEP
THRESHOLD = 0.5

CHUNK_SIZE = int(STEP * RATE)
vad = FrameVAD(model_definition = {
 'sample_rate': SAMPLE_RATE,
 'AudioToMFCCPreprocessor': cfg.preprocessor,
 'JasperEncoder': cfg.encoder,
 'labels': cfg.labels
 },
 threshold=THRESHOLD,
 frame_len=FRAME_LEN, frame_overlap=(WINDOW_SIZE - FRAME_LEN) / 2, 
 offset=0)


In [None]:
vad.reset()

p = pa.PyAudio()
print('Available audio input devices:')
input_devices = []
for i in range(p.get_device_count()):
 dev = p.get_device_info_by_index(i)
 if dev.get('maxInputChannels'):
 input_devices.append(i)
 print(i, dev.get('name'))

if len(input_devices):
 dev_idx = -2
 while dev_idx not in input_devices:
 print('Please type input device ID:')
 dev_idx = int(input())

 empty_counter = 0

 def callback(in_data, frame_count, time_info, status):
 global empty_counter
 signal = np.frombuffer(in_data, dtype=np.int16)
 text = vad.transcribe(signal)
 if len(text):
 print(text,end='\n')
 empty_counter = vad.offset
 elif empty_counter > 0:
 empty_counter -= 1
 if empty_counter == 0:
 print(' ',end='\n')
 return (in_data, pa.paContinue)

 stream = p.open(format=pa.paInt16,
 channels=CHANNELS,
 rate=SAMPLE_RATE,
 input=True,
 input_device_index=dev_idx,
 stream_callback=callback,
 frames_per_buffer=CHUNK_SIZE)

 print('Listening...')

 stream.start_stream()
 
 # Interrupt kernel and then speak for a few more words to exit the pyaudio loop !
 try:
 while stream.is_active():
 time.sleep(0.1)
 finally: 
 stream.stop_stream()
 stream.close()
 p.terminate()

 print()
 print("PyAudio stopped")
 
else:
 print('ERROR: No audio input device found.')

## ONNX Deployment
You can also export the model to ONNX file and deploy it to TensorRT or MS ONNX Runtime inference engines. If you don't have one installed yet, please run:

In [None]:
!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu
# !mkdir -p ort
# %cd ort
# !git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .
# !./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel
# !pip install ./build/Linux/Release/dist/onnxruntime*.whl
# %cd ..

Then just replace `infer_signal` implementation with this code:

In [None]:
import onnxruntime
vad_model.export('vad.onnx')
ort_session = onnxruntime.InferenceSession('vad.onnx')

def to_numpy(tensor):
 return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

def infer_signal(signal):
 data_layer.set_signal(signal)
 batch = next(iter(data_loader))
 audio_signal, audio_signal_len = batch
 audio_signal, audio_signal_len = audio_signal.to(vad_model.device), audio_signal_len.to(vad_model.device)
 processed_signal, processed_signal_len = vad_model.preprocessor(
 input_signal=audio_signal, length=audio_signal_len,
 )
 ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(processed_signal), }
 ologits = ort_session.run(None, ort_inputs)
 alogits = np.asarray(ologits)
 logits = torch.from_numpy(alogits[0])
 return logits