NeMo

File size: 47,583 Bytes

7934b29

{
    "cells": [
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "R12Yn6W1dt9t"
            },
            "outputs": [],
            "source": [
                "\"\"\"\n",
                "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
                "\n",
                "Instructions for setting up Colab are as follows:\n",
                "1. Open a new Python 3 notebook.\n",
                "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
                "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
                "4. Run this cell to set up dependencies.\n",
                "\"\"\"\n",
                "# If you're using Google Colab and not running locally, run this cell.\n",
                "\n",
                "## Install dependencies\n",
                "!pip install wget\n",
                "!apt-get install sox libsndfile1 ffmpeg\n",
                "!pip install text-unidecode\n",
                "\n",
                "# ## Install NeMo\n",
                "BRANCH = 'r1.17.0'\n",
                "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]\n",
                "\n",
                "## Install TorchAudio\n",
                "!pip install torchaudio>=0.13.0 -f https://download.pytorch.org/whl/torch_stable.html\n",
                "\n",
                "## Grab the config we'll use in this example\n",
                "!mkdir configs"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Introduction\n",
                "\n",
                "This VAD tutorial is based on the MarbleNet model from paper \"[MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection](https://arxiv.org/abs/2010.13886)\", which is an modification and extension of [MatchboxNet](https://arxiv.org/abs/2004.08531). \n",
                "\n",
                "The notebook will follow the steps below:\n",
                "\n",
                " - Dataset preparation: Instruction of downloading datasets. And how to convert it to a format suitable for use with nemo_asr\n",
                " - Audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)\n",
                "\n",
                " - Data augmentation using SpecAugment \"[SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779)\" to increase number of data samples.\n",
                " \n",
                " - Develop a small Neural classification model which can be trained efficiently.\n",
                " \n",
                " - Model training on the Google Speech Commands dataset and Freesound dataset in NeMo.\n",
                " \n",
                " - Evaluation of error cases of the model by audibly hearing the samples\n",
                " \n",
                " - Add more evaluation metrics and transfer learning/fine tune\n",
                " "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "I62_LJzc-p2b"
            },
            "outputs": [],
            "source": [
                "# Some utility imports\n",
                "import os\n",
                "from omegaconf import OmegaConf"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab": {},
                "colab_type": "text",
                "id": "K_M8wpkwd7d7"
            },
            "source": [
                "# Data Preparation\n",
                "\n",
                "## Download the background data\n",
                "We suggest to use the background categories of [freesound](https://freesound.org/) dataset  as our non-speech/background data. \n",
                "We provide scripts for downloading and resampling it. Please have a look at Data Preparation part in NeMo docs. Note that downloading this dataset may takes hours. \n",
                "\n",
                "**NOTE:** Here, this tutorial serves as a demonstration on how to train and evaluate models for vad using NeMo. We avoid using freesound dataset, and use `_background_noise_` category in Google Speech Commands Dataset as non-speech/background data."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Download the speech data\n",
                "   \n",
                "We will use the open source Google Speech Commands Dataset (we will use V2 of the dataset for the tutorial, but require very minor changes to support V1 dataset) as our speech data. Google Speech Commands Dataset V2 will take roughly 6GB disk space. These scripts below will download the dataset and convert it to a format suitable for use with nemo_asr.\n",
                "\n",
                "\n",
                "**NOTE**: You may additionally pass `--test_size` or `--val_size` flag for splitting train val and test data.\n",
                "You may additionally pass `--window_length_in_sec` flag for indicating the segment/window length. Default is 0.63s.\n",
                "\n",
                "**NOTE**: You may additionally pass a `--rebalance_method='fixed|over|under'` at the end of the script to rebalance the class samples in the manifest. \n",
                "* 'fixed': Fixed number of samples for each class. For example, train 500, val 100, and test 200. (Change number in script if you want)\n",
                "* 'over': Oversampling rebalance method\n",
                "* 'under': Undersampling rebalance method\n",
                "\n",
                "**NOTE**: We only take a small subset of speech data for demonstration, if you want to use entire speech data. Don't forget to **delete `--demo`** and change rebalance method/number.  `_background_noise_` category only has **6** audio files. So we would like to generate more based on the audio files to enlarge our background training data. If you want to use your own background noise data, just change the `background_data_root` and **delete `--demo`**\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "tmp = 'src'\n",
                "data_folder = 'data'\n",
                "if not os.path.exists(tmp):\n",
                "    os.makedirs(tmp)\n",
                "if not os.path.exists(data_folder):\n",
                "    os.makedirs(data_folder)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "script = os.path.join(tmp, 'process_vad_data.py')\n",
                "if not os.path.exists(script):\n",
                "    !wget -P $tmp https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/dataset_processing/process_vad_data.py"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "speech_data_root = os.path.join(data_folder, 'google_dataset_v2')\n",
                "background_data_root = os.path.join(data_folder, 'google_dataset_v2/google_speech_recognition_v2/_background_noise_')# your <resampled freesound data directory>\n",
                "out_dir = os.path.join(data_folder, 'manifest')\n",
                "if not os.path.exists(speech_data_root):\n",
                "    os.mkdir(speech_data_root)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# This may take a few minutes\n",
                "!python $script \\\n",
                "    --out_dir={out_dir} \\\n",
                "    --speech_data_root={speech_data_root} \\\n",
                "    --background_data_root={background_data_root}\\\n",
                "    --log \\\n",
                "    --demo \\\n",
                "    --rebalance_method='fixed' "
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "TTsxp0nZ1zqo"
            },
            "source": [
                "## Preparing the manifest file\n",
                "\n",
                "Manifest files are the data structure used by NeMo to declare a few important details about the data :\n",
                "\n",
                "1) `audio_filepath`: Refers to the path to the raw audio file <br>\n",
                "2) `label`: The class label (speech or background) of this sample <br>\n",
                "3) `duration`: The length of the audio file, in seconds.<br>\n",
                "4) `offset`: The start of the segment, in seconds."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "ytTFGVe0g9wk"
            },
            "outputs": [],
            "source": [
                "# change below if you don't have or don't want to use rebalanced data\n",
                "train_dataset = 'data/manifest/balanced_background_training_manifest.json,data/manifest/balanced_speech_training_manifest.json' \n",
                "val_dataset = 'data/manifest/background_validation_manifest.json,data/manifest/speech_validation_manifest.json' \n",
                "test_dataset = 'data/manifest/balanced_background_testing_manifest.json,data/manifest/balanced_speech_testing_manifest.json' "
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "s0SZy9SEhOBf"
            },
            "source": [
                "## Read a few rows of the manifest file \n",
                "\n",
                "Manifest files are the data structure used by NeMo to declare a few important details about the data :\n",
                "\n",
                "1) `audio_filepath`: Refers to the path to the raw audio file <br>\n",
                "2) `command`: The class label (or speech command) of this sample <br>\n",
                "3) `duration`: The length of the audio file, in seconds."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "sample_test_dataset =  test_dataset.split(',')[0]"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "HYBidCMIhKQV",
                "scrolled": true
            },
            "outputs": [],
            "source": [
                "!head -n 5 {sample_test_dataset}"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Training - Preparation\n",
                "\n",
                "We will be training a MarbleNet model from paper \"[MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection](https://arxiv.org/abs/2010.13886)\", evolved from [QuartzNet](https://arxiv.org/pdf/1910.10261.pdf) and [MatchboxNet](https://arxiv.org/abs/2004.08531) model. The benefit of QuartzNet over JASPER models is that they use Separable Convolutions, which greatly reduce the number of parameters required to get good model accuracy.\n",
                "\n",
                "MarbleNet models generally follow the model definition pattern QuartzNet-[BxRXC], where B is the number of blocks, R is the number of convolutional sub-blocks, and C is the number of channels in these blocks. Each sub-block contains a 1-D masked convolution, batch normalization, ReLU, and dropout.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "ieAPOM9thTN2"
            },
            "outputs": [],
            "source": [
                "# NeMo's \"core\" package\n",
                "import nemo\n",
                "# NeMo's ASR collection - this collections contains complete ASR models and\n",
                "# building blocks (modules) for ASR\n",
                "import nemo.collections.asr as nemo_asr"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "ss9gLcDv30jI"
            },
            "source": [
                "## Model Configuration\n",
                "The MarbleNet Model is defined in a config file which declares multiple important sections.\n",
                "\n",
                "They are:\n",
                "\n",
                "1) `model`: All arguments that will relate to the Model - preprocessors, encoder, decoder, optimizer and schedulers, datasets and any other related information\n",
                "\n",
                "2) `trainer`: Any argument to be passed to PyTorch Lightning"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "MODEL_CONFIG = \"marblenet_3x2x64.yaml\"\n",
                "\n",
                "if not os.path.exists(f\"configs/{MODEL_CONFIG}\"):\n",
                "  !wget -P configs/ \"https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/marblenet/{MODEL_CONFIG}\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "yoVAs9h1lfci",
                "scrolled": true
            },
            "outputs": [],
            "source": [
                "# This line will print the entire config of the MarbleNet model\n",
                "config_path = f\"configs/{MODEL_CONFIG}\"\n",
                "config = OmegaConf.load(config_path)\n",
                "config = OmegaConf.to_container(config, resolve=True)\n",
                "config = OmegaConf.create(config)\n",
                "\n",
                "print(OmegaConf.to_yaml(config))"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "m2lJPR0a3qww"
            },
            "outputs": [],
            "source": [
                "# Preserve some useful parameters\n",
                "labels = config.model.labels\n",
                "sample_rate = config.model.sample_rate"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "8_pmjeed78rJ"
            },
            "source": [
                "### Setting up the datasets within the config\n",
                "\n",
                "If you'll notice, there are a few config dictionaries called `train_ds`, `validation_ds` and `test_ds`. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "DIe6Qfs18MiQ"
            },
            "outputs": [],
            "source": [
                "print(OmegaConf.to_yaml(config.model.train_ds))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "Fb01hl868Uc3"
            },
            "source": [
                "### `???` inside configs\n",
                "\n",
                "You will often notice that some configs have `???` in place of paths. This is used as a placeholder so that the user can change the value at a later time.\n",
                "\n",
                "Let's add the paths to the manifests to the config above."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "m181HXev8T97"
            },
            "outputs": [],
            "source": [
                "config.model.train_ds.manifest_filepath = train_dataset\n",
                "config.model.validation_ds.manifest_filepath = val_dataset\n",
                "config.model.test_ds.manifest_filepath = test_dataset"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "pbXngoCM5IRG"
            },
            "source": [
                "## Building the PyTorch Lightning Trainer\n",
                "\n",
                "NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem!\n",
                "\n",
                "Let's first instantiate a Trainer object!"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "bYtvdBlG5afU"
            },
            "outputs": [],
            "source": [
                "import torch\n",
                "import pytorch_lightning as pl"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "jRN18CdH51nN"
            },
            "outputs": [],
            "source": [
                "print(\"Trainer config - \\n\")\n",
                "print(OmegaConf.to_yaml(config.trainer))"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "gHf6cHvm6H9b"
            },
            "outputs": [],
            "source": [
                "# Let's modify some trainer configs for this demo\n",
                "# Checks if we have GPU available and uses it\n",
                "accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'\n",
                "config.trainer.devices = 1\n",
                "config.trainer.accelerator = accelerator\n",
                "\n",
                "# Reduces maximum number of epochs to 5 for quick demonstration\n",
                "config.trainer.max_epochs = 5\n",
                "\n",
                "# Remove distributed training flags\n",
                "config.trainer.strategy = None"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "UB9nr7G56G3L"
            },
            "outputs": [],
            "source": [
                "trainer = pl.Trainer(**config.trainer)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "2wt603Vq6sqX"
            },
            "source": [
                "## Setting up a NeMo Experiment\n",
                "\n",
                "NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it ! "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "TfWJFg7p6Ezf"
            },
            "outputs": [],
            "source": [
                "from nemo.utils.exp_manager import exp_manager"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "SC-QPoW44-p2"
            },
            "outputs": [],
            "source": [
                "exp_dir = exp_manager(trainer, config.get(\"exp_manager\", None))"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "Yqi6rkNR7Dph"
            },
            "outputs": [],
            "source": [
                "# The exp_dir provides a path to the current experiment for easy access\n",
                "exp_dir = str(exp_dir)\n",
                "exp_dir"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "t0zz-vHH7Uuh"
            },
            "source": [
                "## Building the MarbleNet Model\n",
                "\n",
                "MarbleNet is an ASR model with a classification task - it generates one label for the entire provided audio stream. Therefore we encapsulate it inside the `EncDecClassificationModel` as follows."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "FRMrKhyf5vhy",
                "scrolled": true
            },
            "outputs": [],
            "source": [
                "vad_model = nemo_asr.models.EncDecClassificationModel(cfg=config.model, trainer=trainer)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "jA9UND-Q_oyw"
            },
            "source": [
                "# Training a MarbleNet Model\n",
                "\n",
                "As MarbleNet is inherently a PyTorch Lightning Model, it can easily be trained in a single line - `trainer.fit(model)` !\n",
                "\n",
                "\n",
                "# Training the model\n",
                "\n",
                "Even with such a small model (73k parameters), and just 5 epochs (should take just a few minutes to train), you should be able to get a test set accuracy score around 98.83% (this result is for the [freesound](https://freesound.org/) dataset) with enough training data. \n",
                "\n",
                "**NOTE:** If you follow our tutorial and user the generated background data, you may notice the below results are acceptable, but please remember, this tutorial is only for **demonstration** and the dataset is not good enough. Please change background dataset and train with enough data for improvement!\n",
                "\n",
                "Experiment with increasing the number of epochs or with batch size to see how much you can improve the score! \n",
                "\n",
                "**NOTE:** Noise robustness is quite important for VAD task. Below we list the augmentation we used in this demo. \n",
                "Please refer to [Online_Noise_Augmentation.ipynb](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Online_Noise_Augmentation.ipynb)  for understanding noise augmentation in NeMo.\n",
                "\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Noise augmentation\n",
                "print(OmegaConf.to_yaml(config.model.train_ds.augmentor)) # noise augmentation\n",
                "print(OmegaConf.to_yaml(config.model.spec_augment)) # SpecAug data augmentation"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "If you are interested in  **pretrained** model, please have a look at [Transfer Leaning & Fine-tuning on a new dataset](#Transfer-Leaning-&-Fine-tuning-on-a-new-dataset) and incoming tutorial 07 Offline_and_Online_VAD_Demo\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "3ngKcRFqBfIF"
            },
            "source": [
                "### Monitoring training progress\n",
                "\n",
                "Before we begin training, let's first create a Tensorboard visualization to monitor progress\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "Cyfec0PDBsXa"
            },
            "outputs": [],
            "source": [
                "try:\n",
                "    from google import colab\n",
                "    COLAB_ENV = True\n",
                "except (ImportError, ModuleNotFoundError):\n",
                "    COLAB_ENV = False\n",
                "\n",
                "# Load the TensorBoard notebook extension\n",
                "if COLAB_ENV:\n",
                "    %load_ext tensorboard\n",
                "    %tensorboard --logdir {exp_dir}\n",
                "else:\n",
                "    print(\"To use tensorboard, please use this notebook in a Google Colab environment.\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "ZApuELDIKQgC"
            },
            "source": [
                "### Training for 5 epochs\n",
                "We see below that the model begins to get modest scores on the validation set after just 5 epochs of training"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "9xiUUJlH5KdD"
            },
            "outputs": [],
            "source": [
                "trainer.fit(vad_model)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Fast Training\n",
                "\n",
                "We can dramatically improve the time taken to train this model by using Multi GPU training along with Mixed Precision.\n",
                "\n",
                "```python\n",
                "# Trainer with a distributed backend:\n",
                "trainer = Trainer(devices=2, num_nodes=2, accelerator='gpu', strategy='dp')\n",
                "\n",
                "# Mixed precision:\n",
                "trainer = Trainer(amp_level='O1', precision=16)\n",
                "\n",
                "# Of course, you can combine these flags as well.\n",
                "```"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "Dkds1jSvKgSc"
            },
            "source": [
                "# Evaluation\n",
                "\n",
                "## Evaluation on the Test set\n",
                "\n",
                "Let's compute the final score on the test set via `trainer.test(model)`"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "mULTrhEJ_6wV",
                "scrolled": true
            },
            "outputs": [],
            "source": [
                "trainer.test(vad_model, ckpt_path=None)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "ifDHkunjM8y6"
            },
            "source": [
                "## Evaluation of incorrectly predicted samples\n",
                "\n",
                "Given that we have a trained model, which performs reasonably well, let's try to listen to the samples where the model is least confident in its predictions."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "PcJrZ72sNCkM"
            },
            "source": [
                "### Extract the predictions from the model\n",
                "\n",
                "We want to possess the actual logits of the model instead of just the final evaluation score, so we can define a function to perform the forward step for us without computing the final loss. Instead, we extract the logits per batch of samples provided."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "rvxdviYtOFjK"
            },
            "source": [
                "### Accessing the data loaders\n",
                "\n",
                "We can utilize the `setup_test_data` method in order to instantiate a data loader for the dataset we want to analyze.\n",
                "\n",
                "For convenience, we can access these instantiated data loaders using the following accessors - `vad_model._train_dl`, `vad_model._validation_dl` and `vad_model._test_dl`."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "CB0QZCAmM656"
            },
            "outputs": [],
            "source": [
                "vad_model.setup_test_data(config.model.test_ds)\n",
                "test_dl = vad_model._test_dl"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "rA7gXawcPoip"
            },
            "source": [
                "### Partial Test Step\n",
                "\n",
                "Below we define a utility function to perform most of the test step. For reference, the test step is defined as follows:\n",
                "\n",
                "```python\n",
                "    def test_step(self, batch, batch_idx, dataloader_idx=0):\n",
                "        audio_signal, audio_signal_len, labels, labels_len = batch\n",
                "        logits = self.forward(input_signal=audio_signal, input_signal_length=audio_signal_len)\n",
                "        loss_value = self.loss(logits=logits, labels=labels)\n",
                "        correct_counts, total_counts = self._accuracy(logits=logits, labels=labels)\n",
                "        return {'test_loss': loss_value, 'test_correct_counts': correct_counts, 'test_total_counts': total_counts}\n",
                "```"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "sBsDOm5ROpQI"
            },
            "outputs": [],
            "source": [
                "@torch.no_grad()\n",
                "def extract_logits(model, dataloader):\n",
                "    logits_buffer = []\n",
                "    label_buffer = []\n",
                "\n",
                "    # Follow the above definition of the test_step\n",
                "    for batch in dataloader:\n",
                "        audio_signal, audio_signal_len, labels, labels_len = batch\n",
                "        logits = model(input_signal=audio_signal, input_signal_length=audio_signal_len)\n",
                "\n",
                "        logits_buffer.append(logits)\n",
                "        label_buffer.append(labels)\n",
                "        print(\".\", end='')\n",
                "    print()\n",
                "\n",
                "    print(\"Finished extracting logits !\")\n",
                "    logits = torch.cat(logits_buffer, 0)\n",
                "    labels = torch.cat(label_buffer, 0)\n",
                "    return logits, labels\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "mZSdprUlOuoV"
            },
            "outputs": [],
            "source": [
                "cpu_model = vad_model.cpu()\n",
                "cpu_model.eval()\n",
                "logits, labels = extract_logits(cpu_model, test_dl)\n",
                "print(\"Logits:\", logits.shape, \"Labels :\", labels.shape)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "9Wd0ukgNXRBz",
                "scrolled": true
            },
            "outputs": [],
            "source": [
                "# Compute accuracy - `_accuracy` is a PyTorch Lightning Metric !\n",
                "acc = cpu_model._accuracy(logits=logits, labels=labels)\n",
                "print(f\"Accuracy : {float(acc[0]*100)} %\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "NwN9OSqCauSH"
            },
            "source": [
                "### Filtering out incorrect samples\n",
                "Let us now filter out the incorrectly labeled samples from the total set of samples in the test set"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "N1YJvsmcZ0uE"
            },
            "outputs": [],
            "source": [
                "import librosa\n",
                "import json\n",
                "import IPython.display as ipd"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "jZAT9yGAayvR"
            },
            "outputs": [],
            "source": [
                "# First let's create a utility class to remap the integer class labels to actual string label\n",
                "class ReverseMapLabel:\n",
                "    def __init__(self, data_loader):\n",
                "        self.label2id = dict(data_loader.dataset.label2id)\n",
                "        self.id2label = dict(data_loader.dataset.id2label)\n",
                "\n",
                "    def __call__(self, pred_idx, label_idx):\n",
                "        return self.id2label[pred_idx], self.id2label[label_idx]"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "X3GSXvYHa4KJ"
            },
            "outputs": [],
            "source": [
                "# Next, let's get the indices of all the incorrectly labeled samples\n",
                "sample_idx = 0\n",
                "incorrect_preds = []\n",
                "rev_map = ReverseMapLabel(test_dl)\n",
                "\n",
                "# Remember, evaluated_tensor = (loss, logits, labels)\n",
                "probs = torch.softmax(logits, dim=-1)\n",
                "probas, preds = torch.max(probs, dim=-1)\n",
                "\n",
                "total_count = cpu_model._accuracy.total_counts_k[0]\n",
                "incorrect_ids = (preds != labels).nonzero()\n",
                "for idx in incorrect_ids:\n",
                "    proba = float(probas[idx][0])\n",
                "    pred = int(preds[idx][0])\n",
                "    label = int(labels[idx][0])\n",
                "    idx = int(idx[0]) + sample_idx\n",
                "\n",
                "    incorrect_preds.append((idx, *rev_map(pred, label), proba))\n",
                "    \n",
                "\n",
                "print(f\"Num test samples : {total_count.item()}\")\n",
                "print(f\"Num errors : {len(incorrect_preds)}\")\n",
                "\n",
                "# First let's sort by confidence of prediction\n",
                "incorrect_preds = sorted(incorrect_preds, key=lambda x: x[-1], reverse=False)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "0JgGo71gcDtD"
            },
            "source": [
                "### Examine a subset of incorrect samples\n",
                "Let's print out the (test id, predicted label, ground truth label, confidence) tuple of first 20 incorrectly labeled samples"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "x37wNJsNbcw0"
            },
            "outputs": [],
            "source": [
                "for incorrect_sample in incorrect_preds[:20]:\n",
                "    print(str(incorrect_sample))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "tDnwYsDKcLv9"
            },
            "source": [
                "###  Define a threshold below which we designate a model's prediction as \"low confidence\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "dpvzeh4PcGJs"
            },
            "outputs": [],
            "source": [
                "# Filter out how many such samples exist\n",
                "low_confidence_threshold = 0.8\n",
                "count_low_confidence = len(list(filter(lambda x: x[-1] <= low_confidence_threshold, incorrect_preds)))\n",
                "print(f\"Number of low confidence predictions : {count_low_confidence}\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "ERXyXvCAcSKR"
            },
            "source": [
                "### Let's hear the samples which the model has least confidence in !"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "kxjNVjX8cPNP"
            },
            "outputs": [],
            "source": [
                "# First let's create a helper function to parse the manifest files\n",
                "def parse_manifest(manifest):\n",
                "    data = []\n",
                "    for line in manifest:\n",
                "        line = json.loads(line)\n",
                "        data.append(line)\n",
                "\n",
                "    return data"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "IWxqw5k-cUVd"
            },
            "outputs": [],
            "source": [
                "# Next, let's create a helper function to actually listen to certain samples\n",
                "def listen_to_file(sample_id, pred=None, label=None, proba=None):\n",
                "    # Load the audio waveform using librosa\n",
                "    filepath = test_samples[sample_id]['audio_filepath']\n",
                "    audio, sample_rate = librosa.load(filepath,\n",
                "                                      offset = test_samples[sample_id]['offset'],\n",
                "                                      duration = test_samples[sample_id]['duration'])\n",
                "\n",
                "\n",
                "    if pred is not None and label is not None and proba is not None:\n",
                "        print(f\"filepath: {filepath}, Sample : {sample_id} Prediction : {pred} Label : {label} Confidence = {proba: 0.4f}\")\n",
                "    else:\n",
                "        \n",
                "        print(f\"Sample : {sample_id}\")\n",
                "\n",
                "    return ipd.Audio(audio, rate=sample_rate)\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "HPj1tFNIcXaU"
            },
            "outputs": [],
            "source": [
                "import json\n",
                "# Now let's load the test manifest into memory\n",
                "all_test_samples = []\n",
                "for _ in test_dataset.split(','):\n",
                "    print(_)\n",
                "    with open(_, 'r') as test_f:\n",
                "        test_samples = test_f.readlines()\n",
                "        \n",
                "        all_test_samples.extend(test_samples)\n",
                "print(len(all_test_samples))\n",
                "test_samples = parse_manifest(all_test_samples)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "colab": {},
                "colab_type": "code",
                "id": "Nt7b_uiScZcC"
            },
            "outputs": [],
            "source": [
                "# Finally, let's listen to all the audio samples where the model made a mistake\n",
                "# Note: This list of incorrect samples may be quite large, so you may choose to subsample `incorrect_preds`\n",
                "count = min(count_low_confidence, 20)  # replace this line with just `count_low_confidence` to listen to all samples with low confidence\n",
                "\n",
                "for sample_id, pred, label, proba in incorrect_preds[:count]:\n",
                "    ipd.display(listen_to_file(sample_id, pred=pred, label=label, proba=proba))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Adding evaluation metrics\n",
                "\n",
                "Here is an example of how to use more metrics (e.g. from torchmetrics) to evaluate your result.\n",
                "\n",
                "**Note:** If you would like to add metrics for training and testing, have a look at \n",
                "```python\n",
                "NeMo/nemo/collections/common/metrics\n",
                "```\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "from torchmetrics import ConfusionMatrix"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "_, pred = logits.topk(1, dim=1, largest=True, sorted=True)\n",
                "pred = pred.squeeze()\n",
                "metric = ConfusionMatrix(num_classes=2, task='binary')\n",
                "metric(pred, labels)\n",
                "# confusion_matrix(preds=pred, target=labels)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Transfer Leaning & Fine-tuning on a new dataset\n",
                "For transfer learning, please refer to [**Transfer learning** part of ASR tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n",
                "\n",
                "More details on saving and restoring checkpoint, and exporting a model in its entirety, please refer to [**Fine-tuning on a new dataset** & **Advanced Usage parts** of Speech Command tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Speech_Commands.ipynb)\n",
                "\n",
                "\n",
                "\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "colab_type": "text",
                "id": "LyIegk2CPNsI"
            },
            "source": [
                "# Inference and more\n",
                "If you are interested in **pretrained** model and **streaming inference**, please have a look at our [VAD inference tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb) and script [vad_infer.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/speech_classification/vad_infer.py)\n"
            ]
        }
    ],
    "metadata": {
        "accelerator": "GPU",
        "colab": {
            "collapsed_sections": [],
            "name": "Voice_Activity_Detection.ipynb",
            "provenance": [],
            "toc_visible": true
        },
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.7.7"
        },
        "pycharm": {
            "stem_cell": {
                "cell_type": "raw",
                "metadata": {
                    "collapsed": false
                },
                "source": []
            }
        }
    },
    "nbformat": 4,
    "nbformat_minor": 1
}