{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"accelerator": "GPU",
"colab": {
"name": "Speech_Commands.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
}
}
}
},
"cells": [
{
"cell_type": "code",
"metadata": {
"id": "R12Yn6W1dt9t"
},
"source": [
"\"\"\"\n",
"You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
"\n",
"Instructions for setting up Colab are as follows:\n",
"1. Open a new Python 3 notebook.\n",
"2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
"3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
"4. Run this cell to set up dependencies.\n",
"\"\"\"\n",
"# If you're using Google Colab and not running locally, run this cell.\n",
"\n",
"## Install dependencies\n",
"!pip install wget\n",
"!apt-get install sox libsndfile1 ffmpeg\n",
"!pip install text-unidecode\n",
"\n",
"# ## Install NeMo\n",
"BRANCH = 'r1.17.0'\n",
"!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]\n",
"\n",
"## Install TorchAudio\n",
"!pip install torchaudio>=0.13.0 -f https://download.pytorch.org/whl/torch_stable.html\n",
"\n",
"## Grab the config we'll use in this example\n",
"!mkdir configs"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "J6ycGIaZfSLE"
},
"source": [
"# Introduction\n",
"\n",
"This Speech Command recognition tutorial is based on the MatchboxNet model from the paper [\"MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition\"](https://arxiv.org/abs/2004.08531). MatchboxNet is a modified form of the QuartzNet architecture from the paper \"[QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions](https://arxiv.org/pdf/1910.10261.pdf)\" with a modified decoder head to suit classification tasks.\n",
"\n",
"The notebook will follow the steps below:\n",
"\n",
" - Dataset preparation: Preparing Google Speech Commands dataset\n",
"\n",
" - Audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)\n",
"\n",
" - Data augmentation using SpecAugment \"[SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779)\" to increase the number of data samples.\n",
" \n",
" - Develop a small Neural classification model that can be trained efficiently.\n",
" \n",
" - Model training on the Google Speech Commands dataset in NeMo.\n",
" \n",
" - Evaluation of error cases of the model by audibly hearing the samples"
]
},
{
"cell_type": "code",
"metadata": {
"id": "I62_LJzc-p2b"
},
"source": [
"# Some utility imports\n",
"import os\n",
"from omegaconf import OmegaConf"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "K_M8wpkwd7d7"
},
"source": [
"# This is where the Google Speech Commands directory will be placed.\n",
"# Change this if you don't want the data to be extracted in the current directory.\n",
"# Select the version of the dataset required as well (can be 1 or 2)\n",
"DATASET_VER = 1\n",
"data_dir = './google_dataset_v{0}/'.format(DATASET_VER)\n",
"\n",
"if DATASET_VER == 1:\n",
" MODEL_CONFIG = \"matchboxnet_3x1x64_v1.yaml\"\n",
"else:\n",
" MODEL_CONFIG = \"matchboxnet_3x1x64_v2.yaml\"\n",
"\n",
"if not os.path.exists(f\"configs/{MODEL_CONFIG}\"):\n",
" !wget -P configs/ \"https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/matchboxnet/{MODEL_CONFIG}\""
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "tvfwv9Hjf1Uv"
},
"source": [
"# Data Preparation\n",
"\n",
"We will be using the open-source Google Speech Commands Dataset (we will use V1 of the dataset for the tutorial but require minor changes to support the V2 dataset). These scripts below will download the dataset and convert it to a format suitable for use with NeMo."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6VL10OXTf8ts"
},
"source": [
"## Download the dataset\n",
"\n",
"The dataset must be prepared using the scripts provided under the `{NeMo root directory}/scripts` sub-directory. \n",
"\n",
"Run the following command below to download the data preparation script and execute it.\n",
"\n",
"**NOTE**: You should have at least 4GB of disk space available if you’ve used --data_version=1; and at least 6GB if you used --data_version=2. Also, it will take some time to download and process, so go grab a coffee.\n",
"\n",
"**NOTE**: You may additionally pass a `--rebalance` flag at the end of the `process_speech_commands_data.py` script to rebalance the class samples in the manifest."
]
},
{
"cell_type": "code",
"metadata": {
"id": "oqKe6_uLfzKU"
},
"source": [
"if not os.path.exists(\"process_speech_commands_data.py\"):\n",
" !wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/dataset_processing/process_speech_commands_data.py"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "TTsxp0nZ1zqo"
},
"source": [
"### Preparing the manifest file\n",
"\n",
"The manifest file is a simple file that has the full path to the audio file, the duration of the audio file, and the label that is assigned to that audio file. \n",
"\n",
"This notebook is only a demonstration, and therefore we will use the `--skip_duration` flag to speed up construction of the manifest file.\n",
"\n",
"**NOTE: When replicating the results of the paper, do not use this flag and prepare the manifest file with correct durations.**"
]
},
{
"cell_type": "code",
"metadata": {
"id": "cWUtDpzKgop9"
},
"source": [
"!mkdir {data_dir}\n",
"!python process_speech_commands_data.py --data_root={data_dir} --data_version={DATASET_VER} --skip_duration --log\n",
"print(\"Dataset ready !\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "eVsPFxJtg30p"
},
"source": [
"## Prepare the path to manifest files"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ytTFGVe0g9wk"
},
"source": [
"dataset_path = 'google_speech_recognition_v{0}'.format(DATASET_VER)\n",
"dataset_basedir = os.path.join(data_dir, dataset_path)\n",
"\n",
"train_dataset = os.path.join(dataset_basedir, 'train_manifest.json')\n",
"val_dataset = os.path.join(dataset_basedir, 'validation_manifest.json')\n",
"test_dataset = os.path.join(dataset_basedir, 'validation_manifest.json')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "s0SZy9SEhOBf"
},
"source": [
"## Read a few rows of the manifest file \n",
"\n",
"Manifest files are the data structure used by NeMo to declare a few important details about the data :\n",
"\n",
"1) `audio_filepath`: Refers to the path to the raw audio file
\n",
"2) `command`: The class label (or speech command) of this sample
\n",
"3) `duration`: The length of the audio file, in seconds."
]
},
{
"cell_type": "code",
"metadata": {
"id": "HYBidCMIhKQV"
},
"source": [
"!head -n 5 {train_dataset}"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "r-pyUBedh8f4"
},
"source": [
"# Training - Preparation\n",
"\n",
"We will be training a MatchboxNet model from the paper [\"MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition\"](https://arxiv.org/abs/2004.08531). The benefit of MatchboxNet over JASPER models is that they use 1D Time-Channel Separable Convolutions, which greatly reduce the number of parameters required to obtain good model accuracy.\n",
"\n",
"MatchboxNet models generally follow the model definition pattern QuartzNet-[BxRXC], where B is the number of blocks, R is the number of convolutional sub-blocks, and C is the number of channels in these blocks. Each sub-block contains a 1-D masked convolution, batch normalization, ReLU, and dropout.\n",
"\n",
"An image of QuartzNet, the base configuration of MatchboxNet models, is provided below.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "T0sV4riijHJF"
},
"source": [
"
\n",
" \n",
"