{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "collapsed_sections": [ "dm-qqTdZDUlZ", "GGKgsW5gvAuf", "0CqpJGR6ecYW" ], "toc_visible": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "pEYsuj0J9pId" }, "outputs": [], "source": [ "\"\"\"\n", "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", "\n", "Instructions for setting up Colab are as follows:\n", "1. Open a new Python 3 notebook.\n", "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GitHub\" tab -> copy/paste GitHub URL)\n", "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", "4. Run this cell to set up dependencies.\n", "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", "\"\"\"\n", "# If you're using Google Colab and not running locally, run this cell.\n", "import os\n", "\n", "# Install dependencies\n", "!pip install wget\n", "!apt-get install sox libsndfile1 ffmpeg\n", "!pip install text-unidecode\n", "!pip install matplotlib>=3.3.2\n", "\n", "## Install NeMo\n", "BRANCH = 'r1.17.0'\n", "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "## Grab the config we'll use in this example\n", "# !mkdir configs\n", "# !wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/contextnet_rnnt/contextnet_rnnt.yaml" ] }, { "cell_type": "markdown", "source": [ "# ASR Domain Adaptation with Adapters\n", "\n", "Throughout various Automatic Speech Recognition tutorials, you may have noticed that ASR datasets are generally enormous - on the order of hundreds or even thousands of hours of speech. Such a large amount of data imposes significant restrictions on the development of ASR models due to the cost of collection of labeled data (or even unlabeled datasets) and the severe compute cost to train a model on that much data. \n", "\n", "We further see that when fine-tuning a pre-trained model, we need large datasets and compute to achieve superior results and avoid overfitting to the new dataset. Worse, by training the entire model (or even just the decoder modules), we severely degrade the model's performance on the original dataset.\n", "\n", "-----\n", "\n", "In this tutorial, we will showcase **Adapters** : A powerful method to efficiently adapt a pre-trained model to a new dataset (with minimal amounts of data, even just 30 minutes !) with minimal compute resources (on a single GPU, in around 10 minutes of training time).\n" ], "metadata": { "id": "cTV4WLrArmxS" } }, { "cell_type": "markdown", "source": [ "## What are Adapters?\n", "\n", "Adapters are a straightforward concept proposed in multiple papers across various domains. For the sake of brevity, we will choose a recent paper [Using Adapters to Overcome Catastrophic Forgetting in End-to-End Automatic Speech Recognition](https://arxiv.org/abs/2203.16082) as a reference for this discussion.\n", "\n", "Essentially, an **Adapter** is any **trainable module that is added * after * a model has been trained to convergence**. \n", "\n", "- These additional modules form a residual bridge over the output of each layer they adapt, such that the model's original performance is not lost. \n", "- The original parameters of the model are frozen in their entirety - so that we don't lose performance on the original domain.\n", "- We train only the new adapter parameters (an insignificant fraction of the total number of parameters). This allows fast experimentation.\n", "\n", "-----\n", "\n", "Adapters are a straightforward concept - as shown by the diagram below. At their simplest, they are residual Feedforward layers that compress the input dimension ($D$) to a small bottleneck dimension ($H$), such that $R^D \\text{->} R^H$, compute an activation (such as ReLU), finally mapping $R^H \\text{->} R^D$ with another Feedforward layer. This output is then added to the input via a simple residual connection.\n", "\n", "