# DefinedCrowd x NeMo - ASR Training

DefinedCrowd’s core business is providing **high quality AI training data** to our customers. Our workflows can serve as standalone or end-to-end data services to build any Speech-or-Text-enabled AI architecture from scratch, to improve solutions already developed, or to evaluate models in production, all with the DefinedCrowd Quality Guarantee.

NVIDIA NeMo is a toolkit built by NVIDIA for **creating conversational AI applications**. This toolkit includes collections of pre-trained modules for **Automatic Speech Recognition (ASR)**, Natural Language Processing (NLP), and Texto-to-Speech (TTS), enabling researchers and data scientists to easily compose complex neural network architectures and focus on designing their applications.

In this tutorial, we want to demonstrate how to **connect DefinedCrowd Speech Workflows** to **train and improve an ASR model** using NVIDIA NeMo. The tutorial re-uses parts of a previous [ASR tutorial from NeMo](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb).

In [None]:
# First, let's install NeMo Toolkit and dependencies to run this notebook
!apt-get install -y libsndfile1 ffmpeg
!pip install Cython

## Install NeMo dependencies in the correct versions
!pip install torchtext==0.11.0 torch==1.10.0 pytorch-lightning==1.5.0

## Install NeMo
!python -m pip install nemo_toolkit[all]==1.0.0b3

## Obtaining data using DefinedCrowd API 

In this section, we are going to demonstrate how to connect to DefinedCrowd API in order to obtain speech collected data.


For more information, visit https://developers.definedcrowd.com/

In [None]:
# For the demo, we will be using a sandbox environment
auth_url = "https://sandbox-auth.definedcrowd.com"
api_url = "https://sandbox-api.definedcrowd.com"

In [None]:
# These variables should be obtained at the DefinedCrowd Enterprise Portal for your account.
client_id = "<INSERT HERE YOUR CLIENT ID>"
client_secret = "<INSERT HERE YOUR SECRET ID>"
project_id = "<INSERT HERE YOUR PROJECT ID>"

### Authentication

In [None]:
import requests, json

payload = {
    "client_id": client_id,
    "client_secret": client_secret,
    "grant_type": "client_credentials",
    "scope": "PublicAPIv2",
}
files = []
headers = {}

# request the Auth 2.0 access token
response = requests.request(
    "POST", f"{auth_url}/connect/token", headers=headers, data=payload, files=files
)
if response.status_code == 200:
    print("Authentication Success!")
    access_token = response.json()["access_token"]
else:
    print("Authentication Failed")

Authentication Success!


### Get a list of deliverables

In [None]:
# GET /projects/{project-id}/deliverables
headers = {"Authorization": "Bearer " + access_token}
response = requests.request(
    "GET", f"{api_url}/projects/{project_id}/deliverables", headers=headers
)

if response.status_code == 200:
    # Pretty print the response
    print(json.dumps(response.json(), indent=4))

    # Get the first deliverable id
    deliverable_id = response.json()[0]["id"]

[
    {
        "projectId": "eb324e45-c4f9-41e7-b5cf-655aa693ae75",
        "id": "258f9e15-2937-4846-b9c3-3ae1164b7364",
        "type": "Flat",
        "fileName": "data_Flat_eb324e45-c4f9-41e7-b5cf-655aa693ae75_258f9e15-2937-4846-b9c3-3ae1164b7364_2021-03-22-14-34-37.zip",
        "createdTimestamp": "2021-03-22T14:34:37.8037259",
        "isPartial": false,
        "downloadCount": 2,
        "status": "Downloaded"
    }
]


### Download the final deliverable for a speech data collection

In [None]:
# the name I want to give to my deliverable file
filename = "scripted_monologue_en_GB.zip"

# GET /projects/{project-id}/deliverables/{deliverable-id}/download
headers = {"Authorization": "Bearer " + access_token}
response = requests.request(
    "GET",
    f"{api_url}/projects/{project_id}/deliverables/{deliverable_id}/download/",
    headers=headers,
)

if response.status_code == 200:
    # save the deliverable file
    with open(filename, "wb") as fp:
        fp.write(response.content)
    print("Deliverable file saved with success!")

Deliverable file saved with success!


In [None]:
!unzip  scripted_monologue_en_GB.zip &> /dev/null
!rm -f en-gb_single-scripted_Dataset.zip

## Speech Dataset

In this section, we are going to analyse the data got from DefinedCrowd. The data is built of scripted speech data collected by the DefinedCrowd Neevo platform from several speakers in the UK (crowd members from DefinedCrowd).

Each row of the dataset contains information about the speech prompt, crowd member, device used, and the recording. The data we find this this delivery is:

**Recording**:
* RecordingId
* PromptId
* Prompt

**Audio File**:
* RelativeFileName
* Duration
* SampleRate
* BitDepth
* AudioCommunicationBand
* RecordingEnvironment

**Crowd Member**:
* SpeakerId
* Gender
* Age
* Accent
* LivingCountry

**Recording Device**:
* Manufacturer
* DeviceType
* Domain

This data can be used for multiple purposes, but in this tutorial, we are going to use it for improving an existent ASR model for British speakers.

In [None]:
import pandas as pd

# let's look into the metadata file
dataset = pd.read_csv("metadata.tsv", sep="\t", index_col=[0])
dataset.head(10)

Unnamed: 0,RecordingId,PromptId,RelativeFileName,Prompt,Duration,SpeakerId,Gender,Age,Manufacturer,DeviceType,Accent,Domain,SampleRate,BitDepth,AudioCommunicationBand,LivingCountry,Native,RecordingEnvironment
0,165559628,64977250,Audio/165559628.wav,The Avengers' extinction.,00:00:02.815,128209,Female,26,Apple,iPhone 6s,Suffolk,generic,16000,16,Broadband,United Kingdom,True,silent
1,165396529,64940978,Audio/165396529.wav,and smile in pictures to make everyone feel safe?,00:00:05.240,422843,Female,31,motorola,moto g(6),Hertfordshire,generic,16000,16,Broadband,United Kingdom,True,silent
2,165466090,64962327,Audio/165466090.wav,- (GUNSHOT) - (GROANS),00:00:03.560,458727,Male,53,Xiaomi,Mi MIX 3 5G,West Sussex,generic,16000,16,Broadband,United Kingdom,True,silent
3,165450603,64958468,Audio/165450603.wav,They had us dead to rights.,00:00:02.621,478075,Female,21,Apple,iPhone 6s,Worcestershire,generic,16000,16,Broadband,United Kingdom,True,silent
4,165454042,64959449,Audio/165454042.wav,The war is happening.,00:00:03.960,477240,Male,30,samsung,SM-G975F,Essex,generic,16000,16,Broadband,United Kingdom,True,silent
5,165493319,64967271,Audio/165493319.wav,Feel her heart beat.,00:00:03.200,480713,Male,31,HMD Global,TA-1012,Norfolk,generic,16000,16,Broadband,United Kingdom,True,silent
6,165845400,65000410,Audio/165845400.wav,Indian Ocean(Kerguelen Plateau),00:00:03.503,432925,Female,69,Apple,iPhone XR,"Scottish Borders, The",generic,16000,16,Broadband,United Kingdom,True,silent
7,165435025,64954084,Audio/165435025.wav,He's been forgotten.,00:00:01.968,478075,Female,21,Apple,iPhone 6s,Worcestershire,generic,16000,16,Broadband,United Kingdom,True,silent
8,165474374,64963765,Audio/165474374.wav,and travel hundreds of miles,00:00:03.711,434058,Female,32,Apple,iPhone 6s,Cumbria,generic,16000,16,Broadband,United Kingdom,True,silent
9,165770882,64995117,Audio/165770882.wav,summoned to prove their noble family lines.,00:00:03.839,480713,Male,31,HMD Global,TA-1012,Norfolk,generic,16000,16,Broadband,United Kingdom,True,silent


In [None]:
# Let's check the data for the first row
dataset.iloc[0]

RecordingId                               165559628
PromptId                                   64977250
RelativeFileName                Audio/165559628.wav
Prompt                    The Avengers' extinction.
Duration                               00:00:02.815
SpeakerId                                    128209
Gender                                       Female
Age                                              26
Manufacturer                                  Apple
DeviceType                                iPhone 6s
Accent                                      Suffolk
Domain                                      generic
SampleRate                                    16000
BitDepth                                         16
AudioCommunicationBand                    Broadband
LivingCountry                        United Kingdom
Native                                         True
RecordingEnvironment                         silent
Name: 0, dtype: object

In [None]:
# How many rows do I have?
len(dataset)

50000

In [None]:
# Let's check some examples from our dataset
import librosa
import IPython.display as ipd

for index, row in dataset.sample(4, random_state=1).iterrows():

    print(f"Prompt: {dataset.iloc[index].Prompt}")
    audio_file = dataset.iloc[index].RelativeFileName

    # Load and listen to the audio file
    audio, sample_rate = librosa.load(audio_file)
    ipd.display(ipd.Audio(audio, rate=sample_rate))

Prompt: You got to be kidding me.


Prompt: waiting for you to finish knocking three times.


Prompt: And let me know if you get a hit on that malware.


Prompt: She had more reason than anyone in the Seven Kingdoms.


## Data Preparation

After downloading the speech data from DefinedCrowd API, we need to adapt it for the format expected by NeMo for ASR training. For this, we need to create manifests for our training and evaluation data, including each audio file's metadata.

NeMo requires that we adapt our data to a [particular manifest format](https://github.com/NVIDIA/NeMo/blob/ebade85f6d10319ef59312cb2eefcba4fd298a3d/nemo/collections/asr/parts/manifest.py#L39). Each line corresponding to one audio sample, so the line count equals the number of samples represented by the manifest. A line must contain the path to an audio file, the corresponding transcript, and the audio sample duration. For example, here is what one line might look like in a NeMo-compatible manifest:
```
{"audio_filepath": "path/to/audio.wav", "duration": 3.45, "text": "this is a nemo tutorial"}
```

For the creation of the manifest, we will also standardize the transcripts. 

In [None]:
import os

# Function to build a manifest
def build_manifest(dataframe, manifest_path):
    with open(manifest_path, "w") as fout:
        for index, row in dataframe.iterrows():
            transcript = row["Prompt"]

            # Our model will use lowercased data for training/testing
            transcript = transcript.lower()

            # Removing linguistic marks (they are not necessary for this demo)
            transcript = (
                transcript.replace("<s>", "")
                .replace("</s>", "")
                .replace("[b_s/]", "")
                .replace("[uni/]", "")
                .replace("[v_n/]", "")
                .replace("[filler/]", "")
                .replace('"', "")
                .replace("[n_s/]", "")
            )

            audio_path = row["RelativeFileName"]

            # Get the audio duration
            try:
                duration = librosa.core.get_duration(filename=audio_path)
            except Exception as e:
                print("An error occurred: ", e)

            if os.path.exists(audio_path):
                # Write the metadata to the manifest
                metadata = {
                    "audio_filepath": audio_path,
                    "duration": duration,
                    "text": transcript,
                }
                json.dump(metadata, fout)
                fout.write("\n")
            else:
                continue

### Train and Test splits

In order to test the quality of our model, we need to reserve some data for model testing. We will be evaluating the model performance on this data.

In [None]:
import json
from sklearn.model_selection import train_test_split

# Split 10% for testing (500 prompts) and 90% for training (4500 prompts)
trainset, testset = train_test_split(dataset, test_size=0.1, random_state=1)

# Build the manifests
build_manifest(trainset, "train_manifest.json")
build_manifest(testset, "test_manifest.json")

## Model Configuration


In this tutorial, we'll describe how to use the QuartzNet15x5 model as a base model for fine-tuning with our data. We want to improve the recognition of our dataset, so we will benchmark the model performance on the base model, and after, on the fine-tuned version.

Some of the following functions were retrieved from the Nemo Tutorial on ASR that could be checked at [https://github.com/NVIDIA/NeMo](https://github.com/NVIDIA/NeMo)

In [None]:
# Let's import Nemo and the functions for ASR
import torch
import nemo
import nemo.collections.asr as nemo_asr

import logging
from nemo.utils import _Logger

# Setup the log level by NeMo
logger = _Logger()
logger.set_verbosity(logging.ERROR)

### Training Parameters

For training, NeMo uses a python dictionary as data structure to keep all the parameters. More information about it can be accessed at the [NeMo ASR Config User Guide](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/configs.html). 

For this tutorial, we will load a pre-existent file with the standard ASR configuration and change only the necessary fields.

In [None]:
## Download the config we'll use in this example
!mkdir configs
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/stable/examples/asr/conf/config.yaml &> /dev/null

# --- Config Information ---#
from ruamel.yaml import YAML

config_path = "./configs/config.yaml"

yaml = YAML(typ="safe")
with open(config_path) as f:
    params = yaml.load(f)

### The Base Model

For our ASR model, we will use a pre-trained QuartzNet15x5 model from NVIDIA's NGC cloud. ([List of pre-trained models from NeMo](https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels))

 *Description of the pre-trained model*: QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other.

In [None]:
# This line will download pre-trained QuartzNet15x5 model from NVIDIA's NGC cloud and instantiate it for you
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En", strict=False)

#### Base Model Performance 

The Word Error Rate (WER) is a valuable measurement tool for comparing different ASR model and evaluating improvements within one system. To obtain the final results, we assess how the model performs by using the testing set.

In [None]:
# Let's configure our model parameters for testing

# Parameters for training, validation, and testing are specified using the 
# train_ds, validation_ds, and test_ds sections of your configuration file

# Bigger batch-size = bigger throughput
params["model"]["validation_ds"]["batch_size"] = 8

# Setup the test data loader and make sure the model is on GPU
params["model"]["validation_ds"]["manifest_filepath"] = "test_manifest.json"
quartznet.setup_test_data(test_data_config=params["model"]["validation_ds"])

# Comment this line if you don't want to use GPU acceleration
_ = quartznet.cuda()

In [None]:
# We will be computing the Word Error Rate (WER) metric between our hypothesis and predictions.

wer_numerators = []
wer_denominators = []

# Loop over all test batches.
# Iterating over the model's `test_dataloader` will give us:
# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)
# See the AudioToCharDataset for more details.
with torch.no_grad():
    for test_batch in quartznet.test_dataloader():
        input_signal, input_signal_length, targets, targets_lengths = [x.cuda() for x in test_batch]
                
        log_probs, encoded_len, greedy_predictions = quartznet(
            input_signal=input_signal, 
            input_signal_length=input_signal_length
        )
        # Notice the model has a helper object to compute WER
        quartznet._wer.update(greedy_predictions, targets, targets_lengths)
        _, wer_numerator, wer_denominator = quartznet._wer.compute()
        wer_numerators.append(wer_numerator.detach().cpu().numpy())
        wer_denominators.append(wer_denominator.detach().cpu().numpy())

In [None]:
# We need to sum all numerators and denominators first. Then divide.
print(f"WER = {sum(wer_numerators)/sum(wer_denominators)*100:.2f}%")

WER = 39.70%


### Model Fine-tuning

The base model got 39.7% of WER, which is not so good. Let's see if providing some data from the same domain and language dialects can improve our ASR model.

For simplification, we are going to train for only 1 epoch using DefinedCrowd's data. 

In [None]:
import pytorch_lightning as pl
from omegaconf import DictConfig
import copy

# Before training we need to 

# provide the train manifest for training
params["model"]["train_ds"]["manifest_filepath"] = "train_manifest.json"

# Use the smaller learning rate for fine-tunning
new_opt = copy.deepcopy(params["model"]["optim"])
new_opt["lr"] = 0.001
quartznet.setup_optimization(optim_config=DictConfig(new_opt))

# Batch size will depend on the GPU memory available
params["model"]["train_ds"]["batch_size"] = 8

# Point to the data we'll use for fine-tuning as the training set
quartznet.setup_training_data(train_data_config=params["model"]["train_ds"])

# clean torch cache
torch.cuda.empty_cache()

# And now we can create a PyTorch Lightning trainer.
trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=1)

# And the fit function will start the training
trainer.fit(quartznet)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConvASREncoder                    | 18.9 M
2 | decoder           | ConvASRDecoder                    | 29.7 K
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | _wer              | WER                               | 0     


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…




1

#### Fine-tunned model Performance

Let's compare the final model performance with the fine-tunned model we got from training with additional data.

In [None]:
# Let's configure our model parameters for testing
params["model"]["validation_ds"]["batch_size"] = 8

# Setup the test data loader and make sure the model is on GPU
params["model"]["validation_ds"]["manifest_filepath"] = "test_manifest.json"
quartznet.setup_test_data(test_data_config=params["model"]["validation_ds"])
_ = quartznet.cuda()

In [None]:
# We will be computing the Word Error Rate (WER) metric between our hypothesis and predictions.

wer_numerators = []
wer_denominators = []

# Loop over all test batches.
# Iterating over the model's `test_dataloader` will give us:
# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)
# See the AudioToCharDataset for more details.
with torch.no_grad():
    for test_batch in quartznet.test_dataloader():
        input_signal, input_signal_length, targets, targets_lengths = [x.cuda() for x in test_batch]
                
        log_probs, encoded_len, greedy_predictions = quartznet(
            input_signal=input_signal, 
            input_signal_length=input_signal_length
        )
        # Notice the model has a helper object to compute WER
        quartznet._wer.update(greedy_predictions, targets, targets_lengths)
        _, wer_numerator, wer_denominator = quartznet._wer.compute()
        wer_numerators.append(wer_numerator.detach().cpu().numpy())
        wer_denominators.append(wer_denominator.detach().cpu().numpy())

In [None]:
# We need to sum all numerators and denominators first. Then divide.
print(f"WER = {sum(wer_numerators)/sum(wer_denominators)*100:.2f}%")

WER = 24.36%


After training new epochs of the neural network ASR architecture, we got a Word Error Rate (WER) of 24.36% which is an improvement over the initial 39.7% from the base model using only 1 epoch for training. For better results, please consider to use more epochs in the training.

# Conclusion

In this tutorial, we demonstrated how to load speech data collected by DefinedCrowd and how to use it to train and measure the performance of an automatic speech recognition (ASR) model.