

# How do I customize TTS pronunciations?

This tutorial walks you through the basics of NeMo TTS pronunciation customization. 

In [None]:
"""
You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.
Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""

BRANCH = 'r1.17.0'
# # If you're using Google Colab and not running locally, uncomment and run this cell.
# !apt-get install sox libsndfile1 ffmpeg
# !pip install wget text-unidecode 
# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]


## Grapheme-to-phoneme (G2P) Overview

Modern **text-to-speech** (TTS) models can learn pronunciations from raw text input and its corresponding audio data.
Sometimes, however, it is desirable to customize pronunciations, for example, for domain-specific terms. As a result, many TTS systems use grapheme and phonetic input during training to directly access and correct pronunciations at inference time.


[The International Phonetic Alphabet (IPA)](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) and [ARPABET](https://en.wikipedia.org/wiki/ARPABET) are the most common phonetic alphabets. 

There are two ways to customize pronunciations:

1. pass phonemes as an input to the TTS model, note that the request-time overrides are best suited for one-off adjustments
2. configure TTS model with the desired domain-specific terms using custom phonetic dictionary

Both methods require users to convert graphemes into phonemes (G2P). 

#### All words for G2P purposes could be divided into the following groups:
* *known* words - words that are present in the model's phonetic dictionary
* *out-of-vocabulary (OOV)* words - words that are missing from the model's phonetic dictionary. 
* *[heteronyms](https://en.wikipedia.org/wiki/Heteronym_(linguistics))* - words with the same spelling but different pronunciations and/or meanings, e.g., *bass* (the fish) and *bass* (the musical instrument).

#### Important NeMo flags:
* `your_spec_generator_model.vocab.g2p.phoneme_dict` - phoneme dictionary that maps words to their phonetic transcriptions, e.g., [ARPABET-based CMU Dictionary](https://github.com/NVIDIA/NeMo/blob/r1.14.0/scripts/tts_dataset_files/cmudict-0.7b_nv22.10) or [IPA-based CMU Dictionary](https://github.com/NVIDIA/NeMo/blob/r1.14.0/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv22.10.txt)
* `your_spec_generator_model.vocab.g2p.heteronyms` - list of the model's heteronyms, grapheme form of these words will be used even if the word is present in the phoneme dictionary.
* `your_spec_generator_model.vocab.g2p.ignore_ambiguous_words`: if is set to **True**, words with more than one phonetic representation in the pronunciation dictionary are ignored. This flag is relevant to the words with multiple valid phonetic transcriptions in the dictionary that are not in `your_spec_generator_model.vocab.g2p.heteronyms` list.
* `your_spec_generator_model.vocab.phoneme_probability` - phoneme probability flag in the Tokenizer and the same from in the G2P module: `your_spec_generator_model.vocab.g2p.phoneme_probability` ([0, 1]). If a word is present in the phoneme dictionary, we still want our TTS model to see graphemes and phonemes during training to handle OOV words during inference. The `phoneme_probability` determines the probability of an unambiguous dictionary word appearing in phonetic form during model training, `(1 - phoneme_probability)` is the probability of the graphemes. This flag is set to `1` in the parse() method during inference.

To ensure the desired pronunciation, we need to add a new entry to the model's phonetic dictionary. If the target word is already in the dictionary, we need to remove the default pronunciation so that only the target pronunciation is present. 

## Default G2P

Below we show how to analyze default G2P output of the NeMo models

In [None]:
import os
import nemo.collections.tts as nemo_tts
from nemo.collections.tts.g2p.modules import IPAG2P
import soundfile as sf
import IPython.display as ipd
import torch

# Load mel spectrogram generator
spec_generator = nemo_tts.models.FastPitchModel.from_pretrained("tts_en_fastpitch_ipa").eval()
# Load vocoder
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan").eval()

In [None]:
# helper functions

def generate_audio(input_text):
 # parse() sets phoneme probability to 1, i.e. dictionary phoneme transcriptions are used for known words
 parsed = spec_generator.parse(input_text)
 spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
 audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
 display(ipd.Audio(audio.detach().to('cpu').numpy(), rate=22050))
 
def display_postprocessed_text(text):
 # to use dictionary entries for known words, not needed for generate_audio() as parse() handles this
 spec_generator.vocab.phoneme_probability = 1
 spec_generator.vocab.g2p.phoneme_probability = 1
 print(f"Input before tokenization: {' '.join(spec_generator.vocab.g2p(text))}\n")

In [None]:
text = "paracetamol can help reduce fever."
generate_audio(text)

#### Expected results if you run the tutorial:
 


During preprocessing, unambiguous dictionary words are converted to phonemes, while OOV and words with multiple entries are kept as graphemes. For example, **paracetamol** is missing from the phoneme dictionary, and **can** has 2 forms.

In [None]:
display_postprocessed_text(text)
for word in ["paracetamol", "can"]:
 word = word.upper()
 phoneme_forms = spec_generator.vocab.g2p.phoneme_dict[word]
 print(f"Number of phoneme forms for wordPhoneme forms for '{word}': {len(phoneme_forms)} -- {phoneme_forms}")

## Input customization

One can pass phonemes directly as input to the model to customize pronunciation.

Let's replace the word **paracetamol** with the desired phonemic transcription in our example by adding vertical bars around each phone, e.g., `ˌpæɹəˈsitəmɔl` -> `|ˌ||p||æ||ɹ||ə||ˈ||s||i||t||ə||m||ɔ||l|`.

In [None]:
print(f"Original text: {text}")

new_pronunciation = "ˌpæɹəˈsitəmɔl"
phoneme_form = "".join([f"|{s}|" for s in new_pronunciation])
text_with_phonemes = text.replace("paracetamol", phoneme_form)
print(f"Text with phonemes: {text_with_phonemes}")

In [None]:
generate_audio(text_with_phonemes)

#### Expected results if you run the tutorial:
 


## Dictionary customization

Below we show how to customize phonetic dictionary for NeMo TTS models. 

Let's add a new entry to the dictionary for the word **paracetamol**. 

In [None]:
# we download IPA-based CMU Dictionary and add a custom entry for the target word
ipa_cmu_dict = "ipa_cmudict-0.7b_nv22.10.txt"
if os.path.exists(ipa_cmu_dict):
 ! rm $ipa_cmu_dict

! wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tts_dataset_files/$ipa_cmu_dict

with open(ipa_cmu_dict, "a") as f:
 f.write(f"PARACETAMOL {new_pronunciation}\n")
 
! tail $ipa_cmu_dict

In [None]:
# let's now use our updated dictionary as the model's phonetic dictionary
spec_generator.vocab.g2p.replace_dict(ipa_cmu_dict)

**Paracetamol** is no longer an OOV, and the model uses the phonetic form we provided:

In [None]:
display_postprocessed_text(text)

Finally, let's use the new phoneme dictionary for synthesis.

In [None]:
generate_audio(text)

#### Expected results if you run the tutorial:
 

# Resources
* [TTS pipeline customization](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-custom.html#tts-pipeline-configuration)
* [Overview of TTS in NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/NeMo_TTS_Primer.ipynb)
* [G2P models in NeMo](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tts/g2p.html)
* [Riva TTS documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html)