# TTS Inference Prosody Control

This notebook is intended to teach users how to control duration and pitch with the FastPitch model.

# License

> Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
>
> Licensed under the Apache License, Version 2.0 (the "License");
> you may not use this file except in compliance with the License.
> You may obtain a copy of the License at
>
> http://www.apache.org/licenses/LICENSE-2.0
>
> Unless required by applicable law or agreed to in writing, software
> distributed under the License is distributed on an "AS IS" BASIS,
> WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> See the License for the specific language governing permissions and
> limitations under the License.

In [None]:
"""
You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.
Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
BRANCH = 'r1.17.0'
# # If you're using Google Colab and not running locally, uncomment and run this cell.
# !apt-get install sox libsndfile1 ffmpeg
# !pip install wget text-unidecode
# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]


## Setup

Please run the below cell to import libraries used in this notebook. This cell will load the fastpitch model and hifigan models used in the rest of the notebook. Lastly, two helper functions are defined. One is used for inference while the other is used to plot the inference results.

In [None]:
# Import all libraries
import IPython.display as ipd
import librosa
import librosa.display
import numpy as np
import torch
from matplotlib import pyplot as plt
%matplotlib inline

# Reduce logging messages for this notebook
from nemo.utils import logging
logging.setLevel(logging.ERROR)

from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel
from nemo.collections.tts.parts.utils.helpers import regulate_len

# Load the models from NGC
fastpitch = FastPitchModel.from_pretrained("tts_en_fastpitch").eval().cuda()
hifigan = HifiGanModel.from_pretrained("tts_en_hifigan").eval().cuda()
sr = 22050

# Define some helper functions
# Define a helper function to go from string to audio
def str_to_audio(inp, pace=1.0, durs=None, pitch=None):
 with torch.no_grad():
 tokens = fastpitch.parse(inp)
 spec, _, durs_pred, _, pitch_pred, *_ = fastpitch(text=tokens, durs=durs, pitch=pitch, speaker=None, pace=pace)
 audio = hifigan.convert_spectrogram_to_audio(spec=spec).to('cpu').numpy()
 return spec, audio, durs_pred, pitch_pred

# Define a helper function to plot spectrograms with pitch and display the audio
def display_pitch(audio, pitch, sr=22050, durs=None):
 fig, ax = plt.subplots(figsize=(12, 6))
 spec = np.abs(librosa.stft(audio[0], n_fft=1024))
 # Check to see if pitch has been unnormalized
 if torch.abs(torch.mean(pitch)) <= 1.0:
 # Unnormalize the pitch with LJSpeech's mean and std
 pitch = pitch * 65.72037058703644 + 214.72202032404294
 # Check to see if pitch has been expanded to the spec length yet
 if len(pitch) != spec.shape[0] and durs is not None:
 pitch = regulate_len(durs, pitch.unsqueeze(-1))[0].squeeze(-1)
 # Plot and display audio, spectrogram, and pitch
 ax.plot(pitch.cpu().numpy()[0], color='cyan', linewidth=1)
 librosa.display.specshow(np.log(spec+1e-12), y_axis='log')
 ipd.display(ipd.Audio(audio, rate=sr))
 plt.show()

## Duration Control

This section is applicable to models that use a duration predictor module. This module is called the Length Regulator and was introduced in FastSpeech [1]. A list of NeMo models that support duration predictors are as follows:

- [FastPitch](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_fastpitch)
- [FastPitch_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastpitchhifigan)
- [FastSpeech2](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_fastspeech_2)
- [FastSpeech2_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastspeech2hifigan)
- [TalkNet](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_talknet)
- [Glow-TTS](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_glowtts)

While each model has their own implementation of this duration predictor, all of them follow a simple convolutional architecture. The input is the encoded tokens, and the output of the module is a value that represents how many frames in the decoder correspond to each token. It is essentially a hard attention mechanism.

Since each model outputs a duration value per token, it is simple to slow down or increase the speech rate by increasing or decreasing these values. Consider the following:

```python
def regulate_len(durations, pace=1.0):
 durations = durations.float() / pace
 # The output from the duration predictor module is still a float
 # If we want the speech to be faster, we can increase the pace and make each token duration shorter
 # Alternatively we can slow down the pace by decreasing the pace parameter
 return durations.long() # Lastly, we need to make the durations integers for subsequent processes
```

Let's try this out with FastPitch

In [None]:
#Define what we want the model to say
input_string = "Hey, I am speaking at different paces!" # Feel free to change it and experiment

# Let's run fastpitch normally
_, audio, *_ = str_to_audio(input_string)
print(f"This is fastpitch speaking at the regular pace of 1.0. This example is {len(audio[0])/sr:.3f} seconds long.")
ipd.display(ipd.Audio(audio, rate=sr))

# We can speed up the speech by adjusting the pace
_, audio, *_ = str_to_audio(input_string, pace=1.2)
print(f"This is fastpitch speaking at the faster pace of 1.2. This example is {len(audio[0])/sr:.3f} seconds long.")
ipd.display(ipd.Audio(audio, rate=sr))

# We can slow down the speech by adjusting the pace
_, audio, *_ = str_to_audio(input_string, pace=0.75)
print(f"This is fastpitch speaking at the slower pace of 0.75. This example is {len(audio[0])/sr:.3f} seconds long.")
ipd.display(ipd.Audio(audio, rate=sr))

## Pitch Control

The newer spectrogram generator models predict the pitch for certain words. Since these models predict pitch, we can adjust the predicted pitch in a similar manner to the predicted durations like in the previous section. A list of NeMo models that support pitch control are as follows:

- [FastPitch](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_fastpitch)
- [FastPitch_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastpitchhifigan)
- [FastSpeech2](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_fastspeech_2)
- [FastSpeech2_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastspeech2hifigan)
- [TalkNet](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_talknet)

### FastPitch

As with the previous tutorial, we will focus on FastPitch. FastPitch differs from some other models as it predicts a pitch difference to a normalized (mean 0, std 1) speaker pitch. Other models will just predict the unnormalized pitch. Looking at a simplified version of the FastPitch model, we see

```python
# Predict pitch
pitch_predicted = self.pitch_predictor(enc_out, enc_mask) # Predicts a pitch that is normalized with speaker statistics 
pitch_emb = self.pitch_emb(pitch.unsqueeze(1)) # A simple 1D convolution to map the float pitch to a embedding pitch

enc_out = enc_out + pitch_emb.transpose(1, 2) # We add the pitch to the encoder output
spec, *_ = self.decoder(input=len_regulated, seq_lens=dec_lens) # We send the sum to the decoder to get the spectrogram
```

Let's see the `pitch_predicted` for a sample text. You can run the below cell. You should get an image that looks like the following for the input `Hey, what is my pitch?`:



Notice that the last word `pitch` has an increase in pitch to stress that it is a question.

In [None]:
import librosa
import librosa.display
from matplotlib import pyplot as plt
import numpy as np
from nemo.collections.tts.parts.utils.helpers import regulate_len
%matplotlib inline

#Define what we want the model to say
input_string = "Hey, what is my pitch?" # Feel free to change it and experiment

# Run inference to get spectrogram and pitch
with torch.no_grad():
 spec, audio, durs_pred, pitch_pred = str_to_audio(input_string)

 # FastPitch predicts one pitch value per token. To plot it, we have to expand the token length to the spectrogram length
 pitch_pred, _ = regulate_len(durs_pred, pitch_pred.unsqueeze(-1))
 pitch_pred = pitch_pred.squeeze(-1)
 # Note we have to unnormalize the pitch with LJSpeech's mean and std
 pitch_pred = pitch_pred * 65.72037058703644 + 214.72202032404294

# Let's plot the predicted pitch and how it affects the predicted audio
fig, ax = plt.subplots(figsize=(12, 6))
spec = np.abs(librosa.stft(audio[0], n_fft=1024))
ax.plot(pitch_pred.cpu().numpy()[0], color='cyan', linewidth=1)
librosa.display.specshow(np.log(spec+1e-12), y_axis='log')
ipd.display(ipd.Audio(audio, rate=sr))

## Plot Control

Now that we see how the pitch affects the predicted spectrogram, we can now adjust it to add some effects. We will explore 4 different manipulations:

1) Pitch shift

2) Pitch flatten

3) Pitch inversion

4) Pitch amplification

### Pitch Shift
First, let's handle pitch shifting. To shift the pitch up or down by some Hz, we can just add or subtract as needed. Let's shift the pitch down by 50 Hz and compare it to the previous example.

In [None]:
#Define what we want the model to say
input_string = "Hey, what is my pitch?" # Feel free to change it and experiment

# Run inference to get spectrogram and pitch
with torch.no_grad():
 spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)
 
 # Note we have to unnormalize the pitch with LJSpeech's mean and std
 pitch_shift = pitch_norm_pred * 65.72037058703644 + 214.72202032404294
 # Now let's pitch shift down by 50Hz
 pitch_shift = pitch_shift - 50
 # Now we have to renormalize it to be mean 0, std 1
 pitch_shift = (pitch_shift - 214.72202032404294) / 65.72037058703644
 
 # Now we can pass it to the model
 spec_shift, audio_shift, durs_shift_pred, _ = str_to_audio(input_string, pitch=pitch_shift)
 # NOTE: We do not plot the pitch returned from str_to_audio.
 # When we override the pitch, we want to plot the pitch that override the model with.
 # In thise case, it is `pitch_shift`

# Let's see both results
print("The first unshifted sample")
display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)
print("The second shifted sample. This sample is much deeper than the previous.")
display_pitch(audio_shift, pitch_shift, durs=durs_shift_pred)

### Pitch Flattening
Second, let's look at pitch flattening. To flatten the pitch, we just set it to 0. Let's run it and compare the results.

In [None]:
#Define what we want the model to say
input_string = "Hey, what is my pitch?" # Feel free to change it and experiment

# Run inference to get spectrogram and pitch
with torch.no_grad():
 spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)
 
 # Now let's set the pitch to 0
 pitch_flat = pitch_norm_pred * 0
 # Now we can pass it to the model
 spec_flat, audio_flat, durs_flat_pred, _ = str_to_audio(input_string, pitch=pitch_flat)

# Let's see both results
print("The first unaltered sample")
display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)
print("The second flattened sample. This sample is more monotone than the previous.")
display_pitch(audio_flat, pitch_flat, durs=durs_flat_pred)

### Pitch Inversion
Third, let's look at pitch inversion. To invert the pitch, we just multiply it by -1. Let's run it and compare the results.

In [None]:
#Define what we want the model to say
input_string = "Hey, what is my pitch?" # Feel free to change it and experiment

# Run inference to get spectrogram and pitch
with torch.no_grad():
 spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)
 
 # Now let's invert the pitch
 pitch_inv = pitch_norm_pred * -1
 # Now we can pass it to the model
 spec_inv, audio_inv, durs_inv_pred, _ = str_to_audio(input_string, pitch=pitch_inv)

# Let's see both results
print("The first unaltered sample")
display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)
print("The second inverted sample. This sample sounds less like a question and more like a statement.")
display_pitch(audio_inv, pitch_inv, durs=durs_inv_pred)

### Pitch Amplify
Lastly, let's look at pitch amplifying. To amplify the pitch, we just multiply it by a positive constant. Let's run it and compare the results.

In [None]:
#Define what we want the model to say
input_string = "Hey, what is my pitch?" # Feel free to change it and experiment

# Run inference to get spectrogram and pitch
with torch.no_grad():
 spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)
 
 # Now let's amplify the pitch
 pitch_amp = pitch_norm_pred * 1.5
 # Now we can pass it to the model
 spec_amp, audio_amp, durs_amp_pred, _ = str_to_audio(input_string, pitch=pitch_amp)

# Let's see both results
print("The first unaltered sample")
display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)
print("The second amplified sample.")
display_pitch(audio_amp, pitch_amp, durs=durs_amp_pred)

## Putting it all together

Now that we understand how to control the duration and pitch of TTS models, we can show how to adjust the voice to sound more solemn (slower speed + lower pitch), or more excited (higher speed + higher pitch).

In [None]:
#Define what we want the model to say
input_string = "I want to pass on my condolences for your loss."

# Run inference to get spectrogram and pitch
with torch.no_grad():
 spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)
 
 # Let's try to make the speech more solemn
 # Let's deamplify the pitch and shift the pitch down by 75% of 1 standard deviation
 pitch_sol = (pitch_norm_pred)*0.75-0.75
 # Fastpitch tends to raise the pitch before "loss" which sounds inappropriate. Let's just remove that pitch raise
 pitch_sol[0][-5] += 0.2
 # Now let's pass our new pitch to fastpitch with a 90% pacing to slow it down
 spec_sol, audio_sol, durs_sol_pred, _ = str_to_audio(input_string, pitch=pitch_sol, pace=0.9)
 
# Let's see both results
print("The first unaltered sample")
display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)
print("The second solumn sample")
display_pitch(audio_sol, pitch_sol, durs=durs_sol_pred)

In [None]:
#Define what we want the model to say
input_string = "Congratulations on your promotion."

# Run inference to get spectrogram and pitch
with torch.no_grad():
 spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)
 
 # Let's amplify the pitch to make it sound more animated
 # We also pitch shift up by 50% of 1 standard deviation
 pitch_excite = (pitch_norm_pred)*1.7+0.5
 # Now let's pass our new pitch to fastpitch with a 110% pacing to speed it up
 spec_excite, audio_excite, durs_excite_pred, _ = str_to_audio(input_string, pitch=pitch_excite, pace=1.1)
 
# Let's see both results
print("The first unaltered sample")
display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)
print("The second excited sample")
display_pitch(audio_excite, pitch_excite, durs=durs_excite_pred)

## Other Models

This notebook lists other models that allow for control of speech rate and pitch. However, please note that not all models accept a `pace`, nor a `pitch` parameter as part of their forward/generate_spectrogram functions. Users who are interested in adding this functionality can use this notebook as a guide on how to do so.

### Duration Control

Adding duration control is the simpler of the two and one simply needs to add the `regulate_lens` function to the appropriate model for duration control.

### Pitch Control

Pitch control is more complicated. There are numerous design decisions that differ between models: 1) Whether to normalize the pitch, 2) Whether to predict pitch per spectrogram frame or per token, and more. While the basic transformations presented here (shift, flatten, invert, and amplify) can be done with all pitch predicting models, where to add this pitch transformation will differ depending on the model.

## References

[1] https://arxiv.org/abs/1905.09263