NeMo

File size: 27,915 Bytes

7934b29

{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# TTS Inference Prosody Control\n",
                "\n",
                "This notebook is intended to teach users how to control duration and pitch with the FastPitch model."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# License\n",
                "\n",
                "> Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.\n",
                ">\n",
                "> Licensed under the Apache License, Version 2.0 (the \"License\");\n",
                "> you may not use this file except in compliance with the License.\n",
                "> You may obtain a copy of the License at\n",
                ">\n",
                ">     http://www.apache.org/licenses/LICENSE-2.0\n",
                ">\n",
                "> Unless required by applicable law or agreed to in writing, software\n",
                "> distributed under the License is distributed on an \"AS IS\" BASIS,\n",
                "> WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
                "> See the License for the specific language governing permissions and\n",
                "> limitations under the License."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "tags": []
            },
            "outputs": [],
            "source": [
                "\"\"\"\n",
                "You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
                "Instructions for setting up Colab are as follows:\n",
                "1. Open a new Python 3 notebook.\n",
                "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
                "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
                "4. Run this cell to set up dependencies.\n",
                "\"\"\"\n",
                "BRANCH = 'r1.17.0'\n",
                "# # If you're using Google Colab and not running locally, uncomment and run this cell.\n",
                "# !apt-get install sox libsndfile1 ffmpeg\n",
                "# !pip install wget text-unidecode\n",
                "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Setup\n",
                "\n",
                "Please run the below cell to import libraries used in this notebook. This cell will load the fastpitch model and hifigan models used in the rest of the notebook. Lastly, two helper functions are defined. One is used for inference while the other is used to plot the inference results."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Import all libraries\n",
                "import IPython.display as ipd\n",
                "import librosa\n",
                "import librosa.display\n",
                "import numpy as np\n",
                "import torch\n",
                "from matplotlib import pyplot as plt\n",
                "%matplotlib inline\n",
                "\n",
                "# Reduce logging messages for this notebook\n",
                "from nemo.utils import logging\n",
                "logging.setLevel(logging.ERROR)\n",
                "\n",
                "from nemo.collections.tts.models import FastPitchModel\n",
                "from nemo.collections.tts.models import HifiGanModel\n",
                "from nemo.collections.tts.parts.utils.helpers import regulate_len\n",
                "\n",
                "# Load the models from NGC\n",
                "fastpitch = FastPitchModel.from_pretrained(\"tts_en_fastpitch\").eval().cuda()\n",
                "hifigan = HifiGanModel.from_pretrained(\"tts_en_hifigan\").eval().cuda()\n",
                "sr = 22050\n",
                "\n",
                "# Define some helper functions\n",
                "# Define a helper function to go from string to audio\n",
                "def str_to_audio(inp, pace=1.0, durs=None, pitch=None):\n",
                "    with torch.no_grad():\n",
                "        tokens = fastpitch.parse(inp)\n",
                "        spec, _, durs_pred, _, pitch_pred, *_ = fastpitch(text=tokens, durs=durs, pitch=pitch, speaker=None, pace=pace)\n",
                "        audio = hifigan.convert_spectrogram_to_audio(spec=spec).to('cpu').numpy()\n",
                "    return spec, audio, durs_pred, pitch_pred\n",
                "\n",
                "# Define a helper function to plot spectrograms with pitch and display the audio\n",
                "def display_pitch(audio, pitch, sr=22050, durs=None):\n",
                "    fig, ax = plt.subplots(figsize=(12, 6))\n",
                "    spec = np.abs(librosa.stft(audio[0], n_fft=1024))\n",
                "    # Check to see if pitch has been unnormalized\n",
                "    if torch.abs(torch.mean(pitch)) <= 1.0:\n",
                "        # Unnormalize the pitch with LJSpeech's mean and std\n",
                "        pitch = pitch * 65.72037058703644 + 214.72202032404294\n",
                "    # Check to see if pitch has been expanded to the spec length yet\n",
                "    if len(pitch) != spec.shape[0] and durs is not None:\n",
                "        pitch = regulate_len(durs, pitch.unsqueeze(-1))[0].squeeze(-1)\n",
                "    # Plot and display audio, spectrogram, and pitch\n",
                "    ax.plot(pitch.cpu().numpy()[0], color='cyan', linewidth=1)\n",
                "    librosa.display.specshow(np.log(spec+1e-12), y_axis='log')\n",
                "    ipd.display(ipd.Audio(audio, rate=sr))\n",
                "    plt.show()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Duration Control\n",
                "\n",
                "This section is applicable to models that use a duration predictor module. This module is called the Length Regulator and was introduced in FastSpeech [1]. A list of NeMo models that support duration predictors are as follows:\n",
                "\n",
                "- [FastPitch](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_fastpitch)\n",
                "- [FastPitch_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastpitchhifigan)\n",
                "- [FastSpeech2](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_fastspeech_2)\n",
                "- [FastSpeech2_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastspeech2hifigan)\n",
                "- [TalkNet](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_talknet)\n",
                "- [Glow-TTS](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_glowtts)\n",
                "\n",
                "While each model has their own implementation of this duration predictor, all of them follow a simple convolutional architecture. The input is the encoded tokens, and the output of the module is a value that represents how many frames in the decoder correspond to each token. It is essentially a hard attention mechanism.\n",
                "\n",
                "Since each model outputs a duration value per token, it is simple to slow down or increase the speech rate by increasing or decreasing these values. Consider the following:\n",
                "\n",
                "```python\n",
                "def regulate_len(durations, pace=1.0):\n",
                "    durations = durations.float() / pace\n",
                "    # The output from the duration predictor module is still a float\n",
                "    # If we want the speech to be faster, we can increase the pace and make each token duration shorter\n",
                "    # Alternatively we can slow down the pace by decreasing the pace parameter\n",
                "    return durations.long()  # Lastly, we need to make the durations integers for subsequent processes\n",
                "```\n",
                "\n",
                "Let's try this out with FastPitch"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "#Define what we want the model to say\n",
                "input_string = \"Hey, I am speaking at different paces!\"  # Feel free to change it and experiment\n",
                "\n",
                "# Let's run fastpitch normally\n",
                "_, audio, *_ = str_to_audio(input_string)\n",
                "print(f\"This is fastpitch speaking at the regular pace of 1.0. This example is {len(audio[0])/sr:.3f} seconds long.\")\n",
                "ipd.display(ipd.Audio(audio, rate=sr))\n",
                "\n",
                "# We can speed up the speech by adjusting the pace\n",
                "_, audio, *_ = str_to_audio(input_string, pace=1.2)\n",
                "print(f\"This is fastpitch speaking at the faster pace of 1.2. This example is {len(audio[0])/sr:.3f} seconds long.\")\n",
                "ipd.display(ipd.Audio(audio, rate=sr))\n",
                "\n",
                "# We can slow down the speech by adjusting the pace\n",
                "_, audio, *_ = str_to_audio(input_string, pace=0.75)\n",
                "print(f\"This is fastpitch speaking at the slower pace of 0.75. This example is {len(audio[0])/sr:.3f} seconds long.\")\n",
                "ipd.display(ipd.Audio(audio, rate=sr))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Pitch Control\n",
                "\n",
                "The newer spectrogram generator models predict the pitch for certain words. Since these models predict pitch, we can adjust the predicted pitch in a similar manner to the predicted durations like in the previous section. A list of NeMo models that support pitch control are as follows:\n",
                "\n",
                "- [FastPitch](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_fastpitch)\n",
                "- [FastPitch_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastpitchhifigan)\n",
                "- [FastSpeech2](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_fastspeech_2)\n",
                "- [FastSpeech2_HifiGan_E2E](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_e2e_fastspeech2hifigan)\n",
                "- [TalkNet](https://ngc.nvidia.com/catalog/models/nvidia:nemo:tts_en_talknet)\n",
                "\n",
                "### FastPitch\n",
                "\n",
                "As with the previous tutorial, we will focus on FastPitch. FastPitch differs from some other models as it predicts a pitch difference to a normalized (mean 0, std 1) speaker pitch. Other models will just predict the unnormalized pitch. Looking at a simplified version of the FastPitch model, we see\n",
                "\n",
                "```python\n",
                "# Predict pitch\n",
                "pitch_predicted = self.pitch_predictor(enc_out, enc_mask)  # Predicts a pitch that is normalized with speaker statistics \n",
                "pitch_emb = self.pitch_emb(pitch.unsqueeze(1))  # A simple 1D convolution to map the float pitch to a embedding pitch\n",
                "\n",
                "enc_out = enc_out + pitch_emb.transpose(1, 2)  # We add the pitch to the encoder output\n",
                "spec, *_ = self.decoder(input=len_regulated, seq_lens=dec_lens)  # We send the sum to the decoder to get the spectrogram\n",
                "```\n",
                "\n",
                "Let's see the `pitch_predicted` for a sample text. You can run the below cell. You should get an image that looks like the following for the input `Hey, what is my pitch?`:\n",
                "\n",
                "<img src=\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/tutorials/tts/images/fastpitch-pitch.png\">\n",
                "\n",
                "Notice that the last word `pitch` has an increase in pitch to stress that it is a question."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import librosa\n",
                "import librosa.display\n",
                "from matplotlib import pyplot as plt\n",
                "import numpy as np\n",
                "from nemo.collections.tts.parts.utils.helpers import regulate_len\n",
                "%matplotlib inline\n",
                "\n",
                "#Define what we want the model to say\n",
                "input_string = \"Hey, what is my pitch?\"  # Feel free to change it and experiment\n",
                "\n",
                "# Run inference to get spectrogram and pitch\n",
                "with torch.no_grad():\n",
                "    spec, audio, durs_pred, pitch_pred = str_to_audio(input_string)\n",
                "\n",
                "    # FastPitch predicts one pitch value per token. To plot it, we have to expand the token length to the spectrogram length\n",
                "    pitch_pred, _ = regulate_len(durs_pred, pitch_pred.unsqueeze(-1))\n",
                "    pitch_pred = pitch_pred.squeeze(-1)\n",
                "    # Note we have to unnormalize the pitch with LJSpeech's mean and std\n",
                "    pitch_pred = pitch_pred * 65.72037058703644 + 214.72202032404294\n",
                "\n",
                "# Let's plot the predicted pitch and how it affects the predicted audio\n",
                "fig, ax = plt.subplots(figsize=(12, 6))\n",
                "spec = np.abs(librosa.stft(audio[0], n_fft=1024))\n",
                "ax.plot(pitch_pred.cpu().numpy()[0], color='cyan', linewidth=1)\n",
                "librosa.display.specshow(np.log(spec+1e-12), y_axis='log')\n",
                "ipd.display(ipd.Audio(audio, rate=sr))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Plot Control\n",
                "\n",
                "Now that we see how the pitch affects the predicted spectrogram, we can now adjust it to add some effects. We will explore 4 different manipulations:\n",
                "\n",
                "1) Pitch shift\n",
                "\n",
                "2) Pitch flatten\n",
                "\n",
                "3) Pitch inversion\n",
                "\n",
                "4) Pitch amplification"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Pitch Shift\n",
                "First, let's handle pitch shifting. To shift the pitch up or down by some Hz, we can just add or subtract as needed. Let's shift the pitch down by 50 Hz and compare it to the previous example."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "#Define what we want the model to say\n",
                "input_string = \"Hey, what is my pitch?\"  # Feel free to change it and experiment\n",
                "\n",
                "# Run inference to get spectrogram and pitch\n",
                "with torch.no_grad():\n",
                "    spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)\n",
                "    \n",
                "    # Note we have to unnormalize the pitch with LJSpeech's mean and std\n",
                "    pitch_shift = pitch_norm_pred * 65.72037058703644 + 214.72202032404294\n",
                "    # Now let's pitch shift down by 50Hz\n",
                "    pitch_shift = pitch_shift - 50\n",
                "    # Now we have to renormalize it to be mean 0, std 1\n",
                "    pitch_shift = (pitch_shift - 214.72202032404294) / 65.72037058703644\n",
                "    \n",
                "    # Now we can pass it to the model\n",
                "    spec_shift, audio_shift, durs_shift_pred, _ = str_to_audio(input_string, pitch=pitch_shift)\n",
                "    # NOTE: We do not plot the pitch returned from str_to_audio.\n",
                "    # When we override the pitch, we want to plot the pitch that override the model with.\n",
                "    # In thise case, it is `pitch_shift`\n",
                "\n",
                "# Let's see both results\n",
                "print(\"The first unshifted sample\")\n",
                "display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)\n",
                "print(\"The second shifted sample. This sample is much deeper than the previous.\")\n",
                "display_pitch(audio_shift, pitch_shift, durs=durs_shift_pred)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Pitch Flattening\n",
                "Second, let's look at pitch flattening. To flatten the pitch, we just set it to 0. Let's run it and compare the results."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "#Define what we want the model to say\n",
                "input_string = \"Hey, what is my pitch?\"  # Feel free to change it and experiment\n",
                "\n",
                "# Run inference to get spectrogram and pitch\n",
                "with torch.no_grad():\n",
                "    spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)\n",
                "    \n",
                "    # Now let's set the pitch to 0\n",
                "    pitch_flat = pitch_norm_pred * 0\n",
                "    # Now we can pass it to the model\n",
                "    spec_flat, audio_flat, durs_flat_pred, _ = str_to_audio(input_string, pitch=pitch_flat)\n",
                "\n",
                "# Let's see both results\n",
                "print(\"The first unaltered sample\")\n",
                "display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)\n",
                "print(\"The second flattened sample. This sample is more monotone than the previous.\")\n",
                "display_pitch(audio_flat, pitch_flat, durs=durs_flat_pred)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Pitch Inversion\n",
                "Third, let's look at pitch inversion. To invert the pitch, we just multiply it by -1. Let's run it and compare the results."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "#Define what we want the model to say\n",
                "input_string = \"Hey, what is my pitch?\"  # Feel free to change it and experiment\n",
                "\n",
                "# Run inference to get spectrogram and pitch\n",
                "with torch.no_grad():\n",
                "    spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)\n",
                "    \n",
                "    # Now let's invert the pitch\n",
                "    pitch_inv = pitch_norm_pred * -1\n",
                "    # Now we can pass it to the model\n",
                "    spec_inv, audio_inv, durs_inv_pred, _ = str_to_audio(input_string, pitch=pitch_inv)\n",
                "\n",
                "# Let's see both results\n",
                "print(\"The first unaltered sample\")\n",
                "display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)\n",
                "print(\"The second inverted sample. This sample sounds less like a question and more like a statement.\")\n",
                "display_pitch(audio_inv, pitch_inv, durs=durs_inv_pred)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Pitch Amplify\n",
                "Lastly, let's look at pitch amplifying. To amplify the pitch, we just multiply it by a positive constant. Let's run it and compare the results."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "#Define what we want the model to say\n",
                "input_string = \"Hey, what is my pitch?\"  # Feel free to change it and experiment\n",
                "\n",
                "# Run inference to get spectrogram and pitch\n",
                "with torch.no_grad():\n",
                "    spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)\n",
                "    \n",
                "    # Now let's amplify the pitch\n",
                "    pitch_amp = pitch_norm_pred * 1.5\n",
                "    # Now we can pass it to the model\n",
                "    spec_amp, audio_amp, durs_amp_pred, _ = str_to_audio(input_string, pitch=pitch_amp)\n",
                "\n",
                "# Let's see both results\n",
                "print(\"The first unaltered sample\")\n",
                "display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)\n",
                "print(\"The second amplified sample.\")\n",
                "display_pitch(audio_amp, pitch_amp, durs=durs_amp_pred)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Putting it all together\n",
                "\n",
                "Now that we understand how to control the duration and pitch of TTS models, we can show how to adjust the voice to sound more solemn (slower speed + lower pitch), or more excited (higher speed + higher pitch)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "#Define what we want the model to say\n",
                "input_string = \"I want to pass on my condolences for your loss.\"\n",
                "\n",
                "# Run inference to get spectrogram and pitch\n",
                "with torch.no_grad():\n",
                "    spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)\n",
                "    \n",
                "    # Let's try to make the speech more solemn\n",
                "    # Let's deamplify the pitch and shift the pitch down by 75% of 1 standard deviation\n",
                "    pitch_sol = (pitch_norm_pred)*0.75-0.75\n",
                "    # Fastpitch tends to raise the pitch before \"loss\" which sounds inappropriate. Let's just remove that pitch raise\n",
                "    pitch_sol[0][-5] += 0.2\n",
                "    # Now let's pass our new pitch to fastpitch with a 90% pacing to slow it down\n",
                "    spec_sol, audio_sol, durs_sol_pred, _ = str_to_audio(input_string, pitch=pitch_sol, pace=0.9)\n",
                "    \n",
                "# Let's see both results\n",
                "print(\"The first unaltered sample\")\n",
                "display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)\n",
                "print(\"The second solumn sample\")\n",
                "display_pitch(audio_sol, pitch_sol, durs=durs_sol_pred)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "#Define what we want the model to say\n",
                "input_string = \"Congratulations on your promotion.\"\n",
                "\n",
                "# Run inference to get spectrogram and pitch\n",
                "with torch.no_grad():\n",
                "    spec_norm, audio_norm, durs_norm_pred, pitch_norm_pred = str_to_audio(input_string)\n",
                "    \n",
                "    # Let's amplify the pitch to make it sound more animated\n",
                "    # We also pitch shift up by 50% of 1 standard deviation\n",
                "    pitch_excite = (pitch_norm_pred)*1.7+0.5\n",
                "    # Now let's pass our new pitch to fastpitch with a 110% pacing to speed it up\n",
                "    spec_excite, audio_excite, durs_excite_pred, _ = str_to_audio(input_string, pitch=pitch_excite, pace=1.1)\n",
                "    \n",
                "# Let's see both results\n",
                "print(\"The first unaltered sample\")\n",
                "display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)\n",
                "print(\"The second excited sample\")\n",
                "display_pitch(audio_excite, pitch_excite, durs=durs_excite_pred)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Other Models\n",
                "\n",
                "This notebook lists other models that allow for control of speech rate and pitch. However, please note that not all models accept a `pace`, nor a `pitch` parameter as part of their forward/generate_spectrogram functions. Users who are interested in adding this functionality can use this notebook as a guide on how to do so.\n",
                "\n",
                "### Duration Control\n",
                "\n",
                "Adding duration control is the simpler of the two and one simply needs to add the `regulate_lens` function to the appropriate model for duration control.\n",
                "\n",
                "### Pitch Control\n",
                "\n",
                "Pitch control is more complicated. There are numerous design decisions that differ between models: 1) Whether to normalize the pitch, 2) Whether to predict pitch per spectrogram frame or per token, and more. While the basic transformations presented here (shift, flatten, invert, and amplify) can be done with all pitch predicting models, where to add this pitch transformation will differ depending on the model."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## References\n",
                "\n",
                "[1] https://arxiv.org/abs/1905.09263"
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3 (ipykernel)",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.8.6"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}