Muyan-TTS is a trainable TTS model designed for podcast applications within a $50,000 budget, which is pre-trained on over 100,000 hours of podcast audio data, enabling zero-shot TTS synthesis with high-quality voice generation. Furthermore, Muyan-TTS supports speaker adaptation with dozens of minutes of target speech, making it highly customizable for individual voices.

Install

Clone & Install

git clone https://github.com/MYZY-AI/Muyan-TTS.git
cd Muyan-TTS

conda create -n muyan-tts python=3.10 -y
conda activate muyan-tts
make build

You need to install FFmpeg. If you're using Ubuntu, you can install it with the following command:

sudo apt update
sudo apt install ffmpeg

Additionally, you need to download the weights of chinese-hubert-base.

Place all the downloaded models in the pretrained_models directory. Your directory structure should look similar to the following:

pretrained_models
β”œβ”€β”€ chinese-hubert-base
β”œβ”€β”€ Muyan-TTS
└── Muyan-TTS-SFT

Quickstart

python tts.py

This will synthesize speech through inference. The core code is as follows:

async def main(model_type, model_path):
    tts = Inference(model_type, model_path, enable_vllm_acc=False)
    wavs = await tts.generate(
        ref_wav_path="assets/Claire.wav",
        prompt_text="Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
        text="Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
    )
    output_path = "logs/tts.wav"
    with open(output_path, "wb") as f:
        f.write(next(wavs))  
    print(f"Speech generated in {output_path}")

You need to specify the prompt speech, including the ref_wav_path and its prompt text, and the text to be synthesized. The synthesized speech is saved by default to logs/tts.wav.

Additionally, you need to specify model_type as either base or sft, with the default being base.

When you specify the model_type to be base, you can change the prompt speech to arbitrary speaker for zero-shot TTS synthesis.

When you specify the model_type to be sft, you need to keep the prompt speech unchanged because the sft model is trained on Claire's voice.

API Usage

python api.py

Using the API mode automatically enables vLLM acceleration, and the above command will start a service on the default port 8020. Additionally, LLM logs will be saved in logs/llm.log.

You can send a request to the API using the example below:

import time
import requests
TTS_PORT=8020
payload = {
    "ref_wav_path": "assets/Claire.wav",
    "prompt_text": "Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
    "text": "Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
}
start = time.time()

url = f"http://localhost:{TTS_PORT}/get_tts"
response = requests.post(url, json=payload)
audio_file_path = "logs/tts.wav"
with open(audio_file_path, "wb") as f:
    f.write(response.content)
    
print(time.time() - start)

By default, the synthesized speech will be saved at logs/tts.wav.

Additionally, you need to specify model_type as either base or sft, with the default being base.

Training

We use LibriSpeech as an example. You can use your own dataset instead, but you need to organize the data into the format shown in data_process/examples.

If you haven't downloaded LibriSpeech yet, you can download the dev-clean set using:

wget https://www.openslr.org/resources/12/dev-clean.tar.gz -P path/to/save

After downloading, specify the librispeech_dir in prepare_sft_dataset.py to match the download location. Then run ./train.sh, which will automatically process the data and generate data/tts_sft_data.json. We will use the first speaker from the LibriSpeech subset for fine-tuning. You can also specify a different speaker as needed in data_process/text_format_conversion.py.

Note that if an error occurs during the process, resolve the error, delete the existing contents of the data folder, and then rerun train.sh.

After generating data/tts_sft_data.json, train.sh will automatically copy it to llama-factory/data and add the following field to dataset_info.json:

"tts_sft_data": {
    "file_name": "tts_sft_data.json"
}

Finally, it will automatically execute the llamafactory-cli train command to start training. You can adjust training settings using training/sft.yaml. By default, the trained weights will be saved to pretrained_models/Muyan-TTS-new-SFT.

You can directly deploy your trained model using the API tool above. During inference, you need to specify the model_type to be sft and replace the ref_wav_path and prompt text with a sample of the speaker's voice you trained on.

Downloads last month
0
Safetensors
Model size
3.61B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support