MYZY-AI/Muyan-TTS-SFT · Hugging Face

Muyan-TTS is a trainable TTS model designed for podcast applications within a $50,000 budget, which is pre-trained on over 100,000 hours of podcast audio data, enabling zero-shot TTS synthesis with high-quality voice generation. Furthermore, Muyan-TTS supports speaker adaptation with dozens of minutes of target speech, making it highly customizable for individual voices.

Install

Clone & Install

git clone https://github.com/MYZY-AI/Muyan-TTS.git
cd Muyan-TTS

conda create -n muyan-tts python=3.10 -y
conda activate muyan-tts
make build

You need to install FFmpeg. If you're using Ubuntu, you can install it with the following command:

sudo apt update
sudo apt install ffmpeg

Additionally, you need to download the weights of chinese-hubert-base.

Place all the downloaded models in the pretrained_models directory. Your directory structure should look similar to the following:

pretrained_models
├── chinese-hubert-base
├── Muyan-TTS
└── Muyan-TTS-SFT

Quickstart

python tts.py

This will synthesize speech through inference. The core code is as follows:

async def main(model_type, model_path):
    tts = Inference(model_type, model_path, enable_vllm_acc=False)
    wavs = await tts.generate(
        ref_wav_path="assets/Claire.wav",
        prompt_text="Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
        text="Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
    )
    output_path = "logs/tts.wav"
    with open(output_path, "wb") as f:
        f.write(next(wavs))  
    print(f"Speech generated in {output_path}")

You need to specify the prompt speech, including the ref_wav_path and its prompt_text, and the text to be synthesized. The synthesized speech is saved by default to logs/tts.wav.

Additionally, you need to specify model_type as either base or sft, with the default being base.

When you specify the model_type to be base, you can change the prompt speech to arbitrary speaker for zero-shot TTS synthesis.

When you specify the model_type to be sft, you need to keep the prompt speech unchanged because the sft model is trained on Claire's voice.

API Usage

python api.py

Using the API mode automatically enables vLLM acceleration, and the above command will start a service on the default port 8020. Additionally, LLM logs will be saved in logs/llm.log.

You can send a request to the API using the example below:

import time
import requests
TTS_PORT=8020
payload = {
    "ref_wav_path": "assets/Claire.wav",
    "prompt_text": "Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
    "text": "Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
}
start = time.time()

url = f"http://localhost:{TTS_PORT}/get_tts"
response = requests.post(url, json=payload)
audio_file_path = "logs/tts.wav"
with open(audio_file_path, "wb") as f:
    f.write(response.content)
    
print(time.time() - start)

By default, the synthesized speech will be saved at logs/tts.wav.

Similarly, you need to specify model_type as either base or sft, with the default being base.

Training

We use LibriSpeech as an example. You can use your own dataset instead, but you need to organize the data into the format shown in data_process/examples.

If you haven't downloaded LibriSpeech yet, you can download the dev-clean set using:

wget --no-check-certificate https://www.openslr.org/resources/12/dev-clean.tar.gz

After uncompressing the data, specify the librispeech_dir in prepare_sft_dataset.py to match the download location. Then run:

./train.sh

This will automatically process the data and generate data/tts_sft_data.json.

Note that we use a specific speaker ID of "3752" from dev-clean of LibriSpeech (which can be specified in data_process/text_format_conversion.py) as an example because its data size is relatively large. If you organize your own dataset for training, please prepare at least a dozen of minutes of speech from the target speaker.

If an error occurs during the process, resolve the error, delete the existing contents of the data folder, and then rerun train.sh.

After generating data/tts_sft_data.json, train.sh will automatically copy it to llama-factory/data and add the following field to dataset_info.json:

"tts_sft_data": {
    "file_name": "tts_sft_data.json"
}

Finally, it will automatically execute the llamafactory-cli train command to start training. You can adjust training settings using training/sft.yaml.

By default, the trained weights will be saved to pretrained_models/Muyan-TTS-new-SFT.

After training, you need to copy the sovits.pth of base/sft model to your trained model path before inference:

cp pretrained_models/Muyan-TTS/sovits.pth pretrained_models/Muyan-TTS-new-SFT

You can directly deploy your trained model using the API tool above. During inference, you need to specify the model_type to be sft and replace the ref_wav_path and prompt_text with a sample of the speaker's voice you trained on.

MYZY-AI
/

Muyan-TTS-SFT

Install

Clone & Install

Quickstart

API Usage

Training

Model tree for MYZY-AI/Muyan-TTS-SFT