Muyan-TTS is a trainable TTS model designed for podcast applications within a $50,000 budget, which is pre-trained on over 100,000 hours of podcast audio data, enabling zero-shot TTS synthesis with high-quality voice generation. Furthermore, Muyan-TTS supports speaker adaptation with dozens of minutes of target speech, making it highly customizable for individual voices.
Install
Clone & Install
git clone https://github.com/MYZY-AI/Muyan-TTS.git
cd Muyan-TTS
conda create -n muyan-tts python=3.10 -y
conda activate muyan-tts
make build
You need to install FFmpeg
. If you're using Ubuntu, you can install it with the following command:
sudo apt update
sudo apt install ffmpeg
Additionally, you need to download the weights of chinese-hubert-base.
Place all the downloaded models in the pretrained_models
directory. Your directory structure should look similar to the following:
pretrained_models
βββ chinese-hubert-base
βββ Muyan-TTS
βββ Muyan-TTS-SFT
Quickstart
python tts.py
This will synthesize speech through inference. The core code is as follows:
async def main(model_type, model_path):
tts = Inference(model_type, model_path, enable_vllm_acc=False)
wavs = await tts.generate(
ref_wav_path="assets/Claire.wav",
prompt_text="Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
text="Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
)
output_path = "logs/tts.wav"
with open(output_path, "wb") as f:
f.write(next(wavs))
print(f"Speech generated in {output_path}")
You need to specify the prompt speech
, including the ref_wav_path
and its prompt text
, and the text
to be synthesized. The synthesized speech is saved by default to logs/tts.wav
.
Additionally, you need to specify model_type
as either base
or sft
, with the default being base
.
When you specify the model_type
to be base
, you can change the prompt speech
to arbitrary speaker for zero-shot TTS synthesis.
When you specify the model_type
to be sft
, you need to keep the prompt speech
unchanged because the sft
model is trained on Claire's voice.
API Usage
python api.py
Using the API mode automatically enables vLLM acceleration, and the above command will start a service on the default port 8020
. Additionally, LLM logs will be saved in logs/llm.log
.
You can send a request to the API using the example below:
import time
import requests
TTS_PORT=8020
payload = {
"ref_wav_path": "assets/Claire.wav",
"prompt_text": "Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
"text": "Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
}
start = time.time()
url = f"http://localhost:{TTS_PORT}/get_tts"
response = requests.post(url, json=payload)
audio_file_path = "logs/tts.wav"
with open(audio_file_path, "wb") as f:
f.write(response.content)
print(time.time() - start)
By default, the synthesized speech will be saved at logs/tts.wav
.
Additionally, you need to specify model_type
as either base
or sft
, with the default being base
.
Training
We use LibriSpeech
as an example. You can use your own dataset instead, but you need to organize the data into the format shown in data_process/examples
.
If you haven't downloaded LibriSpeech
yet, you can download the dev-clean set using:
wget https://www.openslr.org/resources/12/dev-clean.tar.gz -P path/to/save
After downloading, specify the librispeech_dir
in prepare_sft_dataset.py
to match the download location. Then run ./train.sh
, which will automatically process the data and generate data/tts_sft_data.json
. We will use the first speaker from the LibriSpeech subset for fine-tuning. You can also specify a different speaker as needed in data_process/text_format_conversion.py
.
Note that if an error occurs during the process, resolve the error, delete the existing contents of the data folder, and then rerun train.sh
.
After generating data/tts_sft_data.json
, train.sh will automatically copy it to llama-factory/data
and add the following field to dataset_info.json
:
"tts_sft_data": {
"file_name": "tts_sft_data.json"
}
Finally, it will automatically execute the llamafactory-cli train
command to start training. You can adjust training settings using training/sft.yaml
. By default, the trained weights will be saved to pretrained_models/Muyan-TTS-new-SFT
.
You can directly deploy your trained model using the API tool above. During inference, you need to specify the model_type
to be sft
and replace the ref_wav_path
and prompt text
with a sample of the speaker's voice you trained on.
- Downloads last month
- 0