metadata

title: Tortoise TTS API
emoji: 🦀
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.23.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Text-to-speech using Gradio, FastAPI, and TorToise TTS
tags:
  - tortoise-tts
  - text-to-speech
  - voice-cloning
  - gradio
  - fastapi

Tortoise TTS with Voice Cloning

A powerful text-to-speech application with voice cloning capabilities, powered by Tortoise-TTS.

Description

This application allows you to generate high-quality, natural-sounding speech from text. You can customize the voice by either:

Uploading your own voice sample for cloning
Recording your voice directly in the browser
Selecting from a variety of preset voices

The app uses Tortoise-TTS, a high-quality text-to-speech model, and runs efficiently on Hugging Face Spaces with Zero-GPU optimization.

How to Use

Web Interface

Enter the text you want to convert to speech
Choose one of the following voice options:
- Upload a voice sample audio file (WAV format recommended)
- Record your voice using your microphone
- Select a preset voice from the dropdown menu
Click "Generate Speech"
Listen to or download the generated audio

API Endpoints

The app also provides REST API endpoints for programmatic access:

Voice File TTS - /api/tts_with_voice_file/
- POST request with:
  - text: Text to convert to speech (required)
  - voice_file: Audio file for voice cloning (optional)
  - preset_voice: Name of preset voice (optional, defaults to "random")
Preset Voice TTS - /api/tts_with_preset/
- POST request with:
  - text: Text to convert to speech (required)
  - preset_voice: Name of preset voice (required)

Python Example

import requests

# Using preset voice
response = requests.post(
    "https://your-space-name.hf.space/api/tts_with_preset/",
    data={"text": "Hello, this is a test.", "preset_voice": "tom"}
)

# Save the audio file
with open("output.wav", "wb") as f:
    f.write(response.content)

Technical Details

This app leverages:

Tortoise-TTS: State-of-the-art text-to-speech model
Gradio: For the intuitive user interface
FastAPI: For the API endpoints
Zero-GPU: For efficient GPU utilization on Hugging Face Spaces

Limitations

Text generation may take some time (30-60 seconds) depending on text length
Voice cloning quality depends on the clarity and length of the provided sample
For best results, provide voice samples with clear speech and minimal background noise

Credits

This project uses the Tortoise-TTS model. If you use this app in your work, please consider citing:

@misc{tortoise-tts,
  author = {James Betker},
  title = {Tortoise-TTS: A Multi-Voice TTS System},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/neonbjb/tortoise-tts}}
}

License

This project is available under the Apache-2.0 License.