Spaces:
Running
Running
File size: 6,071 Bytes
2fba0c8 c6694be 2fba0c8 c56a4e0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
---
title: Moroccan Darija TTS Demo
emoji: 🎙️
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.27.0
app_file: app.py
pinned: false
---
# Moroccan Darija Text-to-Speech Model
This project implements a Text-to-Speech (TTS) system for Moroccan Darija using the SpeechT5 architecture. It's fine-tuned on the DODa-audio-dataset to generate natural-sounding Darija speech from text input.
## Table of Contents
- [Dataset Overview](#dataset-overview)
- [Project Structure](#project-structure)
- [Installation](#installation)
- [Model Training](#model-training)
- [Inference](#inference)
- [Gradio Demo](#gradio-demo)
- [Project Features](#project-features)
- [Potential Applications](#potential-applications)
- [Limitations and Future Work](#limitations-and-future-work)
- [License](#license)
## Dataset Overview
The **DODa audio dataset** contains 12,743 sentences recorded by 7 contributors (4 females, 3 males). Key characteristics:
- Audio recordings standardized at 16kHz sample rate
- Multiple text representations (Latin script, Arabic script, and English translations)
- High-quality recordings with manual corrections
### Dataset Structure
| Column Name | Description |
|-------------|-------------|
| **audio** | Speech recordings for Darija sentences |
| **darija_Ltn** | Darija sentences using Latin letters |
| **darija_Arab_new** | Corrected Darija sentences using Arabic script |
| **english** | English translation of Darija sentences |
| **darija_Arab_old** | Original (uncorrected) Darija sentences in Arabic script |
### Speaker Distribution
The dataset includes recordings from 7 speakers (4 females, 3 males) with the following distribution:
```
Samples 0-999 -> Female 1
Samples 1000-1999 -> Male 3
Samples 2000-2730 -> Female 2
Samples 2731-2800 -> Male 1
Samples 2801-3999 -> Male 2
Samples 4000-4999 -> Male 1
Samples 5000-5999 -> Female 3
Samples 6000-6999 -> Male 1
Samples 7000-7999 -> Female 4
Samples 8000-8999 -> Female 1
Samples 9000-9999 -> Male 2
Samples 10000-11999 -> Male 1
Samples 12000-12350 -> Male 2
Samples 12351-12742 -> Male 1
```
## Installation
To set up the project environment:
```bash
# Clone the repository
git clone https://github.com/yourusername/darija-tts.git
cd darija-tts
# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scriptsctivate
# Install dependencies
pip install -r requirements.txt
```
## Model Training
The model training process involves:
1. **Data Loading**: Loading the DODa audio dataset from Hugging Face
2. **Data Preprocessing**: Normalizing text and extracting speaker embeddings
3. **Model Setup**: Configuring the SpeechT5 model for Darija TTS
4. **Training**: Fine-tuning the model using the prepared dataset
To run the training:
```bash
# Open the Jupyter notebook
jupyter notebook notebooks/train_darija_tts.ipynb
```
Key training parameters:
- Learning rate: 1e-4
- Batch size: 4 (with gradient accumulation: 8)
- Training steps: 1000
- Evaluation frequency: Every 100 steps
## Inference
To generate speech from text after training:
```python
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import torch
import soundfile as sf
# Load models
model_path = "./models/speecht5_finetuned_Darija"
processor = SpeechT5Processor.from_pretrained(model_path)
model = SpeechT5ForTextToSpeech.from_pretrained(model_path)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
# Load speaker embedding
speaker_embedding = torch.load("./data/speaker_embeddings/female_embedding.pt")
# Normalize and process input text
text = "Salam, kifach nta lyoum?"
inputs = processor(text=text, return_tensors="pt")
# Generate speech
speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)
# Save audio file
sf.write("output.wav", speech.numpy(), 16000)
```
## Gradio Demo
The project includes a Gradio demo that provides a user-friendly interface for text-to-speech conversion:
```bash
# Run the demo locally
cd demo
python app.py
```
The demo features:
- Text input field for Darija text (Latin script)
- Voice selection (male/female)
- Speech speed adjustment
- Audio playback of generated speech
### Deploying to Hugging Face Spaces
To deploy the demo to Hugging Face Spaces:
1. Push your model to the Hugging Face Hub
2. Create a new Space with the Gradio SDK
3. Upload the demo files to the Space
See the notebook for detailed deployment instructions.
## Project Features
- **Multi-Speaker TTS**: Generate speech in both male and female voices
- **Voice Cloning**: Utilizes speaker embeddings for voice characteristics preservation
- **Speed Control**: Adjust the speech rate as needed
- **Text Normalization**: Handles various text inputs through proper normalization
## Potential Applications
- **Voice Assistants**: Build voice assistants that speak Moroccan Darija
- **Accessibility Tools**: Create tools for people with visual impairments
- **Language Learning**: Develop applications for learning Darija pronunciation
- **Content Creation**: Generate voiceovers for videos or audio content
- **Public Announcements**: Create automated announcement systems in Darija
## Limitations and Future Work
Current limitations:
- The model may struggle with code-switching between Darija and other languages
- Pronunciation of certain loanwords might be inconsistent
- Limited emotional range in the generated speech
Future improvements:
- Fine-tune with more diverse speech data
- Implement emotion control for expressive speech
- Add support for Arabic script input
- Develop a multilingual version supporting Darija, Arabic, and French
## License
This project is released under the MIT License. The DODa audio dataset is also available under the MIT License.
## Acknowledgments
- The [DODa audio dataset](https://huggingface.co/datasets/atlasia/DODa-audio-dataset) creators
- Hugging Face for the Transformers library and model hosting
- Microsoft Research for the SpeechT5 model architecture |