File size: 6,071 Bytes
2fba0c8
 
 
 
 
 
c6694be
2fba0c8
 
 
 
 
 
c56a4e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
title: Moroccan Darija TTS Demo
emoji: 🎙️
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.27.0
app_file: app.py
pinned: false
---



# Moroccan Darija Text-to-Speech Model

This project implements a Text-to-Speech (TTS) system for Moroccan Darija using the SpeechT5 architecture. It's fine-tuned on the DODa-audio-dataset to generate natural-sounding Darija speech from text input.

## Table of Contents
- [Dataset Overview](#dataset-overview)
- [Project Structure](#project-structure)
- [Installation](#installation)
- [Model Training](#model-training)
- [Inference](#inference)
- [Gradio Demo](#gradio-demo)
- [Project Features](#project-features)
- [Potential Applications](#potential-applications)
- [Limitations and Future Work](#limitations-and-future-work)
- [License](#license)

## Dataset Overview

The **DODa audio dataset** contains 12,743 sentences recorded by 7 contributors (4 females, 3 males). Key characteristics:

- Audio recordings standardized at 16kHz sample rate
- Multiple text representations (Latin script, Arabic script, and English translations)
- High-quality recordings with manual corrections

### Dataset Structure
| Column Name | Description |
|-------------|-------------|
| **audio** | Speech recordings for Darija sentences |
| **darija_Ltn** | Darija sentences using Latin letters |
| **darija_Arab_new** | Corrected Darija sentences using Arabic script |
| **english** | English translation of Darija sentences |
| **darija_Arab_old** | Original (uncorrected) Darija sentences in Arabic script |

### Speaker Distribution
The dataset includes recordings from 7 speakers (4 females, 3 males) with the following distribution:
```
Samples 0-999     -> Female 1
Samples 1000-1999 -> Male 3
Samples 2000-2730 -> Female 2
Samples 2731-2800 -> Male 1
Samples 2801-3999 -> Male 2
Samples 4000-4999 -> Male 1
Samples 5000-5999 -> Female 3
Samples 6000-6999 -> Male 1
Samples 7000-7999 -> Female 4
Samples 8000-8999 -> Female 1
Samples 9000-9999 -> Male 2
Samples 10000-11999 -> Male 1
Samples 12000-12350 -> Male 2
Samples 12351-12742 -> Male 1
```



## Installation

To set up the project environment:

```bash
# Clone the repository
git clone https://github.com/yourusername/darija-tts.git
cd darija-tts

# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scriptsctivate

# Install dependencies
pip install -r requirements.txt
```

## Model Training

The model training process involves:

1. **Data Loading**: Loading the DODa audio dataset from Hugging Face
2. **Data Preprocessing**: Normalizing text and extracting speaker embeddings
3. **Model Setup**: Configuring the SpeechT5 model for Darija TTS
4. **Training**: Fine-tuning the model using the prepared dataset

To run the training:

```bash
# Open the Jupyter notebook
jupyter notebook notebooks/train_darija_tts.ipynb
```

Key training parameters:
- Learning rate: 1e-4
- Batch size: 4 (with gradient accumulation: 8)
- Training steps: 1000
- Evaluation frequency: Every 100 steps

## Inference

To generate speech from text after training:

```python
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import torch
import soundfile as sf

# Load models
model_path = "./models/speecht5_finetuned_Darija"
processor = SpeechT5Processor.from_pretrained(model_path)
model = SpeechT5ForTextToSpeech.from_pretrained(model_path)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

# Load speaker embedding
speaker_embedding = torch.load("./data/speaker_embeddings/female_embedding.pt")

# Normalize and process input text
text = "Salam, kifach nta lyoum?"
inputs = processor(text=text, return_tensors="pt")

# Generate speech
speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)

# Save audio file
sf.write("output.wav", speech.numpy(), 16000)
```

## Gradio Demo

The project includes a Gradio demo that provides a user-friendly interface for text-to-speech conversion:

```bash
# Run the demo locally
cd demo
python app.py
```

The demo features:
- Text input field for Darija text (Latin script)
- Voice selection (male/female)
- Speech speed adjustment
- Audio playback of generated speech

### Deploying to Hugging Face Spaces

To deploy the demo to Hugging Face Spaces:

1. Push your model to the Hugging Face Hub
2. Create a new Space with the Gradio SDK
3. Upload the demo files to the Space

See the notebook for detailed deployment instructions.

## Project Features

- **Multi-Speaker TTS**: Generate speech in both male and female voices
- **Voice Cloning**: Utilizes speaker embeddings for voice characteristics preservation
- **Speed Control**: Adjust the speech rate as needed
- **Text Normalization**: Handles various text inputs through proper normalization

## Potential Applications

- **Voice Assistants**: Build voice assistants that speak Moroccan Darija
- **Accessibility Tools**: Create tools for people with visual impairments
- **Language Learning**: Develop applications for learning Darija pronunciation
- **Content Creation**: Generate voiceovers for videos or audio content
- **Public Announcements**: Create automated announcement systems in Darija

## Limitations and Future Work

Current limitations:
- The model may struggle with code-switching between Darija and other languages
- Pronunciation of certain loanwords might be inconsistent
- Limited emotional range in the generated speech

Future improvements:
- Fine-tune with more diverse speech data
- Implement emotion control for expressive speech
- Add support for Arabic script input
- Develop a multilingual version supporting Darija, Arabic, and French

## License

This project is released under the MIT License. The DODa audio dataset is also available under the MIT License.

## Acknowledgments

- The [DODa audio dataset](https://huggingface.co/datasets/atlasia/DODa-audio-dataset) creators
- Hugging Face for the Transformers library and model hosting
- Microsoft Research for the SpeechT5 model architecture