Spaces:

HAMMALE
/

speecht5-darija

Running

File size: 6,071 Bytes

2fba0c8
 
 
 
 
 
c6694be
2fba0c8
 
 
 
 
 
c56a4e0

---
title: Moroccan Darija TTS Demo
emoji: 🎙️
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 5.27.0
app_file: app.py
pinned: false
---



# Moroccan Darija Text-to-Speech Model

This project implements a Text-to-Speech (TTS) system for Moroccan Darija using the SpeechT5 architecture. It's fine-tuned on the DODa-audio-dataset to generate natural-sounding Darija speech from text input.

## Table of Contents
- [Dataset Overview](#dataset-overview)
- [Project Structure](#project-structure)
- [Installation](#installation)
- [Model Training](#model-training)
- [Inference](#inference)
- [Gradio Demo](#gradio-demo)
- [Project Features](#project-features)
- [Potential Applications](#potential-applications)
- [Limitations and Future Work](#limitations-and-future-work)
- [License](#license)

## Dataset Overview

The **DODa audio dataset** contains 12,743 sentences recorded by 7 contributors (4 females, 3 males). Key characteristics:

- Audio recordings standardized at 16kHz sample rate
- Multiple text representations (Latin script, Arabic script, and English translations)
- High-quality recordings with manual corrections

### Dataset Structure
| Column Name | Description |
|-------------|-------------|
| **audio** | Speech recordings for Darija sentences |
| **darija_Ltn** | Darija sentences using Latin letters |
| **darija_Arab_new** | Corrected Darija sentences using Arabic script |
| **english** | English translation of Darija sentences |
| **darija_Arab_old** | Original (uncorrected) Darija sentences in Arabic script |

### Speaker Distribution
The dataset includes recordings from 7 speakers (4 females, 3 males) with the following distribution:
```
Samples 0-999     -> Female 1
Samples 1000-1999 -> Male 3
Samples 2000-2730 -> Female 2
Samples 2731-2800 -> Male 1
Samples 2801-3999 -> Male 2
Samples 4000-4999 -> Male 1
Samples 5000-5999 -> Female 3
Samples 6000-6999 -> Male 1
Samples 7000-7999 -> Female 4
Samples 8000-8999 -> Female 1
Samples 9000-9999 -> Male 2
Samples 10000-11999 -> Male 1
Samples 12000-12350 -> Male 2
Samples 12351-12742 -> Male 1
```



## Installation

To set up the project environment:

```bash
# Clone the repository
git clone https://github.com/yourusername/darija-tts.git
cd darija-tts

# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scriptsctivate

# Install dependencies
pip install -r requirements.txt
```

## Model Training

The model training process involves:

1. **Data Loading**: Loading the DODa audio dataset from Hugging Face
2. **Data Preprocessing**: Normalizing text and extracting speaker embeddings
3. **Model Setup**: Configuring the SpeechT5 model for Darija TTS
4. **Training**: Fine-tuning the model using the prepared dataset

To run the training:

```bash
# Open the Jupyter notebook
jupyter notebook notebooks/train_darija_tts.ipynb
```

Key training parameters:
- Learning rate: 1e-4
- Batch size: 4 (with gradient accumulation: 8)
- Training steps: 1000
- Evaluation frequency: Every 100 steps

## Inference

To generate speech from text after training:

```python
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import torch
import soundfile as sf

# Load models
model_path = "./models/speecht5_finetuned_Darija"
processor = SpeechT5Processor.from_pretrained(model_path)
model = SpeechT5ForTextToSpeech.from_pretrained(model_path)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

# Load speaker embedding
speaker_embedding = torch.load("./data/speaker_embeddings/female_embedding.pt")

# Normalize and process input text
text = "Salam, kifach nta lyoum?"
inputs = processor(text=text, return_tensors="pt")

# Generate speech
speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)

# Save audio file
sf.write("output.wav", speech.numpy(), 16000)
```

## Gradio Demo

The project includes a Gradio demo that provides a user-friendly interface for text-to-speech conversion:

```bash
# Run the demo locally
cd demo
python app.py
```

The demo features:
- Text input field for Darija text (Latin script)
- Voice selection (male/female)
- Speech speed adjustment
- Audio playback of generated speech

### Deploying to Hugging Face Spaces

To deploy the demo to Hugging Face Spaces:

1. Push your model to the Hugging Face Hub
2. Create a new Space with the Gradio SDK
3. Upload the demo files to the Space

See the notebook for detailed deployment instructions.

## Project Features

- **Multi-Speaker TTS**: Generate speech in both male and female voices
- **Voice Cloning**: Utilizes speaker embeddings for voice characteristics preservation
- **Speed Control**: Adjust the speech rate as needed
- **Text Normalization**: Handles various text inputs through proper normalization

## Potential Applications

- **Voice Assistants**: Build voice assistants that speak Moroccan Darija
- **Accessibility Tools**: Create tools for people with visual impairments
- **Language Learning**: Develop applications for learning Darija pronunciation
- **Content Creation**: Generate voiceovers for videos or audio content
- **Public Announcements**: Create automated announcement systems in Darija

## Limitations and Future Work

Current limitations:
- The model may struggle with code-switching between Darija and other languages
- Pronunciation of certain loanwords might be inconsistent
- Limited emotional range in the generated speech

Future improvements:
- Fine-tune with more diverse speech data
- Implement emotion control for expressive speech
- Add support for Arabic script input
- Develop a multilingual version supporting Darija, Arabic, and French

## License

This project is released under the MIT License. The DODa audio dataset is also available under the MIT License.

## Acknowledgments

- The [DODa audio dataset](https://huggingface.co/datasets/atlasia/DODa-audio-dataset) creators
- Hugging Face for the Transformers library and model hosting
- Microsoft Research for the SpeechT5 model architecture