--- title: Moroccan Darija TTS Demo emoji: 🎙️ colorFrom: red colorTo: green sdk: gradio sdk_version: 5.27.0 app_file: app.py pinned: false --- # Moroccan Darija Text-to-Speech Model This project implements a Text-to-Speech (TTS) system for Moroccan Darija using the SpeechT5 architecture. It's fine-tuned on the DODa-audio-dataset to generate natural-sounding Darija speech from text input. ## Table of Contents - [Dataset Overview](#dataset-overview) - [Project Structure](#project-structure) - [Installation](#installation) - [Model Training](#model-training) - [Inference](#inference) - [Gradio Demo](#gradio-demo) - [Project Features](#project-features) - [Potential Applications](#potential-applications) - [Limitations and Future Work](#limitations-and-future-work) - [License](#license) ## Dataset Overview The **DODa audio dataset** contains 12,743 sentences recorded by 7 contributors (4 females, 3 males). Key characteristics: - Audio recordings standardized at 16kHz sample rate - Multiple text representations (Latin script, Arabic script, and English translations) - High-quality recordings with manual corrections ### Dataset Structure | Column Name | Description | |-------------|-------------| | **audio** | Speech recordings for Darija sentences | | **darija_Ltn** | Darija sentences using Latin letters | | **darija_Arab_new** | Corrected Darija sentences using Arabic script | | **english** | English translation of Darija sentences | | **darija_Arab_old** | Original (uncorrected) Darija sentences in Arabic script | ### Speaker Distribution The dataset includes recordings from 7 speakers (4 females, 3 males) with the following distribution: ``` Samples 0-999 -> Female 1 Samples 1000-1999 -> Male 3 Samples 2000-2730 -> Female 2 Samples 2731-2800 -> Male 1 Samples 2801-3999 -> Male 2 Samples 4000-4999 -> Male 1 Samples 5000-5999 -> Female 3 Samples 6000-6999 -> Male 1 Samples 7000-7999 -> Female 4 Samples 8000-8999 -> Female 1 Samples 9000-9999 -> Male 2 Samples 10000-11999 -> Male 1 Samples 12000-12350 -> Male 2 Samples 12351-12742 -> Male 1 ``` ## Installation To set up the project environment: ```bash # Clone the repository git clone https://github.com/yourusername/darija-tts.git cd darija-tts # Create a virtual environment (optional but recommended) python -m venv venv source venv/bin/activate # On Windows: venv\Scriptsctivate # Install dependencies pip install -r requirements.txt ``` ## Model Training The model training process involves: 1. **Data Loading**: Loading the DODa audio dataset from Hugging Face 2. **Data Preprocessing**: Normalizing text and extracting speaker embeddings 3. **Model Setup**: Configuring the SpeechT5 model for Darija TTS 4. **Training**: Fine-tuning the model using the prepared dataset To run the training: ```bash # Open the Jupyter notebook jupyter notebook notebooks/train_darija_tts.ipynb ``` Key training parameters: - Learning rate: 1e-4 - Batch size: 4 (with gradient accumulation: 8) - Training steps: 1000 - Evaluation frequency: Every 100 steps ## Inference To generate speech from text after training: ```python from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan import torch import soundfile as sf # Load models model_path = "./models/speecht5_finetuned_Darija" processor = SpeechT5Processor.from_pretrained(model_path) model = SpeechT5ForTextToSpeech.from_pretrained(model_path) vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan") # Load speaker embedding speaker_embedding = torch.load("./data/speaker_embeddings/female_embedding.pt") # Normalize and process input text text = "Salam, kifach nta lyoum?" inputs = processor(text=text, return_tensors="pt") # Generate speech speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder) # Save audio file sf.write("output.wav", speech.numpy(), 16000) ``` ## Gradio Demo The project includes a Gradio demo that provides a user-friendly interface for text-to-speech conversion: ```bash # Run the demo locally cd demo python app.py ``` The demo features: - Text input field for Darija text (Latin script) - Voice selection (male/female) - Speech speed adjustment - Audio playback of generated speech ### Deploying to Hugging Face Spaces To deploy the demo to Hugging Face Spaces: 1. Push your model to the Hugging Face Hub 2. Create a new Space with the Gradio SDK 3. Upload the demo files to the Space See the notebook for detailed deployment instructions. ## Project Features - **Multi-Speaker TTS**: Generate speech in both male and female voices - **Voice Cloning**: Utilizes speaker embeddings for voice characteristics preservation - **Speed Control**: Adjust the speech rate as needed - **Text Normalization**: Handles various text inputs through proper normalization ## Potential Applications - **Voice Assistants**: Build voice assistants that speak Moroccan Darija - **Accessibility Tools**: Create tools for people with visual impairments - **Language Learning**: Develop applications for learning Darija pronunciation - **Content Creation**: Generate voiceovers for videos or audio content - **Public Announcements**: Create automated announcement systems in Darija ## Limitations and Future Work Current limitations: - The model may struggle with code-switching between Darija and other languages - Pronunciation of certain loanwords might be inconsistent - Limited emotional range in the generated speech Future improvements: - Fine-tune with more diverse speech data - Implement emotion control for expressive speech - Add support for Arabic script input - Develop a multilingual version supporting Darija, Arabic, and French ## License This project is released under the MIT License. The DODa audio dataset is also available under the MIT License. ## Acknowledgments - The [DODa audio dataset](https://huggingface.co/datasets/atlasia/DODa-audio-dataset) creators - Hugging Face for the Transformers library and model hosting - Microsoft Research for the SpeechT5 model architecture