Model Card for Arabic StyleTTS2

This is an Arabic text-to-speech model based on StyleTTS2 architecture, specifically adapted for Arabic language synthesis. The model achieves good quality Arabic speech synthesis, though not yet state-of-the-art, and further experimentation is needed to optimize performance for Arabic language specifically. All training objectives from the original StyleTTS2 were maintained, except for the WavLM objectives which were removed as they were primarily designed for English speech.

Example

Here is an example output from the model:

Sample 1

Efficiency and Performance

A key strength of this model lies in its efficiency and performance characteristics:

Compact Architecture: Achieves impressive quality with <100M parameters
Limited Training Data: Trained on only 22 hours of single-speaker audio
Transfer Learning: Successfully fine-tuned from LibriTTS multi-speaker model to single-speaker Arabic
Resource Efficient: Good quality achieved despite limited computational resources

Note: According to the StyleTTS2 authors, performance should improve further when training a single-speaker model from scratch rather than fine-tuning. This wasn't attempted in our case due to computational resource constraints, suggesting potential for even better results with more extensive training.

Model Details

Model Description

This model is a modified version of StyleTTS2, specifically adapted for Arabic text-to-speech synthesis. It incorporates a custom-trained PL-BERT model for Arabic language understanding and removes the WavLM adversarial training component (which was primarily designed for English).

Developed by: Fadi (GitHub: Fadi987)
Model type: Text-to-Speech (StyleTTS2 architecture)
Language(s): Arabic
Finetuned from model: yl4579/StyleTTS2-LibriTTS

Model Sources

Repository: Fadi987/StyleTTS2
Paper: StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
PL-BERT Model: fadi77/pl-bert

Uses

Direct Use

The model can be used for generating Arabic speech from text. To use the model:

Clone the StyleTTS2 repository:

git clone https://github.com/Fadi987/StyleTTS2
cd StyleTTS2

Install espeak-ng for phonemization backend:

# For macOS
brew install espeak-ng

# For Ubuntu/Debian
sudo apt-get install espeak-ng

# For Windows
# Download and install espeak-ng from: https://github.com/espeak-ng/espeak-ng/releases

Install Python dependencies:

pip install -r requirements.txt

Download the model.pth and config.yml files from this repository
Run inference using:

python inference.py --config config.yml --model model.pth --text "الإِتْقَانُ يَحْتَاجُ إِلَى الْعَمَلِ وَالْمُثَابَرَة"

Make sure use properly diacritized Arabic text for best results

Out-of-Scope Use

The model is specifically designed for Arabic text-to-speech synthesis and may not perform well for:

Other languages
Heavy dialect variations
Non-diacritized Arabic text

Training Details

Training Data

Training was performed on approximately 22 hours of Arabic audiobook data
Dataset: fadi77/arabic-audiobook-dataset-24khz
The PL-BERT component was trained on fully diacritized Wikipedia Arabic text

Training Hyperparameters

Number of epochs: 20
Diffusion training: Started from epoch 5

Objectives

Training objectives: All original StyleTTS2 objectives maintained, except WavLM adversarial training
Validation objectives: Identical to original StyleTTS2 validation process

Compute Infrastructure

Hardware Type: NVIDIA H100 GPU

Notable Modifications from Original StyleTTS2 in Architecture and Objectives

The architecture of the model follows that of StyleTTS2 with the following exceptions:

Removed WavLM adversarial training component
Custom PL-BERT trained for Arabic language

Citation

BibTeX:

@article{styletts2,
  title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author={Liu, Yinghao Aaron and Chen, Tao and Ping, Wei and Wu, Xiaoliang and Wang, Dongchao and Duan, Yuxuan and Li, Xiaodi and Li, Chong and Liang, Xuchen and Liu, Qiong and others},
  journal={arXiv preprint arXiv:2306.07691},
  year={2023}
}

Model Card Contact

GitHub: @Fadi987 Hugging Face: @fadi77