Model Card for Arabic StyleTTS2
This is an Arabic text-to-speech model based on StyleTTS2 architecture, specifically adapted for Arabic language synthesis. The model achieves good quality Arabic speech synthesis, though not yet state-of-the-art, and further experimentation is needed to optimize performance for Arabic language specifically. All training objectives from the original StyleTTS2 were maintained, except for the WavLM objectives which were removed as they were primarily designed for English speech.
Example
Here is an example output from the model:
Sample 1
Efficiency and Performance
A key strength of this model lies in its efficiency and performance characteristics:
- Compact Architecture: Achieves impressive quality with <100M parameters
- Limited Training Data: Trained on only 22 hours of single-speaker audio
- Transfer Learning: Successfully fine-tuned from LibriTTS multi-speaker model to single-speaker Arabic
- Resource Efficient: Good quality achieved despite limited computational resources
Note: According to the StyleTTS2 authors, performance should improve further when training a single-speaker model from scratch rather than fine-tuning. This wasn't attempted in our case due to computational resource constraints, suggesting potential for even better results with more extensive training.
Model Details
Model Description
This model is a modified version of StyleTTS2, specifically adapted for Arabic text-to-speech synthesis. It incorporates a custom-trained PL-BERT model for Arabic language understanding and removes the WavLM adversarial training component (which was primarily designed for English).
- Developed by: Fadi (GitHub: Fadi987)
- Model type: Text-to-Speech (StyleTTS2 architecture)
- Language(s): Arabic
- Finetuned from model: yl4579/StyleTTS2-LibriTTS
Model Sources
- Repository: Fadi987/StyleTTS2
- Paper: StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
- PL-BERT Model: fadi77/pl-bert
Uses
Direct Use
The model can be used for generating Arabic speech from text. To use the model:
- Clone the StyleTTS2 repository:
git clone https://github.com/Fadi987/StyleTTS2
cd StyleTTS2
- Install
espeak-ng
for phonemization backend:
# For macOS
brew install espeak-ng
# For Ubuntu/Debian
sudo apt-get install espeak-ng
# For Windows
# Download and install espeak-ng from: https://github.com/espeak-ng/espeak-ng/releases
- Install Python dependencies:
pip install -r requirements.txt
Download the
model.pth
andconfig.yml
files from this repositoryRun inference using:
python inference.py --config config.yml --model model.pth --text "ุงูุฅูุชูููุงูู ููุญูุชูุงุฌู ุฅูููู ุงููุนูู
ููู ููุงููู
ูุซูุงุจูุฑูุฉ"
Make sure use properly diacritized Arabic text for best results
Out-of-Scope Use
The model is specifically designed for Arabic text-to-speech synthesis and may not perform well for:
- Other languages
- Heavy dialect variations
- Non-diacritized Arabic text
Training Details
Training Data
- Training was performed on approximately 22 hours of Arabic audiobook data
- Dataset: fadi77/arabic-audiobook-dataset-24khz
- The PL-BERT component was trained on fully diacritized Wikipedia Arabic text
Training Hyperparameters
- Number of epochs: 20
- Diffusion training: Started from epoch 5
Objectives
- Training objectives: All original StyleTTS2 objectives maintained, except WavLM adversarial training
- Validation objectives: Identical to original StyleTTS2 validation process
Compute Infrastructure
- Hardware Type: NVIDIA H100 GPU
Notable Modifications from Original StyleTTS2 in Architecture and Objectives
The architecture of the model follows that of StyleTTS2 with the following exceptions:
- Removed WavLM adversarial training component
- Custom PL-BERT trained for Arabic language
Citation
BibTeX:
@article{styletts2,
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author={Liu, Yinghao Aaron and Chen, Tao and Ping, Wei and Wu, Xiaoliang and Wang, Dongchao and Duan, Yuxuan and Li, Xiaodi and Li, Chong and Liang, Xuchen and Liu, Qiong and others},
journal={arXiv preprint arXiv:2306.07691},
year={2023}
}