Transformers documentation
VITS
VITS
VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is a end-to-end speech synthesis model, simplifying the traditional two-stage text-to-speech (TTS) systems. It’s unique because it directly synthesizes speech from text using variational inference, adversarial learning, and normalizing flows to produce natural and expressive speech with diverse rhythms and intonations.
You can find all the original VITS checkpoints under the AI at Meta organization.
Click on the VITS models in the right sidebar for more examples of how to apply VITS.
The example below demonstrates how to generate text based on an image with Pipeline or the AutoModel class.
import torch
from transformers import pipeline, set_seed
from scipy.io.wavfile import write
set_seed(555)
pipe = pipeline(
task="text-to-speech",
model="facebook/mms-tts-eng",
torch_dtype=torch.float16,
device=0
)
speech = pipe("Hello, my dog is cute")
# Extract audio data and sampling rate
audio_data = speech["audio"]
sampling_rate = speech["sampling_rate"]
# Save as WAV file
write("hello.wav", sampling_rate, audio_data.squeeze())
Notes
Set a seed for reproducibility because VITS synthesizes speech non-deterministically.
For languages with non-Roman alphabets (Korean, Arabic, etc.), install the uroman package to preprocess the text inputs to the Roman alphabet. You can check if the tokenizer requires uroman as shown below.
# pip install -U uroman from transformers import VitsTokenizer tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng") print(tokenizer.is_uroman)
If your language requires uroman, the tokenizer automatically applies it to the text inputs. Python >= 3.10 doesn’t require any additional preprocessing steps. For Python < 3.10, follow the steps below.
git clone https://github.com/isi-nlp/uroman.git cd uroman export UROMAN=$(pwd)
Create a function to preprocess the inputs. You can either use the bash variable
UROMAN
or pass the directory path directly to the function.import torch from transformers import VitsTokenizer, VitsModel, set_seed import os import subprocess tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-kor") model = VitsModel.from_pretrained("facebook/mms-tts-kor") def uromanize(input_string, uroman_path): """Convert non-Roman strings to Roman using the `uroman` perl package.""" script_path = os.path.join(uroman_path, "bin", "uroman.pl") command = ["perl", script_path] process = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) # Execute the perl command stdout, stderr = process.communicate(input=input_string.encode()) if process.returncode != 0: raise ValueError(f"Error {process.returncode}: {stderr.decode()}") # Return the output as a string and skip the new-line character at the end return stdout.decode()[:-1] text = "이봐 무슨 일이야" uromanized_text = uromanize(text, uroman_path=os.environ["UROMAN"]) inputs = tokenizer(text=uromanized_text, return_tensors="pt") set_seed(555) # make deterministic with torch.no_grad(): outputs = model(inputs["input_ids"]) waveform = outputs.waveform[0]
VitsConfig
class transformers.VitsConfig
< source >( vocab_size = 38 hidden_size = 192 num_hidden_layers = 6 num_attention_heads = 2 window_size = 4 use_bias = True ffn_dim = 768 layerdrop = 0.1 ffn_kernel_size = 3 flow_size = 192 spectrogram_bins = 513 hidden_act = 'relu' hidden_dropout = 0.1 attention_dropout = 0.1 activation_dropout = 0.1 initializer_range = 0.02 layer_norm_eps = 1e-05 use_stochastic_duration_prediction = True num_speakers = 1 speaker_embedding_size = 0 upsample_initial_channel = 512 upsample_rates = [8, 8, 2, 2] upsample_kernel_sizes = [16, 16, 4, 4] resblock_kernel_sizes = [3, 7, 11] resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]] leaky_relu_slope = 0.1 depth_separable_channels = 2 depth_separable_num_layers = 3 duration_predictor_flow_bins = 10 duration_predictor_tail_bound = 5.0 duration_predictor_kernel_size = 3 duration_predictor_dropout = 0.5 duration_predictor_num_flows = 4 duration_predictor_filter_channels = 256 prior_encoder_num_flows = 4 prior_encoder_num_wavenet_layers = 4 posterior_encoder_num_wavenet_layers = 16 wavenet_kernel_size = 5 wavenet_dilation_rate = 1 wavenet_dropout = 0.0 speaking_rate = 1.0 noise_scale = 0.667 noise_scale_duration = 0.8 sampling_rate = 16000 **kwargs )
Parameters
- vocab_size (
int
, optional, defaults to 38) — Vocabulary size of the VITS model. Defines the number of different tokens that can be represented by theinputs_ids
passed to the forward method of VitsModel. - hidden_size (
int
, optional, defaults to 192) — Dimensionality of the text encoder layers. - num_hidden_layers (
int
, optional, defaults to 6) — Number of hidden layers in the Transformer encoder. - num_attention_heads (
int
, optional, defaults to 2) — Number of attention heads for each attention layer in the Transformer encoder. - window_size (
int
, optional, defaults to 4) — Window size for the relative positional embeddings in the attention layers of the Transformer encoder. - use_bias (
bool
, optional, defaults toTrue
) — Whether to use bias in the key, query, value projection layers in the Transformer encoder. - ffn_dim (
int
, optional, defaults to 768) — Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder. - layerdrop (
float
, optional, defaults to 0.1) — The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details. - ffn_kernel_size (
int
, optional, defaults to 3) — Kernel size of the 1D convolution layers used by the feed-forward network in the Transformer encoder. - flow_size (
int
, optional, defaults to 192) — Dimensionality of the flow layers. - spectrogram_bins (
int
, optional, defaults to 513) — Number of frequency bins in the target spectrogram. - hidden_act (
str
orfunction
, optional, defaults to"relu"
) — The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu"
,"relu"
,"selu"
and"gelu_new"
are supported. - hidden_dropout (
float
, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings and encoder. - attention_dropout (
float
, optional, defaults to 0.1) — The dropout ratio for the attention probabilities. - activation_dropout (
float
, optional, defaults to 0.1) — The dropout ratio for activations inside the fully connected layer. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - layer_norm_eps (
float
, optional, defaults to 1e-05) — The epsilon used by the layer normalization layers. - use_stochastic_duration_prediction (
bool
, optional, defaults toTrue
) — Whether to use the stochastic duration prediction module or the regular duration predictor. - num_speakers (
int
, optional, defaults to 1) — Number of speakers if this is a multi-speaker model. - speaker_embedding_size (
int
, optional, defaults to 0) — Number of channels used by the speaker embeddings. Is zero for single-speaker models. - upsample_initial_channel (
int
, optional, defaults to 512) — The number of input channels into the HiFi-GAN upsampling network. - upsample_rates (
Tuple[int]
orList[int]
, optional, defaults to[8, 8, 2, 2]
) — A tuple of integers defining the stride of each 1D convolutional layer in the HiFi-GAN upsampling network. The length ofupsample_rates
defines the number of convolutional layers and has to match the length ofupsample_kernel_sizes
. - upsample_kernel_sizes (
Tuple[int]
orList[int]
, optional, defaults to[16, 16, 4, 4]
) — A tuple of integers defining the kernel size of each 1D convolutional layer in the HiFi-GAN upsampling network. The length ofupsample_kernel_sizes
defines the number of convolutional layers and has to match the length ofupsample_rates
. - resblock_kernel_sizes (
Tuple[int]
orList[int]
, optional, defaults to[3, 7, 11]
) — A tuple of integers defining the kernel sizes of the 1D convolutional layers in the HiFi-GAN multi-receptive field fusion (MRF) module. - resblock_dilation_sizes (
Tuple[Tuple[int]]
orList[List[int]]
, optional, defaults to[[1, 3, 5], [1, 3, 5], [1, 3, 5]]
) — A nested tuple of integers defining the dilation rates of the dilated 1D convolutional layers in the HiFi-GAN multi-receptive field fusion (MRF) module. - leaky_relu_slope (
float
, optional, defaults to 0.1) — The angle of the negative slope used by the leaky ReLU activation. - depth_separable_channels (
int
, optional, defaults to 2) — Number of channels to use in each depth-separable block. - depth_separable_num_layers (
int
, optional, defaults to 3) — Number of convolutional layers to use in each depth-separable block. - duration_predictor_flow_bins (
int
, optional, defaults to 10) — Number of channels to map using the unonstrained rational spline in the duration predictor model. - duration_predictor_tail_bound (
float
, optional, defaults to 5.0) — Value of the tail bin boundary when computing the unconstrained rational spline in the duration predictor model. - duration_predictor_kernel_size (
int
, optional, defaults to 3) — Kernel size of the 1D convolution layers used in the duration predictor model. - duration_predictor_dropout (
float
, optional, defaults to 0.5) — The dropout ratio for the duration predictor model. - duration_predictor_num_flows (
int
, optional, defaults to 4) — Number of flow stages used by the duration predictor model. - duration_predictor_filter_channels (
int
, optional, defaults to 256) — Number of channels for the convolution layers used in the duration predictor model. - prior_encoder_num_flows (
int
, optional, defaults to 4) — Number of flow stages used by the prior encoder flow model. - prior_encoder_num_wavenet_layers (
int
, optional, defaults to 4) — Number of WaveNet layers used by the prior encoder flow model. - posterior_encoder_num_wavenet_layers (
int
, optional, defaults to 16) — Number of WaveNet layers used by the posterior encoder model. - wavenet_kernel_size (
int
, optional, defaults to 5) — Kernel size of the 1D convolution layers used in the WaveNet model. - wavenet_dilation_rate (
int
, optional, defaults to 1) — Dilation rates of the dilated 1D convolutional layers used in the WaveNet model. - wavenet_dropout (
float
, optional, defaults to 0.0) — The dropout ratio for the WaveNet layers. - speaking_rate (
float
, optional, defaults to 1.0) — Speaking rate. Larger values give faster synthesised speech. - noise_scale (
float
, optional, defaults to 0.667) — How random the speech prediction is. Larger values create more variation in the predicted speech. - noise_scale_duration (
float
, optional, defaults to 0.8) — How random the duration prediction is. Larger values create more variation in the predicted durations. - sampling_rate (
int
, optional, defaults to 16000) — The sampling rate at which the output audio waveform is digitalized expressed in hertz (Hz).
This is the configuration class to store the configuration of a VitsModel. It is used to instantiate a VITS model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the VITS facebook/mms-tts-eng architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import VitsModel, VitsConfig
>>> # Initializing a "facebook/mms-tts-eng" style configuration
>>> configuration = VitsConfig()
>>> # Initializing a model (with random weights) from the "facebook/mms-tts-eng" style configuration
>>> model = VitsModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
VitsTokenizer
class transformers.VitsTokenizer
< source >( vocab_file pad_token = '<pad>' unk_token = '<unk>' language = None add_blank = True normalize = True phonemize = True is_uroman = False **kwargs )
Parameters
- vocab_file (
str
) — Path to the vocabulary file. - language (
str
, optional) — Language identifier. - add_blank (
bool
, optional, defaults toTrue
) — Whether to insert token id 0 in between the other tokens. - normalize (
bool
, optional, defaults toTrue
) — Whether to normalize the input text by removing all casing and punctuation. - phonemize (
bool
, optional, defaults toTrue
) — Whether to convert the input text into phonemes. - is_uroman (
bool
, optional, defaults toFalse
) — Whether theuroman
Romanizer needs to be applied to the input text prior to tokenizing.
Construct a VITS tokenizer. Also supports MMS-TTS.
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
Lowercase the input string, respecting any special token ids that may be part or entirely upper-cased.
prepare_for_tokenization
< source >( text: str is_split_into_words: bool = False normalize: typing.Optional[bool] = None **kwargs ) → Tuple[str, Dict[str, Any]]
Parameters
- text (
str
) — The text to prepare. - is_split_into_words (
bool
, optional, defaults toFalse
) — Whether or not the input is already pre-tokenized (e.g., split into words). If set toTrue
, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. - normalize (
bool
, optional, defaults toNone
) — Whether or not to apply punctuation and casing normalization to the text inputs. Typically, VITS is trained on lower-cased and un-punctuated text. Hence, normalization is used to ensure that the input text consists only of lower-case characters. - kwargs (
Dict[str, Any]
, optional) — Keyword arguments to use for the tokenization.
Returns
Tuple[str, Dict[str, Any]]
The prepared text and the unused kwargs.
Performs any necessary transformations before tokenization.
This method should pop the arguments from kwargs and return the remaining kwargs
as well. We test the
kwargs
at the end of the encoding process to be sure all the arguments have been used.
- call
- save_vocabulary
VitsModel
class transformers.VitsModel
< source >( config: VitsConfig )
Parameters
- config (VitsConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The complete VITS model, for text-to-speech synthesis. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None speaker_id: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None labels: typing.Optional[torch.FloatTensor] = None ) → transformers.models.vits.modeling_vits.VitsModelOutput
or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing convolution and attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- speaker_id (
int
, optional) — Which speaker embedding to use. Only used for multispeaker models. - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - labels (
torch.FloatTensor
of shape(batch_size, config.spectrogram_bins, sequence_length)
, optional) — Float values of target spectrogram. Timesteps set to-100.0
are ignored (masked) for the loss computation.
Returns
transformers.models.vits.modeling_vits.VitsModelOutput
or tuple(torch.FloatTensor)
A transformers.models.vits.modeling_vits.VitsModelOutput
or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (VitsConfig) and inputs.
-
waveform (
torch.FloatTensor
of shape(batch_size, sequence_length)
) — The final audio waveform predicted by the model. -
sequence_lengths (
torch.FloatTensor
of shape(batch_size,)
) — The length in samples of each element in thewaveform
batch. -
spectrogram (
torch.FloatTensor
of shape(batch_size, sequence_length, num_bins)
) — The log-mel spectrogram predicted at the output of the flow model. This spectrogram is passed to the Hi-Fi GAN decoder model to obtain the final audio waveform. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attention weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The VitsModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import VitsTokenizer, VitsModel, set_seed
>>> import torch
>>> tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")
>>> model = VitsModel.from_pretrained("facebook/mms-tts-eng")
>>> inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt")
>>> set_seed(555) # make deterministic
>>> with torch.no_grad():
... outputs = model(inputs["input_ids"])
>>> outputs.waveform.shape
torch.Size([1, 45824])
- forward