Speeding Up Text-To-Speech Diffusion Models by Distillation

Introduction
A team of students from the University of Warsaw, Poland, working under the supervision of engineers from NVIDIA, has successfully reduced the latency in the TorToiSe text-to-speech (TTS) model. They achieved this by combining classifier-free guidance and progressive distillation techniques, typically used in computer vision, and adapting them to speech synthesis. The result was a 5x reduction in diffusion latency without a loss in speech quality.
Background
Synthetic voices generated by TTS models are nearly indistinguishable from human speech in simple applications, such as AI-based voice assistants. These synthetic voices can be generated much faster than real-time. Diffusion models, like the ones used in TTS, operate on frequency-based spectrograms, which can be processed like images.
Methods for speeding up diffusion
Existing techniques for reducing latency in diffusion-based TTS can be categorized as either training-free or training-based methods. Training-based methods optimize the network utilized in the diffusion process. Knowledge distillation techniques are used to distill the student network from the teacher network, reducing the number of diffusion steps.
Distillation in diffusion-based TTS
The team decided to apply the distillation approach to the diffusion part of the TorToiSe model, based on its success in computer vision and its potential for significant latency reduction. The approach involved two knowledge distillation phases:
- Mimicking the guided diffusion model output: The student model is trained to replicate the output of the guided diffusion model at each diffusion step. Synthetic data generated from text embeddings and the original teacher model are used in distillation.
- Training another student model: A second student model is trained using progressive distillation. This involves multiple iterations, with each iteration reducing the number of inference steps by a factor of two.
By implementing this approach, the team was able to reduce the number of diffusion steps required for inference by a factor of 7^2, resulting in only 31 steps. The quality of speech produced by the distilled model matches that of the original TTS model based on guided distillation.
Conclusion
The team of students from the University of Warsaw successfully reduced the latency in the TorToiSe TTS model by implementing progressive distillation. This resulted in a 5x reduction in diffusion latency without sacrificing speech quality. Their approach utilized knowledge distillation techniques, mimicking the guided diffusion model output and training a student model. The results show promise for improving the efficiency of diffusion-based TTS models.