清华大学LightGrad-TTS，且流式实现

news2025/7/7 9:26:40

论文链接：

https://arxiv.org/abs/2308.16569

代码地址：

https://github.com/thuhcsi/LightGrad

数据支持：

针对BZNSYP和LJSpeech提供训练脚本

针对Grad-TTS提出两个问题：

DPMs are not lightweight enough for resource-constrained devices.
DPMs require many denoising steps in inference, which increases latency.

提出解决方案：

To reduce model parameters, regular convolution networks in diffusion decoder are substituted with depthwise separable convolutions.
To accelerate the inference procedure, we adopt a training-free fast sampling technique for DPMs (DPM-solver).
Streaming inference is also implemented in LightGrad to reduce latency further.

Compared with Grad-TTS, LightGrad achieves 62.2% reduction in paramters, 65.7% reduction in latency, while preserving comparable speech quality on both Chinese Mandarin and English in 4 denoising steps.

LightGrad流式方案(基于三星论文)：

论文链接：

https://arxiv.org/abs/2111.09052

具体实现：

Decoder input is chopped into chunks at phoneme boundaries to cover several consecutive phonemes and the chunk lengths are limited to a predefined range.
To incorporate context information into decoder, last phoneme of the previous chunk and first phoneme of the following chunk are padded to the head and tail of the current chunk.
Then, the decoder generates mel-spectrogram for each padded chunk.
After this, mel-spectrogram frames corresponding to the padded phonemes are removed to reverse the changes to each chunk.