语音合成经典模型结构介绍

news2025/1/6 20:30:29

(以下内容搬运自 PaddleSpeech)

Models introduction

TTS system mainly includes three modules: Text Frontend, Acoustic model and Vocoder. We introduce a rule-based Chinese text frontend in cn_text_frontend.md. Here, we will introduce acoustic models and vocoders, which are trainable.

The main processes of TTS include:

  1. Convert the original text into characters/phonemes, through the text frontend module.
  2. Convert characters/phonemes into acoustic features, such as linear spectrogram, mel spectrogram, LPC features, etc. through Acoustic models.
  3. Convert acoustic features into waveforms through Vocoders.

A simple text frontend module can be implemented by rules. Acoustic models and vocoders need to be trained. The models provided by PaddleSpeech TTS are acoustic models and vocoders.

Acoustic Models

Modeling Objectives of Acoustic Models

Modeling the mapping relationship between text sequences and speech features:

text X = {x1,...,xM}
specch Y = {y1,...yN}

Modeling Objectives:

Ω = argmax p(Y|X,Ω)

Modeling process of Acoustic Models

At present, there are two mainstream acoustic model structures.

  • Frame level acoustic model:
    • Duration model (M Tokens - > N Frames).
    • Acoustic decoder (N Frames - > N Frames).

  • Sequence to sequence acoustic model:
    • M Tokens - > N Frames.

Tacotron2

Tacotron is the first end-to-end acoustic model based on deep learning, and it is also the most widely used acoustic model.

Tacotron2 is the Improvement of Tacotron.

Tacotron

Features of Tacotron:

  • Encoder.
    • CBHG.
    • Input: character sequence.
  • Decoder.
    • Global soft attention.
    • unidirectional RNN.
    • Autoregressive teacher force training (input real speech feature).
    • Multi frame prediction.
    • CBHG postprocess.
    • Vocoder: Griffin-Lim.

Advantage of Tacotron:

  • No need for complex text frontend analysis modules.
  • No need for an additional duration model.
  • Greatly simplifies the acoustic model construction process and reduces the dependence of speech synthesis tasks on domain knowledge.

Disadvantages of Tacotron:

  • The CBHG is complex and the amount of parameters is relatively large.
  • Global soft attention.
  • Poor stability for speech synthesis tasks.
  • In training, the less the number of speech frames predicted at each moment, the more difficult it is to train.
  • Phase problem in Griffin-Lim causes speech distortion during wave reconstruction.
  • The autoregressive decoder cannot be stopped during the generation process.

Tacotron2

Features of Tacotron2:

  • Reduction of parameters.
    • CBHG -> PostNet (3 Conv layers + BLSTM or 5 Conv layers).
    • remove Attention RNN.
  • Speech distortion caused by Griffin-Lim.
    • WaveNet.
  • Improvements of PostNet.
    • CBHG -> 5 Conv layers.
    • The input and output of the PostNet calculate L2 loss with real Mel spectrogram.
    • Residual connection.
  • Bad stop in an autoregressive decoder.
    • Predict whether it should stop at each moment of decoding (stop token).
    • Set a threshold to determine whether to stop generating when decoding.
  • Stability of attention.
    • Location-aware attention.
    • The alignment matrix of the previous time is considered at step t of the decoder.

You can find PaddleSpeech TTS’s tacotron2 with LJSpeech dataset example at examples/ljspeech/tts0.

TransformerTTS

Disadvantages of the Tacotrons:

  • Encoder and decoder are relatively weak at global information modeling
    • Vanishing gradient of RNN.
    • Fixed-length context modeling problem in CNN kernel.
  • Training is relatively inefficient.
  • The attention is not robust enough and the stability is poor.

Transformer TTS is a combination of Tacotron2 and Transformer.

Transformer

Transformer is a seq2seq model based entirely on an attention mechanism.

Features of Transformer:

  • Encoder.
    • N blocks based on self-attention mechanism.
    • Positional Encoding.
  • Decoder.
    • N blocks based on self-attention mechanism.
    • Add Mask to the self-attention in blocks to cover up the information after the t step.
    • Attentions between encoder and decoder.
    • Positional Encoding.

Transformer TTS

Transformer TTS is a seq2seq acoustic model based on Transformer and Tacotron2.

Motivations:

  • RNNs in Tacotron2 make the inefficiency of training.
  • Vanishing gradient of RNN makes the model’s ability to model long-term contexts weak.
  • Self-attention doesn’t contain any recursive structure which can be trained in parallel.
  • Self-attention can model global context information well.

Features of Transformer TTS:

  • Add conv based PreNet in encoder and decoder.
  • Stop Token in decoder controls when to stop autoregressive generation.
  • Add PostNet after decoder to improve the quality of synthetic speech.
  • Scaled position encoding.
    • Uniform scale position encoding may have a negative impact on input or output sequences.

Disadvantages of Transformer TTS:

  • The ability of position encoding for timing information is still relatively weak.
  • The ability to perceive local information is weak, and local information is more related to pronunciation.
  • Stability is worse than Tacotron2.

You can find PaddleSpeech TTS’s Transformer TTS with LJSpeech dataset example at examples/ljspeech/tts1.

FastSpeech2

Disadvantage of seq2seq models:

  • In the seq2seq model based on attention, no matter how to improve the attention mechanism, it’s difficult to avoid generation errors in the decoding stage.

Frame-level acoustic models use duration models to determine the pronunciation duration of phonemes, and the frame-level mapping does not have the uncertainty of sequence generation.

In seq2saq models, the concept of duration models is used as the alignment module of two sequences to replace attention, which can avoid the uncertainty in attention, and significantly improve the stability of the seq2saq models.

FastSpeech

Instead of using the encoder-attention-decoder based architecture as adopted by most seq2seq based autoregressive and non-autoregressive generation, FastSpeech is a novel feed-forward structure, which can generate a target mel spectrogram sequence in parallel.

Features of FastSpeech:

  • Encoder: based on Transformer.
  • Change FFN to CNN in self-attention.
    • Model local dependency.
  • Length regulator.
    • Use real phoneme durations to expand the output frame of the encoder during training.
  • Non-autoregressive decode.
    • Improve generation efficiency.

Length predictor:

  • Pretrain a TransformerTTS model.
  • Get alignment matrix of train data.
  • Calculate the phoneme durations according to the probability of the alignment matrix.
  • Use the output of the encoder to predict the phoneme durations and calculate the MSE loss.
  • Use real phoneme durations to expand the output frame of the encoder during training.
  • Use phoneme durations predicted by the duration model to expand the frame during prediction.
    • Attentrion can not control phoneme durations. The explicit duration modeling can control durations through duration coefficient (duration coefficient is 1 during training).

Advantages of non-autoregressive decoder:

  • The built-in duration model of the seq2seq model has converted the input length M to the output length N.
  • The length of the output is known, stop token is no longer used, avoiding the problem of being unable to stop.
    • Can be generated in parallel (decoding time is less affected by sequence length)

FastPitch

FastPitch follows FastSpeech. A single pitch value is predicted for every temporal location, which improves the overall quality of synthesized speech.


FastSpeech2

Disadvantages of FastSpeech:

  • The teacher-student distillation pipeline is complicated and time-consuming.
  • The duration extracted from the teacher model is not accurate enough.
  • The target mel spectrograms distilled from the teacher model suffer from information loss due to data simplification.

FastSpeech2 addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS.

Features of FastSpeech2:

  • Directly train the model with the ground-truth target instead of the simplified output from the teacher.
  • Introducing more variation information of speech as conditional inputs, extract duration, pitch, and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.

FastSpeech2 is similar to FastPitch but introduces more variation information of the speech.


You can find PaddleSpeech TTS’s FastSpeech2/FastPitch with CSMSC dataset example at examples/csmsc/tts3, We use token-averaged pitch and energy values introduced in FastPitch rather than frame-level ones in FastSpeech2.

SpeedySpeech

SpeedySpeech simplify the teacher-student architecture of FastSpeech and provide a fast and stable training procedure.

Features of SpeedySpeech:

  • Use a simpler, smaller, and faster-to-train convolutional teacher model (Deepvoice3 and DCTTS) with a single attention layer instead of Transformer used in FastSpeech.
  • Show that self-attention layers in the student network are not needed for high-quality speech synthesis.
  • Describe a simple data augmentation technique that can be used early in the training to make the teacher network robust to sequential error propagation.

You can find PaddleSpeech TTS’s SpeedySpeech with CSMSC dataset example at examples/csmsc/tts2.

Vocoders

In speech synthesis, the main task of the vocoder is to convert the spectral parameters predicted by the acoustic model into the final speech waveform.

Taking into account the short-term change frequency of the waveform, the acoustic model usually avoids direct modeling of the speech waveform, but firstly models the spectral features extracted from the speech waveform, and then reconstructs the waveform by the decoding part of the vocoder.

A vocoder usually consists of a pair of encoders and decoders for speech analysis and synthesis. The encoder estimates the parameters, and then the decoder restores the speech.

Vocoders based on neural networks usually is speech synthesis, which learns the mapping relationship from spectral features to waveforms through training data.

Categories of neural vocodes

  • Autoregression

    • WaveNet
    • WaveRNN
    • LPCNet
  • Flow

    • WaveFlow
    • WaveGlow
    • FloWaveNet
    • Parallel WaveNet
  • GAN

    • WaveGAN
    • Parallel WaveGAN
    • MelGAN
    • Style MelGAN
    • Multi Band MelGAN
    • HiFi GAN
  • VAE

    • Wave-VAE
  • Diffusion

    • WaveGrad
    • DiffWave

Motivations of GAN-based vocoders:

  • Modeling speech signals by estimating probability distribution usually has high requirements for the expression ability of the model itself. In addition, specific assumptions need to be made about the distribution of waveforms.
  • Although autoregressive neural vocoders can obtain high-quality synthetic speech, such models usually have a slow generation speed.
  • The training of inverse autoregressive flow vocoders is complex, and they also require the modeling capability of long-term context information.
  • Vocoders based on Bipartite Transformation converge slowly and are complex.
  • GAN-based vocoders don’t need to make assumptions about the speech distribution and train through adversarial learning.

Here, we introduce a Flow-based vocoder WaveFlow and a GAN-based vocoder Parallel WaveGAN.

WaveFlow

WaveFlow is proposed by Baidu Research.

Features of WaveFlow:

  • It can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on an Nvidia V100 GPU without engineered inference kernels, which is faster than WaveGlow and several orders of magnitude faster than WaveNet.
  • It is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smaller than WaveGlow (87.9M).
  • It is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.

You can find PaddleSpeech TTS’s WaveFlow with LJSpeech dataset example at examples/ljspeech/voc0.

Parallel WaveGAN

Parallel WaveGAN trains a non-autoregressive WaveNet variant as a generator in a GAN-based training method.

Features of Parallel WaveGAN:

  • Use non-causal convolution instead of causal convolution.
  • The input is random Gaussian white noise.
  • The model is non-autoregressive both in training and prediction, which is fast
  • Multi-resolution STFT loss.

You can find PaddleSpeech TTS’s Parallel WaveGAN with CSMSC example at examples/csmsc/voc1.

P.S. 欢迎关注我们的 github repo PaddleSpeech, 是基于飞桨 PaddlePaddle 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发,包含大量基于深度学习前沿和有影响力的模型。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/796.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

XDataverse免费的统一数据库管理工具

XDataverse产品简介 XDataverse是一款通用的数据库管理工具,主要管理关系型数据库,同时也支持一些其余类型的数据库,比如Redis。其主要功能有 支持主流关系型数据库的常规操作,比如MySQL,SQLServer,SQlite,SQLCE,Postg…

机器学习 逻辑回归(2)softmax回归多类别分类-鸢尾花案例

机器学习 逻辑回归之softmax回归多类别分类-鸢尾花案例一、前言二、假设函数三、One-Hot 独热编码四、代价函数五、梯度下降六、原生代码实现6.1 加载并查看数据6.2 添加前置与数据分割6.3 迭代训练6.4 验证数据七、sklearn代码实现八、参考资料PS:softmax回归损失函…

[时间序列预测]基于BP、LSTM、CNN-LSTM神经网络算法的单特征用电负荷预测[保姆级手把手教学]

系列文章目录 深度学习原理-----线性回归梯度下降法 深度学习原理-----逻辑回归算法 深度学习原理-----全连接神经网络 深度学习原理-----卷积神经网络 深度学习原理-----循环神经网络(RNN、LSTM) 时间序列预测-----基于BP、LSTM、CNN-LSTM神经网络…

安卓开发Android studio学习笔记14:用户注册登录(案例演示)

Android studio学习笔记第一步:配置activity_information.xml第二步:配置activity_registration.xml第三步:配置strings.xml第四步:配置InformationActivity第五步:配置RegistrationActivity第六步:运行结果…

二叉搜索树

文章目录二叉搜索树1. 概念2. 模拟实现二叉搜索树2.1 准备工作 创建类2.2 查找方法2.3 插入方法2.4 删除方法3. 性能分析二叉搜索树 前言 : 1. 概念 二叉搜索树又称二叉排序树,它或者是一棵空树,或者是具有以下性质的二叉树: 若它的左子树不…

学点高端技术:基于密度的聚类算法——FDBSCAN算法

机器学习、人工智能各类KNN算法层出不穷,DBSCAN具有强代表性,它是一个基于密度的聚类算法,最大的优点是能够把高密度区域划分为簇,能够在高噪声的条件下实现对目标的精准识别,但该算法当前已远不能满足人们对于高效率、…

零基础自学javase黑马课程第二天

零基础自学javase黑马课程第二天 ✨欢迎关注🖱点赞🎀收藏⭐留言✒ 🔮本文由京与旧铺原创,csdn首发! 😘系列专栏:java学习 💻首发时间:🎞2022年10月16日&#…

【电子技术基础(精华版)】直流稳压电路

前期我们了解了一些关于直流稳压电源的基础知识,为了更好地完善职教高考电子技术专业的需求,接下来我会更新【电子技术基础(精华版)】,从中可以让更多的职教高考生有效地复习。 由于本人是山东省的一位博主&#xff0…

3、SySeVR测试(上)

一、准备 1、将测试代码放在/home/test目录下; 2、将测试数据导入joern 在/home/SySeVR/joern-0.3.1查看是否存在.joernIndex文件,有的话,需要删除。 删除之后,将测试数据导入joern: java -jar /home/SySeVR/joern-0.3.1/bin/jo…

程序员的中年危机:那些能工作到45、50、60的程序员们,究竟具备了哪些能力?

程序员行业新技术发展迅猛,可以说是日新月异。也正是这个原因,中年危机成为我们必须面对和攻克的问题。 思考一个问题:那些能工作到45、50、甚至60的程序员们,究竟具备了哪些过人的能力? 就我过去的经历和观察来说&a…

A comprehensive overview of knowledge graph completion

摘要 知识图(KG)以其代表和管理海量知识的独特优势,为各种下游知识感知任务(如推荐和智能问答)提供了高质量的结构化知识。KGs的质量和完整性在很大程度上决定了下游任务的有效性。但由于知识产权制度的不完备性,知识产权制度中仍有大量有价值的知识缺失…

【《机器人技术》复习】

【《机器人技术》复习】1. 要求:2. 机械手运动解算问题2.1 自由度考点2.2 运动学方程2.3 动力学方程2.4 传感器2.5 编程题1. 要求: 本次大作业上交截止时间 之前,超时,本门课程判定不及格。 作业上交的格式如下 一律以 WORD 文档…

2022年江西省职业院校技能大赛“网络空间安全”比赛任务书

2022年江西省职业院校技能大赛“网络空间安全” 比赛任务书 一、竞赛时间 总计:360分钟 竞赛阶段竞赛阶段 任务阶段 竞赛任务 竞赛时间 分值 A模块 A-1 登录安全加固 180分钟 200分 A-2 本地安全策略配置 A-3 流量完整性保护 A-4 事件监控 A-5 …

求交叉链表头结点-面试必备

这里分享一下一个交叉链表的关键题目,觉得不错的小伙伴别忘了点赞支持 交叉链表无环链表思路代码有环链表思路代码总结无环链表 已知有两个链表(无环)相交,求出相交的头结点 思路 因为链表相交,所以最后一部分一定重…

每天五分钟机器学习:常用的参数寻优方法——k折交叉验证

本文重点 本文我们介绍一种常用的参数寻优方法--k折交叉验证,现在的数据集一般分为三类,分别为训练集,验证集,测试集。训练集用于训练模型,验证集用于调参,测试集用于测试调参之后的模型效果。 但是很多时…

SpringBoot+Vue实现前后端分离社区疫苗接种管理系统

文末获取源码 开发语言:Java 使用框架:spring boot 前端技术:JavaScript、Vue 、css3 开发工具:IDEA/MyEclipse/Eclipse、Visual Studio Code 数据库:MySQL 5.7/8.0 数据库管理工具:phpstudy/Navicat JDK版…

xray和burp联动

目录 xray下载安装CT Stack 安全社区 Burp和xray联动 xray下载安装下载地址:CT Stack 安全社区 先通过PowerShell打开xray所在的目录,运行,生成yaml文件 genca在目录下生成证书 生产证书后将证书导入浏览器 导入后在本地安装一下 Burp和xray…

WebdriverIO – 完整的初学者课程2022

WebdriverIO – 完整的初学者课程2022 从零开始学习和使用 JavaScript 实现 Webdriver IO!构建功能齐全的 Web 测试自动化框架 课程英文名:WebdriverIO - Complete Beginner Course 2022 此视频教程共1.0小时,中英双语字幕,画质…

SD-WAN不断冲击传统WAN架构

随着全球化数字信息转型,网络结构也是在不断的发展和完善。随着云时代的到来,传统的网络布局的局限性开始凸显出来。在过去几年广域网最重要的变化是软件定义广域网技术 (SD-WAN) 的广泛部署,它改变了网络专业人员优化和保护广域网连接的方式…

python基于PHP+MySQL的大学生宿舍管理系统

大学宿舍管理系统是信息时代的产物,它是学校宿管部门的一个好帮手。有了它不再需要繁重的纸质登记,有了它宿管员不在需要繁重的工作,一些公寓信息和住宿等基本信息可以由管理人员及时的对信息进行查询、更新、修改和删除,方便简易,且时效性高 基于PHP大学生宿舍管理系统采用当前…