RawNet 1-3 介绍

news2025/1/24 15:00:11

1. Overview

RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification (RawNet 1) 出自会议:INTERSPEECH 2019. 

(论文链接:https://arxiv.org/pdf/1904.08104.pdf

代码:https://github.com/Jungjee/RawNet.)

Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms (RawNet 2) 出自会议:INTERSPEECH 2020.

(论文链接:https://arxiv.org/pdf/2004.00526v2.pdf

代码:vailable at https://github. com/Jungjee/RawNet.)

Pushing the limits of raw waveform speaker recognition (RawNet 3) 出自会议:INTERSPEECH 2022.

( 论文链接:https://arxiv.org/pdf/2203.08488.pdf

代码:https://github.com/Jungjee/ RawNet and https://github.com/clovaai/voxceleb_ trainer.)

论文单位:韩国Naver Corporation,Korea Advanced Institute of Science and Technology

2. RawNet 1

Direct modeling of raw waveforms using deep neural networks. Less processed features as input allows data-driven approaches with DNNs to yield better discriminative representations compared to using knowledgebased acoustic features.

Advantages:

1. Minimization of pre-processing removes the need for exploration of various hyper-parameters such as the type of acoustic feature to use, window size, shift length, and feature dimension.

2. With recent trends of DNN replacing more sub-processes in various tasks, a raw waveform DNN is well positioned to benefit from future advances in deep learning.

Spectrogram-based CNN can only see fixed frequency regions depending on the internal
pooling rule.

2.1 Front-end: RawNet

RawNet adopts a convolutional neural network-gated recurrent unit (CNN-GRU) architecture.

1. input features are first processed using the residual blocks to extract frame-level embeddings.

2. A GRU is then employed to aggregate the frame-level features into a single utterance-level embedding.

3. Utterance-level embedding is then fed into one fully-connected layer. The output of the fully-connected layer is used as the speaker embedding and is connected to the output layer, where the number of nodes is identical to the number of speakers in the training set.

2.2 Objective functions

The main objectives is to minimize intra-class covariance and maximize inter-class covariance of
utterance-level features.

To consider both inter-class and intra-class covariance, we utilize center loss and speaker basis loss in addition to categorical cross-entropy loss for DNN training.

Center loss function was proposed as

where xi refers to embedding of the ith utterance, cyi refers to the center of class yi, and N refers to the size of a mini-batch.

Speaker basis loss aims to further maximize interclass covariance.

where wi is the basis vector of speaker i and M is the number of speakers within the training set

This loss function considers a weight vector between the last hidden layer and a node of the softmax output layer as a basis vector for the corresponding speaker.

The final objective function:

where LCE refers to categorical cross-entropy loss and \lambda refers to the weight of LC.

2.3 DNN-based back-end classification

In speaker verification, cosine similarity and PLDA are widely used for back-end classification to determine whether two speaker embeddings belong to the same speaker.

RawNet1 propose an approach using the concatenation of the speaker embedding, test utterance, and their element-wise multiplication.(Element-wise binary operation of speaker embeddings to represent relationships) NOT CLEAR

2.4 Experiment and results

Dataset:VoxCeleb1 dataset

3. RawNet 2

RawNet 1 VS RawNet 2

1. Replacing RawNets first convolutional layer with a sinc-convolution layer.

2. Replacing the gated recurrent unit (GRU) layer of RawNet with the self-attentive pooling and
self-multi-head-attentive pooling mechanisms. (No, the GRU better aggregates
frame-level representations into an utterance-level representation.)

3. propose to scale the filter axis of feature maps using a sigmoid-based mechanism (FMS).

4. simplified the loss functions from using categorical cross-entropy (CCE), center, and speaker basis loss  to using only CCE loss.

5. modified the training dataset from VoxCeleb1 to VoxCeleb2

6. applied a test time augmentation (TTA) method in the evaluation phase (20 %
overlap)

3.1 sinc-convolution layer

It is a type of a bandpass filter, where cut-off frequencies are set as parameters that are
optimized with other DNN parameters.

SincNet架构(2018)

标准CNN中,第一个卷积层将执行输入波形与FIR滤波器之间的时域卷积。卷积定义如下:

其中,x[n]代表语音信号,h[n]代表长度为L的滤波器,y[n]代表滤波后的输出。

标准CNN中,每个滤波器的L个权重都需要从数据中学习得来。相反地,SincNet只需要输入信号与仅有少量可学参数θ的预设函数g进行卷积,如下等式:

 受数字信号处理中标准滤波方式的启发,一种合理的选择是使用由矩形带通滤波器组成的滤波器组来定义g。频域中,带通滤波器可表示成两个低通滤波器的差:

其中,f1和f2分别是学习得到的低截止频率和高截止频率,rect表示频域中的矩形窗。

        经过IFT转换到时域后,g表示为:

其中,sinc函数定义为:sinc(x)=sin(x)/x

        截止频率可以在[0, fs/2]范围内随机初始化,fs代表输入信号的采样率。除此之外,也可采用梅尔尺度滤波器组的截止频率来进行初始化,其优点是在包含说话人身份关键信息的频率较低的部分设置更多滤波器。

        为确f1>=0且f2>=f1,上述公式中的f1和f2实际上由以下等式替换:

需要指出的是,实际上并没有强制f2满足奈奎斯特采样定理,因为作者观察到这个限制在训练时自然满足。此外,各个滤波器在这个阶段并没有学习到增益。增益将在后续网络层学习。

        理想的带通滤波器(具有完全平坦的通带和无穷衰减的阻带),要求滤波器权重的个数L是无限的。对g进行截断将只能得到通带具有波纹且阻带为有限衰减的近似理想滤波器。缓解这个问题的一种方法是加窗。加窗是通过将截断的g与窗函数w相乘实现的,旨在对g末尾突变的不连续点进行平滑,文章中采用的是Hamming窗

 汉明窗对频率的选择性很高。然而,结果显示使用其他窗函数时,没有显著差异。

 SincNet中涉及的所有操作都是完全可微的,且滤波器截止频率可以和其他CNN参数那样使用SGD或其他梯度优化方法进行联合优化。

        SincNet架构:第一层为sinc卷积,紧接着是标准CNN流水线操作(池化、归一化、激活、dropout),然后将多个标准卷积或者全连接层堆叠在一起,最后使用softmax分类器进行说话人分类。

提出的SincNet具有以下特性:

  • 收敛速度快
  • 参数少
  • 计算效率高
  • 具有可解释性

SincNet优于其他模型,DNN上性能更好,但DNN必须为每个新的说话人进行微调,灵活性不如d-vector。

具体可以参考论文Mirco Ravanelli, Yoshua Bengio, “Speaker Recognition from raw waveform with SincNet”(2018)

3.2 filter-wise feature map scaling (FMS)

The FMS uses a scale vector whose dimension is identical to the number of filters with values between 0 and 1 derived using sigmoid activation.

We also propose various methods to utilize the FMS to scale given feature maps, i.e., multiplication, addition, and applying both.

let c = [c1; c2; ... ; cF ] be a feature map of a residual block, where T is the sequence length in
time, and F is the number of filters.

1. derive a scale vector to conduct FMS by first performing global average pooling on the
time axis,

2. feed-forwarding through a fully-connected layer followed by sigmoid activation.

3. utilize different operations (addition, multiplication, etc) to scale given feature maps.

3.3 Experiment and results

Among application of various related methods, System #6 (SE) demonstrated the best result.

4. RawNet 3

Even the latest architecture demonstrates equal error rate (EER) of 1.29% , whereas the widely
adopted ECAPA-TDNN architecture and its variants have consistently reported EERs under 1%.

Therefore propose a new model architecture combining several recent advances in deep learning together with RawNet2 to overcome this challenge.

Contributions:

1. We propose a new raw waveform speaker recognition architecture, namely RawNet3, that demonstrates EER under 1% in the VoxCeleb1 evaluation protocol;
2. We explore raw waveform speaker verification model with a self-supervised learning framework for the first time and outperform contrastive-based existing works;
3. We demonstrate the effectiveness of self-supervised pretraining under semi-supervised learning scenario.

4.1 Framework

ECAPA-TDNN vs RawNet 3

SE-Res2Block vs AFMS-Res2MP-block

RawNet 2 vs RawNet 3

1. the parameterised analytic filterbank layer  is utilised instead of sinc-convolution layer.

2. log and mean normalisation is applied to the analytic filterbank output.

3. the number of backbone blocks and their connections have been adapted, following the
ECAPA-TDNN alike topology.

4. the channel and context dependent statistic pooling replaces a uni-directional gated recurrent
unit layer.

4.2 Supervised Learning

Objective function: 

AAM-softmax objective function

4.3 Self-Supervised Learning (DINO - Not Clear)

The DINO framework is one of the most competitive frameworks for self-supervised learning.

DINO involves a teacher and a student network with an identical architecture but different parameters.

The DINO loss is then defined as:

4.4 Experiment

Dataset:VoxCeleb1 and 2 datasets

Supervised Learning as below:

Self-supervised Learning as below:

Semi-supervised Learning (pre-train the model using the DINO self-supervised learning framework, then fine-tune the model using supervised learning with ground truth label-based classification) as below: 

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1139231.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

超分辨率——SESR网络训练并推理测试(详细图文教程)

最近学了一个超轻量化的超分辨率网络SESR,效果还不错。 目录 一、 源码包二、 数据集的准备2.1 官网下载2.2 网盘下载 三、 训练环境配置四、训练4.1 修改配置参数4.2 导入数据集4.3 2倍超分网络训练4.3.1 训练SESR-M5网络4.3.2 训练SESR-M5网络4.3.3 训练SESR-M11…

使用pycharm远程调试

使用pycharm 专业版, 在设置解释器中,具备ssh 解释器功能; 一般在本地无法调试远程端代码,机械性的scp传输文件十分影响工作效率,PyCharm的Pro支持远程Run,Debug,等可视化的功能。 操作系统&…

超声波原理的便携式明渠流量计

便携式明渠流量计是一款便携式的可随带随走的一款仪器,主要是用来和在线明渠流量计做液位和流量比对的一款比对装置。 仪器可自动每两分钟记录一次液位数据,连续记录6次,同时可以累计测量10分钟的流量数据,将其结果与现场在线明渠…

Angular改变组件中样式(两种方法)

项目中遇到修改组件样式的情况,搜了半天资料基本只有vue和react的方法,什么/deep/,v-deep统统不起效,崩溃! 所以这里总结一下Angular的方法。 angular中想引入组件并修改组件内样式,有两种方法。 文章目录 方法1&…

壹牛NFT数字艺术藏品数藏开源无加密

这个版本新增了不少功能,也修复了一些地方。 1.平台新增用户找回密码功能 2.平台新增短信注册(实名制功能) 3.平台新增主图后台添加功能 4.平台修复相关问题,系统高效运行 1、H5端与APP端在新UI完美适配 2、加入宝盒功能&am…

FL Studio21.2最新订阅版本更新升级(详细功能介绍)

好消息!FL Studio 21.2 在 10 月 26 日正式发布啦,它新增了 FL Cloud 在线采样库和 AI 音乐制作功能,还提供音乐分发到 Spotify、Apple Music 等主要音乐平台的服务。此外,还有新的音频分离功能、自定义波形颜色和新的合成器 Kepl…

下一代金融将呈现 7 项激动人心的金融科技创新

金融科技处于现代技术发展的前沿。随着金融科技领域价值超过 1790 亿美元,初创企业和创新者都在寻找金融和银行业的下一个重大事件。 本文讨论了 2024 年的 7 项金融科技创新。 其中一些创新建立在我们在金融行业看到的现有趋势的基础上,而另一些则是完…

学习ftp

文章目录 一、FTP介绍二、两种模式(主动模式和被动模式)三、FTP配置文件详解四、实际场景举例五、黑白名单六、网络限制 一、FTP介绍 1.FTP(File Transfer Protocol)是一种应用广泛且古老的互联网文件传输协议。 2.主要应用于互联…

vue项目中定制化音频展示,wavesurfer.js基本使用

效果图&#xff1a; wavesurfer是一个可定制的音频波形可视化&#xff0c;建立在Audio API和HTML5 Canvas之上。 基本使用&#xff1a; <body><script src"https://unpkg.com/wavesurfer.js"></script><div id"waveform"></di…

VMware ESXi和vCenter和vSphere关系是怎样的?

事实上&#xff0c;我们所说的vSphere并不是一个特定的软件。VMware vSphere是VMware的服务器虚拟化软件套件&#xff0c;它包括了许多软件组件&#xff0c;它们中的每一个都在vSphere环境中执行不同的功能。 VMware vSphere的两个核心组件就是VMware ESXi和VMware vCenter Se…

python接口自动化测试(八)-unittest-生成测试报告

用例的管理问题解决了后&#xff0c;接下来要考虑的就是报告我问题了&#xff0c;这里生成测试报告主要用到 HTMLTestRunner.py 这个模块&#xff0c;下面简单介绍一下如何使用&#xff1a; 一、下载HTMLTestRunner下载&#xff1a; 这个模块不能通过pip安装&#xff0c;只能…

ohos的代码同步以及添加自己的代码

首先我们需要获取到官方的repo工具&#xff0c;命令如下curl -s https://gitee.com/oschina/repo/raw/fork_flow/repo-py3 > ./repo&#xff0c;这里我们就拿到repo工具了&#xff0c;这个repo可以放任意地方&#xff0c;也可以放 /usr/local/bin/repo下&#xff0c;这样可以…

C++ 关键字

bool 取值 true和false&#xff0c;只有二者&#xff0c;C编译器会在赋值时将非0值转换为true&#xff0c;0转换为false ture 代表真值&#xff0c;编译器内部用1来表示 false 代表非真值&#xff0c;编译器内部用0来表示 空间 占用一个字节 多个bool变量定义在一起&a…

Power BI 傻瓜入门 13. 进入仪表板

本章内容包括&#xff1a; 配置仪表板将报表功能集成到仪表板中使用基于AI的功能增强仪表板体验基于仪表板规则定义警报 想象一下&#xff1a;图片和文本的混合体整齐地组织起来&#xff0c;就像一块美丽的画布。它告诉你组织中的一切都在顺利运行&#xff0c;但其中一个视觉…

Linux系统之部署Tale个人博客系统

Linux系统之部署Tale个人博客系统 一、Tale介绍1.1 Tale简介1.2 Tale特点 二、本地环境介绍2.1 本地环境规划2.2 本次实践介绍 三、检查本地环境3.1 检查本地操作系统版本3.2 检查系统内核版本3.3 检查java版本 四、部署Tale个人博客系统4.1 下载Tale源码4.2 查看Tale源码目录4…

干货 | 深度多元时序模型在携程关键指标预测场景下的探索应用

作者简介 doublering&#xff0c;携程高级算法工程师&#xff0c;关注自然语言处理、LLMs、时序预测等领域。 一、背景 互联网行业中&#xff0c;有许多关键指标直接影响公司未来的规划与决策&#xff0c;比如流量、订单量、销售额等。有效地预测这些关键指标能够辅助公司提前做…

C++类模板再学习

之前已经学习了C类模板&#xff1b;类模板的写法和一般类的写法有很大的差别&#xff1b;不容易熟悉&#xff1b;下面再做一遍&#xff1b; 做一个椭圆类&#xff0c;成员有长轴长度和短轴长度&#xff1b; // ellipse.h: interface for the ellipse class. // //#if !define…

Android开发知识

文章目录 HTTPHTTP到底是什么HTTP的工作方式URL ->HTTP报文List itemHTTP的工作方式请求报文格式&#xff1a;Request响应报文格式&#xff1a;ResponseHTTP的请求方法状态码HeaderHostContent-TypeContent-LengthTransfer: chunked (分块传输编码 ChunkedTransfer Encoding…

双向链表的初始化、插入、删除

双向链表的初始化 双向链表的插入 双向链表的删除 我们可以看看这一题循环双向链表的题目 王道p40 17.设计一个算法用于判断带头结点的循环双链表是否对称&#xff08;c语言代码实现&#xff09;_认真敲代码的小火龙的博客-CSDN博客https://blog.csdn.net/m0_46702681/article…

Shiro整合EhCache

缓存工具EhCache EhCache是一种广泛使用的开源Java分布式缓存。主要面向通用缓存,Java EE和轻量级容器。可以和大部分Java项目无缝整合&#xff0c;例如&#xff1a;Hibernate中的缓存就是基于EhCache实现的。EhCache支持内存和磁盘存储&#xff0c;默认存储在内存中&#xff…