语音与语言处理技术交流会（深圳）

news2025/4/19 13:17:29

嘉宾介绍

嘉宾介绍：罗艺，2021年在美国哥伦比亚大学获得博士学位后加入腾讯AI Lab Shenzhen任高级研究员，研究方向主要为音频前端处理，包括但不限于音频分离、单/多通道语音增强等。

报告题目：腾讯AI Lab音频与语音前端处理进展

摘要：本报告介绍腾讯AI Lab音频与语音前端处理团队在音频分离、语音增强、多通道语音处理等方向的研究进展，包括腾讯AI Lab在数据仿真、模型设计、应用场景等方面的探索。

嘉宾介绍：Wei Xue (雪巍) is currently an Assistant Professor at Division of Emerging Interdisciplinary Areas (EMIA), Hong Kong University of Science and Technology (HKUST). He received the Bachelor degree in automatic control from Huazhong University of Science and Technology in 2010, and the Ph.D degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences in 2015. From August 2015 to September 2018 he was first a Marie Curie Experienced Researcher and then a Research Associate in Speech and Audio Processing Group, Department of Electrical & Electronic Engineering, Imperial College London, UK. He was a Senior Research Scientist at JD AI Research, Beijing, from November 2018 to December 2021, where he was leading the R&D on front-end speech processing and acoustic modelling for robust speech recognition. From January 2022 to April 2023 he was an Assistant Professor at Department of Computer Sciences, Hong Kong Baptist University. He was a visiting scholar at Université de Toulon and KU Leuven. Wei's research interests are in speech and music intelligence, including AI music generation, speech enhancement and separation, room acoustics, as well as speech and audio event recognition. He was a former Marie Curie Fellow and was selected into the Beijing Overseas Talent Aggregation Project. He currently leads the AI music research in the theme-based Art-Tech project which totally received HK$52.8 million from Hong Kong RGC.

报告题目：Audio Content Generation: Building Digitalized Human and Humanized AI

摘要：We are entering a new era in which the real and virtual worlds are indistinguishable; interactions between the real and virtual worlds remove the physical barriers between people and define new ways of entertainment, healthcare, and communication. Building a new generation of content generation and interaction over human, machine and environment in terms of audio is essential. We will introduce our progresses in recent months in this talk. Specifically, we will introduce how to digitalize the voice of an arbitrary person to produce the virtual singer, which empowers the AI choir in the world’s first human-machine collaborative symphony orchestra at Hong Kong; We will also introduce CoMoSpeech, which adopts the consistency model for speech synthesis and achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, making diffusion-sampling based speech synthesis truly practical.

嘉宾介绍：My name is Yuancheng Wang, Currently I am a senior student majoring in computer science at The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), supervised by Prof. Zhizheng Wu. Now I am a research intern at Microsoft Research Asia, working closely with Xu Tan. Recently, I am focusing on diffusion-based audio and speech generation.

报告题目：Audio Editing by Following Instructions with Latent Diffusion Models

摘要：Audio editing is applicable for various purposes, such as adding background sound effects, replacing a musical instrument, and repairing damaged audio. Recently, some diffusion-based methods achieved zero-shot audio editing by using a diffusion and denoising process conditioned on the text description of the output audio. However, these methods still have some problems: 1) they have not been trained on editing tasks and cannot ensure good editing effects; 2) they can erroneously modify audio segments that do not require editing; 3) they need a complete description of the output audio, which is not always available or necessary in practical scenarios. In this work, we propose AUDIT, an instruction-guided audio editing model based on latent diffusion models. Specifically, AUDIT has three main design features: 1) we construct triplet training data (instruction, input audio, output audio) for different audio editing tasks and train a diffusion model using instruction and input (to be edited) audio as conditions and generating output (edited) audio; 2) it can automatically learn to only modify segments that need to be edited by comparing the difference between the input and output audio; 3) it only needs edit instructions instead of full target audio descriptions as text input. AUDIT achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks (e.g., adding, dropping, replacement, inpainting, super-resolution).

嘉宾介绍：Dr Yannan Wang received his B.S. degree and the Ph.d. degree in Electrical Engineering and Information Science from University of Science and Technology of China in 2011 and 2017 respectively. He is currently a Senior Researcher at Tencent Ethereal Lab. His research interests involve in speech enhancement, speech separation, voice conversion, voice activity detection, speech dereverberation, acoustic scene classification, sound event localization and detection.

报告题目：Speech signal improvement in real-time communication

摘要：Real-time communication (RTC) systems become a necessity in the life and work of individuals, especially in teleconferencing systems. Speech quality is the key element for communication experience. However various problems degrades the speech quality including acoustical capturing, noise/reverberation corruption, bad device acquisition performance and network congestion, etc. In this talk I would like to present our attempts to promote speech signal, especially in far-field scenario when the environment is complex with lower SNR. In the future we would like to devote more effort in more types of speech quality improvement task.

嘉宾介绍：丁万，现任优必选科技人形机器人事业部专家工程师。本科毕业于武汉大学，博士毕业于华中师范大学，曾在新加坡科技局资讯通信研究院任博士后及科学家（Scientist I）职位，主要研究方向为多模态情感识别和多模态语音合成。2019年入职优必选，主要负责优必选在/离线语音合成技术核心算法研发及产品化工作。参与编写了《支持语音和视觉交互的虚拟数字人技术规范》。曾获 EmotioNet 2017脸部动作单元识别挑战赛第一名，MEC 2017多模态情感识别竞赛第二名， ACII Asia 2018 Outstanding Paper Award等荣誉。

报告题目：优必选的多模态机器学习技术

摘要：人形机器人产品需要通过多模态信息来实现准确地感知和表达。相比传统方法，基于深度学习的多模态识别和合成能够达到更好的效果，但是在落地时仍然需要注意过拟合、实时性等问题。本次报告向大家介绍优必选在多模态机器学习方面进行的一些工作。具体包括多模态情感识别、多模态抑郁症检测和2D数字人合成等。