【From Audio to Talking Heads in Real-Time with AI! RAD-NeRF explained】

news2025/1/20 14:57:37

视频链接:RAD-NeRF

https://me.kiui.moe/radnerf/videos/obama_intro.mp4

 

 

From Audio to Talking Heads in Real-Time with AI! RAD-NeRF explained

Efficient NeRFs for Real-Time Portrait Synthesis (RAD-NeRF)

We’ve heard of deepfakes, we’ve heard of NeRFs, and we’ve seen these kinds of applications allowing you to recreate someone’s face and pretty much make him say whatever you want.

What you might not know is how inefficient those methods are and how much computing and time they require. Plus, we only see the best results. Keep in mind that what we see online is the results associated with the faces we could find most examples of, so basically, internet personalities and the models producing those results are trained using lots of computing, meaning expensive resources like many graphics cards. Still, the results are really impressive and only getting better.

Fortunately, some people like Jiaxian Tang and colleagues are working on making those methods more available and effective with a new model called RAD-NeRF.

From a single video, they can synthesize the person talking for pretty much any word or sentence in real time with better quality. You can animate a talking head following any audio track in real-time. This is both so cool and so scary at the same time.

Just imagine what could be done if we could make you say anything. At least, they still need access to a video of you speaking in front of the camera for five minutes, so it’s hard to achieve that without you knowing. Still, as soon as you appear online, anyone will be able to use such a model and create infinite videos of you talking about anything they want. They can even host live streams with this method, which is even more dangerous and makes it even harder to say what’s true or not. Anyways, even though this is interesting, and I’d love to hear your thoughts in the comments and keep the discussion going, here I wanted to cover something that is only positive and exciting: science. More precisely, how did they achieve to animate talking heads in real-time from any audio using only a video of the face…

As they state, their RAD-NeRF model can run 500 times faster than the previous works with better rendering quality and more control. You may ask how is that possible? We usually trade quality for efficiency, yet they achieve to improve both incredibly.

These immense improvements are possible thanks to three main points.

The first two are related to the architecture of the model. More specifically, how they adapted the NeRF approach to make it more efficient and with improved motions of the torso and head.

The first step is to make NeRFs more efficient (yellow rectangle in the overview of the model). I won’t dive into how NeRFs work since we covered it numerous times. Basically, it is an approach based on neural networks to reconstruct 3D volumetric scenes from a bunch of 2D images, which means regular images. This is why they will take a video as input, as it basically gives you a lot of images of a person from many different angles.

So it usually uses a network to predict all pixels colors and densities from the camera viewpoint you are visualizing and does that for all viewpoints you want to show when rotating around the subject, which is extremely computation hungry as you are predicting multiple parameters for each coordinate in the image every time learning to predict all of them. Plus, in their case, it isn’t only a NeRF producing a 3D scene. It also has to match an audio input and fit the lips, mouth, eyes, and movements with what the person says.

Instead of predicting all pixels’ densities and colors matching the audio for a specific frame, they will work with two separate new and condensed spaces called grid spaces, or grid-based NeRF. They will translate their coordinates into a smaller 3D grid space, translate their audio into a smaller 2D grid space, and then send them to render the head. This means they never merge the audio data with the spatial data, which would increase the size exponentially, adding 2-dimensional inputs to each coordinate. So reducing the size of the audio features along with keeping the audio and spatial features separate is what makes the approach so much more efficient.

But how can the results be better if they use condensed spaces that have less information? Adding a few controllable features like an eye blinking control to our grid NeRF, the model will learn more realistic behaviors for the eyes compared to previous approaches. Something really important for realism.

The second improvement (green rectangle in the overview of the model) they’ve done is to model the torso with another NeRF using the same approach instead of trying to model it with the same NeRF used for the head, which will require much fewer parameters and different needs as the goal here is to animate moving heads and not whole bodies. Since the torso is pretty much static in these cases, they use a much simpler and more efficient NeRF-based module that only works in 2D, working in the image space directly instead of using camera rays as we usually do with NeRF to generate many different angles, which aren’t needed for the torso. So it is basically much more efficient because they modified the approach for this very specific use case of the rigid torso and moving head videos. They then recompose the head with the torso to produce the final video.

And voilà! This is how you produce talking head videos over any audio input efficiently!

Of course, this was just an overview of this new exciting research publication, and they do other modifications during the training of their algorithm to make it more efficient, which is the third point I mentioned at the beginning of the article. I invite you to read their paper for more information. The link is in the references below.

Before you leave, I just wanted to thank the people who recently supported this channel through Patreon. This is not necessary and strictly to support the work I do here.

Huge thanks to Artem Vladykin, Leopoldo Altamirano, Jay Cole, Michael Carychao, Daniel Gimness, and a few anonymous generous donors. It would be greatly appreciated if you also want and can afford to support my work financially on Patreon. But no worries if not. Sincere feedback on this article is all I need to be happier!

I hope you’ve enjoyed this article and I will see you next week with another amazing paper!

References

►Tang, J., Wang, K., Zhou, H., Chen, X., He, D., Hu, T., Liu, J., Zeng, G. and Wang, J., 2022. Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition. arXiv preprint arXiv:2211.12368.
►Results/project page: ​​https://me.kiui.moe/radnerf/

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/625549.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Nacos架构与原理 - 配置模型

文章目录 背景概念介绍配置(Configuration)配置管理 (Configuration Management)配置服务 (Configuration Service)配置项(Configuration Item)配置集(Configuration Set)命名空间(Namespace)配置组&#x…

chatgpt赋能python:Python中如何实现print不换行

Python中如何实现print不换行 在编写Python程序时,我们经常需要使用print语句来输出信息。然而,有时候我们希望在输出信息时不换行,而是将多个输出信息打印在同一行上。这里介绍几种Python中print不换行的方式。 方法一:使用逗号…

Redis7分布式缓存

Redis7分布式缓存 一、Redis入门概述1.主流功能与应用2.命名规则 二、Redis 10 大数据类型1.redis字符串 (String)2. redis列表 (List)3. redis哈希表 (Hash)4. redis集合 (Set)5. redis有序集合 (ZSet)6. redis地理空间(GEO)7. redis基数统计 (HyperLogLog)8. redis位图 (bitm…

wavemlp怎么运行的

1.首先,输入进来的x是batchsize,64,256,256尺寸的。 他会用四个不同的conv组合,theta组合是由二维卷积(batchnorm和relu的),得到两个值。 而,x_h和x_w都是通过一个简单…

JavaSE笔记(四)重制版

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-eFLuY3vJ-1685927553034)(https://s2.loli.net/2022/09/22/lmKBNFc5wPEgjaS.png)] 面向对象高级篇 经过前面的学习,我们已经了解了面向对象编程的大部分基础内容,这一部分&…

一种新颖的智能优化算法—海鸥优化(SOA)算法

目录 一、SOA理论基础 二、ACO数学模型 2.1 迁徙 2.2 攻击 三、SOA伪代码 四、SOA运行结果 海鸥优化算法(Seagull Optimization Algorithm, SOA)是在2019年由 Dhiman 提出的一种受自然界海鸥启发的新颖全局优化算法,模拟了海鸥群体的迁徙和攻击行为。 一、SOA…

chatgpt赋能python:Python中如何实现1到100的循环?

Python中如何实现1到100的循环? 在Python中,实现1到100的循环非常简单。我们可以通过for循环、while循环和列表生成式来实现这个任务。 1. 使用for循环实现1到100的循环 使用for循环可以让我们轻松地遍历从1到100的整数。这种方法很适合在循环中需要进…

chatgpt赋能python:Python中的UTF-8

Python中的UTF-8 什么是UTF-8? 在开始介绍Python中的UTF-8,让我们先了解什么是UTF-8。UTF-8是一种Unicode字符集的编码方法,可以表示全球范围内的字符集,包括机器语言和人类语言。 Python和UTF-8 Python是一种高级编程语言&am…

工作基础知识

fpga 什么是fpga FPGA 是一种硬件可重构的体系结构。它的英文全称是Field Programmable Gate Array,中文名是现场可编程门阵列。 FPGA就是一个可以通过编程来改变内部结构的芯片。 FPGA 是在硅片上预先设计实现的具有可编程特性的集成电路,它能够按照…

完全卸载mysql教程

引言 很多人因为第一次安装mysql导致安装错误,或者安装的数据库版本太高,比如mysql8.0版本,出现了很多问题,导致数据库无法使用,或者一些图形界面无法操作,想要卸载,重装稳定的mysql数据库&…

一种新型智能优化算法—平衡优化算法(EO)

目录 一、EO理论基础 二、EO数学模型 2.1 种群初始化 2.2 建立平衡状态池 2.3 指数项 2.4 生成速率 2.5 更新解 三、EO流程图 四、EO运行结果 平衡优化器(Equilibrium Optimizer,EO)。该算法由Afshin Faramarzi等人于2020年提出的一种新型元启发式优…

JavaSE笔记(七)重制版

多线程与反射 前面我们已经讲解了JavaSE的大部分核心内容,最后一章,我们还将继续学习JavaSE中提供的各种高级特性。这些高级特性对于我们之后的学习,会有着举足轻重的作用。 多线程 **注意:**本章节会涉及到 操作系统 相关知识…

【Simulink】基于FCS-MPC的单相并网逆变器控制(Matlab Function)

之前写过【Simulink】基于FCS-MPC的三相并网逆变器控制(Matlab Function)以及 【Simulink】基于FCS-MPC的三相并网逆变器控制(Stateflow),控制对象都是三相逆变器,本篇博客将研究单相逆变器。 FCS-MPC控制…

chatgpt赋能python:Python一行输入两个数:简单又高效

Python一行输入两个数:简单又高效 作为一名有10年Python编程经验的工程师,我深知Python的简洁和高效。Python作为一种强大的编程语言,具有很多功能强大的工具和库。一行输入两个数也不例外。在本文中,我将介绍Python中一行输入两…

chatgpt赋能python:Python怎么post请求

Python怎么post请求 Python作为一种高级编程语言,拥有丰富的库和模块来处理各种编程任务。其中,处理网络请求和数据交互是Python非常擅长的领域之一。在Python中,我们可以使用标准库或第三方库的HTTP模块来实现HTTP协议的请求和响应。 本篇…

JavaSSM笔记(一)

**建议:**对Java开发还不是很熟悉的同学,最好先花费半个月到一个月时间大量地去编写小项目,不推荐一口气学完,后面的内容相比前面的内容几乎是降维打击,一口气学完很容易忘记之前所学的基础知识,尤其是Java…

TypeScript核心基础

前言 为了方便我们直接使用脚手架让他帮我们直接编译就完事了 创建一个 Vue 应用 前提条件 熟悉命令行已安装 16.0 或更高版本的 Node.js npm init vuelatest 这一指令将会安装并执行 create-vue,它是 Vue 官方的项目脚手架工具。你将会看到一些诸如 TypeScript 和测…

JavaSE笔记(三)重置版

面向对象基础篇 我们在前面已经学习了面向过程编程,也可以自行编写出简单的程序了。我们接着就需要认识 面向对象程序设计(Object Oriented Programming)它是我们在Java语言中要学习的重要内容,面向对象也是高级语言的一大重要特…

Qt 编译使用Quazip库

1. 编译zlib quazip库是基于Qt的压缩解压库,使对zlib库的封装。因此要使用quazip库就需要先编译zlib。 下载zlib 。 为了在Qt中使用,编译环境统一使用Qt的MinGW工具链。 解压源码,打开Qt的MinGW命令行控制台。 make -f ./win32/Makefile.…

chatgpt赋能python:Python中如何3个3个切片

Python中如何3个3个切片 在Python中,我们可以使用切片(Slicing)操作来获取列表、元组、字符串等容器中的子集。而在某些情况下,我们需要按照一定规律切片,比如每三个元素切成一组。那么在Python中,我们该怎…