FlashSpeech、ID-Animator、TalkingGaussian、FlowMap、CutDiffusion

news2025/1/12 6:56:24

本文首发于公众号:机器感知

FlashSpeech、ID-Animator、TalkingGaussian、FlowMap、CutDiffusion

图片

Gradient Guidance for Diffusion Models: An Optimization Perspective

图片

Diffusion models have demonstrated empirical successes in various applications and can be adapted to task-specific needs via guidance. This paper introduces a form of gradient guidance for adapting or fine-tuning diffusion models towards user-specified optimization objectives. We study the theoretic aspects of a guided score-based sampling process, linking the gradient-guided diffusion model to first-order optimization. We show that adding gradient guidance to the sampling process of a pre-trained diffusion model is essentially equivalent to solving a regularized optimization problem, where the regularization term acts as a prior determined by the pre-training data. Diffusion models are able to learn data's latent subspace, however, explicitly adding the gradient of an external objective function to the sample process would jeopardize the structure in generated samples. To remedy this issue, we consider a modified form of gradient guidance based on a forward prediction loss, ......

FlashSpeech: Efficient Zero-Shot Speech Synthesis

图片

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio qualit......

SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose  Estimation

图片

Existing Transformers for monocular 3D human shape and pose estimation typically have a quadratic computation and memory complexity with respect to the feature length, which hinders the exploitation of fine-grained information in high-resolution features that is beneficial for accurate reconstruction. In this work, we propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation, which allow effective utilization of high-resolution features in the Transformer. In addition, based on these two designs, we also introduce several novel modules including a multi-scale attention and a joint-aware attention to further boost the reconstruction performance. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods both quantitatively and qualitatively. Notably, the proposed algorithm achieves an M......

ID-Animator: Zero-Shot Identity-Preserving Human Video Generation

图片

Generating high fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case finetuning or usually missing the identity details in video generation process. In this study, we present ID-Animator, a zero-shot human-video generation approach that can perform personalized video generation given single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline, which incorporates decoupled human attribute and action captioning technique from a constructed facial image pool. Based on this p......

From Parts to Whole: A Unified Reference Framework for Controllable  Human Image Generation

图片

Recent advancements in controllable human image generation have led to zero-shot generation using structural signals (e.g., pose, depth) or facial appearance. Yet, generating human images conditioned on multiple parts of human appearance remains challenging. Addressing this, we introduce Parts2Whole, a novel framework designed for generating customized portraits from multiple reference images, including pose images and various aspects of human appearance. To achieve this, we first develop a semantic-aware appearance encoder to retain details of different human parts, which processes each image based on its textual label to a series of multi-scale feature maps rather than one image token, preserving the image dimension. Second, our framework supports multi-image conditioned generation through a shared self-attention mechanism that operates across reference and target features during the diffusion process. We enhance the vanilla attention mechanism by incorporating mask informa......

TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via  Gaussian Splatting

图片

Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields framework for high-fidelity talking head synthesis. Leveraging the point-based Gaussian Splatting, facial motions can be represented in our method by applying smooth and continuous deformations to persistent Gaussian primitives, without requiring to learn the difficult appearance change like previous methods. Due to this simplification, precise facial motions can be synthesized while keeping a highly intact facial feature. Under such a deformation paradigm, we further identify a face-mouth motion inconsistency that would affect the learning of detailed speaking motions. To address this c......

Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization

图片

We introduce a new system for Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with solver layers to estimate camera pose. The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. The full system can connect disjoint sequences, perform visual odometry, and global optimization. Compared to existing approaches, our design is accurate and robust to catastrophic failures. Code is available at github.com/princeton-vl/MultiSlam_DiffPose ......

FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient  Descent

图片

This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the dow......

Rethinking LLM Memorization through the Lens of Adversarial Compression

图片

Large language models (LLMs) trained on web-scale datasets raise substantial concerns regarding permissible data usage. One major question is whether these models "memorize" all their training data or they integrate many data sources in some way more akin to how a human would learn and synthesize information. The answer hinges, to a large degree, on $\textit{how we define memorization}$. In this work, we propose the Adversarial Compression Ratio (ACR) as a metric for assessing memorization in LLMs -- a given string from the training data is considered memorized if it can be elicited by a prompt shorter than the string itself. In other words, these strings can be "compressed" with the model by computing adversarial prompts of fewer tokens. We outline the limitations of existing notions of memorization and show how the ACR overcomes these challenges by (i) offering an adversarial view to measuring memorization, especially for monitoring unlearning and compliance; and (ii) allow......

CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation  Method

图片

Transforming large pre-trained low-resolution diffusion models to cater to higher-resolution demands, i.e., diffusion extrapolation, significantly improves diffusion adaptability. We propose tuning-free CutDiffusion, aimed at simplifying and accelerating the diffusion extrapolation process, making it more affordable and improving performance. CutDiffusion abides by the existing patch-wise extrapolation but cuts a standard patch diffusion process into an initial phase focused on comprehensive structure denoising and a subsequent phase dedicated to specific detail refinement. Comprehensive experiments highlight the numerous almighty advantages of CutDiffusion: (1) simple method construction that enables a concise higher-resolution diffusion process without third-party engagement; (2) fast inference speed achieved through a single-step higher-resolution diffusion process, and fewer inference patches required; (3) cheap GPU cost resulting from patch-wise inference and fewer patch......

Taming Diffusion Probabilistic Models for Character Control

图片

We present a novel character control framework that effectively utilizes motion diffusion probabilistic models to generate high-quality and diverse character animations, responding in real-time to a variety of dynamic user-supplied control signals. At the heart of our method lies a transformer-based Conditional Autoregressive Motion Diffusion Model (CAMDM), which takes as input the character's historical motion and can generate a range of diverse potential future motions conditioned on high-level, coarse user control. To meet the demands for diversity, controllability, and computational efficiency required by a real-time controller, we incorporate several key algorithmic designs. These include separate condition tokenization, classifier-free guidance on past motion, and heuristic future trajectory extension, all designed to address the challenges associated with taming motion diffusion probabilistic models for character control. As a result, our work represents the first mode......

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1623870.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

基于SpringBoot和Leaflet的地震台网信息预警可视化

目录 前言 一、后台管理设计与实现 1、Model层 2、业务层 3、控制层 二、前端预警可视化设计与实现 1、网页结构 2、数据绑定 三、效果展示 总结 前言 在之前的几篇博客中,我们讲解了如何在Leaflet中进行预警信息提示效果,以及基于XxlCrawler进…

【Linux笔记】基本指令(一)

一道残阳铺水中 半江瑟瑟半江红 目录 Linux基本指令 罗列目录内容:ls 指令 显示当前目录位置信息:pwd 指令 切换工作目录:cd 指令 创建文件修改时间戳:touch指令 创建空目录:mkdir指令 删除空目录:rmdir指…

【14-Ⅱ】Head First Java 学习笔记

HeadFirst Java 本人有C语言基础,通过阅读Java廖雪峰网站,简单速成了java,但对其中一些入门概念有所疏漏,阅读本书以弥补。 第一章 Java入门 第二章 面向对象 第三章 变量 第四章 方法操作实例变量 第五章 程序实战 第六章 Java…

uniapp如何隐藏默认的页面头部导航栏,uniapp开发小程序如何隐藏默认的页面头部导航栏

uniapp如何隐藏默认的页面头部导航栏 默认效果 隐藏后 在pages.json文件中插入 在uni-app中,设置navigationStyle为custom来自定义导航栏,可以隐藏默认的头部了。 {"path": "pages/index/index","name": "index&qu…

STM32与OLED显示屏通信(四针脚和七阵脚)

系列文章目录 STM32单片机系列专栏 C语言术语和结构总结专栏 文章目录 1. 单片机调试 2. OLED简介 3. 接线 4. OLED驱动函数 4.1 四针脚版本 OLED.c OLED.h OLED_Font.h 4.2 七针脚版本 引脚连接 OLED.c OLED.h OLED_Font.h 5. 主函数 工程文件模板 1. 单片机…

Selenium 保存会话信息避免重复登录实战!

前言 • 在一些实际开发场景中,我们在使用 Selenium 做自动化测试时需要保留用户的会话信息,从而避免重复登录,今天这篇文章就带大家实战如何使用 Selenium 保存会话信息。 版本 • Python 3.x 整体思路 • 当我们打开页面时,…

C语言-自定义类型结构体详细讲解

遇到困难时不要抱怨,既然改变不了过去,那么就努力改变未来!💓💓💓 目录 •🌙知识回顾 • 🍋知识点一:结构体类型的声明 ​编辑 • 🌰1.结构体的声明 ​编…

web自动化系列-selenium的基本方法介绍

web自动化 ,一个老生常谈的话题 ,很多人的自动化之路就是从它开始 。它学起来简单 ,但做起来又比较难以驾驭 ;它的执行效率慢 、但又是最接近于用户的操作场景 ; 1.web自动化中的三大亮点技术 我们先聊聊 &#xff0…

MySQL多版本并发控制mvcc原理浅析

文章目录 1.mvcc简介1.1mvcc定义1.2mvcc解决的问题1.3当前读与快照读 2.mvcc原理2.1隐藏字段2.2版本链2.3ReadView2.4读视图生成原则 3.rc和rr隔离级别下mvcc的不同 1.mvcc简介 1.1mvcc定义 mvcc(Multi Version Concurrency Control),多版本并发控制,是…

2024深圳杯C题的8页思路分析+所有代码可执行+参考文献+持续更新参考论文(已经更新了代码与图像)

比赛题目的完整版思路可执行代码数据参考论文都会在第一时间更新上传的,大家可以参考我往期的资料,所有的资料数据以及到最后更新的参考论文都是一次付费后续免费的。注意:(建议先下单占坑,因为随着后续我们更新资料数…

每周一算法:最短路计数

题目描述 给出一个 N N N个顶点 M M M 条边的无向无权图,顶点编号为 1 1 1 到 N N N。 问从顶点 1 1 1 开始,到其他每个点的最短路有几条。 输入格式 第一行包含 2 2 2 个正整数 N , M N,M N,M,为图的顶点数与边数。 接下来 M M …

vue3实现全局事件总线

1、vue3中使用全局事件总线是变化最大的。在vue2中,我们在new Vue中在beforeCreate钩子函数中使用vue.prototype.$busthis来创建全局事件总线。vue3中我需要借助第三方库来完成创建全局事件总线。 2、安装依赖 npm i mitt -s3、封装event-bus.js文件 import mitt …

Web前端开发之HTML_3

标签之表格Form表单块元素与行内元素&#xff08;内联元素&#xff09;HTML5新增标签 1. 标签之表格 <table></table> 1.1 表格&#xff08;快速生成&#xff1a;table>tr*2>td*3{单元格}&#xff09; 表格由行、列、单元格组成。单元格有同行等高、同列等…

【C++】:拷贝构造函数和赋值运算符重载

目录 一&#xff0c;拷贝构造函数1. 什么是拷贝构造函数2. 拷贝构造函数的特性3. 实践总结 二&#xff0c;赋值运算符重载2.1 运算符重载2.2 赋值运算符重载 一&#xff0c;拷贝构造函数 1. 什么是拷贝构造函数 拷贝构造函数是特殊的构造函数。是用一个已经存在的对象&#x…

ArtNeRF、Attention Control、Pixel is a Barrier、FilterPrompt

本文首发于公众号&#xff1a;机器感知 ArtNeRF、Attention Control、Pixel is a Barrier、FilterPrompt ArtNeRF: A Stylized Neural Field for 3D-Aware Cartoonized Face Synthesis Recent advances in generative visual models and neural radiance fields have greatly …

文件上传复习(upload-labs18-19关)

Pass-18&#xff08;条件竞争&#xff09; 代码和第17关大差不差&#xff0c;所以查看提示 需要用到代码审计 上传图片木马配合解析漏洞进行getshell 新建一句话木马 18.php&#xff0c;代码为&#xff1a; <?php fputs(fopen(../upload/shell18.php,w),<?php phpin…

js的算法-交换排序(冒泡)

交换排序 所谓交换排序&#xff0c;是指根据序列中两个元素关键字的比较结果来对换这两个记录在序列中的位置。基于交换的排序算法很多&#xff0c;本次介绍冒泡排序和快速排序。 冒泡 基本思想 从后往前&#xff08;或从前往后&#xff09;两两比较相邻元素的值&#xff0…

新风口下的必应bing国内广告投放该怎么做?

必应Bing作为全球搜索引擎市场的重要参与者&#xff0c;正逐渐显现出其在国内市场的独特价值和潜力。随着互联网生态的多元化发展&#xff0c;必应Bing凭借其高质量用户群和精准投放能力&#xff0c;成为了企业寻求新增长点的新风口。 一、洞察先机&#xff0c;精准定位市场 …

考研数学|跟武忠祥做《660》很吃力,要不要换张宇❓

不建议&#xff01; 我就是个妥妥的二战选手&#xff0c;一战听完汤家凤的课发现大量的题还是不会做&#xff0c;于是冒险把张宇的基础课听完一遍&#xff0c;毫不夸张的硕导致我最后没有强化阶段直接进入冲刺。可想而知&#xff0c;这能考好吗&#xff1f;21年数三87分&#…

ChromaDB教程

使用 Chroma DB&#xff0c;管理文本文档、将文本嵌入以及进行相似度搜索。 随着大型语言模型 &#xff08;LLM&#xff09; 及其应用的兴起&#xff0c;我们看到向量数据库越来越受欢迎。这是因为使用 LLM 需要一种与传统机器学习模型不同的方法。 LLM 的核心支持技术之一是…