HiRoPE、MoDiTalker、RecDiffusion、DreamSalon、InterDreamer、BAMM

news2025/1/11 7:04:21

本文首发于公众号:机器感知

HiRoPE、MoDiTalker、RecDiffusion、DreamSalon、InterDreamer、BAMM

Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D

图片

In recent years, there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets. At the same time, there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However, the availability of 3D or multiview data is still substantially limited compared to 2D image datasets, making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper, we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to predict unseen views on feature spaces generated by a few visual models (i.e. DINO and CLIP)......

Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design  Approach

图片

Parameter-efficient fine-tuning for pre-trained Vision Transformers aims to adeptly tailor a model to downstream tasks by learning a minimal set of new adaptation parameters while preserving the frozen majority of pre-trained parameters. Striking a balance between retaining the generalizable representation capacity of the pre-trained model and acquiring task-specific features poses a key challenge. Currently, there is a lack of focus on guiding this delicate trade-off. In this study, we approach the problem from the perspective of Singular Value Decomposition (SVD) of pre-trained parameter matrices, providing insights into the tuning dynamics of existing methods. Building upon this understanding, we propose a Residual-based Low-Rank Rescaling (RLRR) fine-tuning strategy. This strategy not only enhances flexibility in parameter tuning but also ensures that new parameters do not deviate excessively from the pre-trained model through a residual design. Extensive experiments demo......

Automated Black-box Prompt Engineering for Personalized Text-to-Image  Generation

图片

Prompt engineering is effective for controlling the output of text-to-image (T2I) generative models, but it is also laborious due to the need for manually crafted prompts. This challenge has spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, and produce non-intuitive prompts. In this work, we introduce PRISM, an algorithm that automatically identifies human-interpretable and transferable prompts that can effectively generate desired concepts given only black-box access to T2I models. Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompts distribution for given reference images. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles and images across multiple T2I models......

HiRoPE: Length Extrapolation for Code Models

图片

Addressing the limitation of context length in large language models for code-related tasks is the primary focus of this paper. Existing LLMs are constrained by their pre-trained context lengths, leading to performance issues in handling long complex code sequences. Inspired by how human programmers navigate code, we introduce Hierarchical Rotary Position Embedding (HiRoPE), a novel approach that enhances the traditional rotary position embedding into a hierarchical format based on the hierarchical structure of source code. HiRoPE offers easy integration into existing LLMs without extra training costs. Our method is extensively evaluated with various LLMs, demonstrating stable performance in tasks such as language modeling and long code completion. We also introduce a new long code understanding task with real-world code projects, in hopes of promoting further development in this code-related field. Theoretically and experimentally, we find that HiRoPE also addresses the out-......

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity  Talking Head Generation

图片

Conventional GAN-based models for talking head generation often suffer from limited quality and unstable training. Recent approaches based on diffusion models aimed to address these limitations and improve fidelity. However, they still face challenges, including extensive sampling times and difficulties in maintaining temporal consistency due to the high stochasticity of diffusion models. To overcome these challenges, we propose a novel motion-disentangled diffusion model for high-quality talking head generation, dubbed MoDiTalker. We introduce the two modules: audio-to-motion (AToM), designed to generate a synchronized lip motion from audio, and motion-to-video (MToV), designed to produce high-quality head video following the generated motion. AToM excels in capturing subtle lip movements by leveraging an audio attention mechanism. In addition, MToV enhances temporal consistency by leveraging an efficient tri-plane representation. Our experiments conducted on standard benchm......

RecDiffusion: Rectangling for Image Stitching with Diffusion Models

图片

Image stitching from different captures often results in non-rectangular boundaries, which is often considered unappealing. To solve non-rectangular boundaries, current solutions involve cropping, which discards image content, inpainting, which can introduce unrelated content, or warping, which can distort non-linear features and introduce artifacts. To overcome these issues, we introduce a novel diffusion-based learning framework, \textbf{RecDiffusion}, for image stitching rectangling. This framework combines Motion Diffusion Models (MDM) to generate motion fields, effectively transitioning from the stitched image's irregular borders to a geometrically corrected intermediary. Followed by Content Diffusion Models (CDM) for image detail refinement. Notably, our sampling process utilizes a weighted map to identify regions needing correction during each iteration of CDM. Our RecDiffusion ensures geometric accuracy and overall visual appeal, surpassing all previous methods in bot......

DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context  in Editable Face Generation

图片

While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centered images, novel challenges arise with a nuanced task of "identity fine editing": precisely modifying specific features of a subject while maintaining its inherent identity and context. Existing personalization methods either require time-consuming optimization or learning additional encoders, adept in "identity re-contextualization". However, they often struggle with detailed and sensitive tasks like human face editing. To address these challenges, we introduce DreamSalon, a noise-guided, staged-editing framework, uniquely focusing on detailed image manipulations and identity-context preservation. By discerning editing and boosting stages via the frequency and gradient of predicted noises, DreamSalon first performs detailed manipulations on specific features in the editing stage, guided by high-frequency information, and then employs stochastic denoising in the boosting sta......

Burst Super-Resolution with Diffusion Models for Improving Perceptual  Quality

图片

While burst LR images are useful for improving the SR image quality compared with a single LR image, prior SR networks accepting the burst LR images are trained in a deterministic manner, which is known to produce a blurry SR image. In addition, it is difficult to perfectly align the burst LR images, making the SR image more blurry. Since such blurry images are perceptually degraded, we aim to reconstruct the sharp high-fidelity boundaries. Such high-fidelity images can be reconstructed by diffusion models. However, prior SR methods using the diffusion model are not properly optimized for the burst SR task. Specifically, the reverse process starting from a random sample is not optimized for image enhancement and restoration methods, including burst SR. In our proposed method, on the other hand, burst LR features are used to reconstruct the initial burst SR image that is fed into an intermediate step in the diffusion model. This reverse process from the intermediate step 1) sk......

BAMM: Bidirectional Autoregressive Motion Model

图片

Generating human motion from text has been dominated by denoising motion models either through diffusion or generative masking process. However, these models face great limitations in usability by requiring prior knowledge of the motion length. Conversely, autoregressive motion models address this limitation by adaptively predicting motion endpoints, at the cost of degraded generation quality and editing capabilities. To address these challenges, we propose Bidirectional Autoregressive Motion Model (BAMM), a novel text-to-motion generation framework. BAMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and (2) a masked self-attention transformer that autoregressively predicts randomly masked tokens via a hybrid attention masking strategy. By unifying generative masked modeling and autoregressive modeling, BAMM captures rich and bidirectional dependencies among motion tokens, while learning the probab......

Break-for-Make: Modular Low-Rank Adaptations for Composable  Content-Style Customization

图片

Personalized generation paradigms empower designers to customize visual intellectual properties with the help of textual descriptions by tuning or adapting pre-trained text-to-image models on a few images. Recent works explore approaches for concurrently customizing both content and detailed visual style appearance. However, these existing approaches often generate images where the content and style are entangled. In this study, we reconsider the customization of content and style concepts from the perspective of parameter space construction. Unlike existing methods that utilize a shared parameter space for content and style, we propose a learning framework that separates the parameter space to facilitate individual learning of content and style, thereby enabling disentangled content and style. To achieve this goal, we introduce "partly learnable projection" (PLP) matrices to separate the original adapters into divided sub-parameter spaces. We propose "break-for-make" customi......

Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for  Communication

图片

In this paper, we introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. Central to our approach is the incorporation of factorization to decouple audio features and the combination of textual semantic information, thereby facilitating the creation of more realistic and coordinated movements. We separately train VQ-VAEs with respect to the holistic motions of both speaker and listener. We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model specifically designed to characterize real-world communication scenarios effectively which can generate the motions of both the speaker and the listener simultaneously. These designs ensure that the results we generate are both coordinated and diverse. Our approach demonstrates state-of-the-art performance on two benchmark datasets. Furthermore, we introduce the H......

GANTASTIC: GAN-based Transfer of Interpretable Directions for  Disentangled Image Editing in Text-to-Image Diffusion Models

图片

The rapid advancement in image generation models has predominantly been driven by diffusion models, which have demonstrated unparalleled success in generating high-fidelity, diverse images from textual prompts. Despite their success, diffusion models encounter substantial challenges in the domain of image editing, particularly in executing disentangled edits-changes that target specific attributes of an image while leaving irrelevant parts untouched. In contrast, Generative Adversarial Networks (GANs) have been recognized for their success in disentangled edits through their interpretable latent spaces. We introduce GANTASTIC, a novel framework that takes existing directions from pre-trained GAN models-representative of specific, controllable attributes-and transfers these directions into diffusion-based models. This novel approach not only maintains the generative quality and diversity that diffusion models are known for but also significantly enhances their capability to pe......

InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction

图片

Text-conditioned human motion generation has experienced significant advancements with diffusion models trained on extensive motion capture data and corresponding textual annotations. However, extending such success to 3D dynamic human-object interaction (HOI) generation faces notable challenges, primarily due to the lack of large-scale interaction data and comprehensive descriptions that align with these interactions. This paper takes the initiative and showcases the potential of generating human-object interactions without direct training on text-interaction pair data. Our key insight in achieving this is that interaction semantics and dynamics can be decoupled. Being unable to learn interaction semantics through supervised training, we instead leverage pre-trained large models, synergizing knowledge from a large language model and a text-to-motion model. While such knowledge offers high-level control over interaction semantics, it cannot grasp the intricacies of low-level ......

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1554363.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

利用lidar生成深度图

前言 目前,深度图像的获取方法有:激光雷达深度成像法、计算机立体视觉成像、坐标测量机法、莫尔条纹法、结构光法等。针对深度图像的研究重点主要集中在以下几个方面:深度图像的分割技术,深度图像的边缘检测技术,基于…

HarmonyOS实战开发-实现自定义弹窗

介绍 本篇Codelab基于ArkTS的声明式开发范式实现了三种不同的弹窗,第一种直接使用公共组件,后两种使用CustomDialogController实现自定义弹窗,效果如图所示 相关概念 AlertDialog:警告弹窗,可设置文本内容和响应回调…

网络七层模型:理解网络通信的架构(〇)

🤍 前端开发工程师、技术日更博主、已过CET6 🍨 阿珊和她的猫_CSDN博客专家、23年度博客之星前端领域TOP1 🕠 牛客高级专题作者、打造专栏《前端面试必备》 、《2024面试高频手撕题》 🍚 蓝桥云课签约作者、上架课程《Vue.js 和 E…

【QT入门】 QListWidget各种常见用法详解之列表模式

往期回顾 【QT入门】 Qt代码创建布局之setLayout使用-CSDN博客 【QT入门】 Qt代码创建布局之多重布局变换与布局删除技巧-CSDN博客 【QT入门】 QTabWidget各种常见用法详解-CSDN博客 【QT入门】 QListWidget各种常见用法详解之列表模式 QListWidget有列表和图标两种显示模式&a…

数据结构刷题篇 之 【力扣二叉树基础OJ】详细讲解(含每道题链接及递归图解)

有没有一起拼用银行卡的,取钱的时候我用,存钱的时候你用 1、相同的树 难度等级:⭐ 直达链接:相同的树 2、单值二叉树 难度等级:⭐ 直达链接:单值二叉树 3、对称二叉树 难度等级:⭐⭐ 直达…

Delphi模式编程

文章目录 Delphi模式编程涉及以下几个关键方面:**设计模式的应用****Delphi特性的利用****实际开发中的实践** Delphi模式编程的实例 Delphi模式编程是指在使用Delphi这一集成开发环境(IDE)和Object Pascal语言进行软件开发时,采用…

vivado 器件编程

生成器件镜像后 , 下一步是将其下载到目标器件。 Vivado IDE 具有内置原生的系统内器件编程功能用于执行此操作。 Vivado Design Suite 和 Vivado Lab Edition 都包含相应的功能 , 支持您连接到包含 1 个或多个 FPGA 或 ACAP 的硬 件, 以…

9、鸿蒙学习-开发及引用静态共享包(API 9)

HAR(Harmony Archive)是静态共享包,可以包含代码、C库、资源和配置文件。通过HAR可以实现多个模块或多个工程共享ArkUI组件、资源等相关代码。HAR不同于HAP,不能独立安装运行在设备上,只能作为应用模块的依赖项被引用。…

使用Docker Compose一键部署前后端分离项目(图文保姆级教程)

一、安装Docker和docker Compose 1.Docker安装 //下载containerd.io包 yum install https://download.docker.com/linux/fedora/30/x86_64/stable/Packages/containerd.io-1.2.6-3.3.fc30.x86_64.rpm //安装依赖项 yum install -y yum-utils device-mapper-persistent-data l…

VTK 9.2.6 源码和VTK Examples 编译 Visual Studio 2022

对于编译 VTK 源码和编译详细的说明: VTK 源码编译: 下载源码: 从 VTK 官方网站或者 GitHub 获取源代码。官网目前最近的9.3.0有问题,见VTK 9.3.0 编译问题 Visual Studio 2022去gitlab上选择9.2.6分支进行clone CMake 配置&…

探索数据结构:链式队与循环队列的模拟、实现与应用

✨✨ 欢迎大家来到贝蒂大讲堂✨✨ 🎈🎈养成好习惯,先赞后看哦~🎈🎈 所属专栏:数据结构与算法 贝蒂的主页:Betty’s blog 1. 队列的定义 队列(queue)是一种只允许在一端进…

系统分析师-参考模型

前言 网络术语中的参考模型指的是OSI参考模型,由ISO(国际标准化组织)制定的一套普遍适用的规范集合,以使得全球范围的计算机平台可进行开放式通信。 ISO创建了一个有助于开发和理解计算机的通信模型,即开放系统互联OS…

openwrt 编译mysql数据库固件,并调用

前言 openwrt 编译源码mysql数据库,并编写demo调用 一、整体架构设计 作者要做一个项目,没有后端服务,只有一个电脑,需要在电脑上安装mysql服务端。然后在设备上安装mysql客户端。 二、PC安装mysql 1.官网链接 自行百度安装&a…

一文彻底搞懂spring循环依赖

文章目录 1. 什么是循环依赖2. Spring怎么解决循环依赖3. 无法处理的循环依赖 1. 什么是循环依赖 Spring 中的循环依赖是指两个或多个 Bean 之间相互依赖,形成一个循环引用的情况。在 Spring 容器中,循环依赖通常指的是单例(Singleton&#…

使用 Idea 快速搭建 SpringMVC 项目的详细步骤

一、开篇 SpringMVC 是一款当下流行的优秀的 MVC 框架,关于 MVC 的概念、作用、优点等内容介绍,在作者之前的一篇 Chat 《深入理解 MVC 框架原理:自定义 Struts2 框架》中有详细的描述。描述了关于另一款主流 MVC 框架的原理介绍,…

Docker-Container

Docker ①什么是容器②为什么需要容器③容器的生命周期容器 OOM容器异常退出容器暂停 ④容器命令清单总览docker createdocker rundocker psdocker logsdocker attachdocker execdocker startdocker stopdocker restartdocker killdocker topdocker statsdocker container insp…

unrealbuildtool 无法找到,执行 Generate Visual Studio Project 错误

参考链接 Generate cpp project Couldnt find UnrealBuildTool - Pipeline & Plugins / Plugins - Epic Developer Community Forums (unrealengine.com) 错误提示如下图: 解决方案: 打开 UnrealBuildTool,生成解决方案就可以了

学习Fast-LIO系列代码中相关概念理解

目录 一、流形和流形空间(姿态) 1.1 定义 1.2 为什么要有流形? 1.3 流形要满足什么性质? (1) 拓扑同胚 (2) 可微结构 1.4 欧式空间和流形空间的区别和联系? (1) 区别: (2) 联系: 1.5 将姿态定义在流形上比…

【Python实用标准库】argparser使用教程

argparser使用教程 1.介绍2.基本使用3.add_argument() 参数设置4.参考 1.介绍 (一)argparse 模块是 Python 内置的用于命令项选项与参数解析的模块,其用主要在两个方面: 一方面在python文件中可以将算法参数集中放到一起&#x…

CavalierContours 二维线操作

CavalierContours 二维线操作 2D polyline library for offsetting, combining, etc. 用于偏移、交并补等组合等操作的 2D 多折段线库。 Polyline Structure 多段线结构 Polylines are defined by a sequence of vertexes and a bool indicating whether the polyline is cl…