Stale Diffusion、Drag Your Noise、PhysReaction、CityGaussian

news2024/11/20 4:50:25

本文首发于公众号:机器感知

Stale Diffusion、Drag Your Noise、PhysReaction、CityGaussian

Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic  Propagation

图片

Point-based interactive editing serves as an essential tool to complement the controllability of existing generative models. A concurrent work, DragDiffusion, updates the diffusion latent map in response to user inputs, causing global latent map alterations. This results in imprecise preservation of the original content and unsuccessful editing due to gradient vanishing. In contrast, we present DragNoise, offering robust and accelerated editing without retracing the latent map. The core rationale of DragNoise lies in utilizing the predicted noise output of each U-Net as a semantic editor. This approach is grounded in two critical observations: firstly, the bottleneck features of U-Net inherently possess semantically rich features ideal for interactive editing; secondly, high-level semantics, established early in the denoising process, show minimal variation in subsequent stages. Leveraging these insights, DragNoise edits diffusion semantics in a single denoising step and effi......

Stale Diffusion: Hyper-realistic 5D Movie Generation Using Old-school  Methods

图片

Two years ago, Stable Diffusion achieved super-human performance at generating images with super-human numbers of fingers. Following the steady decline of its technical novelty, we propose Stale Diffusion, a method that solidifies and ossifies Stable Diffusion in a maximum-entropy state. Stable Diffusion works analogously to a barn (the Stable) from which an infinite set of horses have escaped (the Diffusion). As the horses have long left the barn, our proposal may be seen as antiquated and irrelevant. Nevertheless, we vigorously defend our claim of novelty by identifying as early adopters of the Slow Science Movement, which will produce extremely important pearls of wisdom in the future. Our speed of contributions can also be seen as a quasi-static implementation of the recent call to pause AI experiments, which we wholeheartedly support. As a result of a careful archaeological expedition to 18-months-old Git commit histories, we found that naturally-accumulating errors have......

PhysReaction: Physically Plausible Real-Time Humanoid Reaction Synthesis  via Forward Dynamics Guided 4D Imitation

图片

Humanoid Reaction Synthesis is pivotal for creating highly interactive and empathetic robots that can seamlessly integrate into human environments, enhancing the way we live, work, and communicate. However, it is difficult to learn the diverse interaction patterns of multiple humans and generate physically plausible reactions. The kinematics-based approaches face challenges, including issues like floating feet, sliding, penetration, and other problems that defy physical plausibility. The existing physics-based method often relies on kinematics-based methods to generate reference states, which struggle with the challenges posed by kinematic noise during action execution. Constrained by their reliance on diffusion models, these methods are unable to achieve real-time inference. In this work, we propose a Forward Dynamics Guided 4D Imitation method to generate physically plausible human-like reactions. The learned policy is capable of generating physically plausible and human-li......

HairFastGAN: Realistic and Robust Hair Transfer with a Fast  Encoder-Based Approach

图片

Our paper addresses the complex task of transferring a hairstyle from a reference image to an input photo for virtual hair try-on. This task is challenging due to the need to adapt to various photo poses, the sensitivity of hairstyles, and the lack of objective metrics. The current state of the art hairstyle transfer methods use an optimization process for different parts of the approach, making them inexcusably slow. At the same time, faster encoder-based models are of very low quality because they either operate in StyleGAN's W+ space or use other low-dimensional image generators. Additionally, both approaches have a problem with hairstyle transfer when the source pose is very different from the target pose, because they either don't consider the pose at all or deal with it inefficiently. In our paper, we present the HairFast model, which uniquely solves these problems and achieves high resolution, near real-time performance, and superior reconstruction compared to optimiza......

CityGaussian: Real-time High-quality Large-Scale Scene Rendering with  Gaussians

图片

The advancement of real-time 3D scene reconstruction and novel view synthesis has been significantly propelled by 3D Gaussian Splatting (3DGS). However, effectively training large-scale 3DGS and rendering it in real-time across various scales remains challenging. This paper introduces CityGaussian (CityGS), which employs a novel divide-and-conquer training approach and Level-of-Detail (LoD) strategy for efficient large-scale 3DGS training and rendering. Specifically, the global scene prior and adaptive training data selection enables efficient training and seamless fusion. Based on fused Gaussian primitives, we generate different detail levels through compression, and realize fast rendering across various scales through the proposed block-wise detail levels selection and aggregation strategy. Extensive experimental results on large-scale scenes demonstrate that our approach attains state-of-theart rendering quality, enabling consistent real-time rendering of largescale scenes......

Condition-Aware Neural Network for Controlled Image Generation

图片

We present Condition-Aware Neural Network (CAN), a new method for adding control to image generative models. In parallel to prior conditional control methods, CAN controls the image generation process by dynamically manipulating the weight of the neural network. This is achieved by introducing a condition-aware weight generation module that generates conditional weight for convolution/linear layers based on the input condition. We test CAN on class-conditional image generation on ImageNet and text-to-image generation on COCO. CAN consistently delivers significant improvements for diffusion transformer models, including DiT and UViT. In particular, CAN combined with EfficientViT (CaT) achieves 2.78 FID on ImageNet 512x512, surpassing DiT-XL/2 while requiring 52x fewer MACs per sampling step. ......

Uncovering the Text Embedding in Text-to-Image Diffusion Models

图片

The correspondence between input text and the generated image exhibits opacity, wherein minor textual modifications can induce substantial deviations in the generated image. While, text embedding, as the pivotal intermediary between text and images, remains relatively underexplored. In this paper, we address this research gap by delving into the text embedding space, unleashing its capacity for controllable image editing and explicable semantic direction attributes within a learning-free framework. Specifically, we identify two critical insights regarding the importance of per-word embedding and their contextual correlations within text embedding, providing instructive principles for learning-free image editing. Additionally, we find that text embedding inherently possesses diverse semantic potentials, and further reveal this property through the lens of singular value decomposition (SVD). These uncovered properties offer practical utility for image editing and semantic disco......

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

图片

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. Through a 3-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only ~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while also improving the FID and CMMD scores. Secondly, we find that training......

Video Interpolation with Diffusion Models

图片

We present VIDIM, a generative model for video interpolation, which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data, VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation, and demonstrate how such works fail in most settings where the underlying motion is complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated, requires less than a billion parameters per diffusion mo......

StructLDM: Structured Latent Diffusion for 3D Human Generation

图片

Recent 3D human generative models have achieved remarkable progress by learning 3D-aware GANs from 2D images. However, existing 3D human generative methods model humans in a compact 1D latent space, ignoring the articulated structure and semantics of human body topology. In this paper, we explore more expressive and higher-dimensional latent space for 3D human modeling and propose StructLDM, a diffusion-based unconditional 3D human generative model, which is learned from 2D images. StructLDM solves the challenges imposed due to the high-dimensional growth of latent space with three key designs: 1) A semantic structured latent space defined on the dense surface manifold of a statistical human body template. 2) A structured 3D-aware auto-decoder that factorizes the global latent space into several semantic body parts parameterized by a set of conditional structured local NeRFs anchored to the body template, which embeds the properties learned from the 2D training data and can b......

A Unified and Interpretable Emotion Representation and Expression  Generation

图片

Canonical emotions, such as happy, sad, and fearful, are easy to understand and annotate. However, emotions are often compound, e.g. happily surprised, and can be mapped to the action units (AUs) used for expressing emotions, and trivially to the canonical ones. Intuitively, emotions are continuous as represented by the arousal-valence (AV) model. An interpretable unification of these four modalities - namely, Canonical, Compound, AUs, and AV - is highly desirable, for a better representation and understanding of emotions. However, such unification remains to be unknown in the current literature. In this work, we propose an interpretable and unified emotion model, referred as C2A2. We also develop a method that leverages labels of the non-unified models to annotate the novel unified one. Finally, we modify the text-conditional diffusion models to understand continuous numbers, which are then used to generate continuous expressions using our unified emotion model. Through quan......

Large Motion Model for Unified Multi-Modal Motion Generation

图片

Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on developing specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. A unified motion model is appealing since it can leverage a wide range of motion data to achieve broad generalization beyond a single task. However, it is also challenging due to the heterogeneous nature of substantially different motion data and tasks. LMM tackles these challenges from three principled aspects: 1) Data: We consolidate datasets with different modalities, formats and tasks into a comprehensive yet unified motion generation dataset, MotionVerse, comprising 10 tasks, 16 datasets, a total of 320k sequences, and 100 million frames. 2) Archite......

CosmicMan: A Text-to-Image Foundation Model for Humans

图片

We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488......

MagicMirror: Fast and High-Quality Avatar Generation with a Constrained  Search Space

图片

We introduce a novel framework for 3D human avatar generation and personalization, leveraging text prompts to enhance user engagement and customization. Central to our approach are key innovations aimed at overcoming the challenges in photo-realistic avatar synthesis. Firstly, we utilize a conditional Neural Radiance Fields (NeRF) model, trained on a large-scale unannotated multi-view dataset, to create a versatile initial solution space that accelerates and diversifies avatar generation. Secondly, we develop a geometric prior, leveraging the capabilities of Text-to-Image Diffusion Models, to ensure superior view invariance and enable direct optimization of avatar geometry. These foundational ideas are complemented by our optimization pipeline built on Variational Score Distillation (VSD), which mitigates texture loss and over-saturation issues. As supported by our extensive experiments, these strategies collectively enable the creation of custom avatars with unparalleled vis......

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1572895.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Python实现特征模态分解(FMD)

大家好,我是带我去滑雪! 特征模态分解(Feature Mode Decomposition,FMD)是一种信号处理技术,用于从数据中提取特征,并将其表示为一组特定的模态成分。与其他分解方法类似,如小波变换…

RUST语言值所有权之内存复制与移动

1.RUST中每个值都有一个所有者,每次只能有一个所有者 String::from函数会为字符串hello分配一块内存 内存示例如下: 在内存分配前调用s1正常输出 在分配s1给s2后调用报错 因为s1分配给s2后,s1的指向自动失效 s1被move到s2 s1自动释放 字符串克隆使用

Oracle 中 where 和 on 的区别

1.Oracle 中 where 和 on 的区别 on:会先根据on后面的条件进行筛选,条件为真时返回该行,由于on的优先级高于left join,所以left join关键字会把左表中没有匹配的所有行也都返回,然后生成临时表返回,执行优先级高于…

Python 基于列表实现的通讯录管理系统(有完整源码)

目录 通讯录管理系统 PersonInformation类 ContactList类 menu函数 main函数 程序的运行流程 完整代码 运行示例 通讯录管理系统 这是一个基于文本的界面程序,用户可以通过命令行与之交互,它使用了CSV文件来存储和读取联系人信息,这…

C++进阶--C++11(2)

C11第一篇 C11是C编程语言的一个版本,于2011年发布。C11引入了许多新特性,为C语言提供了更强大和更现代化的编程能力。 可变参数模板 在C11中,可变参数模板可以定义接受任意数量和类型参数的函数模板或类模板。它可以表示0到任意个数&…

数据库基础:概念、分类、作用和特点

文章目录 概要DB-Engines 排名数据库的分类数据库的作用数据库的特点数据库的应用小结 概要 数据库是按照数据结构来组织、存储和管理数据的仓库。它是一个长期存储在计算机内的、有组织的、可共享的、统一管理的大量数据的集合。数据库可以被视为电子化的文件柜,用…

40.基于SpringBoot + Vue实现的前后端分离-摄影分享网站(项目 + 论文)

项目介绍 随着互联网时代的发展,传统的线下管理技术已无法高效、便捷的管理信息。为了迎合时代需求,优化管理效率,各种各样的管理系统应运而生,国家在环境要求不断提高的前提下,摄影分享网站管理系统建设也逐渐进入了信…

Lanelets_ 高效的自动驾驶地图表达方式

Lanelets: 高效的自动驾驶地图表达方式 附赠自动驾驶学习资料和量产经验:链接 LaneLets是自动驾驶领域高精度地图的一种高效表达方式,它以彼此相互连接的LaneLets来描述自动驾驶可行驶区域,不仅可以表达车道几何,也可以完整表述车…

考研高数(平面图形的面积,旋转体的体积)

1.平面图形的面积 纠正:参数方程求面积 2.旋转体的体积(做题时,若以x为自变量不好计算,可以求反函数,y为自变量进行计算)

正排索引 vs 倒排索引 - 搜索引擎具体原理

阅读导航 一、正排索引1. 概念2. 实例 二、倒排索引1. 概念2. 实例 三、正排 VS 倒排1. 正排索引优缺点2. 倒排索引优缺点3. 应用场景 三、搜索引擎原理1. 宏观原理2. 具体原理 一、正排索引 1. 概念 正排索引是一种索引机制,它将文档或数据记录按照某种特定的顺序…

【cpp】快速排序优化

标题:【cpp】快速排序 水墨不写bug 正文开始: 快速排序的局限性: 虽然快速排序是一种高效的排序算法,但也存在一些局限性: 最坏情况下的时间复杂度:如果选择的基准元素不合适,或者数组中存在大…

“张衡一号”卫星成功监测太阳活动引起的空间天气事件

太阳出现耀斑和日冕物质抛射等短时间尺度的剧烈活动,造成地球磁层、电离层和中高层大气的强烈扰动,这类活动通常称之为空间天气事件。空间天气事件会对现代高技术系统,如航空、航天、导航通信、电力油气管网等,造成严重影响&#…

Rust---复合数据类型之元组

目录 元组的使用输出结果 元组的使用 fn main() {// 创建一个元组let my_tuple : (i32, &str, f64) (10, "hello", 3.14);// 打印元组中的元素println!("{:?}", my_tuple);// 访问元组中的元素let first_element my_tuple.0; // 访问第一个元素let…

阿里云最新活动及优惠券领取指南

随着云计算技术的快速发展,越来越多的企业选择将业务部署在云平台上。阿里云作为国内领先的云服务提供商,不断推出各种优惠活动及优惠券,旨在帮助用户降低成本,提升运营效率。本文将为大家详细介绍阿里云的最新活动及优惠券领取指…

Web3 游戏周报(3.24-3.30)

【3.24-3.30】Web3 游戏行业动态: Web3 开发平台 Mirror World 在 Solana 上推出首个游戏 rollup 链 NFT 卡牌游戏 Parallel 完成 3,500 万美元融资,Solana Ventures 等参投 加密游戏开发公司 Gunzilla Games 完成 3,000 万美元融资 Telegram 游戏 No…

第四百四十四回

文章目录 1. 问题描述2. 优化方法2.1 缩小范围2.2 替代方法 3. 示例代码4. 内容总结 我们在上一章回中介绍了"如何获取AppBar的高度"相关的内容,本章回中将介绍关于MediaQuery的优化.闲话休提,让我们一起Talk Flutter吧。 1. 问题描述 我们在…

(二)小案例银行家应用程序-创建DOM元素

● 上图的数据很明显是从我们账户数组中拿到了,我们刚刚学习了forEach,所以我们使用forEach来创建我们的DOM元素; const displayMovements function (movements) {movements.forEach((mov, i) > {const type mov > 0 ? deposit : w…

161 Linux C++ 通讯架构实战15,线程池代码分析

线程池应该使用的地方 和 epoll 技术结合 线程池代码处理数据的地方。 线程池分析: 线程池代码1 threadpool_create //Tencent8888 start threadpool_create函数的目的初始化线程池,对应的struct是 threadpool_t /* 1.先malloc整个线程池的大小 2.这里…

常见的加密方式总结(哈希算法、对称、非对称)

哈希算法是一种用数学方法对数据生成一个固定长度的唯一标识的技术,可以用来验证数据的完整性和一致性,常见的哈希算法有 MD、SHA、MAC 等。 对称加密算法是一种加密和解密使用同一个密钥的算法,可以用来保护数据的安全性和保密性&#xff0…

顺序表相关习题

🌈 个人主页:白子寰 🔥 分类专栏:python从入门到精通,魔法指针,进阶C,C语言,C语言题集,C语言实现游戏👈 希望得到您的订阅和支持~ 💡 坚持创作博文…