CameraCtrl、EDTalk、Sketch3D、Diffusion^2、FashionEngine

news2024/9/21 4:40:45

本文首发于公众号:机器感知

CameraCtrl、EDTalk、Sketch3D、Diffusion^2、FashionEngine

NVINS: Robust Visual Inertial Navigation Fused with NeRF-augmented  Camera Pose Regressor and Uncertainty Quantification

图片

In recent years, Neural Radiance Fields (NeRF) have emerged as a powerful tool for 3D reconstruction and novel view synthesis. However, the computational cost of NeRF rendering and degradation in quality due to the presence of artifacts pose significant challenges for its application in real-time and robust robotic tasks, especially on embedded systems. This paper introduces a novel framework that integrates NeRF-derived localization information with Visual-Inertial Odometry(VIO) to provide a robust solution for robotic navigation in a real-time. By training an absolute pose regression network with augmented image data rendered from a NeRF and quantifying its uncertainty, our approach effectively counters positional drift and enhances system reliability. We also establish a mathematically sound foundation for combining visual inertial navigation with camera localization neural networks, considering uncertainty under a Bayesian framework. Experimental validation in the photore......

Is Model Collapse Inevitable? Breaking the Curse of Recursion by  Accumulating Real and Synthetic Data

图片

The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops discovered that such loops can lead to model collapse, a phenomenon where performance progressively degrades with each model-fitting iteration until the latest model becomes useless. However, several recent papers studying model collapse assumed that new data replace old data over time rather than assuming data accumulate over time. In this paper, we compare these two settings and show that accumulating data prevents model collapse. We begin by studying an analytically tractable setup in which a sequence of linear models are fit to the previous models' predictions. Previous work showed if data are replaced, the test error increases linearly with the number of model-fitting iterations; we extend this result by proving that if data instead acc......

Octopus: On-device language model for function calling of software APIs

图片

In the rapidly evolving domain of artificial intelligence, Large Language Models (LLMs) play a crucial role due to their advanced text processing and generation abilities. This study introduces a new strategy aimed at harnessing on-device LLMs in invoking software APIs. We meticulously compile a dataset derived from software API documentation and apply fine-tuning to LLMs with capacities of 2B, 3B and 7B parameters, specifically to enhance their proficiency in software API interactions. Our approach concentrates on refining the models' grasp of API structures and syntax, significantly enhancing the accuracy of API function calls. Additionally, we propose \textit{conditional masking} techniques to ensure outputs in the desired formats and reduce error rates while maintaining inference speeds. We also propose a novel benchmark designed to evaluate the effectiveness of LLMs in API interactions, establishing a foundation for subsequent research. Octopus, the fine-tuned model, is ......

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

图片

Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal input, both aspects often neglected in existing methods. To address this gap, this paper proposes a novel Efficient Disentanglement framework for Talking head generation (EDTalk). Our framework enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on video or audio inputs. Specifically, we employ three lightweight modules to decompose the facial dynamics into three distinct latent spaces representing mouth, pose, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions.......

FashionEngine: Interactive Generation and Editing of 3D Clothed Humans

图片

We present FashionEngine, an interactive 3D human generation and editing system that allows us to design 3D digital humans in a way that aligns with how humans interact with the world, such as natural languages, visual perceptions, and hand-drawing. FashionEngine automates the 3D human production with three key components: 1) A pre-trained 3D human diffusion model that learns to model 3D humans in a semantic UV latent space from 2D image training data, which provides strong priors for diverse generation and editing tasks. 2) Multimodality-UV Space encoding the texture appearance, shape topology, and textual semantics of human clothing in a canonical UV-aligned space, which faithfully aligns the user multimodal inputs with the implicit UV latent space for controllable 3D human editing. The multimodality-UV space is shared across different user inputs, such as texts, images, and sketches, which enables various joint multimodal editing tasks. 3) Multimodality-UV Aligned Sampler ......

GEARS: Local Geometry-aware Hand-object Interaction Synthesis

图片

Generating realistic hand motion sequences in interaction with objects has gained increasing attention with the growing interest in digital humans. Prior work has illustrated the effectiveness of employing occupancy-based or distance-based virtual sensors to extract hand-object interaction features. Nonetheless, these methods show limited generalizability across object categories, shapes and sizes. We hypothesize that this is due to two reasons: 1) the limited expressiveness of employed virtual sensors, and 2) scarcity of available training data. To tackle this challenge, we introduce a novel joint-centered sensor designed to reason about local object geometry near potential interaction regions. The sensor queries for object surface points in the neighbourhood of each hand joint. As an important step towards mitigating the learning complexity, we transform the points from global frame to hand template frame and use a shared module to process sensor features of each individual......

Sketch3D: Style-Consistent Guidance for Sketch-to-3D Generation

图片

Recently, image-to-3D approaches have achieved significant results with a natural image as input. However, it is not always possible to access these enriched color input samples in practical applications, where only sketches are available. Existing sketch-to-3D researches suffer from limitations in broad applications due to the challenges of lacking color information and multi-view content. To overcome them, this paper proposes a novel generation paradigm Sketch3D to generate realistic 3D assets with shape aligned with the input sketch and color matching the textual description. Concretely, Sketch3D first instantiates the given sketch in the reference image through the shape-preserving generation process. Second, the reference image is leveraged to deduce a coarse 3D Gaussian prior, and multi-view style-consistent guidance images are generated based on the renderings of the 3D Gaussians. Finally, three strategies are designed to optimize 3D Gaussians, i.e., structural optimiz......

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

图片

Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs gener......

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

图片

Controllability plays a crucial role in video generation since it allows users to create desired content. However, existing models largely overlooked the precise control of camera pose that serves as a cinematic language to express deeper narrative nuances. To alleviate this issue, we introduce CameraCtrl, enabling accurate camera pose control for text-to-video(T2V) models. After precisely parameterizing the camera trajectory, a plug-and-play camera module is then trained on a T2V model, leaving others untouched. Additionally, a comprehensive study on the effect of various datasets is also conducted, suggesting that videos with diverse camera distribution and similar appearances indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise and domain-adaptive camera control, marking a step forward in the pursuit of dynamic and customized video storytelling from textual and camera pose inputs. Our proje......

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of  Orthogonal Diffusion Models

图片

Recent advancements in 3D generation are predominantly propelled by improvements in 3D-aware image diffusion models which are pretrained on Internet-scale image data and fine-tuned on massive 3D data, offering the capability of producing highly consistent multi-view images. However, due to the scarcity of synchronized multi-view video data, it is impractical to adapt this paradigm to 4D generation directly. Despite that, the available video and 3D data are adequate for training video and multi-view diffusion models that can provide satisfactory dynamic and geometric priors respectively. In this paper, we present Diffusion$^2$, a novel framework for dynamic 3D content creation that leverages the knowledge about geometric consistency and temporal smoothness from these models to directly sample dense multi-view and multi-frame images which can be employed to optimize continuous 4D representation. Specifically, we design a simple yet effective denoising strategy via score composi......

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1597221.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

【团体程序设计天梯赛 往年关键真题 25分题合集 详细分析完整AC代码】(L2-001 - L2-024)搞懂了赛场上拿下就稳了

L2-001 紧急救援 最短路路径打印 样例 输入1 4 5 0 3 20 30 40 10 0 1 1 1 3 2 0 3 3 0 2 2 2 3 2输出1 2 60 0 1 3分析 用一遍dijkstra算法。设立 n u m [ i ] num[i] num[i]和 w [ i ] w[i] w[i]表示从出发点到i结点拥有的路的条数,以及能够找到的救援队的数目…

吴恩达llama课程笔记:第六课code llama编程

羊驼Llama是当前最流行的开源大模型,其卓越的性能和广泛的应用领域使其成为业界瞩目的焦点。作为一款由Meta AI发布的开放且高效的大型基础语言模型,Llama拥有7B、13B和70B(700亿)三种版本,满足不同场景和需求。 吴恩…

Avalonia中MVVM模式下设置TextBox焦点

Avalonia中MVVM模式下设置TextBox焦点 前言引入Nuget库程序里面引入相关库修改前端代码#效果图 前言 我们在开发的过程中,经常会遇到比如我在进入某个页面的时候我需要让输入焦点聚焦在指定的文本框上面,或者点击某个按钮触发某个选项的时候也要自动将输入焦点聚焦到指定的文…

Linux中断(栈、上下部)

进程线程中断的核心:栈 进程切换时,需要将当前进程的寄存器参数保存在当前进程的栈中。要再次执行此进程时需要先从栈中恢复此进程的寄存器参数。 对于同个进程的不同线程,代码和数据是所有线程共享的,所以线程间可以通过全局变量…

白盒测试详解

🍅 视频学习:文末有免费的配套视频可观看 🍅 关注公众号:互联网杂货铺,回复1 ,免费获取软件测试全套资料,资料在手,涨薪更快 概念与定义 白盒测试:侧重于系统或部件内部机…

ASUS华硕ROG幻13笔记本电脑GV301R工厂模式原厂OEM预装Windows11系统,恢复出厂开箱状态

适用于型号:GV301RC、GV301RE、GV301RA 工厂模式安装包:https://pan.baidu.com/s/1gLme1VqidpUjCLocgm5ajQ?pwddnbk 提取码:dnbk 工厂模式Win11安装包带有ASUS RECOVERY恢复功能、自带所有驱动、出厂主题壁纸、系统属性专属联机支持标志…

Matlab|基于广义Benders分解法的综合能源系统优化规划

目录 1 主要内容 广义benders分解法流程图: 优化目标: 约束条件: 2 部分代码 3 程序结果 4 下载链接 1 主要内容 该程序复现文章《综合能源系统协同运行策略与规划研究》第四章内容基于广义Benders分解法的综合能源系统优化规划&…

Python基于flask的豆瓣电影分析可视化系统

博主介绍:✌程序员徐师兄、7年大厂程序员经历。全网粉丝12w、csdn博客专家、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java技术领域和毕业项目实战✌ 🍅文末获取源码联系🍅 👇🏻 精彩专栏推荐订阅👇…

Nacos快速入门(windows)

Nacos是 Dynamic Naming and Configuration Service的简称。Nacos能快速实现动态服务发现、服务配置、服务元数据及流量管理 下载启动Nacos Nacos下载启动(windows) 以下两种随意选择一种即可从github上下载源码方式启动 git clone https://github.com…

业务与数据的终极对决:如何让大数据成为企业的超能力?

在数字化转型的浪潮中,企业如同在茫茫数据海洋中航行的船只,而数据资产管理就是指引航向的罗盘。但是,当业务需求与数据脱节、数据孤岛林立、业务流程与数据流程不同步、以及业务增长带来的数据管理挑战成为阻碍,我们该如何突破重…

c++取经之路(其五)——类和对象拷贝构造函数

概念:拷贝构造函数,只有单个形参,该形参是对本类类型对象的引用(一般常用const修饰),在用已存在的类类型对象创建新对象时由编译器自动调用。 特征: 1. 拷贝构造函数是构造函数的一个重载形式 如: 2. 拷贝…

springboot+Vue项目部署到云服务器上

一、下载配置ngnix 1.压缩包下载并上传 链接: https://pan.baidu.com/s/1m2LKV8ci4WXkAWdJXIeUFQ 提取码: 0415 2.解压 tar -xzvf 压缩包名 3.编译nginx 在解压好的文件夹下,依次执行: ./configure 来到nginx默认安装路径/usr/local/nginx 依次执行命令 mak…

【企业动态】瑞芯微高级业务总监来访东胜物联,共探新能源汽车市场合作

近日,瑞芯微高级业务总监阙金珍一行来访东胜物联参观交流,并就深化合作进行讨论。 东胜物联与瑞芯微建有长期稳固的合作关系,基于RK3588、RK3399、RK3568等处理器,推出了多款嵌入式核心硬件产品,包括核心板、网关等&a…

一文详解MES、ERP、SCM、WMS、APS、SCADA、PLM、QMS、CRM、EAM及其关系

经常遇到很多系统,比如:MES、ERP、SCM、WMS、APS、SCADA、PLM、QMS、CRM、EAM,这些都是什么系统?有什么功能和作用?它们之间的关系是怎样的? 今天就一文详细分享给大家。 10大系统之间的关系 ERP 和其他…

知名度最高的免费SSL证书

一、选证书类型 根据网站性质与安全需求,选定合适的SSL证书: - 域名验证证书(DV):快速验证域名所有权,适用于个人网站、博客,申请简便。 - 组织验证证书(OV):…

东方博宜 1152. 求n个数的最大值和最小值

东方博宜 1152. 求n个数的最大值和最小值 思路:贼拉简单,就是用排序函数sort()函数,注意点,一个是头文件,一个是for循环里面的起始值 sort(a,an) 从小到大排序,sort(a,an,greater()) 从大到小排序 #incl…

爬虫 | 垃圾处理设施数据的获取与保存

Hi,大家好,我是半亩花海。本项目通过发送网络请求(requests),从指定的 URL 获取垃圾处理设施的相关数据,并将数据保存到 CSV 文件中,以供后续分析和利用。 目录 一、项目结构 二、详细说明 三…

wangzherongyao 2024.04.15

第一局:百里陪那只重置技能CD的辅助,对面有兰陵王,妲己,然后我补位廉颇被自己人和对面一阵嘲讽,真的不想说啥,对面盾山和妲己估计都没明白,我一只就能破他们队伍,所以看到没先出魔抗…

【拦截器Interceptor】springboot拦截器的使用和原理

【拦截器Interceptor】springboot拦截器的使用和原理 【一】拦截器简介(1)简介【2】作用 【二】实现步骤【1】自定义拦截器,实现拦截器接口HandlerInterceptor【2】将拦截器添加到容器当中【3】配置拦截器的拦截规则【4】拦截器的执行顺序 【…

HCIP-Datacom(H12-821)111-120题

111、地址转换技术的优点不包括()? A.地址转换可以使内部网络用户(私有IP地址)方便地访问Internet B.地址转换可以屏蔽内部网络的用户提高内部网络的安全性 C.地址转换可以使内部局城网的许多主机共享一个IP地址上网D.地址转换能够处理IP报头加密的情况 解析:地址转换只处理…