WorldGPT、Pix2Pix-OnTheFly、StyleDyRF、ManiGaussian、Face SR

news2024/9/28 15:24:45

本文首发于公众号:机器感知

WorldGPT、Pix2Pix-OnTheFly、StyleDyRF、ManiGaussian、Face SR

HandGCAT: Occlusion-Robust 3D Hand Mesh Reconstruction from Monocular  Images

图片

We propose a robust and accurate method for reconstructing 3D hand mesh from monocular images. This is a very challenging problem, as hands are often severely occluded by objects. Previous works often have disregarded 2D hand pose information, which contains hand prior knowledge that is strongly correlated with occluded regions. Thus, in this work, we propose a novel 3D hand mesh reconstruction network HandGCAT, that can fully exploit hand prior as compensation information to enhance occluded region features. Specifically, we designed the Knowledge-Guided Graph Convolution (KGC) module and the Cross-Attention Transformer (CAT) module. KGC extracts hand prior information from 2D hand pose by graph convolution. CAT fuses hand prior into occluded regions by considering their high correlation.

Text-to-Audio Generation Synchronized with Videos

图片

In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often results in discernible audio-visual mismatches. To bridge this gap, we introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench. This benchmark distinguishes itself with three novel metrics dedicated to evaluating visual alignment and temporal consistency.

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text  and Image Inputs

图片

Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content. However, it remains a formidable challenge pertaining to maintaining temporal consistency and ensuring action smoothness throughout the generated sequences. In this paper, we present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework based on textual prompts and accompanying images. The framework includes two parts: prompt enhancer and full video translation.

Pix2Pix-OnTheFly: Leveraging LLMs for Instruction-Guided Image Editing

图片

The combination of language processing and image processing keeps attracting increased interest given recent impressive advances that leverage the combined strengths of both domains of research. Among these advances, the task of editing an image on the basis solely of a natural language instruction stands out as a most challenging endeavour. While recent approaches for this task resort, in one way or other, to some form of preliminary preparation, training or fine-tuning, this paper explores a novel approach: We propose a preparation-free method that permits instruction-guided image editing on the fly. This approach is organized along three steps properly orchestrated that resort to image captioning and DDIM inversion, followed by obtaining the edit direction embedding, followed by image editing proper.

CHAI: Clustered Head Attention for Efficient LLM Inference

图片

Large Language Models (LLMs) with hundreds of billions of parameters have transformed the field of machine learning. However, serving these models at inference time is both compute and memory intensive, where a single request can require multiple GPUs and tens of Gigabytes of memory. Multi-Head Attention is one of the key components of LLMs, which can account for over 50% of LLMs memory and compute requirement. We observe that there is a high amount of redundancy across heads on which tokens they pay attention to. Based on this insight, we propose Clustered Head Attention (CHAI). CHAI combines heads with a high amount of correlation for self-attention at runtime, thus reducing both memory and compute.

Mechanics of Next Token Prediction with Self-Attention

图片

Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: $\textit{What}$ $\textit{does}$ $\textit{a}$ $\textit{single}$ $\textit{self-attention}$ $\textit{layer}$ $\textit{learn}$ $\textit{from}$ $\textit{next-token}$ $\textit{prediction?}$ We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: $\textbf{(1)}$ $\textbf{Hard}$ $\textbf{retrieval:}$ Given input sequence, self-attention precisely selects the $\textit{high-priority}$ $\textit{input}$ $\textit{tokens}$ associated with the last input token. $\textbf{(2)}$ $\textbf{Soft}$ $\textbf{composition:}$ It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Under suitable conditions, we rigorously characterize these mechanics through a directed graph over tokens extracted from the training data. We prove that gradient descent implicitly discovers the strongly-connected components (SCC) of this graph and self-attention learns to retrieve the tokens that belong to the highest-priority SCC available in the context window.

Make Me Happier: Evoking Emotions Through Image Diffusion Models

图片

Despite the rapid progress in image generation, emotional image editing remains under-explored. The semantics, context, and structure of an image can evoke emotional responses, making emotional image editing techniques valuable for various real-world applications, including treatment of psychological disorders, commercialization of products, and artistic design. For the first time, we present a novel challenge of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while retaining the semantics and structures of the original scenes. To address this challenge, we propose a diffusion model capable of effectively understanding and editing source images to convey desired emotions and sentiments. Moreover, due to the lack of emotion editing datasets, we provide a unique dataset consisting of 340,000 pairs of images and their emotion annotations.

StyleDyRF: Zero-shot 4D Style Transfer for Dynamic Neural Radiance Fields

图片

4D style transfer aims at transferring arbitrary visual style to the synthesized novel views of a dynamic 4D scene with varying viewpoints and times. Existing efforts on 3D style transfer can effectively combine the visual features of style images and neural radiance fields (NeRF) but fail to handle the 4D dynamic scenes limited by the static scene assumption. Consequently, we aim to handle the novel challenging problem of 4D style transfer for the first time, which further requires the consistency of stylized results on dynamic objects. In this paper, we introduce StyleDyRF, a method that represents the 4D feature space by deforming a canonical feature volume and learns a linear style transformation matrix on the feature volume in a data-driven fashion. To obtain the canonical feature volume, the rays at each time step are deformed with the geometric prior of a pre-trained dynamic NeRF to render the feature map under the supervision of pre-trained visual encoders.

ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation

图片

Performing language-conditioned robotic manipulation tasks in unstructured environments is highly demanded for general intelligent robots. Conventional robotic manipulation methods usually learn semantic representation of the observation for action prediction, which ignores the scene-level spatiotemporal dynamics for human goal completion. In this paper, we propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction. Specifically, we first formulate the dynamic Gaussian Splatting framework that infers the semantics propagation in the Gaussian embedding space, where the semantic representation is leveraged to predict the optimal robot action. Then, we build a Gaussian world model to parameterize the distribution in our dynamic Gaussian Splatting framework, which provides informative supervision in the interactive environment via future scene reconstruction. We evaluate our ManiGaussian on 10 RLBench tasks with 166 variations, and the results demonstrate our framework can outperform the state-of-the-art methods by 13.1\% in average success rate.

Tackling the Singularities at the Endpoints of Time Intervals in  Diffusion Models

图片

Most diffusion models assume that the reverse process adheres to a Gaussian distribution. However, this approximation has not been rigorously validated, especially at singularities, where t=0 and t=1. Improperly dealing with such singularities leads to an average brightness issue in applications, and limits the generation of images with extreme brightness or darkness. We primarily focus on tackling singularities from both theoretical and practical perspectives. Initially, we establish the error bounds for the reverse process approximation, and showcase its Gaussian characteristics at singularity time steps. Based on this theoretical insight, we confirm the singularity at t=1 is conditionally removable while it at t=0 is an inherent property. Upon these significant conclusions, we propose a novel plug-and-play method SingDiffusion to address the initial singular time step sampling, which not only effectively resolves the average brightness issue for a wide range of diffusion models without extra training efforts, but also enhances their generation capability in achieving notable lower FID scores.

Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation

图片

The pre-trained vision-language model, exemplified by CLIP, advances zero-shot semantic segmentation by aligning visual features with class embeddings through a transformer decoder to generate semantic masks. Despite its effectiveness, prevailing methods within this paradigm encounter challenges, including overfitting on seen classes and small fragmentation in masks. To mitigate these issues, we propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.Specifically, we leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings. Moreover, to circumvent noisy alignments from the vision part due to its redundant nature, we introduce route attention into self-attention for finding visual consensus, thereby enhancing semantic consistency within the same object. Equipped with a vision-language prompting strategy, our approach significantly boosts the generalization capacity of segmentation models for unseen classes. Experimental results underscore the effectiveness of our approach, showcasing mIoU gains of 4.5 on the PASCAL VOC 2012 and 3.6 on the COCO-Stuff 164k for unseen classes compared with the state-of-the-art methods.

PFStorer: Personalized Face Restoration and Super-Resolution

图片

Recent developments in face restoration have achieved remarkable results in producing high-quality and lifelike outputs. The stunning results however often fail to be faithful with respect to the identity of the person as the models lack necessary context. In this paper, we explore the potential of personalized face restoration with diffusion models. In our approach a restoration model is personalized using a few images of the identity, leading to tailored restoration with respect to the identity while retaining fine-grained details. By using independent trainable blocks for personalization, the rich prior of a base restoration model can be exploited to its fullest. To avoid the model relying on parts of identity left in the conditioning low-quality images, a generative regularizer is employed. With a learnable parameter, the model learns to balance between the details generated based on the input image and the degree of personalization. Moreover, we improve the training pipeline of face restoration models to enable an alignment-free approach. We showcase the robust capabilities of our approach in several real-world scenarios with multiple identities, demonstrating our method's ability to generate fine-grained details with faithful restoration. In the user study we evaluate the perceptual quality and faithfulness of the genereated details, with our method being voted best 61% of the time compared to the second best with 25% of the votes.

Towards Dense and Accurate Radar Perception Via Efficient Cross-Modal Diffusion Model

图片

Millimeter wave (mmWave) radars have attracted significant attention from both academia and industry due to their capability to operate in extreme weather conditions. However, they face challenges in terms of sparsity and noise interference, which hinder their application in the field of micro aerial vehicle (MAV) autonomous navigation. To this end, this paper proposes a novel approach to dense and accurate mmWave radar point cloud construction via cross-modal learning. Specifically, we introduce diffusion models, which possess state-of-the-art performance in generative modeling, to predict LiDAR-like point clouds from paired raw radar data. We also incorporate the most recent diffusion model inference accelerating techniques to ensure that the proposed method can be implemented on MAVs with limited computing resources.

Gaussian Splatting in Style

图片

Scene stylization extends the work of neural style transfer to three spatial dimensions. A vital challenge in this problem is to maintain the uniformity of the stylized appearance across a multi-view setting. A vast majority of the previous works achieve this by optimizing the scene with a specific style image. In contrast, we propose a novel architecture trained on a collection of style images, that at test time produces high quality stylized novel views. Our work builds up on the framework of 3D Gaussian splatting. For a given scene, we take the pretrained Gaussians and process them using a multi resolution hash grid and a tiny MLP to obtain the conditional stylised views. The explicit nature of 3D Gaussians give us inherent advantages over NeRF-based methods including geometric consistency, along with having a fast training and rendering regime.

OneVOS: Unifying Video Object Segmentation with All-in-One Transformer  Framework

图片

Contemporary Video Object Segmentation (VOS) approaches typically consist stages of feature extraction, matching, memory management, and multiple objects aggregation. Recent advanced models either employ a discrete modeling for these components in a sequential manner, or optimize a combined pipeline through substructure aggregation. However, these existing explicit staged approaches prevent the VOS framework from being optimized as a unified whole, leading to the limited capacity and suboptimal performance in tackling complex videos. In this paper, we propose OneVOS, a novel framework that unifies the core components of VOS with All-in-One Transformer. Specifically, to unify all aforementioned modules into a vision transformer, we model all the features of frames, masks and memory for multiple objects as transformer tokens, and integrally accomplish feature extraction, matching and memory management of multiple objects through the flexible attention mechanism. Furthermore, a Unidirectional Hybrid Attention is proposed through a double decoupling of the original attention operation, to rectify semantic errors and ambiguities of stored tokens in OneVOS framework. Finally, to alleviate the storage burden and expedite inference, we propose the Dynamic Token Selector, which unveils the working mechanism of OneVOS and naturally leads to a more efficient version of OneVOS.

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

图片

We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1516833.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Selenium 学习(0.20)——软件测试之单元测试

我又(浪完)回来了…… 很久没有学习了,今天忙完终于想起来学习了。没有学习的这段时间,主要是请了两个事假(5工作日和10工作日)放了个年假(13天),然后就到现在了。 看了下…

每周一算法:A*(A Star)算法

八数码难题 题目描述 在 3 3 3\times 3 33 的棋盘上,摆有八个棋子,每个棋子上标有 1 1 1 至 8 8 8 的某一数字。棋盘中留有一个空格,空格用 0 0 0 来表示。空格周围的棋子可以移到空格中。要求解的问题是:给出一种初始布局…

微信小程序购物/超市/餐饮/酒店商城开发搭建过程和需求

1. 商城开发的基本框架 a. 用户界面(Frontend) 页面设计:包括首页、商品列表、商品详情、购物车、下单界面、用户中心等。交云设计:如何让用户操作更加流畅,包括搜索、筛选、排序等功能的实现。响应式设计&#xff1…

【JAVA重要知识 | 第六篇】Java集合类使用总结(List、Set、Map接口及常见实现类)以及常见面试题

文章目录 6.Java集合类使用总结6.1概览6.1.1集合接口类特性6.1.2List接口和Set接口的区别6.1.3简要介绍(1)List接口(2)Set接口(3)Map接口 6.2Collection接口6.3List接口6.3.1ArrayList6.3.2LinkedList—不常…

网络模块使用Hilt注入

retrofit的异步回调方法已经做了线程切换&#xff0c;切换到了主线程 <?xml version"1.0" encoding"utf-8"?> <manifest xmlns:android"http://schemas.android.com/apk/res/android"><uses-permission android:name"andr…

Solidity 智能合约开发 - 基础:基础语法 基础数据类型、以及用法和示例

苏泽 大家好 这里是苏泽 一个钟爱区块链技术的后端开发者 本篇专栏 ←持续记录本人自学两年走过无数弯路的智能合约学习笔记和经验总结 如果喜欢拜托三连支持~ 本篇主要是做一个知识的整理和规划 作为一个类似文档的作用 更为简要和明了 具体的实现案例和用法 后续会陆续给出…

Milvus向量数据库检索

官方文档&#xff1a;https://milvus.io/docs/search.md   本节介绍如何使用 Milvus 搜索实体。   Milvus 中的向量相似度搜索会计算查询向量与具有指定相似度度量的集合中的向量之间的距离&#xff0c;并返回最相似的结果。您可以通过指定过滤标量字段或主键字段的布尔表达…

Redis及其数据类型和常用命令(一)

Redis 非关系型数据库&#xff0c;不需要使用sql语句对数据库进行操作&#xff0c;而是使用命令进行操作&#xff0c;在数据库存储时使用键值对进行存储&#xff0c;应用场景广泛&#xff0c;一般存储访问频率较高的数据。 一般关系型数据库&#xff08;使用sql语句进行操作的…

市场复盘总结 20240314

仅用于记录当天的市场情况&#xff0c;用于统计交易策略的适用情况&#xff0c;以便程序回测 短线核心&#xff1a;不参与任何级别的调整&#xff0c;采用龙空龙模式 一支股票 10%的时候可以操作&#xff0c; 90%的时间适合空仓等待 二进三&#xff1a; 进级率中 25% 最常用的…

二.递归及实例(汉诺塔问题)

目录 5.递归 6-递归实例:汉诺塔问题 思路: 详细过程: 代码: 5.递归 调用自身 结束条件 6-递归实例:汉诺塔问题 思路: 结果: 详细过程: 代码: #n为盘子的个数 a,b,c分别为3个地方. def hannuta(n,a,b,c): ​if n>0:hannuta(n-1,a,c,b) #将n-1个从a经过c移到到b(a…

【C语言_C语言语句_复习篇】

目录 一、C语言的语句有哪些 1.1 空语句 1.2 表达式语句 1.3 函数调用语句 1.4 复合语句 1.5 控制语句 二、分支语句&#xff08;两种&#xff09; 1.1 if语句 1.1.1 普通分支语句(if、if_else) 1.1.2 嵌套if语句 1.1.3 else嵌套if两种写法的比较 1.1.4 else悬空问题 1.1.…

代码随想录算法训练营第四十七天|动态规划|198.打家劫舍、213.打家劫舍II、337.打家劫舍III

198.打家劫舍 文章 你是一个专业的小偷&#xff0c;计划偷窃沿街的房屋。每间房内都藏有一定的现金&#xff0c;影响你偷窃的唯一制约因素就是相邻的房屋装有相互连通的防盗系统&#xff0c;如果两间相邻的房屋在同一晚上被小偷闯入&#xff0c;系统会自动报警。 给定一个代…

Docker容器化技术(使用Docker搭建论坛)

第一步&#xff1a;删除容器镜像文件 [rootlocalhost ~]# docker rm -f docker ps -aq b09ee6438986 e0fe8ebf3ba1第二步&#xff1a;使用docker拉取数据库 [rootlocalhost ~]# docker run -d --name db mysql:5.7 02a4e5bfffdc81cb6403985fe4cd6acb0c5fab0b19edf9f5b8274783…

NLP:HanLP的下载与使用

昨天说到要做一个自定义的训练模型&#xff0c;但是很快这个想法就被扑灭了&#xff0c;因为这个手工标记的成本太大&#xff0c;而且我的上级并不是想要我做这个场景&#xff0c;而是希望我通过这个场景展示出可以接下最终需求的能力。换句话来说&#xff1a;可以&#xff0c;…

leetcode代码记录(找到小镇的法官

目录 1. 题目&#xff1a;2. 我的代码&#xff1a;小结&#xff1a; 1. 题目&#xff1a; 小镇里有 n 个人&#xff0c;按从 1 到 n 的顺序编号。传言称&#xff0c;这些人中有一个暗地里是小镇法官。 如果小镇法官真的存在&#xff0c;那么&#xff1a; 小镇法官不会信任任何…

【数据结构】树与堆 (向上/下调整算法和复杂度的分析、堆排序以及topk问题)

文章目录 1.树的概念1.1树的相关概念1.2树的表示 2.二叉树2.1概念2.2特殊二叉树2.3二叉树的存储 3.堆3.1堆的插入&#xff08;向上调整&#xff09;3.2堆的删除&#xff08;向下调整&#xff09;3.3堆的创建3.3.1使用向上调整3.3.2使用向下调整3.3.3两种建堆方式的比较 3.4堆排…

新 树莓派4B 温湿度监测 基于debian12的树莓派OS

前言 本文旨在完成通过外接温湿度传感器至树莓派使得树莓派不断记录并存储温湿度数据 这个领域有很多文章&#xff0c;但是部分文章已经缺乏了时效性&#xff0c;在最新系统不适用&#xff0c;本文目前适用 硬件 硬件连接 温湿度传感器常选用DHT11和DHT22&#xff0c;淘宝…

MTK的flash_tool.exe中,“Format-Download”、“Firmware-Upgrade”和“Download”是三种不同的刷机模式

在MTK的flash_tool.exe中&#xff0c;“Format-Download”、“Firmware-Upgrade”和“Download”是三种不同的刷机模式。具体分析如下&#xff1a; Format-Download&#xff1a;这种模式会执行全擦除&#xff0c;即清除存储器中的所有数据&#xff0c;然后下载新的固件。这种方…

图片编辑器tui-image-editor

提示&#xff1a;图片编辑器tui-image-editor 文章目录 前言一、安装tui-image-editor二、新建components/ImageEditor.vue三、修改App.vue四、效果五、遇到问题 this.getResolve is not a function总结 前言 需求&#xff1a;图片编辑器tui-image-editor 一、安装tui-image-ed…

DB算法原理与构建

参考&#xff1a; https://aistudio.baidu.com/projectdetail/4483048 Real-Time Scene Text Detection with Differentiable Binarization 如何读论文-by 李沐 DB (Real-Time Scene Text Detection with Differentiable Binarization) 原理 DB是一个基于分割的文本检测算…