DreamPolisher、InternLM2 、AniArtAvatar、PlainMamba、AniPortrait

news2024/12/27 16:02:35

本文首发于公众号:机器感知

DreamPolisher、InternLM2 、AniArtAvatar、PlainMamba、AniPortrait

DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric  Diffusion

We present DreamPolisher, a novel Gaussian Splatting based method with geometric guidance, tailored to learn cross-view consistency and intricate detail from textual descriptions. While recent progress on text-to-3D generation methods have been promising, prevailing methods often fail to ensure view-consistency and textural richness. This problem becomes particularly noticeable for methods that work with text input alone. To address this, we propose a two-stage Gaussian Splatting based approach that enforces geometric consistency among views. Initially, a coarse 3D generation undergoes refinement via geometric optimization. Subsequently, we use a ControlNet driven refiner coupled with the geometric consistency term to improve both texture fidelity and overall consistency of the generated 3D asset. Empirical evaluations across diverse textual prompts spanning various object categories demonstrate the efficacy of DreamPolisher in generating consistent and realistic 3D objects, aligning closely with the semantics of the textual instructions.

InternLM2 Technical Report

图片

The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV  Caching

图片

The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to dominate a broad range of NLP tasks. Despite their superior accuracy, LLMs present unique challenges in practical inference, concerning the compute and memory-intensive nature. Thanks to the autoregressive characteristic of LLM inference, KV caching for the attention layers in Transformers can effectively accelerate LLM inference by substituting quadratic-complexity computation with linear-complexity memory accesses. Yet, this approach requires increasing memory as demand grows for processing longer sequences. The overhead leads to reduced throughput due to I/O bottlenecks and even out-of-memory errors, particularly on resource-constrained systems like a single commodity GPU. In this paper, we propose ALISA, a novel algorithm-system co-design solution to address the challenges imposed by KV caching. On the algorithm level, ALISA prioritizes tokens that are most important in generating a new token via a Sparse Window Attention (SWA) algorithm. SWA introduces high sparsity in attention layers and reduces the memory footprint of KV caching at negligible accuracy loss. On the system level, ALISA employs three-phase token-level dynamical scheduling and optimizes the trade-off between caching and recomputation, thus maximizing the overall performance in resource-constrained systems. In a single GPU-CPU system, we demonstrate that under varying workloads, ALISA improves the throughput of baseline systems such as FlexGen and vLLM by up to 3X and 1.9X, respectively.

AniArtAvatar: Animatable 3D Art Avatar from a Single Image

图片

We present a novel approach for generating animatable 3D-aware art avatars from a single image, with controllable facial expressions, head poses, and shoulder movements. Unlike previous reenactment methods, our approach utilizes a view-conditioned 2D diffusion model to synthesize multi-view images from a single art portrait with a neutral expression. With the generated colors and normals, we synthesize a static avatar using an SDF-based neural surface. For avatar animation, we extract control points, transfer the motion with these points, and deform the implicit canonical space. Firstly, we render the front image of the avatar, extract the 2D landmarks, and project them to the 3D space using a trained SDF network. We extract 3D driving landmarks using 3DMM and transfer the motion to the avatar landmarks. To animate the avatar pose, we manually set the body height and bound the head and torso of an avatar with two cages. The head and torso can be animated by transforming the two cages. Our approach is a one-shot pipeline that can be applied to various styles. Experiments demonstrate that our method can generate high-quality 3D art avatars with desired control over different motions.

DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with  Space-sensitive Customization and Semantic Preservation

图片

Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient inference. To overcome above challenges, this paper presents DiffFAE, a one-stage and highly-efficient diffusion-based framework tailored for high-fidelity FAE. For high-fidelity query attributes transfer, we adopt Space-sensitive Physical Customization (SPC), which ensures the fidelity and generalization ability by utilizing rendering texture derived from 3D Morphable Model (3DMM). In order to preserve source attributes, we introduce the Region-responsive Semantic Composition (RSC). This module is guided to learn decoupled source-regarding features, thereby better preserving the identity and alleviating artifacts from non-facial attributes such as hair, clothes, and background. We further introduce a consistency regularization for our pipeline to enhance editing controllability by leveraging prior knowledge in the attention matrices of diffusion model. Extensive experiments demonstrate the superiority of DiffFAE over existing methods, achieving state-of-the-art performance in facial appearance editing.

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

图片

In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image. Our methodology is divided into two stages. Initially, we extract 3D intermediate representations from audio and project them into a sequence of 2D facial landmarks. Subsequently, we employ a robust diffusion model, coupled with a motion module, to convert the landmark sequence into photorealistic and temporally consistent portrait animation. Experimental results demonstrate the superiority of AniPortrait in terms of facial naturalness, pose diversity, and visual quality, thereby offering an enhanced perceptual experience. Moreover, our methodology exhibits considerable potential in terms of flexibility and controllability, which can be effectively applied in areas such as facial motion editing or face reenactment. We release code and model weights at https://github.com/scutzzj/AniPortrait

PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

图片

We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at https://github.com/ChenhongyiYang/PlainMamba

Boosting Diffusion Models with Moving Average Sampling in Frequency  Domain

图片

Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior samples. Instead of simply applying moving average to the denoised samples at different timesteps, we first map the denoised samples to data space and then perform moving average to avoid distribution shift across timesteps. In view that diffusion models evolve the recovery from low-frequency components to high-frequency details, we further decompose the samples into different frequency components and execute moving average separately on each component. We name the complete approach "Moving Average Sampling in Frequency domain (MASF)". MASF could be seamlessly integrated into mainstream pre-trained diffusion models and sampling schedules. Extensive experiments on both unconditional and conditional diffusion models demonstrate that our MASF leads to superior performances compared to the baselines, with almost negligible additional complexity cost.

Superior and Pragmatic Talking Face Generation with Teacher-Student  Framework

图片

Talking face generation technology creates talking videos from arbitrary appearance and motion signal, with the "arbitrary" offering ease of use but also introducing challenges in practical applications. Existing methods work well with standard inputs but suffer serious performance degradation with intricate real-world ones. Moreover, efficiency is also an important concern in deployment. To comprehensively address these issues, we introduce SuperFace, a teacher-student framework that balances quality, robustness, cost and editability. We first propose a simple but effective teacher model capable of handling inputs of varying qualities to generate high-quality results. Building on this, we devise an efficient distillation strategy to acquire an identity-specific student model that maintains quality with significantly reduced computational load. Our experiments validate that SuperFace offers a more comprehensive solution than existing methods for the four mentioned objectives, especially in reducing FLOPs by 99\% with the student model. SuperFace can be driven by both video and audio and allows for localized facial attributes editing.

TC4D: Trajectory-Conditioned Text-to-4D Generation

图片

Recent techniques for text-to-4D generation synthesize dynamic 3D scenes using supervision from pre-trained text-to-video models. However, existing representations for motion, such as deformation models or time-dependent neural representations, are limited in the amount of motion they can generate-they cannot synthesize motion extending far beyond the bounding box used for volume rendering. The lack of a more flexible motion model contributes to the gap in realism between 4D generation methods and recent, near-photorealistic video generation models. Here, we propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components. We represent the global motion of a scene's bounding box using rigid transformation along a trajectory parameterized by a spline. We learn local deformations that conform to the global trajectory using supervision from a text-to-video model. Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion, which we evaluate qualitatively and through a user study. Video results can be viewed on our website: https://sherwinbahmani.github.io/tc4d.

ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture  Synthesis

图片

Gestures play a key role in human communication. Recent methods for co-speech gesture generation, while managing to generate beat-aligned motions, struggle generating gestures that are semantically aligned with the utterance. Compared to beat gestures that align naturally to the audio signal, semantically coherent gestures require modeling the complex interactions between the language and human motion, and can be controlled by focusing on certain words. Therefore, we present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis, which can not only generate gestures based on multi-modal speech inputs, but can also facilitate controllability in gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities (e.g. audio vs text) as well as to choose certain words to be emphasized during gesturing. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures. To further advance the research on multi-party interactive gestures, the DnD Group Gesture dataset is released, which contains 6 hours of gesture data showing 5 people interacting with one another. We compare our method with several recent works and demonstrate effectiveness of our method on a variety of tasks. We urge the reader to watch our supplementary video at our website.

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

图片

Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion. The proposed MCA effectively encodes both local and global features at various levels of granularity while efficiently maintaining consistent speed regardless of the video length. Extensive experiments on multiple benchmarks, LVOS, Long-Time Video, and DAVIS 2017, demonstrate the effectiveness of our proposed contributions leading to real-time inference and markedly reduced memory demands without any degradation in segmentation accuracy on long videos. Compared to the best existing transformer-based approach, our MAVOS increases the speed by 7.6x, while significantly reducing the GPU memory by 87% with comparable segmentation performance on short and long video datasets. Notably on the LVOS dataset, our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU. Our code and models will be publicly available at: https://github.com/Amshaker/MAVOS.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1549171.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

PPP实验

一、实验拓扑图 二、实验要求 1、R1和R2使用PPP链路直连,R2和R3把2条PPP链路捆绑为PPP MP直连 2、按照图示配置IP地址 3、R2对R1的PPP进行单向chap验证 4、R2和R3的PPP进行双向chap验证 三、实验步骤 1、PPP MP: (1)R2配置&#x…

C语言从入门到实战----数据在内存中的存储

1. 整数在内存中的存储 在讲解操作符的时候,我们就讲过了下⾯的内容: 整数的2进制表⽰⽅法有三种,即 原码、反码和补码 有符号的整数,三种表⽰⽅法均有符号位和数值位两部分,符号位都是⽤0表⽰“正”,⽤…

dji esdk开发(4)SDK互联互通(与云端进行小数据通信)

Edge SDK 提供接口可以通过上云 API 与和机场建立连接的云端服务器进行小数据交互,即向云端服务器发送自定义小数据与接收来自云端服务器的自定义小数据。 注意: 使用该接口发送和接收数据上下行通道最大带宽不应超过 0.5Mb/S。 1、云端低速通道介绍 使用自定义小数据通道需…

C++类和对象、面向对象编程 (OOP)

文章目录 一、封装1.抽象、封装2.类和对象(0)学习视频(1)类的构成(2)三种访问权限(3)struct和class的区别(4)私有的成员变量、共有的成员函数(5)类内可以直接访问私有成员,不需要经过对象 二、继承三、多态1.概念2.多态的满足条件3.多态的使用条件4.多态原理剖析5.纯…

详细描述红黑树如何左旋、右旋(图文结合)

红黑树 首先要理解二叉查找树 二叉查找树(BST)具备什么特性呢? 左子树上所有结点的值均小于或等于它的根结点的值。 右子树上所有结点的值均大于或等于它的根结点的值。 左、右子树也分别为二叉排序树。 二叉查找树是二分查找的思想&…

vben admin路由跳转拿不到param参数问题

vben admin路由跳转拿不到param参数问题 问题原因: 也就是说,从Vue Router的2022-8-22 这次更新后,我们使用上面的方式在新页面无法获取: vue也给我们提出了解决方案: ​ 1.使用 query 的方式传参 ​ 2.将参数放…

Linux项目自动化构建工具make和makefile

前言 前面我们对yum、vim、gcc/g做了介绍,本期我们再来介绍一个好用的工具,就是make和makefile! 本期内容介绍 什么是make和makefile makefile文件内容的解释 make执行makefile的原理 我们想要的makefile 一、什么是make 和 makefile ? make是一条指令…

DVB-S系统仿真学习

DVB-S系统用于卫星电视信号传输,发送端框图如下所示 扰码 实际数字通信中,载荷数据的码元会出现长连0或长连1的情况,不利于接收端提取时钟信号,同时会使得数据流中含有大量的低频分量,使得QPSK调制器的相位长时间不变…

Python算法100例-4.6 歌星大奖赛

完整源代码项目地址,关注博主私信源代码后可获取 1.问题描述2.问题分析3.算法设计4.确定程序框架5.完整的程序6.问题拓展7.知识点补充 1.问题描述 在歌星大奖赛中,有10个评委为参赛的选手打分,分数为1~100分。选手最…

金融投贷通(金融投资+贷款通)项目准备

金融投贷通(金融投资贷款通)项目准备 专业术语投资专业术语本息专业术语还款专业术语项目介绍三个子系统技术架构核心流程发布借款标投资业务 项目实施测试流程测试步骤 专业术语 投资专业术语 案例:张三借给李四5W,约定期满1年后…

MySQL 高级语句(二)

一、子查询 1.1 相同表子查询 1.2 不同表/多表子查询 1.3 子查询的应用 1.3.1 语法 1.3.2 insert 子查询 1.3.3 update 子查询 1.3.4 delete 子查询 1.4 exists 关键字 1.4.1 true 1.4.2 false 1.5 as别名 二、视图 2.1 视图和表的区别和联系 2.1.1 区别 2.1.2 …

【面试必备】针对一个案例,怎么测试

思考角度 测试用例设计万能公式功能测试(最重要)界面测试易用性测试性能测试安全性测试兼容性测试容错性测试 常见案例物品类水杯笔 软件类微信发送朋友圈功能 测试用例设计万能公式 在面试中经常会遇到的一类题是,给你一个具体的产品&#…

《Attention Is All You Need》

参考: Attention Is All You Need 论文解读:Attention is All you need Transformer模型中的attention结构作用是什么? 如何最简单、通俗地理解Transformer? Transformer 新型神经网络,基于注意力机制 的 编码器-解码器 的序列处…

数据分析web可视化神器---streamlit框架,无需懂前端也能搭建出精美的web网站页面

✨✨ 欢迎大家来到景天科技苑✨✨ 🎈🎈 养成好习惯,先赞后看哦~🎈🎈 所属的专栏:数据分析系统化教学,零基础到进阶实战 景天的主页:景天科技苑 文章目录 Streamlit什么是streamli…

国内ip地址推荐,畅享网络新体验!

在数字化时代,IP地址不仅是网络连接的基石,也是互联网产业发展的重要标志。国内作为全球互联网市场的重要参与者,拥有众多IP地址资源。虎观代理小二旨在探索并推荐一些国内IP地址,分析它们的价值所在,并探讨如何更好地…

抖音电商“达人客服”产品上线啦!超多作者邀你一起“321上客服”!

有问题别自己克服,来抖音电商找“达人客服” 当代年轻人购物,正在从机智省变成理智购。越来越多的人在达人直播间购物,看重的不止是优惠力度,还有服务保障。 为了帮助达人更好地服务用户,抖音电商上线了「达人客服」…

BERT模型中句子Tokenize和ID转换的过程

当我们使用BERT或其他类似的预训练语言模型时,将句子转换为token的过程通常涉及以下几个步骤: 初始化Tokenizer:首先,我们需要导入相应的Tokenizer类,并根据需求选择合适的预训练模型进行初始化。 分词(To…

IRIS / Chronicles 数据库结构

对于我们用得最多的关系型数据库来说,首先有的是数据库名字,然后是表名字,然后就是字段名,随后就是一条一条的数据。 对于 IRIS 来说,因为是使用的层级数据库,所以上面的定义就不能完全的照搬了&#xff0…

腾讯云4核8G12M带宽配置服务器能支撑多少人访问?并发数测试

腾讯云4核8G服务器价格:轻量4核8G12M优惠价格646元15个月、CVM S5服务器4核8G配置1437元买1年送3个月。腾讯云4核8G服务器支持多少人同时在线?支持30个并发数,可容纳日均1万IP人数访问。腾讯云百科txybk.com整理4核8G服务器支持多少人同时在线…

MySQL的安装(Linux版)

1.所需要的文件 MySQL.zip 2. 卸载自带的Mysql-libs # 查看是否存在 rpm -qa | grep mariadb# 如果存在则执行命令进行卸载 rpm -e --nodeps mariadb-libs3.在/opt目录下创建MySQL目录并上传所需要的安装包 cd /optmkdir MySQL4.按照编号顺序安装(压缩包在解压完…