​Inf-DiT:Upsampling Any-Resolution Image、Vidu、MVDiff、Trio-ViT

news2025/1/22 1:11:16

本文首发于公众号:机器感知

​Inf-DiT:Upsampling Any-Resolution Image、Vidu、MVDiff、Trio-ViT

图片

Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

图片

Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 4096*4096), the resolution of generated images is often limited to 1024*1024. In this work. we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inference process and handle global dependencies. Building on this module, we adopt the DiT structure for upsampling and develop an infinite super-resolution model capable of upsampling images of various shapes and resolutions. Comprehensive experiments show that our model achieves SOTA performance in generating ultra-high-resolution images in both machine and human evaluation. Compared to commonly used UNet structures, our model can save more than 5x memory when generating 4096*4096 images. The project URL is .......

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

图片

We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results.......

Space-time Reinforcement Network for Video Object Segmentation

图片

Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a competitive result 85.0% on YouTube VOS 2018. In addition, our network exhibits a high inference sp......

Structured Click Control in Transformer-based Interactive Segmentation

图片

Click-point-based interactive segmentation has received widespread attention due to its efficiency. However, it's hard for existing algorithms to obtain precise and robust responses after multiple clicks. In this case, the segmentation results tend to have little change or are even worse than before. To improve the robustness of the response, we propose a structured click intent model based on graph neural networks, which adaptively obtains graph nodes via the global similarity of user-clicked Transformer tokens. Then the graph nodes will be aggregated to obtain structured interaction features. Finally, the dual cross-attention will be used to inject structured interaction features into vision Transformer features, thereby enhancing the control of clicks over segmentation results. Extensive experiments demonstrated the proposed algorithm can serve as a general structure in improving Transformer-based interactive segmenta?tion performance. The code and data will be released at......

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

图片

In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language. SEED-Data-Edit is composed of three distinct types of data: (1) High-quality editing data produced by an automated pipeline, ensuring a substantial volume of diverse image editing pairs. (2) Real-world scenario data collected from the internet, which captures the intricacies of user intentions for promoting the practical application of image editing in the real world. (3) High-precision multi-turn editing data annotated by humans, which involves multiple rounds of edits for simulating iterative editing processes. The combination of these diverse data sources makes SEED-Data-Edit a comprehensive and versatile dataset for training language-guided image editing model. We fine-tune a pretrained Multimodal Large Language Model (MLLM) that unifies comprehension and generation with SEED-Data-Edit. T......

Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

图片

Current state-of-the-art diffusion models employ U-Net architectures containing convolutional and (qkv) self-attention layers. The U-Net processes images while being conditioned on the time embedding input for each sampling step and the class or caption embedding input corresponding to the desired conditional generation. Such conditioning involves scale-and-shift operations to the convolutional layers but does not directly affect the attention layers. While these standard architectural choices are certainly effective, not conditioning the attention layers feels arbitrary and potentially suboptimal. In this work, we show that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality. For example, a drop-in addition of LoRA conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for unconditional and class-conditional CIFAR-10 generation, improving upon the baseli......

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

图片

Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model qual......

MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View

图片

Generating consistent multiple views for 3D reconstruction tasks is still a challenge to existing image-to-3D diffusion models. Generally, incorporating 3D representations into diffusion model decrease the model's speed as well as generalizability and quality. This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. In the model, we introduce epipolar geometry constraints and multi-view attention to enforce 3D consistency. From as few as one image input, our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.......

Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer

图片

Motivated by the huge success of Transformers in the field of natural language processing (NLP), Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. Unfortunately, due to the existence of hardware-unfriendly and quantization-sensitive non-linear operations, particularly {Softmax}, it is non-trivial to completely quantize all operations in ViTs, yielding either significant accuracy drops or non-negligible hardware costs. In response to challenges associated with \textit{standard ViTs}, we focus our attention towards the quantization and acceleration for \textit{efficient ViTs}, which not only eliminate the troublesome Softmax but also integrate linear attention with low computational complexity, and propose \emph{Trio-ViT} accordingl......

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1657737.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

C++:STL-string

前言 本文主要介绍STL六大组件中的容器之一:string,在学习C的过程中,我们要将C视为一个语言联邦(摘录于Effective C条款一)。如何理解这句话呢,我们学习C,可将其分为四个板块;分别为…

基于springboot实现医院药品管理系统项目【项目源码+论文说明】

基于springboot实现医院药品管理系统演示 摘要 身处网络时代,随着网络系统体系发展的不断成熟和完善,人们的生活也随之发生了很大的变化,人们在追求较高物质生活的同时,也在想着如何使自身的精神内涵得到提升,而读书就…

python-类和对象

1、设计一个 Circle类来表示圆,这个类包含圆的半径以及求面积和周长的函数。再使用这个类创建半径为1~10的圆,并计算出相应的面积和周长。 (1)源代码: import math class Circle: def __init__(self, r): self.r r #面积 def area(self): r…

嵌入式开发九:STM32时钟系统

时钟对于单片机来说是非常重要的,它为单片机工作提供一个稳定的机器周期从而使系统能够正常运行。时钟系统犹如人的心脏,一旦有问题整个系统就崩溃。我们知道 STM32 属于高级单片机,其内部有很多的外设,但不是所有外设都使用同一时…

IO 5.9号

创建一对父子进程&#xff1a; 父进程负责向文件中写入 长方形的长和宽 子进程负责读取文件中的长宽信息后&#xff0c;计算长方形的面积 #include <myhead.h>int main(int argc, const char *argv[]){int retvalfork();if(retval>0){float length,width;int wfdopen(…

【二维数组】

目录 作业 对比&#xff1a; 结果&#xff1a; 二维数组 二维数组的初始化 作业 作业 #define max(a,b)(a>b)?a:b #include<stdio.h> int main() {int x, y,c;scanf("%d %d", &x,&y);cmax(x, y);printf("%d", c);return 0; } 对比…

关于模型参数融合的思考

模型参数融合通常指的是在训练过程中或训练完成后将不同模型的参数以某种方式结合起来&#xff0c;以期望得到更好的性能。这种融合可以在不同的层面上进行&#xff0c;例如在神经网络的不同层之间&#xff0c;或者是在完全不同的模型之间。模型参数融合的目的是结合不同模型的…

震惊,现在面试都加科技与狠货了

震惊&#xff0c;现在面试都加科技与狠货了 生成式AI盛行的现在&#xff0c;程序员找工作变容易了吗我和老痒喝着大酒&#xff0c;吃着他的高升宴&#xff0c;听他说他面试的各种细节&#xff0c;老狗我只恨自己动作慢了一步&#xff0c;不然现在在那侃侃而谈的就是我了。 面试…

【深度学习】【Lora训练2】StabelDiffusion,Lora训练过程,秋叶包,Linux,SDXL Lora训练

文章目录 一、如何为图片打标1.1. 打标工具1.1.1. 秋叶中使用的WD1.41.1.2. 使用BLIP21.1.3. 用哪一种 二、 Lora训练数据的要求2.1 图片要求2.2 图片的打标要求 三、 Lora的其他问题qa1qa2qa3qa4qa5 四、 对图片的处理细节4.1. 图片尺寸问题4.2. 图片内容选取问题4.3. 什么是一…

深入浅出,一文搞懂向量数据库工作原理和应用

大家好&#xff0c;在今天这个数据复杂性日益增长和高维信息丰富的时代&#xff0c;传统数据库在高效处理和提取复杂数据集方面已显得捉襟见肘。向量数据库&#xff0c;作为一项应运而生的技术创新&#xff0c;成功解决了数据领域在不断扩展过程中所面临的挑战。 1.向量数据库…

常见的一些RELAXED MODEL CONCEPTS

释放一致性(release consistency, RC) RC的核心观点是&#xff1a;使用 FENCE 围绕所有同步操作是多余的 同步获取 (acquire) 只需要一个后续的 FENCE&#xff0c;同步释放 (release) 只需要一个前面的 FENCE。 对于表 5.4 的临界区示例&#xff0c;可以省略 FENCE F11、F14…

Vue3专栏项目 -- 一、第一个页面(下)

一、Dropdown 组件&#xff08;下拉菜单组件&#xff09;编码 1、基本功能&#xff1a;展示出下拉按钮和下拉菜单栏的样式 我们可以通过bootstrap来实现这个下拉框&#xff0c;需要注意它这个只是有样式&#xff0c;是没有行为的 然后这个下拉按钮的文字展示是根据用户名称展…

洗地机什么品牌好?洗地机怎么选?618洗地机选购指南

随着科技的飞速发展&#xff0c;洗地机以其高效的清洁能力、稳定的性能和用户友好的设计而闻名&#xff0c;不仅可以高效吸尘、拖地&#xff0c;还不用手动洗滚布&#xff0c;已经逐渐成为现代家庭不可或缺的清洁助手。然而&#xff0c;在众多品牌和型号中&#xff0c;如何选择…

Python专题:七、函数初探

代码的重用,重复的机械性功能 封装性,不用了解其组成原理 易于维护,更新 def是关键词,函数定义,add3函数名(自定义)三个数相加,a,b,c是函数的形式参数,需要注意的是,在出现三个点号之后,还需再输入一个回车,出现三个尖括号,才算函数定义完成,定义完之后就可以使…

MySQL 通过 systemd 启动时 hang 住了……

mysqld&#xff1a;哥&#xff0c;我起不来了…… 作者&#xff1a;贲绍华&#xff0c;爱可生研发中心工程师&#xff0c;负责项目的需求与维护工作。其他身份&#xff1a;柯基铲屎官。 爱可生开源社区出品&#xff0c;原创内容未经授权不得随意使用&#xff0c;转载请联系小编…

网工内推 | 技术支持工程师,最高15k,加班有补贴

01 星网信通 招聘岗位&#xff1a;售前技术支持 职责描述&#xff1a; 1、售前技术支持&#xff1a;技术交流、产品选型报价、方案制作等工作&#xff1b; 2、招投标支持&#xff1a;项目招标参数撰写、标书质疑、应标文件技术部分撰写及资质文件归纳准备、现场讲标及技术澄清…

95、动态规划-编辑距离

递归暴力解法 递归方法的基本思想是考虑最后一个字符的操作&#xff0c;然后根据这些操作递归处理子问题。 递归函数定义&#xff1a;定义一个递归函数 minDistance(i, j)&#xff0c;表示将 word1 的前 i 个字符转换成 word2 的前 j 个字符所需的最小操作数。 递归终止条件…

命运交织的节点:分布式事务最终一致性的心跳共鸣纪实

关注微信公众号 “程序员小胖” 每日技术干货&#xff0c;第一时间送达&#xff01; 引言 在当今云计算和微服务架构大行其道的时代&#xff0c;分布式系统成为了构建高可用、高性能应用的基石。然而&#xff0c;随着系统规模的扩张&#xff0c;数据的一致性问题如同幽灵般萦…

Linux字符设备驱动(一) - 框架

字符设备是Linux三大设备之一(另外两种是块设备&#xff0c;网络设备)&#xff0c;字符设备就是字节流形式通讯的I/O设备,绝大部分设备都是字符设备&#xff0c;常见的字符设备包括鼠标、键盘、显示器、串口等等&#xff0c;当我们执行ls -l /dev的时候&#xff0c;就能看到大量…

C++容器之vector类

目录 1.vector的介绍及使用1.1vector的介绍1.2vector的使用1.2.1 vector的定义1.2.2 vector iterator 的使用1.2.3 vector 空间增长问题1.2.4 vector 增删查改1.2.5vector 迭代器失效问题1.2.6 vector 在OJ中的使用。 2.vector深度剖析及模拟实现2.1 std::vector的核心框架接口…