Imagine Flash、StyleMamba 、FlexControl、Multi-Scene T2V、TexControl

news2024/11/18 3:45:28

本文首发于公众号:机器感知

Imagine Flash、StyleMamba 、FlexControl、Multi-Scene T2V、TexControl

图片

You Only Cache Once: Decoder-Decoder Architectures for Language Models

图片

We introduce a decoder-decoder architecture, YOCO, for large language models, which only caches key-value pairs once. It consists of two components, i.e., a cross-decoder stacked upon a self-decoder. The self-decoder efficiently encodes global key-value (KV) caches that are reused by the cross-decoder via cross-attention. The overall model behaves like a decoder-only Transformer, although YOCO only caches once. The design substantially reduces GPU memory demands, yet retains global attention capability. Additionally, the computation flow enables prefilling to early exit without changing the final output, thereby significantly speeding up the prefill stage. Experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens. We also extend YOCO to 1M context length with near-perfect needle retrieval accuracy. The profiling results show that YOCO improves inference memory, p......

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

图片

Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images. However, this exceptional performance comes at the cost of expensive architectural design, particularly due to the attention module heavily used in leading models. Existing works mainly adopt a retraining process to enhance DM efficiency. This is computationally expensive and not very scalable. To this end, we introduce the Attention-driven Training-free Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens, without the need for any retraining. Specifically, for single-denoising-step pruning, we develop a novel ranking algorithm, Generalized Weighted Page Rank (G-WPR), to identify redundant tokens, and a similarity-based recovery method to restore tokens for the convolution operation. In addition, we propose a Denoising-Steps-Aware Pruning (DSAP) approach to adjust the pruning budget across different den......

Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation

图片

Diffusion models are a powerful generative framework, but come with expensive inference. Existing acceleration methods often compromise image quality or fail under complex conditioning when operating in an extremely low-step regime. In this work, we propose a novel distillation framework tailored to enable high-fidelity, diverse sample generation using just one to three steps. Our approach comprises three key components: (i) Backward Distillation, which mitigates training-inference discrepancies by calibrating the student on its own backward trajectory; (ii) Shifted Reconstruction Loss that dynamically adapts knowledge transfer based on the current time step; and (iii) Noise Correction, an inference-time technique that enhances sample quality by addressing singularities in noise prediction. Through extensive experiments, we demonstrate that our method outperforms existing competitors in quantitative metrics and human evaluations. Remarkably, it achieves performance comparable......

Reviewing Intelligent Cinematography: AI research for camera-based video production

图片

This paper offers a comprehensive review of artificial intelligence (AI) research in the context of real camera content acquisition for entertainment purposes and is aimed at both researchers and cinematographers. Considering the breadth of computer vision research and the lack of review papers tied to intelligent cinematography (IC), this review introduces a holistic view of the IC landscape while providing the technical insight for experts across across disciplines. We preface the main discussion with technical background on generative AI, object detection, automated camera calibration and 3-D content acquisition, and link explanatory articles to assist non-technical readers. The main discussion categorizes work by four production types: General Production, Virtual Production, Live Production and Aerial Production. Note that for Virtual Production we do not discuss research relating to virtual content acquisition, including work on automated video generation, like Stable Di......

StyleMamba : State Space Model for Efficient Text-driven Image Style Transfer

图片

We present StyleMamba, an efficient image style transfer framework that translates text prompts into corresponding visual styles while preserving the content integrity of the original images. Existing text-guided stylization requires hundreds of training iterations and takes a lot of computing resources. To speed up the process, we propose a conditional State Space Model for Efficient Text-driven Image Style Transfer, dubbed StyleMamba, that sequentially aligns the image features to the target text prompts. To enhance the local and global style consistency between text and image, we propose masked and second-order directional losses to optimize the stylization direction to significantly reduce the training iterations by 5 times and the inference time by 3 times. Extensive experiments and qualitative evaluation confirm the robust and superior stylization performance of our methods compared to the existing baselines.......

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

图片

Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate imag......

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

图片

Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., `a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., `a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., `a red pan......

TexControl: Sketch-Based Two-Stage Fashion Image Generation Using Diffusion Model

图片

Deep learning-based sketch-to-clothing image generation provides the initial designs and inspiration in the fashion design processes. However, clothing generation from freehand drawing is challenging due to the sparse and ambiguous information from the drawn sketches. The current generation models may have difficulty generating detailed texture information. In this work, we propose TexControl, a sketch-based fashion generation framework that uses a two-stage pipeline to generate the fashion image corresponding to the sketch input. First, we adopt ControlNet to generate the fashion image from sketch and keep the image outline stable. Then, we use an image-to-image method to optimize the detailed textures of the generated images and obtain the final results. The evaluation results show that TexControl can generate fashion images with high-quality texture as fine-grained image generation.......

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1657948.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

C++从入门到入土(二)——初步认识类与对象

目录 前言 类与对象的引入 类的定义 类的访问限定符及封装 访问限定符: 封装: 类的作用域 类的实例化 类的大小 this指针 this指针的特性 前言 各位佬们,在开始本篇文章的内容之前,我想先向大家道个歉,由于…

Linux流量分析工具 | nethogs

在应急过程中,经常会遇到应用访问缓慢,网络阻塞的情况,分析原因可能会想到存在恶意程序把带宽占满的可能。通过这样一个小工具可以快速定位异常占用带宽程序的路径、PID、占用流量大小或是排除由带宽占满导致服务器缓慢的猜想。 一、简介 Ne…

GitHub Actions 手动触发方式

目录 前言 Star Webhook 手动触发按钮 前言 GitHub Actions 是 Microsoft 收购 GitHub 后推荐的一款 CI/​CD 工具早期可能是处于初级开发阶段,它的功能非常原生,甚至没有直接提供一个手动触发按钮一般的触发方式为代码变动(push 、pull…

Linux网络-PXE高效批量网络装机(命令+截图详细版)

目录 一.部署PXE远程安装服务 1.PXE概述 1.1.PXE批量部署的优点 1.2.要搭建PXE网络体系的前提条件 2.搭建PXE远程安装服务器 2.1.修改相关网络配置(仅主机模式) 2.2.关闭防火墙(老规矩) 2.3.保证挂载上 2.4.准备好配置文…

如何使用IntelliJ IDEA SSH连接本地Linux服务器远程开发

文章目录 1. 检查Linux SSH服务2. 本地连接测试3. Linux 安装Cpolar4. 创建远程连接公网地址5. 公网远程连接测试6. 固定连接公网地址7. 固定地址连接测试 本文主要介绍如何在IDEA中设置远程连接服务器开发环境,并结合Cpolar内网穿透工具实现无公网远程连接&#xf…

【面经】网络

了解TCP/IP协议,了解常用的网络协议:study-area 一、TCP/IP协议 TCP/IP协议是一组网络通信协议,旨在实现不同计算机之间的信息传输。 1、TCP/IP四层模型: 网络接口层、网络层、传输层和应用层。 网络接口层:定义了数据的格式和…

C++ 基础 输入输出

一 C 的基本IO 系统中的预定义流对象cin和cout: 输入流:cin处理标准输入,即键盘输入; 输出流:cout处理标准输出,即屏幕输出; 流:从某种IO设备上读入或写出的字符系列 使用cin、cout这两个流对…

【springboot基础】如何搭建一个web项目?

正在学习springboot,还是小白,今天分享一下如何搭建一个简单的springboot的web项目,只要写一个类就能实现最基础的前后端交互,实现web版helloworld ,哈哈,虽然十分简陋,但也希望对你理解web运作…

车载测试系列:车载蓝牙测试(三)

HFP测试内容与测试方法 2.3 接听来电:测试手机来电时,能否从车载蓝牙设备和手机侧正常接听】拒接、通话是否正常。 1、预置条件:待测手机与车载车载设备处于连接状态 2、测试步骤: 1)用辅助测试机拨打待测手机&…

【JavaWeb】Servlet+JSP+EL表达式+JSTL标签库+Filter过滤器+Listener监听器

需要提前准备了哪些技术,接下来的课才能听懂? JavaSE(Java语言的标准版,Java提供的最基本的类库) Java的开发环境搭建Java的基础语法Java的面向对象数组常用类异常集合多线程IO流反射机制注解Annotation… MySQL&…

CUDA流和事件

CUDA通过流来实现网格级并发。 流和事件 CUDA流是一系列异步的CUDA操作,这些操作按照主机代码确定的顺序在设备上执行。流可以封装这些操作,保持操作的顺序,允许操作在流中排队,并使他们在先前的所有操作之后执行。 这些操作包…

【Linux】在Linux中执行命令ifconfig, 报错-bash:ifconfig: command not found解决方案

一、报错信息 ifconfig 报错-bash:ifconfig: command not found 同时,通过ip addr查看,也看不到IP信息 二、解决方案 找到ifcfg-ens0文件,此文件的目录在/etc/sysconfig/network-scripts目录下 命令:cd /etc/sysconfig/network…

Windows系统本地部署DrawDB数据库设计工具并实现无公网IP远程访问

文章目录 1. Windows本地部署DrawDB2. 安装Cpolar内网穿透3. 实现公网访问DrawDB4. 固定DrawDB公网地址 开发中很多时候都会使用到数据库,所以选择一个好用的数据库设计工具会让工作效率翻倍。在当今数字化时代,数据库管理是许多企业和个人项目的核心。设…

buuctf-misc题目练习二

ningen 打开题目后是一张图片,放进winhex里面 发现PK,PK是压缩包ZIP 文件的文件头,下一步是想办法进行分离 Foremost可以依据文件内的文件头和文件尾对一个文件进行分离,或者识别当前的文件是什么文件。比如拓展名被删除、被附加…

Spring - 9 ( 10000 字 Spring 入门级教程 )

一: MyBatis XML 配置文件 Mybatis 的开发有两种方式: 注解XML 我们已经学习了注解的方式, 接下来我们学习 XML 的方式 MyBatis XML 的方式需要以下两步: 配置数据库连接字符串和 MyBatis写持久层代码 1.1 配置连接字符串和 MyBatis 此步骤需要进…

【经验分享】企业网站建设,不收录的原因有哪些

今天来聊一聊我们做好网站,但是网站排名不高,各大搜索引擎不收录网站的原因: 1.网站结构问题: 公司网站的结构是搜索引擎判断网站内容的关键因素之一。如果网站结构混乱、不清晰,搜索引擎可能难以准确抓取和理解网站的…

汇编--栈和寄存器

栈 栈是一种运算受限的线性表,其限定仅在表尾进行插入和删除操作的线性表,表尾也被叫做栈顶。简单概括就是我们对于元素的操作只能够在栈顶进行,也造就了其先进后出的结构特性。 栈 这种内存空间其实本质上有两种操作:将数据放入…

新款iPad Pro引领AI新纪元:M4芯片揭幕,每秒38万亿次运算惊艳业界

新款iPad Pro搭载了强大的M4芯片,拥有每秒高达38万亿次运算的神经处理单元,AI性能超越当今的AI PC。其外观设计更加接近笔记本电脑,展示了苹果对AI技术的全面拥抱。此次发布不仅是对iPad Pro的一次重大更新,更是为下个月的WWDC发布…

00后抛弃新氧、上游抗议低价,金星又被打脸了

作为“颜值焦虑”的受益者,新氧也面临自己的焦虑。 据新氧最近发布的年报,2023年营收14.98亿元,同比增长19.1%;净利2130万元,同比扭亏为盈。但是,这仅是源于2022年公司业绩的低基数对比,并不能…

Faiss核心解析:提升推荐系统的利器【AI写作免费】

首先,这篇文章是基于笔尖AI写作进行文章创作的,喜欢的宝子,也可以去体验下,解放双手,上班直接摸鱼~ 按照惯例,先介绍下这款笔尖AI写作,宝子也可以直接下滑跳过看正文~ 笔尖Ai写作:…