任务导向的扩散模型压缩;万物皆可成像;根据舞蹈生成音乐;LLM长上下文对齐;LLM KV缓存量化;通过运动场辅助扩散模型图像编辑

news2024/11/29 0:36:36

本文首发于公众号:机器感知

任务导向的扩散模型压缩;万物皆可成像;根据舞蹈生成音乐;LLM长上下文对齐;LLM KV缓存量化;通过运动场辅助扩散模型图像编辑

Task-Oriented Diffusion Model Compression

图片

As recent advancements in large-scale Text-to-Image (T2I) diffusion models have yielded remarkable high-quality image generation, diverse downstream Image-to-Image (I2I) applications have emerged. Despite the impressive results achieved by these I2I models, their practical utility is hampered by their large model size and the computational burden of the iterative denoising process. In this paper, we explore the compression potential of these I2I models in a task-oriented manner and introduce a novel method for reducing both model size and the number of timesteps. Through extensive experiments, we observe key insights and use our empirical knowledge to develop practical solutions that aim for near-optimal results with minimal exploration costs. We validate the effectiveness of our method by applying it to InstructPix2Pix for image editing and StableSR for image restoration. Our approach achieves satisfactory output quality with 39.2% and 56.4% reduction in model footprint and 81.4% and 68.7% decrease in latency to InstructPix2Pix and StableSR, respectively.

Spatial-and-Frequency-aware Restoration method for Images based on Diffusion Models

图片

Diffusion models have recently emerged as a promising framework for Image Restoration (IR), owing to their ability to produce high-quality reconstructions and their compatibility with established methods. Existing methods for solving noisy inverse problems in IR, considers the pixel-wise data-fidelity. In this paper, we propose SaFaRI, a spatial-and-frequency-aware diffusion model for IR with Gaussian noise. Our model encourages images to preserve data-fidelity in both the spatial and frequency domains, resulting in enhanced reconstruction quality. We comprehensively evaluate the performance of our model on a variety of noisy inverse problems, including inpainting, denoising, and super-resolution. Our thorough evaluation demonstrates that SaFaRI achieves state-of-the-art performance on both the ImageNet datasets and FFHQ datasets, outperforming existing zero-shot IR methods in terms of LPIPS and FID metrics.

Image Anything: Towards Reasoning-coherent and Training-free Multi-modal  Image Generation

图片

The multifaceted nature of human perception and comprehension indicates that, when we think, our body can naturally take any combination of senses, a.k.a., modalities and form a beautiful picture in our brain. For example, when we see a cattery and simultaneously perceive the cat's purring sound, our brain can construct a picture of a cat in the cattery. Intuitively, generative AI models should hold the versatility of humans and be capable of generating images from any combination of modalities efficiently and collaboratively. This paper presents ImgAny, a novel end-to-end multi-modal generative model that can mimic human reasoning and generate high-quality images. Our method serves as the first attempt in its capacity of efficiently and flexibly taking any combination of seven modalities, ranging from language, audio to vision modalities, including image, point cloud, thermal, depth, and event data. Our key idea is inspired by human-level cognitive processes and involves the integration and harmonization of multiple input modalities at both the entity and attribute levels without specific tuning across modalities. Accordingly, our method brings two novel training-free technical branches: 1) Entity Fusion Branch ensures the coherence between inputs and outputs. It extracts entity features from the multi-modal representations powered by our specially constructed entity knowledge graph; 2) Attribute Fusion Branch adeptly preserves and processes the attributes. It efficiently amalgamates distinct attributes from diverse input modalities via our proposed attribute knowledge graph. Lastly, the entity and attribute features are adaptively fused as the conditional inputs to the pre-trained Stable Diffusion model for image generation. Extensive experiments under diverse modality combinations demonstrate its exceptional capability for visual content creation.

Dance-to-Music Generation with Encoder-based Textual Inversion of Diffusion Models

图片

The harmonious integration of music with dance movements is pivotal in vividly conveying the artistic essence of dance. This alignment also significantly elevates the immersive quality of gaming experiences and animation productions. While there has been remarkable advancement in creating high-fidelity music from textual descriptions, current methodologies mainly concentrate on modulating overarching characteristics such as genre and emotional tone. They often overlook the nuanced management of temporal rhythm, which is indispensable in crafting music for dance, since it intricately aligns the musical beats with the dancers' movements. Recognizing this gap, we propose an encoder-based textual inversion technique for augmenting text-to-music models with visual control, facilitating personalized music generation. Specifically, we develop dual-path rhythm-genre inversion to effectively integrate the rhythm and genre of a dance motion sequence into the textual space of a text-to-music model. Contrary to the classical textual inversion method, which directly updates text embeddings to reconstruct a single target object, our approach utilizes separate rhythm and genre encoders to obtain text embeddings for two pseudo-words, adapting to the varying rhythms and genres. To achieve a more accurate evaluation, we propose improved evaluation metrics for rhythm alignment. We demonstrate that our approach outperforms state-of-the-art methods across multiple evaluation metrics. Furthermore, our method seamlessly adapts to in-the-wild data and effectively integrates with the inherent text-guided generation capability of the pre-trained model. Samples are available at \url{https://youtu.be/D7XDwtH1YwE}.

Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

图片

In recent years, weakly supervised semantic segmentation using image-level labels as supervision has received significant attention in the field of computer vision. Most existing methods have addressed the challenges arising from the lack of spatial information in these labels by focusing on facilitating supervised learning through the generation of pseudo-labels from class activation maps (CAMs). Due to the localized pattern detection of Convolutional Neural Networks (CNNs), CAMs often emphasize only the most discriminative parts of an object, making it challenging to accurately distinguish foreground objects from each other and the background. Recent studies have shown that Vision Transformer (ViT) features, due to their global view, are more effective in capturing the scene layout than CNNs. However, the use of hierarchical ViTs has not been extensively explored in this field. This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs by bringing local and global views together. SWTformer-V1 generates class probabilities and CAMs using only the patch tokens as features. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information and utilizes a background-aware mechanism to generate more accurate localization maps with improved cross-object discrimination. Based on experiments on the PascalVOC 2012 dataset, SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. It also yields comparable performance by 0.82% mIoU on average higher than other methods in generating initial localization maps, depending only on the classification network. SWTformer-V2 further improves the accuracy of the generated seed CAMs by 5.32% mIoU, further proving the effectiveness of the local-to-global view provided by the Swin transformer.

LongAlign: A Recipe for Long Context Alignment of Large Language Models

图片

Extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. To address this, we present LongAlign -- a recipe of the instruction data, training, and evaluation for long context alignment. First, we construct a long instruction-following dataset using Self-Instruct. To ensure the data diversity, it covers a broad range of tasks from various long context sources. Second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. Additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. Third, we introduce the LongBench-Chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. Experiments show that LongAlign outperforms existing recipes for LLMs in long context tasks by up to 30\%, while also maintaining their proficiency in handling short, generic tasks. The code, data, and long-aligned models are open-sourced at https://github.com/THUDM/LongAlign.

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache  Quantization

图片

LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in ultra-low precisions, such as sub-4-bit. In this work, we present KVQuant, which addresses this problem by incorporating novel methods for quantizing cached KV activations, including: (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution; (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization; (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges; and (v) Q-Norm, where we normalize quantization centroids in order to mitigate distribution shift, providing additional benefits for 2-bit quantization. By applying our method to the LLaMA, LLaMA-2, and Mistral models, we achieve $<0.1$ perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches. Our method enables serving the LLaMA-7B model with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system.

Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

图片

Diffusion models are capable of generating impressive images conditioned on text descriptions, and extensions of these models allow users to edit images at a relatively coarse scale. However, the ability to precisely edit the layout, position, pose, and shape of objects in images with diffusion models is still difficult. To this end, we propose motion guidance, a zero-shot technique that allows a user to specify dense, complex motion fields that indicate where each pixel in an image should move. Motion guidance works by steering the diffusion sampling process with the gradients through an off-the-shelf optical flow network. Specifically, we design a guidance loss that encourages the sample to have the desired motion, as estimated by a flow network, while also being visually similar to the source image. By simultaneously sampling from a diffusion model and guiding the sample to have low guidance loss, we can obtain a motion-edited image. We demonstrate that our technique works on complex motions and produces high quality edits of real and generated images.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1425288.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Vmware 无法开启虚拟化解决方法

最近遇到了Vmware无法开启虚拟化的问题,已经解决,记录一下解决经过。 我遇到的情况是BIOS已经开启虚拟化,HV服务也停用了,但是Vmware仍然提示模块“VPMC”启动失败。网上的解决方案千篇一律,基本都是排查BIOS、停用Windows的虚拟化功能、停用HV主机服务、Vmware配置中关闭…

python pandas模块详解

python pandas模块详解 一&#xff1a;pandas简介二&#xff1a;pandas安装以及库的导入2.1 Pandas安装2.2 pandas模块的导入 三&#xff1a;pandas数据结构3.1 pandas Series结构3.1.1创建Series对象1&#xff09;ndarray&#xff08;数组&#xff09;创建Series对象2&#xf…

静态时序分析:时序弧以及其时序敏感(单调性)

相关阅读 静态时序分析https://blog.csdn.net/weixin_45791458/category_12567571.html?spm1001.2014.3001.5482 在静态时序分析中&#xff0c;不管是组合逻辑单元&#xff08;如与门、或门、与非门等&#xff09;还是时序逻辑&#xff08;D触发器等&#xff09;在时序建模时…

Elasticsearch性能调优

背景 项目上是用 ES 做数据库&#xff0c;存储的告警数据&#xff0c;量级在千万级别左右。测试在压测之后&#xff0c;系统频繁出现告警记录查询报错&#xff0c;系统不可用。基于此排查分析项目上 Elasticsearch 的使用是否合理。 版本及硬件 环境&#xff1a;10.xx.xxx.x…

sectigo ip ssl证书有哪些

Sectigo是移交成立时间较久的CA认证机构&#xff0c;几十年来在全球颁发了各种各样的数字证书&#xff0c;例如&#xff0c;单域名SSL证书、多域名SSL证书、通配符SSL证书等域名SSL证书。Sectigo旗下也有一些不常见的数字证书&#xff0c;例如&#xff0c;代码签名证书、IP证书…

浅谈WPF之UniformGrid和ItemsControl

在日常开发中&#xff0c;有些布局非常具有规律性&#xff0c;比如相同的列宽&#xff0c;行高&#xff0c;均匀的排列等&#xff0c;为了简化开发&#xff0c;WPF提供了UniformGrid布局和ItemsControl容器&#xff0c;本文以一个简单的小例子&#xff0c;简述&#xff0c;如何…

ApacheNginx配置ssl证书

一、Apache配置ssl Linux版本&#xff1a;CentOS Linux release 7.9.2009 (Core) Apache版本&#xff1a;Apache/2.4.6 (CentOS) 1、安装Apache&#xff08;使用默认yum源&#xff09; [root10-35-1-25 ~]# yum -y install httpd2、查Apache版本&启动Apache [root10-35-…

vue使用富文本

1、安装 cnpm install vue-quill-editor2、在main.js中引入 // 富文本 import VueQuillEditor from vue-quill-editor // require styles 引入样式 import quill/dist/quill.core.css import quill/dist/quill.snow.css import quill/dist/quill.bubble.css Vue.use(VueQuill…

使用orangepi玩linux

最近看了这个大佬的文章&#xff0c;写了使用远程来挂载linux的方案&#xff0c;觉得还是很有意思的&#xff0c;瞬间感觉linux这块都还是相通的&#xff0c;就跑了一下&#xff0c;果然&#xff0c;牛逼&#xff01; 香橙派全志H3烧录Uboot&#xff0c;远程加载zImage&#xf…

Autonomous_Exploration_Development_Environment的local_planner学习笔记

1.程序下载网址&#xff1a;https://github.com/HongbiaoZ/autonomous_exploration_development_environment 2.相关参考资料&#xff1a; https://blog.csdn.net/lizjiwei/article/details/124437157 Matlab用采样的离散点做前向模拟三次样条生成路径点-CSDN博客 CMU团队开…

门的方向为何如此重要?探秘产品经理面试题的设计哲学

大家好,我是小米!最近我在面试产品经理的时候遇到了一个有趣而又颇具深意的问题:厕所的门应该朝内还是朝外开?这个问题看似简单,却蕴含了很多关于产品设计的考量。今天,我们一起来深入剖析这个问题,看看我们在设计产品时应该如何权衡各种因素。 背景介绍 在日常生活中…

PyTorch复现网络模型VGG

VGG 原论文地址&#xff1a;https://arxiv.org/abs/1409.1556VGG是Visual Geometry Group&#xff08;视觉几何组&#xff09;的缩写&#xff0c;它是一个在计算机视觉领域中非常有影响力的研究团队&#xff0c;主要隶属于牛津大学的工程系和科学系。VGG以其对卷积神经网络&am…

前一百成绩分析

一、实施目的 基于考情&#xff0c;针对目标生制定学习成果“一生一案”方案&#xff0c;帮助目标生消灭短板学科&#xff0c;达到各科均衡发展。 二、实施方法 1、对年级总分科目总分排名前80的学生&#xff0c;制定“一生一案” 2、对标总分名次&#xff0c;设置单科合理区间…

肯尼斯·里科《C和指针》第11章 动态内存分配(1)动态内存分配的基础知识

数组的元素存储于内存中连续的位置上。当一个数组被声明时&#xff0c;它所需要的内存在编译时就被分配。但是&#xff0c;也可以使用动态内存分配在运行时为它分配内存。在本章中&#xff0c;我们将研究这两种技巧的区别&#xff0c;看看什么时候应该使用动态内存分配以及怎样…

【数学】【记忆化搜索 】【动态规划】964. 表示数字的最少运算符

作者推荐 【动态规划】【字符串】【表达式】2019. 解出数学表达式的学生分数 本文涉及知识点 动态规划汇总 数学 记忆化搜索 LeetCoce964表示数字的最少运算符 给定一个正整数 x&#xff0c;我们将会写出一个形如 x (op1) x (op2) x (op3) x … 的表达式&#xff0c;其中每…

【C++】类和对象万字详解

目录 一、类与对象 1、类是什么 二、类和对象的基础知识 2.1 定义类&#xff1a;成员变量和成员函数 2.2 创建对象&#xff1a;实例化一个类的对象。 2.3对象的生命周期&#xff1a;构造函数和析构函数。 a. 构造函数 b. 析构函数 c.小结&#xff1a; 三、成员变量和…

vue3使用is动态切换组件报错Vue received a Component which was made a reactive object.

vue3使用is动态切换组件&#xff0c;activeComponent用ref定义报错 Vue received a Component which was made a reactive object. This can lead to unnecessary performance overhead, and should be avoided by marking the component with markRaw or using shallowRef ins…

Ansible基础及常用模块

目录 1.前言 Ansible Ansible的特性 2.ansible环境安装部署 管理端安装ansible(192.168.88.22) ansible目录结构 配置主机清单 配置密钥对验证 3.ansible命令行模块 command 模块 shell 模块 ​编辑cron 模块 user 模块 group 模块 copy 模块 file 模块 hostn…

表达式(C语言)

目录 表达式求值 整型提升 ​编辑整型提升的意义 如何进行整体提升&#xff1f; 算术转换 问题表达式解析 表达式1 表达式2 表达式3 表达式4 表达式5 总结 表达式求值 整型提升 C语言中整型算术运算总是至少以缺省整型类型的精度来进行的 为了获得这个精度&#x…

C++: 类的简单介绍(三)———构造函数的初步认识

概念&#xff1a; 构造函数是一个特殊的成员函数&#xff0c;名字与类名相同,创建类类型对象时由编译器自动调用&#xff0c;以保证 每个数据成员都有 一个合适的初始值&#xff0c;并且在对象整个生命周期内只调用一次 需要注意的是&#xff0c;构造函数虽然名称叫构造&…