Single-Head ViT;Faster Whisper;Transformer KF;Pick-and-Draw

news2025/1/12 18:39:16

本文首发于公众号:机器感知

Single-Head ViT;Faster Whisper;Transformer KF;Pick-and-Draw

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

图片

Recently, efficient Vision Transformers have shown great performance with low latency on resource-constrained devices. Conventionally, they use 4x4 patch embeddings and a 4-stage structure at the macro level, while utilizing sophisticated attention with multi-head configuration at the micro level. This paper aims to address computational redundancy at all design levels in a memory-efficient manner. We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance by leveraging token representations with reduced spatial redundancy from the early stages. Furthermore, our preliminary analyses suggest that attention layers in the early stages can be substituted with convolutions, and several attention heads in the latter stages are computationally redundant. To handle this, we introduce a single-head attention module that inherently prevents head redundancy and simultaneously boosts accuracy by parallelly combining global and local information. Building upon our solutions, we introduce SHViT, a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff. For example, on ImageNet-1k, our SHViT-S4 is 3.3x, 8.1x, and 2.4x faster than MobileViTv2 x1.0 on GPU, CPU, and iPhone12 mobile device, respectively, while being 1.3% more accurate. For object detection and instance segmentation on MS COCO using Mask-RCNN head, our model achieves performance comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone latency on GPU and mobile device, respectively.

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on  E-Branchformer

图片

Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI's Whisper using publicly available data and open-source toolkits. With the aim of reproducing Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performance compared to other state-of-the-art speech encoders. In this work, we aim to improve the performance and efficiency of OWSM without extra training data. We present E-Branchformer based OWSM v3.1 models at two scales, i.e., 100M and 1B. The 1B model is the largest E-Branchformer based speech model that has been made publicly available. It outperforms the previous OWSM v3 in a vast majority of evaluation benchmarks, while demonstrating up to 25% faster inference speed.

OptiState: State Estimation of Legged Robots using Gated Networks with Transformer-based Vision and Kalman Filtering

图片

State estimation for legged robots is challenging due to their highly dynamic motion and limitations imposed by sensor accuracy. By integrating Kalman filtering, optimization, and learning-based modalities, we propose a hybrid solution that combines proprioception and exteroceptive information for estimating the state of the robot's trunk. Leveraging joint encoder and IMU measurements, our Kalman filter is enhanced through a single-rigid body model that incorporates ground reaction force control outputs from convex Model Predictive Control optimization. The estimation is further refined through Gated Recurrent Units, which also considers semantic insights and robot height from a Vision Transformer autoencoder applied on depth images. This framework not only furnishes accurate robot state estimates, including uncertainty evaluations, but can minimize the nonlinear errors that arise from sensor measurements and model simplifications through learning. The proposed methodology is evaluated in hardware using a quadruped robot on various terrains, yielding a 65% improvement on the Root Mean Squared Error compared to our VIO SLAM baseline. Code example: https://github.com/AlexS28/OptiState

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

图片

Diffusion-based text-to-image personalization have achieved great success in generating subjects specified by users among various contexts. Even though, existing finetuning-based methods still suffer from model overfitting, which greatly harms the generative diversity, especially when given subject images are few. To this end, we propose Pick-and-Draw, a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. Our approach consists of two components: appearance picking guidance and layout drawing guidance. As for the former, we construct an appearance palette with visual features from the reference image, where we pick local patterns for generating the specified subject with consistent identity. As for layout drawing, we outline the subject's contour by referring to a generative template from the vanilla diffusion model, and inherit the strong image prior to synthesize diverse contexts according to different text conditions. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

图片

Witnessing the evolution of text-to-image diffusion models, significant strides have been made in text-to-3D generation. Currently, two primary paradigms dominate the field of text-to-3D: the feed-forward generation solutions, capable of swiftly producing 3D assets but often yielding coarse results, and the Score Distillation Sampling (SDS) based solutions, known for generating high-fidelity 3D assets albeit at a slower pace. The synergistic integration of these methods holds substantial promise for advancing 3D generation techniques. In this paper, we present BoostDream, a highly efficient plug-and-play 3D refining method designed to transform coarse 3D assets into high-quality. The BoostDream framework comprises three distinct processes: (1) We introduce 3D model distillation that fits differentiable representations from the 3D assets obtained through feed-forward generation. (2) A novel multi-view SDS loss is designed, which utilizes a multi-view aware 2D diffusion model to refine the 3D assets. (3) We propose to use prompt and multi-view consistent normal maps as guidance in refinement.Our extensive experiment is conducted on different differentiable 3D representations, revealing that BoostDream excels in generating high-quality 3D assets rapidly, overcoming the Janus problem compared to conventional SDS-based methods.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1423082.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Docker多节点部署Minio分布式文件系统并测试

文章目录 一、前提准备二、文件配置1. .env2. env/minio.env3. docker-compose-minio.yml 三、测试四、Java测试1. 引入依赖2. 增删改 一、前提准备 准备如下文件夹和文件 ./ ├── docker-compose-minio.yml ├── .env ├── env │ ├── minio.env ├── minio │…

windows安装oracle之后怎么连接使用

目录 1.打开SQl Developer 2.选择JDK 3.登录 4.创建表空间,用户 安装oracle的详细教程 WINDOWS安装Oracle11.2.0.4-CSDN博客 1.打开SQl Developer 找到 SQl Developer 2.选择JDK 根据你安装的oracle版本,因为我的oracle是安装的32位的,所以这里jdk也要选择32位 选择到ja…

熟悉MATLAB 环境

一、问题描述 熟悉MATLAB 环境。 二、实验目的 了解Matlab 的主要功能,熟悉Matlab 命令窗口及文件管理,Matlab 帮助系统。掌握命令行的输入及编辑,用户目录及搜索路径的配置。了解Matlab 数据的特点,熟悉Matlab 变量的命名规则&a…

【JAVA】Long类型返回到前端,精度丢失

一. 问题阐述 20位long类型的数字,从后端接口返回到前端后【四舍五入】 MYSQL端 (1)bigint (20) (2)具体某一条数据 JAVA端 (1)实体类 (2)服务类 (3&…

DevExpress WinForms中文教程 - 如何创建可访问的WinForms应用?(一)

为用户创建易访问的Windows Forms应用程序不仅是最佳实践的体现,还是对包容性和以用户为中心的设计承诺。在应用程序开发生命周期的早期考虑与可访问性相关的需求可以节省长期运行的时间(因为它将决定设计决策和代码实现)。 一个可访问的WinForms应用程序提供了各种…

Unity之第一人称角色控制

目录 第一人称角色控制 😴1、准备工作 📺2、鼠标控制摄像机视角 🎮3、角色控制 😃4.杂谈 第一人称角色控制 专栏Unity之动画和角色控制-CSDN博客的这一篇也有讲到角色控制器,是第三人称视角的,以小编…

The Sandbox 新赛事开启|参加VoxEdit 战士装备设计比赛吧!

来参加VoxEdit 迷你竞赛吧! 按此下载 VoxEdit 即可开始。 https://www.sandbox.game/en/create/vox-edit/ 主题:设计以战士为主题的装备。你可以选择任何装备模板。你也可以选择制作动画,以更精彩地呈现你的创意。 你可以创建单件装备或整套…

【Transformer】Attention Is All You Need

Abstract 主要的序列转导模型是基于复杂的循环或卷积神经网络,包括一个编码器和一个解码器。表现最好的模型还通过注意机制连接编码器和解码器。我们提出了一个新的简单的网络架构,变压器,完全基于注意力机制,完全摒弃递归和卷积…

Atcoder ABC338 A-D题解

又是一篇姗姗来迟的atcoder题解。 Link:ABC338 Problem A: 妥妥的签到题。 #include <bits/stdc.h> using namespace std; int main(){string str;cin>>str;if(int(str[0])<65 || int(str[0])>90){cout<<"NO"<<endl;return 0;}for…

力扣hot100 买卖股票的最佳时机 贪心 经典题

Problem: 121. 买卖股票的最佳时机 文章目录 思路复杂度Code 思路 假设今天卖出&#xff0c;那怎么样收益最大呢&#xff1f;之前买入价是最低的 复杂度 ⏰ 时间复杂度: &#xff1a; O ( n ) O(n) O(n) &#x1f30e; 空间复杂度: O ( 1 ) O(1) O(1) Code class Solut…

微信小程序(二十七)列表渲染改变量名

注释很详细&#xff0c;直接上代码 上一篇 新增内容&#xff1a; 1.改变默认循环单元item变量名 2.改变默认循环下标index变量名 基础模板有问题可以先看上一篇 源码&#xff1a; index.wxml <view class"students"><view class"item"><te…

快速排序|超详细讲解|入门深入学习排序算法

快速排序介绍 快速排序(Quick Sort)使用分治法策略。 它的基本思想是&#xff1a;选择一个基准数&#xff0c;通过一趟排序将要排序的数据分割成独立的两部分&#xff1b;其中一部分的所有数据都比另外一部分的所有数据都要小。然后&#xff0c;再按此方法对这两部分数据分别进…

Unity 常见的图像压缩格式优缺点

在Unity中&#xff0c;将图像压缩至更小的大小&#xff0c;既可以加快加载速度&#xff0c;也可以减少内存的占用。根据不同的目标平台&#xff0c;Unity提供了以下几种常见的图像压缩格式&#xff1a; 1. RGBA Compressed: 是一种通过压缩的方式来存储RGBA&#xff08;红色、…

Flask框架开发学习笔记《6》前后端不分离基础框架

Flask框架开发学习笔记《6》前后端不分离基础框架 Flask是使用python的后端&#xff0c;由于小程序需要后端开发&#xff0c;遂学习一下后端开发。 主要包含如下文件&#xff1a; static 目录中存储了图片templates 目录中存储了 html 文件utils.py 包含了 log 函数server.p…

开发工具git分支冲突解决

在团队协作的软件开发过程中&#xff0c;Git是一款广泛使用的版本控制系统。然而&#xff0c;当多个开发者同时修改同一文件或代码段时&#xff0c;就会产生分支冲突。解决这些冲突需要仔细的协调和技术知识。本篇博客将介绍Git分支冲突的解决方法&#xff0c;以及开发工具和最…

Android 系统启动过程

当按下电源时&#xff0c;引导芯片代码会从预定义的地方(固化在ROM) 开始执行,加载引导程序BootLoader到RAM,然后执行。 启动内核的第一个进程idle(pid0),idle进程是Linux系统第一个进程&#xff0c;是init进程和kthreadd进程的父进程。 idle的主要作用 初始化进程以及内存管…

Modern C++ std::get<n>(tuple)的原理

1. 前言 前面我们讲过std::tuple的实现原理&#xff0c;但没有讲如何取出数据&#xff0c;本节着重讲讲这点。本节与之前的blog有较大关联&#xff0c;如果您没看&#xff0c;这里有链接&#xff0c;链接已按由浅入深排好序&#xff0c;您可以按顺序阅读。如果时间少可以直接看…

【新书推荐】4.2节 字符编码规则

本节内容&#xff1a;字符编码规则。 ■字符编码规则&#xff1a;ASCII码、ANSI字符集、Unicode字符集。 ■变形国标码&#xff1a;国标码是16位编码&#xff0c;高8位表示汉字符的区号&#xff0c;低8位表示汉字符的位号。 4.2.1 字符编码规则 计算机只能存储二进制数0和1&a…

API管理协作工具:Apipost

相信无论是前端&#xff0c;还是后端的测试和开发人员&#xff0c;都遇到过这样的困难。不同工具之间数据一致性非常困难、低效。多个系统之间数据不一致&#xff0c;导致协作低效、频繁出问题&#xff0c;开发测试人员痛苦不堪。 API管理的难点在哪&#xff1f; 开发人员在 …

【WPF.NET开发】优化性能:图形呈现层

本文内容 图形硬件呈现层定义其他资源 呈现层为运行 WPF 应用程序的设备定义图形硬件功能和性能级别。 1、图形硬件 对呈现层级别影响最大的图形硬件功能包括&#xff1a; 视频 RAM - 图形硬件中的视频内存量决定了可用于合成图形的缓冲区大小和数量。 像素着色器 - 像素着…