arXiv2024.8.6 | LLaVA-OneVision：Easy Visual Task Transfer

news2026/3/21 18:41:17

Comment: Project Homepage: https://llava-vl.github.io/blog/2024-08-05-llava-onevision/

论文标题：LLaVA-OneVision：Easy Visual Task Transfer

论文地址：https://arxiv.org/abs/2408.03326

GitHub：https://llava-vl.github.io/blog/llava-onevision

一、模型架构

目标：

其中Xv是视觉特征，Xq<i, Xa<i是前面对话中的指令和答案token

Visual Representations

对于高分辨率的图像，采用Higher AnyRes with Biliear Interpolation，高分辨率会带来性能提升，但是也会导致模型的token数激增，因此做了一个取舍：To strike a balance of performance and cost, we observe that the scaling of resolution is more effective than that of token numbers, and recommend an AnyRes strategy with pooling.

AnyRes策略：将图像切分为a*b个crop，每个crop大小一致，假设每个crop有T个token，视觉token数为L=（a*b+1）*T，其中的+1是原始图像resize后的整体image图。为缩减token数，设置一个阈值t，超过阈值使用双线性插值进行缩减：

不同策略：

单图：使用较大的spatila configuration (a,b) ,使用长序列表示高清晰度的图像，促进图像到视频理解更平滑的能力转移（因为视频是帧，需要处理长序列）

多图：考虑基本分辨率，消减高分辨率图对其多次裁剪。

视频：每帧采用基本分辨率，并采用双线性插值减少token数量，从而允许通过减少每帧的token考虑加入更多的帧。

二、数据

1、High-Quality Knowledge

通过优先考虑数据质量，可以最大限度地提高计算效率。从三个方面考虑：

“Re-Captioned Detailed Description Data”：使用LaVA-NeXT-34B为COCO118K，BLIP558K，CC3M，一共3.5M,

“Document / OCR Data” ：利用了 UReader 数据集中的文本阅读子集，总计 100K。使用SynDOG EN/CN 。总共1.1M。

“Chinese and Language Data.” ：使用原始的ShareGPT4V图像，并利用GPT-4V API生成92K image caption数据，并从 Evo-Instruct 数据集中收集了 143K。

几乎所有（占99.8%）的高质量知识数据都是合成的。

2、Visual Instruction Tuning Data

Data Collection and Curation. 从三个角度考虑：vision, instruction, and response.

Vision input：single-image，mutil-image，video

Language Instruction: 通用QA、通用OCR，文档/图表/屏幕，数学推理，语言。

Language Response：分为free-form和fixed-form，free-form是使用gemini/GPT4V-o生成（同时保留原始答案，怎么组织？），fixed-form是收集的数据集（并手动更正）。将指令分为两类：单图场景 + 所有视觉场景，

单图数据：

视觉场景数据：

三、训练策略

三个阶段：“Language-Image Alignment” 、“High-Quality Knowledge Learning” 、“Visual Instruction Tuning” 。

四、实验

1、Single-Image Benchmarks

Chart, Diagram, and Document Understanding：AI2D, ChartQA , DocVQA , and InfoVQA

Perception and Multi-discipline Reasoning：MME, MMBench, and MMVet, MathVerse, MathVista , and MMM

Real-world Understanding and Visual Chat：RealworldQA , Vibe-Eval , MM-LiveBench, and LLaVA-Bench-Wilder

2、Multi-Image Benchmarks

域内与域外的评估

五、任务迁移涌现能力

1、Joint understanding of diagram and chart (Transfer from single-image to multi-image)

2、GUI for multi-modal agent (Transfer from single-image and multi-image)

3、Set-of-mark Prompting (Transfer from single-image task composition).

set-of-marks (SoM) reasoning

4、Image-to-Video Editing Instruction (Transfer from single-image and video)

5、Video-to-Video Difference (Transfer from multi-image and video)

6、Multi-camera Video Understanding in Self-driving (Transfer from single-image and multiimage to video)

7、Composed Sub-video Understanding (Transfer from multi-image to video)

8、Visual prompting in video (Task transfer from single-image to video).

9、Visual Referring in Image in Video Understanding.

llava系列依旧是简单、高效。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/1992428.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！