Qwen-VL全文翻译（from GPT-4o）

Abstract
1 Introduction
2 Methodology
- 2.1 Model Architecture
- 2.2 Inputs and Outputs
3 Training
- 3.1 Pre-training
- 3.2 Multi-task Pre-training
- 3.3 Supervised Fine-tuning
4 Evaluation
- 4.1 Image Caption and General Visual Question Answering
- 4.2 Text-oriented Visual Question Answering
- 4.3 Refer Expression Comprehension
- 4.4 Few-shot Learning on Vision-Language Tasks
- 4.5 Instruction Following in Real-world User Behavior
5 Related Work
6 Conclusion and Future Work
A Dataset details
B Data Format Details of Training
C Hyperparameters
D Summary of the evaluation benchmarks
E Additional experimental details

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai Shuai Bai Shusheng Yang Shijie Wang Sinan Tan Peng Wang Junyang Lin Chang Zhou Jingren Zhou

Alibaba Group Code & Demo & Models: https://github.com/QwenLM/Qwen-VL

Abstract

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. All models are public to facilitate future research.

在这项工作中，我们介绍了Qwen-VL系列，这是一组大规模的视觉语言模型（LVLMs），旨在感知和理解文本和图像。从Qwen-LM作为基础，我们通过精心设计的(i)视觉接收器，(ii)输入输出接口，(iii)三阶段训练流程，和(iv)多语言多模态清洁语料库，为其赋予视觉能力。除了传统的图像描述和问答功能，我们通过对齐图像-字幕-框元组，增强了Qwen-VLs的定位和文本阅读能力。所得到的模型，包括Qwen-VL和Qwen-VL-Chat，在广泛的视觉为中心的基准测试中（如图像描述、问答、视觉定位）和不同的设置（如零样本、少样本）下，创下了同类模型的新纪录。此外，在真实世界的对话基准测试中，我们的指令微调Qwen-VL-Chat相比现有的视觉语言聊天机器人也表现出了优势。所有模型均已公开，以促进未来的研究。
请添加图片描述

Figure 1: Qwen-VL achieves state-of-the-art performance on a broad range of tasks compared with other generalist models.

图1：与其他通用模型相比，Qwen-VL在广泛的任务上实现了最先进的性能。

1 Introduction

Recently, Large Language Models (LLMs) (Brown et al., 2020; OpenAI, 2023; Anil et al., 2023; Gao et al., 2023; Qwen, 2023) have attracted wide attention due to their powerful capabilities in text generation and comprehension. These models can be further aligned with user intent through fine-tuning instructions, showcasing strong interactive capabilities and the potential to enhance productivity as intelligent assistants. However, native large language models only live in the pure-text world, lacking the ability to handle other common modalities (such as images, speech, and videos), resulting in great restrictions on their application scope. Motivated by this, a group of Large Vision Language Models (LVLMs) (Alayrac et al., 2022; Chen et al., 2022; Li et al., 2023c; Dai et al., 2023; Huang et al., 2023; Peng et al., 2023; Zhu et al., 2023; Liu et al., 2023; Ye et al., 2023b,a; Chen et al., 2023a; Li et al., 2023a; Zhang et al., 2023; Sun et al., 2023; OpenAI, 2023) have been developed to enhance large language models with the ability to perceive and understand visual signals. These large-scale vision-language models demonstrate promising potential in solving real-world vision-central problems.

最近，大型语言模型（LLMs）（Brown等，2020；OpenAI，2023；Anil等，2023；Gao等，2023；Qwen，2023）因其强大的文本生成和理解能力而备受关注。这些模型可以通过微调指令进一步与用户意图对齐，展现出强大的交互能力，并有潜力作为智能助手提高生产力。然而，原生的大型语言模型仅存在于纯文本世界中，缺乏处理其他常见模态（如图像、语音和视频）的能力，极大地限制了它们的应用范围。受此启发，一组大型视觉语言模型（LVLMs）（Alayrac等，2022；Chen等，2022；Li等，2023c；Dai等，2023；Huang等，2023；Peng等，2023；Zhu等，2023；Liu等，2023；Ye等，2023b,a；Chen等，2023a；Li等，2023a；Zhang等，2023；Sun等，2023；OpenAI，2023）被开发出来，以增强大型语言模型感知和理解视觉信号的能力。这些大规模的视觉语言模型在解决现实世界的视觉中心问题上展示了令人鼓舞的潜力。

Nevertheless, despite that lots of works have been conducted to explore the limitation and potency of LVLMs, current open-source LVLMs always suffer from inadequate training and optimization, thus lag far behind the proprietary models (Chen et al., 2022, 2023b; OpenAI, 2023), which hinders further exploration and application of LVLMs in open-source community. What’s more, as real-world visual scenarios are quite complicated, fine-grained visual understanding plays a crucial role for LVLMs to assist people effectively and precisely. But only a few attempts had been made toward this direction (Peng et al., 2023; Chen et al., 2023a), the majority of open-source LVLMs remain perceiving the image in a coarse-grained approach and lacking the ability to execute fine-grained perception such as object grounding or text reading.

然而，尽管已经进行了大量工作来探索LVLMs的局限性和潜力，目前的开源LVLMs总是因训练和优化不足而远远落后于专有模型（Chen等，2022，2023b；OpenAI，2023），这阻碍了开源社区进一步探索和应用LVLMs。更重要的是，由于现实世界的视觉场景相当复杂，细粒度的视觉理解对于LVLMs有效且精确地帮助人们至关重要。但只有少数尝试朝着这个方向（Peng等，2023；Chen等，2023a）进行，大多数开源LVLMs仍在以粗粒度的方式感知图像，缺乏执行细粒度感知（如对象定位或文本阅读）的能力。

In this paper, we explore a way out and present the newest members of the open-sourced Qwen families: Qwen-VL series. Qwen-VLs are a series of highly performant and versatile vision-language foundation models based on Qwen-7B (Qwen, 2023) language model. We empower the LLM basement with visual capacity by introducing a new visual receptor including a language-aligned visual encoder and a position-aware adapter. The overall model architecture as well as the input-output in
terface are quite concise and we elaboratedly design a 3-stage training pipeline to optimize the whole model upon a vast collection of image-text corpus.

在本文中，我们探讨了一种解决方法，并展示了开源Qwen系列的最新成员：Qwen-VL系列。Qwen-VLs是一系列高性能且多功能的视觉语言基础模型，基于Qwen-7B（Qwen，2023）语言模型。我们通过引入新的视觉接收器，包括语言对齐的视觉编码器和位置感知适配器，为LLM基础模型赋予了视觉能力。模型的整体架构和输入输出接口相当简洁，我们精心设计了三阶段训练流程，以优化整个模型基于大量的图像-文本语料库。

Our pre-trained checkpoint, termed Qwen-VL, is capable of perceiving and understanding visual inputs, generating desired responses according to given prompts, and accomplishing various vision-language tasks such as image captioning, question answering, text-oriented question answering, and visual grounding. Qwen-VL-Chat is the instruction-tuned vision-language chatbot based on Qwen-VL. As shown in Fig. 2, Qwen-VL-Chat is able to interact with users and perceive the input images following the intention of users. Specifically, the features of the Qwen-VL series models include:

我们的预训练检查点称为Qwen-VL，能够感知和理解视觉输入，根据给定的提示生成所需的响应，并完成各种视觉语言任务，如图像描述、问答、面向文本的问答和视觉定位。Qwen-VL-Chat是基于Qwen-VL的指令微调视觉语言聊天机器人。如图2所示，Qwen-VL-Chat能够与用户互动，并根据用户的意图感知输入图像。具体来说，Qwen-VL系列模型的特点包括：

Leading performance: Qwen-VLs achieve top-tier accuracy on a vast of vision-centric understanding benchmarks compared to counterparts with similar scales. Besides, Qwen-VL’s stuning performance covers not only the conventional benchmarks e.g., captioning, question-answering, grounding), but also some recently introduced dialogue benchmarks.

领先的性能：与同类模型相比，Qwen-VLs在大量视觉中心理解基准测试中实现了顶级精度。此外，Qwen-VL的出色性能不仅涵盖了传统基准（如描述、问答、定位），还包括一些最近引入的对话基准。

Multi-lingual: Similar to Qwen-LM, Qwen-VLs are trained upon multilingual image-text data with a considerable amount of corpus being in English and Chinese. In this way, Qwen-VLs naturally support English, Chinese, and multilingual instructions.

多语言：与Qwen-LM类似，Qwen-VLs在多语言图像-文本数据上进行训练，大量语料为英文和中文。这样，Qwen-VLs自然支持英语、中文和多语言指令。

Multi-image: In the training phase, we allow arbitrary interleaved image-text data as Qwen-VL’s inputs. This feature allows our Qwen-Chat-VL to compare, understand, and analyze the context when multiple images are given.

多图像：在训练阶段，我们允许任意交错的图像-文本数据作为Qwen-VL的输入。此功能使我们的Qwen-Chat-VL在给出多张图像时能够比较、理解和分析上下文。

Fine-grained visual understanding: Thanks to the higher-resolution input size and fine-grained corpus we used in training, Qwen-VLs exhibit highly competitive fine-grained visual understanding ability. Compared to existing vision-language generalists, our Qwen-VLs possess much better grounding, text-reading, text-oriented question answering, and fine-grained dialog performance.

细粒度视觉理解：得益于我们在训练中使用的高分辨率输入大小和细粒度语料库，Qwen-VLs表现出高度竞争力的细粒度视觉理解能力。与现有的视觉语言通用模型相比，我们的Qwen-VLs在定位、文本阅读、面向文本的问答和细粒度对话性能方面表现更好。

请添加图片描述

Figure 2: Some qualitative examples generated by our Qwen-VL-Chat. Qwen-VL-Chat supports multiple image inputs, multi-round dialogue, multilingual conversation, text-reading, localization, fine-grained recognition and understanding ability.

图2：由我们的Qwen-VL-Chat生成的一些定性示例。Qwen-VL-Chat支持多图像输入、多轮对话、多语言对话、文本阅读、定位、细粒度识别和理解能力。

2 Methodology

2.1 Model Architecture

The overall network architecture of Qwen-VL consists of three components and the details of model parameters are shown in Table 1:

2.1 模型架构

Qwen-VL的整体网络架构由三个组件组成，模型参数的详细信息见表1：

Large Language Model: Qwen-VL adopts a large language model as its foundation component. The model is initialized with pre-trained weights from Qwen-7B (Qwen, 2023).

大型语言模型：Qwen-VL采用大型语言模型作为其基础组件。模型使用Qwen-7B（Qwen，2023）的预训练权重进行初始化。

Visual Encoder: The visual encoder of Qwen-VL uses the Vision Transformer (ViT) (Dosovitskiy et al., 2021) architecture, initialized with pre-trained weights from ==Openclip’s ViT-bigG == (Ilharco et al., 2021). During both training and inference, input images are resized to a specific resolution. The visual encoder processes images by splitting them into patches with a stride of 14, generating a set of image features.

视觉编码器：Qwen-VL的视觉编码器使用视觉Transformer（ViT）（Dosovitskiy等，2021）架构，使用Openclip的ViT-bigG（Ilharco等，2021）的预训练权重进行初始化。在训练和推理过程中，输入图像被调整为特定分辨率。视觉编码器通过将图像分割成步幅为14的块，生成一组图像特征。

Position-aware Vision-Language Adapter: To alleviate the efficiency issues arising from long image feature sequences, Qwen-VL introduces a vision-language adapter that compresses the image features. This adapter comprises a single-layer cross-attention module initialized randomly. The module uses a group of trainable vectors (Embeddings) as query vectors and the image features from the visual encoder as keys for cross-attention operations. This mechanism compresses the visual feature sequence to a fixed length of 256. The ablation about the number of queries is shown in Appendix E.2. Additionally, considering the significance of positional information for fine-grained image comprehension, 2D absolute positional encodings are incorporated into the cross-attention mechanism’s query-key pairs to mitigate the potential loss of positional details during compression. The compressed image feature sequence of length 256 is subsequently fed into the large language model.

位置感知的视觉语言适配器：为缓解长图像特征序列带来的效率问题，Qwen-VL引入了一个视觉语言适配器来压缩图像特征。这个适配器包括一个随机初始化的单层交叉注意模块。该模块使用一组可训练的向量（嵌入）作为查询向量，并使用来自视觉编码器的图像特征作为交叉注意操作的键。这个机制将视觉特征序列压缩到固定长度的256。关于查询数量的消融实验见附录E.2。此外，考虑到位置信息对于细粒度图像理解的重要性，在交叉注意机制的查询-键对中加入了二维绝对位置编码，以减轻压缩过程中可能丢失的位置细节。压缩后的长度为256的图像特征序列随后被输入大型语言模型。

Table 1: Details of Qwen-VL model parameters.

表1：Qwen-VL模型参数详情。

Vision Encoder	VL Adapter	LLM	Total
1.9B	0.08B	7.7B	9.6B

在这里插入图片描述

Figure 3: The training pipeline of the Qwen-VL series.

图3：Qwen-VL系列的训练流程。

2.2 Inputs and Outputs

Image Input: Images are processed through the visual encoder and adapter, yielding fixed-length sequences of image features. To differentiate between image feature input and text feature input, two special tokens (<img> and </img>) are appended to the beginning and end of the image feature sequence respectively, signifying the start and end of image content.

2.2 输入和输出

图像输入：图像通过视觉编码器和适配器处理，产生固定长度的图像特征序列。为区分图像特征输入和文本特征输入，两个特殊词元 （<img>和</img>） 分别被添加到图像特征序列的开头和结尾，表示图像内容的开始和结束。

Bounding Box Input and Output: To enhance the model’s capacity for fine-grained visual understanding and grounding, Qwen-VL’s training involves data in the form of region descriptions, questions, and detections. Differing from conventional tasks involving image-text descriptions or questions, this task necessitates the model’s accurate understanding and generation of region descriptions in a designated format. For any given bounding box, a normalization process is applied (within the range [0, 1000)) and transformed into a specified string format: “(Xtopleft, Ytopleft), (Xbottomright, Ybottomright)”. The string is tokenized as text and does not require an additional positional vocabulary. To distinguish between detection strings and regular text strings, two special tokens (<box> and </box> are added at the beginning and end of the bounding box string. Additionally, to appropriately associate bounding boxes with their corresponding descriptive words or sentences, another set of special tokens (<ref> and </ref>) is introduced, marking the content referred to by the bounding box.

边界框输入和输出：为增强模型对细粒度视觉理解和定位的能力，Qwen-VL的训练涉及区域描述、问题和检测形式的数据。与涉及图像文本描述或问题的传统任务不同，该任务要求模型以指定格式准确理解和生成区域描述。对于任何给定的边界框，进行归一化处理（在[0, 1000）范围内）并转换为指定的字符串格式：“(Xtopleft, Ytopleft), (Xbottomright, Ybottomright)”。字符串被作为文本进行词元化，不需要额外的位置词汇。为区分检测字符串和常规文本字符串，两个特殊词元（<box>和</box>）被添加到边界框字符串的开头和结尾。此外，为了适当地将边界框与其对应的描述性词语或句子关联，另一个特殊词元集（<ref>和</ref>）被引入，标记边界框所引用的内容。

3 Training

As illustrated in Fig. 3, the training process of the Qwen-VL model consists of three stages: two stages of pre-training and a final stage of instruction fine-tuning training.

如图3所示，Qwen-VL模型的训练过程包括三个阶段：两个预训练阶段和一个最终的指令微调训练阶段。

3.1 Pre-training

In the first stage of pre-training, we mainly utilize a large-scale, weakly labeled, web-crawled set of image-text pairs. Our pre-training dataset is composed of several publicly accessible sources and some in-house data. We made an effort to clean the dataset of certain patterns. As summarized in Table 2, the original dataset contains a total of 5 billion image-text pairs, and after cleaning, 1.4 billion data remain, with 77.3% English (text) data and 22.7% Chinese (text) data.

在预训练的第一阶段，我们主要利用一个大规模、弱标记的、网络爬取的图像-文本对集合。我们的预训练数据集由几个公开可访问的来源和一些内部数据组成。我们努力清理数据集中的某些模式。如表2所示，原始数据集包含50亿图像-文本对，清理后剩余14亿数据，其中77.3%为英文（文本）数据，22.7%为中文（文本）数据。

Table 2: Details of Qwen-VL pre-training data. LAION-en and LAION-zh are the English and Chinese language subset of LAION-5B (Schuhmann et al., 2022a). LAION-COCO (Schuhmann et al., 2022b) is a synthetic dataset generated from LAION-en. DataComp (Gadre et al., 2023) and Coyo (Byeon et al., 2022) are collections of image-text pairs. CC12M (Changpinyo et al., 2021), CC3M (Sharma et al., 2018), SBU (Ordonez et al., 2011) and COCO Caption (Chen et al., 2015) are academic caption datasets.

表2：Qwen-VL预训练数据的详细信息。LAION-en和LAION-zh是LAION-5B（Schuhmann等，2022a）的英文和中文语言子集。LAION-COCO（Schuhmann等，2022b）是从LAION-en生成的合成数据集。DataComp（Gadre等，2023）和Coyo（Byeon等，2022）是图像-文本对的集合。CC12M（Changpinyo等，2021）、CC3M（Sharma等，2018）、SBU（Ordonez等，2011）和COCO Caption（Chen等，2015）是学术标题数据集。

在这里插入图片描述

We freeze the large language model and only optimize the vision encoder and VL adapter in this stage. The input images are resized to 224 × 224. The training objective is to minimize the cross-entropy of the text tokens. The maximum learning rate is 2e−4 and the training process uses a batch size of 30720 for the image-text pairs, and the entire first stage of pre-training lasts for 50,000 steps, consuming approximately 1.5 billion image-text samples. More hyperparameters are detailed in Appendix C and the convergence curve of this stage is shown in Figure 6.

在这个阶段，我们冻结大型语言模型，仅优化视觉编码器和VL适配器。输入图像被调整为224 × 224。训练目标是最小化文本词元的交叉熵。最大学习率为2e−4，训练过程使用批量大小为30720的图像-文本对，整个第一阶段的预训练持续50000步，大约消耗了15亿图像-文本样本。更多超参数细节见附录C，该阶段的收敛曲线见图6。

3.2 Multi-task Pre-training

In the second stage of multi-task pre-training, we introduce high-quality and fine-grained VL annotation data with a larger input resolution and interleaved image-text data. As summarized in Table 3, we trained Qwen-VL on 7 tasks simultaneously. For text generation, we use the in-house collected corpus to maintain the LLM’s ability. Captioning data is the same with Table 2 except for far fewer samples and excluding LAION-COCO. We use a mixture of publicly available data for the VQA task which includes GQA (Hudson and Manning, 2019), VGQA (Krishna et al., 2017), VQAv2 (Goyal et al., 2017), DVQA (Kafle et al., 2018), OCR-VQA (Mishra et al., 2019) and DocVQA (Mathew et al., 2021). We follow Kosmos-2 to use the GRIT (Peng et al., 2023) dataset for the grounding task with minor modifications. For the reference grounding and grounded captioning duality tasks, we construct training samples from GRIT (Peng et al., 2023), Visual Genome (Krishna et al., 2017), RefCOCO (Kazemzadeh et al., 2014), RefCOCO+, and RefCOCOg (Mao et al., 2016). In order to improve the text-oriented tasks, we collect pdf and HTML format data from Common Crawl and generate synthetic OCR data in English and Chinese language with natural scenery background, following (Kim et al., 2022). Finally, we simply construct interleaved image-text data by packing the same task data into sequences of length 2048.

在第二阶段的多任务预训练中，我们引入高质量和细粒度的VL注释数据，并使用更大分辨率的输入和交错的图像-文本数据。如表3所示，我们同时在7个任务上训练Qwen-VL。对于文本生成，我们使用内部收集的语料库，以保持LLM的能力。标题数据与表2相同，但样本数要少得多，并且不包括LAION-COCO。我们使用公开可用的数据集混合用于VQA任务，包括GQA（Hudson和Manning，2019）、VGQA（Krishna等，2017）、VQAv2（Goyal等，2017）、DVQA（Kafle等，2018）、OCR-VQA（Mishra等，2019）和DocVQA（Mathew等，2021）。我们按照Kosmos-2的方法使用GRIT（Peng等，2023）数据集进行定位任务，但进行了小的修改。对于参考定位和定位标题对偶任务，我们从GRIT（Peng等，2023）、Visual Genome（Krishna等，2017）、RefCOCO（Kazemzadeh等，2014）、RefCOCO+和RefCOCOg（Mao等，2016）构建训练样本。为了改进文本导向任务，我们从Common Crawl收集pdf和HTML格式数据，并生成带有自然风景背景的英文和中文OCR合成数据，跟随（Kim等，2022）。最后，我们通过将同一任务数据打包成长度为2048的序列，简单地构建交错的图像-文本数据。

Table 3: Details of Qwen-VL multi-task pre-training data.

表3：Qwen-VL多任务预训练数据的详细信息。

在这里插入图片描述

We increase the input resolution of the visual encoder from 224 × 224 to 448 × 448, reducing the information loss caused by image down-sampling. Besides, we ablate the window attention and global attention for higher resolutions of the vision transformer in Appendix E.3. We unlocked the large language model and trained the whole model. The training objective is the same as the pre-training stage.

我们将视觉编码器的输入分辨率从224×224增加到448×448，减少图像下采样导致的信息损失。此外，我们在附录E.3中消融了视觉Transformer在更高分辨率下的窗口注意和全局注意。我们解锁了大型语言模型并训练整个模型。训练目标与预训练阶段相同。

3.3 Supervised Fine-tuning

During this stage, we finetuned the Qwen-VL pre-trained model through instruction fine-tuning to enhance its instruction following and dialogue capabilities, resulting in the interactive Qwen-VL-Chat model. The multi-modal instruction tuning data primarily comes from caption data or dialogue data generated through LLM self-instruction, which often only addresses single-image dialogue and reasoning and is limited to image content comprehension. We construct an additional set of dialogue data through manual annotation, model generation, and strategy concatenation to incorporate localization and multi-image comprehension abilities into the Qwen-VL model. We confirm that the model effectively transfers these capabilities to a wider range of languages and question types. Additionally, we mix multi-modal and pure text dialogue data during training to ensure the model’s universality in dialogue capabilities. The instruction tuning data amounts to 350k. In this stage, we freeze the visual encoder and optimize the language model and adapter module. We demonstrate the data format of this stage in Appendix B.2.

在此阶段，我们通过指令微调对Qwen-VL预训练模型进行微调，以增强其指令遵循和对话能力，从而得到交互式的Qwen-VL-Chat模型。多模态指令调优数据主要来自通过LLM自我指令生成的标题数据或对话数据，这通常只涉及单图像对话和推理，并限于图像内容理解。我们通过手工注释、模型生成和策略串联构建了一组额外的对话数据，将定位和多图像理解能力纳入Qwen-VL模型。我们确认模型能够有效地将这些能力转移到更广泛的语言和问题类型。此外，我们在训练期间混合多模态和纯文本对话数据，以确保模型在对话能力方面的通用性。指令调优数据总量为35万。在此阶段，我们冻结视觉编码器并优化语言模型和适配器模块。我们在附录B.2中展示了该阶段的数据格式。

4 Evaluation

In this section, we conduct an overall evaluation on various multi-modal tasks to comprehensively assess our models’ visual understanding ability. In the following, Qwen-VL denotes the model after the multi-task training, and Qwen-VL-Chat denotes the model after supervised fine-tuning (SFT) stage.

在本节中，我们对各种多模态任务进行总体评估，以全面评估我们模型的视觉理解能力。在下文中，Qwen-VL表示多任务训练后的模型，Qwen-VL-Chat表示监督微调（SFT）阶段后的模型。

Table 9 provides a detailed summary of the used evaluation benchmarks and corresponding metrics.

表9提供了所用评估基准和相应指标的详细总结。

4.1 Image Caption and General Visual Question Answering

Image caption and general visual question answering (VQA) are two conventional tasks for vision-language models. Specifically, image caption requires the model to generate a description for a given image and general VQA requires the model to generate an answer for a given image-question pair.

图像描述和一般视觉问答（VQA）是视觉语言模型的两个传统任务。具体来说，图像描述要求模型生成给定图像的描述，而一般VQA要求模型生成给定图像-问题对的答案。

Table 4: Results on Image Captioning and General VQA.

表4：图像描述和一般VQA的结果。

在这里插入图片描述

For the image caption task, we choose Nocaps (Agrawal et al., 2019) and Flickr30K (Young et al., 2014) as benchmarks and report CIDEr score (Vedantam et al., 2015) as metric. We utilize greedy search for caption generation with a prompt of “Describe the image in English:”.

对于图像描述任务，我们选择Nocaps（Agrawal等，2019）和Flickr30K（Young等，2014）作为基准，并报告CIDEr得分（Vedantam等，2015）作为指标。我们利用贪心搜索生成标题，提示为“用英语描述图像：”。

For general VQA, we utilize five benchmarks including VQAv2 (Goyal et al., 2017), OKVQA (Marino et al., 2020), DocVQA (Mathew et al., 2021), ChartQA (Masry et al., 2022), AI2DDiagram (Kembhavi et al., 2016), and OCR-VQA (Mishra et al., 2019). For VQAv2, OKVQA, GQA and VizWiz VQA, we employ open-ended answer generation with greedy decoding strategy and a prompt of “[question] Answer:”, without any constraint on the model’s output space. However, for ScienceQA, we constrain the model’s output to possible options (instead of open-ended), choosing the option with highest confidence as model’s prediction and report the Top-1 accuracy.

对于一般VQA，我们使用五个基准，包括VQAv2（Goyal等，2017）、OKVQA（Marino等，2020）、DocVQA（Mathew等，2021）、ChartQA（Masry等，2022）、AI2DDiagram（Kembhavi等，2016）和OCR-VQA（Mishra等，2019）。对于VQAv2、OKVQA、GQA和VizWiz VQA，我们采用开放式答案生成，使用贪心解码策略，提示为“[问题]答案：”，没有任何约束在模型的输出空间上。然而，对于ScienceQA，我们将模型的输出限制在可能的选项中（而不是开放式），选择信心最高的选项作为模型的预测，并报告Top-1准确率。

The overall performance on image caption and general VQA tasks are reported in Table 4. As the results shown, our Qwen-VL and Qwen-VL-Chat both achieve obviously better results compared to previous generalist models in terms of both two tasks. Specifically, on zero-shot image caption task, Qwen-VL achieves state-of-the-art performance (i.e., 85.8 CIDEr score) on the Flickr30K karpathy-test split, even outperforms previous generalist models with much more parameters (e.g., Flamingo-80B with 80B parameters).

图像描述和一般VQA任务的总体表现如表4所示。结果显示，我们的Qwen-VL和Qwen-VL-Chat在这两个任务上均显著优于之前的通用模型。特别是在零样本图像描述任务中，Qwen-VL在Flickr30K karpathy-test分割上达到了最先进的性能（即85.8 CIDEr得分），甚至超过了参数量更多的通用模型（如Flamingo-80B，80B参数）。

On general VQA benchmarks, our models also exhibit distinct advantages compared to others. On VQAv2, OKVQA and GQA benchmarks, Qwen-VL achieves 79.5, 58.6 and 59.3 accuracy respectively, which surpasses recent proposed LVLMs by a large margin. It’s worth noting that Qwen-VL also shows strong zero-shot performance on ScienceQA and VizWiz datasets.

在一般VQA基准测试中，我们的模型相比其他模型也表现出明显的优势。在VQAv2、OKVQA和GQA基准测试中，Qwen-VL分别达到了79.5、58.6和59.3的准确率，远远超过了最近提出的LVLMs。值得注意的是，Qwen-VL在ScienceQA和VizWiz数据集上也显示出强大的零样本性能。

4.2 Text-oriented Visual Question Answering

Text-oriented visual understanding has a broad application prospect in real-world scenarios. We assess our models’ ability toward text-oriented visual question answering on several benchmarks including TextVQA (Sidorov et al., 2022), DocVQA (Mathew et al., 2021), ChartQA (Masry et al., 2022), AI2DDiagram (Kembhavi et al., 2016), and OCR-VQA (Mishra et al., 2019). Similarly, the results are shown in Table 5. Compared to previous generalist models and recent LVLMs, our models show better performance on most benchmarks, frequently by a large margin.

文本导向的视觉理解在现实世界场景中具有广泛的应用前景。我们在多个基准测试上评估了我们模型在文本导向的视觉问答方面的能力，包括TextVQA（Sidorov等，2022）、DocVQA（Mathew等，2021）、ChartQA（Masry等，2022）、AI2DDiagram（Kembhavi等，2016）和OCR-VQA（Mishra等，2019）。同样，结果如表5所示。与之前的通用模型和最近的LVLMs相比，我们的模型在大多数基准上表现更好，通常有很大差距。

Table 5: Results on Text-oriented VQA.

表5：文本导向VQA的结果。
在这里插入图片描述

4.3 Refer Expression Comprehension

We show our models’ fine-grained image understanding and localization ability by evaluating on a sort of refer expression comprehension benchmarks such as RefCOCO (Kazemzadeh et al., 2014), RefCOCOg (Mao et al., 2016), RefCOCO+ (Mao et al., 2016) and GRIT (Gupta et al., 2022). Specifically, the refer expression comprehension task requires the model to localize the target object under the guidance of a description. The results are shown in Table 6. Compared to previous generalist models or recent LVLMs, our models obtain top-tier results on all benchmarks.

我们通过在一些参考表达理解基准上进行评估，展示了我们模型的细粒度图像理解和定位能力，如RefCOCO（Kazemzadeh等，2014）、RefCOCOg（Mao等，2016）、RefCOCO+（Mao等，2016）和GRIT（Gupta等，2022）。具体来说，参考表达理解任务要求模型在描述的指导下定位目标对象。结果如表6所示。与之前的通用模型或最近的LVLMs相比，我们的模型在所有基准上都取得了顶级结果。

Table 6: Results on Referring Expression Comprehension task.

表6：参考表达理解任务的结果。

在这里插入图片描述

4.4 Few-shot Learning on Vision-Language Tasks

Our model also exhibits satisfactory in-context learning (a.k.a., few-shot learning) ability. As shown in Figure 4, Qwen-VL achieves better performance through in-context few-shot learning on OKVQA (Marino et al., 2019), Vizwiz (Gurari et al., 2018), TextVQA (Sidorov et al., 2020), and Flickr30K (Young et al., 2014) when compared with models with similar number of parameters (Flamingo-9B (Alayrac et al., 2022), OpenFlamingo-9B(?) and IDEFICS-9B?). Qwen-VL’s performance is even comparable with much larger models (Flamingo-80B and IDEFICS-80B). Note that we adopt naive random sample to construct the few-shot exemplars, sophisticated few-shot exemplar construction methods such as RICES (Yang et al., 2022b) are not used despite better

results would be achieved.

我们的模型在上下文学习（即少样本学习）能力上也表现出令人满意的效果。如图4所示，Qwen-VL在OKVQA（Marino等，2019）、Vizwiz（Gurari等，2018）、TextVQA（Sidorov等，2020）和Flickr30K（Young等，2014）的上下文少样本学习中表现更好，与参数数量相似的模型相比（Flamingo-9B（Alayrac等，2022）、OpenFlamingo-9B（?）和IDEFICS-9B（?））。Qwen-VL的性能甚至可以与更大规模的模型（Flamingo-80B和IDEFICS-80B）媲美。需要注意的是，我们采用了简单的随机样本来构建少样本实例，尽管使用复杂的少样本实例构建方法（如RICES（Yang等，2022b））会取得更好的结果，但我们并未使用。
在这里插入图片描述

Figure 4: Few-shot learning results of Qwen-VL in comparison with other models.

图4：Qwen-VL与其他模型的少样本学习结果比较。

4.5 Instruction Following in Real-world User Behavior

In addition to previous conventional vision-language evaluations, to evaluate our Qwen-VL-Chat model’s capacity under real-world user behavior, we further conduct the evaluations on the TouchStone (Bai et al., 2023), SEED-Bench (Li et al., 2023b), and MME (Fu et al., 2023). TouchStone is an open-ended vision-language instruction-following benchmark. We compare the instruction-following ability of Qwen-VL-Chat with other instruction-tuned LVLMs in both English and Chinese on the TouchStone benchmark. SEED-Bench consists of 19k multiple-choice questions which accrue human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both the spatial and temporal understanding. MME measures both perception and cognition abilities on a total of 14 subtasks.

除了之前的传统视觉语言评估外，为了评估我们的Qwen-VL-Chat模型在现实世界用户行为下的能力，我们还在TouchStone（Bai等，2023）、SEED-Bench（Li等，2023b）和MME（Fu等，2023）上进行了评估。TouchStone是一个开放式的视觉语言指令跟随基准测试。我们在TouchStone基准测试中比较了Qwen-VL-Chat与其他指令微调的LVLMs在英语和中文方面的指令跟随能力。SEED-Bench包含19k个多项选择问题，这些问题汇集了人类注释，用于评估多模态LLMs，涵盖12个评估维度，包括空间和时间理解。MME在总共14个子任务上测量感知和认知能力。

Table 7: Results on Instruction-following benchmarks.

表7：指令跟随基准测试的结果。

在这里插入图片描述

The results on three benchmarks are shown in Table 7. Qwen-VL-Chat has achieved obvious advantages over other LVLMs on all three benchmarks, indicating that our model performs well for understanding and answering diverse user instructions. In SEED-Bench, we have found that our models’ visual capabilities can be effectively transferred to video tasks by simply sampling four frames. In terms of the overall scores presented in TouchStone, our model demonstrates a clear advantage compared to other LVLMs, especially in terms of its Chinese capabilities. In terms of the broad categories of abilities, our model exhibits a more pronounced advantage in understanding and recognition, particularly in areas such as text recognition and chart analysis. For more detailed information, please refer to the TouchStone dataset.

在三个基准测试中的结果如表7所示。Qwen-VL-Chat在所有三个基准测试中均表现出明显的优势，表明我们的模型在理解和回答各种用户指令方面表现良好。在SEED-Bench中，我们发现我们的模型的视觉能力可以通过简单地采样四帧图像有效地转移到视频任务中。根据TouchStone中展示的总体得分，我们的模型相比其他LVLMs表现出明显的优势，特别是在中文能力方面。在广泛的能力类别中，我们的模型在理解和识别方面表现出更明显的优势，特别是在文本识别和图表分析等领域。有关更多详细信息，请参阅TouchStone数据集。

5 Related Work

In recent years, researchers have shown considerable interest in vision-language learning (Su et al., 2019; Chen et al., 2020; Li et al., 2020; Zhang et al., 2021; Li et al., 2021b; Lin et al., 2021; Dou et al., 2022; Zeng et al., 2021; Li et al., 2021a, 2022), especially in the development of multi-task generalist models (Hu and Singh, 2021; Singh et al., 2022; Zhu et al., 2022; Yu et al., 2022; Wang et al., 2022a; Lu et al., 2022a; Bai et al., 2022). CoCa (Yu et al., 2022) proposes an encoder-decoder structure to address image-text retrieval and vision-language generation tasks simultaneously. OFA (Wang et al., 2022a) transforms specific vision-language tasks into sequence-to-sequence tasks using customized task instructions. Unified I/O (Lu et al., 2022a) further introduces more tasks like segmentation and depth estimation into a unified framework. Another category of research focuses on building vision-language representation models (Radford et al., 2021; Jia et al., 2021; Zhai et al., 2022; Yuan et al., 2021; Yang et al., 2022a). CLIP (Radford et al., 2021) leverages contrastive learning and large amounts of data to align images and language in a semantic space, resulting in strong generalization capabilities across a wide range of downstream tasks. BEiT-3 (Wang et al., 2022b) employs a mixture-of-experts (MOE) structure and unified masked token prediction objective, achieving state-of-the-art results on various visual-language tasks. In addition to vision-language learning, ImageBind (Girdhar et al., 2023) and ONE-PEACE (Wang et al., 2023) align more modalities such as speech into a unified semantic space, thus creating more general representation models.

5 相关工作

近年来，研究人员对视觉语言学习表现出了极大的兴趣（Su等，2019；Chen等，2020；Li等，2020；Zhang等，2021；Li等，2021b；Lin等，2021；Dou等，2022；Zeng等，2021；Li等，2021a，2022），尤其是在多任务通用模型的开发方面（Hu和Singh，2021；Singh等，2022；Zhu等，2022；Yu等，2022；Wang等，2022a；Lu等，2022a；Bai等，2022）。CoCa（Yu等，2022）提出了一种编码器-解码器结构，旨在同时解决图像-文本检索和视觉语言生成任务。OFA（Wang等，2022a）将特定的视觉语言任务转换为使用定制任务指令的序列到序列任务。Unified I/O（Lu等，2022a）进一步引入了更多的任务，如分割和深度估计到统一框架中。另一类研究侧重于构建视觉语言表示模型（Radford等，2021；Jia等，2021；Zhai等，2022；Yuan等，2021；Yang等，2022a）。CLIP（Radford等，2021）利用对比学习和大量数据，在语义空间中对齐图像和语言，从而在各种下游任务中具有很强的泛化能力。BEiT-3（Wang等，2022b）采用了一种专家混合（MOE）结构和统一的掩码词元预测目标，在各种视觉语言任务中实现了最先进的结果。除了视觉语言学习，ImageBind（Girdhar等，2023）和ONE-PEACE（Wang等，2023）将更多的模态（如语音）对齐到统一的语义空间，从而创建了更通用的表示模型。

Despite achieving significant progress, previous vision-language models still have several limitations such as poor robustness in instruction following, limited generalization capabilities in unseen tasks, and a lack of in-context abilities. With the rapid development of large language models (LLMs) (Brown et al., 2020; OpenAI, 2023; Anil et al., 2023; Gao et al., 2023; Qwen, 2023), researchers have started building more powerful large vision-language models (LVLMs) based on LLMs (Alayrac et al., 2022; Chen et al., 2022; Li et al., 2023c; Dai et al., 2023; Huang et al., 2023; Peng et al., 2023; Zhu et al., 2023; Liu et al., 2023a; Li et al., 2023a; Zhang et al., 2023; Sun et al., 2023). BLIP-2 (Li et al., 2023c) proposes Q-Former to align the frozen vision foundation models and LLMs. Meanwhile, LLaVA (Liu et al., 2023) and MiniGPT4 (Zhu et al., 2023) introduce visual instruction tuning to enhance instruction following capabilities in LVLMs. Additionally, mPLUG-DocOwl (Ye et al., 2023a) incorporates document understanding capabilities into LVLMs by introducing digital documents data. Kosmos-2 (Peng et al., 2023), Shikra (Chen et al., 2023a), and BuboGPT (Zhao et al., 2023) further enhance LVLMs with visual grounding abilities, enabling region description and localization. In this work, we integrate image captioning, visual question answering, OCR, document understanding, and visual grounding capabilities into Qwen-VL. The resulting model achieves outstanding performance on these diverse style tasks.

尽管取得了显著进展，以前的视觉语言模型仍存在一些限制，如指令跟随的鲁棒性差、在未见任务中的泛化能力有限以及缺乏上下文能力。随着大型语言模型（LLMs）的快速发展（Brown等，2020；OpenAI，2023；Anil等，2023；Gao等，2023；Qwen，2023），研究人员开始基于LLMs构建更强大的大型视觉语言模型（LVLMs）（Alayrac等，2022；Chen等，2022；Li等，2023c；Dai等，2023；Huang等，2023；Peng等，2023；Zhu等，2023；Liu等，2023a；Li等，2023a；Zhang等，2023；Sun等，2023）。BLIP-2（Li等，2023c）提出了Q-Former来对齐冻结的视觉基础模型和LLMs。同时，LLaVA（Liu等，2023）和MiniGPT4（Zhu等，2023）引入了视觉指令调优，以增强LVLMs的指令跟随能力。此外，mPLUG-DocOwl（Ye等，2023a）通过引入数字文档数据，将文档理解能力引入LVLMs。Kosmos-2（Peng等，2023）、Shikra（Chen等，2023a）和BuboGPT（Zhao等，2023）进一步增强了LVLMs的视觉定位能力，使其能够进行区域描述和定位。在这项工作中，我们将图像描述、视觉问答、OCR、文档理解和视觉定位能力整合到Qwen-VL中。所得到的模型在这些不同风格的任务中实现了卓越的性能。

6 Conclusion and Future Work

We release the Qwen-VL series, a set of large-scale multilingual vision-language models that aims to facilitate multimodal research. Qwen-VL outperforms similar models across various benchmarks, supporting multilingual conversations, multi-image interleaved conversations, grounding in Chinese, and fine-grained recognition. Moving forward, we are dedicated to further enhancing Qwen-VL’s capabilities in several key dimensions:

6 结论和未来工作

我们发布了Qwen-VL系列，这是一组大型多语言视觉语言模型，旨在促进多模态研究。Qwen-VL在各种基准测试中表现优于类似模型，支持多语言对话、多图像交错对话、中文定位和细粒度识别。展望未来，我们致力于在几个关键方面进一步增强Qwen-VL的能力：

Integrating Qwen-VL with more modalities, such as speech and video.

将Qwen-VL与更多的模态（如语音和视频）集成。

Augmenting Qwen-VL by scaling up the model size, training data, and higher resolution, enabling it to handle more complex and intricate relationships within multimodal data.

通过扩大模型规模、训练数据和更高分辨率来增强Qwen-VL，使其能够处理多模态数据中的更复杂和精细的关系。

Expanding Qwen-VL’s prowess in multi-modal generation, specifically in generating high-fidelity images and fluent speech.

扩展Qwen-VL在多模态生成方面的能力，特别是在生成高保真图像和流利语音方面。

A Dataset details

A.1 Image-text pairs

We use web-crawled image-text pairs dataset for pre-training, which includes LAION-en (Schuhmann et al., 2022a), LAION-zh (Schuhmann et al., 2022a), LAION-COCO (Schuhmann et al., 2022b), DataComp (Gadre et al., 2023) and Coyo (Byeon et al., 2022). We clean these noisy data by several steps:

A 数据集详情

A.1 图像-文本对

我们使用网络爬取的图像-文本对数据集进行预训练，包括LAION-en（Schuhmann等，2022a）、LAION-zh（Schuhmann等，2022a）、LAION-COCO（Schuhmann等，2022b）、DataComp（Gadre等，2023）和Coyo（Byeon等，2022）。我们通过以下几个步骤清理这些噪声数据：

Removing pairs with too large aspect ratio of the image

1. 删除图像长宽比过大的对

Removing pairs with too small image

2. 删除图像过小的对

Removing pairs with a harsh CLIP score (dataset-specific)

3. 删除CLIP评分过低的对（特定数据集）

Removing pairs with text containing non-English or non-Chinese characters

4. 删除包含非英文或非中文字符的文本对

Removing pairs with text containing emoji characters

5. 删除包含表情符号的文本对

Removing pairs with text length too short or too long

6. 删除文本长度过短或过长的对

Cleaning the text’s HTML-tagged part

7. 清理文本中的HTML标签部分

Cleaning the text with certain unregular patterns

8. 清理具有特定不规则模式的文本

For academic caption datasets, we remove pairs whose text contains the special tags in CC12M (Changpinyo et al., 2021) and SBU (Ordonez et al., 2011). If there is more than one text matching the same image, we select the longest one.

对于学术标题数据集，我们删除文本中包含CC12M（Changpinyo等，2021）和SBU（Ordonez等，2011）中特殊标签的对。如果有多个文本匹配同一图像，我们选择最长的一个。

A.2 VQA

For the VQAv2 (Goyal et al., 2017) dataset, we select the answer annotation based on the maximum confidence. For other VQA datasets, we didn’t do anything special.

A.2 VQA

对于VQAv2（Goyal等，2017）数据集，我们根据最大置信度选择答案注释。对于其他VQA数据集，我们没有做特别处理。

A.3 Grounding

For the GRIT (Peng et al., 2023) dataset, we found that there are many recursive grounding box labels in one caption. We use the greedy algorithm to clean the caption to make sure each image contains the most box labels with no recursive box labels. For other grounding datasets, we simply concatenate the noun/phrase labels with respective bounding box coordinates.

A.3 定位

对于GRIT（Peng等，2023）数据集，我们发现一个标题中有许多递归的定位框标签。我们使用贪婪算法清理标题，确保每个图像包含最多的框标签且无递归框标签。对于其他定位数据集，我们仅将名词/短语标签与相应的边界框坐标连接起来。

A.4 OCR

We generated the synthetic OCR dataset using Synthdog (Kim et al., 2022). Specifically, we use the COCO (Lin et al., 2014) train2017 and unlabled2017 dataset split as the natural scenery background. Then we selected 41 English fonts and 11 Chinese fonts to generate text. We use the default hyperparameters as in Synthdog. We track the generated text locations in the image and convert them to quadrilateral coordinates and we also use these coordinates as training labels. The visualization example is illustrated in the second row of Fig 5.

A.4 OCR

我们使用Synthdog（Kim等，2022）生成合成OCR数据集。具体来说，我们使用COCO（Lin等，2014）train2017和unlabled2017数据集分割作为自然风景背景。然后我们选择了41种英文字体和11种中文字体生成文本。我们使用与Synthdog中相同的默认超参数。我们跟踪生成的图像中的文本位置，并将其转换为四边形坐标，并将这些坐标用作训练标签。可视化示例在图5的第二行中说明。

For all the PDF data we collected, we follow the steps below to pre-process the data using PyMuPDF (Software, 2015) to get the rendering results of each page in a PDF file as well as all the text annotations with their bounding boxes.

对于我们收集的所有PDF数据，我们遵循以下步骤使用PyMuPDF（Software，2015）预处理数据，以获取PDF文件中每页的渲染结果以及所有带有边界框的文本注释。

Extracting all texts and their bounding boxes for each page.

1. 提取每页的所有文本及其边界框。

Rendering each page and save them as an image file.

2. 渲染每页并将其保存为图像文件。

Removing too small image.

3. 删除过小的图像。

Removing images with too many or too few characters.

4. 删除字符过多或过少的图像。

Removing images containing Unicode characters in the “Latin Extended-A” and “Latin Extended-B” blocks.

5. 删除包含“拉丁扩展-A”和“拉丁扩展-B”块中Unicode字符的图像。

Removing images containing Unicode characters in the “Private Use Area (PUA)” block.

6. 删除包含“私用区域（PUA）”块中Unicode字符的图像。

For all HTML web pages we collected, we pre-process them in a similar approach to all the PDF data we collected, but we use Puppeteer (Google, 2023) instead of PyMuPDF to render these HTML pages and get the ground truth annotation. We follow the steps below to pre-process the data.

对于我们收集的所有HTML网页，我们以类似于我们收集的所有PDF数据的方法进行预处理，但我们使用Puppeteer（Google，2023）而不是PyMuPDF渲染这些HTML页面并获取真实注释。我们遵循以下步骤预处理数据。

Extracting all texts for each webpage.

1. 提取每个网页的所有文本。

Rendering each page and save them as an image file.

2. 渲染每页并将其保存为图像文件。

Removing too small image.

3. 删除过小的图像。

Removing images with too many or too few characters.

4. 删除字符过多或过少的图像。

Removing images containing Unicode characters in the “Private Use Area (PUA)” block.

5. 删除包含“私用区域（PUA）”块中Unicode字符的图像。

在这里插入图片描述

Figure 5: Visualization of the Grounding and OCR data used for training Qwen-VL

图5：用于训练Qwen-VL的定位和OCR数据的可视化

B Data Format Details of Training

B.1 Data Format of Multi-Task Pre-training

We visualize the Multi-Task Pre-training data format in Box B.1. The Box contains all 7 tasks with the black-colored text as the prefix sequence without loss and blue-colored text as the ground truth labels with loss.

B 训练数据格式详情

B.1 多任务预训练数据格式

我们在框B.1中可视化了多任务预训练数据格式。该框包含所有7个任务，黑色文本为无损前缀序列，蓝色文本为有损的真实标签。

在这里插入图片描述

B.2 Data Format of Supervised Fine-tuning

To better accommodate multi-image dialogue and multiple image inputs, we add the string “Picture id:” before different images, where the id corresponds to the order of image input dialogue. In terms of dialogue format, we construct our instruction tuning dataset using the ChatML (Openai) format, where each interaction’s statement is marked with two special tokens (<im_start> and <im_end>) to facilitate dialogue termination.

B.2 监督微调的数据格式

为更好地适应多图像对话和多图像输入，我们在不同图像前添加字符串“图片id:”，其中id对应图像输入对话的顺序。在对话格式方面，我们使用ChatML（Openai）格式构建指令调优数据集，其中每次交互的语句都用两个特殊词元（<im_start>和<im_end>）标记，以便于对话终止。

The Dataset Format Example of ChatML

ChatML的数据格式示例

在这里插入图片描述

During training, we ensure the consistency between prediction and training distributions by only supervising answers and special tokens (blue in the example), and not supervising role names or question prompts.

在训练期间，我们通过仅监督答案和特殊词元（示例中的蓝色部分），而不监督角色名称或问题提示，确保预测和训练分布的一致性。

C Hyperparameters

We report the detailed training hyperparameter settings of Qwen-VL in Table 8.

C 超参数

我们在表8中报告了Qwen-VL的详细训练超参数设置。

Table 8: Training hyperparameters of Qwen-VL.

表8：Qwen-VL的训练超参数。

在这里插入图片描述

In the first pre-training stage, the model is trained using AdamW optimizer with β1 = 0.9, β2 = 0.98, eps = 1e−6. We use the cosine learning rate schedule and set the maximum learning rate of 2e−4 and minimum of 1e−6 with a linear warm-up of 500 steps. We use a weight decay of 5e−2 and a gradient clipping of 1.0. For the ViT image encoder, we apply a layer-wise learning rate decay strategy with a decay factor of 0.95. The training process uses a batch size of 30720 for the image-text pairs, and the entire first stage of pre-training lasts for 50,000 steps, consuming approximately 1.5 billion image-text samples and 500 billion image-text tokens.

在第一个预训练阶段，模型使用AdamW优化器进行训练，β1 = 0.9，β2 = 0.98，eps = 1e−6。我们使用余弦学习率调度，最大学习率设置为2e−4，最小学习率为1e−6，并进行500步的线性预热。我们使用5e−2的权重衰减和1.0的梯度裁剪。对于ViT图像编码器，我们应用分层学习率衰减策略，衰减因子为0.95。训练过程使用图像-文本对的批量大小为30720，整个第一个预训练阶段持续50,000步，大约消耗了15亿图像-文本样本和5000亿图像-文本词元。

In the second multi-task training stage, we increase the input resolution of the visual encoder from 224 × 224 to 448 × 448, reducing the information loss caused by image down-sampling. We unlocked the large language model and trained the whole model. The training objective is the same as the pre-training stage. We use AdamW optimizer with β1 = 0.9, β2 = 0.98, eps = 1e−6. We trained for 19000 steps with 400 warm-up steps and a cosine learning rate schedule. Specifically, we use the model parallelism techniques for ViT and LLM.

在第二个多任务训练阶段，我们将视觉编码器的输入分辨率从224 × 224增加到448 × 448，减少图像下采样导致的信息损失。我们解锁了大型语言模型并训练了整个模型。训练目标与预训练阶段相同。我们使用AdamW优化器，β1 = 0.9，β2 = 0.98，eps = 1e−6。我们进行了19000步训练，并进行了400步的预热，使用余弦学习率调度。具体来说，我们使用了ViT和LLM的模型并行技术。

D Summary of the evaluation benchmarks

We provide a detailed summary of the used evaluation benchmarks and corresponding metrics in Table 9.

D 评估基准的总结

我们在表9中提供了所用评估基准和相应指标的详细总结。

Table 9: Summary of the evaluation benchmarks.

表9：评估基准的总结。
在这里插入图片描述

E Additional experimental details

E.1 Convergence of the Pre-training Stage

In Figure 6, we show the convergence of the Pre-training Stage (stage one). The whole models are trained using BFloat16 mixed precision, the batch size is 30720, and the learning rate is 2e−4. All images are only trained once (one epoch). The training loss decreases steadily with the increase of the number of training pictures. Note that, the pre-training stage (Stage one) has no VQA data being added, but the Zero-shot VQA score increases amidst fluctuations.

E 附加实验细节

E.1 预训练阶段的收敛

在图6中，我们展示了预训练阶段（第一阶段）的收敛情况。整个模型使用BFloat16混合精度进行训练，批量大小为30720，学习率为2e−4。所有图像仅训练一次（一个epoch）。随着训练图像数量的增加，训练损失稳步下降。注意，预训练阶段（第一阶段）没有添加VQA数据，但零样本VQA得分在波动中增加。

Figure 6: Visualization of the Convergence of the Pre-training Stage

图6：预训练阶段收敛情况的可视化

在这里插入图片描述

E.2 Number of Learnable Queries in the Vision-Language Adapter

The vision-language adapter uses cross-attention to compress the visual feature sequence by a set of learning queries of length. Too few queries can lead to the loss of some visual information, while too many queries may reduce in greater convergence difficulty and computational cost.

E.2 视觉语言适配器中的可学习查询数量

视觉语言适配器使用交叉注意通过一组学习查询压缩视觉特征序列。查询数量过少会导致部分视觉信息丢失，而查询数量过多会增加收敛难度和计算成本。

An ablation experiment is conducted on the number of learnable queries in the vision-language adapter. We used ViT-L/14 as the visual encoder and the 224 × 224 resolution picture as input, so the sequence length of ViT’s output is (224/14)² = 256. As shown in the left part of Figure 7, the fewer queries used at the beginning of training, the lower the initial loss. However, with convergence, too many or too few queries will cause convergence to slow down, as shown in the right part of Figure 7. Considering that the second training stage (Multi-task Pre-train) applies 448*448 resolution, where the sequence length of ViT’s output is (448/14)² = 1024. Too few queries can result in more information being lost. We finally chose to use 256 queries for the vision-language adapter in Qwen-VL.

我们对视觉语言适配器中的可学习查询数量进行了消融实验。我们使用ViT-L/14作为视觉编码器，并使用224 × 224分辨率的图像作为输入，因此ViT的输出序列长度为(224/14)² = 256。如图7左侧部分所示，训练开始时使用的查询数量越少，初始损失越低。然而，随着收敛，查询数量过多或过少会导致收敛速度减慢，如图7右侧部分所示。考虑到第二阶段训练（多任务预训练）应用448*448分辨率，其中ViT的输出序列长度为(448/14)² = 1024。查询数量过少会导致更多信息丢失。最终，我们选择在Qwen-VL中使用256个查询。

在这里插入图片描述

Figure 7: Visualization of the training loss when using different compressed feature lengths of the vision-language adapter. The left depicts the initial training loss (within 50 steps), and the right depicts the loss in convergence (1k-5k steps). In the legend, L64 denotes that the adapter uses 64 queries to compress the visual feature sequence to a fixed length of 64, and so on. The loss curves have been smoothed to avoid shading owing to fluctuations.

图7：使用不同压缩特征长度的视觉语言适配器的训练损失可视化。左侧描述了初始训练损失（在50步内），右侧描述了收敛损失（1k-5k步）。图例中，L64表示适配器使用64个查询将视觉特征序列压缩到固定长度64，依此类推。损失曲线已被平滑以避免因波动而产生的阴影。

E.3 Window Attention vs Global Attention for Vision Transformer

Using a high-resolution Vision Transformer in the model will significantly increase the computational cost. One possible solution to reduce the computational cost of the model is to use Window Attention in the Vision Transformer, i.e., to perform Attention only in a window of 224 × 224 in most layers of the ViT part of the model, and to perform Attention for the full 448 × 448 or 896 × 896 image in a small number of layers (e.g. 1 out of every 4 layers) of the ViT part of the model.

E.3 视觉Transformer的窗口注意与全局注意

在模型中使用高分辨率视觉Transformer将显著增加计算成本。减少模型计算成本的一种可能解决方案是在视觉Transformer中使用窗口注意，即仅在模型的ViT部分的大多数层中在224 × 224的窗口中执行注意，而在模型的ViT部分的少数层（例如每4层中的1层）中对整个448 × 448或896 × 896图像执行注意。

To this end, we conducted ablation experiments to compare the performance of the model when using Global Attention and Window Attention for ViT. We compare the experimental results for analysing the trade-off between computational efficiency and convergence of the model.

为此，我们进行了消融实验，比较了在ViT中使用全局注意和窗口注意时模型的性能。我们比较了实验结果，以分析模型的计算效率和收敛之间的权衡。

Table 10: Training speed of Window Attention vs Global Attention for different input image resolutions

表10：不同输入图像分辨率下窗口注意与全局注意的训练速度
在这里插入图片描述

As shown in Figure 8 and Table 10, the loss of the model is significantly higher when Window Attention instead of Vanilla Attention is used. And the training speeds for both of them are similar. Therefore, we decided to use Vanilla Attention instead of Window Attention for the Vision Transformer when training Qwen-VL.

如图8和表10所示，当使用窗口注意而不是原生注意时，模型的损失显著增加。两者的训练速度相似。因此，我们决定在训练Qwen-VL时使用原生注意而不是窗口注意。

在这里插入图片描述

Figure 8: Visualization of the Loss when using Window Attention vs Global Attention

图8：使用窗口注意与全局注意时损失的可视化

The reason we don’t use Window Attention with 896 × 896 resolution is that its training speed is too slow for us. Although it reaches a loss value similar to model with 448 × 448 resolution input at 5000 steps. It takes almost 2.5 times longer to train than the model with 448 × 448 resolution input.

我们不使用896 × 896分辨率的窗口注意的原因是其训练速度对我们来说太慢。尽管它在5000步时达到了与448 × 448分辨率输入模型相似的损失值。训练时间几乎是448 × 448分辨率输入模型的2.5倍。

E.4 Performance on Pure-text Tasks

In order to study the effect of multi-modal training on pure-text ability, we show the performance of pure-text tasks of Qwen-VL compared to open-source LLM in Table 11.

E.4 纯文本任务的性能

为了研究多模态训练对纯文本能力的影响，我们在表11中展示了Qwen-VL与开源LLM在纯文本任务中的性能对比。

Qwen-VL uses an intermediate checkpoint of Qwen-7B as the LLM initialization. The reason why we did not use the final released checkpoint of Qwen-7B is that Qwen-VL and Qwen-7B were developed at a very similar period. Because Qwen-VL has a good initialization on LLM by Qwen-7B, it is comparable to many text-only LLMs on pure-text tasks.

Qwen-VL使用Qwen-7B的中间检查点作为LLM初始化。我们没有使用Qwen-7B的最终发布检查点的原因是Qwen-VL和Qwen-7B在非常相似的时期开发。由于Qwen-VL在Qwen-7B上进行了良好的LLM初始化，因此在纯文本任务中可与许多仅文本LLM相媲美。

Due to the introduction of pure-text data in the multi-task training and SFT stage, Qwen-VL do not compromise any pure-text capability. Instead, it can compete against open-source LLMs.

由于在多任务训练和SFT阶段引入了纯文本数据，Qwen-VL并没有在纯文本能力上妥协。相反，它可以与开源LLM竞争。

Table 11: Performance on Pure-text Benchmarks of Qwen-VL compared to open-source LLM.

表11：Qwen-VL与开源LLM在纯文本基准测试中的性能对比。
在这里插入图片描述

Furthermore, in the multi-task training and SFT stages, Qwen-VL not only utilizes visual and language-related data but also incorporates pure-text data for training. The purpose of this is to prevent the catastrophic forgetting of text comprehension by leveraging the information from pure-text data. The results in Table 11 indicate that the Qwen-VL model does not exhibit any degradation in terms of its pure text capability and even demonstrates improvement after multi-task training.

此外，在多任务训练和SFT阶段，Qwen-VL不仅利用了视觉和语言相关数据，还引入了纯文本数据进行训练。这样做的目的是通过利用纯文本数据的信息来防止文本理解的灾难性遗忘。表11的结果表明，Qwen-VL模型在纯文本能力方面没有出现任何退化，甚至在多任务训练后表现出提升。