A Survey on Multimodal Large Language Models（from gpt-4o）

- A Survey on Multimodal Large Language Models
- 1. INTRODUCTION
- 2. ARCHITECTURE
- - 2.1 Modality encoder
  - 2.2 Pre-trained LLM
  - 2.3 Modality interface
- 3. TRAINING STRATEGY AND DATA
- - 3.1 Pre-training
  - - 3.1.1 Training Detail
    - 3.1.2 Data
  - 3.2 Instruction-tuning
  - - 3.2.1 Introduction
    - 3.2.2 Training Detail
    - 3.2.3 Data Collection
    - 3.2.4 Data Quality
  - 3.3 Alignment tuning
  - - 3.3.1 Introduction
    - 3.3.2 Training Detail
    - 3.3.3 Data
- 4. EVALUATION
- - 4.1 Closed-set
  - 4.2 Open-set
- 5 Extensions
- - Granularity Support
  - Modality Support
  - Language Support
  - Scenario/Task Extension
- 6. MULTIMODAL HALLUCINATION
- - 6.1 Preliminaries
  - 6.2 Evaluation Methods
  - 6.3 Mitigation Methods
- 7. EXTENDED TECHNIQUES
- - 7.1 Multimodal In-Context Learning
  - - 7.1.1 Improvement on ICL capabilities
    - 7.1.2 Applications
  - 7.2 Multimodal Chain of Thought
  - - 7.2.1 Learning Paradigms
    - 7.2.2 Chain Configuration
    - 7.2.3 Generation Patterns
  - 7.3 LLM-Aided Visual Reasoning
  - - 7.3.1 Introduction
    - 7.3.2 Training Paradigms
    - 7.3.3 Functions
- 8. CHALLENGES AND FUTURE DIRECTIONS
- 9. CONCLUSION

A Survey on Multimodal Large Language Models

Abstract—Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the paper, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

Index Terms—Multimodal Large Language Model, Vision Language Model, Large Language Model.

摘要—最近，以GPT-4V为代表的多模态大语言模型（MLLM）已成为一个新的研究热点，它利用强大的大语言模型（LLM）作为大脑来执行多模态任务。MLLM的惊人新兴能力，如基于图像写故事和无OCR的数学推理，在传统的多模态方法中很少见，这表明人工通用智能的潜在路径。为此，学术界和工业界都致力于开发能够与甚至优于GPT-4V的MLLM，以惊人的速度推动研究的极限。在本文中，我们旨在追踪和总结MLLM的最新进展。首先，我们介绍了MLLM的基本公式，并描述了相关概念，包括架构、训练策略和数据，以及评估。然后，我们介绍了有关如何扩展MLLM以支持更多粒度、模态、语言和场景的研究课题。我们继续讨论了多模态幻觉和扩展技术，包括多模态ICL（M-ICL）、多模态CoT（M-CoT）和LLM辅助视觉推理（LAVR）。最后，我们讨论了现有的挑战并指出了有前途的研究方向。鉴于MLLM的时代刚刚开始，我们将不断更新这份调查报告，希望它能激发更多的研究。相关的GitHub链接收集了最新的论文，可在https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models找到。

关键词—多模态大语言模型，视觉语言模型，大语言模型。

1. INTRODUCTION

Recent years have seen the remarkable progress of MLLMs [1], [2], [3], [4], [5]. By scaling up data size and model size, these LLMs raise extraordinary performance in various linguistic instructions including instruction following [6], [8], [9], Context Learning (ICL) [7], and Chain of Thought (CoT) [10]. Although LLMs have demonstrated surprising zero/few-shot reasoning performance on most Natural Language Processing (NLP) tasks, they are inherently “blind” to vision since they can only understand discrete text. Concurrently, Large Vision Models (LVMs) can see clearly [9], [10], [11], [12], but commonly lack in reasoning.

1. 简介

近年来，MLLM取得了显著进展[1], [2], [3], [4], [5]。通过扩大数据规模和模型规模，这些LLM在各种语言指令中表现出非凡的性能，包括指令跟随[6], [8], [9]、上下文学习（ICL）[7]和思维链（CoT）[10]。尽管LLM在大多数自然语言处理（NLP）任务中展示了令人惊讶的零样本/小样本推理性能，但它们本质上对视觉是“盲”的，因为它们只能理解离散文本。同时，大型视觉模型（LVM）可以清晰地看到[9], [10], [11], [12]，但通常缺乏推理能力。

In light of this complementarity, LLM and LVM run towards each other, leading to the new field of Multimodal Large Language Model (MLLM). Formally, it refers to the LLM-based model with the ability to receive, reason, and respond with multimodal information. Prior to MLLM, there have been a lot of works devoted to multimodality, which can be divided into discriminative [13], [14], [15] and generative [16], [17], [18] paradigms. CLIP [13], as a representative of the former, projects visual and textual information into a unified representation space, building a bridge for downstream multimodal tasks. In contrast, OFA [16] is a representative of the latter, which unifies multimodal tasks in a sequence-to-sequence manner. MLLM can be classified as the latter according to the sequence operation, but it manifests two representative traits compared with the traditional counterparts: (1) MLLM is based on LLM with billions of scale parameters, which is not available in previous models.(2) MLLM uses new training paradigms to unleash its full potential, such as using multimodal instruction tuning[19], [20] to encourage the model to follow new instructions. Armed with the two traits, MLLM exhibits new capabilities, such as writing website code based on images [21], understanding the deep meaning of a meme [22], and OCR-free math reasoning [23].

鉴于这种互补性，LLM和LVM相互接近，导致了多模态大语言模型（MLLM）这一新领域。正式地，它是指具有接收、推理和响应多模态信息能力的基于LLM的模型。在MLLM之前，有很多工作致力于多模态，可以分为判别式[13], [14], [15]和生成式[16], [17], [18]范式。CLIP [13] 作为前者的代表，将视觉和文本信息投射到统一的表示空间，为下游多模态任务构建了一座桥梁。相比之下，OFA [16] 是后者的代表，它以序列到序列的方式统一多模态任务。根据序列操作，MLLM可以归类为后者，但它与传统模型相比表现出两个代表性特征：（1）MLLM基于具有数十亿规模参数的LLM，这是以前的模型所不具备的。（2）MLLM使用新的训练范式来释放其全部潜力，例如使用多模态指令调优[19], [20]来鼓励模型遵循新指令。凭借这两个特性，MLLM展示了新的能力，例如基于图像编写网站代码[21]、理解模因的深层含义[22]和无OCR的数学推理[23]。

Ever since the release of GPT-4, there has been an incessant research frenzy over MLLMs because of the amazing multimodal examples it shows. Rapid development is fueled by efforts from both academia and industry. Preliminary research on MLLMs focuses on text content generation, grounded in text prompts and image [20], [24]/video [25], [26]/audio [27]. Subsequent works have expanded the capabilities or the usage scenarios, including: (1) better granularity support. Finer control on user prompts is developed to support specific regions through boxes [28] or a certain object through a click [29]. (2) Enhanced support on input and output modalities [30], [31], such as image, video, audio, and point cloud. Besides input, projects like NEXT-GPT [32] further support output in different modalities. (3) Improved language support. Efforts have been made to extend the success of MLLMs to other languages (e.g. Chinese) with relatively limited training corpus [33], [34]. (4) Extension to more realms and user scenarios. Researchers transfer the strong capabilities of MLLMs to other domains such as medical image understanding [35], [36], [37] and document parsing [38], [39], [40]. Moreover, multimodal agents are developed to assist in real-world interaction, e.g. embodied agents [41], [42] and GUI agents [43], [44], [45]. An MLLM timeline is illustrated in Fig 1.

自从GPT-4发布以来，由于其展示的惊人的多模态示例，对MLLM的研究狂潮不断。快速发展由学术界和工业界的努力推动。MLLM的初步研究集中在文本内容生成上，基于文本提示和图像[20], [24] / 视频[25], [26] / 音频[27]。随后的工作扩展了能力或使用场景，包括：（1）更好的粒度支持。通过框[28]或点击[29]某个对象，开发了对用户提示的更精细控制。（2）增强了输入和输出模态的支持[30], [31]，如图像、视频、音频和点云。除了输入外，像NEXT-GPT [32]这样的项目进一步支持不同模态的输出。（3）改进的语言支持。已经努力将MLLM的成功扩展到其他语言（例如中文），即使训练语料相对有限[33], [34]。（4）扩展到更多领域和用户场景。研究人员将MLLM的强大能力转移到其他领域，如医学图像理解[35], [36], [37]和文档解析[38], [39], [40]。此外，开发了多模态代理以协助现实世界的互动，例如具身代理[41], [42]和GUI代理[43], [44], [45]。图1展示了一个MLLM时间线。

在这里插入图片描述

Fig. 1: A timeline of representative MLLMs. We are witnessing rapid growth in this field. More works can be found in our released GitHub page, which is updated daily.

图1：代表性MLLM的时间线。我们见证了该领域的快速增长。更多工作可以在我们发布的GitHub页面中找到，该页面每天更新。

In view of such rapid progress and the promising results of this field, we write this survey to provide researchers with a grasp of the basic idea, main method, and current progress of MLLMs. Note that we mainly focus on visual and language modalities, but also include works involving other modalities like audio and video. Specifically, we cover the most important aspects of MLLMs with corresponding summaries and open a GitHub page that would be updated in real time. To the best of our knowledge, this is the first survey on MLLM.

鉴于该领域的快速进展和令人鼓舞的成果，我们撰写了这份综述，以便研究人员了解MLLM的基本理念、主要方法和当前进展。请注意，我们主要关注视觉和语言模态，但也包括涉及其他模态（如音频和视频）的工作。具体而言，我们涵盖了MLLM的最重要方面，并提供了相应的总结，并开放了一个GitHub页面，该页面将实时更新。据我们所知，这是第一份关于MLLM的综述。

The following parts of the survey are structured as such: the survey starts with a comprehensive review of the essential aspects of MLLMs, including (1) Mainstream architecture (S2); (2) A full picture of training strategy and data (S3); (3) Common practices of performance evaluation (S4). Then, we delve into a deep discussion on some important topics about MLLMs, each focusing on a main problem: (1) What aspects can be further improved or extended (S5)? (2) How to relieve the multimodal hallucination issue (S6)? The survey continues with the introduction of three key techniques (S7), each specialized in a specific scenario: M-ICL (S7.1) is an effective technique commonly used at the inference stage to boost few-shot performance. Another important technique is M-CoT (S7.2), which is typically used in complex reasoning tasks. Afterward, we delineate a general idea to develop LLM-based systems to solve complex reasoning tasks or to address common user queries (S7.3). Finally, we finish our survey with a summary and potential research directions.

调查报告的以下部分结构如下：调查报告首先对MLLM的基本方面进行全面回顾，包括（1）主流架构（S2）；（2）训练策略和数据的全貌（S3）；（3）性能评估的常见实践（S4）。然后，我们深入讨论了一些关于MLLM的重要话题，每个话题都聚焦于一个主要问题：（1）哪些方面可以进一步改进或扩展（S5）？（2）如何缓解多模态幻觉问题（S6）？调查报告继续介绍了三种关键技术（S7），每种技术都专门针对一个特定场景：M-ICL（S7.1）是一种在推理阶段常用的有效技术，用于提高小样本性能。另一个重要技术是M-CoT（S7.2），通常用于复杂推理任务。之后，我们描述了开发基于LLM的系统以解决复杂推理任务或解决常见用户查询的总体思路（S7.3）。最后，我们以总结和潜在的研究方向结束了我们的调查报告。

2. ARCHITECTURE

A typical MLLM can be abstracted into three modules, i.e. a pre-trained modality encoder, a pre-trained LLM, and a modality interface to connect them. Drawing an analogy to humans, modality encoders such as image/audio encoders are human eyes/ears that receive and pre-process optical/acoustic signals, while LLMs are like human brains that understand and reason with the processed signals. In between, the modality interface serves to align different modalities. Some MLLMs also include a generator to output other modalities apart from text. A diagram of the architecture is plotted in Fig. 2. In this section, we introduce each module in sequence.

2. 架构

典型的MLLM可以抽象为三个模块，即预训练模态编码器、预训练LLM和连接它们的模态接口。类比于人类，模态编码器如图像/音频编码器是接收和预处理光学/声学信号的人类眼睛/耳朵，而LLM则像人类大脑，理解和推理处理后的信号。在这两者之间，模态接口用于对齐不同模态。一些MLLM还包括生成器，以输出除文本以外的其他模态。图2绘制了架构图。在本节中，我们按顺序介绍每个模块。

2.1 Modality encoder

The encoders compress raw information, such as images or audio, into a more compact representation. Rather than training from scratch, a common approach is to use a pre-trained encoder that has been aligned to other modalities. For example, CLIP [13] incorporates a visual encoder semantically aligned with the text through large-scale pre-training on image-text pairs. Therefore, it is easier to use such initially pre-aligned encoders to align with LLMs through alignment pre-training (see §3.1).

2.1 模态编码器

编码器将原始信息（如图像或音频）压缩成更紧凑的表示。与从头开始训练不同，一种常见的方法是使用已对齐其他模态的预训练编码器。例如，CLIP [13] 通过在大规模图像-文本对上进行预训练，将视觉编码器语义上与文本对齐。因此，更容易使用这种最初预对齐的编码器通过对齐预训练与LLM对齐（见§3.1）。

The series of commonly used image encoders are summarized in Table 1. Apart from vanilla CLIP image encoders [13], some works also explore using other variants. For example, MiniGPT-4 [21] adopts an EVA-CLIP [47], [48] (ViT-G/14) encoder, which is trained with improved training techniques. In contrast, Osprey [29] introduces a convolution-based ConvNext-L encoder [46] to utilize higher resolution and multi-level features. Some works also explore encoder-free architecture. For instance, the image patches of Fuyu-8b [49] are directly projected before sending to LLMs. Thus, the model naturally supports flexible image resolution input.

常用图像编码器系列在表1中总结。除了普通的CLIP图像编码器[13]外，一些工作还探索使用其他变体。例如，MiniGPT-4 [21] 采用EVA-CLIP [47], [48]（ViT-G/14）编码器，该编码器采用改进的训练技术进行训练。相比之下，Osprey [29] 引入了基于卷积的ConvNext-L编码器[46]，以利用更高的分辨率和多层次特征。一些工作还探索了无编码器架构。例如，Fuyu-8b [49] 的图像块在发送到LLM之前直接投影。因此，模型自然支持灵活的图像分辨率输入。

TABLE 1: A summary of commonly used image encoders.

表1：常用图像编码器概述。
在这里插入图片描述

在这里插入图片描述

图2：典型MLLM架构的图示。它包括编码器、连接器和LLM。可以选择附加一个生成器到LLM，以生成文本之外的更多模态。编码器输入图像、音频和视频并输出特征，这些特征由连接器处理，使LLM能够更好地理解。连接器大致分为三种类型：基于投影的、基于查询的和基于融合的连接器。前两种类型采用词元级融合，将特征处理成词元，与文本词元一起发送，而最后一种类型在LLM内部实现特征级融合。

When choosing encoders, one often considers factors like resolution, parameter size, and pretraining corpus. Notably, many works have empirically verified that using higher resolution can achieve remarkable performance gains [34], [50], [51], [52]. The approaches for scaling up input resolution can be categorized into direct scaling and patch-division methods. The direct scaling way inputs images of higher resolutions to the encoder, which often involves further tuning the encoder [34] or replacing a pre-trained encoder with higher resolution [50]. Similarly, CogAgent [44] uses a dual-encoder mechanism, where two encoders process high and low-resolution images, respectively. High-resolution features are injected into the low-resolution branch through cross-attention. Patch-division methods cut a high-resolution image into patches and reuse the resolution-encoded encoder. For example, Monkey [51] and SPHINX [53] divide a large image into smaller patches and send sub-images together with a downsampled high-resolution image to the image encoder, where the sub-images and the low-resolution image capture local and global features, respectively. In contrast, parameter size and training data composition are of less importance compared with input resolution, found by empirical studies [52].

在选择编码器时，通常考虑分辨率、参数大小和预训练语料等因素。值得注意的是，许多工作经验验证了使用更高分辨率可以显著提高性能[34], [50], [51], [52]。扩大输入分辨率的方法可以分为直接缩放和分块方法。直接缩放方法将更高分辨率的图像输入编码器，这通常涉及进一步调整编码器[34]或用更高分辨率的预训练编码器替换[50]。类似地，CogAgent [44] 使用双编码器机制，分别处理高分辨率和低分辨率图像。高分辨率特征通过交叉注意力注入低分辨率分支。分块方法将高分辨率图像切割成块并重新利用分辨率编码的编码器。例如，Monkey [51] 和SPHINX [53] 将大图像分成较小的块，并将子图像与下采样的高分辨率图像一起发送到图像编码器，子图像和低分辨率图像分别捕获局部和全局特征。相比之下，经验研究发现，与输入分辨率相比，参数大小和训练数据组成的重要性较低[52]。

Similar encoders are also available for other modalities. For example, Pengi [27] uses CLAP [54] model as the audio encoder. ImageBind-LLM [30] uses the ImageBind [55] encoder, which supports encoding image, text, audio, depth, thermal, and Inertial Measurement Unit (IMU) data. Equipped with the strong encoder, ImageBind-LLM can respond to the input of multiple modalities.

类似的编码器也适用于其他模态。例如，Pengi [27] 使用CLAP [54] 模型作为音频编码器。ImageBind-LLM [30] 使用ImageBind [55] 编码器，该编码器支持编码图像、文本、音频、深度、热量和惯性测量单元（IMU）数据。配备强大的编码器，ImageBind-LLM可以响应多模态输入。

2.2 Pre-trained LLM

Instead of training an LLM from scratch, it is more efficient and practical to start with a pre-trained one. Through tremendous pre-training on web corpus, LLMs have been embedded with rich world knowledge, and demonstrate strong generalization and reasoning capabilities.

2.2 预训练LLM

与从头开始训练LLM相比，从预训练的LLM开始更高效和实用。通过对网络语料的大量预训练，LLM已经嵌入了丰富的世界知识，并展示了强大的泛化和推理能力。

We summarize the commonly used and publicly available LLMs in Table 2. Notably, most LLMs fall in the causal decoder category, following GPT-3 [7]. Among them, Flamingo [61] series are relatively lightweight versions, like BLIP-2 [59] and InstructBLIP [60]. LLaMA series [5], [57] and Vicuna family [4] are representative open-sourced LLMs that have attracted much academic attention. Since the two LLMs are predominantly pre-trained on English corpus, they are limited in multi-language support, such as Chinese. In contrast, Qwen [58] is a bilingual LLM that supports Chinese and English well.

我们在表2中总结了常用和公开可用的LLM。值得注意的是，大多数LLM属于因果解码器类别，遵循GPT-3 [7]。其中，Flamingo [61]系列是相对轻量级的版本，如BLIP-2 [59]和InstructBLIP [60]。LLaMA系列[5], [57]和Vicuna家族[4]是吸引了很多学术关注的代表性开源LLM。由于这两个LLM主要在英文语料上进行预训练，因此在多语言支持方面（如中文）受到限制。相比之下，Qwen [58] 是一个双语LLM，能够很好地支持中文和英文。

TABLE 2: A summary of commonly used open-sourced LLMs. en, zh, fr, and de stand for English, Chinese, French, and German, respectively.

表2：常用开源LLM的概述。en、zh、fr和de分别代表英文、中文、法文和德文。

在这里插入图片描述

It should be noted that scaling up the parameter size of LLMs also brings additional gains, similar to the case of increasing input resolution. Specifically, Liu et al. [50], [61] find that simply scaling up LLM from 7B to 13B brings comprehensive improvement on various benchmarks. Furthermore, when using a 34B LLM, the model shows emergent zero-shot Chinese capability, given that only English multimodal data are used during training. Lu et al. [62] see a similar phenomenon by scaling up LLMs from 13B to 35B and 65B/70B, whereby larger model size brings consistent gains on benchmarks specifically designed for MLLMs. There are also works that use smaller LLMs to facilitate deployment on mobile devices. For example, MobileVLM series [63], [64] use downscaled LLaMA [5] (termed as MobileLaMa 1.4B/2.7B), enabling efficient inference on mobile processors.

需要注意的是，扩大LLM的参数规模也会带来额外的收益，类似于增加输入分辨率的情况。具体而言，Liu等人[50], [61]发现简单地将LLM从7B扩展到13B，可以在各种基准上带来全面的改进。此外，当使用34B LLM时，该模型展示了在训练期间仅使用英文多模态数据时的零样本中文能力。Lu等人[62] 通过将LLM从13B扩展到35B和65B/70B，观察到类似的现象，即更大的模型规模在专门为MLLM设计的基准上带来了一致的收益。还有一些工作使用较小的LLM以便在移动设备上部署。例如，MobileVLM系列[63], [64]使用缩小版的LLaMA [5]（称为MobileLaMa 1.4B/2.7B），实现了在移动处理器上的高效推理。

Recently, explorations of Mixture of Experts (MoE) architecture for LLMs have garnered rising attention [65], [66], [67]. Compared with dense models, the sparse architecture enables scaling up total parameter size without increasing computational cost, by selective activation of the parameters. Empirically, MIMI [52] and MoE-LaMa [68] find that MoE implementation achieves better performance than the dense counterpart on almost all the benchmarks.

最近，对LLM的专家混合（MoE）架构的探索引起了越来越多的关注[65], [66], [67]。与密集模型相比，稀疏架构通过选择性激活参数，使总参数规模扩大而不会增加计算成本。根据经验，MIMI [52] 和 MoE-LaMa [68] 发现，在几乎所有基准上，MoE实现的性能优于密集对手。

2.3 Modality interface

Since LLMs can only perceive text, bridging the gap between natural language and other modalities is necessary. However, it would be costly to train a large multimodal model in an end-to-end manner. A more practical way is to introduce a learnable connector between the pre-trained visual encoder and LLM. The other approach is to translate images into languages with the help of expert models, and then send the language to LLM.

2.3 模态接口

由于LLM只能感知文本，弥合自然语言与其他模态之间的差距是必要的。然而，端到端训练一个大型多模态模型的成本很高。一种更实用的方法是在预训练的视觉编码器和LLM之间引入可学习的连接器。另一种方法是借助专家模型将图像翻译成语言，然后将语言发送给LLM。

Learnable Connector: It is responsible for bridging the gap between different modalities. Specifically, the module projects information into the space that LLM can understand efficiently. Based on how multimodal information is fused, there are broadly two ways to implement such interfaces, i.e. token-level and feature-level fusion.

可学习连接器：负责弥合不同模态之间的差距。具体来说，该模块将信息投射到LLM可以高效理解的空间。根据多模态信息融合的方式，大致有两种实现这种接口的方法，即词元级融合和特征级融合。

For token-level fusion, features output from encoders are aligned with text tokens and concatenated with text tokens before being sent into LLMs. A commonly feasible solution is to leverage a group of learnable query tokens to seek cross-modal alignment. The Q-Former design has been implemented in BLIP-2 [59], and subsequently inherited by a variety of work [26], [60], [70]. Such Q-Former-style approaches compress visual tokens into a smaller number of representation vectors. In contrast, some methods simply use a MLP-based interface to bridge the modality gap [20], [37], [71], [72]. For example, LLaVA series adopts one/two linear MLP [20], [50] to project visual tokens and align the feature dimension with word embeddings.

对于词元级融合，从编码器输出的特征与文本词元对齐并在发送到LLM之前与文本词元连接。常见的可行解决方案是利用一组可学习的查询词元来实现跨模态对齐。BLIP-2 [59] 中实现了Q-Former设计，随后被各种工作继承[26], [60], [70]。这种Q-Former风格的方法将视觉词元压缩成较少的表示向量。相比之下，一些方法只是使用基于MLP的接口来弥合模态差距[20], [37], [71], [72]。例如，LLaVA系列采用一个/两个线性MLP [20], [50]来投影视觉词元并对齐特征维度与词嵌入。

On a related note, MM 1[52] has ablated on design choices on the connector and found that for token-level fusion, the type of modality adapter is far less important than the number of visual tokens and input resolution. Nevertheless, Zeng et al. [73] compare the performance of token and feature-level fusion, and empirically reveal that the token-level fusion variant performs better in terms of VQA benchmarks. Regarding the performance gap, the authors suggest that cross-attention models might require a more complicated hyper-parameter searching process to achieve comparable performance.

相关方面，MM1 [52] 对连接器设计选择进行了消融实验，发现对于词元级融合，模态适配器类型远不如视觉词元数量和输入分辨率重要。然而，Zeng等人[73] 比较了词元和特征级融合的性能，并经验性地揭示，词元级融合变体在VQA基准方面表现更好。关于性能差距，作者认为交叉注意力模型可能需要更复杂的超参数搜索过程以实现可比性能。

As another line, feature-level fusion inserts extra modules that enable deep interaction and fusion between text features and visual features. For example, Flamingo [74] inserts extra cross-attention layers between frozen Transformer layers of LLMs, thereby augmenting language features with external visual cues. Similarly, CogVLM [75] plugs in a visual expert module in each Transformer layer to enable dual interaction and fusion between vision and language features. For better performance, the QKV weight matrix of the introduced module is initialized from the pre-trained LLM. Similarly, LLaMA-Adapter [76] introduces learnable prompts into Transformer layers. These prompts are embedded with visual knowledge and then concatenated with text features as prefixes.

另一种方式，特征级融合插入额外的模块，以实现文本特征和视觉特征之间的深度交互和融合。例如，Flamingo [74] 在冻结的LLM Transformer层之间插入额外的交叉注意层，从而通过外部视觉线索增强语言特征。类似地，CogVLM [75] 在每个Transformer层中插入视觉专家模块，以实现视觉和语言特征之间的双重交互和融合。为了更好的性能，引入模块的QKV权重矩阵从预训练的LLM初始化。类似地，LLaMA-Adapter [76] 在Transformer层中引入可学习的提示。这些提示嵌入了视觉知识，然后作为前缀与文本特征连接。

In terms of parameter size, learnable interfaces generally comprise a small portion compared with encoders and LLMs. Take Qwen-VL [34] as an example, the parameter size of the Q-Former is about 0.08B, accounting for less than 1% of the whole parameters, while the encoder and the LLM account for about 19.8% (1.9B) and 80.2% (7.7B), respectively.

在参数规模方面，与编码器和LLM相比，可学习接口通常只占很小一部分。以Qwen-VL [34] 为例，Q-Former的参数规模约为0.08B，占整个参数的不到1%，而编码器和LLM分别占约19.8%（1.9B）和80.2%（7.7B）。

Expert Model. Apart from the learnable interface, using expert models, such as an image captioning model, is also a feasible way to bridge the modality gap [77], [78], [79], [80]. The basic idea is to convert multimodal inputs into languages without training. In this way, LLMs can understand multimodality by the converted languages. For example, VideoChat-Text [25] uses pre-trained vision models to extract visual information such as actions and enriches the descriptions using a speech recognition model. Though using expert models is straightforward, it may not be as flexible as adopting a learnable interface. The conversion of foreign modalities into text would cause information loss. For example, transforming videos into textual descriptions distorts spatial-temporal relationships [25].

专家模型。除了可学习接口，使用专家模型（如图像字幕模型）也是弥合模态差距的一种可行方法[77], [78], [79], [80]。基本思路是将多模态输入转换为语言，而不进行训练。通过这种方式，LLM可以通过转换的语言理解多模态性。例如，VideoChat-Text [25] 使用预训练的视觉模型提取视觉信息（如动作），并使用语音识别模型丰富描述。虽然使用专家模型很简单，但可能不如采用可学习接口灵活。将外来模态转换为文本会导致信息丢失。例如，将视频转换为文本描述会扭曲时空关系[25]。

3. TRAINING STRATEGY AND DATA

A full-fledged MLLM undergoes three stages of training, i.e. pre-training, instruction-tuning, and alignment tuning. Each phase of training requires different types of data and fulfills different objectives. In this section, we discuss training objectives, as well as data collection and characteristics for each training stage.

3. 训练策略和数据

一个成熟的MLLM经历三个训练阶段，即预训练、指令调优和对齐调优。每个训练阶段需要不同类型的数据，并实现不同的目标。在本节中，我们讨论训练目标以及每个训练阶段的数据收集和特征。**

3.1 Pre-training

3.1.1 Training Detail

As the first training stage, pre-training mainly aims to align different modalities and learn multimodal world knowledge. Pre-training stage generally entails large-scale text-paired data, e.g. caption data. Typically, the caption pairs describe images/audio/videos in natural language sentences.

作为第一个训练阶段，预训练主要旨在对齐不同的模态并学习多模态世界知识。预训练阶段通常需要大规模的文本配对数据，例如标题数据。通常，标题对用自然语言句子描述图像/音频/视频。

Here, we consider a common scenario where MLLMs are trained to align vision with text. As illustrated in Table 3, given an image, the model is trained to predict autoregressively the caption of the image, following a standard cross-entropy loss. A common approach for pre-training is to keep pre-trained modules (e.g. visual encoders and LLMs) frozen and train a learnable interface [20], [35], [72]. The idea is to align different modalities without losing pre-trained knowledge. Some methods [34], [81], [82] also unfreeze more modules (e.g. visual encoder) to enable more trainable parameters for alignment. It should be noted that these modules are trained with large-scale web data to cover as many scenarios as possible. Therefore, collected datasets should be diverse and cover various domains and applications.

在这里，我们考虑一个常见的场景，即MLLM被训练来对齐视觉与文本。如表3所示，给定一张图像，模型被训练为自动回归地预测图像的标题，遵循标准的交叉熵损失。常见的预训练方法是保持预训练模块（例如视觉编码器和LLM）冻结并训练一个可学习的接口[20], [35], [72]。其目的是在不丧失预训练知识的情况下对齐不同模态。一些方法[34], [81], [82]也解冻更多模块（例如视觉编码器）以实现更多可训练参数的对齐。需要注意的是，这些模块通过大规模网络数据训练，以涵盖尽可能多的场景。因此，收集的数据集应该多样化，涵盖各种领域和应用。

TABLE 3: A simplified template to structure the caption data. <image> is the placeholder for the visual tokens, and {caption}is the caption for the image. Note that only the part marked in red is used for loss calculation.

表3：用于构建标题数据的简化模板。<image> 是视觉词元的占位符，{caption}是图像的标题。请注意，只有标记为红色的部分用于损失计算。

It should be noted that the training scheme is closely related to the data quality. For short and noisy caption data, a lower resolution (e.g. 224) can be adopted to speed up the training process, while for longer and cleaner data, it is better to utilize higher resolutions (e.g. 448 or higher) to mitigate hallucinations. Besides, ShareGPT4V [83] finds that with high-quality caption data in the pretraining stage, unlocking the vision encode promotes better alignment.

值得注意的是，训练方案与数据质量密切相关。对于简短且嘈杂的标题数据，可以采用较低的分辨率（例如224）来加速训练过程，而对于更长且更清晰的数据，最好使用更高的分辨率（例如448或更高）来减轻幻觉。此外，ShareGPT4V [83] 发现，在预训练阶段使用高质量的标题数据可以更好地解锁视觉编码，从而促进更好的对齐。

3.1.2 Data

Pretraining data mainly serve two purposes, i.e. (1) aligning different modalities and (2) providing world knowledge. The pretraining corpora can be divided into coarse-grained and fine-grained data according to granularities, which we will introduce sequentially. We summarize commonly used pretraining datasets in Table 4.

3.1.2 数据

预训练数据主要有两个目的，即（1）对齐不同模态和（2）提供世界知识。预训练语料可以根据粒度分为粗粒度和细粒度数据，我们将按顺序介绍。表4中总结了常用的预训练数据集。**

TABLE 4: Common datasets used for pre-training.

表4：常用的预训练数据集。

在这里插入图片描述

Coarse-grained caption data share some typical traits in common: (1) The data volume is large since samples are generally sourced from the internet. (2) Because of the web-scrawled nature, the captions are usually short and noisy since they originate from the alt-text of the web images. These data can be cleaned and filtered via automatic tools, for example, using CLIP [13] model to filter out image-text pairs whose similarities are lower than a pre-defined threshold. In what follows, we introduce some representative coarse-grained datasets.

粗粒度字幕数据共有一些典型特征：（1）数据量大，因为样本通常来源于互联网。（2）由于网络爬取的性质，字幕通常很短且噪声很大，因为它们来自网络图像的替代文本。这些数据可以通过自动工具进行清理和过滤，例如，使用CLIP [13] 模型过滤掉相似度低于预定义阈值的图像-文本对。在下文中，我们将介绍一些具有代表性的粗粒度数据集。

CC. CC-3M [84] is a web-scale caption dataset of 3.3M image-caption pairs, where the raw descriptions are derived from alt-text associated with images. The authors design a complicated pipeline to clean data: (1) For images, those with inappropriate content or aspect ratio are filtered. (2) For text, NLP tools are used to obtain text annotations, with samples filtered according to the designed heuristics. (3) For image-text pairs, images are assigned labels via classifiers. If text annotations do not overlap with image labels, the corresponding samples are dropped.

CC. CC-3M [84] 是一个包含3.3M图像-字幕对的网络规模字幕数据集，其中原始描述来自与图像相关的替代文本。作者设计了一个复杂的管道来清理数据：（1）对于图像，过滤掉内容或纵横比不合适的图像。（2）对于文本，使用NLP工具获取文本注释，并根据设计的启发式方法过滤样本。（3）对于图像-文本对，通过分类器为图像分配标签。如果文本注释与图像标签不重叠，则相应样本被丢弃。**

CC-12M [85] is a following work of CC-3M and contains 12.4M image-caption pairs. Compared with the previous work, CC-12M relaxes and simplifies the data-collection pipeline, thus collecting more data.

CC-12M [85] 是CC-3M的后续工作，包含12.4M图像-字幕对。与之前的工作相比，CC-12M放宽并简化了数据收集管道，从而收集到更多数据。**

SBU Captions [86]. It is a captioned photo dataset containing 1M image-text pairs, with images and descriptions sourced from Flickr. Specifically, an initial set of images is acquired by querying the Flickr website with a large number of query terms. The descriptions attached to the images thus serve as captions. Then, to ensure that descriptions are relevant to the images, the retained images fulfill these requirements: (1) Descriptions of the images are of satisfactory length, decided by observation. (2) Descriptions of the images contain at least 2 words in the predefined term lists and a propositional word (e.g. “on”, “under”) that generally suggests spatial relationships.

SBU Captions [86]。这是一个包含1M图像-文本对的带字幕照片数据集，图像和描述均来自Flickr。具体来说，通过使用大量查询词查询Flickr网站来获取初始图像集。附加在图像上的描述因此充当字幕。然后，为确保描述与图像相关，保留的图像满足以下要求：（1）图像的描述长度令人满意，由观察决定。（2）图像的描述包含预定义术语列表中的至少2个词和一个表示空间关系的介词（例如“在……上”，“在……下”）。**

LAION. This series are large web-scale datasets, with images scrawled from the internet and associated alt-text as captions. To filter the image-text pairs, the following steps are performed: (1) Text with short lengths or images with too small or too big sizes are dropped. (2) Image deduplication based on URL. (3) Extract CLIP [13] embeddings for images and text, and use the embeddings for possibly illegal content and image-text pairs with low cosine similarity between embeddings. Here we offer a brief summary of
some typical variants:

LAION. 这一系列是大规模网络数据集，图像从互联网上爬取，并将关联的alt文本作为标题。为了过滤图像-文本对，执行以下步骤：（1）丢弃长度较短的文本或图像过小或过大的图像。（2）基于URL的图像去重。（3）提取图像和文本的CLIP [13] 嵌入，并使用嵌入来过滤可能非法的内容和嵌入间余弦相似度较低的图像-文本对。这里里我们提供了一些典型变体的简要概述

LAION-5B [87]: It is a research-purpose dataset of 5.85B image-text pairs. The dataset adopts similar data filtration strategies, with a 2B English subset.
LAION-5B [87]：这是一个包含5.85B图像-文本对的研究用途数据集。该数据集采用类似的数据过滤策略，包含一个2B英语子集。
LAION-COCO [88]: It contains 600M images extracted from the English subset of LAION-5B. The captions are synthetic, using BLIP [89] to generate various image captions and using CLIP [13] to pick the best fit for the image.
LAION-COCO [88]：包含从LAION-5B的英语子集中提取的600M图像。标题是合成的，使用BLIP [89]生成各种图像标题，并使用CLIP [13]选择最适合图像的标题。
COYO-700M [90]: It contains 747M image-text pairs, which are extracted from CommonCrawl. For data filtering, the authors design the following strategies: (1) For images, those with inappropriate size, content, format, or aspect ratio are filtered. Moreover, the images are filtered based on the pHash value to remove images overlapped with public datasets such as ImageNet and MS-COCO. (2) For text, only English text with satisfactory length, noun forms, and appropriate words are saved. Whitespace before and after the sentence will be removed, and consecutive whitespace characters will be replaced with a single whitespace. Moreover, text appearing more than 10 times (e.g. “image for”) will be dropped. (3) For image-text pairs, duplicated samples are removed based on (image pHash, text) tuple.

- COYO-700M [90]：包含从CommonCrawl提取的747M图像-文本对。对于数据过滤，作者设计了以下策略：（1）对于图像，过滤大小、内容、格式或纵横比不合适的图像。此外，根据pHash值过滤图像，以去除与公共数据集（如ImageNet和MS-COCO）重叠的图像。（2）对于文本，仅保存长度令人满意、名词形式和适当词语的英语文本。句子前后的空白将被删除，连续的空白字符将替换为单个空白。此外，出现超过10次的文本（例如“image for”）将被丢弃。（3）对于图像-文本对，基于（图像pHash，文本）元组去除重复样本。

Recently, more works [83], [91], [92] have explored generating high-quality fine-grained data through prompting strong MLLMs (e.g. GPT-4V). Compared with coarse-grained data, these data generally contain longer and more accurate descriptions of the images, thus enabling finer-grained alignment between image and text modalities. However, since this approach generally requires calling commercial-use MLLMs, the cost is higher, and the data volume is relatively smaller. Notably, ShareGPT4V [83] strikes a balance by first training a captioner with GPT-4V-generated 100K data, then scaling up the data volume to 1.2M using the pre-trained captioner.

最近，更多的工作[83], [91], [92] 通过提示强大的MLLM（如GPT-4V）探索生成高质量细粒度数据。与粗粒度数据相比，这些数据通常包含更长和更准确的图像描述，从而实现图像和文本模态之间更细粒度的对齐。然而，由于这种方法通常需要调用商用MLLM，成本较高，数据量相对较小。值得注意的是，ShareGPT4V [83] 通过首先使用GPT-4V生成的10万数据训练字幕生成器，然后使用预训练的字幕生成器将数据量扩大到120万，达到了平衡。

3.2 Instruction-tuning

3.2.1 Introduction

Instruction refers to the description of tasks. Intuitively, instruction tuning aims to teach models to better understand the instructions from users and fulfill the demanded tasks. Tuning in this way, LLMs can generalize to unseen tasks by following new instructions, thus boosting zero-shot performance. This simple yet effective idea has sparked the success of subsequent NLP works, such as ChatGPT [2], InstructGPT [95], FLAN [19], [56], and OPT-IML [96].

指令是指任务的描述。直观地说，指令调优旨在教模型更好地理解用户的指令并完成所需任务。通过这种方式调优，LLM可以通过遵循新指令来泛化到未见任务，从而提升零样本性能。这一简单而有效的想法引发了后续NLP工作的成功，如ChatGPT [2]，InstructGPT [95]，FLAN [19], [56] 和 OPT-IML [96]。

The comparisons between instruction tuning and related typical learning paradigms are illustrated in Fig. 3. The supervised fine-tuning approach usually requires a large amount of task-specific data to train a task-specific model. The prompting approach reduces the reliance on large-scale data and can fulfill a specialized task via prompt engineering. In such a case, though the few-shot performance has been improved, the zero-shot performance is still quite average [7]. Differently, instruction tuning learns how to generalize to unseen tasks rather than fitting specific tasks like the two counterparts. Moreover, instruction tuning is highly related to multi-task prompting [97].

图3中说明了指令调优与相关典型学习范式的比较。监督微调方法通常需要大量特定任务的数据来训练特定任务的模型。提示方法减少了对大规模数据的依赖，并可以通过提示工程完成专门任务。在这种情况下，尽管小样本性能有所提高，但零样本性能仍然相当平均[7]。不同的是，指令调优学习如何泛化到未见任务，而不是像两者那样拟合特定任务。此外，指令调优与多任务提示密切相关[97]。
在这里插入图片描述

In this section, we delineate the format of instruction samples, the training objectives, typical ways to gather instruction data, and corresponding commonly used datasets.

在本节中，我们描述了指令样本的格式、训练目标、收集指令数据的典型方法以及相应的常用数据集。

3.2.2 Training Detail

A multimodal instruction sample often includes an optional instruction and an input-output pair. The instruction is typically a natural language sentence describing the task, such as, “Describe the image in detail.” The input can be an image-text pair like the VQA task [99] or only an image like the image caption task [100]. The output is the answer to the instruction conditioned on the input. The instruction template is flexible and subject to manual designs [20], [25], [98], as exemplified in Table 5. Note that the instruction template can also be generalized to the case of multi-round conversations [20], [37], [71], [98].

一个多模态指令样本通常包括一个可选的指令和一个输入-输出对。指令通常是描述任务的自然语言句子，例如“ 详细描述图像 ”。输入可以是像VQA任务那样的图像-文本对[99]，也可以只是像图像字幕任务那样的图像[100]。输出是根据输入条件的指令答案。指令模板是灵活的，可以手动设计[20], [25], [98]，如表5所示。请注意，指令模板也可以推广到多轮对话的情况[20], [37], [71], [98]。
在这里插入图片描述

TABLE 5: A simplified template to structure the multimodal instruction data. is a textual description of the task.· <image>, <text> and <output> are input and output from the data sample. Note that<text>in the input may be missed for some datasets, such as image caption datasets merely have <image>. The example is adapted from [98], like the image caption task [100]. The output is the answer to the instruction conditioned on the input. The instruction template is flexible and subject to manual designs [20], [25], [98], as exemplified in Table 5. Note that the instruction template can also be generalized to the case of multi-round conversations [20], [37], [71], [98].

表5：用于构建多模态指令数据的简化模板。<instruction> 是任务的文本描述。<image>, <text> 和 <output> 是数据样本的输入和输出。请注意，对于某些数据集，输入中的<text>可能会丢失，例如图像字幕数据集仅有 <image>。该示例改编自 [98]。例如图像字幕任务[100]。输出是根据输入条件的指令答案。指令模板是灵活的，可以手动设计[20], [25], [98]，如表5所示。请注意，指令模板也可以推广到多轮对话的情况[20], [37], [71], [98]。

Formally, a multimodal instruction sample can be denoted in a triplet form, i.e. ⟨I, M, R⟩, where I, M, R represent the instruction, the multimodal input, and the ground truth response, respectively. The MLLM predicts an answer given the instruction and the multimodal input:

正式地，多模态指令样本可以表示为三元组形式，即⟨I, M, R⟩，其中I、M、R分别表示指令、多模态输入和真实响应。MLLM在给定指令和多模态输入的情况下预测答案：

$\theta)$

$L(\theta) = -\sum_{i=1}^N \log p(R_i|I, M; \theta)$

Here, A denotes the predicted answer, and θ are the parameters of the model. The training objective is typically the original auto-regressive objective used to train LLMs [20], [37], [71], [101], based on which the MLLM is encouraged to predict the next token of the response. The objective can be expressed as:

其中，A表示预测答案，θ是模型参数。训练目标通常是用于训练LLM的原始自回归目标[20], [37], [71], [101]，基于此，MLLM被鼓励预测响应的下一个词元。目标可以表示为：

$L(\theta) = -\sum_{i=1}^N \log p(R_i|I, M; \theta)$

where N is the length of the ground-truth response.

其中N是真实响应的长度。

3.2.3 Data Collection

Since instruction data are more flexible in formats and varied in task formulations, it is usually trickier and more costly to collect data samples. In this section, we summarize three typical ways to harvest instruction data at scale, i.e. data adaptation, self-instruction, and data mixture.

由于指令数据在格式上更灵活，并且任务形式多样，通常收集数据样本更为棘手且成本更高。在本节中，我们总结了大规模收集指令数据的三种典型方法，即数据改编、自我指令和数据混合。

**Data Adaptation. Task-specific datasets are rich sources of high-quality data. Hence, abundant works [70], [76], [82], [101], [102], [103], [104] have utilized existing high-quality datasets to construct instruction-formatted datasets. Take the transformation of VQA datasets for an example, the original sample is an input-output pair where the input comprises an image and a natural language question, and the output is the textual answer to the question conditioned on the image. The input-output pairs of these datasets could naturally comprise the multimodal input and response of the instruction sample (see §3.2.2). The instructions, i.e. the descriptions of the tasks, can either derive from manual design or from semi-automatic generation aided by GPT. Specifically, some works [21], [35], [60], [70], [102], [105] hand-craft a pool of candidate instructions and sample one of them during training. We offer an example of instruction templates for the VQA datasets as shown in Table 6. The other works manually design some seed instructions and use these to prompt GPT to generate more [25], [82], [98].

数据适应。任务特定的数据集是高质量数据的丰富来源。因此，大量工作[70], [76], [82], [101], [102], [103], [104]利用现有的高质量数据集构建指令格式化的数据集。以VQA数据集的转换为例，原始样本是输入输出对，输入包括图像和自然语言问题，输出是基于图像的文本答案。这些数据集的输入输出对可以自然地构成指令样本的多模态输入和响应（见§3.2.2）。指令，即任务描述，可以来自手动设计或借助GPT的半自动生成。具体来说，一些工作[21], [35], [60], [70], [102], [105]手工制作候选指令池并在训练期间采样其中一个。我们提供了VQA数据集的指令模板示例，如表6所示。其他工作手动设计了一些种子指令，并使用这些指令提示GPT生成更多[25], [82], [98]。

Note that since the answers of existing VQA and caption datasets are usually concise, directly using these datasets for instruction tuning may limit the output length of MLLMs. There are two common strategies to tackle this problem. The first one is to specify explicitly in instructions. For example, ChatBridge [104] explicitly declares short and brief for short answer data, as well as a sentence and single sentence for conventional coarse-grained caption data. The second one is to extend the length of existing answers [105]. For example, M3IT [105] proposes to rephrase the original answer by prompting ChatGPT with the original question, answer, and contextual information of the image (e.g. caption and OCR).

请注意，由于现有VQA和标题数据集的答案通常简洁，直接使用这些数据集进行指令调优可能会限制MLLM的输出长度。有两种常见的策略来解决这个问题。第一种是在指令中明确规定。例如，ChatBridge [104] 明确声明短答案数据的简短和简洁，以及常规粗粒度标题数据的一句话和单句。第二种是扩展现有答案的长度[105]。例如，M3IT [105] 提议通过提示ChatGPT原始问题、答案和图像的上下文信息（例如标题和OCR）。

TABLE 6: Instruction templates for VQA datasets, cited from [60]. and [Question] are the image and the question in the original VQA datasets, respectively.

表6：VQA数据集的指令模板，引用自[60]。<图像>和[问题]分别是原始VQA数据集中的图像和问题。
在这里插入图片描述

Self-Instruction. Although existing multi-task datasets can constitute a rich source of data, they usually do not meet human needs well in real-world scenarios, such as multiple rounds of conversations. To tackle this issue, some works collect samples through self-instruction [106], which utilizes LLMs to generate textual instruction-following data using a few hand-annotated samples. Specifically, some instruction-following samples are hand-crafted as demonstrations, after which ChatGPT/GPT-4 is prompted to generate more instruction samples with the demonstrations as guidance. LLaVA [20] extends this approach to the multimodal field by translating images into text of captions and bounding boxes, and prompting text-only GPT-4 to generate new data with the guidance of requirements and demonstrations. In this way, a multimodal instruction dataset is constructed, dubbed LLaVA-Instruct-150K. Following this idea, subsequent works such as MiniGPT-4 [21], ChatBridge [104], GPT4Tools [107], and DetGPT [72] develop different datasets catering for different needs. Recently, with the release of the more powerful multimodal model GPT-4V, many works have adopted GPT-4V to generate data of higher quality, as exemplified by LVIS-InstructVL [91] and ALLaVA [92]. We summarize the popular datasets generated through self-instruction in Table 7.

自我指令。尽管现有的多任务数据集可以构成丰富的数据来源，但它们通常不能很好地满足现实世界场景中的人类需求，例如多轮对话。为了解决这个问题，一些工作通过自我指令收集样本[106]，利用LLM使用少量手动注释样本生成文本指令跟随数据。具体来说，一些指令跟随样本是手工制作的示例，之后通过ChatGPT/GPT-4生成更多的指令样本，以示例为指导。LLaVA [20] 将这种方法扩展到多模态领域，通过将图像翻译成标题和边界框的文本，并提示仅文本的GPT-4在需求和示例的指导下生成新数据。通过这种方式，构建了一个多模态指令数据集，称为LLaVA-Instruct-150K。遵循这一思路，后续的工作如MiniGPT-4 [21]，ChatBridge [104]，GPT4Tools [107]和DetGPT [72] 开发了不同的数据集以满足不同的需求。最近，随着更强大的多模态模型GPT-4V的发布，许多工作采用GPT-4V生成更高质量的数据，如LVIS-InstructVL [91] 和ALLaVA [92]所示。我们在表7中总结了通过自我指令生成的流行数据集。

TABLE 7: A summary of popular datasets generated by self-instruction. For input/output modalities, I: Image, T: Text, V: Video, A: Audio. For data composition, M-T and S-T denote multi-turn and single-turn, respectively.

表7：自我指令生成的流行数据集摘要。输入/输出模态，I：图像，T：文本，V：视频，A：音频。数据组成中，M-T和S-T分别表示多轮和单轮。

在这里插入图片描述

Data Mixture. Apart from the multimodal instruction data, language-only user-assistant conversation data can also be used to improve conversational proficiencies and instruction-following abilities [81], [98], [101], [103]. LaVIN [101] directly constructs a minibatch by randomly sampling from both language-only and multimodal data. Multiturnist [102] proposes different strategies for training with a fusion of single modal and multimodal data, including mixed instruction tuning (combine both types of data and randomly shuffle) and sequential instruction tuning (text data followed by multimodal data).

数据混合。除了多模态指令数据，纯语言的用户助手对话数据也可以用来提高对话能力和指令跟随能力[81], [98], [101], [103]。LaVIN [101] 通过从语言和多模态数据中随机抽样直接构建一个小批量。Multiturnist [102] 提出了不同的策略，通过融合单模态和多模态数据进行训练，包括混合指令调优（组合这两种类型的数据并随机混合）和顺序指令调优（文本数据后接多模态数据）。

3.2.4 Data Quality

Recent research has revealed that the data quality of instruction-tuning samples is less important than quantity. Lyu [73] finds that models pre-trained on larger but noisy image-text pairs do not perform as well as models pre-trained with smaller but cleaner datasets. Similarly, Wei et al. [108] finds that less instruction-tuning data with higher quality can achieve better performance. For data filtering, the work proposes some metrics to evaluate data quality and, correspondingly, a method to automatically filter out inferior vision-language data. Here we discuss two important aspects regarding data quality.

最近的研究表明，指令调优样本的数据质量与数量同样重要。。Lyux [73] 发现预训练在更大但嘈杂的图像-文本对上的模型表现不如在更小但更干净的数据集上预训练的模型。类似地，Wei等人[108] 发现较少的高质量指令调优数据可以实现更好的性能。对于数据过滤，这项工作提出了一些指标来评估数据质量，并相应地提出了一种自动过滤劣质视觉语言数据的方法。这里我们讨论两个关于数据质量的重要方面。

Prompt Diversity. The diversity of instructions has been found to be critical for model performance. Lyu [73] empirically verifies that diverse prompts help improve model performance and generalization ability.

提示多样性。提示的多样性被发现对模型性能至关重要。Lyux [73] 经验性地验证了多样提示有助于提高模型性能和泛化能力。

Task Coverage. In terms of tasks involved in training data, Du et al. [109] perform an empirical study and find that the visual reasoning task is superior to captioning and QA tasks for boosting model performance. Moreover, the study suggests that enhancing the complexity of instructions might be more beneficial than increasing task diversity and incorporating fine-grained spatial annotations.

任务覆盖。在训练数据中涉及的任务方面，Du等人[109] 进行了一项经验研究，发现视觉推理任务优于标题和QA任务，有助于提升模型性能。此外，该研究表明，增强指令的复杂性可能比增加任务多样性和结合细粒度空间注释更有益。

3.3 Alignment tuning

3.3.1 Introduction

Alignment tuning is more often used in scenarios where models need to be aligned with specific human preferences, e.g. response with fewer hallucinations (see §6). Currently, Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO) are two main techniques for alignment tuning. In this section, we introduce the main ideas of the two techniques in sequence and offer some examples of how they are utilized in addressing practical problems, and finally, give a compilation of the related datasets.

对齐调优更常用于需要与特定人类偏好对齐的场景，例如更少幻觉的响应（见§6）。目前，基于人类反馈的强化学习（RLHF）和直接偏好优化（DPO）是两种主要的对齐调优技术。在本节中，我们按顺序介绍这两种技术的主要思想，并提供一些如何利用它们解决实际问题的示例，最后汇总相关的数据集。

3.3.2 Training Detail

RLHF [110], [111]. This technique aims to utilize reinforcement learning algorithms to align LLMs with human preferences, with human annotations as supervision in the training loop. As exemplified in InstructGPT [95], RLHF incorporates three key steps: (1) Supervised fine-tuning. This step aims to fine-tune a pre-trained model to present the preliminary desired output behavior. The fine-tuned model in the RLHF setting is called a policy model. Note that this step might be skipped since the supervised policy model $\pi^{sft}$ can be initialized from an instruction-tuned model (see §3.2).

RLHF [110], [111]。该技术旨在利用强化学习算法使LLM与人类偏好对齐，并在训练循环中以人类注释作为监督。如InstructGPT [95]所示，RLHF包含三个关键步骤：（1）监督微调。此步骤旨在微调预训练模型，以呈现初步的期望输出行为。在RLHF设置中微调的模型称为策略模型。注意，此步骤可能会被跳过，因为监督策略模型 $\pi^sft$ 可以从指令调优模型初始化（见§3.2）。

(2) Reward modeling. A reward model is trained using preference pairs in this step. Given a multimodal prompt (e.g. image and text) and a response pair ((y_{x}, y_{1})), the reward model (r_{\theta}) learns to give a higher reward to the preferred response (y_{1}) over (y_{x}), according to the following objective:

（2）奖励建模。在此步骤中，使用偏好对训练奖励模型。给定多模态提示（例如图像和文本）和响应对 ((y_{x}, y_{1}))，奖励模型 (r_{\theta}) 学习根据以下目标为首选响应 (y_{1}) 提供更高的奖励，相对于 (y_{x})：

$L(\theta) = -E_{(x,y_{x},y_{1})\sim D}[\log \frac{e^{r_{\theta}(x,y_{1})}}{e^{r_{\theta}(x,y_{x})} + e^{r_{\theta}(x,y_{1})}}]$

其中，D = { ((x, y_{x}, y_{1})) } 是由人类注释的比较数据集。在实践中，奖励模型 (r_{\theta}) 与策略模型共享相似的结构，并根据偏好预测进行监督训练。

(3) Reinforcement learning. The reward model is used to optimize the RL policy model (\pi_{\theta}^{RL}). A per-token KL penalty is often added to the training objective to avoid deviating too far from the original policy [95], resulting in the objective:

（3）强化学习。奖励模型用于优化RL策略模型 (\pi_{\theta}^{RL})。通常在训练目标中添加每个词元的KL惩罚，以避免偏离原始策略太远[95]，目标如下：

$L(\phi) = -E_{x\sim D,x\sim\pi_{\theta}^{RL}} [r_{\theta}(x,y) + \beta \cdot D_{KL}(\pi_{\theta}^{RL}(y|x) || \pi_{\theta}^{SFT}(y|x)) ]$

where (\beta) is the coefficient for the KL penalty term. Typically, both the RL policy (\pi_{\theta}^{RL}) and the reference model (\pi_{\theta}^{SFT}) are initialized from the supervised model (\pi_{\theta}^{SFT}). The obtained RL policy model is expected to align with human preferences through this training process.

其中， $\beta$ 是KL惩罚项的系数。通常，RL策略 (\pi_{\theta}^{RL}) 和参考模型 (\pi_{\theta}^{SFT}) 都从监督模型 (\pi_{\theta}^{SFT}) 初始化。通过这个训练过程，获得的RL策略模型有望与人类偏好对齐。

Researchers have explored using the RLHF techniques for better multimodal alignment. For example, LLaVA-RLHF [112] collects human preference data and tunes a model with fewer hallucinations based on LLaVA [20]. DPO [113]. It learns from human preference labels utilizing a simple binary classification loss. Compared with the PPO-based RLHF algorithm, DPO is exempt from learning an explicit reward model, thus simplifying the whole pipeline to two steps, i.e. human preference data collection and preference learning. The learning objective is as follows:

研究人员已经探索使用RLHF技术以实现更好的多模态对齐。例如，LLaVA-RLHF [112] 收集人类偏好数据并基于LLaVA [20] 调整一个幻觉更少的模型。DPO [113]。它利用简单的二元分类损失从人类偏好标签中学习。与基于PPO的RLHF算法相比，DPO免除了学习显式奖励模型的步骤，从而简化整个流程为两个步骤，即人类偏好数据收集和偏好学习。学习目标如下：

$L(\phi) = -E_{(x,y_{1},y_{x})\sim D} [ \log \frac{\pi_{\theta}^{RL}(y_{1}|x)}{\pi_{\theta}^{REF}(y_{x}|x)} ]$

RLHF-V [114] collects fine-grained (segment-level) preference data pairs by correcting hallucinations in the model response and uses the obtained data to perform dense DPO. Silkie [115] instead collects preference data via prompting GPT-4V and distills the preference supervision into an instruction-tuned model through DPO.

RLHF-V [114] 通过纠正模型响应中的幻觉收集细粒度（片段级）偏好数据对，并使用获得的数据执行密集DPO。Silkie [115] 通过提示GPT-4V收集偏好数据，并通过DPO将偏好监督蒸馏到指令调优模型中。

3.3.3 Data

The gist of data collection for alignment-tuning is to collect feedback for model responses, i.e. to decide which response is better, and the amount of data used in this phase is typically even less than that used in previous stages. In this part, we introduce three datasets and summarize them in Table 8.

对齐调优数据收集的要点是收集模型响应的反馈，即决定哪个响应更好，并且在这个阶段使用的数据量通常比以前的阶段更少。在这一部分，我们介绍三个数据集，并在表8中总结它们。**

TABLE 8: A summary of datasets for alignment-tuning. For input/output modalities, I: Image, T: Text.

表8：对齐调优数据集摘要。输入/输出模态，I：图像，T：文本。

在这里插入图片描述

LLaVA-RLHF [112]. It contains 10K preference pairs collected from human feedback in terms of honesty and helpfulness. The dataset mainly serves to reduce hallucinations in model responses.

LLaVA-RLHF [112]。它包含从人类反馈中收集的10K偏好对，主要用于减少模型响应中的幻觉。

RLHF-V [114]. It has 5.7K fine-grained human feedback data (segment-level hallucination corrections).

RLHF-V [114]。它包含5.7K细粒度人类反馈数据（片段级幻觉纠正）。

VLFeedback [115]. It utilizes AI to provide feedback on model responses. The dataset contains more than 380K comparison pairs scored by GPT-4V in terms of helpfulness, accuracy, and relevance.

VLFeedback [115]。它利用AI提供模型响应反馈。数据集包含超过380K的比较对，由GPT-4V根据有用性、准确性和相关性评分。

4. EVALUATION

Evaluation is an essential part of developing MLLMs since it provides feedback for model optimization and helps compare the performance of different models. Compared with evaluation methods of traditional multimodal models, the evaluation of MLLMs exhibits several new traits: (1) Since MLLMs are generally versatile, it is important to evaluate MLLMs comprehensively. (2) MLLMs exhibit many emergent capabilities that require special attention (e.g. OCR-free math reasoning) and thus require new evaluation schemes. The evaluation of MLLMs can be broadly categorized into two types according to the question genres, including closed-set and open-set.

评估是开发MLLM的重要组成部分，因为它为模型优化提供反馈，并有助于比较不同模型的性能。与传统多模态模型的评估方法相比，MLLM的评估表现出几个新特点：（1）由于MLLM通常是多才多艺的，全面评估MLLM是很重要的。（2）MLLM展示了许多需要特别关注的新兴能力（例如无OCR的数学推理），因此需要新的评估方案。根据问题类型，MLLM的评估可以大致分为两种类型，包括闭集和开放集。

4.1 Closed-set

Closed-set questions refer to a type of question where the possible answer options are predefined and limited to a finite set. The evaluation is usually performed on task-specific datasets. In this case, the responses can be naturally judged by benchmark metrics [20], [60], [70], [76], [101], [102], [103], [104]. For example, InstructBLIP [60] reports the accuracy on ScienceQA [116], as well as the CIDEr score [117] on NoCaps [118] and Flickr30K [119]. The evaluation settings are typically zero-shot [60], [102], [104], [105] or finetuning [20], [35], [60], [70], [76], [101], [103], [105]. The first setting often selects a wide range of datasets covering different general tasks and splits them into held-in and held-out datasets. After tuning on the former, zero-shot performance is evaluated on the latter with unseen datasets or even unseen tasks. In contrast, the second setting is often observed in the evaluation of domain-specific tasks. For example, LLaVA [20] and LLaMA-Adapter [76] report fine-tuned performance on ScienceQA [116]. LLaVA-Med [35] reports results on biomedical VQA [120], [121], [122].

闭集问题是指一种问题，其可能的答案选项是预定义的并且限制在有限的集合内。评估通常在任务特定的数据集上进行。在这种情况下，可以通过基准指标自然地判断响应[20], [60], [70], [76], [101], [102], [103], [104]。例如，InstructBLIP [60] 报告了ScienceQA [116]上的准确性，以及NoCaps [118]和Flickr30K [119]上的CIDEr评分[117]。评估设置通常是零样本[60], [102], [104], [105]或微调[20], [35], [60], [70], [76], [101], [103], [105]。第一个设置通常选择涵盖不同一般任务的广泛数据集，并将其分为保留数据集和未见数据集。在前者上调优后，在后者上评估零样本性能。或甚至未见任务。相比之下，第二种设置通常在领域特定任务的评估中观察到。例如，LLaVA [20] 和 LLaMA-Adapter [76] 报告了在ScienceQA [116]上的微调性能。LLaVA-Med [35] 报告了在生物医学VQA [120], [121], [122]上的结果。

The above evaluation methods are usually limited to a small range of selected tasks or datasets, lacking a comprehensive quantitative comparison. To this end, some efforts have endeavored to develop new benchmarks specially designed for MLLMs [123], [124], [125], [126], [127], [128], [129]. For example, Fu et al. [123] construct a comprehensive evaluation benchmark MME that includes a total of 14 perception and cognition tasks. All instruction-answer pairs in MME are manually designed to avoid data leakage. MMBench [124] is a benchmark specifically designed for evaluating multiple dimensions of model capabilities, using ChatGPT to match open responses with pre-defined choices. Video-ChatGPT [130] and Video-Bench [131] focus on video domains and propose specialized benchmarks as well as evaluation tools for assessment. There are also evaluation strategies designed to evaluate a specific aspect of the model [102], as exemplified by POPE [132] for assessment of hallucination degree.

上述评估方法通常限于一小范围的选定任务或数据集，缺乏全面的定量比较。为此，一些努力致力于开发专门为MLLM设计的新基准[123], [124], [125], [126], [127], [128], [129]。例如，Fu等人[123]构建了一个全面的评估基准MME，包括总共14个感知和认知任务。MME中的所有指令-答案对都是手工设计的，以避免数据泄露。MMBench [124] 是一个专门为评估模型能力的多个维度而设计的基准，使用ChatGPT将开放响应与预定义选项匹配。Video-ChatGPT [130]和Video-Bench [131]聚焦视频领域，提出了专门的基准和评估工具进行评估。还有一些评估策略设计用于评估模型的特定方面[102]，例如由POPE [132] 用于幻觉程度评估。

4.2 Open-set

In contrast to the closed-set questions, the responses to open-set questions can be more flexible, where MLLMs usually play a chatbot role. Because the content of the chat can be arbitrary, it would be trickier to judge than the closed-ended output. The criterion can be classified into manual scoring, GPT scoring, and case study. Manual scoring requires humans to assess the generated responses. This kind of approach often involves hand-crafted questions that are designed to assess specific dimensions. For example, mPLUG-Owl [81] collects a visually related evaluation set to judge capabilities like natural image understanding, diagram, and flowchart understanding. Similarly, GPT4Tools [107] builds two sets for the finetuning and zero-shot performance, respectively, and evaluates the responses in terms of thought, action, arguments, and the whole.

与封闭集问题相比，对开放集问题的回答可以更加灵活，MLLM通常扮演聊天机器人的角色。由于聊天内容可以是任意的，因此判断起来比封闭式输出更棘手。标准可以分为人工评分、GPT评分和案例研究。人工评分需要人类评估生成的回答。这种方法通常涉及设计用于评估特定维度的手工问题。例如，mPLUG-Owl [81] 收集了一个视觉相关的评估集，以判断自然图像理解、图表和流程图理解等能力。同样，GPT4Tools [107] 分别构建了微调和零样本性能的两个集合，并从思维、行动、论点和整体等方面评估响应。

Since manual assessment is labor intensive, some researchers have explored rating with GPT, namely GPT scoring. This approach is often used to evaluate performance on multimodal dialogue. LLaVA [20] proposes to score the responses via text-only GPT-4 in terms of different aspects, such as helpfulness and accuracy. Specifically, 30 images are sampled from the COCO [133] validation set, each associated with a short question, a detailed question, and a complex reasoning question via self-instruction on GPT-4. The answers generated by both the model and GPT-4 are sent to GPT-4 for comparison. Subsequent works follow this idea and prompt ChatGPT [81] or GPT-4 [35], [70], [101], [104], [105] to rate results [35], [70], [81], [101], [104] or judge which one is better [103].

由于人工评估劳动强度大，一些研究人员探索了使用GPT评分，即GPT评分。这种方法通常用于评估多模态对话的表现。LLaVA [20] 提议通过仅文本的GPT-4在不同方面对回答进行评分，例如有用性和准确性。具体来说，从COCO [133] 验证集中抽取30张图像，每张图像都通过GPT-4自我指令与一个简短问题、一个详细问题和一个复杂推理问题相关联。由模型和GPT-4生成的回答都被发送到GPT-4进行比较。后续工作遵循这一想法，并提示ChatGPT [81] 或 GPT-4 [35], [70], [101], [104], [105] 进行评分[35], [70], [81], [101], [104] 或判断哪个更好[103]。

A main issue of applying text-only GPT-4 as an evaluator is that the judge is only based on image-related text content, such as captions or bounding box coordinates, without accessing the image [35]. Thus, it may be questionable to set GPT-4 as the performance upper bound in this case. With the release of the vision interface of GPT, some works [77], [134] exploit a more advanced GPT-4V model to assess the performance of MLLMs. For example, Woodpecker [77] adopts GPT-4V to judge the response quality of model answers based on the image. The evaluation is expected to be more accurate than using text-only GPT-4 since GPT-4V has direct access to the image.

应用仅文本的GPT-4作为评估者的一个主要问题是，评判仅基于与图像相关的文本内容，如字幕或边界框坐标，而不访问图像[35]。因此，在这种情况下，将GPT-4设为性能上限可能会有问题。随着GPT视觉接口的发布，一些工作[77], [134] 利用更高级的GPT-4V模型来评估MLLM的性能。例如，Woodpecker [77] 采用GPT-4V根据图像判断模型回答的质量。由于GPT-4V可以直接访问图像，评估预计比仅使用文本的GPT-4更准确。

A supplementary approach is to compare the different capabilities of MLLMs through case studies. For instance, some studies evaluate two typical advanced commercial-use models, GPT-4V and Gemini. Yang et al. [135] perform in-depth qualitative analysis on GPT-4V by crafting a series of samples across various domains and tasks, spanning from preliminary skills, such as caption and object counting, to complex tasks that require world knowledge and reasoning, such as joke understanding and indoor navigation as an embodied agent. Wen et al. [136] make a more focused evaluation of GPT-4V by designing samples targeting automatic driving scenarios. Fu et al. [137] carry out a comprehensive evaluation on Gemini-Pro by comparing the model against GPT-4V. The results suggest that GPT-4V and Gemini exhibit comparable visual reasoning abilities in spite of different response styles.

一种补充方法是通过案例研究比较MLLM的不同能力。例如，一些研究评估了两个典型的高级商用模型，GPT-4V和Gemini。杨等[135]通过设计跨越各种领域和任务的一系列样本对GPT-4V进行深入的定性分析，从初步技能，如字幕和对象计数，到需要世界知识和推理的复杂任务，如笑话理解和作为具身代理的室内导航。Wen等[136]通过设计针对自动驾驶场景的样本对GPT-4V进行更集中的评估。傅等[137]通过将Gemini-Pro与GPT-4V进行比较，进行了全面评估。结果表明，尽管响应风格不同，GPT-4V和Gemini表现出相当的视觉推理能力。

5 Extensions

Recent studies have made significant strides in extending the capabilities of MLLMs, spanning from more potent foundational abilities to broader coverage of scenarios. We trace the principal development of MLLMs in this regard.

最近的研究在扩展MLLM的能力方面取得了显著进展，从更强的基础能力到更广泛的场景覆盖。我们追踪了MLLM在这方面的主要发展。

Granularity Support

To facilitate better interaction between agents and users, researchers have developed MLLMs with finer support of granularities in terms of model inputs and outputs. On the input side, models that support finer control from user prompts are developed progressively, evolving from image to region [28], [138], [139] and even pixels [29], [140], [141]. Specifically, Shikra [28] supports region-level input and understanding. Users may interact with the assistant more flexibly by referring to specific regions, which are represented in bounding boxes of natural language forms. Ferret [141] takes a step further and supports more flexible referring by devising a hybrid representation scheme. The model supports different forms of prompts, including point, box, and sketch. Similarly, Osprey [29] supports point input by utilizing a segmentation model [9]. Aided by the exceptional capabilities of the pre-trained segmentation model, Osprey enables specifying a single entity or part of it with a single click. On the output side, grounding capabilities are improved in line with the development of input support. Shikra [28] supports response grounded in the image with box annotations, resulting in higher precision and finer referring experience. LISA [142] further supports mask-level understanding and reasoning, which makes pixel-level grounding possible.

为了促进代理和用户之间更好的交互，研究人员开发了在模型输入和输出方面提供更细粒度支持的MLLM。在输入方面，支持用户提示的更精细控制的模型逐步开发，从图像到区域[28], [138], [139] 甚至像素[29], [140], [141]。具体来说，Shikra [28] 支持区域级输入和理解。用户可以通过引用特定区域（在自然语言形式的边界框中表示）与助手进行更灵活的交互。Ferret [141] 更进一步，通过设计混合表示方案支持更灵活的引用。模型支持不同形式的提示，包括点、框和草图。同样，Osprey [29] 通过利用分割模型[9] 支持点输入。在预训练分割模型的卓越能力的帮助下，Osprey可以通过单击指定单个实体或其部分。在输出方面，随着输入支持的发展，基础能力得到了提高。Shikra [28] 支持带有框注释的图像中的响应，从而提高了精度和更细致的引用体验。LISA [142] 进一步支持掩码级理解和推理，使像素级基础成为可能。

Modality Support

Increased support for modalities is a tendency for MLLM studies. On the one hand, researchers have explored adapting MLLMs to support the input of more multimodal content, such as 3D point cloud [41], [143], [144], [145]. On the other hand, MLLMs are also extended to generate responses of more modalities, such as image [32], [146], [147], [148], audio [32], [147], [149], [150], and video [32], [151]. For example, NExT-GPT [32] proposes a framework that supports inputs and outputs of mixed modalities, specifically, combinations of text, image, audio, and video, with the help of diffusion models [152], [153] attached to the MLLM. The framework applies an encoder-decoder architecture and puts LLM as a pivot for understanding and reasoning.

对模态的支持增加是MLLM研究的一个趋势。一方面，研究人员探索了适应MLLM以支持更多多模态内容输入的方法，例如3D点云[41], [143], [144], [145]。另一方面，MLLM也被扩展到生成更多模态的响应，例如图像[32], [146], [147], [148]，音频[32], [147], [149], [150] 和视频[32], [151]。例如，NExT-GPT [32] 提出了一种框架，支持混合模态的输入和输出，特别是文本、图像、音频和视频的组合，并借助附加到MLLM的扩散模型[152], [153]。该框架采用编码器-解码器架构，并将LLM作为理解和推理的枢纽。

Language Support

Current models are predominantly unilingual, probably due to the fact that high-quality non-English training corpus is scarce. Some works have been devoted to developing multilingual models so that a broader range of users can be covered. VisCPM [33] transfers model capabilities to the multilingual setting by designing a multi-stage training scheme. Specifically, the scheme takes English as a pivotal language, with abundant training corpus. Utilizing a pre-trained bilingual LLM, the multimodal capabilities are transferred to Chinese by adding some translated samples during instruction tuning. Taking a similar approach, Qwen-VL [34] is developed from the bilingual LLM Qwen [58] and supports both Chinese and English. During pre-training, Chinese data is mixed into the training corpus to preserve the bilingual capabilities of the model, taking up 22.7% of the whole data volume.

目前的模型主要是单语的，可能是由于高质量的非英语训练语料库稀缺。一些工作致力于开发多语种模型，以便覆盖更广泛的用户。VisCPM [33] 通过设计多阶段训练方案将模型能力转移到多语言环境中。具体来说，该方案将英语作为关键语言，拥有丰富的训练语料库。利用预训练的双语LLM，通过在指令调优期间添加一些翻译样本，将多模态能力转移到中文。采取类似的方法，Qwen-VL [34] 是从双语LLM Qwen [58] 开发的，支持中文和英文。在预训练期间，中文数据被混入训练语料库中，以保留模型的双语能力，占整个数据量的22.7%。

Scenario/Task Extension

Apart from developing common general-purpose assistants, some studies have focused on more specific scenarios where practical conditions should be considered, while others extend MLLMs to downstream tasks with specific expertise.

除了开发常见的通用助手外，一些研究专注于需要考虑实际条件的更具体的场景，而另一些研究则将MLLM扩展到具有特定专业知识的下游任务。

A typical tendency is to adapt MLLMs to more specific real-life scenarios. MobileVLM [63] explores developing small-size variants of MLLMs for resource-limited scenarios. Some designs and techniques are utilized for deployment on mobile devices, such as LLMs of smaller size and quantization techniques to speed up computation. Other works develop agents that interact with real-world [41], [154], [155], e.g. user-friendly assistants specially designed for Graphical User Interface (GUI), as exemplified by CogAgent [44], AppAgent [43], and Mobile-Agent [45]. These assistants excel in planning and guiding through each step to fulfill a task specified by users, acting as helpful agents for human-machine interaction. Another line is to augment MLLMs with specific skills for solving tasks in different domains, e.g. document understanding [38], [39], [156], [157] and medical domains [35], [36], [37]. For document understanding, mPLUG-DocOwl [38] utilizes various forms of document-level data for tuning, resulting in an enhanced model in OCR-free document understanding. TextMonkey [39] incorporates multiple tasks related to document understanding to improve model performance. Apart from conventional document image and scene text datasets, position-related tasks are added to reduce hallucinations and help models learn to ground responses in the visual information. MLLMs can also be extended to medical domains by instilling knowledge of the medical domain. For example, LLaVA-Med [158] injects medical knowledge into vanilla LLaVA [20] and develops an assistant specialized in medical image understanding and question answering.

一个典型的趋势是将MLLM适应于更具体的现实生活场景。MobileVLM [63] 探索开发用于资源有限场景的小尺寸MLLM变体。一些设计和技术用于在移动设备上部署，例如更小尺寸的LLM和量化技术以加速计算。其他工作开发与现实世界互动的代理[41], [154], [155]，例如，专为图形用户界面（GUI）设计的用户友好助手，如CogAgent [44], AppAgent [43], 和 Mobile-Agent [45]。这些助手擅长规划和指导每一步以完成用户指定的任务，充当人机交互的有用代理。另一条路线是增强MLLM在不同领域解决任务的特定技能，例如文档理解[38], [39], [156], [157] 和医学领域[35], [36], [37]。对于文档理解，mPLUG-DocOwl [38] 利用各种形式的文档级数据进行调优，从而在无OCR文档理解方面生成增强模型。TextMonkey [39] 合并了与文档理解相关的多个任务，以提高模型性能。除了传统的文档图像和场景文本数据集，还增加了与位置相关的任务，以减少幻觉并帮助模型学习在视觉信息中基础响应。通过灌输医学领域的知识，MLLM还可以扩展到医学领域。例如，LLaVA-Med [158] 将医学知识注入到普通的LLaVA [20] 中，并开发了一种专门用于医学图像理解和问题解答的助手。

6. MULTIMODAL HALLUCINATION

Multimodal hallucination refers to the phenomenon of responses generated by MLLMs being inconsistent with the image content [77]. As a fundamental and important problem, the issue has received increased attention. In this section, we briefly introduce some related concepts and research development.

多模态幻觉是指MLLM生成的响应与图像内容不一致的现象[77]。作为一个基本且重要的问题，这一问题受到越来越多的关注。在本节中，我们简要介绍一些相关概念和研究进展。

6.1 Preliminaries

Current research on multimodal hallucinations can be further categorized into three types [159]:

当前关于多模态幻觉的研究可以进一步分为三种类型[159]：

Existence Hallucination is the most basic form, meaning that models incorrectly claim the existence of certain objects in the image.

1) 存在幻觉是最基本的形式，意味着模型错误地声称图像中存在某些对象。

Attribute Hallucination means describing the attributes of certain objects in a wrong way, e.g. failure to identify a dog’s color correctly. It is typically associated with existence hallucination since descriptions of the attributes should be grounded in objects present in the image.

2) 属性幻觉是指以错误的方式描述某些对象的属性，例如未能正确识别狗的颜色。它通常与存在幻觉相关，因为属性的描述应该基于图像中存在的对象。

Relationship Hallucination is a more complex type and is also based on the existence of objects. It refers to false descriptions of relationships between objects, such as relative positions and interactions.

3) 关系幻觉是一种更复杂的类型，也是基于对象的存在。它指的是对对象之间关系的错误描述，例如相对位置和互动。

In what follows, we first introduce some specific evaluation methods (§6.2), which are useful to gauge the performance of methods for mitigating hallucinations (§6.3). Then, we will discuss in detail the current methods for reducing hallucinations, according to the main categories each method falls into.

在接下来的内容中，我们首先介绍一些具体的评估方法（§6.2），这些方法对于评估减轻幻觉的方法的性能很有用（§6.3）。然后，我们将根据每种方法所属的主要类别详细讨论当前减少幻觉的方法。

6.2 Evaluation Methods

CLIP [160] is an early metric that evaluates hallucination levels in open-ended captions. The metric measures the proportion of sentences with hallucinated objects or hallucinated objects in all the objects mentioned. In contrast, POPE [132] is a method that evaluates closed-set choices. Specifically, multiple prompts with binary choices are formulated, each querying if a specific object exists in the image. The method also covers more challenging settings to evaluate the robustness of MLLMs, with data statistics taken into consideration. The final evaluation uses a simple matchword mechanism, i.e. by detecting keywords “yes/no”, to convert open-ended responses into closed-set binary choices. With a similar evaluation approach, MME [123] provides a more comprehensive evaluation, covering aspects of existence, count, position and color, as exemplified in [77].

CLIP [160] 是一种早期的度量方法，用于评估开放式标题中的幻觉级别。该度量方法衡量包含幻觉对象的句子或所有提到的对象中幻觉对象的比例。相比之下，POPE [132] 是一种评估闭集选择的方法。具体来说，制定了多个具有二元选择的提示，每个提示查询图像中是否存在特定对象。该方法还涵盖了更具挑战性的设置，以评估MLLM的鲁棒性，并考虑了数据统计。最终评估使用简单的匹配词机制，即通过检测关键词“是/否”，将开放式响应转换为闭集二元选择。采用类似的评估方法，MME [123] 提供了更全面的评估，涵盖了存在、计数、位置和颜色方面，如[77]所示。

Different from previous approaches that use matching mechanisms to detect and decide hallucinations, HaLMeD [161] proposes using text-only LLMs as a judge to automatically decide whether MLLMs’ captions are correct against reference captions. In light of the fact that text-only LLMs can only access limited image context and require reference annotations, Woodpecker [77] uses GPT-4V to directly assess model responses grounded in the image. FaithScore [162] is a more fine-grained metric based on a routine that breaks down descriptive sub-sentences and evaluates each sub-sentence separately. Based on previous studies, AMBER [163] is an LLM-free benchmark that encompasses both discriminative tasks and generative tasks and involves three types of possible hallucinations (see §6.1).

不同于以前使用匹配机制检测和决定幻觉的方法，HaLMeD [161] 提议使用仅文本LLM作为评判，自动决定MLLM的标题是否正确。鉴于仅文本LLM只能访问有限的图像上下文并需要参考注释，Woodpecker [77] 使用GPT-4V直接评估基于图像的模型响应。FaithScore [162] 是一种更细粒度的度量方法，基于一个例行程序，将描述性子句分解并分别评估每个子句。基于以前的研究，AMBER [163] 是一个不依赖LLM的基准，涵盖了辨别任务和生成任务，并涉及三种可能的幻觉类型（见§6.1）。

6.3 Mitigation Methods

According to high-level ideas, the current methods can be roughly divided into three categories: pre-correction, in-process-correction, and post-correction.

根据高级别的想法，当前的方法大致可以分为三类：预先校正、过程校正和后期校正。

Pre-correction. An intuitive and straightforward solution for hallucination is to collect specialized data (e.g., negative data) and use the data for fine-tuning, thus resulting in models with fewer hallucinated responses. LRV-InstructVL [164] introduces a visual instruction-tuning dataset. Apart from common positive instructions, the dataset incorporates delicately designed negative instructions at different semantic levels to encourage responses faithful to the image content. LLaVA-RLHF [112] collects human-preference pairs and finetunes models with reinforcement learning techniques, leading to models more aligned with less hallucinated answers.

预先校正。针对幻觉的一种直观且简单的解决方案是收集专门的数据（例如负样本数据），并使用这些数据进行微调，从而产生幻觉响应较少的模型。LRV-InstructVL [164] 引入了一个视觉指令调优数据集。除了常见的正面指令外，该数据集还包含精心设计的负面指令，涵盖不同的语义层次，以鼓励响应忠实于图像内容。LLaVA-RLHF [112] 收集人类偏好对，并使用强化学习技术微调模型，使模型的响应幻觉较少。

In-process-correction. Another line is to make improvements in architectural design or feature representation. These works try to explore the reasons for hallucinations and design corresponding remedies to mitigate them at the generation process.
HalluSwitch [159] performs an empirical analysis of possible factors of object existence hallucinations and hypothesizes that existence hallucinations derive from objects not grounded by visual encoders, and they are actually inferred based on knowledge embedded in the LLM. Based on the assumption, a continuous controlling factor and corresponding training scheme are introduced to control the extent of imagination in model output during inference.
VCD [165] suggests hallucinations stem from two primary causes, i.e. statistical bias in training corpus and strong language prior embedded in LLMs. The authors take notice of the phenomenon that when injecting noise into the image, MLLMs tend to lean towards language prior rather than the image content for response generation, leading to hallucinations. Correspondingly, this work designs an amplify-then-contrast decoding scheme to offset the false bias.

过程校正。另一条线是改进架构设计或特征表示。这些工作试图探索幻觉的原因，并设计相应的补救措施以在生成过程中减轻幻觉。HalluSwitch [159] 对象存在幻觉的可能因素进行了经验分析，并假设存在幻觉源自未被视觉编码器锚定的对象，而这些对象实际上是基于嵌入在LLM中的知识推断的。基于该假设，引入了一个连续控制因素和相应的训练方案，以控制推理过程中模型输出中的想象程度。VCD [165] 认为幻觉源于两个主要原因，即训练语料库中的统计偏差和嵌入LLM中的强语言先验。作者注意到，当向图像中注入噪声时，MLLM倾向于依赖语言先验而不是图像内容来生成响应，从而导致幻觉。因此，该工作设计了一个放大然后对比的解码方案，以抵消虚假偏差。

HACL [166] investigates the embedding space of vision and language. Based on the observation, a contrastive learning scheme is devised to pull paired cross-modal representation closer while pushing away non-hallucinated and hallucinated text representation.

HACL [166] 研究了视觉和语言的嵌入空间。基于观察，设计了一种对比学习方案，将成对的跨模态表示拉近，同时推开非幻觉和幻觉文本表示。

Post-correction. Different from previous paradigms, post-correction mitigates hallucinations in a post-remedy way and corrects hallucinations after output generation. Woodpecker [77] is a training-free general framework for hallucination correction. Specifically, the method incorporates expert models to supplement contextual information of the image and crafts a pipeline to correct hallucinations step by step. The method is interpretable in that intermediate results of each step can be checked, and objects are grounded in the image. The other method LURE [167] trains a specialized revisor to mask objects with high uncertainty in the descriptions and regenerates the responses again.

后期校正。不同于以前的范式，后期校正以一种后补救的方式减轻幻觉，并在输出生成后纠正幻觉。Woodpecker [77] 是一个无需训练的幻觉纠正通用框架。具体来说，该方法结合专家模型以补充图像的上下文信息，并设计了一个管道逐步纠正幻觉。该方法是可解释的，因为可以检查每一步的中间结果，并在图像中锚定对象。另一种方法LURE [167] 训练一个专门的修正器，以掩盖描述中高不确定性的对象，并重新生成响应。

7. EXTENDED TECHNIQUES

7.1 Multimodal In-Context Learning

ICL is one of the important emergent abilities of LLMs. There are two good traits of ICL: (1) Different from traditional supervised learning paradigms that learn implicit patterns from abundant data, the crux of ICL is to learn from analogy [168]. Specifically, in the ICL setting, LLMs learn from a few examples along with an optional instruction and extrapolate to new questions, thereby solving complex and unseen tasks in a few-shot manner [22], [169], [170]. (2) ICL is usually implemented in a training-free manner [168] and thus can be flexibly integrated into different frameworks at the inference stage. A closely related technique to ICL is instruction-tuning (see §3.2), which is shown empirically to enhance the ICL ability [19].

ICL是LLM的重要新兴能力之一。ICL有两个优良特性：（1）不同于从丰富数据中学习隐式模式的传统监督学习范式，ICL的关键在于从类比中学习[168]。具体来说，在ICL设置中，LLM从少量示例以及可选指令中学习，并外推到新问题，从而以小样本方式解决复杂和未见任务[22], [169], [170]。（2）ICL通常以无训练方式实现，并增强ICL能力[19]。（2）ICL通常以免训练的方式实施[168]，因此可以在推理阶段灵活地集成到不同的框架中。与ICL密切相关的一种技术是指令调优（见§3.2），实验证明其可以增强ICL能力[19]。
In the context of MLLM, ICL has been extended to more modalities, leading to Multimodal ICL (M-ICL). Building upon the setting in (§3.2), at inference time, M-ICL can be implemented by adding a demonstration set, i.e., a set of in-context samples, to the original sample. In this case, the template can be extended as illustrated in Table 9. Note that we list two in-context examples for illustration, but the number and the ordering of examples can be flexibly adjusted. In fact, models are commonly sensitive to the arrangement of demonstrations [168], [171].

在MLLM的背景下，ICL已扩展到更多模态，形成了多模态ICL（M-ICL）。基于（§3.2）中的设置，在推理时，M-ICL可以通过向原始样本添加示例集，即一组上下文样本来实现。在这种情况下，可以如表9所示扩展模板。请注意，我们列出了两个上下文示例以作说明，但示例的数量和排序可以灵活调整。事实上，模型通常对示例的排列很敏感[168], [171]。

TABLE 9: A simplified example of the template to structure an M-ICL query, adapted from [98]. For illustration, we list two in-context examples and a query divided by a dashed line. [instruction] and [response] are texts from the data sample. <image> is a placeholder to represent the multimodal input (an image in this case). <BOS> and <EOS> are tokens denoting the start and the end of the input to the LLM, respectively.

表9：用于构建M-ICL查询的模板简化示例，改编自[98]。为了说明，我们列出了两个上下文示例和一个由虚线分隔的查询。[指令]和[响应]是数据样本中的文本。<image>是表示多模态输入（在本例中为图像）的占位符。<BOS>和<EOS>分别是表示LLM输入的开始和结束的词元。

在这里插入图片描述

7.1.1 Improvement on ICL capabilities

Recently, a growing amount of work has focused on enhancing ICL performance under various scenarios. In this section, we trace the development of this field and summarize some relevant works.

最近，越来越多的工作集中在增强各种场景下的ICL性能。在本节中，我们追踪了这一领域的发展并总结了一些相关工作。

MIMIC-IT [172] combines in-context learning with instruction tuning by building an instruction dataset formatted with multimodal context. The model instruction tuned on the introduced dataset shows improved few-shot performance in the caption task. Emu [173] extends the idea of Flamingo [74] by introducing extra modalities in model generation and corresponding training corpus. Aided by the introduced vision decoder, i.e. Stable Diffusion, the model learns from extra vision supervision and supports more flexibility in output format and in-context reasoning. Specifically, apart from answering in pure text, the model can also give responses in the form of images. Sheng et al. [174] adopt a similar idea and try to extend output modalities into both text and image. Instead of adopting a specialized encoder for images, the work adopts a unified quantization scheme with a shared embedding layer.

MIMIC-IT [172] 通过构建一个带有多模态上下文格式的指令数据集，将上下文学习与指令调优相结合。使用引入的数据集进行指令调优的模型在标题任务中表现出改进的小样本性能。Emu [173] 通过在模型生成和相应训练语料中引入额外模态，扩展了Flamingo [74] 的理念。在引入的视觉解码器（即稳定扩散）的帮助下，模型从额外的视觉监督中学习，并支持更灵活的输出格式和上下文推理。具体来说，除了以纯文本回答外，模型还可以以图像形式提供响应。Sheng等人[174] 采用了类似的理念，并尝试将输出模态扩展到文本和图像中。该工作采用统一量化方案，并共享嵌入层，而不是采用专门的图像编码器。

Some other works explore improving few-shot learning performance under specific settings. Link-context learning [175] focuses on strengthening the causal link between image-label pairs and casts a contrast training scheme by formulating positive and negative image-description pairs. MMICL [176] aims to augment the capabilities in reasoning with multiple related images. To strengthen the link between image and text, the work proposes a context scheme to transform interleaved image-text data into a uniform format. Jeong [177] finds that when inserting a small fraction of incoherent images/text as noise, MLLMs can be misled to give responses inconsistent with the context. Based on the observation, the work accordingly proposes a pre-filtering method to remove irrelevant context and facilitate more coherent responses.

其他一些工作探索了特定设置下的小样本学习性能改进。链接上下文学习[175] 聚焦于加强图像标签对之间的因果联系，并通过制定正负图像描述对进行对比训练方案。MMICL [176] 旨在增强多相关图像的推理能力。为了加强图像和文本之间的联系，该工作提出了一种上下文方案，将交错的图像文本数据转换为统一格式。Jeong [177] 发现，当插入一小部分不连贯的图像/文本作为噪声时，MLLM可能会被误导，从而给出与上下文不一致的响应。基于这一观察，该工作相应地提出了一种预过滤方法，以去除无关的上下文并促进更连贯的响应。

7.1.2 Applications

In terms of applications in multimodality, M-ICL is mainly used in two scenarios: (1) solving various visual reasoning tasks [22], [74], [178], [179], [180] and (2) teaching LLMs to use external tools [169], [170], [181]. The former usually involves learning from a few task-specific examples and generating a response when the task is doing and what the input template is. In this way, with a few pieces of information provided in instructions and demonstrations, LLMs get a sense of what the task is doing and what the output template is. In contrast, examples of tool usage are more fine-grained. They typically comprise a list of steps that could sequentially be executed to fulfill the task. Thus, the second scenario is closely related to CoT (see §7.2).

在多模态应用方面，M-ICL主要用于两种场景：（1）解决各种视觉推理任务[22], [74], [178], [179], [180] 和（2）教LLM使用外部工具[169], [170], [181]。前者通常涉及从少量任务特定示例中学习，并在任务进行时生成响应。通过指令和示例中提供的少量信息，LLM对任务的进行和输出模板有了一个感知。相比之下，工具使用的示例更为细致。它们通常包含一个步骤列表，可以顺序执行以完成任务。因此，第二种场景与CoT密切相关（见§7.2）。

7.2 Multimodal Chain of Thought

As the pioneer work [8] points out, CoT is “a series of intermediate reasoning steps”, which has been proven to be effective in complex reasoning tasks [8], [182], [183]. The main idea of CoT is to prompt LLMs to output not only the final answer but also the reasoning process that leads to the answer, resembling the cognitive process of humans.

正如先驱工作[8]所指出的，CoT是“一系列中间推理步骤”，已被证明在复杂推理任务中有效[8], [182], [183]。CoT的主要思想是提示LLM输出不仅是最终答案，还包括导致答案的推理过程，类似于人类的认知过程。

Inspired by the success in NLP, multiple works [184], [185], [186], [187] have been proposed to extend the unimodal CoT to Multimodal CoT (M-CoT). We first introduce different paradigms for acquiring the M-CoT ability (§7.2.1). Then, we delineate more specific aspects of M-CoT, including the chain configuration (§7.2.2) and the pattern (§7.2.3).

受NLP成功的启发，多项工作[184], [185], [186], [187] 被提出将单模态CoT扩展到多模态CoT（M-CoT）。我们首先介绍获取M-CoT能力的不同范式（§7.2.1）。然后，我们描绘了M-CoT的更具体方面，包括链配置（§7.2.2）和模式（§7.2.3）。

7.2.1 Learning Paradigms

The learning paradigm is also an aspect worth investigating. There are broadly three ways to acquire the M-CoT ability, i.e., through finetuning and training-free few/zero-shot learning. The sample size requirement for the three ways is in descending order.

学习范式也是一个值得研究的方面。广泛来说，有三种方式获得M-CoT能力，即通过微调和无训练的小样本/零样本学习。这三种方式对样本大小的要求依次递减。

Intuitively, the finetuning approach often involves curating specific datasets for M-CoT learning. For example, Lu et al. [166] construct a scientific question-answering dataset ScienceQA with lectures and explanations, which can serve as sources of learning CoT reasoning, and finetune the model on this proposed dataset. Multimodal-CoT [185] also uses the ScienceQA benchmark but generates the output in a two-step fashion, i.e., the rationale (chain of reasoning steps) and the final answer based on the rationale. CoT-PT [187] learns an implicit chain of reasoning through a combination of prompting and step-specific visual bias.

直观地，微调方法通常涉及为M-CoT学习策划特定的数据集。例如，Lu等人[166]构建了一个带有讲座和解释的科学问答数据集ScienceQA，可作为学习CoT推理的来源，并在提出的数据集上微调模型。Multimodal-CoT [185] 也使用ScienceQA基准，但以两步方式生成输出，即基于推理链的推理和最终答案。CoT-PT [187] 通过组合提示和步骤特定的视觉偏差学习隐式推理链。

Compared with finetuning, few/zero-shot learning is more computationally efficient. The main difference between them is that the few-shot learning typically requires hand-crafting some in-context examples so that the model can learn to reason step by step more easily. In contrast, the zero-shot learning does not require any specific example for CoT learning. In this case, models learn to use the embedded knowledge and the reasoning ability without explicit guidance by prompting designed instructions like “Let’s think frame by frame” or “What happened between these two keyframes” [184], [186]. Similarly, some works [22], [188] prompt models with descriptions of the task and tool usage to decompose complex tasks into sub-tasks.

与微调相比，小样本/零样本学习在计算效率上更高。它们之间的主要区别在于小样本学习通常需要手工制作一些上下文示例，以便模型可以更容易地逐步推理。相比之下，零样本学习不需要任何特定的CoT学习示例。在这种情况下，模型学习使用嵌入的知识和推理能力，而无需通过设计的指令（如“让我们逐帧思考”或“这两个关键帧之间发生了什么”[184], [186]）进行明确指导。同样，一些工作[22], [188]通过任务描述和工具使用提示模型，将复杂任务分解为子任务。

7.2.2 Chain Configuration

Structure and length are two critical aspects of the reasoning chains. In terms of structure, current methods can be divided into single-chain and tree-shape methods. Reasoning within a single chain is a paradigm widely used in various methods [116], [185]. Specifically, the step-by-step reasoning stems from a single question-rationale-answer chain. Recently, some methods have explored using a more complicated scheme, i.e., tree-shape chain, for reasoning. Specifically, DL-CoT [188] breaks down a question into multiple sub-questions, each of which is solved by LLM itself or using experts to generate rationales. Then the LLM aggregates and reasons with the rationales to form the final answer. With respect for chain length, it can be categorized into adaptive and pre-defined formations. The former configuration requires LLMs to decide on their own when to halt the reasoning chains [22], [116], [169], [170], [185], [188], while the latter setting stops the chains with a pre-defined length [79], [184], [186], [187].

结构和长度是推理链的两个关键方面。在结构方面，当前方法可以分为单链和树形方法。在单链内推理是一种广泛用于各种方法的范式[116], [185]。具体来说，逐步推理源于单个问题-推理-答案链。最近，一些方法探索了使用更复杂的方案，即树形链进行推理。具体来说，DL-CoT [188] 将一个问题分解为多个子问题，每个子问题由LLM本身或使用专家生成推理来解决。然后，LLM汇总并使用推理形成最终答案。关于链的长度，可以分为自适应和预定义形式。前者配置要求LLM自行决定何时停止推理链[22], [116], [169], [170], [185], [188]，而后者设置在预定义长度时停止链[79], [184], [186], [187]。

7.2.3 Generation Patterns

7.2.3 生成模式

How the chain is constructed is a question worth studying. We summarize the current works into (1) an infilling-based pattern and (2) a predicting-based pattern. Specifically, the infilling-based pattern demands deducing steps between surrounding context (previous and following steps) to fill the logical gaps [184], [186]. In contrast, the predicting-based pattern requires extending the reasoning chains given conditions such as instructions and previous reasoning history [22], [116], [169], [170], [185], [188]. The two types of patterns share a requirement that the generated steps should be consistent and correct.

链的构建方式是一个值得研究的问题。我们将当前的工作总结为（1）基于填充的模式和（2）基于预测的模式。具体来说，基于填充的模式需要推导周围上下文（前后步骤）之间的步骤，以填补逻辑差距[184], [186]。相比之下，基于预测的模式需要在给定条件下扩展推理链，如指令和以前的推理历史[22], [116], [169], [170], [185], [188]。这两种模式的一个共同要求是生成的步骤应一致且正确。

7.3 LLM-Aided Visual Reasoning

7.3.1 Introduction

Inspired by the success of tool-augmented LLMs [190], [191], [192], [193], some researches have explored the possibilities of invoking external tools [22], [201], [169], [170] or vision foundation models [22], [79], [80], [188], [194], [195], [196] for visual reasoning tasks. Taking LLMs as helpers withdifferent roles, these works build task-specific [79], [197], [198] or general-purpose [22], [169], [170], [181], [188] visual reasoning systems.

受到工具增强型LLM[190], [191], [192], [193] 成功的启发，一些研究探索了调用外部工具[22], [201], [169], [170] 或视觉基础模型[22], [79], [80], [188], [194], [195], [196] 以进行视觉推理任务的可能性。将LLM作为助手，与不同角色，这些工作构建了特定任务[79], [197], [198]或通用[22], [169], [170], [181], [188] 视觉推理系统。

Compared with conventional visual reasoning models [199], [200], [201], these works manifest several good traits: (1) Strong generalization abilities. Equipped with rich open-world knowledge learned from large-scale pretraining, these systems can easily generalize to unseen objects or concepts with remarkable zero/few-shot performance [169], [170], [159], [197], [198], [202]. (2) Emergent abilities. Aided by strong reasoning abilities of LLMs, these systems can perform complex tasks. For example, given an image, MM-REACT [22] can interpret the meaning beneath the surface, such as explaining why a meme is funny. (3) Better interactivity and control. Traditional models typically allow a limited set of control mechanisms and often entail expensive curated datasets [203], [204]. In contrast, LLM-based systems have the ability to make fine control in a user-friendly interface (e.g. chitchat and natural language queries).

与传统视觉推理模型[199], [200], [201]相比，这些工作表现出几个优良特性：（1）强大的泛化能力。配备了从大规模预训练中学习的丰富开放世界知识，这些系统可以轻松泛化到未见的对象或概念，表现出显著的零样本/少样本性能[169], [170], [159], [197], [198], [202]。（2）新兴能力。在LLM的强大推理能力的帮助下，这些系统可以执行复杂任务。例如，给定一张图像，MM-REACT [22] 可以解释表面之下的意义，例如解释为什么一个梗图很有趣。（3）更好的互动性和控制。传统模型通常允许有限的控制机制，并且通常需要昂贵的策划数据集[203], [204]。相比之下，基于LLM的系统具有在用户友好界面（例如闲聊和自然语言查询）中进行精细控制的能力。

For this part, we start with introducing different training paradigms employed in the construction of LLM-Aided Visual Reasoning systems (§7.3.2). Then, we delve into the primary roles that LLMs play within these systems (§7.3.3).

在这一部分，我们首先介绍用于构建LLM辅助视觉推理系统的不同训练范式（§7.3.2）。然后，我们深入探讨LLM在这些系统中扮演的主要角色（§7.3.3）。

7.3.2 Training Paradigms

According to training paradigms, LLM-Aided Visual Reasoning systems generally adopt two strategies, i.e., training-free and finetuning.

根据训练范式，LLM辅助视觉推理系统通常采用两种策略，即无训练和微调。

Training-free. With abundant prior knowledge stored in pre-trained LLMs, an intuitive and simple way is to repurpose pre-trained models and directly prompt LLMs to fulfill various needs. According to the setting, the reasoning systems can be further categorized into few-shot models [22], [169], [170], [181] and zero-shot models [79], [197]. The few-shot models entail a few hand-crafted in-context samples (see §7.1) to guide LLMs to generate a program or a sequence of execution steps. These programs or execution steps serve as instructions for corresponding foundation models or external tools/modules. The zero-shot models take a step further by directly utilizing LLMs’ linguistics/semantics knowledge or reasoning abilities. For example, PointCLIP V2 [197] prompts GPT-3 to generate descriptions with 3D-related semantics for better alignment with corresponding images. In CAT [79], LLMs are instructed to refine descriptions according to user queries.

无训练。通过预训练LLM中存储的大量先验知识，一种直观且简单的方法是重新利用预训练模型并直接提示LLM以满足各种需求。根据设置，推理系统可以进一步分为少样本模型[22], [169], [170], [181] 和零样本模型[79], [197]。少样本模型需要少量手工制作的上下文示例（见§7.1）来指导LLM生成程序或一系列执行步骤。这些程序或执行步骤作为相应基础模型或外部工具/模块的指令。零样本模型进一步通过直接利用LLM的语言/语义知识或推理能力。例如，PointCLIP V2 [197] 提示GPT-3生成具有3D相关语义的描述，以更好地与相应图像对齐。在CAT [79] 中，LLM被指示根据用户查询改进描述。

Finetuning. Some works adopt further finetuning to improve the planning abilities with respect to tool usage [107] or to improve localization capabilities [142], [205] of the system. For example, GPT4Tools [107] introduces the instruction-tuning approach (see §3.2). Accordingly, a new tool-related instruction dataset is collected and used to finetune the model.

微调。一些工作采用进一步的微调以提高工具使用方面的规划能力[107]或改进系统的定位能力[142], [205]。例如，GPT4Tools [107] 引入了指令调优方法（见§3.2）。相应地，收集了一个新的工具相关指令数据集，并用于微调模型。

7.3.3 Functions

In order to further inspect what roles LLMs exactly play in LLM-Aided Visual Reasoning systems, existing related works are divided into three types:

为了进一步检查LLM在LLM辅助视觉推理系统中究竟扮演什么角色，现有相关工作分为三种类型：

LLM as a Controller
LLM as a Decision Maker
LLM as a Semantics Refiner
LLM作为控制器
LLM作为决策者
LLM作为语义精炼器

The first two roles are related to CoT (see §7.2). It is frequently used because complex tasks need to be broken down into intermediate simpler steps. When LLMs act as controllers, the systems often finish the task in a single round, while multi-round is more common in the case of the decision maker. We delineate how LLMs serve these roles in the following parts.

前两种角色与CoT有关（见§7.2）。它经常被使用，因为复杂任务需要分解为中间的更简单步骤。当LLM作为控制器时，系统通常在单轮中完成任务，而在决策者的情况下，多轮更为常见。我们在以下部分中详细说明LLM如何担任这些角色。

LLM as a Controller. In this case, LLM acts as a central controller that (1) breaks down a complex task into simpler sub-tasks/steps and (2) assigns these tasks to appropriate tools/modules. The first step is often finished by leveraging the CoT ability of LLMs. Specifically, LLMs are prompted explicitly to output task planning [181] or, more directly, the modules to call [107], [169], [170]. For example, VisProg [170] prompts GPT-3 to output a visual program, where each program line invokes a module to perform a sub-task. In addition, LLMs are required to output argument names for the module input. To handle these complex requirements, some hand-crafted in-context examples are used as references [169], [170], [181]. This is closely related to the optimization of reasoning chains (see §7.2), or more specifically, the least-to-most prompting [206] technique. In this way, complex problems are broken down into sub-problems that are solved sequentially.

LLM作为控制器。在这种情况下，LLM充当中央控制器，（1）将复杂任务分解为更简单的子任务/步骤，并（2）将这些任务分配给适当的工具/模块。第一步通常通过利用LLM的CoT能力完成。具体来说，LLM被明确提示输出任务规划[181]或更直接地调用模块[107], [169], [170]。例如，VisProg [170] 提示GPT-3输出一个视觉程序，其中每一行程序调用一个模块执行子任务。此外，要求LLM输出模块输入的参数名称。为了处理这些复杂要求，使用一些手工制作的上下文示例作为参考[169], [170], [181]。这与推理链的优化密切相关（见§7.2），或更具体地说，是最少到最多提示[206] 技术。通过这种方式，复杂问题被分解为依次解决的子问题。

LLM as a Decision Maker. In this case, complex tasks are solved in a multi-round manner, often in an iterative way [195]. Decision-makers often fulfill the following responsibilities: (1) Summarize the current context and the input information, and then use the information available to either plan the next step or answer the question or complete the task; (2) Organize and summarize the answer to present it in a user-friendly way.

LLM作为决策者。在这种情况下，复杂任务通常以多轮方式解决，通常以迭代方式进行[195]。决策者通常履行以下职责：（1）总结当前上下文和输入信息，然后使用可用信息计划下一步或回答问题或完成任务；（2）组织和总结答案，以用户友好的方式呈现。

LLM as a Semantics Refiner. When LLM is used as a Semantics Refiner, researchers mainly utilize its linguistics and semantics knowledge. Specifically, LLMs are often instructed to integrate information into consistent and fluent natural language sentences [202] or generate texts according to different specific needs [79], [197], [198].

LLM作为语义精炼器。当LLM用作语义精炼器时，研究人员主要利用其语言和语义知识。具体来说，经常指示LLM将信息整合成一致且流畅的自然语言句子[202]或根据不同的具体需求生成文本[79], [197], [198]。

8. CHALLENGES AND FUTURE DIRECTIONS

The development of MLLMs is still in a rudimentary stage and thus leaves much room for improvement, which we summarize below:

MLLM的发展仍处于初步阶段，因此还有很大的改进空间，我们总结如下：

Current MLLMs are limited in processing multimodal information of long context. This restricts the development of advanced models with more multimodal tokens, e.g. long-video understanding, and long documents interleaved with images and text.

- 当前MLLM在处理长上下文多模态信息方面存在局限性。这限制了具有更多多模态词元的高级模型的发展，例如长视频理解以及图像和文本交错的长文档。

MLLMs should be upgraded to follow more complicated instructions. For example, a mainstream approach to generating high-quality question-answer pair data is still prompting closed-source GPT-4V because of its advanced instruction-following capabilities, while other models generally fail to achieve.

- 应该升级MLLM以遵循更复杂的指令。例如，生成高质量问答对数据的主流方法仍然是提示闭源的GPT-4V，因为其高级指令跟随能力，而其他模型通常未能实现。

There is still a large space for improvement in techniques like M-ICL and M-CoT. Current research on the two techniques is still rudimentary, and the related capabilities of MLLMs are weak. Thus, explorations of the underlying mechanisms and potential improvement are promising.

- 在M-ICL和M-CoT等技术上仍有很大的改进空间。目前对这两种技术的研究仍然很初步，MLLM的相关能力也很薄弱。因此，探索底层机制和潜在改进是有前途的。

Developing embodied agents based on MLLMs is a heated topic. It would be meaningful to develop such agents that can interact with the real world. Such endeavors require models with critical capabilities, including perception, reasoning, planning, and execution.

- 基于MLLM开发具身代理是一个热门话题。开发能够与现实世界互动的此类代理将是有意义的。这种努力需要具有关键能力的模型，包括感知、推理、规划和执行。

Safety issues. Similarly to LLMs, MLLMs can be vulnerable to crafted attacks [177], [207], [208]. In other words, MLLMs can be misled to output biased or undesirable responses. Thus, improving model safety will be an important topic.

- 安全问题。与LLM类似，MLLM可能容易受到精心设计的攻击[177], [207], [208]。换句话说，MLLM可能会被误导输出有偏见或不良的响应。因此，改进模型安全性将是一个重要的话题。

9. CONCLUSION

In this paper, we perform a survey of the existing MLLM literature and offer a broad view of its main directions, including the basic recipe and related extensions. Moreover, we underscore the current research gaps that need to be filled and point out some promising research directions. We hope this survey can offer readers a clear picture of the current progress of MLLM and inspire more works.

在本文中，我们对现有的MLLM文献进行了综述，并提供了其主要方向的广泛视角，包括基本配方和相关扩展。此外，我们强调了需要填补的当前研究空白，并指出了一些有前途的研究方向。我们希望这份综述能为读者提供当前MLLM进展的清晰图景，并激发更多的工作。