前言：多媒体自动评论生成旨在通过使用生成模型，对给定上下文生成符合情境的评论，近年来，随着图像描述等跨模态工作取得较大突破，相关研究也逐渐展开。评论作为社交平台互动的重要组成部分，在引导舆论、提升用户体验等方面发挥重大作用。现有的多媒体自动评论生成研究工作相对有限，下面对其进行介绍。

检索式图像自动评论（Search-based Automatic Image Commenting）

[1]- Predicting Viewer Affective Comments Based on Image Content in Social Media (ICMR, 2014) National Taiwan University, Chen et al.
[2]- Assistive Image Comment Robot—A Novel Mid-Level Concept-Based Representation (TAC, 2015) FX Palo Alto Laboratory, Chen et al.

如下图，Chen等[1,2]提出使用贝叶斯概率模型，在分析图像情感的前提下，通过预测阅读者的情感反应，在此基础上提出为图像生成评论的模型。给定一张测试图像及其元数据，首先评估该图像的发布者情感因素（PAC），然后从训练集中选取和该测试图像具有相似PAC的图像，选取其对应评论构建候选评论池。通过计算评论与测试图像的向量内积，选取得分较高的评论进行回复。
图1 情感相关模型及其应用

如下图所示，自动评论能够较好贴合图像内容，但 (c ), (d)中的自动评论明显与图像不符，如出现错误的目标和动作等。

图2 自动评论结果示例

[3]- Object-Based Visual Sentiment Concept Analysis and Application (MM, 2014) Columbia University, Chen et al.

为解决工作[1,2]中生成评论含有错误目标和动作的问题，工作[3]将目标检测加入模型，使用传统目标检测方法DPM检测出测试图像中的目标。
图3 基于目标检测的自动评论生成
基于目标检测的评论生成提升了评论的质量，如下图所示。

图4 自动生成评论效果对比

[4-1]- Share-and-Chat: Achieving Human-Level Video Commenting by Search and Multi-View Embedding (MM, 2016) Sun Yat-sen University, Li et al.
[4-2]- Video ChatBot: Triggering Live Social Interactions by Automatic Video Commenting∗ (MM, 2016) Sun Yat-sen University, Li et al.

Li等[4-1,4-2]将评论生成任务迁移至视频领域。首先使用CNN获取视频特征表示，通过ANN搜索出相似视频，然后对相关评论进行动态排序，选取出合适评论，如下图所示。
图5 Share and Chat 方法流程图

[5]- See and chat: automatically generating viewer-level comments on images (Multimedia Tools and Applications, 2019) Sun Yat-sen University, Chen et al.

Chen等[5]首先使用CNN获取图像表征信息，然后使用KNN，根据这些特征信息筛选出与测试图像相似的图像，然后使用Ranking典型相关分析（RCCA）对候选评论进行排序，如下图所示。使用 Flickr API构建数据集，并从图文相关性、评论感情强度和评论长度等方面对数据进行后处理。数据集划分比例为：400K, 25K, 1K张图像。
图6 See and Chat 方法流程图

生成式图像自动评论（Generative Automatic Image Commenting）

[6]- Auto Image Comment via Deep Attention (ICIVC, 2017) Jiangxi Normal University , Shi et al.

Shi等在[6]中首次提出生成式图像评论模型，如下图，该模型使用Encoder-Decoder框架，CNN+LSTM组合模型，结合注意力机制，生成适合的评论短语。
图7 生成式图像评论生成

[7]- Neural Visual Social Comment on Image-Text Content (IETE Technical Review, 2020) Shanghai University, Yin et al.

Yin等[7]将输入的图像及其文字结合，融合多模态信息生成评论，采用新浪微博爬取的帖子作为数据集，每条样本包括帖子的文本和零至多张图片，以及对应的评论信息。使用主题分类模型用于生成评论与真实评论，构建感知损失，将其与MLE损失进行比较。该工作的创新之处在于，使用主题分类模型，使得生成的评论主题与原始评论相同但又不失多样性。
图8 基于主题分类模型的生成式评论
[8]- Explainable Outfit Recommendation with Joint Outfit Matching and Comment Generation (TKDE, 2020) Shandong University, Lin et al.

Lin等[8]通过使用CNN提取图像特征，然后采用GRU和跨模态注意力机制为服装生成自然的评论，如下图所示。

图9 服装评论生成（a）图像特征提取（b）互注意力机制（c）解码器生成评论

[9]- An Image Comment Method Based on Emotion Capture Module (ICFTIC, 2021) Beihang University, Li et al.

Li等[9]首先使用 GAN 生成图像描述，然后使用文本风格迁移与文本改写间接生成评论。首先借鉴现有图像描述数据集，使用文本编辑方法打造图像评论数据集。然后将目标域设置成评论数据库，学习评论的语言风格，通过对描述进行改写生成评论，如下图所示。
图10 基于文本改写的图像评论生成

视频弹幕自动生成（Automatic Live Video Commenting）

随着短视频社交软件的普及，一些研究者陆续开展视频弹幕生成相关研究工作。下面对几个代表性工作进行介绍。

[10]- LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts (AAAI, 2019) Beijing University, Ma et al.

本文出自北大孙栩老师课题组，是第一篇提出视频弹幕生成这一任务的文章。Ma等提出两个处理此任务的baseline模型，分别是：层级结构的Fusional RNN 和线性结构的 Unified Transformer，如下图所示。

开源代码：https://github.com/lancopku/livebot

图11 两种 baseline 模型

[11]- VideoIC: A Video Interactive Comments Dataset and Multimodal Multitask Learning for Comments Generation (MM, 2020) Renmin University of China, Wang et al.

本文出自中国人民大学进琴老师团队，采用多任务学习方法，使用 Transformer 和 LSTM 分别提取图像的局部和全局特征；使用 Bi-LSTM 提取文本特征；送入基于 Transformer 的编码器中进行多模态特征整合，然后分别计算生成损失与上下文判别损失，整体框架图如下。

开源代码：https://github.com/AIM3-RUC/VideoIC

在这里插入图片描述

[12]- PLVCG: A Pretraining Based Model for Live Video Comment Generation (PAKDD, 2021) Chinese Academiy of Sciences, Zeng et al.
[13]- Knowing Where and What to Write in Automated Live Video Comments: A Unified Multi-Task Approach (ICMI, 2021) University College Dublin, Wu et al.
[14]- Sending or not? A multimodal framework for Danmaku comment prediction (IPM, 2021) Chinese Academiy of Sciences, Xi et al.

未来研究方向

综上所述，多媒体自动生成式评论仍有很大的研究空间。个人感觉，以下几个研究方向有待探索。（1）为确保评论对象符合图像内容，考虑加入目标检测模块，实现针对图像局部的细粒度评论。（2）添加情感模块，确保生成的评论与原始评论情感步调一致。

参考文献

[1] Y.Y. Chen, et al.Predicting Viewer Affective Comments Based on Image Content in Social Media, ICMR, 2014.
[2] Y.Y.Chen, et al. Assistive Image Comment Robot—A Novel Mid-Level Concept-Based Representation, IEEE TRANSACTIONS ON AFFECTIVE COMPUTING (CCF-B), 2015.
[3] T. Chen, et al. Object-Based Visual Sentiment Concept Analysis and Application, ACM Multimedia, 2014.
[4] Li et al. Share-and-Chat: Achieving Human-Level Video Commenting by Search and Multi-View Embedding. ACMMM, 2016.
[5] J.W. Chen, et al. See and chat: automatically generating viewer-level comments on images. Multimedia Tools and Applications, 2019.
[6] J.H. Shi, et al. Auto Image Comment via Deep Attention. IEEE 4th International Conference on Image, Vision and Computing (ICIVC), 2017.
[7] Y. Yin, et al. Neural Visual Social Comment on Image-Text Content, IETE Technical Review, 2020.
[8] Y.J. Lin, et al. Explainable Outfit Recommendation with Joint Outfit Matching and Comment Generation. TKDE, 2020.
[9] Q. Li, J. Yin and Y. Wang, An Image Comment Method Based on Emotion Capture Module, 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC), 2021, pp. 334-339.
[10] Ma et al. LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts, AAAI, 2019.
[11] Wang et al. VideoIC: A Video Interactive Comments Dataset and Multimodal Multitask Learning for Comments Generation, MM, 2020.
[12] Zeng et al. PLVCG: A Pretraining Based Model for Live Video Comment Generation, PAKDD, 2021.
[13] Wu et al. Knowing Where and What to Write in Automated Live Video Comments: A Unified Multi-Task Approach, ICMI, 2021.
[14] Xi et al. Sending or not? A multimodal framework for Danmaku comment prediction, IPM, 2021.