[CVPR 2024] AnyDoor: Zero-shot Object-level Image Customization

在这里插入图片描述

github.com/ali-vilab/AnyDoor.

写在前面：
【论文速读】按照#论文十问#提炼出论文核心知识点，方便相关科研工作者快速掌握论文内容。过程中并不对论文相关内容进行翻译。博主认为翻译难免会损坏论文的原本含义，也鼓励诸位入门级科研人员阅读文献原文。

文章目录

- 01 现有工作的不足？
- 02 文章解决了什么问题？
- 03 关键的解决方案是什么？
- 04 主要的贡献是什么？
- 05 有哪些相关的工作？
- 06 方法具体是如何实现的？
- 07 论文中的实验是如何设计的？
- 08 实验结果和对比效果如何？
- 09 消融研究告诉了我们什么？
- 10 这个工作还可以如何优化？
- 参考文献

01 现有工作的不足？

Local image editing: those methods could only give coarse guidance for generation and often fail to synthesize ID-consistent results for untrained new concepts

Customized image generation: Although they could generate high-fidelity images, the user could not specify the scenario and the location of the target object. Besides, the time-consuming finetuning impedes them from being used in large-scale applications.

Image harmonization: these methods only explore the low-level changes, editing the structure, view, and pose of the foreground objects, or generating the shadows and reflections are not taken into consideration.

02 文章解决了什么问题？

This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations with desired shapes.
AnyDoor is able to generate ID-consistent compositions with high quality in zero-shot.

03 关键的解决方案是什么？

Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object.

we complement the commonly used identity feature with detail features, which are carefully designed to maintain appearance details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings.
We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness.

04 主要的贡献是什么？

We present AnyDoor for object teleportation. The core idea is to use a discriminative ID extractor and a frequency aware detail extractor to characterize the target object.
Trained on a large combination of video and image data, we composite the object at the specific location of the scene image with effective shape control.
AnyDoor provides a universal solution for general region-to-region mapping tasks and could be profitable for various applications.

05 有哪些相关的工作？

Stable Diffusion [41],
IP-Adapter [58],
Paint-by-Example [56]
Graphit [16]
DreamBooth [42]
Custom Diffusion [27]
Cones [33]

06 方法具体是如何实现的？

In this paper, we investigate “object teleportation”, which means accurately and seamlessly placing the target object into the desired location of the scene image.

we re-generate a box/mask-marked local region of a scene image by taking the target object as the template.

we represent the target object with identity and detail-related features,
then composite them with the interaction of the background scene.
- we use an ID extractor to produce discriminative ID tokens and delicately design a frequency-aware detail extractor to get detail maps as a supplement.
- We inject the ID tokens and the detail maps into a pre-trained text-to-image diffusion model as guidance to generate the desired composition.
To make the generated content more customizable, we explore leveraging additional controls (e.g. user-drawn masks) to indicate the shape/poses of the object.
To learn customized object generation with high diversities, we collect image pairs for the same object from videos to learn the appearance variations, and also leverage largescale statistic images to guarantee the scenario diversity.

High frequencyy map
在这里插入图片描述

The training supervision is a mean square error loss as：

07 论文中的实验是如何设计的？

During inference, given a scene image and a location box, we expand the box into a square with an amplifier ratio of 2.0.

For quantitative results, we construct a new benchmark with 30 new concepts provided by DreamBooth [42] for the target images. For the scene image, we manually pick 80 images with boxes in COCO-Val [31]. Thus we generate 2,400 images for the object-scene combinations. We also make qualitative analysis on VitonHDtest [13] to validate the performance for virtual try-on.

we follow DreamBooth [42] to calculate the CLIPScore and DINO-Score, as these metrics could reflect the similarity between the generated region and the target object. we organize user studies with a group of 15 annotators to rate the generated results from the perspective of fidelity, quality, and diversity.

08 实验结果和对比效果如何？

Extensive experiments demonstrate the superiority of our approach over existing alternatives as well as its great potential in real-world applications, such as virtual try-on, shape editing, and object swapping

Comparisons with Reference-based methods.
在这里插入图片描述
Firure 5 show that previous reference-based methods could only keep the semantic consistency with distinguishing features like the dog face on the backpack, and coarse granites of patterns like the color of the sloth toy. However, as those new concepts are not included in the training category, their generation results are far from ID-consistent. In contrast, our AnyDoor shows promising performance for zero-shot image customization with highly-faithful details.

Comparisons with Tuning-based methods.
在这里插入图片描述
User study.

09 消融研究告诉了我们什么？

Core components.
在这里插入图片描述
ID extractor.

Detail extractor.
在这里插入图片描述

More Applications

10 这个工作还可以如何优化？

It still struggles with fine details like the small characters or logos. This issue might be solved by collecting related training data, enlarging the resolution, and training better VAE decoders.

参考文献

[1] Ali Athar, Jonathon Luiten, Paul Voigtlaender, Tarasha Khurana, Achal Dave, Bastian Leibe, and Deva Ramanan. Burst: A benchmark for unifying object recognition, segmentation and tracking in video. In WACV, 2023. 4 [2] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In CVPR, 2022. 2 [3] Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel CohenOr, and Dani Lischinski. Break-a-scene: Extracting multiple concepts from a single image. In SIGGRAPH Asia, 2023. 2 [4] Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. Salient object detection: A benchmark. TIP, 2015. 4 [5] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In ICCV, 2023. 2 [6] Arantxa Casanova, Marl`ene Careil, Adriana RomeroSoriano, Christopher J Pal, Jakob Verbeek, and Michal Drozdzal. Controllable image generation via collage representations. arXiv:2304.13722, 2023. 3 [7] Bor-Chun Chen and Andrew Kae. Toward realistic image compositing with adversarial learning. In CVPR, 2019. 2 [8] Haoxing Chen, Zhangxuan Gu, Yaohui Li, Jun Lan, Changhua Meng, Weiqiang Wang, and Huaxiong Li. Hierarchical dynamic image harmonization. In ACMMM, 2022. 2 [9] Hong Chen, Yipeng Zhang, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation. arXiv:2305.03374, 2023. 2 [10] Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Rui, Xuhui Jia, Ming-Wei Chang, and William W Cohen. Subjectdriven text-to-image generation via apprenticeship learning. In NeurIPS, 2023. 6 [11] Xi Chen, Zhiyan Zhao, Feiwu Yu, Yilei Zhang, and Manni Duan. Conditional diffusion for interactive segmentation. In ICCV, 2021. 3 [12] Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and Hengshuang Zhao. Focalclick: towards practical interactive image segmentation. In CVPR, 2022. 3 [13] Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In CVPR, 2021. 4, 5, 8 [14] Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, and Liqing Zhang. Dovenet: Deep image harmonization via domain verification. In CVPR, 2020. 2, 4 [15] Wenyan Cong, Xinhao Tao, Li Niu, Jing Liang, Xuesong Gao, Qihao Sun, and Liqing Zhang. High-resolution image harmonization via collaborative dual transformations. In CVPR, 2022. 2 [16] Graphit Contributors. Graphit: A unified framework for diverse image editing tasks. https://github.com/ navervision/Graphit, 2023. 5, 6, 7 [17] Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. Mose: A new dataset for video object segmentation in complex scenes. In ICCV, 2023. 4 [18] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023. 2, 6 [19] Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized lowrank adaptation for multi-concept customization of diffusion models. In NeurIPS, 2023. 2 [20] Zonghui Guo, Haiyong Zheng, Yufeng Jiang, Zhaorui Gu, and Bing Zheng. Intrinsic image harmonization. In CVPR, 2021. 2 [21] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019. 4 [22] Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. In ICML, 2023. 2 [23] Nick Kanopoulos, Nagesh Vasanthavada, and Robert L Baker. Design of an image edge detection filter using the sobel operator. JSSC, 1988. 3 [24] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In CVPR, 2023. 2 [25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014. 5 [26] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ICCV, 2023. 2, 3, 4, 8 [27] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In CVPR, 2023. 2, 6 [28] Dongxu Li, Junnan Li, and Steven CH Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-toimage generation and editing. In NeurIPS, 2023. 2 [29] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. 2 [30] Tianle Li, Max Ku, Cong Wei, and Wenhu Chen. Dreamedit: Subject-driven image editing. arXiv:2306.12624, 2023. 2 [31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla ́r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5 [32] Qin Liu, Zhenlin Xu, Gedas Bertasius, and Marc Niethammer. Simpleclick: Interactive image segmentation with simple vision transformers. In ICCV, 2023. 3 [33] Zhiheng Liu, Ruili Feng, Kai Zhu, Yifei Zhang, Kecheng Zheng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Cones: Concept neurons in diffusion models for customized generation. In ICML, 2023. 2, 6 [34] Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Cones 2: Customizable image synthesis with multiple subjects. In NeurIPS, 2023. 2 [35] Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, and Yi Yang. Large-scale video panoptic segmentation in the wild: A benchmark. In CVPR, 2022. 4 [36] Maxime Oquab, Timoth ́ee Darcet, Th ́eo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2024. 3, 4, 7 [37] Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, et al. Unicontrol: A unified diffusion model for controllable visual generation in the wild. In NeurIPS, 2023. 2 [38] Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. U2net: Going deeper with nested u-structure for salient object detection. PR, 2020. 3 [39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 2, 3, 7 [40] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125, 2022. 2 [41] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ̈orn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022. 2, 4, 5, 6, 7, 8 [42] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023. 2, 5, 6 [43] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022. 2 [44] Vishnu Sarukkai, Linden Li, Arden Ma, Christopher Re ́, and Kayvon Fatahalian. Collage diffusion. In WACV, 2024. 3 [45] Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without testtime finetuning. arXiv:2304.03411, 2023. 6 [46] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014. 7 [47] Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, and Daniel Aliaga. Objectstitch: Object compositing with diffusion model. In CVPR, 2023. 2, 3, 4, 7 [48] Kalyan Sunkavalli, Micah K Johnson, Wojciech Matusik, and Hanspeter Pfister. Multi-scale image harmonization. In SIGGRAPH, 2010. 2 [49] Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and Xiang Ruan. Learning to detect salient objects with image-level supervision. In CVPR, 2017. 4 [50] Tan Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Lijuan Wang. Disco: Disentangled control for referring human dance generation in real world. arXiv:2307.00040, 2023. 8 [51] Weiyao Wang, Matt Feiszli, Heng Wang, and Du Tran. Unidentified video objects: A benchmark for dense, openworld segmentation. In ICCV, 2021. 4 [52] Guangxuan Xiao, Tianwei Yin, William T Freeman, Fr ́edo Durand, and Song Han. Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv:2305.10431, 2023. 2 [53] Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun Zhang. Smartbrush: Text and shape guided object inpainting with diffusion model. In CVPR, 2023. 2 [54] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtubevos: A large-scale video object segmentation benchmark. arXiv:1809.03327, 2018. 4 [55] Ben Xue, Shenghui Ran, Quan Chen, Rongfei Jia, Binqiang Zhao, and Xing Tang. Dccf: Deep comprehensible color filter learning framework for high-resolution image harmonization. In ECCV, 2022. 2 [56] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In CVPR, 2023. 2, 3, 4, 5, 6, 7 [57] Linjie Yang, Yuchen Fan, and Ning Xu. Video instance segmentation. In ICCV, 2019. 4 [58] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ipadapter: Text compatible image prompt adapter for text-toimage diffusion models. arXiv:2308.06721, 2023. 2, 5, 6 [59] Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv:2304.06790, 2023. 2 [60] Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A largescale dataset of multi-view images. In CVPR, 2023. 4 [61] Ziyang Yuan, Mingdeng Cao, Xintao Wang, Zhongang Qi, Chun Yuan, and Ying Shan. Customnet: Zero-shot object customization with variable-viewpoints in text-to-image diffusion models. arXiv:2310.19784, 2023. 2 [62] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 2, 4, 7 [63] Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris N Metaxas, and Jian Ren. Sine: Single image editing with textto-image diffusion models. In CVPR, 2023. 2 [64] Na Zheng, Xuemeng Song, Zhaozheng Chen, Linmei Hu, Da Cao, and Liqiang Nie. Virtually trying on new clothing with arbitrary poses. In ACMMM, 2019. 4