[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--机器人、强化学习

专属领域论文订阅

VX 扫吗关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持
如果你感觉对你有帮助可以扫吗关注，每日准时为你推送最新论文

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉导航
具身智能，机器人
强化学习
开放词汇，检测分割

== human robot interaction ==

标题: Chat Failures and Troubles: Reasons and Solutions

作者: Manal Helal, Patrick Holthaus, Gabriella Lakatos

[UpdateTime:]2024-01-18

[Downlink:]http://arxiv.org/abs/2309.03708v2

中文摘要: 本文研究了人机交互（HRI）中导致聊天失败和麻烦的一些常见问题。给定用例的设计决策始于合适的机器人、合适的聊天模型、识别导致故障的常见问题、识别潜在的解决方案以及规划持续改进。总之，建议使用闭环控制算法来指导训练过的人工智能（AI）预训练模型的使用，并提供词汇过滤，在新数据集上重新训练批处理模型，从数据流中在线学习，和/或使用强化学习模型来自我更新训练过的模型并减少错误。

摘要: This paper examines some common problems in Human-Robot Interaction (HRI) causing failures and troubles in Chat. A given use case’s design decisions start with the suitable robot, the suitable chatting model, identifying common problems that cause failures, identifying potential solutions, and planning continuous improvement. In conclusion, it is recommended to use a closed-loop control algorithm that guides the use of trained Artificial Intelligence (AI) pre-trained models and provides vocabulary filtering, re-train batched models on new datasets, learn online from data streams, and/or use reinforcement learning models to self-update the trained models and reduce errors.

标题: Biased-MPPI: Informing Sampling-Based Model Predictive Control by Fusing Ancillary Controllers

作者: Elia Trevisan, Javier Alonso-Mora

[UpdateTime:]2024-01-17

[Downlink:]http://arxiv.org/abs/2401.09241v1

中文摘要: 由于机器人的动力学、环境以及与其他智能体的交互的不确定性，人类居住环境中自主机器人的运动规划提出了许多挑战。基于采样的MPC方法，如模型预测路径积分（MPPI）控制，在解决这些复杂的运动规划问题方面显示出前景。然而，MPPI的性能很大程度上依赖于采样分布的选择。现有文献通常使用先前计算的输入序列作为采样的高斯分布的平均值，导致潜在的故障和局部最小值。在本文中，我们提出了MPPI方法的新推导，以提高其效率，鲁棒性和收敛性。我们的方法包括一个数学公式，允许任意采样分布，解决数值问题，并缓解局部最小值的问题。我们提出了一种有效的重要性采样方案，该方案同时结合了经典和基于学习的辅助控制器，从而产生更多信息的采样和控制融合。通过处理模型不确定性、快速环境变化和降低对局部极小值的敏感性的实验，我们证明了我们提出的方案的卓越效率和鲁棒性。

摘要: Motion planning for autonomous robots in human-populated environments poses numerous challenges due to uncertainties in the robot’s dynamics, environment, and interaction with other agents. Sampling-based MPC approaches, such as Model Predictive Path Integral (MPPI) control, have shown promise in addressing these complex motion planning problems. However, the performance of MPPI heavily relies on the choice of the sampling distribution. Existing literature often uses the previously computed input sequence as the mean of a Gaussian distribution for sampling, leading to potential failures and local minima. In this paper, we propose novel derivations of the MPPI method to enhance its efficiency, robustness, and convergence. Our approach includes a mathematical formulation allowing for arbitrary sampling distributions, addressing numerical issues, and alleviating the problem of local minima. We present an efficient importance sampling scheme that combines classical and learning-based ancillary controllers simultaneously, resulting in more informative sampling and control fusion. We demonstrate our proposed scheme’s superior efficiency and robustness through experiments by handling model uncertainties, rapid environmental changes and reducing susceptibility to local minima.

标题: Admittance Controller Complemented with Real-time Singularity Avoidance for Rehabilitation Parallel Robots

作者: Jose L. Pulloquinga, Rafael J. Escarabajal, Marina Valles

[UpdateTime:]2024-01-17

[Downlink:]http://arxiv.org/abs/2401.09132v1

中文摘要: 康复任务需要鲁棒和精确的轨迹跟踪性能，主要通过并联机器人实现。在这个领域，限制施加在病人身上的力的值是至关重要的，尤其是当涉及受伤的肢体时。在人机交互研究中，导纳控制器根据用户驱动末端执行器到工作空间内任意位置的努力来修改机器人的位置。然而，并联机器人在工作空间内具有奇点，使得实现传统的导纳控制器不安全。因此，本研究提出了一种通过使用实时奇异避免算法来克服奇异配置限制的导纳控制器。奇异性避免算法根据并联机器人的实际位置对原轨迹进行修正。将互补导纳控制器应用于4自由度并联膝关节康复机器人。在这种情况下，实际位置由3D跟踪系统测量，因为由正向运动学计算的位置在奇点附近是不准确的。实验结果验证了所提出的导纳控制器在安全膝关节康复训练中的有效性

摘要: Rehabilitation tasks demand robust and accurate trajectory-tracking performance, mainly achieved with parallel robots. In this field, limiting the value of the force exerted on the patient is crucial, especially when an injured limb is involved. In human-robot interaction studies, the admittance controller modifies the location of the robot according to the user efforts driving the end-effector to an arbitrary location within the workspace. However, a parallel robot has singularities within the workspace, making implementing a conventional admittance controller unsafe. Thus, this study proposes an admittance controller that overcomes the limitations of singular configurations by using a real-time singularity avoidance algorithm. The singularity avoidance algorithm modifies the original trajectory based on the actual location of the parallel robot. The complemented admittance controller is applied to a 4 degrees of freedom parallel robot for knee rehabilitation. In this case, the actual location is measured by a 3D tracking system because the location calculated by the forward kinematics is inaccurate in the vicinity of a singularity. The experimental results verify the effectiveness of the proposed admittance controller for safe knee rehabilitation exercises

标题: Hands-On Robotics: Enabling Communication Through Direct Gesture Control

作者: Max Pascher, Alia Saad, Jonathan Liebers

[UpdateTime:]2024-01-17

[Downlink:]http://arxiv.org/abs/2401.09077v1

中文摘要: 有效的人机交互（HRI）是将机器人系统无缝集成到我们日常生活中的基础。然而，当前的通信模式需要额外的技术接口，这可能是麻烦和间接的。本文提出了一种新的方法，通过移动机器人的末端执行器来使用基于直接运动的通信。我们的策略使用户能够通过使用四种不同的手势与机器人交流——两次握手（“正式”和“非正式”）和两个字母（“W”和“S”）。作为概念验证，我们对16名参与者进行了用户研究，捕捉主观体验评级和客观数据，用于训练机器学习分类器。我们的发现表明，通过移动机器人的末端执行器执行的四种不同的手势可以以接近100%的准确率区分。我们的研究为未来HRI界面的设计提供了启示，表明基于运动的交互可以使人类操作员直接与机器人通信，消除了额外硬件的必要性。

摘要: Effective Human-Robot Interaction (HRI) is fundamental to seamlessly integrating robotic systems into our daily lives. However, current communication modes require additional technological interfaces, which can be cumbersome and indirect. This paper presents a novel approach, using direct motion-based communication by moving a robot’s end effector. Our strategy enables users to communicate with a robot by using four distinct gestures – two handshakes (‘formal’ and ‘informal’) and two letters (‘W’ and ‘S’). As a proof-of-concept, we conducted a user study with 16 participants, capturing subjective experience ratings and objective data for training machine learning classifiers. Our findings show that the four different gestures performed by moving the robot’s end effector can be distinguished with close to 100% accuracy. Our research offers implications for the design of future HRI interfaces, suggesting that motion-based interaction can empower human operators to communicate directly with robots, removing the necessity for additional hardware.

== Reinforcement Learning ==

标题: CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents

作者: Siyuan Qi, Shuo Chen, Yexin Li

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10568v1

[GitHub:]https://github.com/bigai-ai/civrealm.|

中文摘要: 决策代理的泛化包含两个基本要素：从过去的经验中学习和在新的上下文中推理。然而，在大多数交互环境中，主要的重点是学习，通常以牺牲推理的复杂性为代价。在本文中，我们介绍了CivRealm，一个受文明游戏启发的环境。文明与人类历史和社会的深刻结合需要复杂的学习，而其不断变化的情况需要强大的推理来概括。特别是，CivRealm建立了一个玩家数量不断变化的不完全信息一般和博弈；它呈现了过多的复杂特征，挑战代理处理需要外交和谈判技巧的开放式随机环境。在CivRealm中，我们为两种典型的代理类型提供了接口：专注于学习的基于张量的代理和强调推理的基于语言的代理。为了促进进一步的研究，我们提出了两种范式的初步结果。规范的基于RL的代理在迷你游戏中表现出合理的性能，而基于RL和LLM的代理在整个游戏中都很难取得实质性的进展。总的来说，CivRealm对决策代理来说是一个独特的学习和推理挑战。代码可在https：//github.com/bigai-ai/civrealm获得。

摘要: The generalization of decision-making agents encompasses two fundamental elements: learning from past experiences and reasoning in novel contexts. However, the predominant emphasis in most interactive environments is on learning, often at the expense of complexity in reasoning. In this paper, we introduce CivRealm, an environment inspired by the Civilization game. Civilization’s profound alignment with human history and society necessitates sophisticated learning, while its ever-changing situations demand strong reasoning to generalize. Particularly, CivRealm sets up an imperfect-information general-sum game with a changing number of players; it presents a plethora of complex features, challenging the agent to deal with open-ended stochastic environments that require diplomacy and negotiation skills. Within CivRealm, we provide interfaces for two typical agent types: tensor-based agents that focus on learning, and language-based agents that emphasize reasoning. To catalyze further research, we present initial results for both paradigms. The canonical RL-based agents exhibit reasonable performance in mini-games, whereas both RL- and LLM-based agents struggle to make substantial progress in the full game. Overall, CivRealm stands as a unique learning and reasoning challenge for decision-making agents. The code is available at https://github.com/bigai-ai/civrealm.

标题: LangProp: A code optimization framework using Language Models applied to driving

作者: Shu Ishida, Gianluca Corrado, George Fedoseev

[UpdateTime:]2024-01-18

[Downlink:]http://arxiv.org/abs/2401.10314v1

[GitHub:]https://github.com/shuishida/LangProp.|

中文摘要: LangProp是一个框架，用于在监督/强化学习设置中迭代优化大型语言模型（LLMs）生成的代码。虽然LLMs可以产生合理的解决方案，但这些解决方案往往是次优的。特别是对于代码生成任务，初始代码很可能会在某些边缘情况下失败。LangProp自动评估输入输出对数据集上的代码性能，以及捕捉任何异常，并在训练循环中将结果反馈给LLM，以便LLM可以迭代地改进它生成的代码。通过对该代码优化过程采用度量和数据驱动的训练范式，人们可以轻松地适应传统机器学习技术（如模仿学习、匕首和强化学习）的发现。我们在CARLA中展示了自动驾驶自动代码优化的第一个概念证明，表明LangProp可以生成可解释和透明的驾驶策略，这些策略可以以度量和数据驱动的方式进行验证和改进。我们的代码将是开源的，可在https：//github.com/shuishida/LangProp。获得

摘要: LangProp is a framework for iteratively optimizing code generated by large language models (LLMs) in a supervised/reinforcement learning setting. While LLMs can generate sensible solutions zero-shot, the solutions are often sub-optimal. Especially for code generation tasks, it is likely that the initial code will fail on certain edge cases. LangProp automatically evaluates the code performance on a dataset of input-output pairs, as well as catches any exceptions, and feeds the results back to the LLM in the training loop, so that the LLM can iteratively improve the code it generates. By adopting a metric- and data-driven training paradigm for this code optimization procedure, one could easily adapt findings from traditional machine learning techniques such as imitation learning, DAgger, and reinforcement learning. We demonstrate the first proof of concept of automated code optimization for autonomous driving in CARLA, showing that LangProp can generate interpretable and transparent driving policies that can be verified and improved in a metric- and data-driven way. Our code will be open-sourced and is available at https://github.com/shuishida/LangProp.

标题: CQLite: Communication-Efficient Multi-Robot Exploration Using Coverage-biased Distributed Q-Learning

作者: Ehsan Latif, Ramviyas Parasuraman

[UpdateTime:]2024-01-18

[Downlink:]http://arxiv.org/abs/2307.00500v2

[GitHub:]https://github.com/herolab-uga/cqlite|

中文摘要: 前沿探索和强化学习在历史上一直用于解决使许多移动机器人能够自主和协作探索复杂环境的问题。这些方法需要保持内部全局地图进行导航，但它们没有考虑到机器人之间通信和信息共享的高成本。本研究提供了一种新的分布式Q学习技术CQLite，旨在最小化机器人之间的数据通信开销，同时在多机器人探索中实现快速收敛和彻底覆盖。所提出的CQLite方法使用特别映射合并，并选择性地在最近识别的前沿共享更新的Q值，以显著降低通信成本。对CQLite算法的收敛性和效率的理论分析，以及利用几个机器人在模拟室内地图上的大量数值验证，证明了该方法的新颖性。凭借超过2倍的计算和通信减少以及改进的映射性能，CQLite超越了尖端的多机器人探索技术，如快速探索随机树和深度强化学习。相关代码在\url{https://github.com/herolab-uga/cqlite}开源。

摘要: Frontier exploration and reinforcement learning have historically been used to solve the problem of enabling many mobile robots to autonomously and cooperatively explore complex surroundings. These methods need to keep an internal global map for navigation, but they do not take into consideration the high costs of communication and information sharing between robots. This study offers CQLite, a novel distributed Q-learning technique designed to minimize data communication overhead between robots while achieving rapid convergence and thorough coverage in multi-robot exploration. The proposed CQLite method uses ad hoc map merging, and selectively shares updated Q-values at recently identified frontiers to significantly reduce communication costs. The theoretical analysis of CQLite’s convergence and efficiency, together with extensive numerical verification on simulated indoor maps utilizing several robots, demonstrates the method’s novelty. With over 2x reductions in computation and communication alongside improved mapping performance, CQLite outperformed cutting-edge multi-robot exploration techniques like Rapidly Exploring Random Trees and Deep Reinforcement Learning. Related codes are open-sourced at \url{https://github.com/herolab-uga/cqlite}.

标题: Cooperative Edge Caching Based on Elastic Federated and Multi-Agent Deep Reinforcement Learning in Next-Generation Network

作者: Qiong Wu, Wenhua Wang, Pingyi Fan

[UpdateTime:]2024-01-18

[Downlink:]http://arxiv.org/abs/2401.09886v1

[GitHub:]https://github.com/qiongwu86/Edge-Caching-Based-on-Multi-Agent-Deep-Reinforcement-Learning-and-Federated-Learning|

摘要: Edge caching is a promising solution for next-generation networks by empowering caching units in small-cell base stations (SBSs), which allows user equipments (UEs) to fetch users’ requested contents that have been pre-cached in SBSs. It is crucial for SBSs to predict accurate popular contents through learning while protecting users’ personal information. Traditional federated learning (FL) can protect users’ privacy but the data discrepancies among UEs can lead to a degradation in model quality. Therefore, it is necessary to train personalized local models for each UE to predict popular contents accurately. In addition, the cached contents can be shared among adjacent SBSs in next-generation networks, thus caching predicted popular contents in different SBSs may affect the cost to fetch contents. Hence, it is critical to determine where the popular contents are cached cooperatively. To address these issues, we propose a cooperative edge caching scheme based on elastic federated and multi-agent deep reinforcement learning (CEFMR) to optimize the cost in the network. We first propose an elastic FL algorithm to train the personalized model for each UE, where adversarial autoencoder (AAE) model is adopted for training to improve the prediction accuracy, then {a popular} content prediction algorithm is proposed to predict the popular contents for each SBS based on the trained AAE model. Finally, we propose a multi-agent deep reinforcement learning (MADRL) based algorithm to decide where the predicted popular contents are collaboratively cached among SBSs. Our experimental results demonstrate the superiority of our proposed scheme to existing baseline caching schemes.

标题: FREED++: Improving RL Agents for Fragment-Based Molecule Generation by Thorough Reproduction

作者: Alexander Telepov, Artem Tsypin, Kuzma Khrabrov

[UpdateTime:]2024-01-18

[Downlink:]http://arxiv.org/abs/2401.09840v1

[Project:]https://www.jmlr.org/tmlr/)|

中文摘要: 新治疗药物的合理设计旨在找到具有所需生物功能的分子结构，例如，通过与特定蛋白质结合来激活或抑制特定蛋白质的能力。分子对接是评估蛋白质——分子相互作用的常用技术。最近，强化学习（RL）已经成为一种有前途的方法，以对接分数（DS）作为奖励来生成分子。在这项工作中，我们复制，审查和改进了最近的分子生成RL模型称为FREED（arXiv：2110.01219）。尽管报道了三种靶蛋白的突出结果，但对所提出的方法的广泛评估揭示了几个局限性和挑战。我们的贡献包括修复了许多实现错误，简化了模型，同时提高了其质量，显著扩展了实验，并与当前最先进的蛋白质条件分子生成方法进行了准确的比较。我们表明，与其他方法相比，由此产生的固定模型能够产生具有更高对接分数的分子。

摘要: A rational design of new therapeutic drugs aims to find a molecular structure with desired biological functionality, e.g., an ability to activate or suppress a specific protein via binding to it. Molecular docking is a common technique for evaluating protein-molecule interactions. Recently, Reinforcement Learning (RL) has emerged as a promising approach to generating molecules with the docking score (DS) as a reward. In this work, we reproduce, scrutinize and improve the recent RL model for molecule generation called FREED (arXiv:2110.01219). Extensive evaluation of the proposed method reveals several limitations and challenges despite the outstanding results reported for three target proteins. Our contributions include fixing numerous implementation bugs and simplifying the model while increasing its quality, significantly extending experiments, and conducting an accurate comparison with current state-of-the-art methods for protein-conditioned molecule generation. We show that the resulting fixed model is capable of producing molecules with superior docking scores compared to alternative approaches.

标题: BridgeData V2: A Dataset for Robot Learning at Scale

作者: Homer Walke, Kevin Black, Abraham Lee

[UpdateTime:]2024-01-17

[Downlink:]http://arxiv.org/abs/2308.12952v3

[Project:]https://rail-berkeley.github.io/bridgedata|

中文摘要: 我们介绍了BridgeData V2，这是一个大型和多样化的机器人操作行为数据集，旨在促进可扩展机器人学习的研究。BridgeData V2包含在公开可用的低成本机器人上的24个环境中收集的60,096条轨迹。BridgeData V2提供了广泛的任务和环境可变性，产生了可以跨环境、领域和机构推广的技能，使数据集成为广泛研究人员的有用资源。此外，该数据集与各种基于目标图像或自然语言指令的开放词汇、多任务学习方法兼容。在我们的实验中，我们在我们的数据集上训练了6种最先进的模仿学习和离线强化学习方法，并发现它们在一系列需要不同概括量的任务上取得了成功。我们还证明了这些方法的性能随着更多的数据和更高容量的模型而提高，并且在更多种类的技能上的训练导致更好的泛化。通过公开共享BridgeData V2和我们的预训练模型，我们旨在加速可扩展机器人学习方法的研究。https://rail-berkeley.github.io/bridgedata的项目页面

摘要: We introduce BridgeData V2, a large and diverse dataset of robotic manipulation behaviors designed to facilitate research on scalable robot learning. BridgeData V2 contains 60,096 trajectories collected across 24 environments on a publicly available low-cost robot. BridgeData V2 provides extensive task and environment variability, leading to skills that can generalize across environments, domains, and institutions, making the dataset a useful resource for a broad range of researchers. Additionally, the dataset is compatible with a wide variety of open-vocabulary, multi-task learning methods conditioned on goal images or natural language instructions. In our experiments, we train 6 state-of-the-art imitation learning and offline reinforcement learning methods on our dataset, and find that they succeed on a suite of tasks requiring varying amounts of generalization. We also demonstrate that the performance of these methods improves with more data and higher capacity models, and that training on a greater variety of skills leads to improved generalization. By publicly sharing BridgeData V2 and our pre-trained models, we aim to accelerate research in scalable robot learning methods. Project page at https://rail-berkeley.github.io/bridgedata

== Open vocabulary detection ==

标题: ActAnywhere: Subject-Aware Video Background Generation

作者: Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10822v1

[Project:]https://actanywhere.github.io.|

中文摘要: 生成适合前景主体运动的视频背景是电影行业和视觉效果社区的一个重要问题。这项任务包括合成背景，与前景主题的运动和外观保持一致，同时也符合艺术家的创作意图。我们引入了ActAnywhere，这是一个生成模型，可以自动化传统上需要繁琐的手动工作的过程。我们的模型利用了大规模视频扩散模型的力量，并专门为此任务量身定制。ActAnywhere将前景主体分割序列作为输入，将描述所需场景的图像作为条件，以生成具有逼真前景——背景交互的连贯视频，同时坚持条件帧。我们在人类场景交互视频的大规模数据集上训练我们的模型。广泛的评估证明了我们模型的卓越性能，明显优于基线。此外，我们表明ActAnywhere推广到不同的分布外样本，包括非人类受试者。请访问我们的项目网页https：//actanywhere.github.io。

摘要: Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist’s creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere generalizes to diverse out-of-distribution samples, including non-human subjects. Please visit our project webpage at https://actanywhere.github.io.

标题: ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

作者: Chen Liang, Yu Wu, Yawei Luo

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2103.10702v4

[Project:]https://ieeexplore.ieee.org/abstract/document/10083244|

中文摘要: 基于文本的视频分割是一项具有挑战性的任务，它分割出视频中自然语言引用的对象。它本质上需要语义理解和细粒度的视频理解。现有的方法以自下而上的方式将语言表示引入分割模型，这仅仅是在ConvNets的局部感受野内进行视觉——语言交互。我们认为，这种交互没有实现，因为在给定部分观察的情况下，该模型几乎不能构建区域级关系，这与自然语言/指称表达式的描述逻辑相反。事实上，人们通常使用与其他对象的关系来描述目标对象，如果不看完整个视频，可能不容易理解。为了解决这个问题，我们引入了一种新的自上而下的方法，通过模仿我们如何在语言指导下分割对象。我们首先找出视频中的所有候选对象，然后通过解析这些高级对象之间的关系来选择被引用的对象。为了精确理解关系，研究了三种对象级关系，即位置关系、文本引导语义关系和时间关系。对A2D句子和J-HMDB句子的大量实验表明，我们的方法远远优于最先进的方法。定性结果也表明我们的结果更容易解释。

摘要: Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue that such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin. Qualitative results also show our results are more explainable.

标题: Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

作者: Chen Liang, Yu Wu, Tianfei Zhou

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2106.01061v2

[Project:]https://ieeexplore.ieee.org/abstract/document/10083244|

中文摘要: 引用视频对象分割（RVOS）旨在自然语言引用的指导下对视频对象进行分割。以前的方法通常通过在图像格上直接基于语言引用来处理rvo。这种自下而上的策略未能探索对象级线索，容易导致较差的结果。在这项工作中，我们提出了一个两阶段，自上而下的RVOS解决方案。首先，通过将从几个采样帧中检测到的对象遮罩传播到整个视频来构建一组详尽的对象轨迹。其次，提出了一个基于Transformer model的tracklet语言基础模块，该模块同时有效地建模实例级视觉关系和跨模态交互。我们的模型在CVPR2021上排名第一，涉及Youtube-VOS挑战赛。

摘要: Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks first place on CVPR2021 Referring Youtube-VOS challenge.

标题: Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

作者: Yang Liu, Muzhi Zhu, Hengtao Li

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2305.13310v2

[GitHub:]https://github.com/aim-uofa/Matcher.|

中文摘要: 在大规模预训练的支持下，vision foundation模型在开放世界图像理解方面表现出巨大的潜力。然而，与擅长直接处理各种语言任务的大型语言模型不同，vision foundation模型需要特定于任务的模型结构，然后对特定任务进行微调。在这项工作中，我们提出了Matcher，一种新的感知范式，利用现成的视觉基础模型来解决各种感知任务。Matcher可以通过使用上下文中的示例来分割任何内容，而无需训练。此外，我们在Matcher框架内设计了三个有效的组件，以与这些基础模型协作，并在不同的感知任务中释放它们的全部潜力。Matcher在各种分割任务中展示了令人印象深刻的泛化性能，所有这些都不需要训练。例如，它在COCO-20 $^i$ 上实现了52.7%的mIoU，超过了最先进的专家模型1.6%。此外，Matcher在所提出的LVIS-92 $^i$ 上实现了33.0%的mIoU，用于一次性语义分割，比最先进的通才模型高出14.4%。我们的可视化结果进一步展示了Matcher在野外应用于图像时的开放世界通用性和灵活性。我们的代码可以在https://github.com/aim-uofa/Matcher找到。

摘要: Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. However, unlike large language models that excel at directly tackling various language tasks, vision foundation models require a task-specific model structure followed by fine-tuning on specific tasks. In this work, we present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher can segment anything by using an in-context example without training. Additionally, we design three effective components within the Matcher framework to collaborate with these foundation models and unleash their full potential in diverse perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20 $^i$ with one example, surpassing the state-of-the-art specialist model by 1.6%. In addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92 $^i$ for one-shot semantic segmentation, outperforming the state-of-the-art generalist model by 14.4%. Our visualization results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild. Our code can be found at https://github.com/aim-uofa/Matcher.

标题: Local-Global Context Aware Transformer for Language-Guided Video Segmentation

作者: Chen Liang, Wenguan Wang, Tianfei Zhou

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2203.09773v2

[GitHub:]https://github.com/leonnnop/Locater|https://github.com/leonnnop/Locater|

中文摘要: 我们探索语言引导视频分割（LVS）的任务。以前的算法大多采用3D CNN来学习视频表示，难以捕捉长期上下文，容易遭受视觉语言错位。有鉴于此，我们提出了Locater（局部——全局上下文感知Transformer model），它用有限的内存增强了Transformer model架构，以便以有效的方式用语言表达查询整个视频。存储器被设计成包含两个组件——一个用于持久保存全局视频内容，另一个用于动态收集本地时间上下文和分割历史。基于记忆的局部——全局上下文和每个帧的特定内容，定位器整体地和灵活地理解表达式作为每个帧的自适应查询向量。该向量用于查询相应的帧以生成掩码。该存储器还允许定位器以线性时间复杂度和恒定大小的存储器处理视频，而Transformer model式的自我注意计算与序列长度成二次比例。为了彻底检查LVS模型的视觉接地能力，我们贡献了一个新的LVS数据集A2D-S+，它建立在A2D-S数据集的基础上，但在相似对象之间消除歧义方面提出了更多的挑战。在三个LVS数据集和我们的A2D-S+上的实验表明，Locater优于以前的最先进技术。此外，我们在第三届大规模视频对象分割挑战赛的参考视频对象分割赛道上获得了第一名，Locater是获胜解决方案的基础。我们的代码和数据集可从以下网址获得：https://github.com/leonnnop/Locater

摘要: We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components – one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S+ show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution. Our code and dataset are available at: https://github.com/leonnnop/Locater

标题: Symbol as Points: Panoptic Symbol Spotting via Point-based Representation

作者: Wenlong Liu, Tianyu Yang, Yuhan Wang

[UpdateTime:]2024-01-19

[Downlink:]http://arxiv.org/abs/2401.10556v1

[GitHub:]https://github.com/nicehuster/SymPoint.|

中文摘要: 这项工作研究全景符号识别问题，即从计算机辅助设计（CAD）图纸中识别和解析可数的对象实例（窗户、门、桌子等）和不可数的东西（墙壁、栏杆等）。现有的方法通常包括将矢量图形光栅化为图像并使用基于图像的方法进行符号识别，或者直接构建图形并使用图形神经网络进行符号识别。在本文中，我们采用了一种不同的方法，将图形图元视为一组局部连接的2D点，并使用点云分割方法来处理它。具体来说，我们利用点Transformer model来提取原始特征，并附加一个类似mask2former的点样头来预测最终输出。为了更好地利用原语的局部连接信息，提高原语的识别能力，我们进一步提出了连接注意模块（ACM）和对比连接学习方案（CCL）。最后，我们为定位头的掩模注意模块提出了一种KNN插值机制，以更好地处理原始掩模下采样，这是图像的原始级，而不是像素级。我们的方法名为SymPoint，简单而有效，在FloorPlanCAD数据集上比最近最先进的方法GAT-CADNet绝对增加了9.6%的PQ和10.4%的RQ。源代码和模型将在https://github.com/nicehuster/SymPoint上提供。

摘要: This work studies the problem of panoptic symbol spotting, which is to spot and parse both countable object instances (windows, doors, tables, etc.) and uncountable stuff (wall, railing, etc.) from computer-aided design (CAD) drawings. Existing methods typically involve either rasterizing the vector graphics into images and using image-based methods for symbol spotting, or directly building graphs and using graph neural networks for symbol recognition. In this paper, we take a different approach, which treats graphic primitives as a set of 2D points that are locally connected and use point cloud segmentation methods to tackle it. Specifically, we utilize a point transformer to extract the primitive features and append a mask2former-like spotting head to predict the final output. To better use the local connection information of primitives and enhance their discriminability, we further propose the attention with connection module (ACM) and contrastive connection learning scheme (CCL). Finally, we propose a KNN interpolation mechanism for the mask attention module of the spotting head to better handle primitive mask downsampling, which is primitive-level in contrast to pixel-level for the image. Our approach, named SymPoint, is simple yet effective, outperforming recent state-of-the-art method GAT-CADNet by an absolute increase of 9.6% PQ and 10.4% RQ on the FloorPlanCAD dataset. The source code and models will be available at https://github.com/nicehuster/SymPoint.

专属领域论文订阅

VX 扫吗关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持。谢谢提供建议
如果你感觉对你有帮助可以扫吗关注，每日准时为你推送最新论文