[晓丽紫]每日论文分享(有中文摘要，源码或项目地址)--机器人相关

专属领域论文订阅

VX关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉导航
具身智能，机器人
强化学习
开放词汇，检测分割

[晓丽紫]每日论文分享(有中文摘要，源码或项目地址)

== Embodied Artificial Intelligence

标题: Augmented Reality User Interface for Command, Control, and Supervision of Large Multi-Agent Teams

作者: Frank Regal, Chris Suarez, Fabian Parra

中文摘要: 多智能体人机团队通过开发和结合人类和机器人的优势，可以更有效地收集各种环境的信息。在国防、搜救、第一反应等行业，异构的人类机器人团队有望通过将人类从未知和潜在危险的情况中转移出来，加快数据收集，提高团队安全性。这项工作建立在AugRE之上，AugRE是一个基于增强现实（AR）的可扩展人机团队框架。它使用户能够本地化并与50多个自主代理进行通信。通过我们的努力，用户能够指挥、控制和监督大型团队中的代理，包括视线和非视线，而无需事先修改环境，也无需用户在现场使用典型的硬件（即操纵杆、键盘、笔记本电脑、平板电脑等）。所展示的工作表明，早期迹象表明，将这些基于AR HMD的用户交互模式结合起来进行指挥、控制和监督将有助于提高人机团队的协作性、稳健性和信任度

摘要: Multi-agent human-robot teaming allows for the potential to gather information about various environments more efficiently by exploiting and combining the strengths of humans and robots. In industries like defense, search and rescue, first-response, and others alike, heterogeneous human-robot teams show promise to accelerate data collection and improve team safety by removing humans from unknown and potentially hazardous situations. This work builds upon AugRE, an Augmented Reality (AR) based scalable human-robot teaming framework. It enables users to localize and communicate with 50+ autonomous agents. Through our efforts, users are able to command, control, and supervise agents in large teams, both line-of-sight and non-line-of-sight, without the need to modify the environment prior and without requiring users to use typical hardware (i.e. joysticks, keyboards, laptops, tablets, etc.) in the field. The demonstrated work shows early indications that combining these AR-HMD-based user interaction modalities for command, control, and supervision will help improve human-robot team collaboration, robustness, and trust.

[Downlink:]http://arxiv.org/abs/2401.05665v1

[Project:]https://sites.google.com/view/xr-robotics-iros2023/home?authuser=0|

标题: Unified Learning from Demonstrations, Corrections, and Preferences during Physical Human-Robot Interaction

作者: Shaunak A. Mehta, Dylan P. Losey

中文摘要: 人类可以利用物理交互来教授机器人手臂。这种物理交互有多种形式，具体取决于任务、用户以及机器人迄今为止所学的知识。现有技术的方法侧重于从单一模态学习，或者通过假设机器人具有关于人类预期任务的先验信息来组合多种交互类型。相比之下，在本文中，我们引入了一种算法形式主义，它将从演示、更正和偏好中学习结合起来。我们的方法不对人类想要教机器人的任务进行假设；相反，我们通过将人类的输入与附近的替代品进行比较，从头开始学习奖励模型。我们首先推导出一个损失函数，该函数训练一组奖励模型，以匹配人类的演示、校正和偏好。反馈的类型和顺序取决于人类老师：我们使机器人能够被动或主动地收集反馈。然后，我们应用约束优化将我们学到的奖励转化为所需的机器人轨迹。通过模拟和用户研究，我们证明了我们提出的方法比现有的基线更准确地从物理人类交互中学习操纵任务，特别是当机器人面临新的或意想不到的目标时。我们的用户研究视频可在以下网站获取：https://youtu.be/FSUJsTYvEKU

摘要: Humans can leverage physical interaction to teach robot arms. This physical interaction takes multiple forms depending on the task, the user, and what the robot has learned so far. State-of-the-art approaches focus on learning from a single modality, or combine multiple interaction types by assuming that the robot has prior information about the human’s intended task. By contrast, in this paper we introduce an algorithmic formalism that unites learning from demonstrations, corrections, and preferences. Our approach makes no assumptions about the tasks the human wants to teach the robot; instead, we learn a reward model from scratch by comparing the human’s inputs to nearby alternatives. We first derive a loss function that trains an ensemble of reward models to match the human’s demonstrations, corrections, and preferences. The type and order of feedback is up to the human teacher: we enable the robot to collect this feedback passively or actively. We then apply constrained optimization to convert our learned reward into a desired robot trajectory. Through simulations and a user study we demonstrate that our proposed approach more accurately learns manipulation tasks from physical human interaction than existing baselines, particularly when the robot is faced with new or unexpected objectives. Videos of our user study are available at: https://youtu.be/FSUJsTYvEKU

[Downlink:]http://arxiv.org/abs/2207.03395v2

[Project:]https://youtu.be/FSUJsTYvEKU|

标题: Transferability of HRI Research: Potential and Challenges

作者: Wafa Johal

中文摘要: 随着机器人技术和人工智能的发展，机器人技术的应用正在蓬勃发展。人机交互（HRI）是机器人学的一个重要领域，因为它允许机器人更接近人类（与人类或为人类）工作。HRI研究成功的一个关键因素是可转移性，即研究成果被行业采用并为社会提供利益的能力。在本文中，我们探讨了可转移性在HRI研究中的潜力和挑战。首先，我们检查了HRI研究的现状，并确定了可能导致成功结果的各种类型的贡献。其次，我们讨论了每种贡献的潜在好处，并确定了可以促进行业采用HRI研究的因素。然而，我们也认识到，与可转移性相关的一些挑战，如人力资源研究从业者所需的明确定义的工作/技能集的多样性，缺乏行业主导的研究，以及人力资源研究方法缺乏标准化。我们讨论了这些挑战，并提出了潜在的解决方案，以弥合行业期望与HRI学术研究之间的差距

摘要: With advancement of robotics and artificial intelligence, applications for robotics are flourishing. Human-robot interaction (HRI) is an important area of robotics as it allows robots to work closer to humans (with them or for them). One crucial factor for the success of HRI research is transferability, which refers to the ability of research outputs to be adopted by industry and provide benefits to society. In this paper, we explore the potentials and challenges of transferability in HRI research. Firstly, we examine the current state of HRI research and identify various types of contributions that could lead to successful outcomes. Secondly, we discuss the potential benefits for each type of contribution and identify factors that could facilitate industry adoption of HRI research. However, we also recognize that there are several challenges associated with transferability, such as the diversity of well-defined job/skill-sets required from HRI practitioners, the lack of industry-led research, and the lack of standardization in HRI research methods. We discuss these challenges and propose potential solutions to bridge the gap between industry expectations and academic research in HRI.

[Downlink:]http://arxiv.org/abs/2401.05802v1

标题: Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?

作者: Mudit Verma, Siddhant Bhambri, Subbarao Kambhampati

中文摘要: 大型语言模型在各种自然语言和生成任务中表现出非凡的生成能力。然而，可能的拟人化和对失败案例的宽容推动了对大语言模型涌现能力的讨论，尤其是对大语言模式中心理理论能力的讨论。虽然存在一些错误信念测试来验证推断和维护另一个实体的心理模型的能力，但我们研究了ToM能力的一个特殊应用，它具有更高的风险和可能不可逆转的后果：人机交互。在这项工作中，我们探索了感知行为识别的任务，其中机器人采用大型语言模型（LLM）以类似于人类观察者的方式评估机器人生成的行为。我们关注四种行为类型，即可解释、可阅读、可预测和模糊行为，这些行为已被广泛用于合成可解释的机器人行为。因此，LLM的目标是成为代理的人类代理，并回答某个代理行为将如何被循环中的人类感知，例如“给定机器人的行为X，人类观察者会发现它是可解释的吗？”。我们进行了一项人类受试者研究，以验证用户能够在五个领域的精心策划的情况下（机器人设置和计划）正确回答这样的问题。信念测试的第一个分析产生了非常积极的结果，夸大了人们对LLM拥有ToM能力的期望。然后，我们提出并执行了一套打破这种错觉的扰动测试，即不一致信念、不一致上下文和信念测试。我们得出的结论是，LLM在香草提示上的高分显示了它在HRI设置中的潜在用途，然而，在LLM缺乏的情况下，拥有ToM要求对琐碎或无关的扰动保持不变

摘要: Large Language Models have shown exceptional generative abilities in various natural language and generation tasks. However, possible anthropomorphization and leniency towards failure cases have propelled discussions on emergent abilities of Large Language Models especially on Theory of Mind (ToM) abilities in Large Language Models. While several false-belief tests exists to verify the ability to infer and maintain mental models of another entity, we study a special application of ToM abilities that has higher stakes and possibly irreversible consequences : Human Robot Interaction. In this work, we explore the task of Perceived Behavior Recognition, where a robot employs a Large Language Model (LLM) to assess the robot’s generated behavior in a manner similar to human observer. We focus on four behavior types, namely - explicable, legible, predictable, and obfuscatory behavior which have been extensively used to synthesize interpretable robot behaviors. The LLMs goal is, therefore to be a human proxy to the agent, and to answer how a certain agent behavior would be perceived by the human in the loop, for example “Given a robot’s behavior X, would the human observer find it explicable?”. We conduct a human subject study to verify that the users are able to correctly answer such a question in the curated situations (robot setting and plan) across five domains. A first analysis of the belief test yields extremely positive results inflating ones expectations of LLMs possessing ToM abilities. We then propose and perform a suite of perturbation tests which breaks this illusion, i.e. Inconsistent Belief, Uninformative Context and Conviction Test. We conclude that, the high score of LLMs on vanilla prompts showcases its potential use in HRI settings, however to possess ToM demands invariance to trivial or irrelevant perturbations in the context which LLMs lack.

[Downlink:]http://arxiv.org/abs/2401.05302v1

标题: Evaluating Gesture Recognition in Virtual Reality

作者: Sandeep Reddy Sabbella, Sara Kaszuba, Francesco Leotta

中文摘要: 随着机器人融入日常生活的各个方面，人机交互（HRI）变得越来越重要。HRI的一个关键方面是手势识别，它允许机器人实时解释和响应人类手势。手势识别在HRI的非言语交际中起着重要作用。为此，正在进行的研究是，这种非语言交流如何加强语言交流，提高系统的整体效率，从而增强机器人的用户体验。然而，手势识别系统需要解决几个挑战，包括数据生成、可传输性、可扩展性、可推广性、标准化以及缺乏手势系统的基准测试。在这篇初步论文中，我们希望通过向一些可以用作地面机器人标准的命令提供手势，来解决使用虚拟现实模拟生成数据的挑战和标准化问题

摘要: Human-Robot Interaction (HRI) has become increasingly important as robots are being integrated into various aspects of daily life. One key aspect of HRI is gesture recognition, which allows robots to interpret and respond to human gestures in real-time. Gesture recognition plays an important role in non-verbal communication in HRI. To this aim, there is ongoing research on how such non-verbal communication can strengthen verbal communication and improve the system’s overall efficiency, thereby enhancing the user experience with the robot. However, several challenges need to be addressed in gesture recognition systems, which include data generation, transferability, scalability, generalizability, standardization, and lack of benchmarking of the gestural systems. In this preliminary paper, we want to address the challenges of data generation using virtual reality simulations and standardization issues by presenting gestures to some commands that can be used as a standard in ground robots.

[Downlink:]http://arxiv.org/abs/2401.04545v1

标题: Testing Human-Robot Interaction in Virtual Reality: Experience from a Study on Speech Act Classification

作者: Sara Kaszuba, Sandeep Reddy Sabbella, Francesco Leotta

中文摘要: 近年来，越来越多的人机交互（HRI）方法在虚拟现实（VR）中得到了实施和评估，因为它可以加快设计迭代，并使最终用户更安全地评估和掌握HRI原语。然而，确定最合适的VR体验并不简单。在这项工作中，我们评估了在智能农业场景中，用户如何在语音行为理解任务中感知沉浸式和非沉浸式VR。特别是，我们收集了参与这两个实验的81名参与者的意见和建议，以突出这些不同经历的优势和劣势

摘要: In recent years, an increasing number of Human-Robot Interaction (HRI) approaches have been implemented and evaluated in Virtual Reality (VR), as it allows to speed-up design iterations and makes it safer for the final user to evaluate and master the HRI primitives. However, identifying the most suitable VR experience is not straightforward. In this work, we evaluate how, in a smart agriculture scenario, immersive and non-immersive VR are perceived by users with respect to a speech act understanding task. In particular, we collect opinions and suggestions from the 81 participants involved in both experiments to highlight the strengths and weaknesses of these different experiences.

[Downlink:]http://arxiv.org/abs/2401.04534v1

== Reinforcement Learning ==

标题: Personalized Reinforcement Learning with a Budget of Policies

作者: Dmitry Ivanov, Omer Ben-Porat

中文摘要: 机器学习（ML）中的个性化根据用户的个人特征定制模型的决策。虽然这种方法在推荐系统等领域取得了成功，但其向医疗保健和自动驾驶等高风险领域的扩展受到了广泛的监管审批程序的阻碍。为了应对这一挑战，我们提出了一种称为代表马尔可夫决策过程（r-MDP）的新框架，该框架旨在平衡个性化需求和监管约束。在r-MDP中，我们通过与一小部分有代表性的策略进行交互，来满足不同的用户群体，每个用户群体都有独特的偏好。我们的目标有两个：有效地将每个用户与适当的代表性政策相匹配，同时优化这些政策，以最大限度地提高整体社会福利。我们开发了两种深度强化学习算法，可以有效地解决r-MDP问题。这些算法的灵感来源于经典的K-means聚类原理，并有坚实的理论基础。我们在各种模拟环境中进行的实证调查显示，即使在政策预算有限的情况下，算法也有能力促进有意义的个性化。此外，它们还展示了可扩展性，能够有效地适应更大的策略预算

摘要: Personalization in machine learning (ML) tailors models’ decisions to the individual characteristics of users. While this approach has seen success in areas like recommender systems, its expansion into high-stakes fields such as healthcare and autonomous driving is hindered by the extensive regulatory approval processes involved. To address this challenge, we propose a novel framework termed represented Markov Decision Processes (r-MDPs) that is designed to balance the need for personalization with the regulatory constraints. In an r-MDP, we cater to a diverse user population, each with unique preferences, through interaction with a small set of representative policies. Our objective is twofold: efficiently match each user to an appropriate representative policy and simultaneously optimize these policies to maximize overall social welfare. We develop two deep reinforcement learning algorithms that efficiently solve r-MDPs. These algorithms draw inspiration from the principles of classic K-means clustering and are underpinned by robust theoretical foundations. Our empirical investigations, conducted across a variety of simulated environments, showcase the algorithms’ ability to facilitate meaningful personalization even under constrained policy budgets. Furthermore, they demonstrate scalability, efficiently adapting to larger policy budgets.

[Downlink:]http://arxiv.org/abs/2401.06514v1

[GitHub:]https://github.com/dimonenka/RL_policy_budget|

标题: Bridging RL Theory and Practice with the Effective Horizon

作者: Cassidy Laidlaw, Stuart Russell, Anca Dragan

中文摘要: 深度强化学习（RL）在某些环境中效果显著，但在另一些环境中却以灾难性的失败告终。理想情况下，RL理论应该能够理解为什么会这样，即实际性能的预测边界。不幸的是，目前的理论并不完全具备这种能力。我们通过引入一个新的数据集BRIDGE，将标准的深度RL算法与先前的样本复杂度边界进行了比较。它由来自常见深度RL基准的155个确定性MDP及其相应的表格表示组成，使我们能够准确地计算与实例相关的边界。我们选择关注确定性环境，因为它们具有随机环境的许多有趣特性，但更容易分析。使用BRIDGE，我们发现先验界与深度RL成功与失败的时间并没有很好的相关性，但我们发现了一个令人惊讶的特性。当在随机策略下具有最高Q值的动作也在最优策略下具有最大Q值时（即，当对随机策略的Q函数贪婪是最优的时），深度RL倾向于成功；当他们不这样做时，深层RL往往会失败。我们将这一性质推广到MDP的一个新的复杂性度量中，我们称之为有效视界，它大致对应于当用随机展开来评估叶节点时，在该MDP中需要多少步的前瞻性搜索才能识别下一个最优动作。使用BRIDGE，我们表明，在四个度量中，基于有效水平的边界比先前的样本复杂度边界更能反映PPO和DQN的经验性能。我们还发现，与现有的界限不同，有效视界可以预测使用奖励塑造或预先训练的探索政策的效果。我们的代码和数据可在https://github.com/cassidylaidlaw/effective-horizon

摘要: Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We choose to focus on deterministic environments because they share many interesting properties of stochastic environments, but are easier to analyze. Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy (i.e. when it is optimal to be greedy on the random policy’s Q function), deep RL tends to succeed; when they don’t, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon

[Downlink:]http://arxiv.org/abs/2304.09853v3

[GitHub:]https://github.com/cassidylaidlaw/effective-horizon|

标题: Enhancing variational quantum state diagonalization using reinforcement learning techniques

作者: Akash Kundu, Przemysław Bedełek, Mateusz Ostaszewski

中文摘要: 变分量子算法对NISQ计算机的应用至关重要。这样的算法需要短量子电路，更适合在近期硬件上实现，并且已经开发了许多这样的方法。其中一个特别令人感兴趣的方法是所谓的变分量子态对角化方法，它构成了一个重要的算法子程序，可以直接用于处理量子态中编码的数据。特别是，它可以用于辨别量子态的特征，如系统的纠缠特性，或用于量子机器学习算法。在这项工作中，我们通过利用强化学习（RL）来解决量子态对角化任务中所需的设计非常浅的量子电路的问题。我们使用一种新的RL状态编码方法、密集奖励函数和 $\epsilon$ 贪婪策略来实现这一点。我们证明了强化学习方法提出的电路比标准的变分量子态对角化算法更浅，因此可以用于硬件能力限制量子电路深度的情况。我们在论文中提出的方法可以很容易地适用于处理广泛的变分量子算法

摘要: The variational quantum algorithms are crucial for the application of NISQ computers. Such algorithms require short quantum circuits, which are more amenable to implementation on near-term hardware, and many such methods have been developed. One of particular interest is the so-called variational quantum state diagonalization method, which constitutes an important algorithmic subroutine and can be used directly to work with data encoded in quantum states. In particular, it can be applied to discern the features of quantum states, such as entanglement properties of a system, or in quantum machine learning algorithms. In this work, we tackle the problem of designing a very shallow quantum circuit, required in the quantum state diagonalization task, by utilizing reinforcement learning (RL). We use a novel encoding method for the RL-state, a dense reward function, and an $\epsilon$ -greedy policy to achieve this. We demonstrate that the circuits proposed by the reinforcement learning methods are shallower than the standard variational quantum state diagonalization algorithm and thus can be used in situations where hardware capabilities limit the depth of quantum circuits. The methods we propose in the paper can be readily adapted to address a wide range of variational quantum algorithms.

[Downlink:]http://arxiv.org/abs/2306.11086v3

[GitHub:]https://github.com/iitis/RL_for_VQSD_ansatz_optimization|

标题: Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint

作者: Zhipeng Chen, Kun Zhou, Wayne Xin Zhao

中文摘要: 强化学习（RL）已广泛用于训练大型语言模型，以防止意外输出，例如减少危害性和错误。然而，现有的RL方法大多采用实例级奖励，无法对复杂的推理任务提供细粒度的监督，也无法关注导致错误的少数关键令牌。为了解决这一问题，我们提出了一种新的RL方法，名为\textbf{RLMEC}，该方法结合了一个生成模型作为奖励模型，该模型由错误解重写任务在最小编辑约束下进行训练，并可以为RL训练产生令牌级奖励。基于生成奖励模型，我们设计了用于训练的令牌级RL目标和用于稳定RL过程的基于模仿的正则化。这两个目标都集中在学习错误解决方案的关键令牌上，减少其他不重要令牌的影响。数学任务和问答任务的实验结果证明了该方法的有效性。我们的代码和数据位于\url{https://github.com/RUCAIBox/RLMEC}.

摘要: Reinforcement learning (RL) has been widely used in training large language models~(LLMs) for preventing unexpected outputs, \eg reducing harmfulness and errors. However, existing RL methods mostly adopt the instance-level reward, which is unable to provide fine-grained supervision for complex reasoning tasks, and can not focus on the few key tokens that lead to the incorrectness. To address it, we propose a new RL method named \textbf{RLMEC} that incorporates a generative model as the reward model, which is trained by the erroneous solution rewriting task under the minimum editing constraint, and can produce token-level rewards for RL training. Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process. And the both objectives focus on the learning of the key tokens for the erroneous solution, reducing the effect of other unimportant tokens. The experiment results on mathematical tasks and question-answering tasks have demonstrated the effectiveness of our approach. Our code and data are available at \url{https://github.com/RUCAIBox/RLMEC}.

[Downlink:]http://arxiv.org/abs/2401.06081v1

[GitHub:]https://github.com/RUCAIBox/RLMEC|

标题: Open-Source Reinforcement Learning Environments Implemented in MuJoCo with Franka Manipulator

作者: Zichun Xu, Yuntao Li, Xiaohang Yang

中文摘要: 本文介绍了在MuJoCo物理引擎上与MuJoCo动物园的Franka Emika Panda手臂开发的三个开源强化学习环境。通过继承体育馆核心技术的体育馆机器人API，实现了推、滑、取、放三项具有代表性的任务。支持稀疏二进制和密集奖励，并且观察空间包含所需和已实现目标的关键，以遵循多目标强化学习框架。使用三种不同的非策略算法来验证仿真属性，以确保所有任务的逼真度，并给出了基准测试结果。每个环境和任务都以干净的方式定义，并保留用于修改环境的主要参数以反映主要差异。存储库（包括所有环境）位于https://github.com/zichunxx/panda_mujoco_gym.

摘要: This paper presents three open-source reinforcement learning environments developed on the MuJoCo physics engine with the Franka Emika Panda arm in MuJoCo Menagerie. Three representative tasks, push, slide, and pick-and-place, are implemented through the Gymnasium Robotics API, which inherits from the core of Gymnasium. Both the sparse binary and dense rewards are supported, and the observation space contains the keys of desired and achieved goals to follow the Multi-Goal Reinforcement Learning framework. Three different off-policy algorithms are used to validate the simulation attributes to ensure the fidelity of all tasks, and benchmark results are also given. Each environment and task are defined in a clean way, and the main parameters for modifying the environment are preserved to reflect the main difference. The repository, including all environments, is available at https://github.com/zichunxx/panda_mujoco_gym.

[Downlink:]http://arxiv.org/abs/2312.13788v2

[GitHub:]https://github.com/zichunxx/panda_mujoco_gym.|

标题: Edge Generation Scheduling for DAG Tasks Using Deep Reinforcement Learning

作者: Binqi Sun, Mirco Theile, Ziyuan Qin

中文摘要: 有向无环图（DAG）任务目前在实时领域中被采用，用于对汽车、航空电子和工业领域的复杂应用程序进行建模，这些应用程序通过相互通信任务链实现其功能。本文基于平凡可调度性的概念，提出了一种新的可调度性测试方法，研究了实时DAG任务的调度问题。利用这种可调度性测试，我们提出了一种新的DAG调度框架（边缘生成调度——EGS），该框架试图通过迭代生成边缘来最小化DAG宽度，同时保证最后期限约束。我们研究了如何通过开发深度强化学习算法与图表示神经网络相结合来学习有效的EGS边缘生成策略，从而有效地解决边缘生成问题。我们通过将所提出的算法与最先进的DAG调度启发式算法和最优混合整数线性规划基线进行比较来评估其有效性。实验结果表明，所提出的算法在调度相同DAG任务时需要更少的处理器，优于现有技术。代码位于https://github.com/binqi-sun/egs.

摘要: Directed acyclic graph (DAG) tasks are currently adopted in the real-time domain to model complex applications from the automotive, avionics, and industrial domains that implement their functionalities through chains of intercommunicating tasks. This paper studies the problem of scheduling real-time DAG tasks by presenting a novel schedulability test based on the concept of trivial schedulability. Using this schedulability test, we propose a new DAG scheduling framework (edge generation scheduling – EGS) that attempts to minimize the DAG width by iteratively generating edges while guaranteeing the deadline constraint. We study how to efficiently solve the problem of generating edges by developing a deep reinforcement learning algorithm combined with a graph representation neural network to learn an efficient edge generation policy for EGS. We evaluate the effectiveness of the proposed algorithm by comparing it with state-of-the-art DAG scheduling heuristics and an optimal mixed-integer linear programming baseline. Experimental results show that the proposed algorithm outperforms the state-of-the-art by requiring fewer processors to schedule the same DAG tasks. The code is available at https://github.com/binqi-sun/egs.

[Downlink:]http://arxiv.org/abs/2308.14647v2

[GitHub:]https://github.com/binqi-sun/egs.|

== Open vocabulary detection ==

标题: Seeing the roads through the trees: A benchmark for modeling spatial dependencies with aerial imagery

作者: Caleb Robinson, Isaac Corley, Anthony Ortiz

中文摘要: 充分理解复杂的高分辨率卫星或航空图像场景通常需要在广泛的相关背景下进行空间推理。人类对象识别系统能够在长程相关上下文中理解场景中的对象。例如，如果一个人观察到一个空中场景，显示道路被树冠分割，那么他们不太可能得出道路实际上被树木分割成了不相交的碎片的结论，而是认为附近树木的树冠正在堵塞道路。然而，目前进行的研究有限，无法理解现代机器学习模型的长期上下文理解。在这项工作中，我们提出了一个道路分割基准数据集，切萨皮克道路空间上下文（RSC），用于评估地理空间机器学习模型的空间长期上下文理解，并展示了常用的语义分割模型在这项任务中的失败。例如，我们表明，经过训练从航空图像中的背景中分割道路的U-Net在未被遮挡的道路上实现了84%的召回率，但在被树冠覆盖的道路上仅实现了63.5%的召回率——尽管经过训练以相同的方式建模。我们进一步分析了模型的性能如何随着决策的相关上下文（在我们的案例中是未封闭的道路）在距离上的变化而变化。我们发布代码来复制我们的实验以及图像和口罩的数据集，以鼓励未来在这个方向上的研究——https://github.com/isaaccorley/ChesapeakeRSC.

摘要: Fully understanding a complex high-resolution satellite or aerial imagery scene often requires spatial reasoning over a broad relevant context. The human object recognition system is able to understand object in a scene over a long-range relevant context. For example, if a human observes an aerial scene that shows sections of road broken up by tree canopy, then they will be unlikely to conclude that the road has actually been broken up into disjoint pieces by trees and instead think that the canopy of nearby trees is occluding the road. However, there is limited research being conducted to understand long-range context understanding of modern machine learning models. In this work we propose a road segmentation benchmark dataset, Chesapeake Roads Spatial Context (RSC), for evaluating the spatial long-range context understanding of geospatial machine learning models and show how commonly used semantic segmentation models can fail at this task. For example, we show that a U-Net trained to segment roads from background in aerial imagery achieves an 84% recall on unoccluded roads, but just 63.5% recall on roads covered by tree canopy despite being trained to model both the same way. We further analyze how the performance of models changes as the relevant context for a decision (unoccluded roads in our case) varies in distance. We release the code to reproduce our experiments and dataset of imagery and masks to encourage future research in this direction – https://github.com/isaaccorley/ChesapeakeRSC.

[Downlink:]http://arxiv.org/abs/2401.06762v1

[GitHub:]https://github.com/isaaccorley/ChesapeakeRSC.|

标题: Scalable 3D Panoptic Segmentation With Superpoint Graph Clustering

作者: Damien Robert, Hugo Raguet, Loic Landrieu

中文摘要: 我们通过将大型三维点云的全景分割任务重新定义为可扩展的图聚类问题，介绍了一种高效的方法。这种方法可以只使用局部辅助任务进行训练，从而消除了训练过程中资源密集的实例匹配步骤。此外，我们的公式可以很容易地适应超点范式，从而进一步提高其效率。这使我们的模型能够在一次推理中处理具有数百万个点和数千个对象的场景。我们的方法称为SuperCluster，为两个室内扫描数据集实现了最先进的全景分割性能：S3DIS Area ~5为50.1 $PQ （$ +7.8 $）， S c an N e t V 2 为 58.7$ PQ。我们还为两个大型移动地图基准设置了第一个最先进的技术：KITTI-360和DALES。只有209$k的参数，我们的模型比最佳竞争方法小了30多倍，训练速度快了15多倍。我们的代码和预训练模型可在https://github.com/drprojects/superpoint_transformer.

摘要: We introduce a highly efficient method for panoptic segmentation of large 3D point clouds by redefining this task as a scalable graph clustering problem. This approach can be trained using only local auxiliary tasks, thereby eliminating the resource-intensive instance-matching step during training. Moreover, our formulation can easily be adapted to the superpoint paradigm, further increasing its efficiency. This allows our model to process scenes with millions of points and thousands of objects in a single inference. Our method, called SuperCluster, achieves a new state-of-the-art panoptic segmentation performance for two indoor scanning datasets: $50.1$ PQ ( $+ 7.8$ ) for S3DIS Area~5, and $58.7$ PQ ( $+ 25.2$ ) for ScanNetV2. We also set the first state-of-the-art for two large-scale mobile mapping benchmarks: KITTI-360 and DALES. With only $209$ k parameters, our model is over $30$ times smaller than the best-competing method and trains up to $15$ times faster. Our code and pretrained models are available at https://github.com/drprojects/superpoint_transformer.

[Downlink:]http://arxiv.org/abs/2401.06704v1

[GitHub:]https://github.com/drprojects/superpoint_transformer.|

标题: Shape-IoU: More Accurate Metric considering Bounding Box Shape and Scale

作者: Hao Zhang, Shuaijie Zhang

中文摘要: 边界盒回归损失作为检测器定位分支的重要组成部分，在目标检测任务中起着重要作用。现有的边界框回归方法通常考虑GT框和预测框之间的几何关系，并使用边界框的相对位置和形状来计算损失，而忽略了边界框的形状和比例等固有特性对边界框回归的影响。为了弥补现有研究的不足，本文提出了一种以包围盒本身的形状和尺度为重点的包围盒回归方法。首先，我们分析了边界框的回归特性，发现边界框本身的形状和比例因素会对回归结果产生影响。基于上述结论，我们提出了Shape-IoU方法，该方法可以通过关注边界框本身的形状和比例来计算损失，从而使边界框回归更加准确。最后，我们通过大量的对比实验验证了我们的方法，结果表明我们的方法可以有效地提高检测性能，并优于现有方法，在不同的检测任务中达到最先进的性能。代码位于https://github.com/malagoutou/Shape-IoU

摘要: As an important component of the detector localization branch, bounding box regression loss plays a significant role in object detection tasks. The existing bounding box regression methods usually consider the geometric relationship between the GT box and the predicted box, and calculate the loss by using the relative position and shape of the bounding boxes, while ignoring the influence of inherent properties such as the shape and scale of the bounding boxes on bounding box regression. In order to make up for the shortcomings of existing research, this article proposes a bounding box regression method that focuses on the shape and scale of the bounding box itself. Firstly, we analyzed the regression characteristics of the bounding boxes and found that the shape and scale factors of the bounding boxes themselves will have an impact on the regression results. Based on the above conclusions, we propose the Shape IoU method, which can calculate the loss by focusing on the shape and scale of the bounding box itself, thereby making the bounding box regression more accurate. Finally, we validated our method through a large number of comparative experiments, which showed that our method can effectively improve detection performance and outperform existing methods, achieving state-of-the-art performance in different detection tasks.Code is available at https://github.com/malagoutou/Shape-IoU

[Downlink:]http://arxiv.org/abs/2312.17663v2

[GitHub:]https://github.com/malagoutou/Shape-IoU|

标题: Vision Transformers with Hierarchical Attention

作者: Yun Liu, Yu-Huan Wu, Guolei Sun

中文摘要: 本文解决了普通视觉转换器中与多头自注意（MHSA）相关的高计算/空间复杂性。为此，我们提出了分层MHSA（H-MHSA），这是一种以分层方式计算自我注意的新方法。具体来说，我们首先像通常做的那样将输入图像划分为多个补丁，每个补丁都被视为一个令牌。然后，所提出的H-MHSA学习局部补丁内的令牌关系，作为局部关系建模。然后，将小的补丁合并为较大的补丁，H-MHSA对少量合并的令牌的全局依赖性进行建模。最后，将局部和全局注意力特征进行聚合，得到具有强大表示能力的特征。由于我们在每一步只计算有限数量的令牌的注意力，因此计算负载显著降低。因此，H-MHSA可以在不牺牲细粒度信息的情况下有效地对令牌之间的全局关系进行建模。结合H-MHSA模块，我们构建了一个基于层次注意力的变换网络家族，即HAT-Net。为了证明HAT-Net在场景理解方面的优势，我们对基本视觉任务进行了广泛的实验，包括图像分类、语义分割、对象检测和实例分割。因此，HAT-Net为视觉转换器提供了一个新的视角。代码和预训练模型可在https://github.com/yun-liu/HAT-Net.

摘要: This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vanilla vision transformers. To this end, we propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information. With the H-MHSA module incorporated, we build a family of Hierarchical-Attention-based Transformer Networks, namely HAT-Net. To demonstrate the superiority of HAT-Net in scene understanding, we conduct extensive experiments on fundamental vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation. Therefore, HAT-Net provides a new perspective for vision transformers. Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.

[Downlink:]http://arxiv.org/abs/2106.03180v4

[GitHub:]https://github.com/yun-liu/HAT-Net.|

标题: SEMv2: Table Separation Line Detection Based on Instance Segmentation

作者: Zhenrong Zhang, Pengfei Hu, Jiefeng Ma

中文摘要: 表格结构识别是机器理解表格不可或缺的要素。它的主要目的是识别表的内部结构。然而，由于其结构和风格的复杂性和多样性，将表格数据解析为机器能够理解的结构化格式是极具挑战性的。在这项工作中，我们坚持基于拆分和合并的方法的原则，并提出了一种精确的表结构识别器，称为SEMv2（SEM:split，Embed and merge）。与之前“拆分”阶段的工作不同，我们的目标是解决表分隔线实例级判别问题，并引入了一种基于条件卷积的表分隔线检测策略。具体来说，我们以自上而下的方式设计“拆分”，首先检测表分隔线实例，然后动态预测每个实例的表分隔线掩码。通过在行/列方式处理表分离线掩模，可以精确地获得最终的表分离线形状。为了全面评估SEMv2，我们还提出了一个更具挑战性的表结构识别数据集，称为iFLYTAB，它包括各种场景中的多个样式表，如照片、扫描文档等。在公开可用的数据集（如SciTSR、PubTabNet和iFLYTAB）上进行的大量实验证明了我们提出的方法的有效性。代码和iFLYTAB数据集可在https://github.com/ZZR8066/SEMv2.

摘要: Table structure recognition is an indispensable element for enabling machines to comprehend tables. Its primary purpose is to identify the internal structure of a table. Nevertheless, due to the complexity and diversity of their structure and style, it is highly challenging to parse the tabular data into a structured format that machines can comprehend. In this work, we adhere to the principle of the split-and-merge based methods and propose an accurate table structure recognizer, termed SEMv2 (SEM: Split, Embed and Merge). Unlike the previous works in the split'' stage, we aim to address the table separation line instance-level discrimination problem and introduce a table separation line detection strategy based on conditional convolution. Specifically, we design the split’’ in a top-down manner that detects the table separation line instance first and then dynamically predicts the table separation line mask for each instance. The final table separation line shape can be accurately obtained by processing the table separation line mask in a row-wise/column-wise manner. To comprehensively evaluate the SEMv2, we also present a more challenging dataset for table structure recognition, dubbed iFLYTAB, which encompasses multiple style tables in various scenarios such as photos, scanned documents, etc. Extensive experiments on publicly available datasets (e.g. SciTSR, PubTabNet and iFLYTAB) demonstrate the efficacy of our proposed approach. The code and iFLYTAB dataset are available at https://github.com/ZZR8066/SEMv2.

[Downlink:]http://arxiv.org/abs/2303.04384v2

[GitHub:]https://github.com/ZZR8066/SEMv2.|

标题: SamLP: A Customized Segment Anything Model for License Plate Detection

作者: Haoxuan Ding, Junyu Gao, Yuan Yuan

中文摘要: 随着基础模型的出现，这种新颖的深度学习范式在自然语言处理和计算机视觉方面取得了许多强大的成就。基础模型具有特征提取能力强、泛化能力强、学习能力少、零样本能力强等优点，有利于视觉任务的实现。作为车辆的独特身份，不同国家和地区的车牌样式和外观各异，甚至不同类型的车辆也有不同的车牌号。然而，最近基于深度学习的车牌检测器主要在特定的数据集上进行训练，这些有限的数据集限制了LP检测器的有效性和鲁棒性。为了减轻有限数据的负面影响，本文尝试利用基础模型的优势。我们为LP检测任务定制了一个视觉基础模型，即分段任意模型（SAM），并提出了第一个基于视觉基础模型的LP检测器，命名为SamLP。具体来说，我们设计了一种低秩自适应（LoRA）微调策略，将额外的参数注入SAM，并将SAM转移到LP检测任务中。然后，我们进一步提出了一个可提示的微调步骤，为SamLP提供可提示的分割能力。实验表明，与其他LP检测器相比，我们提出的SamLP具有很好的检测性能。同时，所提出的SamLP具有很强的少跳和零样本学习能力，显示了转移视觉基础模型的潜力。代码位于https://github.com/Dinghaoxuan/SamLP

摘要: With the emergence of foundation model, this novel paradigm of deep learning has encouraged many powerful achievements in natural language processing and computer vision. There are many advantages of foundation model, such as excellent feature extraction power, mighty generalization ability, great few-shot and zero-shot learning capacity, etc. which are beneficial to vision tasks. As the unique identity of vehicle, different countries and regions have diverse license plate (LP) styles and appearances, and even different types of vehicles have different LPs. However, recent deep learning based license plate detectors are mainly trained on specific datasets, and these limited datasets constrain the effectiveness and robustness of LP detectors. To alleviate the negative impact of limited data, an attempt to exploit the advantages of foundation model is implement in this paper. We customize a vision foundation model, i.e. Segment Anything Model (SAM), for LP detection task and propose the first LP detector based on vision foundation model, named SamLP. Specifically, we design a Low-Rank Adaptation (LoRA) fine-tuning strategy to inject extra parameters into SAM and transfer SAM into LP detection task. And then, we further propose a promptable fine-tuning step to provide SamLP with prompatable segmentation capacity. The experiments show that our proposed SamLP achieves promising detection performance compared to other LP detectors. Meanwhile, the proposed SamLP has great few-shot and zero-shot learning ability, which shows the potential of transferring vision foundation model. The code is available at https://github.com/Dinghaoxuan/SamLP

[Downlink:]http://arxiv.org/abs/2401.06374v1

[GitHub:]https://github.com/Dinghaoxuan/SamLP|