[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--强化学习、模仿学习、机器人

专属领域论文订阅

关注{晓理紫}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

为了答谢各位网友的支持，从今日起免费为300名读者提供订阅主题论文服务，只需VX关注公号并回复{邮箱+论文主题}（如：123456@xx.com + chatgpt@large language model @LLM）,主题必须是同一个领域，最多三个关键词。解释权归博主所有

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

== RL @ RLHF ==

标题: Getting the Ball Rolling: Learning a Dexterous Policy for a Biomimetic Tendon-Driven Hand with Rolling Contact Joints

作者: Yasunori Toshimitsu, Benedek Forrai, Barnabas Gavin Cangan

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2308.02453v3

Project: https://srl-ethz.github.io/get-ball-rolling/|https://youtu.be/YahsMhqNU8o|

GitHub: https://github.com/srl-ethz/faive_gym_oss|

中文摘要: 仿生、灵巧的机器人手有潜力复制人类可以完成的许多任务，并获得作为通用操作平台的地位。强化学习（RL）框架的最新进展在四足运动和灵巧操纵任务中取得了显著的性能。结合能够并行模拟数千个机器人的基于GPU的高度并行化模拟，基于RL的控制器变得更加可扩展和可接近。然而，为了将RL训练的策略带到现实世界中，我们需要输出可以与物理致动器和传感器一起工作的策略的训练框架，以及可以用可访问的材料制造但足够健壮以运行交互式策略的硬件平台。本工作介绍了仿生肌腱驱动的Faive手及其系统架构，该系统使用肌腱驱动的滚动接触关节来实现3D可打印、鲁棒的高自由度手设计。我们对手的每个元素进行建模，并将其集成到GPU模拟环境中，用RL训练策略，并实现灵巧的手握球体旋转技能向物理机器人手的零镜头转移。

摘要: Biomimetic, dexterous robotic hands have the potential to replicate much of the tasks that a human can do, and to achieve status as a general manipulation platform. Recent advances in reinforcement learning (RL) frameworks have achieved remarkable performance in quadrupedal locomotion and dexterous manipulation tasks. Combined with GPU-based highly parallelized simulations capable of simulating thousands of robots in parallel, RL-based controllers have become more scalable and approachable. However, in order to bring RL-trained policies to the real world, we require training frameworks that output policies that can work with physical actuators and sensors as well as a hardware platform that can be manufactured with accessible materials yet is robust enough to run interactive policies. This work introduces the biomimetic tendon-driven Faive Hand and its system architecture, which uses tendon-driven rolling contact joints to achieve a 3D printable, robust high-DoF hand design. We model each element of the hand and integrate it into a GPU simulation environment to train a policy with RL, and achieve zero-shot transfer of a dexterous in-hand sphere rotation skill to the physical robot hand.

标题: Sample-efficient Reinforcement Learning in Robotic Table Tennis

作者: Jonas Tebbe, Lukas Krauch, Yapeng Gao

PubTime: 2024-01-04

Downlink: http://arxiv.org/abs/2011.03275v4

Project: https://youtu.be/uRAtdoL6Wpw.|

中文摘要: 强化学习（RL）最近在各种计算机游戏和模拟中取得了一些令人印象深刻的成功。这些成功中的大多数都是基于代理人可以从中学习的大量情节。然而，在典型的机器人应用中，可行的尝试次数非常有限。在本文中，我们提出了一个样本有效的RL算法应用于一个乒乓球机器人的例子。在乒乓球比赛中，每一次击球都是不同的，位置、速度和旋转都不同。因此，必须根据高维连续状态空间找到精确的返回。为了使在少数试验中学习成为可能，该方法被嵌入到我们的机器人系统中。这样我们就可以使用一步到位的环境。状态空间取决于击球时的球（位置、速度、旋转），动作是击球时的球拍状态（方向、速度）。提出了一种基于行动者——批评家的确定性策略梯度算法用于加速学习。在许多具有挑战性的场景中，我们的方法在模拟和真实机器人上都具有竞争力。在不到200美元的训练中，无需预训练即可获得准确的结果。展示我们实验的视频可在https：//youtu.be/uRAtdoL6Wpw。

摘要: Reinforcement learning (RL) has achieved some impressive recent successes in
various computer games and simulations. Most of these successes are based on
having large numbers of episodes from which the agent can learn. In typical
robotic applications, however, the number of feasible attempts is very limited.
In this paper we present a sample-efficient RL algorithm applied to the example
of a table tennis robot. In table tennis every stroke is different, with
varying placement, speed and spin. An accurate return therefore has to be found
depending on a high-dimensional continuous state space. To make learning in few
trials possible the method is embedded into our robot system. In this way we
can use a one-step environment. The state space depends on the ball at hitting
time (position, velocity, spin) and the action is the racket state
(orientation, velocity) at hitting. An actor-critic based deterministic policy
gradient algorithm was developed for accelerated learning. Our approach
performs competitively both in a simulation and on the real robot in a number
of challenging scenarios. Accurate results are obtained without pre-training in
under $200$ episodes of training. The video presenting our experiments is
available at https://youtu.be/uRAtdoL6Wpw.

标题: Bridging the Gap Between Target Networks and Functional Regularization

作者: Alexandre Piche, Valentin Thomas, Joseph Marino

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2210.12282v2

Project: https://openreview.net/forum?id=BFvoemrmqX|

中文摘要: 自举是深度强化学习许多成功的背后原因。然而，通过自举学习价值函数往往会由于目标值的快速变化而导致训练不稳定。通过使用一组附加的滞后参数来估计目标值，目标网络被用来稳定训练。尽管目标网络很受欢迎，但它们对优化的影响仍然被误解。在这项工作中，我们表明，他们作为一个隐式正则化。这种正则化器具有不灵活和非凸等缺点。为了克服这些问题，我们提出了一个显式函数正则化，它是函数空间中的一个凸正则化子，并且易于调整。我们从理论上分析了我们的方法的收敛性，并从经验上证明了用更有理论基础的函数正则化方法代替目标网络导致更好的样本效率和性能改进。

摘要: Bootstrapping is behind much of the successes of Deep Reinforcement Learning.
However, learning the value function via bootstrapping often leads to unstable
training due to fast-changing target values. Target Networks are employed to
stabilize training by using an additional set of lagging parameters to estimate
the target values. Despite the popularity of Target Networks, their effect on
the optimization is still misunderstood. In this work, we show that they act as
an implicit regularizer. This regularizer has disadvantages such as being
inflexible and non convex. To overcome these issues, we propose an explicit
Functional Regularization that is a convex regularizer in function space and
can easily be tuned. We analyze the convergence of our method theoretically and
empirically demonstrate that replacing Target Networks with the more
theoretically grounded Functional Regularization approach leads to better
sample efficiency and performance improvements.

标题: Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents

作者: Marco Pleines, Matthias Pallasch, Frank Zimmer

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2309.17207v3

GitHub: https://github.com/MarcoMeter/endless-memory-gym/|

中文摘要: Memory Gym提供了一套2D部分可观察的环境，即迫击炮伤害、神秘路径和灼热的聚光灯，旨在对决策代理的记忆能力进行基准测试。这些最初任务有限的环境被扩展成创新的、无止境的格式，反映了累积记忆游戏（如“我打包了我的包”）不断升级的挑战。任务设计的这一进展将重点从仅仅评估样本效率转移到探索动态、长时间场景中的记忆效率水平。为了解决可用的基于内存的深度强化学习基线中的差距，我们引入了一种将Transformer model-XL（TrXL）与近似策略优化相集成的实现。这种方法利用TrXL作为情景记忆的一种形式，采用滑动窗口技术。我们对门控循环单元（GRU）和TrXL的比较研究揭示了不同设置下的不同性能。在有限环境下，TrXL在神秘路径中表现出优越的采样效率，在迫击炮伤害中表现出色。然而，GRU在灼热的聚光灯下效率更高。最值得注意的是，在所有没完没了的任务中，GRU取得了显著的复苏，持续大幅超过TrXL。网站和源代码：https://github.com/MarcoMeter/endless-memory-gym/

摘要: Memory Gym presents a suite of 2D partially observable environments, namely
Mortar Mayhem, Mystery Path, and Searing Spotlights, designed to benchmark
memory capabilities in decision-making agents. These environments, originally
with finite tasks, are expanded into innovative, endless formats, mirroring the
escalating challenges of cumulative memory games such as ``I packed my bag’'.
This progression in task design shifts the focus from merely assessing sample
efficiency to also probing the levels of memory effectiveness in dynamic,
prolonged scenarios. To address the gap in available memory-based Deep
Reinforcement Learning baselines, we introduce an implementation that
integrates Transformer-XL (TrXL) with Proximal Policy Optimization. This
approach utilizes TrXL as a form of episodic memory, employing a sliding window
technique. Our comparative study between the Gated Recurrent Unit (GRU) and
TrXL reveals varied performances across different settings. TrXL, on the finite
environments, demonstrates superior sample efficiency in Mystery Path and
outperforms in Mortar Mayhem. However, GRU is more efficient on Searing
Spotlights. Most notably, in all endless tasks, GRU makes a remarkable
resurgence, consistently outperforming TrXL by significant margins. Website and
Source Code: https://github.com/MarcoMeter/endless-memory-gym/

标题: DeXtreme: Transfer of Agile In-hand Manipulation from Simulation to Reality

作者: Ankur Handa, Arthur Allshire, Viktor Makoviychuk

PubTime: 2024-01-02

Downlink: http://arxiv.org/abs/2210.13702v2

Project: https://dextreme.org/|

中文摘要: 最近的工作证明了深度强化学习（RL）算法在模拟中学习复杂机器人行为的能力，包括在多指操作领域。然而，由于模拟和现实之间的差距，这种模型很难转移到现实世界中。在本文中，我们介绍了我们的技术来训练a）可以在拟人化机器人手上执行鲁棒灵巧操作的策略和b）适合于提供关于被操纵物体状态的可靠实时信息的鲁棒姿态估计器。我们的策略经过训练，可以适应模拟中的各种条件。因此，在相同的重定向任务上，我们基于视觉的策略明显优于文献中的最佳视觉策略，并且与通过运动捕捉系统给予特权状态信息的策略具有竞争力。我们的工作重申了在各种硬件和模拟器设置中灵巧操作的模拟到真实转换的可能性，在我们的例子中，是基于Allegro Hand和Isaac Gym GPU的模拟。此外，它为研究人员提供了使用常见的、负担得起的机器人手和相机实现这些结果的可能性。由此产生的视频政策及补充包括实验和演示在内的信息可以在https：//dextreme.org/

摘要: Recent work has demonstrated the ability of deep reinforcement learning (RL)
algorithms to learn complex robotic behaviours in simulation, including in the
domain of multi-fingered manipulation. However, such models can be challenging
to transfer to the real world due to the gap between simulation and reality. In
this paper, we present our techniques to train a) a policy that can perform
robust dexterous manipulation on an anthropomorphic robot hand and b) a robust
pose estimator suitable for providing reliable real-time information on the
state of the object being manipulated. Our policies are trained to adapt to a
wide range of conditions in simulation. Consequently, our vision-based policies
significantly outperform the best vision policies in the literature on the same
reorientation task and are competitive with policies that are given privileged
state information via motion capture systems. Our work reaffirms the
possibilities of sim-to-real transfer for dexterous manipulation in diverse
kinds of hardware and simulator setups, and in our case, with the Allegro Hand
and Isaac Gym GPU-based simulation. Furthermore, it opens up possibilities for
researchers to achieve such results with commonly-available, affordable robot
hands and cameras. Videos of the resulting policy and supplementary
information, including experiments and demos, can be found at
https://dextreme.org/

标题: Multi-agent Reinforcement Learning for Cooperative Lane Changing of Connected and Autonomous Vehicles in Mixed Traffic

作者: Wei Zhou, Dong Chen, Jun Yan

PubTime: 2024-01-05

Downlink: http://arxiv.org/abs/2111.06318v2

中文摘要: 自动驾驶在过去吸引了大量的研究兴趣
二十年，因为它提供了许多潜在的好处，包括释放司机
从疲惫的驾驶和缓解交通拥堵，等等。
尽管取得了可喜的进展，但变道仍然是一个巨大的挑战
自动驾驶汽车（AV），尤其是在混合和动态交通场景中。
最近，强化学习（RL），一种强大的数据驱动控制方法，
已被广泛研究用于AVs的变道决策
取得了令人鼓舞的成果。然而，这些研究中的大多数是
侧重于单车设置，以及在变道的背景下
与人类驾驶车辆（HDV）共存的多种AVs很少收到
注意。在本文中，我们制定了车道变换决策
混合交通公路环境中多个AVs作为多agent的研究
强化学习（MARL）问题，其中每个AV进行车道变换
基于相邻AVs和hdv的运动的决策。具体来说，
提出了一种新的多智能体优势演员——评论家网络（MA2C）
局部奖励设计和参数共享方案。特别是
提出了多目标奖励函数，
驾驶舒适性和自动驾驶的安全性。综合实验
在三种不同交通密度和不同水平下进行的结果
表明我们提出的MARL框架
在以下方面始终优于几个最先进的基准
效率、安全性和驾驶员舒适性。

摘要: Autonomous driving has attracted significant research interests in the past
two decades as it offers many potential benefits, including releasing drivers
from exhausting driving and mitigating traffic congestion, among others.
Despite promising progress, lane-changing remains a great challenge for
autonomous vehicles (AV), especially in mixed and dynamic traffic scenarios.
Recently, reinforcement learning (RL), a powerful data-driven control method,
has been widely explored for lane-changing decision makings in AVs with
encouraging results demonstrated. However, the majority of those studies are
focused on a single-vehicle setting, and lane-changing in the context of
multiple AVs coexisting with human-driven vehicles (HDVs) have received scarce
attention. In this paper, we formulate the lane-changing decision making of
multiple AVs in a mixed-traffic highway environment as a multi-agent
reinforcement learning (MARL) problem, where each AV makes lane-changing
decisions based on the motions of both neighboring AVs and HDVs. Specifically,
a multi-agent advantage actor-critic network (MA2C) is developed with a novel
local reward design and a parameter sharing scheme. In particular, a
multi-objective reward function is proposed to incorporate fuel efficiency,
driving comfort, and safety of autonomous driving. Comprehensive experimental
results, conducted under three different traffic densities and various levels
of human driver aggressiveness, show that our proposed MARL framework
consistently outperforms several state-of-the-art benchmarks in terms of
efficiency, safety and driver comfort.

== Imitation Learning ==

标题: LangProp: A code optimization framework using Language Models applied to driving

作者: Shu Ishida, Gianluca Corrado, George Fedoseev

PubTime: 2024-01-18

Downlink: http://arxiv.org/abs/2401.10314v1

GitHub: https://github.com/shuishida/LangProp.|

中文摘要: LangProp是一个框架，用于在监督/强化学习设置中迭代优化大型语言模型（LLMs）生成的代码。虽然LLMs可以产生合理的解决方案，但这些解决方案往往是次优的。特别是对于代码生成任务，初始代码很可能会在某些边缘情况下失败。LangProp自动评估输入输出对数据集上的代码性能，以及捕捉任何异常，并在训练循环中将结果反馈给LLM，以便LLM可以迭代地改进它生成的代码。通过对该代码优化过程采用度量和数据驱动的训练范式，人们可以轻松地适应传统机器学习技术（如模仿学习、匕首和强化学习）的发现。我们在CARLA中展示了自动驾驶自动代码优化的第一个概念证明，表明LangProp可以生成可解释和透明的驾驶策略，这些策略可以以度量和数据驱动的方式进行验证和改进。我们的代码将是开源的，可在https：//github.com/shuishida/LangProp。获得

摘要: LangProp is a framework for iteratively optimizing code generated by large language models (LLMs) in a supervised/reinforcement learning setting. While LLMs can generate sensible solutions zero-shot, the solutions are often sub-optimal. Especially for code generation tasks, it is likely that the initial code will fail on certain edge cases. LangProp automatically evaluates the code performance on a dataset of input-output pairs, as well as catches any exceptions, and feeds the results back to the LLM in the training loop, so that the LLM can iteratively improve the code it generates. By adopting a metric- and data-driven training paradigm for this code optimization procedure, one could easily adapt findings from traditional machine learning techniques such as imitation learning, DAgger, and reinforcement learning. We demonstrate the first proof of concept of automated code optimization for autonomous driving in CARLA, showing that LangProp can generate interpretable and transparent driving policies that can be verified and improved in a metric- and data-driven way. Our code will be open-sourced and is available at https://github.com/shuishida/LangProp.

标题: FMB: a Functional Manipulation Benchmark for Generalizable Robotic Learning

作者: Jianlan Luo, Charles Xu, Fangchen Liu

PubTime: 2024-01-16

Downlink: http://arxiv.org/abs/2401.08553v1

Project: https://functional-manipulation-benchmark.github.io|

中文摘要: 在本文中，我们提出了一个在功能操纵背景下研究机器人学习的真实世界基准：机器人需要通过以功能相关的方式组合个体操纵技能来完成复杂的长期行为。我们的功能操作基准（FMB）的核心设计原则强调复杂性和可访问性之间的和谐平衡。任务的范围被有意缩小，以确保可管理规模的模型和数据集可以被有效地用来跟踪进度。同时，它们的多样性足以构成重大的一般化挑战。此外，该基准测试旨在易于复制，包含所有基本的硬件和软件组件。为了实现这一目标，FMB由各种3D打印物体组成，旨在让其他研究人员轻松准确地复制。对象是按程序生成的，提供了一个原则性的框架，以受控的方式研究泛化。我们专注于基本的操作技能，包括抓取、重新定位和一系列组装行为。FMB可用于评估获得单个技能的方法，以及组合和排序这些技能以解决复杂、多阶段操作任务的方法。我们还提供了一个模仿学习框架，其中包括一套经过训练的策略来解决提出的任务。这使得研究人员能够利用我们的任务作为一个通用的工具包来检查管道的各个部分。例如，研究人员可以为抓取控制器提出更好的设计，并结合我们的基线重定向和组装策略进行评估，作为解决多阶段任务的管道的一部分。我们的数据集、对象CAD文件、代码和评估视频可以在我们的项目网站上找到：https：//functional-manipulation-benchmark.github.io

摘要: In this paper, we propose a real-world benchmark for studying robotic learning in the context of functional manipulation: a robot needs to accomplish complex long-horizon behaviors by composing individual manipulation skills in functionally relevant ways. The core design principles of our Functional Manipulation Benchmark (FMB) emphasize a harmonious balance between complexity and accessibility. Tasks are deliberately scoped to be narrow, ensuring that models and datasets of manageable scale can be utilized effectively to track progress. Simultaneously, they are diverse enough to pose a significant generalization challenge. Furthermore, the benchmark is designed to be easily replicable, encompassing all essential hardware and software components. To achieve this goal, FMB consists of a variety of 3D-printed objects designed for easy and accurate replication by other researchers. The objects are procedurally generated, providing a principled framework to study generalization in a controlled fashion. We focus on fundamental manipulation skills, including grasping, repositioning, and a range of assembly behaviors. The FMB can be used to evaluate methods for acquiring individual skills, as well as methods for combining and ordering such skills to solve complex, multi-stage manipulation tasks. We also offer an imitation learning framework that includes a suite of policies trained to solve the proposed tasks. This enables researchers to utilize our tasks as a versatile toolkit for examining various parts of the pipeline. For example, researchers could propose a better design for a grasping controller and evaluate it in combination with our baseline reorientation and assembly policies as part of a pipeline for solving multi-stage tasks. Our dataset, object CAD files, code, and evaluation videos can be found on our project website: https://functional-manipulation-benchmark.github.io

标题: Residual Q-Learning: Offline and Online Policy Customization without Value

作者: Chenran Li, Chen Tang, Haruki Nishimura

PubTime: 2024-01-15

Downlink: http://arxiv.org/abs/2306.09526v3

Project: https://sites.google.com/view/residualq-learning.|

中文摘要: 模仿学习（IL）是一个广泛使用的框架，用于从演示中学习模仿行为。它对于解决复杂的现实世界任务特别有吸引力，在这些任务中手工制作奖励功能是困难的，或者当目标是模仿人类专家行为时。然而，习得的模仿政策只能遵循示范中的行为。在应用模仿策略时，我们可能需要定制策略行为，以满足来自不同下游任务的不同需求。同时，我们仍然希望定制策略保持其模仿性。为此，我们制定了一个新的问题集，称为策略定制。它将学习任务定义为训练一个策略，该策略继承了先前策略的特征，同时满足目标下游任务施加的一些额外要求。我们提出了一种新颖的原则性方法来解释和确定两个任务目标之间的权衡。具体来说，我们将定制问题公式化为马尔可夫决策过程（MDP），其奖励函数结合了1）演示的固有奖励；以及2）下游任务指定的附加奖励。我们提出了一个新的框架，残差Q学习，它可以在不知道先验策略的内在回报或价值函数的情况下，通过利用先验策略来求解公式化的MDP。我们推导了一族可以实现离线和在线策略定制的残差Q学习算法，并表明所提出的算法可以在各种环境下有效地完成策略定制任务。演示视频和代码可在我们的网站上获得：https：//sites.google.com/view/residualq-learning。

摘要: Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations. It is especially appealing for solving complex real-world tasks where handcrafting reward function is difficult, or when the goal is to mimic human expert behavior. However, the learned imitative policy can only follow the behavior in the demonstration. When applying the imitative policy, we may need to customize the policy behavior to meet different requirements coming from diverse downstream tasks. Meanwhile, we still want the customized policy to maintain its imitative nature. To this end, we formulate a new problem setting called policy customization. It defines the learning task as training a policy that inherits the characteristics of the prior policy while satisfying some additional requirements imposed by a target downstream task. We propose a novel and principled approach to interpret and determine the trade-off between the two task objectives. Specifically, we formulate the customization problem as a Markov Decision Process (MDP) with a reward function that combines 1) the inherent reward of the demonstration; and 2) the add-on reward specified by the downstream task. We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy without knowing the inherent reward or value function of the prior policy. We derive a family of residual Q-learning algorithms that can realize offline and online policy customization, and show that the proposed algorithms can effectively accomplish policy customization tasks in various environments. Demo videos and code are available on our website: https://sites.google.com/view/residualq-learning.

标题: Multi-Stage Cable Routing through Hierarchical Imitation Learning

作者: Jianlan Luo, Charles Xu, Xinyang Geng

PubTime: 2024-01-13

Downlink: http://arxiv.org/abs/2307.08927v5

Project: https://sites.google.com/view/cablerouting.|

中文摘要: 我们研究学习执行多阶段机器人操纵任务的问题，并应用于电缆布线，机器人必须将电缆穿过一系列夹子。这种设置提出了具有代表性的挑战复杂的多阶段机器人操作场景：处理可变形物体，结束视觉感知的循环，以及处理由多个步骤组成的扩展行为，这些步骤必须成功执行才能完成整个任务。在这种情况下，为每个阶段学习以足够高的成功率成功执行完整的时间扩展任务的单个原语是不切实际的：如果每个阶段都必须成功完成并且具有不可忽略的失败概率，则成功完成整个任务的可能性变得可以忽略不计。因此，用于这种多阶段任务的成功控制器必须能够从故障中恢复，并通过智能地选择在任何给定时间触发哪些控制器、重试或根据需要采取纠正措施来补偿低级控制器中的缺陷。为此，我们描述了一个模仿学习系统，该系统使用从较低（电机控制）和较高（排序）级别的演示中训练的基于视觉的策略，提出了一个用于实例化该方法以学习电缆布线任务的系统，并执行了在推广到非常具有挑战性的剪辑放置变化方面显示出良好性能的评估。补充视频、数据集和代码可在https://sites.google.com/view/cablerouting。

摘要: We study the problem of learning to perform multi-stage robotic manipulation tasks, with applications to cable routing, where the robot must route a cable through a series of clips. This setting presents challenges representative of complex multi-stage robotic manipulation scenarios: handling deformable objects, closing the loop on visual perception, and handling extended behaviors consisting of multiple steps that must be executed successfully to complete the entire task. In such settings, learning individual primitives for each stage that succeed with a high enough rate to perform a complete temporally extended task is impractical: if each stage must be completed successfully and has a non-negligible probability of failure, the likelihood of successful completion of the entire task becomes negligible. Therefore, successful controllers for such multi-stage tasks must be able to recover from failure and compensate for imperfections in low-level controllers by smartly choosing which controllers to trigger at any given time, retrying, or taking corrective action as needed. To this end, we describe an imitation learning system that uses vision-based policies trained from demonstrations at both the lower (motor control) and the upper (sequencing) level, present a system for instantiating this method to learn the cable routing task, and perform evaluations showing great performance in generalizing to very challenging clip placement variations. Supplementary videos, datasets, and code can be found at https://sites.google.com/view/cablerouting.

标题: Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

作者: Zipeng Fu, Tony Z. Zhao, Chelsea Finn

PubTime: 2024-01-04

Downlink: http://arxiv.org/abs/2401.02117v1

Project: https://mobile-aloha.github.io|https://mobile-aloha.github.io|

中文摘要: 从人类演示中模仿学习在机器人领域表现出了令人印象深刻的表现。然而，大多数结果都集中在桌面操作上，缺乏通常有用任务所需的灵活性和灵活性。在这项工作中，我们开发了一个模拟移动操作任务的系统，这些任务是双手动的，需要全身控制。我们首先介绍了Mobile ALOHA，一种低成本的全身远程操作数据采集系统。它为ALOHA系统增加了一个移动基站和一个全身远程操作接口。使用使用Mobile ALOHA收集的数据，然后我们执行监督行为克隆，并发现与现有静态ALOHA数据集的联合训练可以提高移动操作任务的性能。每项任务有50个演示，联合训练可以将成功率提高90%，使Mobile ALOHA能够自主完成复杂的移动操作任务，如炒虾和上菜、打开一个双门壁柜来存放沉重的烹饪锅、呼叫并进入电梯，以及使用厨房水龙头轻轻冲洗用过的锅。项目网站：https://mobile-aloha.github.io

摘要: Imitation learning from human demonstrations has shown impressive performance
in robotics. However, most results focus on table-top manipulation, lacking the
mobility and dexterity necessary for generally useful tasks. In this work, we
develop a system for imitating mobile manipulation tasks that are bimanual and
require whole-body control. We first present Mobile ALOHA, a low-cost and
whole-body teleoperation system for data collection. It augments the ALOHA
system with a mobile base, and a whole-body teleoperation interface. Using data
collected with Mobile ALOHA, we then perform supervised behavior cloning and
find that co-training with existing static ALOHA datasets boosts performance on
mobile manipulation tasks. With 50 demonstrations for each task, co-training
can increase success rates by up to 90%, allowing Mobile ALOHA to autonomously
complete complex mobile manipulation tasks such as sauteing and serving a piece
of shrimp, opening a two-door wall cabinet to store heavy cooking pots, calling
and entering an elevator, and lightly rinsing a used pan using a kitchen
faucet. Project website: https://mobile-aloha.github.io

标题: Mimicking the Maestro: Exploring the Efficacy of a Virtual AI Teacher in Fine Motor Skill Acquisition

作者: Hadar Mulian, Segev Shlomov, Lior Limonad

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2310.10280v2

摘要: Motor skills, especially fine motor skills like handwriting, play an essential role in academic pursuits and everyday life. Traditional methods to teach these skills, although effective, can be time-consuming and inconsistent. With the rise of advanced technologies like robotics and artificial intelligence, there is increasing interest in automating such teaching processes using these technologies, via human-robot and human-computer interactions. In this study, we examine the potential of a virtual AI teacher in emulating the techniques of human educators for motor skill acquisition. We introduce an AI teacher model that captures the distinct characteristics of human instructors. Using a Reinforcement Learning environment tailored to mimic teacher-learner interactions, we tested our AI model against four guiding hypotheses, emphasizing improved learner performance, enhanced rate of skill acquisition, and reduced variability in learning outcomes. Our findings, validated on synthetic learners, revealed significant improvements across all tested hypotheses. Notably, our model showcased robustness across different learners and settings and demonstrated adaptability to handwriting. This research underscores the potential of integrating Reinforcement Learning and Imitation Learning models with robotics in revolutionizing the teaching of critical motor skills.

== Embodied Artificial Intelligence@robotic agent@human robot interaction ==

标题: The Conversation is the Command: Interacting with Real-World Autonomous Robot Through Natural Language

作者: Linus Nwankwo, Elmar Rueckert

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2401.11838v1

Project: https://osf.io/wzyf6|

GitHub: https://github.com/LinusNEP/TCC_IRoNL.git).|

中文摘要: 近年来，自主代理在现实世界环境中激增，如我们的家庭、办公室和公共场所。然而，自然的人机交互仍然是一个关键的挑战。在本文中，我们介绍了一种协同利用大型语言模型（LLMs）和多模态视觉语言模型（VLMs）的能力的方法，使人类能够通过对话与自主机器人进行自然交互。我们利用LLMs解码来自人类的高级自然语言指令，并将其抽象为精确的机器人可操作命令或查询。此外，我们利用VLMs来提供对机器人任务环境的视觉和语义理解。我们99.13%的命令识别准确率和97.96%的命令执行成功率表明，我们的方法可以增强现实世界应用中的人机交互。本文的视频演示可以在https：//osf.io/wzyf6找到，代码可以在我们的GitHub资源库（https：//github.com/LinusNEP/tcc_iron.git）找到。

摘要: In recent years, autonomous agents have surged in real-world environments such as our homes, offices, and public spaces. However, natural human-robot interaction remains a key challenge. In this paper, we introduce an approach that synergistically exploits the capabilities of large language models (LLMs) and multimodal vision-language models (VLMs) to enable humans to interact naturally with autonomous robots through conversational dialogue. We leveraged the LLMs to decode the high-level natural language instructions from humans and abstract them into precise robot actionable commands or queries. Further, we utilised the VLMs to provide a visual and semantic understanding of the robot’s task environment. Our results with 99.13% command recognition accuracy and 97.96% commands execution success show that our approach can enhance human-robot interaction in real-world applications. The video demonstrations of this paper can be found at https://osf.io/wzyf6 and the code is available at our GitHub repository (https://github.com/LinusNEP/TCC_IRoNL.git).

标题: Augmented Reality User Interface for Command, Control, and Supervision of Large Multi-Agent Teams

作者: Frank Regal, Chris Suarez, Fabian Parra

PubTime: 2024-01-11

Downlink: http://arxiv.org/abs/2401.05665v1

Project: https://sites.google.com/view/xr-robotics-iros2023/home?authuser=0|

中文摘要: 多智能体人——机器人团队通过利用和结合人类和机器人的优势，可以更有效地收集各种环境的信息。在国防、搜索和救援、急救等行业，异构人机团队有望通过将人类从未知和潜在危险的情况中移除来加速数据收集和提高团队安全性。这项工作建立在AugRE的基础上，AugRE是一个基于增强现实（AR）的可扩展人机团队框架。它使用户能够本地化并与50多个自主代理通信。通过我们的努力，用户能够指挥、控制和监督大型团队中的代理，无论是视距还是非视距，而无需事先修改环境，也无需用户使用典型的硬件（即操纵杆、键盘、笔记本电脑、平板电脑等）。）在外地。演示的工作表明，早期迹象表明，将这些基于AR-HMD的用户交互模式结合起来进行指挥、控制和监督，将有助于改善人机团队协作、健壮性和信任。

摘要: Multi-agent human-robot teaming allows for the potential to gather information about various environments more efficiently by exploiting and combining the strengths of humans and robots. In industries like defense, search and rescue, first-response, and others alike, heterogeneous human-robot teams show promise to accelerate data collection and improve team safety by removing humans from unknown and potentially hazardous situations. This work builds upon AugRE, an Augmented Reality (AR) based scalable human-robot teaming framework. It enables users to localize and communicate with 50+ autonomous agents. Through our efforts, users are able to command, control, and supervise agents in large teams, both line-of-sight and non-line-of-sight, without the need to modify the environment prior and without requiring users to use typical hardware (i.e. joysticks, keyboards, laptops, tablets, etc.) in the field. The demonstrated work shows early indications that combining these AR-HMD-based user interaction modalities for command, control, and supervision will help improve human-robot team collaboration, robustness, and trust.

标题: Unified Learning from Demonstrations, Corrections, and Preferences during Physical Human-Robot Interaction

作者: Shaunak A. Mehta, Dylan P. Losey

PubTime: 2024-01-09

Downlink: http://arxiv.org/abs/2207.03395v2

Project: https://youtu.be/FSUJsTYvEKU|

中文摘要: 人类可以利用物理交互来教授机器人手臂。这种物理交互有多种形式，取决于任务、用户和机器人到目前为止学到的东西。最先进的方法专注于从单一模态中学习，或者通过假设机器人具有关于人类预期任务的先验信息来组合多种交互类型。相比之下，在本文中，我们介绍了一种算法形式主义，它将从演示、纠正和偏好中学习结合起来。我们的方法对人类想要教给机器人的任务没有任何假设；相反，我们通过将人类的输入与附近的替代方案进行比较，从头开始学习奖励模型。我们首先导出一个损失函数，它训练一组奖励模型来匹配人类的演示、纠正和偏好。反馈的类型和顺序由人类老师决定：我们让机器人被动或主动地收集反馈。然后，我们应用约束优化将我们学习到的奖励转换成期望的机器人轨迹。通过模拟和用户研究，我们证明了我们提出的方法比现有的基线更准确地从物理人类交互中学习操纵任务，特别是当机器人面临新的或意想不到的目标时。我们的用户研究视频可在以下网站获得：https：//youtu.be/FSUJsTYvEKU

摘要: Humans can leverage physical interaction to teach robot arms. This physical interaction takes multiple forms depending on the task, the user, and what the robot has learned so far. State-of-the-art approaches focus on learning from a single modality, or combine multiple interaction types by assuming that the robot has prior information about the human’s intended task. By contrast, in this paper we introduce an algorithmic formalism that unites learning from demonstrations, corrections, and preferences. Our approach makes no assumptions about the tasks the human wants to teach the robot; instead, we learn a reward model from scratch by comparing the human’s inputs to nearby alternatives. We first derive a loss function that trains an ensemble of reward models to match the human’s demonstrations, corrections, and preferences. The type and order of feedback is up to the human teacher: we enable the robot to collect this feedback passively or actively. We then apply constrained optimization to convert our learned reward into a desired robot trajectory. Through simulations and a user study we demonstrate that our proposed approach more accurately learns manipulation tasks from physical human interaction than existing baselines, particularly when the robot is faced with new or unexpected objectives. Videos of our user study are available at: https://youtu.be/FSUJsTYvEKU

标题: StROL: Stabilized and Robust Online Learning from Humans

作者: Shaunak A. Mehta, Forrest Meng, Andrea Bajcsy

PubTime: 2024-01-04

Downlink: http://arxiv.org/abs/2308.09863v2

GitHub: https://github.com/VT-Collab/StROL_RAL|

中文摘要: 在当前的互动中，机器人经常需要在线学习人类的奖励功能。这种实时学习需要快速但近似的学习规则：当人类的行为有噪声或次优时，当前的近似会导致机器人学习不稳定。因此，在本文中，我们试图增强梯度下降学习规则在推断人类奖励参数时的鲁棒性和收敛性。我们将机器人的学习算法建模为基于人类偏好参数的动态系统，其中人类的真实（但未知）偏好是平衡点。这使我们能够执行李亚普诺夫稳定性分析，以推导机器人学习动力学收敛的条件。我们提出的算法（StROL）使用这些条件来学习设计鲁棒的学习规则：给定原始的学习动态，StROL输出修改的学习规则，该规则现在在更大的人类输入集下收敛到人类的真实参数。在实践中，这些自主生成的学习规则可以正确地推断出人类试图传达的内容，即使人类是嘈杂的、有偏见的和次优的。通过模拟和用户研究，我们发现StROL比最先进的在线奖励学习方法产生更准确的估计和更少的遗憾。请点击此处查看视频和代码：https://github.com/VT-Collab/StROL_RAL

摘要: Robots often need to learn the human’s reward function online, during the
current interaction. This real-time learning requires fast but approximate
learning rules: when the human’s behavior is noisy or suboptimal, current
approximations can result in unstable robot learning. Accordingly, in this
paper we seek to enhance the robustness and convergence properties of gradient
descent learning rules when inferring the human’s reward parameters. We model
the robot’s learning algorithm as a dynamical system over the human preference
parameters, where the human’s true (but unknown) preferences are the
equilibrium point. This enables us to perform Lyapunov stability analysis to
derive the conditions under which the robot’s learning dynamics converge. Our
proposed algorithm (StROL) uses these conditions to learn robust-by-design
learning rules: given the original learning dynamics, StROL outputs a modified
learning rule that now converges to the human’s true parameters under a larger
set of human inputs. In practice, these autonomously generated learning rules
can correctly infer what the human is trying to convey, even when the human is
noisy, biased, and suboptimal. Across simulations and a user study we find that
StROL results in a more accurate estimate and less regret than state-of-the-art
approaches for online reward learning. See videos and code here:
https://github.com/VT-Collab/StROL_RAL

标题: Motion Control of Interactive Robotic Arms Based on Mixed Reality Development

作者: Hanxiao Chen

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2401.01644v1

Project: http://www.icca.net/,|

中文摘要: 混合现实（MR）正在不断发展，以激发机器人的新模式

摘要: Mixed Reality (MR) is constantly evolving to inspire new patterns of robot
manipulation for more advanced Human- Robot Interaction under the 4th
Industrial Revolution Paradigm. Consider that Mixed Reality aims to connect
physical and digital worlds to provide special immersive experiences, it is
necessary to establish the information exchange platform and robot control
systems within the developed MR scenarios. In this work, we mainly present
multiple effective motion control methods applied on different interactive
robotic arms (e.g., UR5, UR5e, myCobot) for the Unity-based development of MR
applications, including GUI control panel, text input control panel,
end-effector object dynamic tracking and ROS-Unity digital-twin connection.

标题: Chat Failures and Troubles: Reasons and Solutions

作者: Manal Helal, Patrick Holthaus, Gabriella Lakatos

PubTime: 2024-01-18

Downlink: http://arxiv.org/abs/2309.03708v2

中文摘要: 本文研究了人机交互（HRI）中导致聊天失败和麻烦的一些常见问题。给定用例的设计决策始于合适的机器人、合适的聊天模型、识别导致故障的常见问题、识别潜在的解决方案以及规划持续改进。总之，建议使用闭环控制算法来指导训练过的人工智能（AI）预训练模型的使用，并提供词汇过滤，在新数据集上重新训练批处理模型，从数据流中在线学习，和/或使用强化学习模型来自我更新训练过的模型并减少错误。

摘要: This paper examines some common problems in Human-Robot Interaction (HRI) causing failures and troubles in Chat. A given use case’s design decisions start with the suitable robot, the suitable chatting model, identifying common problems that cause failures, identifying potential solutions, and planning continuous improvement. In conclusion, it is recommended to use a closed-loop control algorithm that guides the use of trained Artificial Intelligence (AI) pre-trained models and provides vocabulary filtering, re-train batched models on new datasets, learn online from data streams, and/or use reinforcement learning models to self-update the trained models and reduce errors.

== Object Detection@ Segmentation@Open vocabulary detection@SAM ==

标题: MicroSegNet: A Deep Learning Approach for Prostate Segmentation on Micro-Ultrasound Images

作者: Hongxu Jiang, Muhammad Imran, Preethika Muralidharan

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2305.19956v3

Project: https://zenodo.org/records/10475293|

GitHub: https://github.com/mirthAI/MicroSegNet|

中文摘要: 微超声（微美国）是一种新颖的29 MHz超声技术，其分辨率比传统超声高3-4倍，有可能实现低成本、准确的前列腺癌诊断。准确的前列腺分割对于前列腺体积测量、癌症诊断、前列腺活检和治疗计划至关重要。然而，由于中线前列腺、膀胱和尿道之间的伪影和模糊边界，微美国上的前列腺分割具有挑战性。本文介绍了MicroSegNet，这是一个多尺度注释引导的Transformer model UNet模型，专门用于解决这些挑战。在训练过程中，MicroSegNet更关注难以分割的区域（硬区域），其特征是专家和非专家注释之间的差异。我们通过提出注释引导的二元交叉熵（AG-BCE）损失来实现这一点，该损失将较大的权重分配给硬区域中的预测误差，而将较低的权重分配给容易区域中的预测误差。AG-BCE损失通过利用多尺度深度监督无缝集成到训练过程中，使MicroSegNet能够捕捉各种尺度的全球上下文相关性和本地信息。我们使用55名患者的微美国图像训练了我们的模型，随后对20名患者进行了评估。我们的MicroSegNet模型实现了0.939的Dice系数和2.02 mm的Hausdorff距离，优于几种最先进的分割方法，以及三种具有不同经验水平的人类注释者。我们的代码在https：//github.com/mirthAI/MicroSegNet公开，我们的数据集在https：//zenodo.org/records/10475293公开。

摘要: Micro-ultrasound (micro-US) is a novel 29-MHz ultrasound technique that provides 3-4 times higher resolution than traditional ultrasound, potentially enabling low-cost, accurate diagnosis of prostate cancer. Accurate prostate segmentation is crucial for prostate volume measurement, cancer diagnosis, prostate biopsy, and treatment planning. However, prostate segmentation on micro-US is challenging due to artifacts and indistinct borders between the prostate, bladder, and urethra in the midline. This paper presents MicroSegNet, a multi-scale annotation-guided transformer UNet model designed specifically to tackle these challenges. During the training process, MicroSegNet focuses more on regions that are hard to segment (hard regions), characterized by discrepancies between expert and non-expert annotations. We achieve this by proposing an annotation-guided binary cross entropy (AG-BCE) loss that assigns a larger weight to prediction errors in hard regions and a lower weight to prediction errors in easy regions. The AG-BCE loss was seamlessly integrated into the training process through the utilization of multi-scale deep supervision, enabling MicroSegNet to capture global contextual dependencies and local information at various scales. We trained our model using micro-US images from 55 patients, followed by evaluation on 20 patients. Our MicroSegNet model achieved a Dice coefficient of 0.939 and a Hausdorff distance of 2.02 mm, outperforming several state-of-the-art segmentation methods, as well as three human annotators with different experience levels. Our code is publicly available at https://github.com/mirthAI/MicroSegNet and our dataset is publicly available at https://zenodo.org/records/10475293.

标题: OMG-Seg: Is One Model Good Enough For All Segmentation?

作者: Xiangtai Li, Haobo Yuan, Wei Li

PubTime: 2024-01-18

Downlink: http://arxiv.org/abs/2401.10229v1

Project: https://lxtgh.github.io/project/omg_seg/|

GitHub: https://github.com/lxtGH/OMG-Seg.|

中文摘要: 在这项工作中，我们解决了各种分割任务，每个任务传统上都由不同的或部分统一的模型来解决。我们提出了OMG-Seg，这是一个足够好的模型，可以高效和有效地处理所有分割任务，包括图像语义、实例和全景分割，以及它们的视频对应物、开放词汇设置、提示驱动的交互式分割（如SAM）和视频对象分割。据我们所知，这是第一个在一个模型中处理所有这些任务并实现令人满意的性能的模型。我们表明，OMG-Seg是一种基于Transformer model的编码器——解码器架构，具有特定于任务的查询和输出，可以支持十多种不同的分割任务，同时显著降低各种任务和数据集的计算和参数开销。我们严格评估了合作训练中任务间的影响和相关性。代码和模型可在https：//github.com/lxtGH/OMG-Seg获得。

摘要: In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.

标题: RAP-SAM: Towards Real-Time All-Purpose Segment Anything

作者: Shilin Xu, Haobo Yuan, Qingyu Shi

PubTime: 2024-01-18

Downlink: http://arxiv.org/abs/2401.10228v1

Project: https://xushilin1.github.io/rap_sam/|

GitHub: https://github.com/xushilin1/RAP-SAM/.|

中文摘要: 由Transformer model架构推进，视觉基础模型（VFMs）在性能和泛化能力方面取得了显著进步。Segment Anything模型（SAM）是一种能够实现广义分割的出色模型。然而，大多数VFM不能实时运行，这使得很难将它们转移到几个产品中。另一方面，目前的实时分割主要有一个目的，比如对驾驶场景进行语义分割。我们认为实际应用需要不同的输出。因此，本工作探索了一种新的实时分段设置，称为实时通用分段，以在实时部署中传输VFMs。它包含三个不同的任务，包括交互式分割、全景分割和视频分割。我们的目标是使用一个模型来实时完成上述任务。我们首先对几个强基线进行基准测试。然后，我们提出了实时通用SAM（RAP-SAM）。它包含一个高效的编码器和一个高效的解耦解码器来执行提示驱动解码。此外，我们进一步探索不同的训练策略和调整方法，以进一步提高共同训练的表现。我们的代码和模型可在https：//github.com/xushilin1/RAP-SAM/获得。

摘要: Advanced by transformer architecture, vision foundation models (VFMs) achieve remarkable progress in performance and generalization ability. Segment Anything Model (SAM) is one remarkable model that can achieve generalized segmentation. However, most VFMs cannot run in realtime, which makes it difficult to transfer them into several products. On the other hand, current real-time segmentation mainly has one purpose, such as semantic segmentation on the driving scene. We argue that diverse outputs are needed for real applications. Thus, this work explores a new real-time segmentation setting, named all-purpose segmentation in real-time, to transfer VFMs in real-time deployment. It contains three different tasks, including interactive segmentation, panoptic segmentation, and video segmentation. We aim to use one model to achieve the above tasks in real-time. We first benchmark several strong baselines. Then, we present Real-Time All Purpose SAM (RAP-SAM). It contains an efficient encoder and an efficient decoupled decoder to perform prompt-driven decoding. Moreover, we further explore different training strategies and tuning methods to boost co-training performance further. Our code and model are available at https://github.com/xushilin1/RAP-SAM/.

标题: Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

作者: Yumeng Li, Margret Keuper, Dan Zhang

PubTime: 2024-01-16

Downlink: http://arxiv.org/abs/2401.08815v1

Project: https://yumengli007.github.io/ALDM/|

GitHub: https://github.com/boschresearch/ALDM|

中文摘要: 尽管大规模扩散模型最近取得了进展，但布局到图像（L2I）合成任务进展甚微。当前的L2I模型要么通过文本的可编辑性差，要么生成的图像和输入布局之间的对齐弱。这限制了它们在实践中的可用性。为了减轻这一点，我们建议将对抗性监督整合到L2I扩散模型（ALDM）的传统训练管道中。具体来说，我们采用基于分割的鉴别器，该鉴别器向扩散发生器提供关于去噪图像和输入布局之间的像素级对齐的显式反馈。为了鼓励在采样步骤中一致地遵守输入布局，我们进一步引入了多步展开策略。我们不是查看单个时间步长，而是递归地展开几个步骤来模拟推理过程，并要求鉴别器在特定时间窗口内评估去噪图像与布局的对齐情况。我们的实验表明，ALDM能够实现生成图像的布局忠实性，同时允许通过文本提示进行广泛的编辑。此外，我们展示了它在实际应用中的有用性：通过文本控制合成目标分布样本，我们大大提高了语义分割模型的领域泛化能力（约1200万分）。

摘要: Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).

标题: LESEN: Label-Efficient deep learning for Multi-parametric MRI-based Visual Pathway Segmentation

作者: Alou Diakite, Cheng Li, Lei Xie

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2401.01654v1

GitHub: https://github.com/aldiak/Semi-Supervised-Multimodal-Visual-Pathway-|

中文摘要: 最近的研究显示了深度学习在基于多参数MRI的视觉路径（VP）分割中的潜力。然而，获取用于训练的标记数据既费力又耗时。因此，在标记样本有限的情况下开发有效的算法至关重要。在这项工作中，我们提出了一种标签有效的自集成深度学习方法（LESEN）。LESEN结合了监督和非监督损失，使学生和教师模型能够相互学习，形成一个自我集成的平均教师框架。此外，我们引入了可靠的无标记样本选择（RUSS）机制，以进一步提高LESEN的有效性。我们在人类连接体项目（HCP）数据集上的实验证明了我们的方法与最先进的技术相比的卓越性能，推进了临床和研究环境中综合分析的多模态VP分割。实现代码可在以下网址获得：https：//github.com/aldiak/semi-supervised-multimodal-visual-pathway-delineation。

摘要: Recent research has shown the potential of deep learning in multi-parametric
MRI-based visual pathway (VP) segmentation. However, obtaining labeled data for
training is laborious and time-consuming. Therefore, it is crucial to develop
effective algorithms in situations with limited labeled samples. In this work,
we propose a label-efficient deep learning method with self-ensembling (LESEN).
LESEN incorporates supervised and unsupervised losses, enabling the student and
teacher models to mutually learn from each other, forming a self-ensembling
mean teacher framework. Additionally, we introduce a reliable unlabeled sample
selection (RUSS) mechanism to further enhance LESEN’s effectiveness. Our
experiments on the human connectome project (HCP) dataset demonstrate the
superior performance of our method when compared to state-of-the-art
techniques, advancing multimodal VP segmentation for comprehensive analysis in
clinical and research settings. The implementation code will be available at:
https://github.com/aldiak/Semi-Supervised-Multimodal-Visual-Pathway-
Delineation.

标题: S3Net: Innovating Stereo Matching and Semantic Segmentation with a Single-Branch Semantic Stereo Network in Satellite Epipolar Imagery

作者: Qingyuan Yang, Guanzhou Chen, Xiaoliang Tan

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2401.01643v1

GitHub: https://github.com/CVEO/S3Net.|

中文摘要: 立体匹配和语义分割是双目卫星三维重建中的重要任务。然而，以前的研究主要将这些任务视为独立的并行任务，缺乏一个完整的多任务学习框架。本文介绍了一种解决方案，单分支语义立体网络（S3Net），它创新性地将语义分割和立体匹配结合起来，使用自融合和互融合模块。与以前独立利用语义或差异信息的方法不同，我们的方法确定并利用这两个任务之间的内在联系，导致对语义信息和差异估计的更准确理解。在US3D数据集上的对比测试证明了我们的S3Net的有效性。我们的模型将语义分割中的mIoU从61.38提高到67.39，并将视差估计中的D1误差和平均端点误差（EPE）分别从10.051降低到9.579和1.439降低到1.403，超过了现有的竞争方法。我们的代码可在以下网址查阅：https://github.com/CVEO/S3Net。

摘要: Stereo matching and semantic segmentation are significant tasks in binocular
satellite 3D reconstruction. However, previous studies primarily view these as
independent parallel tasks, lacking an integrated multitask learning framework.
This work introduces a solution, the Single-branch Semantic Stereo Network
(S3Net), which innovatively combines semantic segmentation and stereo matching
using Self-Fuse and Mutual-Fuse modules. Unlike preceding methods that utilize
semantic or disparity information independently, our method dentifies and
leverages the intrinsic link between these two tasks, leading to a more
accurate understanding of semantic information and disparity estimation.
Comparative testing on the US3D dataset proves the effectiveness of our S3Net.
Our model improves the mIoU in semantic segmentation from 61.38 to 67.39, and
reduces the D1-Error and average endpoint error (EPE) in disparity estimation
from 10.051 to 9.579 and 1.439 to 1.403 respectively, surpassing existing
competitive methods. Our codes are available at:https://github.com/CVEO/S3Net.

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

为了答谢各位网友的支持，从今日起免费为300名读者提供订阅主题论文服务，只需VX关注公号并回复{邮箱+论文主题}（如：123456@xx.com + chatgpt@large language model @LLM）,主题必须是同一个领域，最多三个关键词。解释权归博主所有

在这里插入图片描述