【论文笔记】Transformer-based deep imitation learning for dual-arm robot manipulation

问题:In a dual-arm manipulation setup, the increased number of state dimensions caused by the additional robot manipulators causes distractions and results in poor performance of the neural networks.


We address this issue using a self-attention mechanism that computes dependencies between elements in a sequential input and focuses on important elements.




  1. Imitation Learning;
  2. Dual Arm Manipulation;
  3. Deep Learning in Grasping and Manipulation




However, we believe the problem to solve for deep imitation learning on dual-arm manipulation tasks lies in the distractions caused by the increased dimensions of the concatenated left/right robot arm kinematics states, which is essential for the collaboration of both arms for dual-arm manipulation.


之前的工作:This study measured the human gaze position with an eye-tracker while teleoperating the robot to generate demonstration data and used only the foveated vision around the predicted gaze position for deep imitation learning to suppress visual distraction from task-unrelated objects.


For example, when the robot reaches its right arm to an object, the left arm kinematics states are not used to compute the policy and become distractions.



Because the transformer dynamically generates attention based on the input state, this architecture can be applied to suppress distractions on kinematics states without attention signal.


In this architecture, each element of the sensory inputs (gaze position, left arm state, and right arm state) as well as the foveated image embedding is input to the Transformer to determine which element should be paid attention.



  1. an uncoordinated manipulation task (two arms executing different tasks)
  2. a goal-coordinated manipulation task (both arms solving the same task but not physically interacting with each other)
  3. a bi-manual tasks (both arms physically interacting to solve the task)


A. Imitation learning-based dual-arm manipulation

1. based on key-points selected by hidden Markov models (HMM) to reproduce the movements of arms

2. an imitation learning framework was proposed

  • segments motion primitives and learns task structure from segments for dual-arm structured tasks in simulation.
  • The proposed framework was tested on a pizza preparation scenario, which is an uncoordinated manipulation task.

3. a deep imitation learning model is designed

  • captures relational information in dual-arm manipulation tasks to improve bi-manual manipulation tasks in the simulated environment
  • their work required manually defined task primitives

To the best of our knowledge, the performance of a self-attention-based deep imitation learning method for dual-arm manipulation has not been studied in a real-world robot environment.



B. Transformer-based robot learning

1. the Transformer-based seq-to-seq architecture

  • to improve meta-imitation learning

  • they used the Transformer to capture temporal correspondences between the demonstration and the target task.

  • these studies did not apply the self-attention architecture to robot kinematics states



A. Robot system

  • a human operator teleoperates two UR5 (Universal Robots) manipulators
  • a head-mounted display (HMD,头戴式显示器) provides vision captured from a stereo camera(立体相机) mounted on the robot(戴在机器人的头上)
  • During teleoperation, the human gaze is measured by an eye-tracker mounted in the HMD(戴在HMD上的眼动仪)

In this research, the left camera image is resized into 256 × 256 (called the global image) and recorded at 10 Hz with the two-dimensional gaze coordinate of the left eye and with the two-dimensional gaze coordinate of the left eye and robot kinematics states of both arms.

在本研究中,将左侧摄像机图像调整为256×256(称为全局图像),并10 Hz记录以左眼的二维注视坐标和机器人双臂的运动学状态


the end-effector position(末端执行器的位置)3,空间中的点坐标
the orientation(末端执行器的方向)6,三维欧拉角的正弦和余弦值
the gripper angle1,张开/闭合

B. Gaze position prediction

1. human gaze was used to achieve imitation learning of a robot manipulator that is robust against visual distractions.

  1. predicts gaze position(预测凝视点)
  2. crops the images around the predicted gaze position to remove task irrelevant visual distractions(收获在凝视点周围的图片来消除任务无关的视觉分散)

2. using a mixture density network

  1. a neural network architecture that fits a Gaussian mixture model (GMM) into the target, for estimating the probability distribution of the gaze position.

  2. The gaze predictor inputs the entire 256 × 256 × 3 256 × 256 × 3 256×256×3 RGB image and outputs μ ∈ R 2 × N \mu ∈ R^{2×N} μR2×N , σ ∈ R 2 × N \sigma ∈ R^{2 × N} σR2×N , ρ ∈ R 1 × N \rho ∈ R^{1×N} ρR1×N , p ∈ R N p ∈ R^{N} pRN , which comprises the probability distribution of a two-dimensional gaze coordinate location where N N N is the number of Gaussian distributions that compose the GMM.

    在本文工作中, N N N=8



  3. This network is trained by minimizing the negative log-likelihood of the probability distribution with the measured human gaze as target e e e

C. Transformer-based dual-arm imitation learning

  1. 从整个 256 × 256 256\times 256 256×256 的三维照片中提取二维的凝视点

  2. 从二维凝视点中生成 64 × 64 64\times 64 64×64 的凸性图(foveated image)

  3. 将凸性图导入到5层卷积层和一层全局池化层

  4. 将预测的注视位置和左右机器人机械手状态连接成22维状态



我们给每个值添加了一个22维的 one-hot 向量作为位置嵌入。


最后,编码的特征被扁平,并通过一个具有一个隐藏层的 MLP 来预测双臂的动作。


The gripper is controlled by the last element of the predicted action. However, this element only predicts the angle of the gripper and does not provide enough force to grasp any object. Therefore, a binary signal for the gripper open/close command is also predicted. If this binary signal predicts that the gripper should be closed, the gripper command additionally tries 5 5 5 to provide enough force to grasp objects.


A. Task setup


  1. 在这项任务中,机器人必须用它的左端执行器拿起玩具苹果,然后拿起玩具的橙色,它总是放在苹果的右边,用它的右端执行器。
  2. 这个任务评估不协调的操作,因为捡起苹果和橙色是相互独立的。
  3. 苹果和橙色被随机放置在机器人手臂够得及的桌子上。


  1. 在这个任务中,机器人必须使用两只双臂将盒子推到目标位置。
  2. 这个任务是为了评估双手操作。


  1. 这个任务评估机器人是否能够完成涉及抓握的精确双手操作。
  2. 机器人必须首先用左臂抓住玩具香蕉,把它站起来,用右手重新抓住它,最后翻转(flip)香蕉。
  3. 这种再抓取行为需要根据左手的位置来准确地预测右手的抓取位置。


  1. 在这个任务中,机器人操纵者试图打结,这是一个复杂的目标协调操作任务,涉及一系列考虑可变形结的几何形状的子任务。
  2. 这个结被放置在一个 α α α 的形状中。机器人必须用右手拾起交叉部分的中心(Pick Up),适当引导左手进入环内,避免碰撞,抓住绳子的一个末端(Grasp),将末端拔出环(Pull Out),松开爪,用右手拾起绳子的另一端(Regrasp),最后打结。

90 % 90\% 90% 的训练集和 10 % 10\% 10% 的测试集


Knot Tying 的中间隐藏层数量更多,因为涉及的动作更加复杂

B. Baselines

与两个没有用 Transformer 编码器的 baseline 进行测试

  • 第一个GAP模型(baseline-GAP)保留了GAP层,但用一个完全连接的层替换了每个Transformer 编码器层
  • 第二个baseline模型(baseline)不包括用于图像处理的 GAP 和 Transformer 编码器。

C. Performance evaluation

The result indicates that the Transformer-based method outperforms both baseline methods in terms of success rates of picking up each object.

Because the baseline-GAP performed the worst among the three models, it was excluded from further experiments.







我们发现,在 Release And Pick 期间模仿释放行为并不成功,这可能是因为确切的释放时间与当前的感官输入没有很强的关系。



D. Attention weight assessment

To determine which sensory input the Transformer attends, we investigated the attention weights for each sensory input.

  1. attention rollout,它是所有 Transformer 层对所有注意头部的平均注意权重的递归乘法。
  2. 感觉输入的注意值属于凸性图、注视位置、左臂状态,从 23 × 23 23×23 23×23 注意推出开始计算出右臂状态。
  3. Then, each time series attention values in each input domain (Image, Gaze, Left, Right) was normalized because we want to see the change of attention values on each input domain.



  1. Our Transformer-based deep imitation learning architecture is not specialized for dual arms but instead can be expanded to more complicated robots such as multi-arm robots or humanoid robots by concatenating more sensory information into the state representation.

  2. In our experiment, closed-chain bimanual manipulation tasks such as moving heavy objects using both arms were not tested. In our teleoperation setup without force feedback, the counterforce from the object is not transferred back to the human, causing failure while teleoperating the closed chain manipulation tasks.





