V-BEs: Visually-Grounded Library of Behaviors for Manipulating Diverse Objects across Diverse Configurations and Views

CoRL 2021

Carnegie Mellon University


We propose a method for manipulating diverse objects across a wide range of initial and goal configurations and camera placements. We disentangle the standard image-to-action mapping into two separate modules: (1) a behavior selector which conditions on intrinsic and semantically-rich object appearance features to select the behaviors that can successfully perform the desired tasks on the object in hand, and (2) a library of behaviors each of which conditions on extrinsic and abstract object properties to predict actions to execute over time. Our framework outperforms various learning and non-learning based baselines in both simulated and real robot tasks.



Overview of method

Given an input RGB-D image $I_v$ of the scene and an object 3D bounding box $\mathrm{o}$, the selector $\mathrm{G}$ predicts the probability of successfully manipulating the object when applying each behavior on the object. This is done by using geometry-aware recurrent neural networks (GRNNs) to convert the image to a 3D scene representation $\mathbf{M}'$, cropping the representation to the object using the provided 3D bounding box, and computing the cosine simularity between this object representation $\mathbf{F}(I_v, \mathrm{o}; \phi)$ and learned behavioral keys $\kappa_i$. for each behavior $\pi_i$ in the library. The behavior with the highest predicted success probability is then executed in the environment. We train the selector using interaction labels collected by running behaviors on random training objects and recording the binary success or failure outcomes of the executions.

Interpolate start reference image.

Paradigm 1: In contrast to state-to-action or image-to-action mapping, the proposed framework decomposes a policy into a behavior selection module and a library of behaviors to select from. The decomposition enables these modules to work on different representation: the selection module operates in a semantically-rich visual feature space, while the behaviors operates in an abstract object state space that facilitate efficient policy learning.

Interpolate start reference image.

Quantitative Results

Our method out-performs various baselines and ablations in simulated pushing and grasping tasks.

We also perform real robot experiment in a setup where the robot needs to execute various skills to transport diverse rigid, granular, and liquid objects to a plate.

Qualitative Results

Simulated Grasping

Simulated Pushing

Real Transporting


  title={Visually-Grounded Library of Behaviors for Manipulating Diverse Objects across Diverse Configurations and Views},
  author={Yang, Jingyun and Tung, Hsiao-Yu and Zhang, Yunchu and Pathak, Gaurav and Pokle, Ashwini and Atkeson, Christopher G and Fragkiadaki, Katerina},
  booktitle={5th Annual Conference on Robot Learning},