ATK-Policy

Motivation

Visuomotor policies often suffer from perceptual challenges, where visual differences between training and evaluation environments degrade policy performance. Policies relying on state estimations, like 6D pose, require task-specific tracking and are difficult to scale, while raw sensor-based policies may lack robustness to small visual disturbances. In this work, we leverage 2D keypoints — spatially consistent features in the image frame — as a flexible state representation for robust policy learning and apply it to both sim-to-real transfer and real-world imitation learning. However, the choice of which keypoints to use can vary across objects and tasks. We propose a novel method, ATK, to automatically select keypoints in a task-driven manner so that the chosen keypoints are predictive of optimal behavior for the given task. Our proposal optimizes for a minimal set of keypoints that focus on task-relevant parts while preserving policy performance and robustness. We distill expert data (either from an expert policy in simulation or a human expert) into a policy that operates on RGB images while tracking the selected keypoints. By leveraging pre-trained visual modules, our system effectively encodes states and transfers policies to the real-world evaluation scenario despite wide scene variations and perceptual challenges such as transparent objects, fine-grained tasks, and deformable objects manipulation.

Our Method

ATK automatically selects minimal yet necessary information for task execution by distilling expert data (either from an expert policy in simulation or a human expert) into a policy that operates on a selective subset of keypoints and optimizing the selection mask. Once the keypoints are identified, they are transferred from the training set to the real-world evaluation scenario. Finally, the keypoint-based policy is transferred to the evaluation scenario, taking as input RGB images while tracking the transferred keypoints

Experiments

1. Immitation Learning Robustness

ATK shows robustness to random background, distractors and light

3. Category-Level Generalization

Can our method generalize to different objects within the same category?

Towel Folding

Static View

Moving View

Static View

Moving View

Hanging Blanket

Static View

Moving View

Static View

Moving View

4. ViewPoint Generalization

Can our method generalize to different camera viewpoints?

Towel Folding

Viewpoint 1

Viewpoint 2

Viewpoint 3

Hanging Blanket

Viewpoint 1

Viewpoint 2

Viewpoint 3

Failure Cases

Using 2D keypoint pixel coordinates as input provides limited semantic information, which may limit the system's ability to fully understand the context of the task. Additionally, off-the-shelf tracking or correspondence modules may not always be robust enough for robotic applications, especially when dealing with uncommon or visually challenging data. In sim-to-real scenarios, discrepancies in dynamics between simulation and the real world further contribute to performance failures.

ATK: Automatic Task-driven Keypoint Selection for Robust Policy Learning

Motivation

Our Method

Experiments

1. Immitation Learning Robustness

Automatic Task-driven Keypoints Filtering​

Keypoints Transfer during Inference​​

Keypoints Tracking during Rollout​

Random Position

Random Distractors

Random Backgrounds

Random Lights

Automatic Task-driven Keypoints Filtering​

Keypoints Transfer during Inference​​

Keypoints Tracking during Rollout​

Random Position

Random Distractors

Random Backgrounds

Random Lights

Automatic Task-driven Keypoints Filtering​

Keypoints Transfer during Inference​​

Keypoints Tracking during Rollout​

Random Position

Random Distractors

Random Backgrounds

Random Lights

Automatic Task-driven Keypoints Filtering​

Keypoints Transfer during Inference​​

Keypoints Tracking during Rollout​

Random Position

Random Distractors

Random Backgrounds

Random Lights

Quantitative Results

2. Sim-to-real Transfer

Clock Hand Turning

Clock Button Press

GlassPot Lift

Sushi Pick

Quantitative Results

3. Category-Level Generalization

Towel Folding

Static View

Moving View

Static View

Moving View

Hanging Blanket

Static View

Moving View

Static View

Moving View

4. ViewPoint Generalization

Towel Folding

Viewpoint 1

Viewpoint 2

Viewpoint 3

Hanging Blanket

Viewpoint 1

Viewpoint 2

Viewpoint 3

Failure Cases

Tracking Errors

Correspondense Mismatch

Dynamics Gap

Simulation

Real World

Simulation

Real World

Representative Failure trajectories in the Real World

Team

BibTeX

Automatic Task-driven Keypoints Filtering

Keypoints Transfer during Inference

Keypoints Tracking during Rollout

Automatic Task-driven Keypoints Filtering

Keypoints Transfer during Inference

Keypoints Tracking during Rollout

Automatic Task-driven Keypoints Filtering

Keypoints Transfer during Inference

Keypoints Tracking during Rollout

Automatic Task-driven Keypoints Filtering

Keypoints Transfer during Inference

Keypoints Tracking during Rollout