ATK: Automatic Task-driven Keypoint Selection for Robust Policy Learning

University of Washington

Motivation

Visuomotor policies often suffer from perceptual challenges, where visual differences between training and evaluation environments degrade policy performance. Policies relying on state estimations, like 6D pose, require task-specific tracking and are difficult to scale, while raw sensor-based policies may lack robustness to small visual disturbances. In this work, we leverage 2D keypoints — spatially consistent features in the image frame — as a flexible state representation for robust policy learning and apply it to both sim-to-real transfer and real-world imitation learning. However, the choice of which keypoints to use can vary across objects and tasks. We propose a novel method, ATK, to automatically select keypoints in a task-driven manner so that the chosen keypoints are predictive of optimal behavior for the given task. Our proposal optimizes for a minimal set of keypoints that focus on task-relevant parts while preserving policy performance and robustness. We distill expert data (either from an expert policy in simulation or a human expert) into a policy that operates on RGB images while tracking the selected keypoints. By leveraging pre-trained visual modules, our system effectively encodes states and transfers policies to the real-world evaluation scenario despite wide scene variations and perceptual challenges such as transparent objects, fine-grained tasks, and deformable objects manipulation.

Quantitative Results

Plot Imitation
Plot Imitation ATK

Imitation policy success rates. Left: Aggregated results across diverse evaluation conditions show that ATK outperforms other methods based on different input modalities and selection strategies. Right: ATK demonstrates strong robustness under positional variation and various visual perturbations.

2. Sim-to-real Transfer

ATK enables sim-to-real policies to remain robust and generalizable to environmental disturbance

Method

Qualitative visualization of task-relevant keypoint selection and transfer. Keypoints selected in simulation transfer to real-world scenes across various object positions, backgrounds, distractors, and lighting.

Clock Hand Turning

Clock Button Press
GlassPot Lift
Sushi Pick

Quantitative Results

Plot Imitation
Plot Imitation ATK

Sim-to-real policy success rates in real world. Left: Aggregated results across real-world evaluation conditions—random pose, background variation, distractor objects, and lighting changes—show that ATK outperforms other methods using different input modalities. Right: ATK demonstrates strong robustness under positional variation and various visual perturbations.

3. Category-Level Generalization

Can our method generalize to different objects within the same category?

Towel Folding

Static View

Moving View

Static View

Moving View

Hanging Blanket

Static View

Moving View

Static View

Moving View

4. ViewPoint Generalization

Can our method generalize to different camera viewpoints?

Towel Folding

Viewpoint 1

Viewpoint 2

Viewpoint 3

Hanging Blanket

Viewpoint 1

Viewpoint 2

Viewpoint 3

Failure Cases

Using 2D keypoint pixel coordinates as input provides limited semantic information, which may limit the system's ability to fully understand the context of the task. Additionally, off-the-shelf tracking or correspondence modules may not always be robust enough for robotic applications, especially when dealing with uncommon or visually challenging data. In sim-to-real scenarios, discrepancies in dynamics between simulation and the real world further contribute to performance failures.

Tracking Errors
Correspondense Mismatch
Dynamics Gap

Simulation

Real World

Simulation

Real World

Representative Failure trajectories in the Real World

Team

Yunchu Zhang

Yunchu Zhang

University of Washington

Shubham Mittal

Shubham Mittal

University of Washington

Zhengyu Zhang

Zhengyu Zhang

University of Washington

Liyiming Ke

Liyiming Ke

University of Washington

Siddhartha Srinivasa

Siddhartha Srinivasa

University of Washington

Abhishek Gupta

Abhishek Gupta

University of Washington

BibTeX

            
@misc{zhang2025atkautomatictaskdrivenkeypoint,
    title={ATK: Automatic Task-driven Keypoint Selection for Robust Policy Learning}, 
    author={Yunchu Zhang and Shubham Mittal and Zhengyu Zhang and Liyiming Ke and Siddhartha Srinivasa and Abhishek Gupta},
    year={2025},
    eprint={2506.13867},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2506.13867}, 
}