Visuomotor policies often suffer from perceptual challenges, where visual differences between training and evaluation environments degrade policy performance. Policies relying on state estimations, like 6D pose, require task-specific tracking and are difficult to scale, while raw sensor-based policies may lack robustness to small visual disturbances. In this work, we leverage 2D keypoints — spatially consistent features in the image frame — as a flexible state representation for robust policy learning and apply it to both sim-to-real transfer and real-world imitation learning. However, the choice of which keypoints to use can vary across objects and tasks. We propose a novel method, ATK, to automatically select keypoints in a task-driven manner so that the chosen keypoints are predictive of optimal behavior for the given task. Our proposal optimizes for a minimal set of keypoints that focus on task-relevant parts while preserving policy performance and robustness. We distill expert data (either from an expert policy in simulation or a human expert) into a policy that operates on RGB images while tracking the selected keypoints. By leveraging pre-trained visual modules, our system effectively encodes states and transfers policies to the real-world evaluation scenario despite wide scene variations and perceptual challenges such as transparent objects, fine-grained tasks, and deformable objects manipulation.
ATK automatically selects minimal yet necessary information for task execution by distilling expert data (either from an expert policy in simulation or a human expert) into a policy that operates on a selective subset of keypoints and optimizing the selection mask. Once the keypoints are identified, they are transferred from the training set to the real-world evaluation scenario. Finally, the keypoint-based policy is transferred to the evaluation scenario, taking as input RGB images while tracking the transferred keypoints
ATK shows robustness to random background, distractors and light
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
ATK (static view)
ATK (moving view)
R3M
Depth
Point Cloud
Imitation policy success rates. Left: Aggregated results across diverse evaluation conditions show that ATK outperforms other methods based on different input modalities and selection strategies. Right: ATK demonstrates strong robustness under positional variation and various visual perturbations.
ATK enables sim-to-real policies to remain robust and generalizable to environmental disturbance
Qualitative visualization of task-relevant keypoint selection and transfer. Keypoints selected in simulation transfer to real-world scenes across various object positions, backgrounds, distractors, and lighting.
Sim-to-real policy success rates in real world. Left: Aggregated results across real-world evaluation conditions—random pose, background variation, distractor objects, and lighting changes—show that ATK outperforms other methods using different input modalities. Right: ATK demonstrates strong robustness under positional variation and various visual perturbations.
Can our method generalize to different objects within the same category?
Can our method generalize to different camera viewpoints?
Using 2D keypoint pixel coordinates as input provides limited semantic information, which may limit the system's ability to fully understand the context of the task. Additionally, off-the-shelf tracking or correspondence modules may not always be robust enough for robotic applications, especially when dealing with uncommon or visually challenging data. In sim-to-real scenarios, discrepancies in dynamics between simulation and the real world further contribute to performance failures.
Yunchu Zhang
University of Washington
Shubham Mittal
University of Washington
Zhengyu Zhang
University of Washington
Liyiming Ke
University of Washington
Siddhartha Srinivasa
University of Washington
Abhishek Gupta
University of Washington
@misc{zhang2025atkautomatictaskdrivenkeypoint,
title={ATK: Automatic Task-driven Keypoint Selection for Robust Policy Learning},
author={Yunchu Zhang and Shubham Mittal and Zhengyu Zhang and Liyiming Ke and Siddhartha Srinivasa and Abhishek Gupta},
year={2025},
eprint={2506.13867},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2506.13867},
}