DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

1Institute for AI, Peking University, 2PKU-PsiBot Joint Lab, 3HKUST (Guangzhou)
*Equal contribution, Corresponding authors

Abstract

Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on restrictive assumptions, such as single-object settings or limited environments, leading to constrained generalization. We present DexGraspVLA, a hierarchical framework for general dexterous grasping in cluttered scenes based on RGB image perception and language instructions. It utilizes a pre-trained Vision-Language model as the high-level task planner and learns a diffusion-based policy as the low-level Action controller. The key insight to achieve robust generalization lies in iteratively transforming diverse language and visual inputs into domain-invariant representations via foundation models, where imitation learning can be effectively applied due to the alleviation of domain shift. Notably, our method achieves a 90+% success rate under thousands of unseen object, lighting, and background combinations in a “zero-shot” environment. Empirical analysis confirms the consistency of internal model behavior across environmental variations, thereby validating our design and explaining its generalization performance. DexGraspVLA also demonstrates free-form long-horizon prompt execution, robustness to adversarial objects and human disturbance, and failure recovery, which are rarely achieved simultaneously in prior work. Extended application to nonprehensile object grasping further proves its generality.

Method

DexGraspVLA adopts a hierarchical architecture composed of an off-the-shelf VLM-based high-level planner and a diffusion-based low-level controller. Given a cluttered scene, the planner grounds the user prompt, e.g., "clear the table", in the observation and proposes grasping instructions \(\{l_i\}\) sequentially. For each instruction \(l\), e.g., "grasp the cookie", the planner identifies the target object \(A\) from the head image \(\mathbf{I}_{t_0}\) and marks its bounding box \((x_1^A, y_1^A, x_2^A, y_2^A)\) at initial time \(t_0\).

The controller consists of four parts:

  1. Two segmentation models including SAM, which obtains the object's mask \(\mathbf{m}_{t_0}\) at \(t_0\), and Cutie, a video segmentation model that continuously tracks the mask \(\mathbf{m}_t\) during each grasping process.
  2. Three vision encoders including two frozen DINOv2 that extract features from the third-view head-camera image \(\mathbf{I}_t^h\) and the first-view wrist-camera image \(\mathbf{I}_t^w\), and a trainable ViT that deals with the mask \(\mathbf{m}_t\).
  3. Three MLP projectors that map the visual features and robot proprioceptive state into the same feature space, forming a feature sequence.
  4. A DiT that predicts an action chunk from \(\mathbf{a}_t\) to \(\mathbf{a}_{t+H-1}\).

During the controller's grasping process, the planner monitors the execution and triggers a scripted placing motion when grasping succeeds. After each grasping attempt, the planner resets the robot and proposes a new grasping instruction. This process continues until the user prompt is fully completed.

The controller is trained on a dataset consisting of 2,094 successful grasping episodes in cluttered scenes. These demonstrations are collected at typical human motion speeds, with each episode taking approximately 3.5 seconds. In total, this amounts to roughly two hours of data.

Experiments

Large-Scale Generalization Evaluation

To thoroughly evaluate the generalization performance of DexGraspVLA, we curate (a) 360 unseen objects covering a diverse range of sizes, weights, geometries, textures, materials, and categories, and (b) 3 unseen lighting conditions and 6 unseen backgrounds that significantly differ from the training data.

Based on this setup, we design three types of grasping tasks in cluttered scenes, with each scene containing approximately six objects. The tasks involve grasping an unseen object from a random scene under the following conditions:

  • Unseen Objects: on a white table under white light.
  • Unseen Backgrounds: on an unseen background under white light.
  • Unseen Lightings: on a white table under an unseen lighting condition.

In total, we generate over 1,200 unseen cluttered scenes in a zero-shot environment to rigorously test our method.

The table above presents the success rate of DexGraspVLA, where "Ours@\(k\)" denotes its performance when each test is given \(k\) attempts. DexGraspVLA achieves a 91.1% single-attempt success rate on Unseen Objects, 90.5% on Unseen Backgrounds, and 90.9% on Unseen Lightings, with an overall success rate of 90.8%. These results demonstrate its ability to grasp specified objects from clutter while remaining robust to environmental changes—all without domain-specific fine-tuning.

Notably, despite operating in unseen environments and tasks, DexGraspVLA maintains high success rates, addressing a key challenge in imitation learning: overfitting to a single domain. It dexterously adjusts to varying object geometries, sizes, and positions, and its closed-loop policy enables re-grasps, improving robustness. The method also withstands human-induced perturbations, tracking and securing moving objects until a successful grasp is achieved.

With multiple attempts, performance further improves—reaching 96.9% success within three tries. Additionally, our model takes ~6 seconds per grasp, comparable to human efficiency, ensuring practical real-world usability.

Comparison to Baselines on A Single-Object Grasping Benchmark

To compare DexGraspVLA with baselines whose controllers learn directly from raw visual inputs, we conduct single-object grasping experiments on 13 seen objects and 8 unseen objects. Each object is placed at five different table locations, ensuring coverage of the workspace and the robot's reach. At each location, the policy attempts two independent grasps, totaling 210 tests under consistent environmental conditions (white tabletop and lighting, in a zero-shot environment).

DexGraspVLA achieves over 98% success on both seen and unseen objects, significantly outperforming alternatives. Notably, it even performs slightly better on unseen objects, confirming that it generalizes beyond training data rather than overfitting. In contrast, baseline models struggle in the novel environment, as they directly map raw inputs to actions and are sensitive to perceptual shifts.

Internal Model Behavior Analysis

We analyze the internal model behavior of DexGraspVLA to show that it is robust to environmental variations. The first row in the figure above presents the cropped raw head images of the same cluttered objects in four different environments: a white table, a calibration board, a tablecloth, and a tablecloth under disco light. The second row shows that the DINOv2 features of images are consistent across variations. The third row is the masks of the target objects accurately tracked by Cutie. The fourth row reflects that the averaged attention maps of DiT to head image features are also consistent regardless of perceptual differences. The fifth row confirms DexGraspVLA is attending to the correct object. Therefore, we substantiate that DexGraspVLA indeed transforms perceptually diverse raw inputs into invariant representations, on which it effectively applies imitation learning to model the data distribution, explaining its superior generalization performance.

Long-Horizon Task Evaluation

To evaluate DexGraspVLA's capability on complex, long-horizon tasks, we design four user prompts: "Clear the table", "Grasp all bottles", "Grasp all green objects", and "Grasp all food". These prompts require common-sense and physical knowledge to identify appropriate grasping targets sequentially. For each prompt, we generate 24 cluttered scenes using unseen objects placed on a white tabletop under white lighting. Tasks involve grasping multiple relevant targets in sequence, requiring both high-level planning and precise low-level control.

DexGraspVLA achieves an overall task success rate of 89.6%, completing all stages of the multi-step prompts reliably. The high-level planner accurately proposes current grasping instructions (94.3% success on average), marks bounding boxes with over 98% accuracy, and detects task completion to avoid redundant actions (94+% accuracy). The low-level controller executes grasps with over 91% success, enabling robust step-by-step task execution. These results underscore DexGraspVLA's strong reasoning capability, perceptual grounding, and action reliability—working in synergy to complete long-horizon tasks without any task-specific training.

Extended Application to Nonprehensile Grasping

To test its generality, we extend DexGraspVLA to nonprehensile grasping, which requires object repositioning before lifting, by applying the same hierarchical framework without architectural changes. Trained on over 1,000 demonstrations with flat, hard-to-grasp objects, DexGraspVLA achieves 84.7% success on 144 tests with 18 unseen objects across novel lighting and background conditions. It learns to push objects toward the table edge before executing a stable grasp, showing robust generalization to shape, texture, and pose variations. This result highlights DexGraspVLA's flexibility in handling complex manipulation skills beyond dexterous grasping, and its generality in both perception and control.

Performance Demonstrations

Dexterous Grasping in Unseen Cluttered Scenes



Lighting Generalization



Background Generalization



Grasping Small Objects



Grasping Industry Objects



Re-grasps



Human Disturbance



Long-horizon Grasping



Extended Application: Nonprehensile Grasping

To evaluate its generality, we extend DexGraspVLA to nonprehensile grasping, where the robot first repositions the nonprehensile object toward the table edge before executing a stable grasp. The controller is trained on 1,029 human demonstrations. Below are some deployment results.



Robot Shakes Hand with Human

Conclusion

We present DexGraspVLA, a hierarchical VLA framework for general-purpose dexterous grasping. By leveraging a pre-trained VLM as the high-level planner and a diffusion-based low-level controller, the system transforms diverse multimodal inputs into domain-invariant representations and learns robust closed-loop grasping policies via imitation learning. Our large-scale evaluations show over 90% success across thousands of unseen cluttered scenes in a zero-shot setting, with empirical evidence of strong generalization and consistent internal behavior. DexGraspVLA also handles free-form long-horizon prompts, recovers from failures, and extends to nonprehensile grasping, demonstrating broad applicability.

BibTeX

@misc{zhong2025dexgraspvla,
      title={DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping}, 
      author={Yifan Zhong and Xuchuan Huang and Ruochong Li and Ceyao Zhang and Yitao Liang and Yaodong Yang and Yuanpei Chen},
      year={2025},
      eprint={2502.20900},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2502.20900}, 
}