DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

1Institute for AI, Peking University, 2PKU-PsiBot Joint Lab, 3HKUST (Guangzhou)
*Equal contribution, Corresponding authors

Abstract

Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on specific assumptions, such as single-object settings or limited environments, leading to constrained generalization. Our solution is DexGraspVLA, a hierarchical framework that utilizes a pre-trained Vision-Language model as the high-level task planner and learns a diffusion-based policy as the low-level Action controller. The key insight lies in iteratively transforming diverse language and visual inputs into domain-invariant representations, where imitation learning can be effectively applied due to the alleviation of domain shift. Thus, it enables robust generalization across a wide range of real-world scenarios. Notably, our method achieves a 90+% success rate under thousands of unseen object, lighting, and background combinations in a "zero-shot" environment. Empirical analysis further confirms the consistency of internal model behavior across environmental variations, thereby validating our design and explaining its generalization performance. We hope our work can be a step forward in achieving general dexterous grasping.

Method

DexGraspVLA adopts a hierarchical architecture composed of an off-the-shelf VLM-based high-level planner and a diffusion-based low-level controller. Given a cluttered scene, the planner reasons about the user prompt, e.g., "clear the table", decomposing it into multiple grasping instructions when necessary. For each instruction \(l\), e.g., "grasp the cookie", the planner identifies the target object \(A\) from the head image \(\mathbf{I}_{t_0}\) and marks its bounding box \((x_1^A, y_1^A, x_2^A, y_2^A)\) at initial time \(t_0\).

The controller consists of four parts:

  1. Two segmentation models including SAM, which obtains the object's mask \(\mathbf{m}_{t_0}\) at \(t_0\), and Cutie, a video segmentation model that continuously tracks the mask \(\mathbf{m}_t\) during each grasping process.
  2. Three vision encoders including two frozen DINOv2 that extract features from the third-view head-camera image \(\mathbf{I}_t^h\) and the first-view wrist-camera image \(\mathbf{I}_t^w\), and a trainable ViT that deals with the mask \(\mathbf{m}_t\).
  3. Three MLP projectors that map the visual features and robot proprioceptive state into the same feature space, forming a feature sequence.
  4. A DiT that predicts an action chunk from \(\mathbf{a}_t\) to \(\mathbf{a}_{t+H-1}\).

During the controller's grasping process, the planner monitors the execution, checks whether grasping succeeds, and assists re-grasping when failing. This process continues until the user prompt is fully completed.

The controller is trained on a dataset consisting of 2,094 successful grasping episodes in cluttered scenes. These demonstrations are collected at typical human motion speeds, with each episode taking approximately 3.5 seconds. In total, this amounts to roughly two hours of data.

Experiments

Large-Scale Generalization Evaluation

To thoroughly evaluate the generalization performance of DexGraspVLA, we curate (a) 360 unseen objects covering a diverse range of sizes, weights, geometries, textures, materials, and categories, and (b) 3 unseen lighting conditions and 6 unseen backgrounds that significantly differ from the training data.

Based on this setup, we design three types of grasping tasks in cluttered scenes, with each scene containing approximately six objects. The tasks involve grasping an unseen object from a random scene under the following conditions:

  • Unseen Objects: on a white table under white light.
  • Unseen Backgrounds: on an unseen background under white light.
  • Unseen Lightings: on a white table under an unseen lighting condition.

In total, we generate over 1,200 unseen cluttered scenes in a zero-shot environment to rigorously test our method.

The table above presents the success rate of DexGraspVLA, where "Ours@\(k\)" denotes its performance when each test is given \(k\) attempts. DexGraspVLA achieves a 91.1% single-attempt success rate on Unseen Objects, 90.5% on Unseen Backgrounds, and 90.9% on Unseen Lightings, with an overall success rate of 90.8%. These results demonstrate its ability to grasp specified objects from clutter while remaining robust to environmental changes—all without domain-specific fine-tuning.

Notably, despite operating in unseen environments and tasks, DexGraspVLA maintains high success rates, addressing a key challenge in imitation learning: overfitting to a single domain. It dexterously adjusts to varying object geometries, sizes, and positions, and its closed-loop policy enables re-grasps, improving robustness. The method also withstands human-induced perturbations, tracking and securing moving objects until a successful grasp is achieved.

With multiple attempts, performance further improves—reaching 96.9% success within three tries. Additionally, our model takes ~6 seconds per grasp, comparable to human efficiency, ensuring practical real-world usability.

Comparison to Baselines on A Single-Object Grasping Benchmark

To compare DexGraspVLA with baselines whose controllers learn directly from raw visual inputs, we conduct single-object grasping experiments on 13 seen objects and 8 unseen objects. Each object is placed at five different table locations, ensuring coverage of the workspace and the robot's reach. At each location, the policy attempts two independent grasps, totaling 210 tests under consistent environmental conditions (white tabletop and lighting, in a zero-shot environment).

DexGraspVLA achieves over 98% success on both seen and unseen objects, significantly outperforming alternatives. Notably, it even performs slightly better on unseen objects, confirming that it generalizes beyond training data rather than overfitting. In contrast, baseline models struggle in the novel environment, as they directly map raw inputs to actions and are sensitive to perceptual shifts.

Internal Model Behavior Analysis

We analyze the internal model behavior of DexGraspVLA to show that it is robust to environmental variations. The first row in the figure above presents the cropped raw head images of the same cluttered objects in four different environments: a white table, a calibration board, a tablecloth, and a tablecloth under disco light. The second row shows that the DINOv2 features of images are consistent across variations. The third row is the masks of the target objects accurately tracked by Cutie. The fourth row reflects that the averaged attention maps of DiT to head image features are also consistent regardless of perceptual differences. The fifth row confirms DexGraspVLA is attending to the correct object. Therefore, we substantiate that DexGraspVLA indeed transforms perceptually diverse raw inputs into invariant representations, on which it effectively applies imitation learning to model the data distribution, explaining its superior generalization performance.

Performance Demonstrations

Dexterous Grasping in Unseen Cluttered Scenes



Lighting Generalization



Background Generalization



Grasping Small Objects



Grasping Industry Objects



Re-grasps



Human Disturbance



Long-horizon Grasping



Robot Shakes Hand with Human

Conclusion

This paper presents DexGraspVLA, the first hierarchical vision-language-action framework that advances toward general dexterous grasping. It utilizes a pre-trained VLM as the high-level planner to plan the grasping process and a diffusion-based policy as the low-level controller to perform closed-loop action prediction for grasping. Within this paradigm, DexGraspVLA capitalizes on the world knowledge of foundation models to understand diverse raw inputs and transform them into domain-invariant representations. Imitation learning is then applied to model the mapping from representation to action distribution, which is highly effective due to the alleviation of domain shift. Our large-scale evaluations show that it attains a success rate exceeding 90% across thousands of unseen cluttered scenes in a "zero-shot" test environment, demonstrating robust generalization. An empirical analysis of its internal model behavior further validates the underlying framework design. Overall, DexGraspVLA demonstrates the promise of leveraging foundation models to enhance generalization in dexterous grasping. We plan to further refine its performance and broaden its applications in future work.