DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

To thoroughly evaluate the generalization performance of DexGraspVLA, we curate (a) 360 unseen objects covering a diverse range of sizes, weights, geometries, textures, materials, and categories, and (b) 3 unseen lighting conditions and 6 unseen backgrounds that significantly differ from the training data.

Based on this setup, we design three types of grasping tasks in cluttered scenes, with each scene containing approximately six objects. The tasks involve grasping an unseen object from a random scene under the following conditions:

Unseen Objects: on a white table under white light.
Unseen Backgrounds: on an unseen background under white light.
Unseen Lightings: on a white table under an unseen lighting condition.

In total, we generate over 1,200 unseen cluttered scenes in a zero-shot environment to rigorously test our method.

The table above presents the success rate of DexGraspVLA, where "Ours@\(k\)" denotes its performance when each test is given \(k\) attempts. DexGraspVLA achieves a 91.1% single-attempt success rate on Unseen Objects, 90.5% on Unseen Backgrounds, and 90.9% on Unseen Lightings, with an overall success rate of 90.8%. These results demonstrate its ability to grasp specified objects from clutter while remaining robust to environmental changes—all without domain-specific fine-tuning.

Notably, despite operating in unseen environments and tasks, DexGraspVLA maintains high success rates, addressing a key challenge in imitation learning: overfitting to a single domain. It dexterously adjusts to varying object geometries, sizes, and positions, and its closed-loop policy enables re-grasps, improving robustness. The method also withstands human-induced perturbations, tracking and securing moving objects until a successful grasp is achieved.

With multiple attempts, performance further improves—reaching 96.9% success within three tries. Additionally, our model takes ~6 seconds per grasp, comparable to human efficiency, ensuring practical real-world usability.

Comparison to Baselines on A Single-Object Grasping Benchmark

To compare DexGraspVLA with baselines whose controllers learn directly from raw visual inputs, we conduct single-object grasping experiments on 13 seen objects and 8 unseen objects. Each object is placed at five different table locations, ensuring coverage of the workspace and the robot's reach. At each location, the policy attempts two independent grasps, totaling 210 tests under consistent environmental conditions (white tabletop and lighting, in a zero-shot environment).

DexGraspVLA achieves over 98% success on both seen and unseen objects, significantly outperforming alternatives. Notably, it even performs slightly better on unseen objects, confirming that it generalizes beyond training data rather than overfitting. In contrast, baseline models struggle in the novel environment, as they directly map raw inputs to actions and are sensitive to perceptual shifts.

Internal Model Behavior Analysis

We analyze the internal model behavior of DexGraspVLA to show that it is robust to environmental variations. The first row in the figure above presents the cropped raw head images of the same cluttered objects in four different environments: a white table, a calibration board, a tablecloth, and a tablecloth under disco light. The second row shows that the DINOv2 features of images are consistent across variations. The third row is the masks of the target objects accurately tracked by Cutie. The fourth row reflects that the averaged attention maps of DiT to head image features are also consistent regardless of perceptual differences. The fifth row confirms DexGraspVLA is attending to the correct object. Therefore, we substantiate that DexGraspVLA indeed transforms perceptually diverse raw inputs into invariant representations, on which it effectively applies imitation learning to model the data distribution, explaining its superior generalization performance.

Long-Horizon Task Evaluation

To evaluate DexGraspVLA's capability on complex, long-horizon tasks, we design four user prompts: "Clear the table", "Grasp all bottles", "Grasp all green objects", and "Grasp all food". These prompts require common-sense and physical knowledge to identify appropriate grasping targets sequentially. For each prompt, we generate 24 cluttered scenes using unseen objects placed on a white tabletop under white lighting. Tasks involve grasping multiple relevant targets in sequence, requiring both high-level planning and precise low-level control.

DexGraspVLA achieves an overall task success rate of 89.6%, completing all stages of the multi-step prompts reliably. The high-level planner accurately proposes current grasping instructions (94.3% success on average), marks bounding boxes with over 98% accuracy, and detects task completion to avoid redundant actions (94+% accuracy). The low-level controller executes grasps with over 91% success, enabling robust step-by-step task execution. These results underscore DexGraspVLA's strong reasoning capability, perceptual grounding, and action reliability—working in synergy to complete long-horizon tasks without any task-specific training.

Extended Application to Nonprehensile Grasping

To test its generality, we extend DexGraspVLA to nonprehensile grasping, which requires object repositioning before lifting, by applying the same hierarchical framework without architectural changes. Trained on over 1,000 demonstrations with flat, hard-to-grasp objects, DexGraspVLA achieves 84.7% success on 144 tests with 18 unseen objects across novel lighting and background conditions. It learns to push objects toward the table edge before executing a stable grasp, showing robust generalization to shape, texture, and pose variations. This result highlights DexGraspVLA's flexibility in handling complex manipulation skills beyond dexterous grasping, and its generality in both perception and control.