VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation

VLA-Cache Image

Abstract

Vision-Language-Action (VLA) model can process instructions and visual perception to directly generate actions as output in an end-to-end fashion due to its strong multi-modal reasoning capabilities. While the performance of VLA models is promising, their computational cost can be substantial. This raises challenge for applying them on robotics tasks, which requires real-time decision-making to respond quickly to environmental changes. Since robotic control involves sequential decision-making, the visual input often exhibits minimal variation between successive steps. A natural idea is to reuse the computational results of unchanged visual tokens from the last step. Motivated by this idea, we propose VLA-Cache, an efficient vision-language-action model. VLA-Cache incorporates a token-selection mechanism that compares the visual input at each step with the input from the previous step, adaptively identifying visual tokens with minimal changes. The computational results for these unchanged tokens are then reused in subsequent steps via KV-cache, thereby significantly improving the efficiency of the VLA-Cache model. Experimental results on both simulation (e.g., LIBERO benchmark and SIMPLER) and real-world robot valid VLA-Cache can achieve practical acceleration with minimal sacrifice in success rate.

Experiments

Simulation Experiments

LIBERO Experiments

Comparison of VLA-Cache with other VLM caching acceleration algorithms on the LIBERO environment, using OpenVLA as the baseline model. We report the success rate, as well as the FLOPS and CUDA Time during the language backbone decoding process.


LIBERO Experiments

Comparison of VLA-Cache with the baseline model CogACT on the SIMPLER environment. We report the success rate and inference efficiency. The results are presented for the Google robot arm in two SIMPLER settings: Visual Matching and Variant Aggregations.


Real-world Experiments

Comparison of VLA-Cache with the baseline model OpenVLA on the real-world environment. We report the success rate and inference efficiency across four different tasks. In the real-world tasks, a Kinova Jaco2 robot arm was used to collect demonstrations and fine-tune OpenVLA as the baseline model. We also report the average performance across all tasks.


Rollouts of VLA-Cache


Real-World Tasks

PickPot (Baseline)
PlaceCube (Baseline)
PutSausage (Baseline)
WipeTable (Baseline)
PickPot (VLA-Cache)
PlaceCube (VLA-Cache)
PutSausage (VLA-Cache)
WipeTable (VLA-Cache)

LIBERO

LIBERO-Spatial (Baseline)
LIBERO-Object (Baseline)
LIBERO-Goal (Baseline)
LIBERO-Long (Baseline)
LIBERO-Spatial (VLA-Cache)
LIBERO-Object (VLA-Cache)
LIBERO-Goal (VLA-Cache)
LIBERO-Long (VLA-Cache)

BibTeX

@article{xu2025vla,
  title={VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation},
  author={Xu, Siyu and Wang, Yunke and Xia, Chenghao and Zhu, Dihao and Huang, Tao and Xu, Chang},
  journal={arXiv preprint arXiv:2502.02175},
  year={2025}
}