arXiv Papers

Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

Published: 2026-03-16 17:59:54

Authors: Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Feng, Jiale Yu, Shuo Gu, Peng Jia, Pheng-Ann Heng, Shanghang Zhang

Categories: cs.CV

Abstract:
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.

Summary (gpt-4o-mini — added 2026-03-18 04:00 UTC)

            {
  "tldr": "This paper introduces DeepVision-VLA, a framework that enhances Vision-Language-Action (VLA) models by improving visual representation sensitivity in deeper layers, leading to better action prediction in robotic manipulation tasks.",
  "whats_new": [
    "• Proposes the Vision-Language Mixture-of-Transformers (VL-MoT) framework for improved visual grounding.",
    "• Introduces Action-Guided Visual Pruning (AGVP) to filter irrelevant visual tokens.",
    "• Achieves state-of-the-art performance, outperforming prior methods by 9.0% in simulations and 7.5% in real-world tasks.",
    "• Provides insights into the internal processing of visual information in VLA models."
  ],
  "method": "The authors analyze existing VLA models to identify weaknesses in visual token sensitivity in deeper layers. They propose DeepVision-VLA, which integrates multi-level visual features from a Vision Expert into deeper layers of the VLA backbone, enhancing action prediction. AGVP is introduced to prune irrelevant visual tokens based on shallow-layer attention.",
  "results": [
    "• DeepVision-VLA achieves an 83% mean success rate on RLBench, outperforming all baselines.",
    "• Outperforms QwenVLA-OFT by 14% in simulated tasks.",
    "• Achieves a 100% success rate in the 'pour coke to bottle' task.",
    "• Maintains high performance under unseen background and lighting conditions."
  ],
  "limitations": [
    "• The framework's reliance on a specific Vision Expert model may limit generalizability.",
    "• Further exploration of alternative visual foundation models is needed.",
    "• The computational efficiency of the proposed methods in larger-scale applications remains to be evaluated."
  ],
  "why_useful": "This paper is significant for advancing robotic manipulation capabilities by enhancing the integration of visual and language information. The proposed methods can be applied to improve various VLA tasks, making robots more effective in real-world scenarios.",
  "provenance": {
    "page_citations": [],
    "code_data_availability": "Not provided."
  }
}

arXiv Papers

Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

Benchmarking quantum simulation with neutron-scattering experiments

Perception-Aware Autonomous Exploration in Feature-Limited Environments

Robust and Computationally Efficient Linear Contextual Bandits under Adversarial Corruption and Heavy-Tailed Noise

Flat-Band Generation in InAs/GaSb Quantum Wells through Vertically Engineered Heterostructures

Topological localisation and motility of active knots

Computational Concept of the Psyche

QCD-driven dark matter: AQNs formation and observational tests

Benchmarking Machine Learning Approaches for Polarization Mapping in Ferroelectrics Using 4D-STEM

A curvature estimate for holomophic maps on open Riemann surfaces

Learning Latent Proxies for Controllable Single-Image Relighting

Kimodo: Scaling Controllable Human Motion Generation

An extreme particle accelerator powered by PSR J1849-0001

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

Quantum-Inspired Unitary Pooling for Multispectral Satellite Image Classification

Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge Editing for Large Language Models

Federated Learning of Binary Neural Networks: Enabling Low-Cost Inference

SIMTERFERE: An optical interferometry simulator for quantifying the coherent flux stability of VLTI/GRAVITY+. Reaching per mill stability: Application to exoplanet spectroscopy

Agentic workflow enables the recovery of critical materials from complex feedstocks via selective precipitation

iDaVIE v1.0: A virtual reality tool for interactive analysis of astronomical data cubes

Controlled Langevin Dynamics for Sampling of Feedforward Neural Networks Trained with Minibatches

Deep learning and the rate of approximation by flows

Motivic GUT Part I: Grand Unified Theory of Topological Order

Mitigating Renewable-Induced Risks for Green and Conventional Ammonia Producers through Coordinated Production and Futures Trading

Multi-Scenario User Profile Construction via Recommendation Lists

Error semitransparent universal control of a bosonic logical qubit

Autonomous quantum heat engine

NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

Drift-reduced fluid modeling of rapidly rotating plasmas

Intelligent Co-Design: An Interactive LLM Framework for Interior Spatial Design via Multi-Modal Agents

Active Seriation: Efficient Ordering Recovery with Statistical Guarantees

Lebesgue measure of distance sets with regular pins and multi-scale Mizohata-Takeuchi-type estimates

Low-frequency noise as a probe of microscopic disorder in CVD-grown graphene

On Csanyi's and Arias' Functional for Ground States Energy of Multi-Particle Fermion Systems: Asymptotics

Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies

Evaluating the Robustness of Reinforcement Learning based Adaptive Traffic Signal Control

IRAM 04191+1522: a compact proto-brown dwarf binary candidate

Self-Supervised ImageNet Representations for In Vivo Confocal Microscopy: Tortuosity Grading without Segmentation Maps

Weyl asymptotics for singular metrics with a variable boundary degeneracy exponent

A Methodology for Dynamic Parameters Identification of 3-DOF Parallel Robots in Terms of Relevant Parameters

The elliptic three-loop integrals of hadronic vacuum polarization in chiral perturbation theory

On the Nonasymptotic Bounds of Joint Source-Channel Coding with Hierarchical Sources

Small-x TMD distributions initial condition: Nc-dependence and Gaussian approximations

SliceMapper: Intelligent Mapping of O-CU and O-DU onto O-Cloud Sites in 6G O-RAN

Towards physically more comprehensive AGN modelling in cosmological simulations: A MACER-based modification of IllustrisTNG

RIS-Aided RSMA Improves the Latency vs. Energy Trade-off in the Finite Block Length MIMO Downlink

Electric Polarizability of Charged Pions from nHYP Four-Point Functions

A Formal Physical Framework for the Origin of Life: Dissipation-Driven Selection of Evolving Replicators