iFlyBot-VLA

iFlyTek Research & Development Groups, LindenBot

iFlyBot-VLA, a robust bimanual manipulation policy for diverse tasks

Abstract

We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our framework, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community.

Architecture visualization

The architecture of iFlyBot-VLA consists primarily of a language transformer backbone and an action expert network. The model generates executable robot actions through a combination of explicit and implicit planning.

Latent action model visualization

The previously mentioned latent actions are produced by a latent action model we pretrained following the LAPA-simIlar framework, employing a VQ-VAE pipeline.

Experimental results visualization

Overview of the data composition and proportions employed in the pre-training stage.

Experimental results visualization

Experimental results demonstrate that the iFlyBot-VLA model achieves strong performance in the Libero simulator.

Real-World Experiments

More From Our Team:

iFlyBot-VLM

iFlyBot-VLM trajectory inference demonstration.

Abstract

We introduce iFlyBot-VLM, a general-purpose Vision-Language Model (VLM) developed to advance the domain of embodied intelligence. The central objective of iFlyBot-VLM is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robotic motion control. To this end, the model abstracts complex visual and spatial information into a body-agnostic and transferable “Operational Language,” enabling seamless perception–action closed-loop coordination across diverse robotic platforms. The architecture of iFlyBot-VLM is systematically designed to realize four key functional capabilities essential for embodied intelligence: 1) Spatial understanding and metric reasoning; 2) Interactive target grounding; 3) Action abstraction and control parameter generation; 4) Task planning and skill sequencing. We envision iFlyBot-VLM as a scalable and generalizable foundation model for embodied AI, facilitating the progression from specialized, task-oriented systems toward generalist, cognitively capable agents. We conducted evaluations on ten mainstream embodied intelligence–related VLM benchmark datasets, such as Blink and Where2Place, achieving state-of-the-art performance while preserving the model’s generalization capabilities. Both the training data and model weights will be publicly released to foster further research and development in the field of embodied intelligence.

iFlyBot-VLM in actions

BibTeX

@misc{2511.01914,
Author = {Yuan Zhang and Chenyu Xue and Wenjie Xu and Chao Ji and Jiajia wu and Jia Pan},
Title = {iFlyBot-VLA Technical Report},
Year = {2025},
Eprint = {arXiv:2511.01914},
}