iFlyBot-VLA
Abstract
We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our framework, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community.
The architecture of iFlyBot-VLA consists primarily of a language transformer backbone and an action expert network. The model generates executable robot actions through a combination of explicit and implicit planning.
The previously mentioned latent actions are produced by a latent action model we pretrained following the LAPA-simIlar framework, employing a VQ-VAE pipeline.
Overview of the data composition and proportions employed in the pre-training stage.
Experimental results demonstrate that the iFlyBot-VLA model achieves strong performance in the Libero simulator.
Real-World Experiments
We evaluated the performance of iFlyBot-VLA on a general pick-and-place task, testing its robustness under disturbances from unseen objects, lighting variations, and unseen environments.
iFlyTek-VLA trained on our self-collected data achieves a higher real-world success rate compared to π₀.
iFlyTek-VLA demonstrates outstanding performance in complex long-horizon, dual-arm manipulation tasks within a simulated factory assembly-line environment.
Folding clothes that are randomly placed or partially unfolded is a highly challenging task for VLAs, requiring both precise execution and strong robustness. In real-world experiments, iFlyBot-VLA demonstrates remarkable robustness in performing such tasks.
Given the complexity of folding task, we provide a more detailed comparison. Since locating the correct grasping points often requires multiple attempts, we imposed a 3-minute time limit for each full execution. The detailed results are presented in the bottom image, where the x-axis corresponds to the steps illustrated in the upper image.
More From Our Team:
iFlyBot-VLM
Abstract
We introduce iFlyBot-VLM, a general-purpose Vision-Language Model (VLM) developed to advance the domain of embodied intelligence. The central objective of iFlyBot-VLM is to bridge the cross-modal semantic gap between high-dimensional environmental perception and low-level robotic motion control. To this end, the model abstracts complex visual and spatial information into a body-agnostic and transferable “Operational Language,” enabling seamless perception–action closed-loop coordination across diverse robotic platforms. The architecture of iFlyBot-VLM is systematically designed to realize four key functional capabilities essential for embodied intelligence: 1) Spatial understanding and metric reasoning; 2) Interactive target grounding; 3) Action abstraction and control parameter generation; 4) Task planning and skill sequencing. We envision iFlyBot-VLM as a scalable and generalizable foundation model for embodied AI, facilitating the progression from specialized, task-oriented systems toward generalist, cognitively capable agents. We conducted evaluations on ten mainstream embodied intelligence–related VLM benchmark datasets, such as Blink and Where2Place, achieving state-of-the-art performance while preserving the model’s generalization capabilities. Both the training data and model weights will be publicly released to foster further research and development in the field of embodied intelligence.
iFlyBot-VLM inherits the robust, three-stage "ViT-Projector-LLM" paradigm from established Vision-Language Models. It integrates a dedicated, incrementally pre-trained Visual Encoder with an advanced Language Model via a simple, randomly initialized MLP projector for efficient feature alignment.
The rich composition of embodied AI domain data has significantly enhanced the performance of iFlyBot-VLM in spatial understanding, perception, and task planning.
iFlyBot-VLM achieves state-of-the-art (SOTA) or near-SOTA performance on spatial understanding, perception, and task planning benchmarks.
iFlyBot-VLM in actions
iFlyBot-VLM Applied to Object Picking and Placing
iFlybot-VLM: Enabling Spatial Understanding-Enhanced Instruction Generalization
iFlybot-VLM: Driving Advanced Object Generalization
iFlybot-VLM: Boosting Advanced Scene Generalization
* Motion: Supported by Curobo
* Interaction: Supported by iFlyTek Multimodal Interaction Backpack
BibTeX
@misc{2511.01914,
Author = {Yuan Zhang and Chenyu Xue and Wenjie Xu and Chao Ji and Jiajia wu and Jia Pan},
Title = {iFlyBot-VLA Technical Report},
Year = {2025},
Eprint = {arXiv:2511.01914},
}