About GensPi

We are the first general world model ("GWM") company globally that built a GWM unifying the digital world and the physical world. We are dedicated to building a unified intelligence framework capable of modeling, reasoning, predicting and acting upon the underlying rules that govern both digital and physical worlds. Guided by first principles thinking, we use visual and auditory information, which naturally encodes the physical world, to train our foundation world model and replicate the human process of perceiving, simulating and interacting with the world, and ultimately, to enable AGI that connects the digital world with the physical world.

Our Foundation World Model

At the core of our technology stack is our foundation world model. We train our foundation world model using our proprietary U-ViT architecture that combines diffusion models with the transformer architecture. This approach replicates the human cognitive process of perceiving the physical world, thereby enabling our foundation world model to understand the fundamentals of world operations.

World Generation Model

We provide advanced multi-modal generation capabilities in the digital world with Vidu, our world generation model. Based on the strong capabilities to understand and reconstruct the world of our foundation world model, Vidu can decode and render audiovisual content for viewing, high-fidelity content creation and compositional innovation with creators, thereby enabling strong text-to-video, image-to-video and reference based generation capabilities and also improving the consistency, reasonableness and interactivity in world generation. Vidu elevates video generation models from mere content generation to understanding the laws underlying the world’s operations, which builds a dynamic digital world and in turn enhances the generalized capabilities of our foundation world model.

World Action Model

We achieve task number scaling and high data efficiency in embodied intelligence with Motus, a world action model with a unified framework. Motus adopts a Mixture-of-Transformers (“MoT”) framework that addresses the fragmentation of embodied agents in understanding, modeling and control by integrating multiple functions into a unified framework. Compared with the VLA pathway, world models are generally regarded as having stronger generality and being more suitable as the foundational model backbone for physical world applications. Leveraging our foundation world model’s understanding of physical laws, as well as the state prediction capabilities under sequence actions, Motus translates predictions directly into executable actions for robots, smart devices and other embodiments. By incorporating action feedback from the physical world, Motus continuously optimizes its cognition, creating a positive closed-loop of “virtual modeling—physical action feedback iteration.”