Video foundation models have made striking progress in synthesizing visually compelling and temporally coherent content, yet their viability as world simulators hinges on whether they internalize the physical, logical, and spatial constraints that govern reality. Existing evaluation metrics—such as Fréchet Video Distance (FVD)—largely emphasize perceptual fidelity, leaving critical reasoning failures undetected, including hallucinations that violate causal structure, physical laws, and global consistency. To address this gap, we propose a principled evaluation framework grounded in five core reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal reasoning.
Building on this framework, we introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a comprehensive benchmark suite designed to assess generative reasoning across three complementary domains: Abstract Reasoning (e.g., ARC-AGI, Sudoku), Embodied Navigation (e.g., real-world 3D navigation and localization), and Physical Commonsense (e.g., sports and compositional physical interactions). MMGR evaluates both video and image generative models using fine-grained, domain-specific metrics that require holistic correctness rather than partial success.
We benchmark state-of-the-art video generation models—including Veo-3, Sora-2, and Wan-2.2—alongside leading image generation models such as Nano-banana, Nano-banana Pro, GPT-4o-image, and Qwen-image, revealing a pronounced performance asymmetry across modalities. While current models achieve moderate success on Physical Commonsense tasks, they fail catastrophically on Abstract Reasoning (achieving <10% accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings.
Through detailed quantitative analysis and human evaluation, we identify key limitations in existing training paradigms: a severe imbalance favoring perceptual data over symbolic reasoning, architectural weaknesses in maintaining global state consistency, and optimization objectives that reward visual plausibility over causal correctness. By unifying abstract logic, embodied interaction, and intuitive physics under a single evaluation framework, MMGR provides a diagnostic lens into the reasoning deficits of modern generative models and outlines a concrete roadmap toward physically grounded, logically consistent, and reasoning-aware world models.