MMGR: Multi-Modal Generative Reasoning

¹University of Wisconsin–Madison ²University of California, Los Angeles ³Michigan State University ⁴University of Illinois Urbana–Champaign ⁵University of Adelaide ⁶Salesforce AI Research ⁷Microsoft ⁸Adobe Research

^* Equal contribution.

Abstract

Video foundation models have made striking progress in synthesizing visually compelling and temporally coherent content, yet their viability as world simulators hinges on whether they internalize the physical, logical, and spatial constraints that govern reality. Existing evaluation metrics—such as Fréchet Video Distance (FVD)—largely emphasize perceptual fidelity, leaving critical reasoning failures undetected, including hallucinations that violate causal structure, physical laws, and global consistency. To address this gap, we propose a principled evaluation framework grounded in five core reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal reasoning.

Building on this framework, we introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a comprehensive benchmark suite designed to assess generative reasoning across three complementary domains: Abstract Reasoning (e.g., ARC-AGI, Sudoku), Embodied Navigation (e.g., real-world 3D navigation and localization), and Physical Commonsense (e.g., sports and compositional physical interactions). MMGR evaluates both video and image generative models using fine-grained, domain-specific metrics that require holistic correctness rather than partial success.

We benchmark state-of-the-art video generation models—including Veo-3, Sora-2, and Wan-2.2—alongside leading image generation models such as Nano-banana, Nano-banana Pro, GPT-4o-image, and Qwen-image, revealing a pronounced performance asymmetry across modalities. While current models achieve moderate success on Physical Commonsense tasks, they fail catastrophically on Abstract Reasoning (achieving <10% accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings.

Through detailed quantitative analysis and human evaluation, we identify key limitations in existing training paradigms: a severe imbalance favoring perceptual data over symbolic reasoning, architectural weaknesses in maintaining global state consistency, and optimization objectives that reward visual plausibility over causal correctness. By unifying abstract logic, embodied interaction, and intuitive physics under a single evaluation framework, MMGR provides a diagnostic lens into the reasoning deficits of modern generative models and outlines a concrete roadmap toward physically grounded, logically consistent, and reasoning-aware world models.

MMGR Overview

Overview of the MMGR benchmark and evaluation pipeline.

Overview of the three domains in the MMGR benchmark. MMGR evaluates multi-modal generative reasoning across Domain 1: Abstract Reasoning, Domain 2: Embodied Navigation, and Domain 3: Physical Commonsense. (1) Abstract Reasoning includes Maze Solving, Sudoku Solving, ARC-AGI, and Math Challenge tasks, which test logical, 2D spatial, and temporal reasoning. (2) Embodied Navigation spans four environment-conditioned tasks: Panoramic View Last-Mile Navigation, Top-down View Real-World Navigation, 3D Real-World Navigation, and Simultaneous Localization and Generation (SLAG). The four tasks probe 2D/3D spatial reasoning, physical scene understanding, and coherent temporal planning. (3) Physical Commonsense covers Physical Concept scenarios and Sports activities, evaluating whether models produce videos that follow intuitive physics such as force, momentum, rotation, material behavior, and continuous motion. Together, these domains provide a comprehensive testbed for assessing a model's ability to generate physically plausible, spatially grounded, and logically coherent solutions.

Benchmark Domains

Abstract Reasoning Case Studies

Four abstract reasoning tasks, including Maze Solving, Sudoku Solving, ARC-AGI, and Math Challenge, with side-by-side generations from Veo-3, Sora-2, and Wan2.2.