MMGR: Multi-Modal Generative Reasoning

1University of Wisconsin–Madison 2University of California, Los Angeles 3Michigan State University 4University of Illinois Urbana–Champaign 5University of Adelaide 6Salesforce AI Research 7Microsoft 8Adobe Research
* Equal contribution.

Abstract

Video foundation models have made striking progress in synthesizing visually compelling and temporally coherent content, yet their viability as world simulators hinges on whether they internalize the physical, logical, and spatial constraints that govern reality. Existing evaluation metrics—such as Fréchet Video Distance (FVD)—largely emphasize perceptual fidelity, leaving critical reasoning failures undetected, including hallucinations that violate causal structure, physical laws, and global consistency. To address this gap, we propose a principled evaluation framework grounded in five core reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal reasoning.

Building on this framework, we introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a comprehensive benchmark suite designed to assess generative reasoning across three complementary domains: Abstract Reasoning (e.g., ARC-AGI, Sudoku), Embodied Navigation (e.g., real-world 3D navigation and localization), and Physical Commonsense (e.g., sports and compositional physical interactions). MMGR evaluates both video and image generative models using fine-grained, domain-specific metrics that require holistic correctness rather than partial success.

We benchmark state-of-the-art video generation models—including Veo-3, Sora-2, and Wan-2.2—alongside leading image generation models such as Nano-banana, Nano-banana Pro, GPT-4o-image, and Qwen-image, revealing a pronounced performance asymmetry across modalities. While current models achieve moderate success on Physical Commonsense tasks, they fail catastrophically on Abstract Reasoning (achieving <10% accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings.

Through detailed quantitative analysis and human evaluation, we identify key limitations in existing training paradigms: a severe imbalance favoring perceptual data over symbolic reasoning, architectural weaknesses in maintaining global state consistency, and optimization objectives that reward visual plausibility over causal correctness. By unifying abstract logic, embodied interaction, and intuitive physics under a single evaluation framework, MMGR provides a diagnostic lens into the reasoning deficits of modern generative models and outlines a concrete roadmap toward physically grounded, logically consistent, and reasoning-aware world models.

MMGR Overview

Overview of the MMGR benchmark and evaluation pipeline.
Overview of the three domains in the MMGR benchmark. MMGR evaluates multi-modal generative reasoning across Domain 1: Abstract Reasoning, Domain 2: Embodied Navigation, and Domain 3: Physical Commonsense. (1) Abstract Reasoning includes Maze Solving, Sudoku Solving, ARC-AGI, and Math Challenge tasks, which test logical, 2D spatial, and temporal reasoning. (2) Embodied Navigation spans four environment-conditioned tasks: Panoramic View Last-Mile Navigation, Top-down View Real-World Navigation, 3D Real-World Navigation, and Simultaneous Localization and Generation (SLAG). The four tasks probe 2D/3D spatial reasoning, physical scene understanding, and coherent temporal planning. (3) Physical Commonsense covers Physical Concept scenarios and Sports activities, evaluating whether models produce videos that follow intuitive physics such as force, momentum, rotation, material behavior, and continuous motion. Together, these domains provide a comprehensive testbed for assessing a model's ability to generate physically plausible, spatially grounded, and logically coherent solutions.

Benchmark Domains

MMGR benchmark domains and tasks.
Overview of the three domains in the MMGR benchmark. MMGR evaluates multi-modal generative reasoning across Domain 1: Abstract Reasoning, Domain 2: Embodied Navigation, and Domain 3: Physical Commonsense. (1) Abstract Reasoning includes Maze Solving, Sudoku Solving, ARC-AGI, and Math Challenge tasks, which test logical, 2D spatial, and temporal reasoning. (2) Embodied Navigation spans four environment-conditioned tasks: Panoramic View Last-Mile Navigation, Top-down View Real-World Navigation, 3D Real-World Navigation, and Simultaneous Localization and Generation (SLAG). The four tasks probe 2D/3D spatial reasoning, physical scene understanding, and coherent temporal planning. (3) Physical Commonsense covers Physical Concept scenarios and Sports activities, evaluating whether models produce videos that follow intuitive physics such as force, momentum, rotation, material behavior, and continuous motion. Together, these domains provide a comprehensive testbed for assessing a model's ability to generate physically plausible, spatially grounded, and logically coherent solutions.

Abstract Reasoning Case Studies

Four abstract reasoning tasks, including Maze Solving, Sudoku Solving, ARC-AGI, and Math Challenge, with side-by-side generations from Veo-3, Sora-2, and Wan2.2.

Task 01 - Maze Solving

dfs_easy_3x3_n000_var1_fixed_both

Input

Input image for Maze Solving task.

Veo-3

Sora-2

Wan2.2

Task 02 - Sudoku Solving

easy_4x4_001

Input

Input image for Sudoku Solving task.

Veo-3

Sora-2

Wan2.2

Task 03 - ARC-AGI

arc_v1_match_easy_0a2355a6

Input

Input image for ARC-AGI task.

Veo-3

Sora-2

Wan2.2

Task 04 - Math Challenge

gsm8k_problem_1309

Input

Input image for Math Challenge task.

Veo-3

Sora-2

Wan2.2

Embodied Navigation Case Studies

Four navigation settings with side-by-side generations from Veo-3, Sora-2, and Wan2.2.

Task 01 - Panoramic View Last-Mile Navigation (L.M.Nav.)

floor02plus_quality04_oneturn_color

360-degree panoramic, over-the-shoulder context that stresses short-range navigation under a hard layout.

Input

Task 01 input panorama.

Veo-3

Sora-2

Wan2.2

Task 02 - Top-down View Real-World Navigation (T.V.R.-W.Nav.)

floor01_quality05_noturn_color

Bird’s-eye global planning with long-horizon path prediction showcased under the hardest quality level.

Input

Task 02 input top-down view.

Veo-3

Sora-2

Wan2.2

Task 03 - 3D Real-World Navigation (3D R.-W.Nav.)

floor02plus_quality04_noturn_color

Cutaway/dollhouse renderings that expose full 3D structure, testing grounding across multi-room, multi-level layouts.

Input

Task 03 input dollhouse rendering.

Veo-3

Sora-2

Wan2.2

Task 04 - Simultaneous Localization and Generation (SLAG)

floor01_quality04_noturn_color

Joint localization and scene layout generation with paired 3D and top-down views to test holistic spatial reasoning.

Input

Task 04 input scene view.

Veo-3

Sora-2

Wan2.2

Physical Commonsense Case Studies

Four physical commonsense tasks, including Physical Concept and Sports, with side-by-side generations from Veo-3, Sora-2, and Wan2.2.

Task 01 - Physical Concept

videophy_v1_003

Input

Tying a rope to a pole.

Veo-3

Sora-2

Wan2.2

Task 02 - Sports

SPORT_002

Input

A ballerina executes a series of pirouettes, spinning multiple times on her toes.

Veo-3

Sora-2

Wan2.2