New benchmark confirms AI video generators look stunning but still can't reason about the world – the-decoder.com

Home AI New benchmark confirms AI video generators look stunning but still can't reason about the world – the-decoder.com
New benchmark confirms AI video generators look stunning but still can't reason about the world – the-decoder.com

Modern video generators like Sora 2, Seedance 2.0, and Veo 3.1 produce increasingly impressive clips. But a new benchmark from Tsinghua University confirms what keeps coming up: visual quality and actual world understanding are two different things.
Instead of focusing on image quality, WorldReasonBench tests whether a model can take a starting scene and continue it in a way that makes sense: physically, socially, logically, and informationally.
Consider a basic test case: give a generator an image of an apple on a branch and tell it to drop the apple. The result might look great—smooth motion, realistic textures, nice lighting—and still get the physics fundamentally wrong. The apple might fly upward, pop like a balloon, or fall in a straight line instead of curving. Standard quality metrics would still reward that video for its realism. That’s the gap WorldReasonBench is designed to catch.
WorldReasonBench includes about 400 test cases across four areas: world knowledge (physics, weather, cultural norms), human-centered scenes (object handling, social interaction), logical reasoning (math, geometry, science experiments), and information-based reasoning (reading data and diagrams).
Scoring works in two stages. First, a process-aware method uses structured questions to check whether the video reaches the right end state in a plausible way. Then a second pass rates reasoning quality, temporal consistency, and visual aesthetics. Alongside the benchmark, the team also released WorldRewardBench, a dataset of about 6,000 video comparisons ranked by trained annotators.
The researchers tested five commercial systems (Sora 2, Kling, Wan 2.6, Seedance 2.0, Veo 3.1-Fast) and six open-source models (LTX 2.3, Wan 2.2-14B, UniVideo, HunyuanVideo 1.5, Cosmos-Predict 2.5, LongCat-Video). Commercial generators scored roughly double what open-source models managed on the core reasoning metric, with no statistical overlap between the two groups.
ByteDance’s Seedance 2.0 came out on top, finishing first in nearly nine out of ten statistical re-runs. Veo 3.1-Fast did best on world knowledge, Sora 2 led on human-centered scenes. Seedance 2.0 also beat Veo 3.1-Fast, Kling, and Wan 2.6 in human ratings.
More important than the rankings is a shared weakness: logical reasoning is the hardest category for every model tested. Even the best commercial systems drop well below their overall averages here, and most open-source models fail it almost entirely. Information-based reasoning is the second-toughest area, particularly when tasks require physically grounded transitions or exact preservation of text and numbers.
The study also introduces a metric that tracks how many correct answers come from dynamic, process-based phases rather than static snapshots. Commercial models score much higher here, which points to where open-source models really fall short: not in how things look, but in understanding cause and effect.
When models get more detailed prompts that spell out what should happen step by step, open-source generators improve the most. They’re simply more dependent on prompt quality than their commercial rivals, which may itself be a side effect of the commercial models’ stronger reasoning ability.
To validate their approach, the team compared their metrics against rankings from human video comparisons. The core metric tracks closely with human judgment and clearly outperforms traditional AI judges that compare videos in pairs.
The conclusion fits a growing body of evidence: despite real progress in resolution, length, and controllability, the jump from pixel generator to reliable world model hasn’t happened. Getting there will likely depend less on visual polish and more on a better grasp of causal mechanisms and the ability to keep information consistent over time. The benchmark, data, and code are available on GitHub.
An international team of researchers recently reached a similar conclusion: Sora 2 and Veo 3.1 fall well short of human performance on reasoning tasks. Whether video generators even qualify as “world models” remains a contested question in AI research. Meta’s Yann LeCun considers systems like Sora a dead end, while DeepMind CEO Demis Hassabis sees Google’s Veo as a step toward a world model. OpenAI shut down Sora as a commercial video generator but kept the team intact to focus on world model research. A proposed definition called OpenWorldLib explicitly rules out pure text-to-video models from the category.
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive “AI Radar” frontier report six times a year, full archive access, and access to our comment section.
Stay in the loop on AI. Clear, useful, no fluff.

Follow The Decoder for AI news, background stories and expert analyses.
Stay in the loop on AI. Clear, useful, no fluff.

source

Leave a Reply

Your email address will not be published.