Possible Critical Flaws in Evaluation(VBVR-Bench)

After conducting empirical tests and a code audit, I found that **the current evaluation metrics(VBVR-Bench) fail to objectively reflect the reasoning capabilities of video models.** In many cases, videos that completely fail the logic of the task are still awarded **high scores**, rendering the benchmark results unreliable.

For example, in the task `grid_highest_cost`, the evaluator should compute total cost of the path in generated videos, but in **"vbvrevalkit/eval/vbvr_bench/evaluators/grid_highest_cost.py"**, I found that the total cost is roughly computed by the coverage of the path rather than real cost. And amazingly, **the function is not even called within the whole repo!**

<img width="1512" height="383" alt="Image" src="https://github.com/user-attachments/assets/155d6340-860f-4620-a42d-5b614913989a" />

Based on the evidence above, I strongly suspect the evaluation code is written by AI without strictly reviewed.
If any of my statements is wrong, please contact me to let me know.
**Looking forward to your explanation and feedback on my questions.**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Critical Flaws in Evaluation(VBVR-Bench) #234

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible Critical Flaws in Evaluation(VBVR-Bench) #234

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions