-
Notifications
You must be signed in to change notification settings - Fork 7
Description
After conducting empirical tests and a code audit, I found that the current evaluation metrics(VBVR-Bench) fail to objectively reflect the reasoning capabilities of video models. In many cases, videos that completely fail the logic of the task are still awarded high scores, rendering the benchmark results unreliable.
For example, in the task grid_highest_cost, the evaluator should compute total cost of the path in generated videos, but in "vbvrevalkit/eval/vbvr_bench/evaluators/grid_highest_cost.py", I found that the total cost is roughly computed by the coverage of the path rather than real cost. And amazingly, the function is not even called within the whole repo!
Based on the evidence above, I strongly suspect the evaluation code is written by AI without strictly reviewed.
If any of my statements is wrong, please contact me to let me know.
Looking forward to your explanation and feedback on my questions.