Skip to content

fengrui128/visual-cot-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

visual-cot-eval

Evaluation framework for measuring faithfulness of visual chain-of-thought (CoT) reasoning in vision-language models.

Motivation

VLMs can arrive at correct answers through spurious reasoning — the model says "I can see a red car" when no car is present, but still gets the question right. This tool checks whether intermediate reasoning steps are actually grounded in visual evidence, not just plausible-sounding text that happens to lead to the correct answer.

Installation

git clone https://github.com/fengrui128/visual-cot-eval.git
cd visual-cot-eval
pip install -e .

Dependencies: nltk, spacy (en_core_web_sm), numpy.

Basic Usage

from vcot_eval.parser import CoTParser
from vcot_eval.grounding import GroundingChecker
from vcot_eval.faithfulness import FaithfulnessEvaluator

parser = CoTParser()
checker = GroundingChecker()
evaluator = FaithfulnessEvaluator(checker)

model_output = """
Step 1: I can see a dog sitting on the grass.
Step 2: The dog appears to be brown in color.
Step 3: Therefore, the answer is brown.
"""

detected_objects = ["dog", "grass", "fence"]
image_regions = [{"label": "dog", "bbox": [10, 20, 100, 150]}, {"label": "grass", "bbox": [0, 130, 300, 200]}]

steps = parser.parse(model_output)
result = evaluator.evaluate(steps, detected_objects, image_regions)

print(result)
# {'step_scores': [0.9, 0.85, 0.4], 'chain_score': 0.72, 'grounding_rate': 0.67}

Input Format

See data/sample_annotation.json for a full example. Each record expects:

  • image_id — string identifier
  • question — the question posed to the model
  • model_output — raw text output including reasoning steps
  • ground_truth — expected answer
  • detected_objects — list of object labels from an object detector
  • image_regions — list of {label, bbox} dicts (bbox as [x, y, w, h])

Running Evaluation

python scripts/evaluate.py --input data/my_outputs.json --output results/eval_out.json

Output Metrics

Metric Description
grounding_rate Fraction of steps that reference at least one detected object
step_faithfulness_score Per-step score in [0, 1] based on object and region overlap
chain_consistency Detects contradictory claims across steps (e.g., "red car" then "blue car")

Notes

  • Grounding is NLP-based (token matching + lemmatization), not vision-based. It checks whether the text references things that are actually in the image, not whether the visual reasoning is correct.
  • Designed for offline evaluation of saved model outputs, not real-time inference.
  • Tested on LLaVA-1.5 and InstructBLIP outputs.

License

MIT

About

Faithfulness evaluation for visual chain-of-thought reasoning in VLMs — are the reasoning steps actually grounded?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages